Streaming data is data that is continuously generated from thousands of data sources, often also sent simultaneously as data records, at a small scale (about a few kilobytes).
Streaming data includes a wide variety of data, such as log files generated by customers using your mobile or web applications, online shopping data, in-game player activity, information from social networking sites, financial trading floors, or geospatial services, as well as telemetry data from connected devices or instruments in your data center.
Such data is processed incrementally, either by record or sequentially based on a sliding time window, and can be used for a variety of analyses, including correlation, aggregation, filtering, and sampling. The information derived from this type of analysis allows companies to gain insight into all aspects of their business and customer activities. Examples include service usage (for metering/billing), server activity, website hits, and the geographic location of equipment, people, and physical objects, allowing for rapid response to new situations.
Advantages of Streaming Data
For most scenarios where dynamic new data is continuously being generated, it is advantageous to employ streaming data processing. This type of processing is applicable to most industries and big data use cases. Typically, companies start out with simple applications, such as collecting system logs and performing rudimentary processing such as rolling min-max calculations. These applications then evolve to the need to accomplish more complex near-real-time processing.
Initially, applications may generate simple reports by processing data streams, and then perform simple response operations, such as alerting when key metrics exceed certain thresholds. Eventually, these applications perform more sophisticated forms of data analysis, such as applying machine learning algorithms, and extracting deeper information from the data. After a while, complex stream and event processing algorithms begin to be applied, such as using time window decay algorithms to find the most recent popular movies, further enriching the information.
Examples of streaming data
1、Sensors on transportation, industrial equipment and agricultural machinery send data to a stream processing application. The application then monitors performance, detects any potential defects in advance, and automatically orders spare parts, thus preventing equipment downtime.
2、Financial institutions track stock market fluctuations in real time, calculate value-at-risk, and then automatically rebalance portfolios based on stock price movements.
3、real estate websites track a portion of the data in a customer's mobile device and then suggest properties that should be visited in real time based on their geographic location.
4、Solar power companies must maintain a generation capacity that can meet customer demand or pay a penalty. The company implemented a streaming data application to monitor all panels in the power system and schedule service in real time, thereby minimizing the period of low capacity for each panel and therefore reducing the associated penalty payments.
5、Media publishers stream billions of online content clickstream records, use demographic information about users to aggregate and enrich the data, and optimize content placement on their websites to achieve relevance and provide a better experience for their audiences.
6、Online gaming companies collect streaming data about player-game interactions and provide this data to gaming platforms, which then analyze the data in real time and offer a variety of incentives and dynamic experiences to engage players.
Challenges of using streaming data
Streaming data processing requires two layers: a storage layer and a processing layer. The storage layer needs to support record ordering and a high degree of consistency in order to read and write large streams of data in a fast, inexpensive and repeatable manner. The processing layer is responsible for processing the data in the storage layer, running calculations based on that data, and then notifying the storage layer to delete data that is no longer needed.
You must also plan for scalability, data persistence, and fault tolerance for both the storage and processing tiers. As a result, multiple platforms have emerged that provide the infrastructure needed to build streaming data applications, including Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Apache Flume, Apache Spark Streaming, and Apache Storm.