The user community around Apache Spark is exploding with 300,000 people taking part in global spark meetups, a 3.6x increase- all thanks to the novel features like Structured Streaming API and many new features and enhancements coming up to the existing features in 2017. Databricks study of 1400 Spark users found that 56% more users globally used Spark Streaming applications in 2015 in comparison to 2014. Also, 48% of spark users mentioned that Spark Streaming is the most-used and important spark component. Spark Streaming architecture focusses on programming perks for spark developers owing to its ever-growing user base- CloudPhysics, Uber, eBay, Amazon, ClearStory, Yahoo, Pinterest , Netflix, etc. Apache Spark is a big data technology well worth taking note of and learning about. This blog explores the need for spark streaming, what spark streaming and what are the various companies is using spark streaming component to enhance business productivity.
Need for Spark Streaming
Spark Streaming has garnered lot of popularity and attention in the big data enterprise computation industry. As companies continue to generate increasing data than ever before to extract value from data for real-time business scenarios, it needs to be closely monitored and acted upon quickly. Earlier programmers use to build two stacks, one for batch processing and one for streaming to process same data. Also, the existing processing frameworks could not achieve both either they could perform batch processing of 100’s of Terabytes of data with high latency or they could perform stream processing of 100’s of Megabytes of data with low latency. This made it difficult and painful as the developers had to maintain multiple programming models requiring double operational and implementation effort. The move to embrace both batch processing and stream processing is not an easy one even for fast flying web companies. Thus, the need for large scale and real-time data processing using Spark Streaming became extremely important.
If you would like more information about Apache Spark Training and Certification, please click the orange "Request Info" button on top of this page.
Let’s consider the traditional streaming systems like Apache Storm that aims to guarantee low latencies. Storm gets an event whether it is 10 bytes of data or large volume of data, whenever there is an incoming input event it, traditional event streaming systems try to process the data as soon as it comes in. If the data needs to go through 6 machines then it sends them through the different machines immediately then and there itself. A major consequence of this kind of a design is the state. Every node will have its own mutable state in the graph of computation and processing. When the incoming event goes from one node to another, the state of each processing node gets modified. Though the modified state can be updated to the databases but the issue is what happens in case there is a failure in the system. If a node fails and goes down, the associated state with it also goes down i.e. the mutable state is lost whenever a node fails ,making stateful stream processing fault-tolerant a challenging task. This led to a new design called Lambda Architecture which is not fault tolerant. Thus achieving streaming stateful stream processing is hard to implement and the best way to do this is a micro batch. Spark has an amazing implementation of this known as Spark Streaming.
What is Spark Streaming?
“A data processing framework to build streaming applications.”
Added to the Apache Spark Framework in 2013, Spark Streaming (also known as micro-batching framework) is an integral part of the Core Spark API that allows data scientists and big data engineers to process real-time data from multiple sources like Kafka, Kinesis, Flume, Amazon, etc. It supports real time processing of streaming data like tweets from Twitter, production web server log files from Amazon S3, Flume or HDFS and other messaging queues like Apache Kafka.
Why use Spark Streaming?
- Fault-tolerant semantics
- Simpler and Modular
- Support for merging data with historical data
- Ease of Code Reuse
- Highly Scalable
- High level language operators for streaming data
Spark Streaming Architecture
Apache Spark has Resilient Distributed Datasets that maintain a lineage graph of how each partition of the data is created. Whenever there is a failure, it can recreate the data and run the computations again. When there is stream of incoming data, you can take a sliding window, grab a little bit of data in a window and run it as if it was a batch (DStream). This process can be repeated again and again. The key abstraction in Spark Streaming is a Discretized Stream or DStream built on RDD’s. A DStream represents a stream of data divided into small batches. Using Spark Streaming calls the live data stream is chopped into smaller batches of x seconds. Spark then considers each batch of data as an RDD and processes them using various RDD operations. The results are returned in batches which can be sent to HTFS or any other streaming system.
Spark Streaming architecture consists of 3 important components –
- Master Node – It is responsible for tracking the DStream lineage graph and also schedules various tasks to compute any new RDD partitions.
- Client Library – Used to send data into the system.
- Worker Nodes – They receive data, store partitions of the computed RDD’s and execute tasks.
The major difference between Spark Streaming architecture and traditional streaming systems architecture is that in spark streaming computations are divided into short, stateless, deterministic tasks that can run on any given node in the spark cluster or on multiple nodes. Spark Streaming architecture makes it easy and candid to balance load across the spark cluster and react to failures.
Data Sources for Spark Streaming
- Spark Streaming Kafka
- Spark Streaming Amazon’s Kinesis
- TCP Sockets
- Apache Flume
Advantages of Spark Streaming over Traditional Streaming Systems
- It unifies streaming, batch processing and interactive analytics. The fusion of disparate data processing capabilities makes it easy for big data developers to use a single framework for all big data processing needs. For instance, spark developers can make use of the machine learning library to train models offline and it can be used directly for recording live data in Spark Streaming.
- A major selling point for the rapid adoption of Apache Spark Streaming is increased programmer productivity as the code used for batch processing can be used with minor tweaks for real-time computations as well.
- Native integration with advanced processing libraries like MLib, Graph processing and SQL.
- Spark Streaming helps recover from failures faster as computations are in the form of discretized small streams making it easy to re-launch failed tasks in parallel on other nodes in a spark cluster.
Spark Streaming Use Cases
Spark Streaming is a perfect fit for any use case that requires real-time data statistics and response. Organizations are using spark streaming for various real-time data processing applications like recommendations and targeting, network optimization, personalization, scoring of analytic models, stream mining, etc.
General Ways Spark Streaming is Used Today
- Streaming ETL – Data is cleaned continuously and aggregated before it is pushed into the data stores. Popular spark streaming examples for this are Uber and Pinterest. Pinterest uses Spark Streaming to gain insights on how users interact with pins across the globe in real-time. Similarly, Uber uses Streaming ETL pipelines to collect event data for real-time telemetry analysis.
- Complex Session Analysis – Spark Streaming can be used to analyse events relating to live sessions, like tracking the user activity after a user login’s to the app or website. One popular spark streaming example for this use case is Netflix. Netflix uses spark streaming to glean valuable insights on how users engage on their website.
- Trigger Event Detection –Companies are using spark streaming to respond to unusual behaviours or events which could cause a potential threat or a serious problem within the system. A popular spark streaming example for this use case are hospitals that detect potentially dangerous life threats when monitoring patient vitals so that an automatic alert is sent to their care takers who can act accordingly on-time. Another company, CloudPhysics uses spark streaming to detect anomalies in machine data.
Common Spark Streaming Use Cases
- Fraud Detection / Intrusion Detection
- Stock Market
- Real- Time Bidding/ Ad-Auction platforms
- Real-Time Data Warehousing
- Clickstream Analysis
- Log Processing
- Trend Analysis
Spark Streaming Example Use Cases for Mobile Phones
- Location based Advertisements
- Network Metrics Analysis
Spark Streaming Example Use Cases for Web
- Website Analytics
- Sentiment Analysis
Spark Streaming Example Use Cases for Sensors
- Supply Chain Planning
- Malfunction Detection
- Dynamic Process Optimisation
You Might Also be Interested to Read -