The user community around Apache Spark is exploding with 300,000 people taking part in global spark meetups, a 3.6x increase- all thanks to the novel features like Structured Streaming API and many new features and enhancements coming up to the existing features in 2017. Databricks study of 1400 Spark users found that 56% more users globally used Spark Streaming applications in 2015 in comparison to 2014. Also, 48% of spark users mentioned that Spark Streaming is the most-used and important spark component. Spark Streaming architecture focusses on programming perks for spark developers owing to its ever-growing user base- CloudPhysics, Uber, eBay, Amazon, ClearStory, Yahoo, Pinterest , Netflix, etc. Apache Spark is a big data technology well worth taking note of and learning about. This blog explores the need for spark streaming, what spark streaming and what are the various companies is using spark streaming component to enhance business productivity.
Spark Streaming has garnered lot of popularity and attention in the big data enterprise computation industry. As companies continue to generate increasing data than ever before to extract value from data for real-time business scenarios, it needs to be closely monitored and acted upon quickly. Earlier programmers use to build two stacks, one for batch processing and one for streaming to process same data. Also, the existing processing frameworks could not achieve both either they could perform batch processing of 100’s of Terabytes of data with high latency or they could perform stream processing of 100’s of Megabytes of data with low latency. This made it difficult and painful as the developers had to maintain multiple programming models requiring double operational and implementation effort. The move to embrace both batch processing and stream processing is not an easy one even for fast flying web companies. Thus, the need for large scale and real-time data processing using Spark Streaming became extremely important.
If you would like more information about Apache Spark Training and Certification, please click the orange "Request Info" button on top of this page.
Let’s consider the traditional streaming systems like Apache Storm that aims to guarantee low latencies. Storm gets an event whether it is 10 bytes of data or large volume of data, whenever there is an incoming input event it, traditional event streaming systems try to process the data as soon as it comes in. If the data needs to go through 6 machines then it sends them through the different machines immediately then and there itself. A major consequence of this kind of a design is the state. Every node will have its own mutable state in the graph of computation and processing. When the incoming event goes from one node to another, the state of each processing node gets modified. Though the modified state can be updated to the databases but the issue is what happens in case there is a failure in the system. If a node fails and goes down, the associated state with it also goes down i.e. the mutable state is lost whenever a node fails ,making stateful stream processing fault-tolerant a challenging task. This led to a new design called Lambda Architecture which is not fault tolerant. Thus achieving streaming stateful stream processing is hard to implement and the best way to do this is a micro batch. Spark has an amazing implementation of this known as Spark Streaming.
“A data processing framework to build streaming applications.”
Added to the Apache Spark Framework in 2013, Spark Streaming (also known as micro-batching framework) is an integral part of the Core Spark API that allows data scientists and big data engineers to process real-time data from multiple sources like Kafka, Kinesis, Flume, Amazon, etc. It supports real time processing of streaming data like tweets from Twitter, production web server log files from Amazon S3, Flume or HDFS and other messaging queues like Apache Kafka.
Apache Spark has Resilient Distributed Datasets that maintain a lineage graph of how each partition of the data is created. Whenever there is a failure, it can recreate the data and run the computations again. When there is stream of incoming data, you can take a sliding window, grab a little bit of data in a window and run it as if it was a batch (DStream). This process can be repeated again and again. The key abstraction in Spark Streaming is a Discretized Stream or DStream built on RDD’s. A DStream represents a stream of data divided into small batches. Using Spark Streaming calls the live data stream is chopped into smaller batches of x seconds. Spark then considers each batch of data as an RDD and processes them using various RDD operations. The results are returned in batches which can be sent to HTFS or any other streaming system.
Spark Streaming architecture consists of 3 important components –
The major difference between Spark Streaming architecture and traditional streaming systems architecture is that in spark streaming computations are divided into short, stateless, deterministic tasks that can run on any given node in the spark cluster or on multiple nodes. Spark Streaming architecture makes it easy and candid to balance load across the spark cluster and react to failures.
Spark Streaming is a perfect fit for any use case that requires real-time data statistics and response. Organizations are using spark streaming for various real-time data processing applications like recommendations and targeting, network optimization, personalization, scoring of analytic models, stream mining, etc.
You Might Also be Interested to Read -