Apache Spark and Apache Flink are both open- sourced, distributed processing framework which was built to reduce the latencies of Hadoop Mapreduce in fast data processing. There is a common misconception that Apache Flink is going to replace Spark or is it possible that both these big data technologies ca n co-exist, thereby serving similar needs to fault-tolerant, fast data processing. Apache Spark and Flink may seem similar to someone who has not worked with either of these and is only familiar with Hadoop, and it is obvious that they will feel that the development of Apache Flink is mostly superfluous. But Flink managed to stay ahead in the game because of its stream processing feature, which manages to process rows upon rows of data in real time – which is not possible in Apache Spark’s batch processing method. This makes Flink faster than Spark.
According to this IBM study, we are creating about 2.5 quintillion bytes of data every day – and this rate of data generation continues to increase at an unprecedented pace. To put things in another perspective, about 90% of all data existing in this world, was created in the last two years, even though the World Wide Web has been accessible to public for well over two decades. As the Internet grew, so did the number of users and the ever-increasing demand for content paved the way for Web 2.0 in the last decade. It was the first time that users were allowed to create their own data on the internet and it was ready to be consumed by a data hungry audience.
Then it was social media’s turn to invade our lives. According to the wersm (we are social media) report, Facebook gets more than 4 million likes in a minute! The data generated by other popular sources is mentioned in the infographic (taken from the same wersm study) before we have a look at how this data is consumed.
“How to store these enormous amounts of data?” was a problem statement that kept the tech geeks busy for most part of previous decade. Sudden rise of social media did not make their tasks any easier. However, new age storage solutions such as Cloud Computing has revolutionized the industry and presented the best possible solution. In the present decade, the problem statement has shifted to “What to do with huge chunks of data?” Data Analytics emerged as the ultimate goal but before that, a lot of work needs to be done to integrate data stored in different formats at different sources and prepare it for processing and analytics, which is a demanding task.
For the complete list of big data companies and their salaries- CLICK HERE
Our two topics for today – Apache Spark and Apache Flink – attempt to answer that question and more.
Spark is an open source, cluster computing framework which has a large global user base. It is written in Scala, Java, R and Python and gives programmers an Application Programming Interface (API) built on a fault tolerant, read only multiset of distributed data items. In a short time of 2 years since its initial release (May 2014), it has seen wide acceptability for real time, in-memory, advanced analytics – owing to its speed, ease of use and the ability to handle sophisticated analytical requirements.
If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.
Advantages of Spark
Apache Spark has several advantages over traditional Big Data and MapReduce based technologies. The prominent ones are. It essentially takes MapReduce to the next level with a performance that is several times faster. One of the key differentiators for Spark is its ability to hold intermediate results in-memory itself, rather than writing back to disk and reading from it again, which is critical for iteration based use cases.
- Speed – Spark can execute batch processing jobs 10 to 100 times faster than MapReduce. That doesn’t mean it lags behind when data has to be written to (and fetched from) disk, as it is the world record holder for large-scale on-disk sorting.
- Ease of Use – Apache Sparkhas easy to use APIs, built for operating on large datasets.
- Unified Engine – Spark can run on top of Hadoop, making use of its cluster manager (YARN) and underlying storage (HDFS, HBase, etc.). However, it can also run independent of Hadoop, joining hands with other cluster managers and storage platforms (the likes of Cassandra and Amazon S3). It also comes with higher – level libraries that support SQL queries data streaming, machine learning and graph processing.
- Choose from Java, Scala or Python – Spark doesn’t tie you down to a particular language and lets you choose from the popular ones such as Java, Scala, Python, R and even Clojure.
- In-memory data sharing – Different jobs can share data within the memory, which makes it an ideal choice for iterative, interactive and event stream processing tasks.
- Active, expanding user community – An active user community has led to a stable release of Spark (in June, 2016) within 2 years of its initial release. This speaks volumes of its worldwide acceptability, which is on the rise.
German for ‘quick’ or ‘nimble’, Apache Flink is the latest entrant to the list of open-source frameworks focused on Big Data Analytics that are trying to replace Hadoop’s aging MapReduce, just like Spark. Flink got its first API-stable version released in March 2016 and is built for in-memory processing of batch data, just like Spark. This model comes in really handy when repeated passes need to be made on the same data. This makes it an ideal candidate for machine learning and other use cases that require adaptive learning, self-learning networks, etc. With the inevitable boom of Internet of Things (IoT) space, Flink user community has some exciting challenges to look forward to.
Advantages of Flink
- Actual stream processing engine that can approximate batch processing, rather than being the other way around.
- Better memory management – Explicit memory management gets rid of the occasional spikes found in Spark framework.
- Speed – It manages faster speeds by allowing iterative processing to take place on the same node rather than having the cluster run them independently. Its performance can be further tuned by tweaking it to re-process only that part of data that has changed rather than the entire set. It offers up to five-fold boost in speed when compared to the standard processing algorithm.
- Less configuration
Apache Flink vs Spark
By the time Flink came along, Apache Spark was already the de facto framework for fast, in-memory big data analytic requirements for a number of organizations around the world. This made Flink appear superfluous. After all, why would one require another data processing engine while the jury was still out on the existing one? One has to dig deeper into the capabilities of Flink to observe what sets it apart, though a number of analysts have billed it up as the “4G of Data Analytics”.
Deeply embedded inside Spark’s settings is a little weakness that Flink has targeted and is trying to capitalize upon. Though it stands true for the purpose of casual discussions, Spark is not purely a stream-processing engine. As observed by Ian Pointer in the InfoWorld article ‘Apache Flink: New Hadoop contender squares off against Spark’, Spark is essentially a fast-batch operation which works on only a small part of incoming data during a time unit. Spark refers to this as “micro batching” in its official documentation. This issue is unlikely to have any practical significance on operations unless the use case requires low latency (financial systems) where delay of the order of milliseconds can cause significant impact. That being said, Flink is pretty much a work in progress and cannot stake claim to replace Spark yet.
Flink is a stream processing framework that can run the chores requiring batch processing, giving you the option to use the same algorithm in both the modes, without having to turn to a technology like Apache Storm that requires low latency response.
Both Spark and Flink support in-memory processing that gives them distinct advantage of speed over other frameworks. When it comes to real time processing of incoming data, Flink does not stand up against Spark, though it has the capability to carry out real time processing tasks.
Spark and Flink both can handle iterative, in memory processing. When it comes to speed, Flink gets the upper hand as it can be programmed to process only the data that has changed, which is where it comes out on top of Spark.
Growth stories – Spark and Flink
Any software framework needs more than technical expertise to be able to help businesses derive the maximum value. In this section we dig into the Apache Spark 2015 Year in Review article by Databricks to see how it has fared in the global community of users and developers. The year saw 4 releases (1.3 to 1.6), each one with hundreds of fixes to improve the framework. What has caught our eye is the growth in number of contributing developers – from 500 in 2014 to over 1000 in 2015! Another noticeable thing about Spark is the ease with which its users transition to the new versions. The report mentions that within three months a majority of users adopt to the latest release. These facts enhance its reputation as the most actively developed (and adopted) open source data tool.
Flink has been relatively late to the race but 2015 year in review on its official website shows why it is here to stay as the most complete open source stream processing frameworks available. Flink's github repository (Get the Repository – Here) shows the community doubled in size in 2015 – from 75 contributors to 150. Repository forks more than tripled in the year and so did the number of stars of the repository. Starting out from Berlin, Germany, it has seen its user community grow across continents to North America and Asia. The Flink Forward Conference was another milestone for Flink, which saw participation from over 250 participants, where more than 100 participants travelled from across the globe to attend technical talks from organizations including Google, MongoDB, Telecom, NFLabs, RedHat, IBM, Huawei, Ericsson, Capital One, Amadeus and many more.
Though it is still early days to single out one of these two as a clear winner, we are of the view that instead of having many frameworks do the same thing, tech world would be better served by having new entrants do different things and complementing the existing ones rather than competing against them.