Any discussion at the top big data conferences in 2016 is likely to be incomplete without a debate on which big data framework to choose for your next big data deployment- Hadoop or Spark “OR” Spark Hadoop. Apache Spark is currently raking in the popularity votes but Hadoop still maintains its top position when it comes to the big data framework of choice. Hadoop does not have monopoly on Big Data, but there is a stubborn misconception that Apache Spark is an alternative to Hadoop and that, it is likely to bring an end to the era of Hadoop. It is difficult to say “Hadoop vs Spark” - as the two big data frameworks are not mutually exclusive but they can be better when they are paired with each other. Companies know that Hadoop and Spark are the go-to frameworks for working with big data but they are often confused on, whether they have to choose Apache Spark over Hadoop or vice-versa. Let’s take a look at how Hadoop and Spark complement each other by working together effectively, as a big data system.
Say, you are a Hadoop Developer working on your very first project, in analysing petabytes of big data and extracting meaningful insights, by using a combination of Hadoop MapReduce jobs and SQL-on-Hadoop tools, for your organization. Within a few weeks you see that something else, other than Hadoop is trending in the big data space. All of a sudden, everyone is saying Apache Spark is here to replace Hadoop and companies are moving away from Hadoop towards Spark. You might have come across headlines like this, in news blogs -
Apache Spark's Marriage to Hadoop Will Be Bigger Than Kim and Kanye- Forrester.com
Apache Spark: A Killer or Saviour of Apache Hadoop? - O’Reily
Adios Hadoop, Hola Spark –t3chfest
All these headlines show the hype involved around the fieriest debate on Spark vs Hadoop. Some of the headlines claimed that Hadoop is dead and Apache Spark is replacing it. Should you quit working on the Hadoop ecosystem, you so diligently learnt and love using? The answer is a definite NO.
Hadoop forms a strong foundation for any of the future big data initiatives and Apache Spark is one of these big data initiatives - which has enhanced features like in-memory processing and machine learning capabilities.
The Hadoop stack has evolved over time from SQL to interactive, from MapReduce processing framework to various lightning fast processing frameworks like Apache Spark and Tez. Hadoop MapReduce and Spark both are developed, to solve the problem of efficient big data processing. Apache Hadoop is a basic level distributed data computing framework for collecting and distributing data across various nodes in the cluster, located on different servers. Apache Spark was mainly developed to process big data, more efficiently than Hadoop MapReduce, due to its in-memory processing capabilities. There has been lot of excitement around Apache Spark with increasing - numbers of contributors, enterprise adoption of the open source project and numbers of learners.
Hadoop MapReduce is used for batch processing of data stored in HDFS for fast and reliable analysis, whereas Apache Spark is used for data streaming and in-memory distributed processing for faster real-time analysis.
Apache Hadoop has two main components- HDFS and YARN. The Hadoop Distributed File System allows users to distribute huge amounts of big data across different nodes in a cluster of servers. HDFS stores data in a cost effective manner, as it does not require any consumer hardware. YARN is the computation engine for processing data stored on top of Hadoop. YARN can host various open source computing frameworks like MapReduce, Tez or Apache Spark. So when people say that Spark is replacing Hadoop, it actually means that big data professionals now prefer to use Apache Spark for processing the data instead of Hadoop MapReduce. MapReduce and Hadoop are not the same – MapReduce is just a component to process the data in Hadoop and so is Spark.
Apache Spark is a data processing package that works on the data stored in HDFS, as it does not have its own storage system for organizing distributed files. Spark processes large amounts of data by showing resiliency and performing machine leaning at a speed that is 100 times faster than MapReduce.
A market research firm MarketAnalysis.com reports that Hadoop market is anticipated to grow at a CAGR of 58% - crossing the $1 billion mark, by the end of 2020. So, this is definitely not the end of Hadoop but it is likely to add value to the organizational big data endeavours along with Spark.
“Some people take Hadoop to mean a whole ecosystem (HDFS, Hive, MapReduce, etc.), in which case Spark is designed to fit well within the ecosystem (reading from any input source that MapReduce supports through the Input Format interface, being compatible with Hive and YARN, etc.). Others refer to Hadoop MapReduce in particular, in which case I think it’s very likely that non-MapReduce engines will take over in a lot of domains, and in many cases they already have.”-said Matei Zaharia, the CTO of Databricks
Organizations can make the best use of Hadoop capabilities in production environments by integrating Spark with Hadoop. Apache Spark can run directly on top of Hadoop to leverage the storage and cluster managers or Spark can run separately from Hadoop to integrate with other storage and cluster managers. Hadoop has in-built disaster recovery capabilities so the duo collectively can be used for data management and cluster administration for analysis workloads.
In the healthcare and finance sectors, where data security is of critical importance, Hadoop and Spark can work together. Spark enjoys security bonus from Hadoop, as it can use HDFS’s access control lists and file level permissions. Hadoop allows Spark workloads to be deployed on the available resources in a distributed cluster, devoid of manually having to allocate and track every task.
Using Spark Hadoop together helps users leverage the power of Machine Learning through MLlib library. Machine Learning algorithms can be executed faster in-memory, unlike Hadoop MapReduce where data has to be moved in and out of disks for processing. Apache Spark uses RDDs for faster data access which add value to a Hadoop cluster by reducing the lag time and enhancing the performance. Whenever the system fails, RDDs can be computed using prior information.
Many organizations are already using Hadoop Spark together –
Will the bond between Hadoop and Spark continue to blossom is the big “Big Data” question?
Apache Spark does not require Hadoop to run, but can also run on other storage systems. If Databricks, the company that leads the Spark Community, develops its own file system so that it can exists as an independent big data ecosystem – then Spark will no longer need to rely on Hadoop to deliver the best performance. This implies that Hadoop Spark may not continue to coexist together if the Spark community develops its own Hadoop-less ecosystem.
There is always a possibility that the open source Hadoop community and the top Hadoop vendors like Cloudera, Hortonworks or MapR can develop an open source technology that competes well with the features offered by Spark.
Spark, Hadoop - each of them has their own specialities and excel in various perspectives as mentioned above, however, they are designed to achieve the same goal. Apache Spark is not a challenger to Hadoop but is meant to enhance the Hadoop stack. Organizations should consider Apache Spark as an additional feature that can be added to the existing Hadoop infrastructure based on the use case. When processing speed is a primary factor for data science applications- Apache Spark can dive into the big data scene along with Hadoop to derive valuable insights. When the use case demands normal processing speed and limited tasks to be performed on data- Hadoop alone is sufficient. There are many other scenarios like the Internet of Things where Hadoop and Spark make a lovely combination for faster analytics.
Apache Spark’s agility, speed and comparable ease of use, very well complement Hadoop MapReduce’s low cost of operation on commodity hardware. There is no “either” or “or” proposition for Hadoop and Spark, organizations that leverage both the frameworks in tandem, can maximize their big data investments through faster analytics and better storage capabilities.
Hadoop and Spark duo make an excellent big data infrastructure for faster data processing and analytics. Do you think so? Let us know in comments below, with some real-time examples where Hadoop and Spark make a perfect match.