Hadoop and Spark are popular apache projects in the big data ecosystem. Apache Spark is an improvement on the original Hadoop MapReduce component of the hadoop big data ecosystem. There is great excitement around Apache Spark as it provides real advantage in interactive data interrogation on in-memory data sets and also in multi-pass iterative machine learning algorithms. However, there is a hot debate on whether spark can mount challenge to Apache Hadoop by replacing it and becoming the top big data analytics tool. What elaborates is a detailed discussion on Spark Hadoop comparison and helps users understand why spark is faster than Hadoop.
|Apache Spark||Apache Hadoop|
|Easy to program and does not require any abstractions.||Difficult to program and requires abstractions.|
|Programmers can perform streaming, batch processing and machine learning ,all in the same cluster.||It is used for generating reports that help find answers to historical queries.|
|Has in-built interactive mode.||No in-built interactive mode except tools like Pig and Hive.|
|Executes jobs 10 to 100 times faster than Hadoop MapReduce.||Hadoop MapReduce does not leverage the memory of the hadoop cluster to the maximum.|
|Programmers can modify the data in real-time through Spark streaming.||Allows you to just process a batch of stored data.|
There are various approaches in the world of big data which make Apache Hadoop just the perfect choice for iterative data processing, interactive queries and ad hoc queries. Every Hadoop user is aware of the fact that Hadoop MapReduce framework is meant majorly for batch processing and thus using Hadoop MapReduce for machine learning processes, ad-hoc data exploration and other similar processes is not apt.
Most of the Big Data vendors are making their efforts for finding and ideal solution to this challenging problem that has paved way for the advent of a very demanding and popular alternative named Apache Spark. Spark makes development completely a pleasurable activity and has a better performance execution engine over MapReduce whilst using the same storage engine Hadoop HDFS for executing huge data sets.
Apache Spark has gained great hype in the past few months and is now being regarded as the most active project of Hadoop Ecosystem.
Before we get into further discussion on what empowers Apache Spark over Hadoop MapReduce let us have a brief understanding of what actually Apache Spark is and then move on to understanding the differences between the two.
Spark is a fast cluster computing system developed by the contributions of near about 250 developers from 50 companies in the UC Berkeley’s AMP Lab, for making data analytics faster and easier to write and as well to run.
Apache Spark is an open source available for free download thus making it a user friendly face of the distributed programming framework i.e. Big Data. Spark follows a general execution model that helps in in-memory computing and optimization of arbitrary operator graphs so that querying data becomes much faster when compared to the disk based engines like MapReduce.
Apache Spark has a well designed application programming interface that consists of various parallel collections with methods such as groupByKey, Map and Reduce so that you get a feel as though you are programming locally. With Apache Spark you can write collection oriented algorithms using the functional programming language Scala.
Hadoop MapReduce that was envisioned at Google and successfully implemented and Apache Hadoop is an extremely famous and widely used execution engine. You will find several applications that are on familiar terms with how to decompose their work into a sequence of MapReduce jobs. All these real time applications will have to continue their operation without any change.
However the users have been consistently complaining about the high latency problem with Hadoop MapReduce stating that the batch mode response for all these real time applications is highly painful when it comes to processing and analyzing data.
Now this paved way for Hadoop Spark, a successor system that is more powerful and flexible than Hadoop MapReduce. Despite the fact that it might not be possible for all the future allocations or existing applications to completely abandon Hadoop MapReduce, but there is a scope for most of the future applications to make use of a general purpose execution engine such as Hadoop Spark that comes with many more innovative features, to accomplish much more than that is possible with MapReduce Hadoop.
Apache Spark is an open source standalone project that was developed to collectively function together with HDFS. Apache Spark by now has a huge community of vocal contributors and users for the reason that programming with Spark using Scala is much easier and it is much faster than the Hadoop MapReduce framework both on disk and in-memory.
Thus, Hadoop Spark is just the apt choice for the future big data applications that possibly would require lower latency queries, iterative computation and real time processing on similar data.
Hadoop Spark has lots of advantages over Hadoop MapReduce framework in terms of a wide range of computing workloads it can deal with and the speed at which it executes the batch processing jobs.
Hadoop Spark has been said to execute batch processing jobs near about 10 to 100 times faster than the Hadoop MapReduce framework just by merely by cutting down on the number of reads and writes to the disc.
In case of MapReduce there are these Map and Reduce tasks subsequent to which there is a synchronization barrier and one needs to preserve the data to the disc. This feature of MapReduce framework was developed with the intent that in case of failure the jobs can be recovered but the drawback to this is that, it does not leverage the memory of the Hadoop cluster to the maximum.
Nevertheless with Hadoop Spark the concept of RDDs (Resilient Distributed Datasets) lets you save data on memory and preserve it to the disc if and only if it is required and as well it does not have any kind of synchronization barriers that possibly could slow down the process. Thus the general execution engine of Spark is much faster than Hadoop MapReduce with the use of memory.
It is now easy for the organizations to simplify their infrastructure used for data processing as with Hadoop Spark now it is possible to perform Streaming, Batch Processing and Machine Learning all in the same cluster.
Most of the real time applications use Hadoop MapReduce for generating reports that help in finding answers to historical queries and then altogether delay a different system that will deal with stream processing so as to get the key metrics in real time. Thus the organizations ought to manage and maintain separate systems and then develop applications for both the computational models.
However with Hadoop Spark all these complexities can be eliminated as it is possible to implement both stream and batch processing on the same system so that it simplifies the development, deployment and maintenance of the application.With Spark it is possible to control different kinds of workloads, so if there is an interaction between various workloads in the same process it is easier to manage and secure such workloads which come as a limitation with MapReduce.
In case of Hadoop MapReduce you just get to process a batch of stored data but with Hadoop Spark it is as well possible to modify the data in real time through Spark Streaming.
With Spark Streaming it is possible to pass data through various software functions for instance performing data analytics as and when it is collected.
Developers can now as well make use of Apache Spark for Graph processing which maps the relationships in data amongst various entities such as people and objects. Organizations can also make use of Apache Spark with predefined machine learning code libraries so that machine learning can be performed on the data that is stored in various Hadoop clusters.
Spark ensures lower latency computations by caching the partial results across its memory of distributed workers unlike MapReduce which is disk oriented completely. Hadoop Spark is slowly turning out to be a huge productivity boost in comparison to writing complex Hadoop MapReduce pipelines.
Writing Spark is always compact than writing Hadoop MapReduce code. Here is a Spark MapReduce example-The below images show the word count program code in Spark and Hadoop MapReduce.If we look at the images, it is clearly evident that Hadoop MapReduce code is more verbose and lengthy.
Apache Hadoop and Apache Spark both provide a good level of fault tolerance, however, the approach used by these two systems to ensure fault tolerance varies significantly.
Hadoop has fault tolerance due to its mode of operation. In Hadoop, there is a cluster of machines and the data is replicated across multiple machines, also known as nodes, in the cluster. If there is a failure in one of the nodes, either due to a system failure or due to a planned exit for any maintenance related issues, the system can proceed by tracking any missing data from other nodes which have replicas of the data. There is a master node that keeps a track of the status of all the slave nodes. The slave nodes are required to send messages to the master node periodically. These are known as Heartbeat messages. If the master node does not receive these messages from a particular slave node for more than 10 minutes, the data being replicated onto that particular slave node is copied to another slave node. The master node, however, is a single point of failure for the cluster. If the master node fails, data is not lost but the cluster will be down for a while.
In Spark, fault tolerance is achieved by means of RDDs. Resilient Distributed Datasets (RDD) is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. In case of a failure, the Spark system tracks how the immutable dataset was created and then restarts it accordingly. In Spark, data is rebuilt in a cluster by using Directed Acyclic Graphs (DAG) to track the workflows.
Spark supports authentication for RPC channels by means of a shared secret. Event logging and Web UIs in Spark can be secured by using javax servlet filters. In addition, since Spark can use HDFS and run on YARN, it can make use of Kerberos for authentication, file permissions on HDFS, and encryption between nodes. Hadoop MapReduce provides all the security benefits as Hadoop and has more fine-grained security features which are available from HDFS. It can also integrate with other Hadoop tools, including Knox Gateway and Apache Sentry.
In the case of security features, Spark is not as advanced when compared to MapReduce. Another issue with Spark is that the security in Spark is set to off by default and can leave the application vulnerable to attack.
In the case of Hadoop, all the files passed into HDFS (Hadoop Distributed File System) are divided into blocks based on a configured block size. Each of these blocks is replicated a specific number of times across the nodes of the cluster. The number of replicas is determined by the replication factor. The NameNode in a Hadoop cluster keeps track of the cluster. It assigns blocks to the various DataNodes in a cluster, and these blocks are written on the DataNode they are assigned. The MapReduce algorithm is built on top of HDFS and has its JobTracker. Whenever an application has to be executed, Hadoop accepts the JobTracker and assigns the work to TaskTrackers, which are present on the other nodes. YARN (Yet Another Resource Negotiator) is responsible for allocating resources to perform the tasks that are assigned by the JobTracker. YARN monitors and rotates the resources if needed for more efficiency.
In Spark, computations are carried out in memory and kept there until the user actively decides to persist them. Spark first reads data from a file present on a filestore (the filestore can be HDFS too) onto the SparkContext. From the SparkContext, Spark creates a structure, which contains a collection of immutable elements which can be operated on in parallel. This structure is known as a Resilient Distributed Dataset or RDD. While the RDD is being created, Spark also creates a Directed Acyclic Graph (DAG) in order to visualize the order of operations and their relationship. DAGs have stages and steps. RDDs can be used to carry out transformations, actions, intermediate steps, and final steps. The results of these transformations go into the DAG but do not persist on the disk. The results of actions, however, persist all the data that is in memory on the disk. As of Spark 2.0, a new abstraction called DataFrames was introduced. DataFrames are similar to RDDs, but data is organized into named columns in the case of DataFrames, making them more user-friendly than RDDs. SparkSQL allows querying of data from the DataFrames.
ix) Spark vs. Hadoop - Machine learning
Spark and Hadoop are both provided with their own built-in libraries that can be used for the purpose of machine learning.
Hadoop provides Mahout as a library for machine learning. Mahout supports techniques such as classification, clustering, and batch-based collaborative filtering, which all run on top of MapReduce. Mahout is, however, being replaced by another tool called Samsara, which is a domain-specific language (DSL) written in Scala that provides a platform for in-memory and algebraic operations and also allows users to incorporate their own algorithms.
Spark provides MLlib as its library for machine learning. MLlib can be used for iterative machine learning applications in-memory. The library is available in Java, Scala, Python, and R. It allows users to perform classification, regression and also allows building user-defined machine learning pipelines with hyperparameter tuning. Since Spark allows for machine learning algorithms to run in-memory, it is found to be faster when it comes to handling several machine learning algorithms, including k-means and Naive Bayes.
Spark works better than Hadoop for the purpose of Iterative processing. Spark’s RDDs allow multiple map operations to be carried out in memory, but MapReduce will have to write the intermediate results to a disk.
Due to its faster computational speed, Spark is a better choice to handle real-time processing or for immediate insights.
Spark supports graph processing better than Hadoop since it is not only better for iterative computations but also has its own API called GraphX, solely to handle graph computations.
Spark works better for machine learning since it has a dedicated machine learning library, called MLlib, which has its own built-in algorithms that can also run in-memory.
Spark has also been found to be faster when working with certain machine learning applications such as Naive Bayes and k-means.
Spark is said to have a more optimal performance in terms of processing speed when compared to Hadoop. This is because Spark does not have to deal with input-output overhead every time it runs a task, unlike in the case of MapReduce, and hence Spark is found to be much faster for many applications. In addition, the DAG of Spark also provides optimizations between steps. Since Hadoop does not draw any cyclical connection between the steps in MapReduce, it cannot perform any performance tuning at that level.
Hadoop MapReduce is being condemned by most of the users as a log jam in Hadoop Clustering for the reason that MapReduce executes all the jobs in Batch Mode which implies that analyzing data in real time is not possible. With the advent of Hadoop Spark which is proven to be a great alternative to Hadoop MapReduce the biggest question that hinders the minds of Data Scientists is Hadoop vs. Spark- Who wins the battle?
Apache Spark executes the jobs in micro batches that are very short say approximately 5 seconds or less than that. Apache Spark has over the time been successful in providing more stability when compared to the real time stream oriented Hadoop Frameworks.
Nevertheless every coin has two faces and yeah so does Hadoop Spark comes with some backlogs such as inability to handle in case if the intermediate data is greater than the memory size of the node, problems in case of node failure and the most important of all is the cost factor.
Hadoop Spark makes use of the journaling (also known as “Recomputation”) for providing resiliency in case there is a node failure by chance as a result we can conclude that the recovery behavior in case of node failure is just similar as that in case of Hadoop MapReduce except for the fact that the recovery process would be much faster.
Spark also has the spill to disk feature incase if for a particular node there is insufficient RAM for storing the data partitions then it provides graceful degradation for disk based data handling. When it comes to cost, with street RAM prices being 5USD per GB, we can have near about 1TB of RAM for 5K USD thus making memory to be a very minor fraction of the overall node costing.
One great advantage that comes coupled with Hadoop MapReduce over Apache Spark is that in case if the data size is greater than memory then under such circumstances Apache Spark will not be able to leverage its cache and there is much probability that it will be far slower than the batch processing of MapReduce.
If the question that is leaving you confused on Hadoop MapReduce or Apache Spark or rather say to choose Disk Based Computing or RAM Based Computing, then the answer to this question is straightforward. It all depends and the variables on which this decision depends keep on changing dynamically with time.
Nevertheless, the current trends are in favor of the in-memory techniques like the Apache Spark as the industry trends seem to be rendering a positive feedback for it. So to conclude with we can state that, the choice of Hadoop MapReduce vs. Apache Spark depends on the user-based case and we cannot make an autonomous choice.