"Spark is beautiful. With Hadoop, it would take us six-seven months to develop a machine learning model. Now, we can do about four models a day.” - said Rajiv Bhat, senior vice president of data sciences and marketplace at InMobi.
Apache Spark is considered as a powerful complement to Hadoop, big data’s original technology of choice. Spark is a more accessible, powerful and capable big data tool for tackling various big data challenges. With more than 500 contributors from across 200 organizations responsible for code and a user base of 225,000+ members- Apache Spark has become mainstream and most in-demand big data framework across all major industries.
Ecommerce companies like Alibaba, social networking companies like Tencent and Chinese search engine Baidu, all run Apache spark operations at scale. Here are a few features that are responsible for its popularity.
Fast Processing Speed: The first and foremost advantage of using Apache Spark for your big data is that it offers 100x faster in memory and 10x faster on the disk in Hadoop clusters.
Supports a variety of programming languages: Spark applications can be implemented in a variety of languages like Scala, R, Python, Java, and Clojure. This makes it easy for developers to work according to their preferences.
Powerful Libraries: It contains more than just map and reduce functions. It contains libraries SQL and dataframes, MLlib (for machine learning), GraphX, and Spark streaming which offer powerful tools for data analytics.
Near real-time processing: Spark has MapReduce that can process data stored in Hadoop and it also has Spark Streaming which can handle data in real-time.
Compatibility: Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can operate diverse data sources.
Now that you are aware of its exciting features, let us explore Spark Architecture to realize what makes it so special. This article is a single-stop resource that gives spark architecture overview with the help of spark architecture diagram and is a good beginners resource for people looking to learn spark.
Apache Spark has a well-defined and layered architecture where all the spark components and layers are loosely coupled and integrated with various extensions and libraries. Apache Spark Architecture is based on two main abstractions-
RDD’s are collection of data items that are split into partitions and can be stored in-memory on workers nodes of the spark cluster. In terms of datasets, apache spark supports two types of RDD’s – Hadoop Datasets which are created from the files stored on HDFS and parallelized collections which are based on existing Scala collections. Spark RDD’s support two different types of operations – Transformations and Actions. An important property of RDDs is that they are immutable, thus transformations never return a single value. Instead, transformation functions simply read an RDD and generate a new RDD. On the other hand, the Action operation evaluates and produces a new value. When an Action function is applied on an RDD object, all the data processing requests are evaluated at that time and the resulting value is returned.
Read in Detail about Resilient Distributed Datasets in Spark.
Direct - Transformation is an action which transitions data partition state from A to B.
Acyclic -Transformation cannot return to the older partition
DAG is a sequence of computations performed on data where each node is an RDD partition and edge is a transformation on top of data. The DAG abstraction helps eliminate the Hadoop MapReduce multi0stage execution model and provides performance enhancements over Hadoop.
Apache Spark follows a master/slave architecture with two main daemons and a cluster manager –
Spark Architecture Diagram – Overview of Apache Spark Cluster
A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster or in mixed machine configuration.
For classic Hadoop platforms, it is true that handling complex assignments require developers to link together a series of MapReduce jobs and run them in a sequential manner. Here, each job has a high latency. The job output data between each step has to be saved in the HDFS before other processes can start. The advantage of having DAG and RDD is that they replace the disk IO with in-memory operations and support in-memory data sharing across DAGs, so that different jobs can be performed with the same data allowing complicated workflows.
Spark Driver – Master Node of a Spark Application
It is the central point and the entry point of the Spark Shell (Scala, Python, and R). The driver program runs the main () function of the application and is the place where he Spark Context and RDDs are created, and also where transformations and actions are performed. Spark Driver contains various components – DAGScheduler, TaskScheduler, BackendScheduler, and BlockManager responsible for the translation of spark user code into actual spark jobs executed on the cluster.
Spark Driver performs two main tasks: Converting user programs into tasks and planning the execution of tasks by executors. A detailed description of its tasks is as follows:
An executor is a distributed agent responsible for the execution of tasks. Every spark application has its own executor process. Executors usually run for the entire lifetime of a Spark application and this phenomenon is known as “Static Allocation of Executors”. However, users can also opt for dynamic allocations of executors wherein they can add or remove spark executors dynamically to match with the overall workload.
An external service is responsible for acquiring resources on the Spark cluster and allocating them to a spark job. There are 3 different types of cluster managers a Spark application can leverage for the allocation and deallocation of various physical resources such as memory for client spark jobs, CPU memory, etc. Hadoop YARN, Apache Mesos, Kubernetes, or the simple standalone spark cluster manager either of them can be launched on-premise or in the cloud for a spark application to run.
Cluster Mode: In this mode, the driver will run inside the Standalone cluster as another procedure on one of the worker nodes, and after that, it will link back to request executors.
One important point to note about the Standalone cluster manager is that it spreads out each application over the maximum number of executors by default.
Choosing a cluster manager for any spark application depends on the goals of the application because all cluster managers provide different set of scheduling capabilities. To get started with apache-spark, the standalone cluster manager is the easiest one to use when developing a new spark application.
When a client submits a spark user application code, the driver implicitly converts the code containing transformations and actions into a logical directed acyclic graph (DAG). At this stage, the driver program also performs certain optimizations like pipelining transformations and then it converts the logical DAG into physical execution plan with set of stages. After creating the physical execution plan, it creates small physical execution units referred to as tasks under each stage. Then tasks are bundled to be sent to the Spark Cluster.
The driver program then talks to the cluster manager and negotiates for resources. The cluster manager then launches executors on the worker nodes on behalf of the driver. At this point the driver sends tasks to the cluster manager based on data placement. Before executors begin execution, they register themselves with the driver program so that the driver has holistic view of all the executors. Now executors start executing the various tasks assigned by the driver program. At any point of time when the spark application is running, the driver program will monitor the set of executors that run. Driver program in the spark architecture also schedules future tasks based on data placement by tracking the location of cached data. When driver programs main () method exits or when it call the stop () method of the Spark Context, it will terminate all the executors and release the resources from the cluster manager.
The structure of a Spark program at higher level is - RDD's are created from the input data and new RDD's are derived from the existing RDD's using different transformations, after which an action is performed on the data. In any spark program, the DAG operations are created by default and whenever the driver runs the Spark DAG will be converted into a physical execution plan.
spark-submit is the single script used to submit a spark program and launches the application on the cluster. There are multiple options through which spark-submit script can connect with different cluster managers and control on the number of resources the application gets. For few cluster managers, spark-submit can run the driver within the cluster like in YARN on worker node whilst for others it runs only on local machines.