Apache Spark Architecture Explained in Detail

Apache Spark Architecture Explained in Detail


Click Here to Download Spark Architecture PDF

"Spark is beautiful. With Hadoop, it would take us six-seven months to develop a machine learning model. Now, we can do about four models a day.” -  said Rajiv Bhat, senior vice president of data sciences and marketplace at InMobi.

Apache Spark Architecture Explained

Apache Spark is considered as a powerful complement to Hadoop, big data’s original technology of choice. Spark is a more accessible, powerful and capable big data tool for tackling various big data challenges. With more than 500 contributors from across 200 organizations responsible for code and a user base of 225,000+ members- Apache Spark has become mainstream and most in-demand big data framework across all major industries. 

Ecommerce companies like Alibaba, social networking companies like Tencent and Chinese search engine Baidu, all run Apache spark operations at scale. Here are a few features that are responsible for its popularity.

  1. Fast Processing Speed: The first and foremost advantage of using Apache Spark for your big data is that it offers 100x faster in memory and 10x faster on the disk in Hadoop clusters.

  2. Supports a variety of programming languages: Spark applications can be implemented in a variety of languages like Scala, R, Python, Java, and Clojure. This makes it easy for developers to work according to their preferences.

  3. Powerful Libraries: It contains more than just map and reduce functions. It contains libraries SQL and dataframes, MLlib (for machine learning), GraphX, and Spark streaming which offer powerful tools for data analytics.

  4. Near real-time processing: Spark has MapReduce that can process data stored in Hadoop and it also has Spark Streaming which can handle data in real-time.

  5. Compatibility: Spark can run on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can operate diverse data sources.

Now that you are aware of its exciting features, let us explore Spark Architecture to realize what makes it so special. This article is a single-stop resource that gives spark architecture overview with the help of spark architecture diagram and is a good beginners resource for people looking to learn spark.



Access Solved Big Data and Data Projects

Build a Project Portfolio and Find your Dream Big Data Job With Us!  
Schedule Your FREE Demo

Understanding Apache Spark Architecture

Apache Spark has a well-defined and layered architecture where all the spark components and layers are loosely coupled and integrated with various extensions and libraries. Apache Spark Architecture is based on two main abstractions-

  • Resilient Distributed Datasets (RDD)
  • Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

RDD’s are collection of data items that are split into partitions and can be stored in-memory on workers nodes of the spark cluster. In terms of datasets, apache spark supports two types of RDD’s – Hadoop Datasets which are created from the files stored on HDFS and parallelized collections which are based on existing Scala collections. Spark RDD’s support two different types of operations – Transformations and Actions. An important property of RDDs is that they are immutable, thus transformations never return a single value. Instead, transformation functions simply read an RDD and generate a new RDD. On the other hand, the Action operation evaluates and produces a new value. When an Action function is applied on an RDD object, all the data processing requests are evaluated at that time and the resulting value is returned. 

Read in Detail about Resilient Distributed Datasets in Spark.

Big Data Projects

Directed Acyclic Graph (DAG)

Direct - Transformation is an action which transitions data partition state from A to B.

Acyclic -Transformation cannot return to the older partition

DAG is a sequence of computations performed on data where each node is an RDD partition and edge is a transformation on top of data.  The DAG abstraction helps eliminate the Hadoop MapReduce multi0stage execution model and provides performance enhancements over Hadoop.

Spark Architecture Overview

Apache Spark follows a master/slave architecture with two main daemons and a cluster manager –

  1. Master Daemon – (Master/Driver Process)
  2. Worker Daemon –(Slave Process)

Spark Architecture Diagram

Spark Architecture Diagram – Overview of Apache Spark Cluster

A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster or in mixed machine configuration.

For classic Hadoop platforms, it is true that handling complex assignments require developers to link together a series of MapReduce jobs and run them in a sequential manner. Here, each job has a high latency. The job output data between each step has to be saved in the HDFS before other processes can start. The advantage of having DAG and RDD is that they replace the disk IO with in-memory operations and support in-memory data sharing across DAGs, so that different jobs can be performed with the same data allowing complicated workflows. 

Get More Practice, More Big Data and Analytics Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro

Role of Driver in Spark Architecture

Spark Driver – Master Node of a Spark Application

 It is the central point and the entry point of the Spark Shell (Scala, Python, and R). The driver program runs the main () function of the application and is the place where he Spark Context and RDDs are created, and also where transformations and actions are performed. Spark Driver contains various components – DAGScheduler, TaskScheduler, BackendScheduler, and BlockManager responsible for the translation of spark user code into actual spark jobs executed on the cluster. 

Spark Driver performs two main tasks: Converting user programs into tasks and planning the execution of tasks by executors. A detailed description of its tasks is as follows:

  • The driver program that runs on the master node of the spark cluster schedules the job execution and negotiates with the cluster manager.
  • It translates the RDD’s into the execution graph and splits the graph into multiple stages.
  • Driver stores the metadata about all the Resilient Distributed Databases and their partitions.
  • Cockpits of Jobs and Tasks Execution -Driver program converts a user application into smaller execution units known as tasks. Tasks are then executed by the executors i.e. the worker processes which run individual tasks.
  • After the task has been completed, all the executors submit their results to the Driver.
  • Driver exposes the information about the running spark application through a Web UI at port 4040.

Hadoop and Spark Projects

Role of Executor in Spark Architecture

An executor is a distributed agent responsible for the execution of tasks. Every spark application has its own executor process. Executors usually run for the entire lifetime of a Spark application and this phenomenon is known as “Static Allocation of Executors”. However, users can also opt for dynamic allocations of executors wherein they can add or remove spark executors dynamically to match with the overall workload.

  • Executor performs all the data processing and returns the results to the Driver..
  • Reads from and writes data to external sources.
  • Executor stores the computation results in data in-memory, cache or on hard disk drives.
  • Interacts with the storage systems.
  • Provides in-memory storage for RDDs that are collected by user programs, via a utility called the Block Manager that resides within each executor. As RDDs are collected directly inside of executors, tasks can run parallelly with the collected data.

Role of Cluster Manager in Spark Architecture

An external service is responsible for acquiring resources on the Spark cluster and allocating them to a spark job. There are 3 different types of cluster managers a Spark application can leverage for the allocation and deallocation of various physical resources such as memory for client spark jobs, CPU memory, etc. Hadoop YARN, Apache Mesos, Kubernetes, or the simple standalone spark cluster manager either of them can be launched on-premise or in the cloud for a spark application to run.

  • Standalone Cluster Manager
    Standalone Cluster Manager of Apache Spark provides an effortless method of executing applications on a cluster. It contains one master and several workers, each having a configured size of memory and CPU cores. When one submits an application, they can decide beforehand what amount of memory the executors will use, and the total number of cores for all executors. One can run the Standalone cluster manager either by starting a master and workers manually or through launch scripts of Spark’s ‘sbin’ directory.

    There are two deploy modes that the Standalone cluster manager offers for where the driver program of an application can execute. They are:
    1. Client Mode (Default Mode): In this mode, the driver will be launched on that machine where the spark-submit command was executed. 
    2.  Cluster Mode: In this mode, the driver will run inside the Standalone cluster as another procedure on one of the worker nodes, and after that, it will link back to request executors.

One important point to note about the Standalone cluster manager is that it spreads out each application over the maximum number of executors by default.

  • Hadoop YARN
    YARN is another option for Cluster Manager in Spark. It was introduced in Hadoop 2.0 and supports utilizing varied data processing frameworks on a distributed resource pool. It is essentially placed on the same nodes as Hadoop’s Distributed File System (HDFS).

Which Cluster Manager should we use and when?

Choosing a cluster manager for any spark application depends on the goals of the application because all cluster managers provide different set of scheduling capabilities. To get started with apache-spark, the standalone cluster manager is the easiest one to use when developing a new spark application.

Build an Awesome Job Winning Project Portfolio with Solved End-to-End Big Data Projects

Understanding the Run-Time Architecture of a Spark Application

What happens when a Spark Job is submitted?

When a client submits a spark user application code, the driver implicitly converts the code containing transformations and actions into a logical directed acyclic graph (DAG). At this stage, the driver program also performs certain optimizations like pipelining transformations and then it converts the logical DAG into physical execution plan with set of stages. After creating the physical execution plan, it creates small physical execution units referred to as tasks under each stage. Then tasks are bundled to be sent to the Spark Cluster.

The driver program then talks to the cluster manager and negotiates for resources. The cluster manager then launches executors on the worker nodes on behalf of the driver. At this point the driver sends tasks to the cluster manager based on data placement. Before executors begin execution, they register themselves with the driver program so that the driver has holistic view of all the executors. Now executors start executing the various tasks assigned by the driver program. At any point of time when the spark application is running, the driver program will monitor the set of executors that run. Driver program in the spark architecture also schedules future tasks based on data placement by tracking the location of cached data. When driver programs main () method exits or when it call the stop () method of the Spark Context, it will terminate all the executors and release the resources from the cluster manager.

The structure of a Spark program at higher level is - RDD's are created from the input data and new RDD's are derived from the existing RDD's using different transformations, after which an action is performed on the data. In any spark program, the DAG operations are created by default and whenever the driver runs the Spark DAG will be converted into a physical execution plan.

Launching a Spark Program

spark-submit is the single script used to submit a spark program and launches the application on the cluster. There are multiple options through which spark-submit script can connect with different cluster managers and control on the number of resources the application gets. For few cluster managers, spark-submit can run the driver within the cluster like in YARN on worker node whilst for others it runs only on local machines.

Data Science Projects

PREVIOUS

NEXT

Copy of How to Start a Travel Blog Graphic


Tutorials