Apache Spark Ecosystem and Spark Components

Synopsis of what constitutes the Apache Spark ecosystem and Spark components, language support, cluster management, execution engine and shared memory.

Get access to all Data Engineering Projects View all Data Engineering Projects

Last Updated: 14 Apr 2024 | BY ProjectPro

With increase in real-time insights, Apache Spark has moved from a talking point in the boardroom discussions to enterprise deployments in production. It is undeniable that Apache Spark is not just a component of the Hadoop ecosystem but has become the lingua franca of big data analytics for many organizations. Over 750 contributors from 200+ companies and increasing number of use cases in myriad industries like retail, healthcare, finance, advertising and education- Apache Spark continues to gain attention in the big data space. Spark puts the promise for faster data processing and easier development. Enthusiastic to know how Spark achieves this? To answer this question, let’s introduce the Apache Spark ecosystem and explain the Spark components which make Apache Spark fast and reliable. A lot of these Spark components were built to resolve the issues that cropped up while using Hadoop MapReduce.

Streaming Data Pipeline using Spark, HBase and Phoenix

Downloadable solution code | Explanatory videos | Tech Support

Start Project

Apache Spark Ecosystem
Language Support in Apache Spark
- 1) Scala
- 2) Python
Spark Components
Cluster Management in Apache Spark

Apache Spark Community Growth

Yahoo, Uber, Amazon, eBay, Pinterest, Spotify, Baidu, Alibaba, and Shopify Verizon are some of the top companies already using Spark in production.
IBM, Cloudera, DataStax, BlueData provide commercialized Spark distributions.
The largest known cluster of Apache Spark has 8000 nodes.
Spark has 14,763 commits from 818 contributors as of February 17^th, 2016.

All the above facts and figures show how the Spark Ecosystem has grown since 2010, with development of various libraries and frameworks that allow faster and more advanced data analytics than Hadoop.

New Projects

Apache Spark Ecosystem

Spark Ecosystem and Its Components

Apache Spark is a powerful alternative to Hadoop MapReduce, with several, rich functionality features, like machine learning, real-time stream processing and graph computations. A 2015 survey on Apache Spark, reported that 91% of Spark users consider performance as a vital factor in its growth. With a benchmark performance of running big data applications 100 times faster on Hadoop clusters - Apache Spark allows for entirely new use cases to enhance the value of big data. Apache Spark is gaining popularity and is emerging as the standard execution engine in Hadoop because of its extensible and flexible API’s, high performance, ease of use and increased developer productivity.

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

Language Support in Apache Spark

Apache Spark ecosystem is built on top of the core execution engine that has extensible API’s in different languages. A recent 2015 Spark Survey on 62% of Spark users evaluated the Spark languages - 58% were using Python in 2015, 71% were using Scala, 31% of the respondents were using Java and 18% were using R programming language.

Here's what valued users are saying about ProjectPro

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data. In each learning path, there are many customized projects with all the details from the beginner to...

Jingwei Li

Graduate Research assistance at Stony Brook University

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills in Data Engineering/Science and hoping to find real-world projects fortunately, I came across...

Ed Godalle

Director Data Analytics at EY / EY Tech

Not sure what you are looking for?

View All Projects

1) Scala

â€‹â€‹Spark framework is built on Scala, so programming in Scala for Spark can provide access to some of the latest and greatest features that might not be available in other supported programming spark languages.

2) Python

Python language has excellent libraries for data analysis like Pandas and Sci-Kit learn but is comparatively slower than Scala.

3) R Language

R programming language has rich environment for machine learning and statistical analysis which helps increase developer productivity. Data scientists can now use R language along with Spark through SparkR for processing data that cannot be handled by a single machine.

4) Java

Java is verbose and does not support REPL but is definitely a good choice for developers coming from a Java+Hadoop background.

Spark Components

The Apache Spark component include:

Spark Core.
Spark SQL.
Spark Streaming.
MLlib(Machine learning library)
GraphX.
Spark R.

Let us discuss them in detail.

Apache Spark Components

1) Spark Core Component

Spark Core component is the foundation for parallel and distributed processing of large datasets. Spark Core component is accountable for all the basic I/O functionalities, scheduling and monitoring the jobs on spark clusters, task dispatching, networking with different storage systems, fault recovery and efficient memory management.

Spark Core makes use of a special data structure known as RDD (Resilient Distributed Datasets). Data sharing or reuse in distributed computing systems like Hadoop MapReduce requires the data to be stored in intermediate stores like Amazon S3 or HDFS. This slows down the overall computation speed because of several replications, IO operations and serializations in storing the data in these intermediate stable data stores. Resilient Distributed Datasets overcome this drawback of Hadoop MapReduce by allowing - fault tolerant ‘in-memory’ computations.

Resilient Distributed Datasets are immutable, partitioned collection of records that can be operated on - in parallel. RDDs can contain any kind of objects Python, Scala, Java or even user defined class objects. RDDs are usually created by either transformation of existing RDDs or by loading an external dataset from a stable storage like HDFS or HBase.

Operations on RDDs

i) Transformations

Coarse grained operations like join, union, filter or map on existing RDDs which produce a new RDD, with the result of the operation, are referred to as transformations. All transformations in Spark are lazy. Spark does not execute them immediately but instead a lineage is created that tracks all the transformations to be applied on a RDD.

ii) Actions

Operations like count, first and reduce which return values after computations on existing RDDs are referred to as Actions.

2) Spark SQL Component

Spark SQL components acts as a library on top of Apache Spark that has been built based on Shark. Spark developers can leverage the power of declarative queries and optimized storage by running SQL like queries on Spark data, that is present in RDDs and other external sources. Users can perform, extract, transform and load functions on the data coming from various formats like JSON or Parquet or Hive and then run ad-hoc queries using Spark SQL. Spark SQL eases the process of extracting and merging various datasets so that the datasets are ready to use for machine learning.

DataFrame constitutes the main abstraction for Spark SQL. Distributed collection of data ordered into named columns is known as a DataFrame in Spark. In the earlier versions of Spark SQL, DataFrame’s were referred to as SchemaRDDs. DataFrame API in spark integrates with the Spark procedural code to render tight integration between procedural and relational processing. DataFrame API evaluates operations in a lazy manner to provide support for relational optimizations and optimize the overall data processing workflow. All relational functionalities in Spark can be encapsulated using SparkSQL context or HiveContext.

Catalyst, an extensible optimizer is at the core functioning of Spark SQL, which is an optimization framework embedded in Scala to help developers improve their productivity and performance of the queries that they write. Using Catalyst, Spark developers can briefly specify complex relational optimizations and query transformations in a few lines of code by making the best use of Scala’s powerful programming constructs like pattern matching and runtime meta programming. Catalyst eases the process of adding optimization rules, data sources and data types for machine learning domains.

3) Spark Streaming

Spark Streaming is a light weight API that allows developers to perform batch processing and streaming of data with ease, in the same application. Discretized Streams form the base abstraction in Spark Streaming. It makes use of a continuous stream of input data (Discretized Stream or Stream- a series of RDD’s) to process data in real-time. Spark Streaming leverages the fast scheduling capacity of Apache Spark Core to perform streaming analytics by ingesting data in mini-batches. Transformations are applied on those mini batches of data. Data in Spark Streaming is ingested from various data sources and live streams like Twitter, Apache Kafka, Akka Actors, IoT Sensors, Amazon Kinesis, Apache Flume, etc. in event drive, fault-tolerant and type-safe applications.

Features of Spark Streaming

Easy, reliable and fast processing of live data streams.
Spark developers can reuse the same code for stream and batch processing and can also integrate the streaming data with historical data.
Spark Streaming has exactly-once message guarantees and helps recover lost work without having to write any extra code or adding additional configurations.
Spark streaming supports inclusion of Spark MLlib for machine learning pipelines into data pathways.

Applications of Spark Streaming

Spark streaming is used in applications that require real-time statistics and rapid response like alarms, IoT sensors, diagnostics, cyber security, etc. Spark streaming finds great applications in Log processing, Intrusion Detection and Fraud Detection.
Spark streaming is most useful for Online Advertisements and Campaigns, Finance, Supply Chain management, etc.

Get More Practice, More Big Data and Analytics Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro

4) Spark Component MLlib

With increasing number of companies aiming to build user-focused data products and services – the need for machine learning to develop recommendations, predictive insights and personalized results, is increasing. Data scientists can solve this problem with the use of popular data science programming tools like Python and R. However, a majority of data scientist’s time is spent in supporting the infrastructure for these language instead of building machine learning models to solve business data problems. Spark MLlib library is a solution to this problem.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Request a demo

MLlib is a low-level machine learning library that can be called from Scala, Python and Java programming languages. MLlib is simple to use, scalable, compatible with various programming languages and can be easily integrated with other tools. MLlib eases the deployment and development of scalable machine learning pipelines. MLlib library has implementations for various common machine learning algorithms –

Clustering- K-means
Classification – naïve Bayes, logistic regression, SVM
Decomposition- Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)
Regression –Linear Regression
Collaborative Filtering-Alternating Least Squares for Recommendations

According to 2015 Spark Survey by Databricks- the production use of MLlib spark component increased from 11% in 2014 to 15% in 2015.MLlib has code contributions from 200 developers across 75 organizations that has provided 2000+ patches to MLlib alone. Spark MLlib is 9 times as fast as the disk based version of Mahout.

Why data scientists use MLlib?

MLlib has a simple application programming interface for data scientists who are already familiar with data science programming tools like R and Python.
It comes with Spark framework as a standard component that helps data scientists write applications in Python, Scala, Java or R programming language.
Data scientists can iterate through data problems 100 times faster than Hadoop MapReduce, helping them solve machine learning problems at large scale in an interactive fashion.
Data scientists can build Machine learning models as a multi-step journey from data ingestion through train and error to production.
Data scientists can run the same machine learning code on the big cluster and the PC without breaking down.

Business Use Cases of Spark MLlib

Supply chain optimization and maintenance
Advertising optimization- To find out the probability of users clicking on available ads.
Marketing – To recommend products to customers to maximize revenue or engagement.
Fraud Detection- To track the anomalous behaviour of users.

5) Spark GraphX

There are several data-parallel tools for graph computations but they do not tackle the challenges of graph construction and transformation. Also the graph computations are inefficient and cumbersome because of complex joins.GraphX is an API on top of Apache Spark for cross-world manipulations that solves this problem. Spark GraphX introduces Resilient Distributed Graph (RDG- an abstraction of Spark RDD’s). RDG’s associate records with the vertices and edges in a graph. RDG’s help data scientists perform several graph operations through various expressive computational primitives. These primitives help developers implement Pregel and PageRank abstraction in approximately 20 lines of code or even less than that.

GraphX component of Spark supports multiple use cases like social network analysis, recommendation and fraud detection. Other graph databases can also be used but they require several systems to create the entire computation pipeline, Using Spark GraphX, data scientist can work with graph and non-graph sources to achieve flexibility and resilience in graph computing.

Shared Memory in Apache Spark

Apache Spark’s Cousin Tachyon- An in-memory reliable file system

In-memory computation has gained traction recently as data scientists can perform interactive and fast queries because of it. However, in-memory processing at times results in various issues like –

In a distributed system-one job’s output is given as input to the other job. In-memory computation is fast but the rate at which output data is written, depends on the network or disk bandwidth and is often slow.
When the JVM crashes, in-memory data is lost. It is time consuming to load the data into memory again.
There could be duplicate input data for different jobs.

With increase in the size of datasets and storage becoming a great logjam for different workloads - Tachyon supports reliable file sharing across cluster computing frameworks like Spark and Hadoop at memory-speed. Tachyon is a reliable shared memory that forms an integral part of the Spark ecosystem which helps achieve the desired throughput and performance by avoiding unnecessary replications.

Tachyon is already used in production at popular companies like RedHat, Yahoo, Intel, IBM, and Baidu. Tachyon code is contributed by 100+ developers from 30 organizations.

Build an Awesome Job Winning Project Portfolio with Solved End-to-End Big Data Projects

Cluster Management in Apache Spark

Apache Spark applications can run in 3 different cluster managers –

Standalone Cluster – If only Spark is running, then this is one of the easiest to setup cluster manager that can be used for novel deployments. In standalone mode - Spark manages its own cluster. In standalone mode, each application runs an executor on every node within the cluster.
Apache Mesos –It is a dedicated cluster and resource manager that provides rich resource scheduling capabilities. Mesos has fine grained sharing option so Spark shell scales down its CPU allocation during the execution of multiple commands especially when several users are running interactive shells.
YARN- YARN comes with most of the Hadoop distributions and is the only cluster manager in Spark that supports security. YARN cluster manager allows dynamic sharing and central configuration of the same pool of cluster resources between various frameworks that run on YARN. The number of executors to use can be selected by the user unlike the Standalone mode. YARN is a better choice when big Hadoop cluster is already in use at production.

Apache Spark is evolving at a rapid pace because of its interactive performance, fault tolerance, productivity benefits. Apache Spark is likely to become the industry-standard for big data processing in 2016.

ProjectPro

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

Meet The Author