With increase in real-time insights, Apache Spark has moved from a talking point in the boardroom discussions to enterprise deployments in production. It is undeniable that Apache Spark is not just a component of the Hadoop ecosystem but has become the lingua franca of big data analytics for many organizations. Over 750 contributors from 200+ companies and increasing number of use cases in myriad industries like retail, healthcare, finance, advertising and education- Apache Spark continues to gain attention in the big data space. Spark puts the promise for faster data processing and easier development. Enthusiastic to know how Spark achieves this? To answer this question, let’s introduce the Apache Spark ecosystem and explain the Spark components which make Apache Spark fast and reliable. A lot of these Spark components were built to resolve the issues that cropped up while using Hadoop MapReduce.
- Yahoo, Uber, Amazon, eBay, Pinterest, Spotify, Baidu, Alibaba, and Shopify Verizon are some of the top companies already using Spark in production.
- IBM, Cloudera, DataStax, BlueData provide commercialized Spark distributions.
- The largest known cluster of Apache Spark has 8000 nodes.
- Spark has 14,763 commits from 818 contributors as of February 17th, 2016.
All the above facts and figures show how the Spark Ecosystem has grown since 2010, with development of various libraries and frameworks that allow faster and more advanced data analytics than Hadoop.
Apache Spark Ecosystem
Apache Spark is a powerful alternative to Hadoop MapReduce, with several, rich functionality features, like machine learning, real-time stream processing and graph computations. A 2015 survey on Apache Spark, reported that 91% of Spark users consider performance as a vital factor in its growth. With a benchmark performance of running big data applications 100 times faster on Hadoop clusters - Apache Spark allows for entirely new use cases to enhance the value of big data. Apache Spark is gaining popularity and is emerging as the standard execution engine in Hadoop because of its extensible and flexible API’s, high performance, ease of use and increased developer productivity.
Language Support in Apache Spark
Apache Spark ecosystem is built on top of the core execution engine that has extensible API’s in different languages. A recent 2015 Spark Survey on 62% of Spark users evaluated the Spark languages - 58% were using Python in 2015, 71% were using Scala, 31% of the respondents were using Java and 18% were using R programming language.
Spark framework is built on Scala, so programming in Scala for Spark can provide access to some of the latest and greatest features that might not be available in other supported programming spark languages.
Python language has excellent libraries for data analysis like Pandas and Sci-Kit learn but is comparatively slower than Scala.
3) R Language
R programming language has rich environment for machine learning and statistical analysis which helps increase developer productivity. Data scientists can now use R language along with Spark through SparkR for processing data that cannot be handled by a single machine.
Java is verbose and does not support REPL but is definitely a good choice for developers coming from a Java+Hadoop background.
1) Spark Core Component
Spark Core component is the foundation for parallel and distributed processing of large datasets. Spark Core component is accountable for all the basic I/O functionalities, scheduling and monitoring the jobs on spark clusters, task dispatching, networking with different storage systems, fault recovery and efficient memory management.
Spark Core makes use of a special data structure known as RDD (Resilient Distributed Datasets). Data sharing or reuse in distributed computing systems like Hadoop MapReduce requires the data to be stored in intermediate stores like Amazon S3 or HDFS. This slows down the overall computation speed because of several replications, IO operations and serializations in storing the data in these intermediate stable data stores. Resilient Distributed Datasets overcome this drawback of Hadoop MapReduce by allowing - fault tolerant ‘in-memory’ computations.
Resilient Distributed Datasets are immutable, partitioned collection of records that can be operated on - in parallel. RDDs can contain any kind of objects Python, Scala, Java or even user defined class objects. RDDs are usually created by either transformation of existing RDDs or by loading an external dataset from a stable storage like HDFS or HBase.
Operations on RDDs
Coarse grained operations like join, union, filter or map on existing RDDs which produce a new RDD, with the result of the operation, are referred to as transformations. All transformations in Spark are lazy. Spark does not execute them immediately but instead a lineage is created that tracks all the transformations to be applied on a RDD.
Operations like count, first and reduce which return values after computations on existing RDDs are referred to as Actions.
2) Spark SQL Component
Spark SQL components acts as a library on top of Apache Spark that has been built based on Shark. Spark developers can leverage the power of declarative queries and optimized storage by running SQL like queries on Spark data, that is present in RDDs and other external sources. Users can perform, extract, transform and load functions on the data coming from various formats like JSON or Parquet or Hive and then run ad-hoc queries using Spark SQL. Spark SQL eases the process of extracting and merging various datasets so that the datasets are ready to use for machine learning.
DataFrame constitutes the main abstraction for Spark SQL. Distributed collection of data ordered into named columns is known as a DataFrame in Spark. In the earlier versions of Spark SQL, DataFrame’s were referred to as SchemaRDDs. DataFrame API in spark integrates with the Spark procedural code to render tight integration between procedural and relational processing. DataFrame API evaluates operations in a lazy manner to provide support for relational optimizations and optimize the overall data processing workflow. All relational functionalities in Spark can be encapsulated using SparkSQL context or HiveContext.
Catalyst, an extensible optimizer is at the core functioning of Spark SQL, which is an optimization framework embedded in Scala to help developers improve their productivity and performance of the queries that they write. Using Catalyst, Spark developers can briefly specify complex relational optimizations and query transformations in a few lines of code by making the best use of Scala’s powerful programming constructs like pattern matching and runtime meta programming. Catalyst eases the process of adding optimization rules, data sources and data types for machine learning domains.
3) Spark Streaming
Spark Streaming is a light weight API that allows developers to perform batch processing and streaming of data with ease, in the same application. Discretized Streams form the base abstraction in Spark Streaming. It makes use of a continuous stream of input data (Discretized Stream or Stream- a series of RDD’s) to process data in real-time. Spark Streaming leverages the fast scheduling capacity of Apache Spark Core to perform streaming analytics by ingesting data in mini-batches. Transformations are applied on those mini batches of data. Data in Spark Streaming is ingested from various data sources and live streams like Twitter, Apache Kafka, Akka Actors, IoT Sensors, Amazon Kinesis, Apache Flume, etc. in event drive, fault-tolerant and type-safe applications.
Features of Spark Streaming
- Easy, reliable and fast processing of live data streams.
- Spark developers can reuse the same code for stream and batch processing and can also integrate the streaming data with historical data.
- Spark Streaming has exactly-once message guarantees and helps recover lost work without having to write any extra code or adding additional configurations.
- Spark streaming supports inclusion of Spark MLlib for machine learning pipelines into data pathways.
Applications of Spark Streaming
- Spark streaming is used in applications that require real-time statistics and rapid response like alarms, IoT sensors, diagnostics, cyber security, etc. Spark streaming finds great applications in Log processing, Intrusion Detection and Fraud Detection.
- Spark streaming is most useful for Online Advertisements and Campaigns, Finance, Supply Chain management, etc.
4) Spark Component MLlib
With increasing number of companies aiming to build user-focused data products and services – the need for machine learning to develop recommendations, predictive insights and personalized results, is increasing. Data scientists can solve this problem with the use of popular data science programming tools like Python and R. However, a majority of data scientist’s time is spent in supporting the infrastructure for these language instead of building machine learning models to solve business data problems. Spark MLlib library is a solution to this problem.
MLlib is a low-level machine learning library that can be called from Scala, Python and Java programming languages. MLlib is simple to use, scalable, compatible with various programming languages and can be easily integrated with other tools. MLlib eases the deployment and development of scalable machine learning pipelines. MLlib library has implementations for various common machine learning algorithms –
- Clustering- K-means
- Classification – naïve Bayes, logistic regression, SVM
- Decomposition- Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)
- Regression –Linear Regression
- Collaborative Filtering-Alternating Least Squares for Recommendations
According to 2015 Spark Survey by Databricks- the production use of MLlib spark component increased from 11% in 2014 to 15% in 2015.MLlib has code contributions from 200 developers across 75 organizations that has provided 2000+ patches to MLlib alone. Spark MLlib is 9 times as fast as the disk based version of Mahout.
Why data scientists use MLlib?
- MLlib has a simple application programming interface for data scientists who are already familiar with data science programming tools like R and Python.
- It comes with Spark framework as a standard component that helps data scientists write applications in Python, Scala, Java or R programming language.
- Data scientists can iterate through data problems 100 times faster than Hadoop MapReduce, helping them solve machine learning problems at large scale in an interactive fashion.
- Data scientists can build Machine learning models as a multi-step journey from data ingestion through train and error to production.
- Data scientists can run the same machine learning code on the big cluster and the PC without breaking down.
CLICK HEREto get free data scientist salary report for 2016 delivered to your inbox!
Business Use Cases of Spark MLlib
- Supply chain optimization and maintenance
- Advertising optimization- To find out the probability of users clicking on available ads.
- Marketing – To recommend products to customers to maximize revenue or engagement.
- Fraud Detection- To track the anomalous behaviour of users.
5) Spark GraphX
There are several data-parallel tools for graph computations but they do not tackle the challenges of graph construction and transformation. Also the graph computations are inefficient and cumbersome because of complex joins.GraphX is an API on top of Apache Spark for cross-world manipulations that solves this problem. Spark GraphX introduces Resilient Distributed Graph (RDG- an abstraction of Spark RDD’s). RDG’s associate records with the vertices and edges in a graph. RDG’s help data scientists perform several graph operations through various expressive computational primitives. These primitives help developers implement Pregel and PageRank abstraction in approximately 20 lines of code or even less than that.
GraphX component of Spark supports multiple use cases like social network analysis, recommendation and fraud detection. Other graph databases can also be used but they require several systems to create the entire computation pipeline, Using Spark GraphX, data scientist can work with graph and non-graph sources to achieve flexibility and resilience in graph computing.
Shared Memory in Apache Spark
Apache Spark’s Cousin Tachyon- An in-memory reliable file system
In-memory computation has gained traction recently as data scientists can perform interactive and fast queries because of it. However, in-memory processing at times results in various issues like –
- In a distributed system-one job’s output is given as input to the other job. In-memory computation is fast but the rate at which output data is written, depends on the network or disk bandwidth and is often slow.
- When the JVM crashes, in-memory data is lost. It is time consuming to load the data into memory again.
- There could be duplicate input data for different jobs.
With increase in the size of datasets and storage becoming a great logjam for different workloads - Tachyon supports reliable file sharing across cluster computing frameworks like Spark and Hadoop at memory-speed. Tachyon is a reliable shared memory that forms an integral part of the Spark ecosystem which helps achieve the desired throughput and performance by avoiding unnecessary replications.
Tachyon is already used in production at popular companies like RedHat, Yahoo, Intel, IBM, and Baidu. Tachyon code is contributed by 100+ developers from 30 organizations.
Cluster Management in Apache Spark
Apache Spark applications can run in 3 different cluster managers –
- Standalone Cluster – If only Spark is running, then this is one of the easiest to setup cluster manager that can be used for novel deployments. In standalone mode - Spark manages its own cluster. In standalone mode, each application runs an executor on every node within the cluster.
- Apache Mesos –It is a dedicated cluster and resource manager that provides rich resource scheduling capabilities. Mesos has fine grained sharing option so Spark shell scales down its CPU allocation during the execution of multiple commands especially when several users are running interactive shells.
- YARN- YARN comes with most of the Hadoop distributions and is the only cluster manager in Spark that supports security. YARN cluster manager allows dynamic sharing and central configuration of the same pool of cluster resources between various frameworks that run on YARN. The number of executors to use can be selected by the user unlike the Standalone mode. YARN is a better choice when big Hadoop cluster is already in use at production.
Apache Spark is evolving at a rapid pace because of its interactive performance, fault tolerance, productivity benefits. Apache Spark is likely to become the industry-standard for big data processing in 2016.