With increase in real-time insights, Apache Spark has moved from a talking point in the boardroom discussions to enterprise deployments in production. It is undeniable that Apache Spark is not just a component of the Hadoop ecosystem but has become the lingua franca of big data analytics for many organizations. Over 750 contributors from 200+ companies and increasing number of use cases in myriad industries like retail, healthcare, finance, advertising and education- Apache Spark continues to gain attention in the big data space. Spark puts the promise for faster data processing and easier development. Enthusiastic to know how Spark achieves this? To answer this question, let’s introduce the Apache Spark ecosystem and explain the Spark components which make Apache Spark fast and reliable. A lot of these Spark components were built to resolve the issues that cropped up while using Hadoop MapReduce.
All the above facts and figures show how the Spark Ecosystem has grown since 2010, with development of various libraries and frameworks that allow faster and more advanced data analytics than Hadoop.
Apache Spark is a powerful alternative to Hadoop MapReduce, with several, rich functionality features, like machine learning, real-time stream processing and graph computations. A 2015 survey on Apache Spark, reported that 91% of Spark users consider performance as a vital factor in its growth. With a benchmark performance of running big data applications 100 times faster on Hadoop clusters - Apache Spark allows for entirely new use cases to enhance the value of big data. Apache Spark is gaining popularity and is emerging as the standard execution engine in Hadoop because of its extensible and flexible API’s, high performance, ease of use and increased developer productivity.
Apache Spark ecosystem is built on top of the core execution engine that has extensible API’s in different languages. A recent 2015 Spark Survey on 62% of Spark users evaluated the Spark languages - 58% were using Python in 2015, 71% were using Scala, 31% of the respondents were using Java and 18% were using R programming language.
Spark framework is built on Scala, so programming in Scala for Spark can provide access to some of the latest and greatest features that might not be available in other supported programming spark languages.
Python language has excellent libraries for data analysis like Pandas and Sci-Kit learn but is comparatively slower than Scala.
3) R Language
R programming language has rich environment for machine learning and statistical analysis which helps increase developer productivity. Data scientists can now use R language along with Spark through SparkR for processing data that cannot be handled by a single machine.
Java is verbose and does not support REPL but is definitely a good choice for developers coming from a Java+Hadoop background.
Spark Core component is the foundation for parallel and distributed processing of large datasets. Spark Core component is accountable for all the basic I/O functionalities, scheduling and monitoring the jobs on spark clusters, task dispatching, networking with different storage systems, fault recovery and efficient memory management.
Spark Core makes use of a special data structure known as RDD (Resilient Distributed Datasets). Data sharing or reuse in distributed computing systems like Hadoop MapReduce requires the data to be stored in intermediate stores like Amazon S3 or HDFS. This slows down the overall computation speed because of several replications, IO operations and serializations in storing the data in these intermediate stable data stores. Resilient Distributed Datasets overcome this drawback of Hadoop MapReduce by allowing - fault tolerant ‘in-memory’ computations.
Resilient Distributed Datasets are immutable, partitioned collection of records that can be operated on - in parallel. RDDs can contain any kind of objects Python, Scala, Java or even user defined class objects. RDDs are usually created by either transformation of existing RDDs or by loading an external dataset from a stable storage like HDFS or HBase.
Coarse grained operations like join, union, filter or map on existing RDDs which produce a new RDD, with the result of the operation, are referred to as transformations. All transformations in Spark are lazy. Spark does not execute them immediately but instead a lineage is created that tracks all the transformations to be applied on a RDD.
Operations like count, first and reduce which return values after computations on existing RDDs are referred to as Actions.
Spark SQL components acts as a library on top of Apache Spark that has been built based on Shark. Spark developers can leverage the power of declarative queries and optimized storage by running SQL like queries on Spark data, that is present in RDDs and other external sources. Users can perform, extract, transform and load functions on the data coming from various formats like JSON or Parquet or Hive and then run ad-hoc queries using Spark SQL. Spark SQL eases the process of extracting and merging various datasets so that the datasets are ready to use for machine learning.
DataFrame constitutes the main abstraction for Spark SQL. Distributed collection of data ordered into named columns is known as a DataFrame in Spark. In the earlier versions of Spark SQL, DataFrame’s were referred to as SchemaRDDs. DataFrame API in spark integrates with the Spark procedural code to render tight integration between procedural and relational processing. DataFrame API evaluates operations in a lazy manner to provide support for relational optimizations and optimize the overall data processing workflow. All relational functionalities in Spark can be encapsulated using SparkSQL context or HiveContext.
Catalyst, an extensible optimizer is at the core functioning of Spark SQL, which is an optimization framework embedded in Scala to help developers improve their productivity and performance of the queries that they write. Using Catalyst, Spark developers can briefly specify complex relational optimizations and query transformations in a few lines of code by making the best use of Scala’s powerful programming constructs like pattern matching and runtime meta programming. Catalyst eases the process of adding optimization rules, data sources and data types for machine learning domains.
Spark Streaming is a light weight API that allows developers to perform batch processing and streaming of data with ease, in the same application. Discretized Streams form the base abstraction in Spark Streaming. It makes use of a continuous stream of input data (Discretized Stream or Stream- a series of RDD’s) to process data in real-time. Spark Streaming leverages the fast scheduling capacity of Apache Spark Core to perform streaming analytics by ingesting data in mini-batches. Transformations are applied on those mini batches of data. Data in Spark Streaming is ingested from various data sources and live streams like Twitter, Apache Kafka, Akka Actors, IoT Sensors, Amazon Kinesis, Apache Flume, etc. in event drive, fault-tolerant and type-safe applications.
With increasing number of companies aiming to build user-focused data products and services – the need for machine learning to develop recommendations, predictive insights and personalized results, is increasing. Data scientists can solve this problem with the use of popular data science programming tools like Python and R. However, a majority of data scientist’s time is spent in supporting the infrastructure for these language instead of building machine learning models to solve business data problems. Spark MLlib library is a solution to this problem.
MLlib is a low-level machine learning library that can be called from Scala, Python and Java programming languages. MLlib is simple to use, scalable, compatible with various programming languages and can be easily integrated with other tools. MLlib eases the deployment and development of scalable machine learning pipelines. MLlib library has implementations for various common machine learning algorithms –
According to 2015 Spark Survey by Databricks- the production use of MLlib spark component increased from 11% in 2014 to 15% in 2015.MLlib has code contributions from 200 developers across 75 organizations that has provided 2000+ patches to MLlib alone. Spark MLlib is 9 times as fast as the disk based version of Mahout.
CLICK HEREto get free data scientist salary report for 2016 delivered to your inbox!
There are several data-parallel tools for graph computations but they do not tackle the challenges of graph construction and transformation. Also the graph computations are inefficient and cumbersome because of complex joins.GraphX is an API on top of Apache Spark for cross-world manipulations that solves this problem. Spark GraphX introduces Resilient Distributed Graph (RDG- an abstraction of Spark RDD’s). RDG’s associate records with the vertices and edges in a graph. RDG’s help data scientists perform several graph operations through various expressive computational primitives. These primitives help developers implement Pregel and PageRank abstraction in approximately 20 lines of code or even less than that.
GraphX component of Spark supports multiple use cases like social network analysis, recommendation and fraud detection. Other graph databases can also be used but they require several systems to create the entire computation pipeline, Using Spark GraphX, data scientist can work with graph and non-graph sources to achieve flexibility and resilience in graph computing.
In-memory computation has gained traction recently as data scientists can perform interactive and fast queries because of it. However, in-memory processing at times results in various issues like –
With increase in the size of datasets and storage becoming a great logjam for different workloads - Tachyon supports reliable file sharing across cluster computing frameworks like Spark and Hadoop at memory-speed. Tachyon is a reliable shared memory that forms an integral part of the Spark ecosystem which helps achieve the desired throughput and performance by avoiding unnecessary replications.
Tachyon is already used in production at popular companies like RedHat, Yahoo, Intel, IBM, and Baidu. Tachyon code is contributed by 100+ developers from 30 organizations.
Apache Spark applications can run in 3 different cluster managers –
Apache Spark is evolving at a rapid pace because of its interactive performance, fault tolerance, productivity benefits. Apache Spark is likely to become the industry-standard for big data processing in 2016.