Apache Spark

Spark : Fastest cluster computing framework for large scale data processing.

Originally developed at UC Berkely in 2009, Apache Spark is fast becoming enterprises' favorite framework for data processing. Apache Spark is a fast and reliable data processing framework that enables real time and advanced analytics on Hadoop. Spark provides an interface for programming on clusters with infinite data parallelism and fault tolerance. Saprk has high-level development APIs in Scala, Java, Python and R. It also support tools like Spark-SQL, MLib and GraphX.

What is Spark in Big Data?

  • Spark was developed in response to solve the limitations of MapReduce computing. Spark provides users with a fault-tolerant robust application programming interface called RDDs - Resilient Distributed Datasets. RDDs perform in a way that is a direct contrast to MapReduce's cluster computing framework.
  • MapReduce's cluster computing framework used to read input data on disk, map the function to the data and reduce the results on the map. Spark RDDs facilitate the iterative algorithms and visit the database multiple times in a loop for interactive and exploratory data analysis.

How does Spark work?

  • Spark allows for interactive SQL queries to explore the data. Spark provides a programming abstraction called DataFrames which also acts as a distributed SQL query engine. This enables any unmodified Hadoop Hive queries to run 100x faster. Due to Spark's ease of use and fault tolerant capabilities, it integrates with a variety of data sources like HDFS, Flume, Twitter, Kafka, etc. Spark MLib delivers high quality algorithms and works at a blazing speed that leaves MapReduce far behind. This library is usable and can be used in any language like Scala, Java and Python. GraphX is a very useful tool on Spark that allows users to build, transform and scale up graph related data.
  • Spark Core is the underlying execution engine, for the Spark platform - on top of which all other functionalities are built. It provides for in-memory computing abilities and delivers speed. It also provides to support a wide variety of applications.

Apache Spark Blogs

Hadoop vs. Spark: Not Mutually Exclusive but Better Together
Organizations can make the best use of Hadoop capabilities in production environments by integrating Spark with Hadoop. Apache Spark can run directly on top of Hadoop to leverage the storage and cluster managers or Spark can run separately from Hadoop to integrate with other storage and cluster managers. Hadoop has in-built disaster recovery capabilities so the duo collectively can be used for data management and cluster administration for analysis workloads. Click to read more.
Spark SQL vs. Apache Drill-War of the SQL-on-Hadoop Tools
Spark SQL is used for real-time, in-memory and parallelized SQL-on-Hadoop engine that borrows some of its features from the predecessor Shark to retain Hive compatibility and provides 100X faster querying than Hive. It is used for manipulating and ingesting data in various formats like JSON, Hive, EDW's or Parquet. Spark SQL allows users to do several advanced analytics with data from stream processing to machine learning. Click to read more.
Apache Spark makes Data Processing & Preparation Faster
Businesses can no longer afford to view data preparation tools as an add-on to Big Data analytic solutions. Before we get to the latest buzzword in data preparation tools - Apache Spark - we would like to re-iterate that data preparation tools are not competing with Big Data analytic software. If anything, they are enablers. Click to read more.
Apache Spark Ecosystem and Spark Components
Apache Spark is a powerful alternative to Hadoop MapReduce, with several, rich functionality features, like machine learning, real-time stream processing and graph computations. A 2015 survey on Apache Spark, reported that 91% of Spark users consider performance as a vital factor in its growth. With a benchmark performance of running big data applications 100 times faster on Hadoop clusters - Apache Spark allows for entirely new use cases to enhance the value of big data. Click to read more.
Scala vs. Python for Apache Spark
Scala and Python are both easy to program and help data experts get productive fast. Data scientists often prefer to learn both Scala for Spark and Python for Spark but Python is usually the second favourite language for Apache Spark, as Scala was there first. However, here are some important factors that can help data scientists or data engineers choose the best programming language based on their requirements. Click here to read more.

Apache Spark Tutorials

Tutorial: Introduction to Apache Spark
Spark was initially started by Matei Zaharia at UC Berkeley's AMPLab in 2009. It was an academic project in UC Berkley. Initially the idea was to build a cluster management tool, which can support different kind of cluster computing systems. The cluster management tool which was built as a result is Mesos. After Mesos was created, developers built a cluster computing framework on top of it, resulting in the creating of Spark. Click to read more.
Step-by-Step Tutorial for Apache Spark Installation
This tutorial presents a step-by-step guide to install Apache Spark. Spark can be configured with multiple cluster managers like YARN, Mesos etc. Along with that it can be configured in local mode and standalone mode. Click to read more.

Spark Interview Questions

  1. What are the advantages of using Apache Spark over Hadoop MapReduce for big data processing?

    • Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.
    • Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases(RDD).
    • Spark is easier to program as it comes with an interactive mode.
    • It provides complete recovery using lineage graph whenever something goes wrong. Read more
  2. What is Shark?

    • Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background - to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark - offering compatibility with Hive metastore, queries and data. Read more.
  3. What is a Sparse Vector?

    • A sparse vector has two parallel arrays - one for indices and the other for values. These vectors are used for storing non-zero entries to save space. Read more.

Apache Spark Slides

Advanced MapReduce(Part - 1)

Introduction to Apache Spark

Apache Spark Videos

Apache Spark Q&A

  1. Where is the course resources folder for winutils.exe?

    • From the first class installation instructions, Step 6 says to "download winutils.exe from the course resources folder". I could not find such a folder anywhere. On the top of my navi taps, there are, from the leftmost to rightmost, Home, Blog, Tutorials, contact Us, DeZyre for Business, My Account, Sign Out. Can someone help let me a pointer to its location? Much appreciated! Click to read answer.

Apache Spark Assignments

Spark Programming Exercises

Sample Data: Sample Data presented here is a retail sales data of an online store which needs to be analyzed.

Sales.csv - schema
Column #1 Transaction ID
Column #2 Customer ID
Column #3 Item ID
Column #4 Amount Paid

API used - textFile, collect

Program Arguments used - local src/main/resources/sales.csv 

import org.apache.spark.SparkContext

object LoadData {

    def main(args: Array[String]) {
        //create spark context
        val sc = new SparkContext(args(0), "apiexamples")

        //it creates RDD[String] of type MappedRDD

        val dataRDD = sc.textFile(args(1))

        //print the content . Converting to List just for nice formatting
package - apiexamples - serialization  
Arguments - local src/main/resources/sales.csv 2

import org.apache.spark.SparkContext

object ItemFilter {
    def main(args: Array[String]) {

        val sc = new SparkContext(args(0), "apiexamples")
        val dataRDD = sc.textFile(args(1))
        val itemIDToSearch = args(2)

        val itemRows = dataRDD.filter(row => {
            val columns = row.split(",")
            val itemId = columns(2)
            if (itemId.equals(itemIDToSearch)) true
            else false

package - anatomy - CacheExample
Arguments - local src/main/resources/sales.csv

import org.apache.spark.storage.RDDBlockId
import org.apache.spark. {
    SparkContext, SparkEnv

object CacheExample {

    def main(args: Array[String]) {

        val sc = new SparkContext(args(0), "cache example")
        val salesData = sc.textFile(args(1), 2)

        val salesByCustomer = salesData.map(value => {
            val colValues = value.split(",")
                (colValues(1), colValues(3).toDouble)

        println("salesData rdd id is " + salesData.id)
        println("salesBy customer id is " + salesByCustomer.id)

        val firstPartition = salesData.partitions.head


        println(" the persisted RDD's " + sc.getPersistentRDDs)

        //check whether its in cache
        val blockManager = SparkEnv.get.blockManager
        val key = RDDBlockId(salesData.id, firstPartition.index)

        println("before evaluation " + blockManager.get(key))

        // after execution


        println("after evaluation " + blockManager.get(key))
processing person-icon