The questions asked at a big data developer or apache spark developer job interview may fall into one of the following categories based on Spark Ecosystem Components -
In addition, displaying project experience in the following is key -
With the increasing demand from the industry, to process big data at a faster pace -Apache Spark is gaining huge momentum when it comes to enterprise adoption. Hadoop MapReduce well supported the need to process big data fast but there was always a need among developers to learn more flexible tools to keep up with the superior market of midsize big data sets, for real time data processing within seconds.
To support the momentum for faster big data processing, there is increasing demand for Apache Spark developers who can validate their expertise in implementing best practices for Spark - to build complex big data solutions. In collaboration with and big data industry experts -we have curated a list of top 50 Apache Spark Interview Questions and Answers that will help students/professionals nail a big data developer interview and bridge the talent supply for Spark Developers across various industry segments.
Companies like Amazon, Shopify, Alibaba and eBay are adopting Apache Spark for their big data deployments- the demand for Spark developers is expected to grow exponentially. Google Trends confirm “hockey-stick-like-growth” in Spark enterprise adoption and awareness among organizations across various industries. Spark is becoming popular because of its ability to handle event streaming and processing big data faster than Hadoop MapReduce. 2017 is the best time to hone your Apache Spark skills and pursue a fruitful career as a data analytics professional, data scientist or big data developer.
These Apache Spark Projects will help you develop skills which will make you eligible to apply for Spark developer job roles.
Preparation is very important to reduce the nervous energy at any big data job interview. Regardless of the big data expertise and skills one possesses, every candidate dreads the face to face big data job interview. Though there is no way of predicting exactly what questions will be asked in any big data or spark developer job interview- these Apache spark interview questions and answers might help you prepare for these interviews better.
1) Compare Spark vs Hadoop MapReduce
|Does not leverage the memory of the hadoop cluster to maximum.||Let's save data on memory with the use of RDD's.|
|MapReduce is disk oriented.||Spark caches data in-memory and ensures low latency.|
|Only batch processing is supported||Supports real-time processing through spark streaming.|
|Is bound to hadoop.||Is not bound to Hadoop.|
Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.
Refer Spark vs Hadoop
2) What is Shark?
Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background - to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark - offering compatibility with Hive metastore, queries and data.
3) List some use cases where Spark outperforms Hadoop in processing.
4) What is a Sparse Vector?
A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.
5) What is RDD?
RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –
6) Explain about transformations and actions in the context of RDDs.
Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.
Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.
7) What are the languages supported by Apache Spark for developing big data applications?
Scala, Java, Python, R and Clojure
8) Can you use Spark to access and analyse data stored in Cassandra databases?
Yes, it is possible if you use Spark Cassandra Connector.
9) Is it possible to run Apache Spark on Apache Mesos?
Yes, Apache Spark can be run on the hardware clusters managed by Mesos.
10) Explain about the different cluster managers in Apache Spark
The 3 different clusters managers supported in Apache Spark are:
11) How can Spark be connected to Apache Mesos?
To connect Spark with Mesos-
12) How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
13) Why is there a need for broadcast variables when working with Apache Spark?
These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().
14) Is it possible to run Spark and Mesos along with Hadoop?
Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.
15) What is lineage graph?
The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
16) How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.
17) Explain about the major libraries that constitute the Spark Ecosystem
18) What are the benefits of using Spark with Apache Mesos?
It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.
19) What is the significance of Sliding Window operation?
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
20) What is a DStream?
Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations –
21) When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
22) What is Catalyst framework?
Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
23) Name a few companies that use Apache Spark in production.
Pinterest, Conviva, Shopify, Open Table
24) Which spark library allows reliable file sharing at memory speed across different cluster frameworks?
25) Why is BlinkDB used?
BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time. BlinkDB builds a few stratified samples of the original data and then executes the queries on the samples, rather than the original data in order to reduce the time taken for query execution. The sizes and numbers of the stratified samples are determined by the storage availability specified when importing the data. BlinkDB consists of two main components:
Sample building engine: determines the stratified samples to be built based on workload history and data distribution.
Dynamic sample selection module: selects the correct sample files at runtime based on the time and/or accuracy requirements of the query.
26) How can you compare Hadoop and Spark in terms of ease of use?
Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers - making it comparatively easier to use than Hadoop.
27) What are the common mistakes developers make when running Spark applications?
Developers often make the mistake of-
Developers need to be careful with this, as Spark makes use of memory for processing.
28) What is the advantage of a Parquet file?
Parquet file is a columnar format file that helps –
29) What are the various data sources available in SparkSQL?
30) How Spark uses Hadoop?
Spark has its own cluster management computation and mainly uses Hadoop for storage.
31) What are the key features of Apache Spark that you like?
32) What do you understand by Pair RDD?
Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.
33) Which one will you choose for a project –Hadoop MapReduce or Apache Spark?
The answer to this question depends on the given project scenario - as it is known that Spark makes use of memory instead of network and disk I/O. However, Spark uses large amount of RAM and requires dedicated machine to produce effective results. So the decision to use Hadoop or Spark varies dynamically with the requirements of the project and budget of the organization.
34) Explain about the different types of transformations on DStreams?
35) Explain about the popular use cases of Apache Spark
Apache Spark is mainly used for
36) Is Apache Spark a good fit for Reinforcement learning?
No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.
37) What is Spark Core?
It has all the basic functionalities of Spark, like - memory management, fault recovery, interacting with storage systems, scheduling tasks, etc.
38) How can you remove the elements with a key present in any other RDD?
Use the subtractByKey () function
39) What is the difference between persist() and cache()
persist () allows the user to specify the storage level whereas cache () uses the default storage level.
40) What are the various levels of persistence in Apache Spark?
Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.
The various storage/persistence levels in Spark are -
41) How Spark handles monitoring and logging in Standalone mode?
Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes.
42) Does Apache Spark provide check pointing?
Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint - is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
43) How can you launch Spark jobs inside Hadoop MapReduce?
Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights.
44) How Spark uses Akka?
Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.
45) How can you achieve high availability in Apache Spark?
46) Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.
47) Explain about the core components of a distributed Spark application.
48) What do you understand by Lazy Evaluation?
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but it does nothing, unless asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.
49) Define a worker node.
A node that can run the Spark application code in a cluster can be called as a worker node. A worker node can have more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.
50) What do you understand by SchemaRDD?
An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column.
51) What are the disadvantages of using Apache Spark over Hadoop MapReduce?
Apache spark does not scale well for compute intensive jobs and consumes large number of system resources. Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient processing of big data. Also, Spark does have its own file management system and hence needs to be integrated with other cloud based data platforms or apache hadoop.
52) Is it necessary to install spark on all the nodes of a YARN cluster while running Apache Spark on YARN ?
No , it is not necessary because Apache Spark runs on top of YARN.
53) What do you understand by Executor Memory in a Spark application?
Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.
54) What does the Spark Engine do?
Spark engine schedules, distributes and monitors the data application across the spark cluster.
55) What makes Apache Spark good at low-latency workloads like graph processing and machine learning?
Apache Spark stores data in-memory for faster model building and training. Machine learning algorithms require multiple iterations to generate a resulting optimal model and similarly graph algorithms traverse all the nodes and edges.These low latency workloads that need multiple iterations can lead to increased performance. Less disk access and controlled network traffic make a huge difference when there is lots of data to be processed.
56) Is it necessary to start Hadoop to run any Apache Spark Application ?
Starting hadoop is not manadatory to run any spark application. As there is no seperate storage in Apache Spark, it uses Hadoop HDFS but it is not mandatory. The data can be stored in local file system, can be loaded from local file system and processed.
57) What is the default level of parallelism in apache spark?
If the user does not explicitly specify then the number of partitions are considered as default level of parallelism in Apache Spark.
58) Explain about the common workflow of a Spark program
59) In a given spark program, how will you identify whether a given operation is Transformation or Action ?
One can identify the operation based on the return type -
i) The operation is an action, if the return type is other than RDD.
ii) The operation is transformation, if the return type is same as the RDD.
60) What according to you is a common mistake apache spark developers make when using spark ?
61) Suppose that there is an RDD named ProjectPrordd that contains a huge list of numbers. The following spark code is written to calculate the average -
def ProjectProAvg(x, y):
avg = ProjectPrordd.reduce(ProjectProAvg);
What is wrong with the above code and how will you correct it ?
Average function is neither commutative nor associative. The best way to compute average is to first sum it and then divide it by count as shown below -
def sum(x, y):
avg = total / ProjectPrordd.count();
However, the above code could lead to an overflow if the total becomes big. So, the best way to compute average is divide each number by count and then add up as shown below -
cnt = ProjectPrordd.count();
myrdd1 = ProjectPrordd.map(divideByCnt);
avg = ProjectPrordd.reduce(sum);
62) Compare map() and flatMap() in Spark.
In Spark, map() transformation is applied to each row in a dataset to return a new dataset. flatMap() transformation is also applied to each row of the dataset, but a new flattened dataset is returned. In case of flatMap, if a record is nested (e.g. a column which is in itself made up of a list, array), the data within that record gets extracted and is returned as a new row of the returned dataset.
Both map() and flatMap() transformations are narrow, which means that they do not result in shuffling of data in Spark.
flatMap() is said to be a one-to-many transformation function as it returns more rows than the current DataFrame. map() returns the same number of records as what was present in the input DataFrame.
flatMap() can give a result which contains redundant data in some columns.
flatMap() can be used to flatten a column which contains arrays or lists. It can be used to flatten any other nested collection too.
1) Explain the difference between Spark SQL and Hive.
2) What is the purpose of BlinkDB?
BlinkDB is an approximate query engine that is built on top of Hive and Spark. Its purpose is to allow users to trade-off query accuracy for a shorter response time and in the process allow interactive queries on the data.
1) Name some sources from where Spark streaming component can process real-time data.
Apache Flume, Apache Kafka, Amazon Kinesis
2) Name some companies that are already using Spark Streaming.
Uber, Netflix, Pinterest.
3) What is the bottom layer of abstraction in the Spark Streaming API ?
4) What do you understand by receivers in Spark Streaming ?
Receivers are special entities in Spark Streaming that consume data from various data sources and move them to Apache Spark. Receivers are usually created by streaming contexts as long running tasks on various executors and scheduled to operate in a round robin manner with each receiver taking a single core.
We invite the big data community to share the most frequently asked Apache Spark Interview questions and answers, in the comments below - to ease big data job interviews for all prospective analytics professionals.
5) How will you calculate the number of executors required to do real-time processing using Apache Spark? What factors need to be connsidered for deciding on the number of nodes for real-time processing?
The number of nodes can be decided by benchmarking the hardware and considering multiple factors such as optimal throughput (network speed), memory usage, the execution frameworks being used (YARN, Standalone or Mesos) and considering the other jobs that are running within those execution frameworks along with spark.
6) What is the difference between Spark Transform in DStream and map ?
tranform function in spark streaming allows developers to use Apache Spark transformations on the underlying RDD's for the stream. map function in hadoop is used for an element to element transform and can be implemented using transform.Ideally , map works on the elements of Dstream and transform allows developers to work with RDD's of the DStream. map is an elementary transformation whereas transform is an RDD transformation.
1) What is SparkContext in PySpark?
A SparkContext represents the entry point to connect to a Spark cluster. It can be used to create RDDs, accumulators and broadcast variables on that particular cluster. Only one SparkContext can be active per JVM. A SparkContext has to be stopped before creating a new one. PySpark uses the library Py4J to launch a JVM and creates a JavaSparkContext, By default, PySpark has SparkContext available as ‘sc’. Hence, creating a SparkContext will not work.
2) What is SparkConf in PySpark?
SparkConf allows one to set up a few configurations and parameters that are needed to run a Spark application. SparkConf has attributes which will have to be provided configuration details to run a Spark application.
The following code block shows the details for a SparkConf class in PySpark
Initially, a SparkConf object can be created with SparkConf(), which will load the values from spark.* Java system properties as well. Different parameters using the SparkConf object and their parameters can be used rather than the system properties, if they are specified.
3) What are SparkFiles in PySpark
SparkFiles in PySpark allow uploading of files to PySpark using sc.addFile() where sc is the default SparkConf in PySpark. SparkFiles.get() can be used to get the path on a worker. SparkContext.addFile() enables one to resolve the paths to files which are added.
SparkFiles contain the following class methods:
get(filename) : used to specify the path of the file that is added through SparkContext.addFile().
getrootdirectory(): used to specify the path to the root directory which contains the file that is added through SparkContext.addFile().
4) Explain serializers in PySpark.
Serializers are responsible for performance tuning in Apache Spark. All data that is sent over the network, written to the disk or kept in memory should be serialized. Serialization is very important for costly operations.
PySpark supports custom serializers, two of which:
MarshalSerializer: This serializer is faster than the PickleSerializer but supports fewer datatypes.
PickleSerializer: this serializer is slower than other custom serializer, but has the ability to support almost all Python data types.
Objects can be serialized in PySpark using the custom serializers.
5) What are some key differences in the Python API (PySpark) compared to the original Apache Spark?
PySpark is an API developed and released by the Apache Spark foundation, to facilitate Python engineers to work with Spark. Apache Spark is written in Scala. It is able to work well with other languages such as Java, R and Python.
Another point to note is that since Scala is a compile-time, type-safe language, Apache Spark offers certain features that cannot be supported by PySpark, one such example is Datasets. Datasets are a strongly typed collection of domain-specific objects on which computations can be performed in parallel.
Check Out Top Scala Interview Questions for Spark Developers.
6) What are the different cluster managers provided by Apache Spark?
There are three different cluster managers that are available on Apache Spark. These are:
Standalone Cluster Manager: The Standalone Cluster Manager is a simple cluster manager which is responsible for the management of resources based on the requirements from applications. The Standalone Cluster Manager is resilient in that it can handle task failures. It is designed in such a way that it has masters and workers, which are configured with a certain amount of allocated memory and CPU cores. Using this cluster manager, Spark allocates resources based on the core.
Apache Mesos: Apache Mesos uses dynamic resource sharing and isolation in order to handle the workload in a distributed environment. Mesos is useful for managing and deploying applications in large-scale clusters. Apache Mesos works by combining existing physical resources present on the nodes in a cluster into a single virtual resource. Apache Mesos contains three components:
Mesos masters: The Mesos master is an instance of the cluster. In order to provide fault tolerance, a cluster will have many Mesos masters. However, only one instance of master is considered the leading master. The Mesos master is in charge of sharing the resources between the applications.
Mesos agent: The Mesos agent is responsible for managing the resources present on physical nodes in order to run the framework.
Mesos frameworks: Applications that run on top of Mesos are referred to as Mesos frameworks. A framework in turn comprises the scheduler, which acts as a controller, and the executor, which carries out the work to be done.
7) What is shuffling in Spark and when does it occur?
In Spark, shuffling is a mechanism by which redistribution of data is performed across partitions. Spark performs shuffling to repartition the data across different executors or across different machines in a cluster. Shuffling, by default, does not change the number of partitions but only the content within the partitions. Shuffling is an expensive operation and is recommended to be avoided as much as possible as it involves data being written to the disk and transferred across the network. Shuffling also involves deserialization and serialization of the data.
Shuffling is performed when a transformation requires data from other partitions. An example is to find the mean of all values in a column. In such cases, Spark will gather the necessary data from various partitions and combine it into a new partition.
8) What is meant by coalesce in Spark?
Coalesce in Spark is a method which is used to reduce the number of partitions in a DataFrame. Reduction of partitions using the repartitioning method is an expensive operation. Instead, the coalesce method can be used. Coalesce does not perform a full shuffle and instead of creating new partitions, it shuffles the data using Hash Partitioner and adjusts the data into the existing partitions. The Coalesce method can only be used to decrease the number of partitions. Coalesce is to be ideally used in cases where one wants to store the same data in a lesser number of files.
9) How does Spark Streaming handle caching?
Caching can be handled in Spark Streaming by means of a change in settings on DStreams. A Discretized Stream (DStream) allows users to keep the stream’s data persistent in memory. By using the persist() method on a DStream, every RDD of that particular DStream is kept persistent on memory, and can be used if the data in the DStream has to be used for computation multiple times. Unlike RDDs, in the case of DStreams, the default persistence level involves keeping the data serialized in memory.