MapReduce Interview Questions and Answers for 2018

MapReduce Interview Questions and Answers for 2018

Last Update made on January 11, 2018.

Hadoop MapReduce  Interview Questions and Answers 2017 Hadoop job interview is a tough road to cross with many pitfalls, that can make good opportunities fall off the edge. One, often over-looked part of Hadoop job interview is - thorough preparation. So, here’s how DeZyre helps you get ready for your interview for a Hadoop developer job role.This blog contains commonly asked hadoop mapreduce interview questions and answers that will help you ace your next hadoop job interview.

Without much ado, let’s charge you for your next hadoop job interview with commonly asked Hadoop MapReduce Interview Questions and Answers-

Hadoop MapReduce Interview Questions and Answers for 2018

1) Compare RDBMS with Hadoop MapReduce.

RDBMS vs Hadoop MapReduce




Size of Data Traditional RDBMS can handle upto gigabytes of data. Hadoop MapReduce can hadnle upto petabytes of data or more.
Updates Read and Write multiple times. Read many times but write once model.
Schema Static Schema that needs to be pre-defined. Has a dynamic schema
Processing Model Supports both batch and interactive processing. Supports only batch processing.
Scalability Non-Linear Linear


2) Explain about the basic parameters of mapper and reducer function.

Mapper Function Parameters

The basic parameters of a mapper function are LongWritable, text, text and IntWritable.

LongWritable, text- Input Parameters

Text, IntWritable- Intermediate Output Parameters

Here is a sample code on the usage of Mapper function with basic parameters –

public static class Map extends MapReduceBase implements Mapper {
private final static IntWritable one = new IntWritable (1); 
private Text word = new Text () ;}

Reducer Function Parameters

The basic parameters of a reducer function are text, IntWritable, text, IntWritable

First two parameters Text, IntWritable represent Intermediate Output Parameters

The next two parameters Text, IntWritable represent Final Output Parameters

Learn How to develop big data applications

3) How data is spilt in Hadoop?

The InputFormat used in the MapReduce job create the splits. The number of mappers are then decided based on the number of splits. Splits are not always created based on the HDFS block size. It all depends on the programming logic within the getSplits () method of InputFormat.

For the complete list of big data companies and their salaries- CLICK HERE

4) What is the fundamental difference between a MapReduce Split and a HDFS block?

MapReduce split is a logical piece of data fed to the mapper. It basically does not contain any data but is just a pointer to the data. HDFS block is a physical piece of data.

5) When is it not recommended to use MapReduce paradigm for large scale data processing?

It is not suggested to use MapReduce for iterative processing use cases, as it is not cost effective, instead Apache Pig can be used for the same.

6) What happens when a DataNode fails during the write process?

When a DataNode fails during the write process, a new replication pipeline that contains the other DataNodes opens up and the write process resumes from there until the file is closed. NameNode observes that one of the blocks is under-replicated and creates a new replica asynchronously.

7) List the configuration parameters that have to be specified when running a MapReduce job.

  • Input and Output location of the MapReduce job in HDFS.
  • Input and Output Format.
  • Classes containing the Map and Reduce functions.
  • JAR file that contains driver classes and mapper, reducer classes.

8) Is it possible to split 100 lines of input as a single split in MapReduce?

Yes this can be achieved using Class NLineInputFormat

9) Where is Mapper output stored?

The intermediate key value data of the mapper output will be stored on local file system of the mapper nodes. This directory location is set in the config file by the Hadoop Admin. Once the Hadoop job completes execution, the intermediate will be cleaned up.

10) Explain the differences between a combiner and reducer.

Combiner can be considered as a mini reducer that performs local reduce task. It runs on the Map output and produces the output to reducers input. It is usually used for network optimization when the map generates greater number of outputs.

  • Unlike a reducer, the combiner has a constraint that the input or output key and value types must match the output types of the Mapper.
  • Combiners can operate only on a subset of keys and values i.e. combiners can be executed on functions that are commutative.
  • Combiner functions get their input from a single mapper whereas reducers can get data from multiple mappers as a result of partitioning.

11) When is it suggested to use a combiner in a MapReduce job?

Combiners are generally used to enhance the efficiency of a MapReduce program by aggregating the intermediate map output locally on specific mapper outputs. This helps reduce the volume of data that needs to be transferred to reducers. Reducer code can be used as a combiner, only if the operation performed is commutative. However, the execution of a combiner is not assured.

12) What is the relationship between Job and Task in Hadoop?

A single job can be broken down into one or many tasks in Hadoop.

13)  Is it important for Hadoop MapReduce jobs to be written in Java?

It is not necessary to write Hadoop MapReduce jobs in Java but users can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop Streaming API.

14) What is the process of changing the split size if there is limited storage space on Commodity Hardware?

If there is limited storage space on commodity hardware, the split size can be changed by implementing the “Custom Splitter”. The call to Custom Splitter can be made from the main method.

15)  What are the primary phases of a Reducer? 

The 3 primary phases of a reducer are –

1) Shuffle

2) Sort

3) Reduce

16) What is a TaskInstance? 

The actual Hadoop MapReduce jobs that run on each slave node are referred to as Task instances. Every task instance has its own JVM process. For every new task instance, a JVM process is spawned by default for a task.

17) Can reducers communicate with each other? 

Reducers always run in isolation and they can never communicate with each other as per the Hadoop MapReduce programming paradigm.

18) What is the difference between Hadoop and RDBMS?

  • In RDBMS, data needs to be pre-processed being stored, whereas Hadoop requires no pre-processing.
  • RDBMS is generally used for OLTP processing whereas Hadoop is used for analytical requirements on huge volumes of data.
  • Database cluster in RDBMS uses the same data files in shared storage whereas in Hadoop the storage is independent of each processing node.

19) Can we search files using wildcards?

Yes, it is possible to search for file through wildcards.

20) How is reporting controlled in hadoop?

The file file controls reporting.

21) What is the default input type in MapReduce?


Go: Building Real-World Hadoop MapReduce Projects

22) Is it possible to rename the output file?

Yes, this can be done by implementing the multiple format output class.

23) What do you understand by compute and storage nodes?

Storage node is the system, where the file system resides to store the data for processing.

Compute node is the system where the actual business logic is executed.

24) When should you use a reducer?

It is possible to process the data without a reducer but when there is a need to combine the output from multiple mappers – reducers are used. Reducers are generally used when shuffle and sort are required.

25) What is the role of a MapReduce partitioner?

MapReduce is responsible for ensuring that the map output is evenly distributed over the reducers. By identifying the reducer for a particular key, mapper output is redirected accordingly to the respective reducer.

26) What is identity Mapper and identity reducer?

IdentityMapper is the default Mapper class in Hadoop. This mapper is executed when no mapper class is defined in the MapReduce job.

IdentityReducer is the default Reducer class in Hadoop. This mapper is executed when no reducer class is defined in the MapReduce job. This class merely passes the input key value pairs into the output directory.

27) What do you understand by the term Straggler ?

A map or reduce task that takes unsually long time to finish is referred to as straggler.

Please share your interview experience on mapreduce questions asked in your interview in the comments below to help the big data community.





Learn Hadoop Online

Relevant Projects

Hive Project - Visualising Website Clickstream Data with Apache Hadoop
Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Design a Hadoop Architecture
Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.