MapReduce Interview Questions and Answers for 2017

Last Update made on January 11, 2017.

Hadoop MapReduce  Interview Questions and Answers 2017 Hadoop job interview is a tough road to cross with many pitfalls, that can make good opportunities fall off the edge. One, often over-looked part of Hadoop job interview is - thorough preparation. So, here’s how DeZyre helps you get ready for your interview for a Hadoop developer job role.This blog contains commonly asked hadoop mapreduce interview questions and answers that will help you ace your next hadoop job interview.

Without much ado, let’s charge you for your next hadoop job interview with commonly asked Hadoop MapReduce Interview Questions and Answers-

Hadoop MapReduce Interview Questions and Answers for 2017

1) Compare RDBMS with Hadoop MapReduce.

RDBMS vs Hadoop MapReduce

Feature

RDBMS 

MapReduce

Size of Data Traditional RDBMS can handle upto gigabytes of data. Hadoop MapReduce can hadnle upto petabytes of data or more.
Updates Read and Write multiple times. Read many times but write once model.
Schema Static Schema that needs to be pre-defined. Has a dynamic schema
Processing Model Supports both batch and interactive processing. Supports only batch processing.
Scalability Non-Linear Linear


What do you understand by chain Mapper and chain Reducer?

2) Explain about the basic parameters of mapper and reducer function.

Mapper Function Parameters

The basic parameters of a mapper function are LongWritable, text, text and IntWritable.

LongWritable, text- Input Parameters

Text, IntWritable- Intermediate Output Parameters

Here is a sample code on the usage of Mapper function with basic parameters –

public static class Map extends MapReduceBase implements Mapper {
private final static IntWritable one = new IntWritable (1); 
private Text word = new Text () ;}

Reducer Function Parameters

The basic parameters of a reducer function are text, IntWritable, text, IntWritable

First two parameters Text, IntWritable represent Intermediate Output Parameters

The next two parameters Text, IntWritable represent Final Output Parameters

Learn How to develop big data applications

3) How data is spilt in Hadoop?

The InputFormat used in the MapReduce job create the splits. The number of mappers are then decided based on the number of splits. Splits are not always created based on the HDFS block size. It all depends on the programming logic within the getSplits () method of InputFormat.

For the complete list of big data companies and their salaries- CLICK HERE

4) What is the fundamental difference between a MapReduce Split and a HDFS block?

MapReduce split is a logical piece of data fed to the mapper. It basically does not contain any data but is just a pointer to the data. HDFS block is a physical piece of data.

5) When is it not recommended to use MapReduce paradigm for large scale data processing?

It is not suggested to use MapReduce for iterative processing use cases, as it is not cost effective, instead Apache Pig can be used for the same.

6) What happens when a DataNode fails during the write process?

When a DataNode fails during the write process, a new replication pipeline that contains the other DataNodes opens up and the write process resumes from there until the file is closed. NameNode observes that one of the blocks is under-replicated and creates a new replica asynchronously.

7) List the configuration parameters that have to be specified when running a MapReduce job.

  • Input and Output location of the MapReduce job in HDFS.
  • Input and Output Format.
  • Classes containing the Map and Reduce functions.
  • JAR file that contains driver classes and mapper, reducer classes.

8) Is it possible to split 100 lines of input as a single split in MapReduce?

Yes this can be achieved using Class NLineInputFormat

9) Where is Mapper output stored?

The intermediate key value data of the mapper output will be stored on local file system of the mapper nodes. This directory location is set in the config file by the Hadoop Admin. Once the Hadoop job completes execution, the intermediate will be cleaned up.

10) Explain the differences between a combiner and reducer.

Combiner can be considered as a mini reducer that performs local reduce task. It runs on the Map output and produces the output to reducers input. It is usually used for network optimization when the map generates greater number of outputs.

  • Unlike a reducer, the combiner has a constraint that the input or output key and value types must match the output types of the Mapper.
  • Combiners can operate only on a subset of keys and values i.e. combiners can be executed on functions that are commutative.
  • Combiner functions get their input from a single mapper whereas reducers can get data from multiple mappers as a result of partitioning.

11) When is it suggested to use a combiner in a MapReduce job?

Combiners are generally used to enhance the efficiency of a MapReduce program by aggregating the intermediate map output locally on specific mapper outputs. This helps reduce the volume of data that needs to be transferred to reducers. Reducer code can be used as a combiner, only if the operation performed is commutative. However, the execution of a combiner is not assured.

12) What is the relationship between Job and Task in Hadoop?

A single job can be broken down into one or many tasks in Hadoop.

13)  Is it important for Hadoop MapReduce jobs to be written in Java?

It is not necessary to write Hadoop MapReduce jobs in Java but users can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop Streaming API.

14) What is the process of changing the split size if there is limited storage space on Commodity Hardware?

If there is limited storage space on commodity hardware, the split size can be changed by implementing the “Custom Splitter”. The call to Custom Splitter can be made from the main method.

15)  What are the primary phases of a Reducer? 

The 3 primary phases of a reducer are –

1) Shuffle

2) Sort

3) Reduce

16) What is a TaskInstance? 

The actual Hadoop MapReduce jobs that run on each slave node are referred to as Task instances. Every task instance has its own JVM process. For every new task instance, a JVM process is spawned by default for a task.

17) Can reducers communicate with each other? 

Reducers always run in isolation and they can never communicate with each other as per the Hadoop MapReduce programming paradigm.

18) What is the difference between Hadoop and RDBMS?

  • In RDBMS, data needs to be pre-processed being stored, whereas Hadoop requires no pre-processing.
  • RDBMS is generally used for OLTP processing whereas Hadoop is used for analytical requirements on huge volumes of data.
  • Database cluster in RDBMS uses the same data files in shared storage whereas in Hadoop the storage is independent of each processing node.

19) Can we search files using wildcards?

Yes, it is possible to search for file through wildcards.

20) How is reporting controlled in hadoop?

The file hadoop-metrics.properties file controls reporting.

21) What is the default input type in MapReduce?

Text

22) Is it possible to rename the output file?

Yes, this can be done by implementing the multiple format output class.

23) What do you understand by compute and storage nodes?

Storage node is the system, where the file system resides to store the data for processing.

Compute node is the system where the actual business logic is executed.

24) When should you use a reducer?

It is possible to process the data without a reducer but when there is a need to combine the output from multiple mappers – reducers are used. Reducers are generally used when shuffle and sort are required.

25) What is the role of a MapReduce partitioner?

MapReduce is responsible for ensuring that the map output is evenly distributed over the reducers. By identifying the reducer for a particular key, mapper output is redirected accordingly to the respective reducer.

26) What is identity Mapper and identity reducer?

IdentityMapper is the default Mapper class in Hadoop. This mapper is executed when no mapper class is defined in the MapReduce job.

IdentityReducer is the default Reducer class in Hadoop. This mapper is executed when no reducer class is defined in the MapReduce job. This class merely passes the input key value pairs into the output directory.

27) What do you understand by the term Straggler ?

A map or reduce task that takes unsually long time to finish is referred to as straggler.

Please share your interview experience on mapreduce questions asked in your interview in the comments below to help the big data community.

 

PREVIOUS

NEXT

 

Learn Hadoop Online


Answers
Currently have 6 answers
Q: What do you understand by chain Mapper and chain Reducer?
tomer chaining a mapper / reducer multiple time without the need of the other - reducer/mapper
Jul 31 2017, 09:01 PM
Arvind Chain Mapper will allow to set number of mapper for parallel processing,out put of first mapper will be input for second mapper ans so on
Jun 29 2017, 12:02 PM
Bunny The ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer task. For each record output by the Reducer, the Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output. The key functionality of this feature is that the Mappers in the chain do not need to be aware that they are executed after the Reducer or in a chain. This enables having reusable specialized Mappers that can be combined to perform composite operations within a single task. Special care has to be taken when creating chains that the key/values output by a Mapper are valid for the following Mapper in the chain. It is assumed all Mappers and the Reduce in the chain use maching output and input key and value classes as no conversion is done by the chaining code. Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And immediate benefit of this pattern is a dramatic reduction in disk IO.
May 12 2017, 02:21 PM
Madhu ChainMapper class allows to use multiple mapper classes within a same map task. Same way, chainReducer class allows to use multiple reducer classes within a same reduce task. These multiple mapper and reducer classes will execute sequentially or in a pipeline fashion.
Jan 06 2017, 05:36 AM
RK Recursive Mapper & Reducer
Oct 20 2016, 05:33 PM

comments powered by Disqus