Last Update made on January 11, 2017.
|Hadoop job interview is a tough road to cross with many pitfalls, that can make good opportunities fall off the edge. One, often over-looked part of Hadoop job interview is - thorough preparation. So, here’s how DeZyre helps you get ready for your interview for a Hadoop developer job role.This blog contains commonly asked hadoop mapreduce interview questions and answers that will help you ace your next hadoop job interview.|
Without much ado, let’s charge you for your next hadoop job interview with commonly asked Hadoop MapReduce Interview Questions and Answers-
Hadoop MapReduce Interview Questions and Answers for 2017
1) Compare RDBMS with Hadoop MapReduce.
|Size of Data||Traditional RDBMS can handle upto gigabytes of data.||Hadoop MapReduce can hadnle upto petabytes of data or more.|
|Updates||Read and Write multiple times.||Read many times but write once model.|
|Schema||Static Schema that needs to be pre-defined.||Has a dynamic schema|
|Processing Model||Supports both batch and interactive processing.||Supports only batch processing.|
2) Explain about the basic parameters of mapper and reducer function.
Mapper Function Parameters
The basic parameters of a mapper function are LongWritable, text, text and IntWritable.
LongWritable, text- Input Parameters
Text, IntWritable- Intermediate Output Parameters
Here is a sample code on the usage of Mapper function with basic parameters –
public static class Map extends MapReduceBase implements Mapper
private final static IntWritable one = new IntWritable (1);
private Text word = new Text () ;}
Reducer Function Parameters
The basic parameters of a reducer function are text, IntWritable, text, IntWritable
First two parameters Text, IntWritable represent Intermediate Output Parameters
The next two parameters Text, IntWritable represent Final Output Parameters
3) How data is spilt in Hadoop?
The InputFormat used in the MapReduce job create the splits. The number of mappers are then decided based on the number of splits. Splits are not always created based on the HDFS block size. It all depends on the programming logic within the getSplits () method of InputFormat.
For the complete list of big data companies and their salaries- CLICK HERE
4) What is the fundamental difference between a MapReduce Split and a HDFS block?
MapReduce split is a logical piece of data fed to the mapper. It basically does not contain any data but is just a pointer to the data. HDFS block is a physical piece of data.
5) When is it not recommended to use MapReduce paradigm for large scale data processing?
It is not suggested to use MapReduce for iterative processing use cases, as it is not cost effective, instead Apache Pig can be used for the same.
6) What happens when a DataNode fails during the write process?
When a DataNode fails during the write process, a new replication pipeline that contains the other DataNodes opens up and the write process resumes from there until the file is closed. NameNode observes that one of the blocks is under-replicated and creates a new replica asynchronously.
7) List the configuration parameters that have to be specified when running a MapReduce job.
- Input and Output location of the MapReduce job in HDFS.
- Input and Output Format.
- Classes containing the Map and Reduce functions.
- JAR file that contains driver classes and mapper, reducer classes.
8) Is it possible to split 100 lines of input as a single split in MapReduce?
Yes this can be achieved using Class NLineInputFormat
9) Where is Mapper output stored?
The intermediate key value data of the mapper output will be stored on local file system of the mapper nodes. This directory location is set in the config file by the Hadoop Admin. Once the Hadoop job completes execution, the intermediate will be cleaned up.
10) Explain the differences between a combiner and reducer.
Combiner can be considered as a mini reducer that performs local reduce task. It runs on the Map output and produces the output to reducers input. It is usually used for network optimization when the map generates greater number of outputs.
- Unlike a reducer, the combiner has a constraint that the input or output key and value types must match the output types of the Mapper.
- Combiners can operate only on a subset of keys and values i.e. combiners can be executed on functions that are commutative.
- Combiner functions get their input from a single mapper whereas reducers can get data from multiple mappers as a result of partitioning.
11) When is it suggested to use a combiner in a MapReduce job?
Combiners are generally used to enhance the efficiency of a MapReduce program by aggregating the intermediate map output locally on specific mapper outputs. This helps reduce the volume of data that needs to be transferred to reducers. Reducer code can be used as a combiner, only if the operation performed is commutative. However, the execution of a combiner is not assured.
12) What is the relationship between Job and Task in Hadoop?
A single job can be broken down into one or many tasks in Hadoop.
13) Is it important for Hadoop MapReduce jobs to be written in Java?
It is not necessary to write Hadoop MapReduce jobs in Java but users can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop Streaming API.
14) What is the process of changing the split size if there is limited storage space on Commodity Hardware?
If there is limited storage space on commodity hardware, the split size can be changed by implementing the “Custom Splitter”. The call to Custom Splitter can be made from the main method.
15) What are the primary phases of a Reducer?
The 3 primary phases of a reducer are –
16) What is a TaskInstance?
The actual Hadoop MapReduce jobs that run on each slave node are referred to as Task instances. Every task instance has its own JVM process. For every new task instance, a JVM process is spawned by default for a task.
17) Can reducers communicate with each other?
Reducers always run in isolation and they can never communicate with each other as per the Hadoop MapReduce programming paradigm.
18) What is the difference between Hadoop and RDBMS?
- In RDBMS, data needs to be pre-processed being stored, whereas Hadoop requires no pre-processing.
- RDBMS is generally used for OLTP processing whereas Hadoop is used for analytical requirements on huge volumes of data.
- Database cluster in RDBMS uses the same data files in shared storage whereas in Hadoop the storage is independent of each processing node.
19) Can we search files using wildcards?
Yes, it is possible to search for file through wildcards.
20) How is reporting controlled in hadoop?
The file hadoop-metrics.properties file controls reporting.
21) What is the default input type in MapReduce?
22) Is it possible to rename the output file?
Yes, this can be done by implementing the multiple format output class.
23) What do you understand by compute and storage nodes?
Storage node is the system, where the file system resides to store the data for processing.
Compute node is the system where the actual business logic is executed.
24) When should you use a reducer?
It is possible to process the data without a reducer but when there is a need to combine the output from multiple mappers – reducers are used. Reducers are generally used when shuffle and sort are required.
25) What is the role of a MapReduce partitioner?
MapReduce is responsible for ensuring that the map output is evenly distributed over the reducers. By identifying the reducer for a particular key, mapper output is redirected accordingly to the respective reducer.
26) What is identity Mapper and identity reducer?
IdentityMapper is the default Mapper class in Hadoop. This mapper is executed when no mapper class is defined in the MapReduce job.
IdentityReducer is the default Reducer class in Hadoop. This mapper is executed when no reducer class is defined in the MapReduce job. This class merely passes the input key value pairs into the output directory.
27) What do you understand by the term Straggler ?
A map or reduce task that takes unsually long time to finish is referred to as straggler.
Please share your interview experience on mapreduce questions asked in your interview in the comments below to help the big data community.