Scenario-Based Hadoop Interview Questions to prepare for in 2024

List of commonly asked hadoop interview questions and answers for experienced that will help you ace your next hadoop job interview.

Scenario-Based Hadoop Interview Questions to prepare for in 2024
 |  BY ProjectPro

Click Here to Download Hadoop Q&A PDF

Having complete diverse big data hadoop projects at ProjectPro, most of the students often have these questions in mind –

  1. “How to prepare for a Hadoop job interview?”
  2. “Where can I find real-time or scenario-based hadoop interview questions and answers for experienced?”

Hadoop Project - Choosing the best SQL-on-Hadoop Engine

Downloadable solution code | Explanatory videos | Tech Support

Start Project

While we have answered the first question for our students through a series of blog posts on commonly asked Hadoop Interview Questions and tips for preparing a Hadoop resume. This blog post aims to familiarize students with a list of hadoop interview questions and answers for experienced that are scenario-based or real-time hadoop interview questions and are most likely to be asked in a hadoop job interview. What makes answering scenario based hadoop interview questions challenging, is there are an infinite number of situations the interviewer can ask. It is difficult to know in advance what kind of a scenario based question you will be asked.

ProjectPro Free Projects on Big Data and Data Science

Scenario based hadoop interview questions are a big part of hadoop job interviews. Big data recruiters and employers use these kind of interview questions to get an idea if you have the desired competencies and hadoop skills required for the open hadoop job position. It is easy to list a set of big data and hadoop skills on your resume but you need to demonstrate to the satisfaction of the interviewer on how you successfully go about solving big data problems and glean valuable insights.

What would you do if you were presented with the following hadoop interview question-

“How will you select the right hardware for your hadoop cluster?”

The meaning behind asking such real-time or scenario based hadoop interview questions is to test your skills on how you would apply your hadoop skills and approach a given big data problem. As a hadoop developer or hadoop administrator, getting into the details of hadoop architecture and studying it carefully, almost becomes your key to success in answering real-time hadoop job interview questions.

To know more about the Hadoop Ecosystem and Its Components , refer Hadoop Wiki

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

Data Science Interview Preparation

Scenario-Based Hadoop Interview Questions and Answers for Experienced

Real-Time or Scenario based Hadoop Interview Questions

1) If 8TB is the available disk space per node (10 disks with 1 TB, 2 disk for operating system etc. were excluded.). Assuming initial data size is 600 TB. How will you estimate the number of data nodes (n)?

Estimating the hardware requirement is always challenging in Hadoop environment because we never know when data storage demand can increase for a business. We must understand following factors in detail to come to a conclusion for the current scenario of adding right numbers to the cluster:

  1. The actual size of data to store – 600 TB
  2. At what pace the data will increase in the future (per day/week/month/quarter/year) – Data trending analysis or business requirement justification (prediction)
  3. We are in Hadoop world, so replication factor plays an important role – default 3x replicas
  4. Hardware machine overhead (OS, logs etc.) – 2 disks were considered
  5. Intermediate mapper and reducer data output on hard disk - 1x
  6. Space utilization between 60 % to 70 % - Finally, as a perfect designer we never want our hard drive to be full with their storage capacity.
  7. Compression ratio

Let’s do some calculation to find the number of data nodes required to store 600 TB of data:

Rough calculation:

  • Data Size – 600 TB
  • Replication factor – 3
  • Intermediate data – 1
  • Total Storage requirement – (3+1) * 600 = 2400 TB
  • Available disk size for storage – 8 TB
  • Total number of required data nodes (approx.): 2400/8 = 300 machines

Actual Calculation: Rough Calculation + Disk space utilization + Compression ratio

  • Disk space utilization – 65 % (differ business to business)
  • Compression ratio – 2.3
  • Total Storage requirement – 2400/2.3 = 1043.5 TB
  • Available disk size for storage – 8*0.65 = 5.2 TB
  • Total number of required data nodes (approx.): 1043.5/5.2 = 201 machines
  • Actual usable cluster size (100 %): (201*8*2.3)/4 = 925 TB

Case: Business has predicted 20 % data increase in a quarter and we need to predict the new machines to be added in a year

  • Data increase – 20 % over a quarter
  • Additional data:
  • 1st quarter: 1043.5 * 0.2 = 208.7 TB
  • 2nd quarter: 1043.5 * 1.2 * 0.2 = 250.44 TB
  • 3rd quarter: 1043.5 * (1.2)^2 * 0.2 = 300.5 TB
  • 4th quarter: 1043.5 * (1.2)^3 * 0.2 = 360.6 TB
  • Additional data nodes requirement (approx.):
  • 1st quarter: 208.7/5.2 = 41 machines
  • 2nd quarter: 250.44/5.2 = 49 machines
  • 3rd quarter: 300.5/5.2 = 58 machines
  • 4th quarter: 360.6/5.2 = 70 machines

With these numbers you can predict next year additional machines requirement for the cluster (last quarter + 24), (last quarter + 28) and so on.

2) You have a directory ProjectPro that has the following files – HadoopTraining.txt, _SparkTraining.txt, #DataScienceTraining.txt, .SalesforceTraining.txt. If you pass the ProjectPro directory to the Hadoop MapReduce jobs, how many files are likely to be processed?

Only HadoopTraining.txt and #DataScienceTraining.txt will be processed for Mapreduce jobs because when we process a file (either in a directory or individual) in Hadoop using any FileInputFormat such as TextInputFormat, KeyValueInputFormat or SequenceFileInputFormat, we must confirm that none of files must have a hidden file prefix such as “_” or “.” because mapreduce FileInputFormat will by default uses hiddenFileFilter class to ignore all those files with these prefix in their name.

  private static final PathFilter hiddenFileFilter = new PathFilter(){

      public boolean accept(Path p){

        String name = p.getName();

        return !name.startsWith("_") && !name.startsWith(".");

      }

    };

However, we can set our own custom filter such as FileInputFormat.setInputPathFilter to eliminate such criteria but remember, hiddenFileFilter is always active.

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

3) Imagine that you are uploading a file of 500MB into HDFS.100MB of data is successfully uploaded into HDFS and another client wants to read the uploaded data while the upload is still in progress. What will happen in such a scenario, will the 100 MB of data that is uploaded will it be displayed?

Although the default blocks size is 64 MB in Hadoop 1x and 128 MB in Hadoop 2x whereas in such a scenario let us consider block size to be 100 MB which means that we are going to have 5 blocks replicated 3 times (default replication factor). Let’s consider an example of how does a block is written to HDFS:

We have 5 blocks (A/B/C/D/E) for a file, a client, a namenode and a datanode. So, first the client will take Block A and will approach namenode for datanode location to store this block and the replicated copies. Once client is aware about the datanode information, it will directly reach out to datanode and start copying Block A which will be simultaneously replicated to other 2 datanodes. Once the block is copied and replicated to the datanodes, client will get the confirmation about the Block A storage and then, it will initiate the same process for next block “Block B”.

So, during this process if 1st block of 100 MB is written to HDFS and the next block has been started by the client to store then 1st block will be visible to readers. Only the current block being written will not be visible by the readers.

Recommended Reading:  

4) When decommissioning the nodes in a Hadoop Cluster, why should you stop all the task trackers?

We are aware about a complete process on how to decommission a datanode and there are loads of material available on internet to do so but what about the task tracker running a MapReduce job on a datanode which is likely to be decommissioned. Unlike the datanode, there is no graceful way to decommission a tasktracker. It is always assumed that when we want to move the same task to other node then we need to rely on making the task process to stop for failure and let it be rescheduled elsewhere on the cluster. It is possible that a task on its final attempt is running on the tasktracker and that a final failure may result in the entire job failing. Unfortunately, it’s not always possible to prevent this case from occurring. So, the idea behind decommissioning that it will stop your datanode but to move the current task to another node, we need to manually stop the task tracker running on the decommissioned node.

Get More Practice, More Big Data and Analytics Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro

5) When does a NameNode enter the safe mode?

Namenode is responsible for managing the meta storage of the cluster and if something is missing from the cluster then Namenode will be held. This makes Namenode checking all the necessary information during the safe mode before making cluster writable to the users. There are couple of reasons for Namenode to enter the safe mode during startup such as;

i) Namenode loads the filesystem state from fsimage and edits log file, it then waits for datanodes to report their blocks, so it does not start replicating the blocks which already exist in the cluster another.

ii) Heartbeats from all the datanodes and also if any corrupt blocks exist in the cluster. Once Namenode verify all these information, it will leave the safe mode and make cluster accessible. Sometime, we need to manually enter/leave the safe mode for Namenode which can be done using command line “hdfs dfsadmin -safemode enter/leave”.

6) Did you ever run a lopsided job that resulted in out of memory ever? If yes, then how did you handle it?

OutOfMemoryError” is the most common error in MapReduce jobs because data is growing with different sizes which makes a challenging environment for a developer to estimate the right amount of memory allocated for a job. In Hadoop world, it is not only an administrator job to look after the configuration but developer has also given an opportunity to manage their own jobs configuration. We must make sure that following properties must be set appropriately considering the available resources in the cluster to avoid out of memory error:

mapreduce.map.memory.mb: Maximum amount of memory used by a mapper within a container

mapreduce.map.java.opts: Maximum amount of heap size used by a mapper which must be less than the above

mapreduce.reduce.memory.mb: Maximum amount of memory used by a reducer within a container

mapreduce.reduce.java.opts: Maximum amount of heap size used by a reducer which must be less than the above

yarn.scheduler.maximum-allocation-mb: The maximum allocation size allowed for a container but require administrative privileges.

There are some other factors also which may impact your memory such as spilling data over disk which can be corrected using following configuration:

mapreduce.reduce.shuffle.input.buffer.percent

mapreduce.reduce.shuffle.memory.limit.percent

mapreduce.reduce.shuffle.parallel.copies

7) What are the steps followed by the application while running a YARN job when calling a SubmitApplication method?

  1. All jobs that are submitted by a client go to the resource manager. The resource manager is provided with a scheduler, and it is the responsibility of the resource manager to determine the resources required to run that particular job. 

  2. Once the resources and the number of resources are determined, the resource manager will then launch the application masters specific to the applications that are to be run.

  3. The application master associated with a specific application, also known as the application master daemon remains available until the job gets completed.

  4. The duty of the application master is to negotiate resources from the resource manager. This means that the application manager will ask for the number of resources as specified by the resource manager.

  5. The application manager will then launch the container in a different node manager once the data becomes available.

  6. Node managers monitor the containers, and a node manager is responsible for all containers available in that particular node. The container is responsible for giving periodic updates to the application manager regarding the job that it is executing.

  7. Once the job gets completed, the container and the resources get freed up, following which the application manager proceeds to update the resource manager that the job is completed. The client then receives the corresponding update from the resource manager. 

8) Suppose you want to get an HDFS file into a local directory; how would you go about it?

There are two commands that can be used to get HDFS files into the local system:

hadoop fs -get  
hadoop fs -copyToLocal  

9) Suppose you have one table in HBase. It is required to create a Hive table on top of it, where there should not be any manual movement of data. Changes made to the HBase table should be replicated in the Hive table without explicitly making any changes to it. How can you achieve this?

An approach that can be used is to create a Hive table pointing to the HBase table as the data source. HBase existing tables can be mapped to Hive. Hive can be given access to an existing table in HBase, containing multiple families and columns, using the CREATE EXTERNAL TABLE statement. However, the columns of HBase have to be mapped as well, and they will be validated against the column families of the existing table on HBase. The table name of the table in HBase is optional. In such a case, if any changes are made to the table in HBase, they will be reflected in the table on Hive as well.

10) What command will you use to copy data from one node in Hadoop to another?

hdfs dfs -distcp hdfs://source_namenode/apache_hadoop hdfs://dest_namenodeB/Hadoop

11) How can you kill an application running on YARN?

Use:

yarn application -list

To list all the applications that are running on YARN.

To kill the application that you want to kill, identify its application ID and use the following command to kill it:

yarn application -kill appid

12) In MapReduce tasks, each reduce task writes its output to a file named part-r-nnnnn. Here nnnnn is the partition ID associated with the reduce task. Is it possible to ultimately merge these files? Explain your answer.

The files do not get automatically merged by Hadoop. The number of files generated is equal to the number of reduce tasks that take place. If you need that as input for the next job, there is no need to worry about having separate files. Simply specify the entire directory as input for the next job. If the data from the files must be pulled out of the cluster, they can be merged while transferring the data. 

The following command may be used to merge the files while pulling the data off the cluster:

Hadoop fs -cat //part-r-* > /directory of destination path

The complete merging of the output files from the reduce tasks can be delegated using the following command as well:

Hadoop fs -getmerge / /

13) There is a YARN cluster in which the total amount of memory available is 40GB. There are two application queues, ApplicationA and ApplicationB. The queue of ApplicationA has 20 GB allocated, while that of ApplicationB has 8GB allocated. Each map task requires an allocation of 32GB. How will the fair scheduler assign the available memory resources under the DRF (Dominant Resource Finder) Scheduler?

The allocation of resources within a particular queue is controlled separately. Within one queue:

The FairScheduler can apply either the FIFO policy, the FairPolicy or the DominantResourceFairnessPolicy. The CapacityScheduler may use either the FIFOPolicy or the FairPolicy.

The default scheduling policy of the FairScheduler is the Fair policy, where memory is used as a resource.The DRF policy uses both memory and CPU as resources and allocates them accordingly. DRF is quite similar to fair-scheduling. However, the difference is that DRF primarily applies to the allocation of resources among queues. This is already heavily handled by queue weights. Hence, the most crucial job of the DRF is to manage multiple resources rather than equal resource allocation.

In this case, initially, both ApplicationA and ApplicatonB will have some resources allocated to various jobs present in their corresponding queues. In such a way, only 12GB (40GB - (20 GB + 8 GB)) will remain in the cluster. Each of the queues will request to run a map task of size 32GB. The total memory available is 40 GB.The rest of the required resources can be taken from the CPU. In such a case, ApplicationA currently holds 20GB. Another 12GB is required for the map task to get executed. Here, the fair scheduler will grant the container requesting 12GB of memory to ApplicationA. The memory allocated to ApplicationB is 8GB, and it will require another 24 GB to run a map task. Memory is not available for application, and hence the DRF will try to use 8 GB from memory, and the remaining 20GB will be used from the CPU.

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

Request a demo

14) How does a NameNode know that one of the DataNodes in a cluster is not functioning?

Hadoop clusters follow a master-slave architecture, in which the NameNode acts as the master and the Data Nodes act as the slaves. The data nodes contain the actual data that has to be stored or processed in a cluster. The data nodes are responsible for sending heartbeat messages to the NameNode every 3 seconds to confirm that they are active or alive. Suppose the NameNode fails to receive a heartbeat message from a particular node for more than ten minutes. In that case, the NameNode considers that particular data node to be no longer active or dead. The name node then initiates replication of the data on the dead data node blocks to some of the other data nodes that are active on the cluster. Data nodes are able to talk to each other to rebalance the data and their replicas within a cluster. They can copy data around and transfer it to keep the replication valid in the cluster. The data nodes store metadata and information about which files can be mapped to which particular block location. Data nodes also maintain a checksum for each block. When data gets written to HDFS, the checksum value is written simultaneously to the data node. When the data gets read, by default, the same checksum value is used for verification.

Data nodes are responsible for updating the name node with the block information at regular intervals of time and before verifying the checksum's value. If the checksum value is not correct for a specific block, then that particular block can be considered to have a disk-level corruption. Since there is an issue in the reporting of the block information to the name node, the name node is able to know that there is a disk-level corruption on the data and necessary steps have to be taken to copy the data to alternate locations on other active data nodes to maintain the replication factor.

15) How can you determine the number of map tasks and reduce tasks based on requirements?

The performance of Hadoop is very heavily influenced by the number of map and reduce tasks. More tasks result in an increase in the framework overhead but also allow increased load balancing and reduces the cost of failures. At one end of the spectrum, only one map and only 1 reduce task results in no distribution. At the other end, the framework may run out of resources to meet the number of tasks.

Build an Awesome Job Winning Project Portfolio with Solved End-to-End Big Data Projects

8) There are 100 map tasks that are running, of which 99 tasks have completed and one task is running very slow. The slow running map task is replicated on a different machine and the output is gathered from the first completed map task. All other map tasks are killed. What is this phenomenon referred to in Hadoop?

9) There is an external jar file of size 1.5 MB having all the required dependencies to run your Hadoop MapReduce jobs. How will copy the jar file to the task tracker and what are the steps to follow?

10) If there are ‘m’ mappers and ‘r’ reducers in a given hadoop mapreduce job, how many copy and write operations will be required for the shuffle and sort algorithm?

11) When a job is run the properties file is copied to the distributed cache for the map jobs to access. How can you access the properties file?

12) How will you calculate the size of your hadoop cluster?

13) How will you estimate the Hadoop storage given the size of the data to be moved, average compression ratio, intermediate and replication factors.

14) Given 200 billion unique URL's , how will you find the first unique URL using Hadoop ? - Unlock the Answer Here 

 

PREVIOUS

NEXT

Access Solved Big Data and Data Projects

About the Author

ProjectPro

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

Meet The Author arrow link