Last Update Made on January 3, 2017.
|Cracking a Hadoop Admin Interview becomes a tedious job if you do not spend enough time preparing for it.This article lists top Hadoop Admin Interview Questions and Answers which are likely to be asked when being interviewed for Hadoop Adminstration jobs.|
In 2010, nobody knew what Hadoop is and today the elephant in the big data room has become the big data darling. According to Wikibon, the Hadoop market crossed $256 mn in vendor revenue in 2012 and is anticipated to exponentially increase to $1.7 billion by end of 2017. Programmers, architects, system administrators and data warehousing professionals are leaving no stone unturned in learning Hadoop for storing and processing large data sets.
Professionals who are trying for a Hadoop Developer or Hadoop Admin job, do not necessarily put much effort into preparing just Hadoop Admin Interview Questions. While people going for the Hadoop developer positions, can take the liberty to prepare interview questions related to administration as part of their overall Hadoop interview, it is essential for people – who are preparing just for the role of Hadoop Admin, to get into the details of Hadoop admin interview questions. In our previous posts Top 100 Hadoop Interview Questions and Answers and Top 50 Hadoop Interview Questions, we listed all the Hadoop Interview Questions that can be asked in a Hadoop Developer job interview.
Computing research found that the skills gap for Hadoop is one of the biggest in the entire big data spectrum. In the big data space where Hadoop is used by various industries, the importance of Hadoop Administration cannot be overlooked. There are myriad industries hiring Hadoop Administrators, for ensuring that their big data systems can tick in the most complex and dynamic situations. From finance to government sectors, every industry is hiring Hadoop Admins to manage their big data platforms. The demand for Hadoop Admin professionals is rising, to fulfil the dearth of expertise talent.
Want to know how much a Hadoop Professional earns at top tech companies- CLICK HERE
Without much ado let’s help you get started on bridging the talent gap by helping you nail your next Hadoop Administration Job Interview -
How to prepare for a Hadoop Admin Interview?
Hadoop Admin Interviews, test a candidate’s knowledge around the installation, configuration and maintenance of Hadoop software. A Hadoop Administrator is required to research and implement platform specific big data solutions based on the requirements of the stakeholders. It is necessary for a candidate appearing for a Hadoop Admin Interview, to be well-versed with concepts of large scale data management. To justify yourself as a quality candidate for the Hadoop Admin job profile, make sure that you discuss your knowledge and abilities to manage Hadoop projects, exhibit multitasking and leadership skills in your specific areas of interest and expertise.
If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.
If you have applied for a Hadoop Admin job, then it is worth your time to review some of the Hadoop Admin Interview Questions, listed below, while you prepare for your interview-
Learn Hadoop to become a Microsoft Certified Big Data Engineer.
Hadoop Admin Interview Questions and Answers
1) How will you decide whether you need to use the Capacity Scheduler or the Fair Scheduler?
Fair Scheduling is the process in which resources are assigned to jobs such that all jobs get to share equal number of resources over time. Fair Scheduler can be used under the following circumstances -
i) If you wants the jobs to make equal progress instead of following the FIFO order then you must use Fair Scheduling.
ii) If you have slow connectivity and data locality plays a vital role and makes a significant difference to the job runtime then you must use Fair Scheduling.
iii) Use fair scheduling if there is lot of variability in the utilization between pools.
Capacity Scheduler allows runs the hadoop mapreduce cluster as a shared, multi-tenant cluster to maximize the utilization of the hadoop cluster and throughput.Capacity Scheduler can be used under the following circumstances -
i) If the jobs require scheduler detrminism then Capacity Scheduler can be useful.
ii) CS's memory based scheduling method is useful if the jobs have varying memory requirements.
iii) If you want to enforce resource allocation because you know very well about the cluster utilization and workload then use Capacity Scheduler.
2) What are the daemons required to run a Hadoop cluster?
NameNode, DataNode, TaskTracker and JobTracker
3) How will you restart a NameNode?
The easiest way of doing this is to run the command to stop running shell script i.e. click on stop-all.sh. Once this is done, restarts the NameNode by clicking on start-all.sh.
4) Explain about the different schedulers available in Hadoop.
- FIFO Scheduler – This scheduler does not consider the heterogeneity in the system but orders the jobs based on their arrival times in a queue.
- COSHH- This scheduler considers the workload, cluster and the user heterogeneity for scheduling decisions.
- Fair Sharing-This Hadoop scheduler defines a pool for each user. The pool contains a number of map and reduce slots on a resource. Each user can use their own pool to execute the jobs.
5) List few Hadoop shell commands that are used to perform a copy operation.
- fs –put
- fs –copyToLocal
- fs –copyFromLocal
6) What is jps command used for?
jps command is used to verify whether the daemons that run the Hadoop cluster are working or not. The output of jps command shows the status of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.
7) What are the important hardware considerations when deploying Hadoop in production environment?
- Memory-System’s memory requirements will vary between the worker services and management services based on the application.
- Operating System - a 64-bit operating system avoids any restrictions to be imposed on the amount of memory that can be used on worker nodes.
- Storage- It is preferable to design a Hadoop platform by moving the compute activity to data to achieve scalability and high performance.
- Capacity- Large Form Factor (3.5”) disks cost less and allow to store more, when compared to Small Form Factor disks.
- Network - Two TOR switches per rack provide better redundancy.
- Computational Capacity- This can be determined by the total number of MapReduce slots available across all the nodes within a Hadoop cluster.
8) How many NameNodes can you run on a single Hadoop cluster?
9) What happens when the NameNode on the Hadoop cluster goes down?
The file system goes offline whenever the NameNode is down.
10) What is the conf/hadoop-env.sh file and which variable in the file should be set for Hadoop to work?
This file provides an environment for Hadoop to run and consists of the following variables-HADOOP_CLASSPATH, JAVA_HOME and HADOOP_LOG_DIR. JAVA_HOME variable should be set for Hadoop to run.
11) Apart from using the jps command is there any other way that you can check whether the NameNode is working or not.
Use the command -/etc/init.d/hadoop-0.20-namenode status.
12) In a MapReduce system, if the HDFS block size is 64 MB and there are 3 files of size 127MB, 64K and 65MB with FileInputFormat. Under this scenario, how many input splits are likely to be made by the Hadoop framework.
2 splits each for 127 MB and 65 MB files and 1 split for the 64KB file.
13) Which command is used to verify if the HDFS is corrupt or not?
Hadoop FSCK (File System Check) command is used to check missing blocks.
14) List some use cases of the Hadoop Ecosystem
Text Mining, Graph Analysis, Semantic Analysis, Sentiment Analysis, Recommendation Systems.
15) How can you kill a Hadoop job?
Hadoop job –kill jobID
16) I want to see all the jobs running in a Hadoop cluster. How can you do this?
Using the command – Hadoop job –list, gives the list of jobs running in a Hadoop cluster.
17) Is it possible to copy files across multiple clusters? If yes, how can you accomplish this?
Yes, it is possible to copy files across multiple Hadoop clusters and this can be achieved using distributed copy. DistCP command is used for intra or inter cluster copying.
18) Which is the best operating system to run Hadoop?
Ubuntu or Linux is the most preferred operating system to run Hadoop. Though Windows OS can also be used to run Hadoop but it will lead to several problems and is not recommended.
19) What are the network requirements to run Hadoop?
- SSH is required to run - to launch server processes on the slave nodes.
- A password less SSH connection is required between the master, secondary machines and all the slaves.
20) The mapred.output.compress property is set to true, to make sure that all output files are compressed for efficient space usage on the Hadoop cluster. In case under a particular condition if a cluster user does not require compressed data for a job. What would you suggest that he do?
If the user does not want to compress the data for a particular job then he should create his own configuration file and set the mapred.output.compress property to false. This configuration file then should be loaded as a resource into the job.
21) What is the best practice to deploy a secondary NameNode?
It is always better to deploy a secondary NameNode on a separate standalone machine. When the secondary NameNode is deployed on a separate machine it does not interfere with the operations of the primary node.
22) How often should the NameNode be reformatted?
The NameNode should never be reformatted. Doing so will result in complete data loss. NameNode is formatted only once at the beginning after which it creates the directory structure for file system metadata and namespace ID for the entire file system.
23) If Hadoop spawns 100 tasks for a job and one of the job fails. What does Hadoop do?
The task will be started again on a new TaskTracker and if it fails more than 4 times which is the default setting (the default value can be changed), the job will be killed.
24) How can you add and remove nodes from the Hadoop cluster?
- To add new nodes to the HDFS cluster, the hostnames should be added to the slaves file and then DataNode and TaskTracker should be started on the new node.
- To remove or decommission nodes from the HDFS cluster, the hostnames should be removed from the slaves file and –refreshNodes should be executed.
25) You increase the replication level but notice that the data is under replicated. What could have gone wrong?
Nothing could have actually wrong, if there is huge volume of data because data replication usually takes times based on data size as the cluster has to copy the data and it might take a few hours.
26) Explain about the different configuration files and where are they located.
The configuration files are located in “conf” sub directory. Hadoop has 3 different Configuration files- hdfs-site.xml, core-site.xml and mapred-site.xml
Hadoop Admin Interview Questions
- How will you initiate the installation process if you have to setup a Hadoop Cluster for the first time?
- How will you install a new component or add a service to an existing Hadoop cluster?
- If Hive Metastore service is down, then what will be its impact on the Hadoop cluster?
- How will you decide the cluster size when setting up a Hadoop cluster?
- How can you run Hadoop and real-time processes on the same cluster?
- If you get a connection refused exception - when logging onto a machine of the cluster, what could be the reason? How will you solve this issue?
- How can you identify and troubleshoot a long running job?
- How can you decide the heap memory limit for a NameNode and Hadoop Service?
- If the Hadoop services are running slow in a Hadoop cluster, what would be the root cause for it and how will you identify it?
- How many DataNodes can be run on a single Hadoop cluster?
- Configure slots in Hadoop 2.0 and Hadoop 1.0.
- In case of high availability, if the connectivity between Standby and Active NameNode is lost. How will this impact the Hadoop cluster?
- What is the minimum number of ZooKeeper services required in Hadoop 2.0 and Hadoop 1.0?
- If the hardware quality of few machines in a Hadoop Cluster is very low. How will it affect the performance of the job and the overall performance of the cluster?
- How does a NameNode confirm that a particular node is dead?
- Explain the difference between blacklist node and dead node.
- How can you increase the NameNode heap memory?
- Configure capacity scheduler in Hadoop.
- After restarting the cluster, if the MapReduce jobs that were working earlier are failing now, what could have gone wrong while restarting?
- Explain the steps to add and remove a DataNode from the Hadoop cluster.
- In a large busy Hadoop cluster-how can you identify a long running job?
- When NameNode is down, what does the JobTracker do?
- When configuring Hadoop manually, which property file should be modified to configure slots?
- How will you add a new user to the cluster?
- What is the advantage of speculative execution? Under what situations, Speculative Execution might not be beneficial?
Open Ended Hadoop Admin Interview Questions
These interview questions are asked on a case by case basis, depending on – where you are applying for the role of a Hadoop admin, do you have prior experience at this role, etc. Please do share your Hadoop Admin interview experience in the comments below.
- Describe your Hadoop journey with your roles and responsibility in the present project?
- Which tool have you used in your project for monitoring clusters and nodes in Hadoop?
- How many nodes do you think can be present in one cluster?
- Have you worked on any go-live project in your organization?
- Which MapReduce version have you configured on your Hadoop cluster?
- Explain any notable Hadoop use case by a company, that helped maximize its profitability?
- Can you create a Hadoop cluster from scratch?
- Do you follow a standard procedure to deploy Hadoop?
- How will you manage a Hadoop system?
- Which tool will you prefer to use for monitoring Hadoop and HBase clusters?
The above list just gives an overview on the different types of Hadoop Admin Interview questions that can be asked. However, the Hadoop Admin Interview questions can purely vary and change based on your working experience and the business domain you come from. Do not worry if you are inexperienced, as companies would love to hire you if you are clear with your basics and have hands-on experience in working on Hadoop projects. The foremost thing to get started on, is to prepare for a great career in Hadoop Administration and one can definitely succeed in nailing a Hadoop Admin Interview. Strive for excellence and success will follow.
We would love to answer any questions you have in honing your Hadoop skills for a lucrative career, please leave a comment below.