HDFS Interview Questions and Answers for 2018

HDFS Interview Questions and Answers for 2018

Last Update Made On March 3,2017.

Hadoop HDFS Interview Questions and Answers 2017 The next in the series of articles highlighting the most commonly asked Hadoop Interview Questions, related to each of the tools in the Hadoop ecosystem is - Hadoop HDFS Interview Questions and Answers.

Commonly Asked HDFS Interview Questions and Answers for 2018

1) What is the difference between HDFS and GFS?


HDFS(Hadoop Distributed File System)

GFS(Google File System)

Default block size in HDFS is 128 MB. Default block size in HDFS is 64 MB.
Only data append operation is possible in HDFS. GFS allows random file writes.
Data is represented in blocks. Data is represented in chunks.
HDFS has the edit log and journal. GFS has the operation log.
Works on Single Write and Multiple Read Model Works on Multiple Write and Multiple Read Model

Hadoop job interviews, can, at times be really simple with most commonly asked Hadoop interview questions like- What do you mean by heartbeat in HDFS? Or what do you understand by a mapper and reducer? However, it becomes difficult when you are not actually prepared to hear such common questions in an interview and end up messing up the entire Hadoop interview due to ill preparedness. The way candidates answer such simple straightforward Hadoop interview questions, not only explains their understanding of the entire Hadoop ecosystem but also exhibits their candid interest in the position. To ease this step of the Hadoop job interview for the candidates, DeZyre presents a list of most commonly asked HDFS Hadoop interview questions and answers.

Before we dive into the list of HDFS Interview Questions and Answers for 2018, here’s a quick overview on the Hadoop Distributed File System (HDFS) -

HDFS is the key tool for managing pools of big data. It is the primary file system used by Hadoop application for storing and streaming large datasets reliably. It stores the application data and file system metadata separately. Application data is stored on severs known as DataNodes and file system metadata is stored on dedicated servers called NameNodes. HDFS uses a master slave architecture. Every Hadoop cluster consists of a single NameNode which manages different file system operations and remaining supporting DataNodes manage data storage on individual computing nodes.

HDFS does not depend on data protection mechanisms like RAID to make data durable but instead replicates the content on multiple DataNodes to ensure reliability. HDFS breaks the information into pieces and distributes it to various nodes in a Hadoop cluster to allow parallel processing, each piece of data is copied multiple times and distributed to individual nodes, with at least one copy of data stored on a different server rack than others. Consequently, whenever nodes crash the data can be found in another places within the same Hadoop cluster to resume processing whilst the failure is being resolved.

HDFS Interview Questions and Answers to prepare for Hadoop Job Interview in 2018

HDFS Interview Questions and Answers


2) How will you measure HDFS space consumed?

The two popular utilities or commands to measure HDFS space consumed are hdfs dfs –du and hdfs dfsadmin –report. HDFS provides reliable storage by copying data to multiple nodes. The number of copies it creates is usually referred to as the replication factor which is greater than one.

  • hdfs dfs –du –This command shows the space consumed by data without replications.
  • hdfs dfsadmin –report- This command shows the real disk usage by considering data replication. Therefore, the output of hdfs dfsadmin –report will always be greater than the output of hdfs dfs –du command.

Work on hands-on Hadoop Projects

3) Is it a good practice to use HDFS for multiple small files?

It is not a good practice to use HDFS for multiple small files because NameNode is an expensive high performance system. Occupying the NameNode space with the unnecessary amount of metadata generated for each of the small multiple files is not sensible. If there is a large file with loads of data, then it is always a wise move to use HDFS because it will occupy less space for metadata and provide optimized performance.

Master Hadoop by working on Real-World Hadoop HDFS Projects

If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.

4) I have a file “Sample” on HDFS. How can I copy this file to the local file system?

This can be accomplished using the following command -

bin/hadoop fs -copyToLocal /hdfs/source/path /localfs/destination/path

5) What do you understand by Inodes?

HDFS namespace consists of files and directories. Inodes are used to represent these files and directories on the NameNode. Inodes record various attributes like the namespace quota, disk space quota, permissions, modified time and access time.

6) Replication causes data redundancy then why is it still preferred in HDFS?

As we know that Hadoop works on commodity hardware, so there is an increased probability of getting crashed. Thus to make the entire Hadoop system highly tolerant, replication factor is preferred even though it creates multiple copies of the same data at different locations. Data on HDFS is stored in at least 3 different locations. Whenever one copy of the data is corrupted and the other copy of the data is not available due to some technical glitches then the data can be accessed from the third location without any data loss.

7) Data is replicated at least thrice on HDFS. Does it imply that any alterations or calculations done on one copy of the data will be reflected in the other two copies also?

Calculations or any transformations are performed on the original data and do not get reflected to all the copies of data. Master node identifies where the original data is located and performs the calculations. Only if the node is not responding or data is corrupted then it will perform the desired calculations on the second replica.

8) How will you compare two HDFS files?

UNIX has a diff command to compare two HDFS files but there is no diff command with Hadoop. However, redirections can be used in the shell with the diff command as follows-

diff < (hadoop fs -cat /path/to/file) < (hadoop fs -cat /path/to/file2)

If the goal is just to find whether the two files are similar or not without having to know the exact differences, then a checksum-based approach can also be followed to compare two files. Get the checksums for both files and compare them.

Hadoop Projects

9) How will you copy a huge file of size 80GB into HDFS parallelly?

Using the distCP tools huge files can be copied within or in between various Hadoop clusters.

10) Are Job Tracker and Task Tracker present on the same machine?

No, they are present on separate machines as Job Tracker is a single point of failure in Hadoop MapReduce and if the Job Tracker goes down all the running Hadoop jobs will halt.

11) Can you create multiple files in HDFS with varying block sizes?

Yes, it is possible to create multiple files in HDFS with different block sizes using an API. The block size can be specified during the time of file creation. Below is the signature of the method that helps achieve this –

public FSDataOutputStream (Path f, boolean overwrite, int buffersize, short replication, long blocksize) throws IO Exception

12) What happens if two clients try writing into the same HDFS file?

HDFS provides support only for exclusive writes so when one client is already writing the file, the other client cannot open the file in write mode. When the client requests the NameNode to open the file for writing, NameNode provides lease to the client for writing to the file. So, if another client requests for lease on the same it will be rejected.

13) What do you understand by Active and Passive NameNodes?

The NameNode that works and runs in the Hadoop cluster is often referred to as the Active NameNode. Passive NameNode also known as Standby NameNode is the similar to an active NameNode but it comes into action only when the active NameNode fails. Whenever the active NameNode fails, the passive NameNode or the standby NameNode replaces the active NameNode, to ensure that the Hadoop cluster is never without a NameNode.

14) How will you balance the disk space usage on a HDFS cluster?

Balancer tool helps achieve this by taking a threshold value as input parameter which is always a fraction between 0 and 1. The HDFS cluster is said to be balanced, if, for every DataNode, the ratio of used space at the node to total capacity of the node differs from the ratio of used space in the cluster to total capacity of the cluster - is not greater than the threshold value.

15) If a DataNode is marked as decommissioned, can it be chosen for replica placement?

Whenever a DataNode is marked as decommissioned it cannot be considered for replication but it continues to serve read request until the node enters the decommissioned state completely i.e. till all the blocks on the decommissioning DataNode are replicated.


16) How will you reduce the size of large cluster by removing a few nodes?

A set of existing nodes can be removed using the decommissioning feature to reduce the size of a large cluster. The nodes that have to be removed should be added to the exclude file. The name of the exclude file should be stated as a config parameter dfs.hosts.exclude. By editing the exclude files or the configuration file, the decommissioning process can be ended.

17)  What do you understand by Safe Mode in Hadoop?

The state in which NameNode does not perform replication or deletion of blocks is referred to as Safe Mode in Hadoop. In safe mode, NameNode only collects block reports information from the DataNodes.

18) How will you manually enter and leave Safe Mode in Hadoop?

Below command is used to enter Safe Mode manually –

$ Hdfs dfsadmin -safe mode enter

Once the safe mode is entered manually, it should be removed manually.

Below command is used to leave Safe Mode manually –

$ hdfs dfsadmin -safe mode leave

19) What are the advantages of a block transfer?

The size of a file can be larger than the size of a single disk within the network. Blocks from a single file need not be stored on the same disk and can make use of different disks present in the Hadoop cluster. This simplifies the entire storage subsystem providing fault tolerance and high availability.

20) How will you empty the trash in HDFS?

Just like many desktop operating systems handle deleted files without a key, HDFS also moves all the deleted files into trash folder stored at /user/hdfs/.Trash. The trash can be emptied by running the following command-

hdfs –dfs expunge

21) What does the HDFS error “File could only be replicated to 0 nodes, instead of 1” mean?

This exception occurs when the DataNode is not available to the NameNode (i.e. the client is not able to communicate with the DataNode) due to one of the following reasons –

  • In hdfs-site.xml file, if the block size is a negative value.
  • If there are any network fluctuations between the DataNode and NameNode, as a result of which the primary DataNode goes down whilst write is in progress.
  • Disk of DataNode is full.
  • DataNode is eventful and occupied with block reporting and scanning.

If the post on HDFS Interview Questions and Answers was helpful, then please spend a minute from your valuable time to share it with the social media icons above and help the Hadoop community at large.

You might be interested to read a series of blogs on the most frequently asked Hadoop Interview Questions-

Top 100 Hadoop Interview Questions and Answers 

Hadoop Developer Interview Questions at Top Tech Companies,

Top Hadoop Admin Interview Questions and Answers

Top 50 Hadoop Interview Questions

Hadoop Pig Interview Questions and Answers

Hadoop Hive Interview Questions and Answers

Hadoop MapReduce Interview Questions and Answers




Hadoop Training Online in California

Relevant Projects

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Real-Time Log Processing using Spark Streaming Architecture
In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.