Last Update Made On March 3,2017.
|The next in the series of articles highlighting the most commonly asked Hadoop Interview Questions, related to each of the tools in the Hadoop ecosystem is - Hadoop HDFS Interview Questions and Answers.|
HDFS(Hadoop Distributed File System)
GFS(Google File System)
|Default block size in HDFS is 128 MB.||Default block size in HDFS is 64 MB.|
|Only data append operation is possible in HDFS.||GFS allows random file writes.|
|Data is represented in blocks.||Data is represented in chunks.|
|HDFS has the edit log and journal.||GFS has the operation log.|
|Works on Single Write and Multiple Read Model||Works on Multiple Write and Multiple Read Model|
Hadoop job interviews, can, at times be really simple with most commonly asked Hadoop interview questions like- What do you mean by heartbeat in HDFS? Or what do you understand by a mapper and reducer? However, it becomes difficult when you are not actually prepared to hear such common questions in an interview and end up messing up the entire Hadoop interview due to ill preparedness. The way candidates answer such simple straightforward Hadoop interview questions, not only explains their understanding of the entire Hadoop ecosystem but also exhibits their candid interest in the position. To ease this step of the Hadoop job interview for the candidates, DeZyre presents a list of most commonly asked HDFS Hadoop interview questions and answers.
Before we dive into the list of HDFS Interview Questions and Answers for 2018, here’s a quick overview on the Hadoop Distributed File System (HDFS) -
HDFS is the key tool for managing pools of big data. It is the primary file system used by Hadoop application for storing and streaming large datasets reliably. It stores the application data and file system metadata separately. Application data is stored on severs known as DataNodes and file system metadata is stored on dedicated servers called NameNodes. HDFS uses a master slave architecture. Every Hadoop cluster consists of a single NameNode which manages different file system operations and remaining supporting DataNodes manage data storage on individual computing nodes.
HDFS does not depend on data protection mechanisms like RAID to make data durable but instead replicates the content on multiple DataNodes to ensure reliability. HDFS breaks the information into pieces and distributes it to various nodes in a Hadoop cluster to allow parallel processing, each piece of data is copied multiple times and distributed to individual nodes, with at least one copy of data stored on a different server rack than others. Consequently, whenever nodes crash the data can be found in another places within the same Hadoop cluster to resume processing whilst the failure is being resolved.
The two popular utilities or commands to measure HDFS space consumed are hdfs dfs –du and hdfs dfsadmin –report. HDFS provides reliable storage by copying data to multiple nodes. The number of copies it creates is usually referred to as the replication factor which is greater than one.
It is not a good practice to use HDFS for multiple small files because NameNode is an expensive high performance system. Occupying the NameNode space with the unnecessary amount of metadata generated for each of the small multiple files is not sensible. If there is a large file with loads of data, then it is always a wise move to use HDFS because it will occupy less space for metadata and provide optimized performance.
Master Hadoop by working on Real-World Hadoop HDFS Projects
If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.
This can be accomplished using the following command -
bin/hadoop fs -copyToLocal /hdfs/source/path /localfs/destination/path
HDFS namespace consists of files and directories. Inodes are used to represent these files and directories on the NameNode. Inodes record various attributes like the namespace quota, disk space quota, permissions, modified time and access time.
As we know that Hadoop works on commodity hardware, so there is an increased probability of getting crashed. Thus to make the entire Hadoop system highly tolerant, replication factor is preferred even though it creates multiple copies of the same data at different locations. Data on HDFS is stored in at least 3 different locations. Whenever one copy of the data is corrupted and the other copy of the data is not available due to some technical glitches then the data can be accessed from the third location without any data loss.
Calculations or any transformations are performed on the original data and do not get reflected to all the copies of data. Master node identifies where the original data is located and performs the calculations. Only if the node is not responding or data is corrupted then it will perform the desired calculations on the second replica.
UNIX has a diff command to compare two HDFS files but there is no diff command with Hadoop. However, redirections can be used in the shell with the diff command as follows-
diff < (hadoop fs -cat /path/to/file) < (hadoop fs -cat /path/to/file2)
If the goal is just to find whether the two files are similar or not without having to know the exact differences, then a checksum-based approach can also be followed to compare two files. Get the checksums for both files and compare them.
Using the distCP tools huge files can be copied within or in between various Hadoop clusters.
No, they are present on separate machines as Job Tracker is a single point of failure in Hadoop MapReduce and if the Job Tracker goes down all the running Hadoop jobs will halt.
Yes, it is possible to create multiple files in HDFS with different block sizes using an API. The block size can be specified during the time of file creation. Below is the signature of the method that helps achieve this –
public FSDataOutputStream (Path f, boolean overwrite, int buffersize, short replication, long blocksize) throws IO Exception
HDFS provides support only for exclusive writes so when one client is already writing the file, the other client cannot open the file in write mode. When the client requests the NameNode to open the file for writing, NameNode provides lease to the client for writing to the file. So, if another client requests for lease on the same it will be rejected.
The NameNode that works and runs in the Hadoop cluster is often referred to as the Active NameNode. Passive NameNode also known as Standby NameNode is the similar to an active NameNode but it comes into action only when the active NameNode fails. Whenever the active NameNode fails, the passive NameNode or the standby NameNode replaces the active NameNode, to ensure that the Hadoop cluster is never without a NameNode.
Balancer tool helps achieve this by taking a threshold value as input parameter which is always a fraction between 0 and 1. The HDFS cluster is said to be balanced, if, for every DataNode, the ratio of used space at the node to total capacity of the node differs from the ratio of used space in the cluster to total capacity of the cluster - is not greater than the threshold value.
Whenever a DataNode is marked as decommissioned it cannot be considered for replication but it continues to serve read request until the node enters the decommissioned state completely i.e. till all the blocks on the decommissioning DataNode are replicated.
A set of existing nodes can be removed using the decommissioning feature to reduce the size of a large cluster. The nodes that have to be removed should be added to the exclude file. The name of the exclude file should be stated as a config parameter dfs.hosts.exclude. By editing the exclude files or the configuration file, the decommissioning process can be ended.
The state in which NameNode does not perform replication or deletion of blocks is referred to as Safe Mode in Hadoop. In safe mode, NameNode only collects block reports information from the DataNodes.
Below command is used to enter Safe Mode manually –
$ Hdfs dfsadmin -safe mode enter
Once the safe mode is entered manually, it should be removed manually.
Below command is used to leave Safe Mode manually –
$ hdfs dfsadmin -safe mode leave
The size of a file can be larger than the size of a single disk within the network. Blocks from a single file need not be stored on the same disk and can make use of different disks present in the Hadoop cluster. This simplifies the entire storage subsystem providing fault tolerance and high availability.
Just like many desktop operating systems handle deleted files without a key, HDFS also moves all the deleted files into trash folder stored at /user/hdfs/.Trash. The trash can be emptied by running the following command-
hdfs –dfs expunge
21) What does the HDFS error “File could only be replicated to 0 nodes, instead of 1” mean?
This exception occurs when the DataNode is not available to the NameNode (i.e. the client is not able to communicate with the DataNode) due to one of the following reasons –
If the post on HDFS Interview Questions and Answers was helpful, then please spend a minute from your valuable time to share it with the social media icons above and help the Hadoop community at large.
You might be interested to read a series of blogs on the most frequently asked Hadoop Interview Questions-