Java-based Distributed File System for Storing Big Data

HDFS is the primary distributed storage on Hadoop for managing pools of big data, that spans across large clusters of commodity servers. HDFS is regarded as the bucket of the hadoop ecosystem, where data is dumped and sits there until the user wants to export it to another tool, for running analysis on the stored data. Any machine that supports Java programming language can run HDFS.

How HDFS works?

HDFS uses a master slave architecture where each cluster has a NameNode for managing the file system operations and supporting DataNodes for managing data storage on individual computing nodes (usually there exists one DataNode per node in the Hadoop cluster). HDFS stores data in files which are divided into one or more segments and are stored in particular DataNodes. These small segments of files are referred to as blocks and the minimum amount of data that can be read or written is referred to as a single block. By default the size of a block is 64MB which can be changed in the configuration file.

Components of HDFS Architecture:

  • NameNode is the commodity hardware that manages the cluster metadata and acts as the master server. NameNode is responsible for providing client access to files, managing the file system namespace, regulating file system operations like renaming, opening or closing files.
  • DataNode is where the data actually resides for performing read or write operations. Based on the instructions provided by the NameNode, DataNode performs block creation, block deletion and replication.

HDFS Example

Understanding the Working of HDFS with an Example
Suppose a file contains the contact numbers of people in US; the numbers of people with last name starting from S are stored on Server 1, starting with T are stored on Server 2 and so on. In hadoop, all these pieces of the contact information are stored across the cluster. To rebuild the complete phonebook, the program will require to access all the blocks from each server in the cluster. To achieve high availability, when any of the components fail, HDFS creates a replica of all the smaller pieces on two different servers by default. This allows the hadoop cluster to break down the work into smaller chunks and run them on all the servers within the cluster for better scalability.

What is HDFS good at?

  • Design of HDFS is based on taming the data reliability to avoid any data loss due to failures. HDFS has proven to be reliable under different use cases and cluster sizes right from big internet giants like Google and Facebook to small startups.
  • Highly fault-tolerant and can easily detect faults for automatic recovery.
  • HDFS is open-source and can be used with zero licensing and support costs. HDFS uses direct commodity storage and shares the cost of the systems and network it runs on with the computing/processing layer of Hadoop i.e. MapReduce resulting in extremely low cost per byte for storage.
  • Provides high level of efficiency, as data and logic are processed in parallel on nodes where data actually resides.
  • HDFS renders high bandwidth to support MapReduce workloads by delivering data at a faster data rate up to 2 gigabits per second per computer into the MapReduce layer.
  • HDFS has a robust coherency model.

HDFS Blogs

Hadoop Components and Architecture:Big Data Hadoop Training
The default big data storage layer for Apache Hadoop is HDFS. HDFS is the Secret Sauce of Apache Hadoop components as users can dump huge datasets into HDFS and the data will sit there nicely until the user wants to leverage it for analysis. Click to read more.
What is Hadoop 2.0 High Availability?
Hadoop introduced YARN - that has the ability to process terabytes and Petabytes of data present in HDFS with the use of various non-MapReduce applications namely GIRAPH and MPI. Click to read more.
What are the Prerequisites to learn Hadoop?
So for professionals exploring opportunities in Hadoop, some basic knowledge on Linux is required to setup Hadoop. We have listed some basic commands that can be used to manage files on HDFS clusters. Click to read more.
Hadoop Explained: How does Hadoop work and how to use it?
Hadoop is used in big data applications that gather data from disparate data sources in different formats. HDFS is flexible in storing diverse data types, irrespective of the fact that your data contains audio or video files (unstructured), or contain record level data just as in an ERP system (structured), log file or XML files (semi-structured). Click to read more.
5 Reasons Why Business Intelligence Professionals Should Learn Hadoop
Business Intelligence professionals use various tools to draw useful data that are used to generate customized reports and this is where the Hadoop File Distribution System (HDFS) proves itself. Click here to read more.

HDFS Tutorials

Hadoop HDFS Tutorial
Hadoop HDFS tolerates any disk failures by storing multiple copies of a single data block on different servers in the Hadoop cluster. Each file is stored in the form of small blocks which are replicated across multiple servers in a Hadoop cluster. Click to read more.
Tutorial- Hadoop Multinode Cluster Setup on Ubuntu
Primary distributed storage system used by Hadoop applications to hold large volume of data. HDFS is scalable and fault-tolerant which works closely with a wide variety of concurrent data access application. Click to read more.
Hadoop Impala Tutorial
Impala main goal is to make SQL-on Hadoop operations fast and efficient to appeal to new categories of users and open up Hadoop to new types of use cases. Impala - HIVE integration gives an advantage to use either HIVE or Impala for processing or to create tables under single shared file system HDFS without any changes in the table definition. Click to read more.

HDFS Interview Questions

  1. What is a block and block scanner in HDFS?

    • Block - The minimum amount of data that can be read or written is generally referred to as a block in HDFS. The default size of a block in HDFS is 64MB. Read more.
  2. Explain the difference between NameNode, Backup Node and Checkpoint NameNode.

    • NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. Read more.
  3. What is commodity hardware?

    • Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Read more.

HDFS Slides

HDFS Videos

HDFS Questions & Answers

  1. HDFS Commands and Basic Shell Commands

    • I am trying to practice commands from module 2. I am getting different results from the document provided in the LMS. Can please someone help, how to solve this issue. Click to read the answer.
  2. how to fetch data from hdfs to hbase and vice versa?

    • how hdfs and hbase do interaction?? means how to fetch data from hdfs to hbase and how to insert data from hbase into hdfs? Click to read the answer.
  3. When I create a folder in HDFS where does it create?

    • For example, I created a new folder(s) in HDFS as, hadoop fs -mkdir wordcount. If I want to look into this created directory, where should I look in HDFS folders via winscp? Click to read the answer.
  4. How does HDFS split files with non-fixes record size, log/text files

    • If HDFS uses fixed block size, then end of block may be in the middle of a log line, if the line of log/text file split in the middle, information from that row can not be accessed on one data node only, because continuation of that line is on another node. Click to read the answer.
  5. File Format and Data manipulation in HDFS

    • How to convert a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS or Hive/Hcatalog? Click to read the answer.

HDFS Assignments

Anyone working on HDFS will appreciate the following questions curated by the industry experts at DeZyre. You will need to have a working knowledge on HDFS to solve these questions. The first 3 questions have been solved for you.

  1. Is HDFS designed for lots of small files or bigger files?

    Bigger files - Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. While storing millions of files is feasible, billions is beyond the capability of current hardware.

  2. Which type of nodes store the data in HDFS?

    DataNode - Namenode is the master and keeps meta information while data node is a slave which keep all data.

  3. How does NameNode finds DataNode is down?

    Namenode send hearbeats signals to DataNode and if DataNode stops responding then it considers DataNode down - Whenever NameNode stops getting heartbeat signals for period of time it considers DataNode as down. DataNodes send heartbeats to the NameNode every three seconds. After a period without any heartbeats, a DataNode is assumed to be lost.

  4. Can HDFS be used for low latency data access?

  5. Why the block size in HDFS is large (around 64MB-128MB) compared to disk blocks?

  6. Why do we need replication in HDFS?

  7. What is the advantage of hosting data node and task tracker on same node?

  8. Can Secondary NameNode takeover as NameNode if NameNode fails?

processing person-icon