How HDFS works?
HDFS uses a master slave architecture where each cluster has a NameNode for managing the file system operations and supporting DataNodes for managing data storage on individual computing nodes (usually there exists one DataNode per node in the Hadoop cluster). HDFS stores data in files which are divided into one or more segments and are stored in particular DataNodes. These small segments of files are referred to as blocks and the minimum amount of data that can be read or written is referred to as a single block. By default the size of a block is 64MB which can be changed in the configuration file.
Components of HDFS Architecture:
- NameNode is the commodity hardware that manages the cluster metadata and acts as the master server. NameNode is responsible for providing client access to files, managing the file system namespace, regulating file system operations like renaming, opening or closing files.
- DataNode is where the data actually resides for performing read or write operations. Based on the instructions provided by the NameNode, DataNode performs block creation, block deletion and replication.
- Understanding the Working of HDFS with an Example
- Suppose a file contains the contact numbers of people in US; the numbers of people with last name starting from S are stored on Server 1, starting with T are stored on Server 2 and so on. In hadoop, all these pieces of the contact information are stored across the cluster. To rebuild the complete phonebook, the program will require to access all the blocks from each server in the cluster. To achieve high availability, when any of the components fail, HDFS creates a replica of all the smaller pieces on two different servers by default. This allows the hadoop cluster to break down the work into smaller chunks and run them on all the servers within the cluster for better scalability.
What is HDFS good at?
- Google and Facebook to small startups.
Design of HDFS is based on taming the data reliability to avoid any data loss due to failures. HDFS has proven to be reliable under different use cases and cluster sizes right from big internet giants like
Highly fault-tolerant and can easily detect faults for automatic recovery.
HDFS is open-source and can be used with zero licensing and support costs. HDFS uses direct commodity storage and shares the cost of the systems and network it runs on with the computing/processing layer of Hadoop i.e. MapReduce resulting in extremely low cost per byte for storage.
Provides high level of efficiency, as data and logic are processed in parallel on nodes where data actually resides.
HDFS renders high bandwidth to support MapReduce workloads by delivering data at a faster data rate up to 2 gigabits per second per computer into the MapReduce layer.
HDFS has a robust coherency model.
- Hadoop Components and Architecture:Big Data Hadoop Training
- The default big data storage layer for Apache Hadoop is HDFS. HDFS is the Secret Sauce of Apache Hadoop components as users can dump huge datasets into HDFS and the data will sit there nicely until the user wants to leverage it for analysis. Click to read more.
- What is Hadoop 2.0 High Availability?
- Hadoop introduced YARN - that has the ability to process terabytes and Petabytes of data present in HDFS with the use of various non-MapReduce applications namely GIRAPH and MPI. Click to read more.
- What are the Prerequisites to learn Hadoop?
- So for professionals exploring opportunities in Hadoop, some basic knowledge on Linux is required to setup Hadoop. We have listed some basic commands that can be used to manage files on HDFS clusters. Click to read more.
- Hadoop Explained: How does Hadoop work and how to use it?
- Hadoop is used in big data applications that gather data from disparate data sources in different formats. HDFS is flexible in storing diverse data types, irrespective of the fact that your data contains audio or video files (unstructured), or contain record level data just as in an ERP system (structured), log file or XML files (semi-structured). Click to read more.
- 5 Reasons Why Business Intelligence Professionals Should Learn Hadoop
- Business Intelligence professionals use various tools to draw useful data that are used to generate customized reports and this is where the Hadoop File Distribution System (HDFS) proves itself. Click here to read more.
- Hadoop HDFS Tutorial
- Hadoop HDFS tolerates any disk failures by storing multiple copies of a single data block on different servers in the Hadoop cluster. Each file is stored in the form of small blocks which are replicated across multiple servers in a Hadoop cluster. Click to read more.
- Tutorial- Hadoop Multinode Cluster Setup on Ubuntu
- Primary distributed storage system used by Hadoop applications to hold large volume of data. HDFS is scalable and fault-tolerant which works closely with a wide variety of concurrent data access application. Click to read more.
- Hadoop Impala Tutorial
- Impala main goal is to make SQL-on Hadoop operations fast and efficient to appeal to new categories of users and open up Hadoop to new types of use cases. Impala - HIVE integration gives an advantage to use either HIVE or Impala for processing or to create tables under single shared file system HDFS without any changes in the table definition. Click to read more.
Anyone working on HDFS will appreciate the following questions curated by the industry experts at DeZyre. You will need to have a working knowledge on HDFS to solve these questions. The first 3 questions have been solved for you.
Is HDFS designed for lots of small files or bigger files?
Bigger files - Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. While storing millions of files is feasible, billions is beyond the capability of current hardware.
Which type of nodes store the data in HDFS?
DataNode - Namenode is the master and keeps meta information while data node is a slave which keep all data.
How does NameNode finds DataNode is down?
Namenode send hearbeats signals to DataNode and if DataNode stops responding then it considers DataNode down - Whenever NameNode stops getting heartbeat signals for period of time it considers DataNode as down. DataNodes send heartbeats to the NameNode every three seconds. After a period without any heartbeats, a DataNode is assumed to be lost.
Can HDFS be used for low latency data access?
Why the block size in HDFS is large (around 64MB-128MB) compared to disk blocks?
Why do we need replication in HDFS?
What is the advantage of hosting data node and task tracker on same node?
Can Secondary NameNode takeover as NameNode if NameNode fails?