Hortonworks founder predicted that by end of 2020, 75% of Fortune 2000 companies will be running 1000 node hadoop clusters in production. The tiny toy elephant in the big data room has become the most popular big data solution across the globe. However, implementation of Hadoop in production is still accompanied by deployment and management challenges like scalability, flexibility and cost-effectiveness.
Many organizations that venture into enterprise adoption of Hadoop by business users or by an analytics group within the company do not have any knowledge on how a good Hadoop architecture design should be and how actually a Hadoop cluster works in production. This lack of knowledge leads to design of a Hadoop cluster that is more complex than is necessary for a particular big data application making it a pricey implementation. Apache Hadoop was developed with the purpose of having a low–cost, redundant data store that would allow organizations to leverage big data analytics at economical cost and maximize profitability of the business.
A good hadoop architectural design requires various design considerations in terms of computing power, networking and storage. This blog post gives an in-depth explanation of the Hadoop architecture and the factors to be considered when designing and building a Hadoop cluster for production success. But, before we dive into the architecture of Hadoop, let us have a look at what Hadoop is and what are the reasons behind its popularity.
“Data, Data everywhere; not a disk to save it.” This quote feels so relatable to most of us today as we usually run out of space either on our Laptops or on our mobile phones. Hard disks are now becoming more popular as a solution to this problem. But, what about companies like Facebook and Google, where their users are constantly generating new data in the form of pictures, posts, etc., every millisecond? How do these companies handle their data so smoothly that every time we log in to our accounts on these websites, we can access all our chats, emails, etc., without any difficulties? Well, the answer to this is that these tech giants are using frameworks like Apache Hadoop. These frameworks make the task of accessing information from large datasets easy using simple programming models. These frameworks achieve this feat through distributed processing of the datasets across clusters of computers. To understand this better, consider the case where we want to transfer 100 GB of data in 10 minutes. One way to achieve this is to have one colossal computer that can transmit at lightning speed of 10GB in one minute. Another method is to store the data in 10 computers and then let each computer transfer the data at 1GB/min. The latter task of parallel processing sounds more straightforward than the former, and this is the kind of task that Apache Hadoop achieves.
Let us now take a look at how Apache Hadoop implements this parallel processing by understanding the architecture of Hadoop.
Apache Hadoop offers a scalable, flexible and reliable distributed computing big data framework for a cluster of systems with storage capacity and local computing power by leveraging commodity hardware. Hadoop follows a Master Slave architecture for the transformation and analysis of large datasets using Hadoop MapReduce paradigm.
The components that play a vital role in the Hadoop architecture are -
Hadoop Common– It contains libraries and utilities that other Hadoop modules require;
Hadoop Distributed File System (HDFS)– A distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster; patterned after the UNIX file system and provides POSIX-like semantics.
Hadoop YARN – Introduced in 2012, this is a platform that is in charge of managing computing resources in clusters and utilising them for planning users' applications.
Hadoop MapReduce – This is an application of the MapReduce programming model for large-scale data processing. Both Hadoop's MapReduce and HDFS, were inspired by Google's papers on MapReduce and Google File System.
Hadoop Ozone – This is a scalable, redundant, and distributed object store for Hadoop. This is a new addition (introduced in 2020) to the Hadoop family and unlike HDFS, it can handle both small and large files alike.
Read More About Hadoop YARN Architecture
Hadoop skillset requires thoughtful knowledge of every layer in the Hadoop stack right from understanding about the various components in the hadoop architecture, designing a hadoop cluster, performance tuning it and setting up the top chain responsible for data processing.
In the previous section, we talked about how one can use more machines to make the transfer of data quickly. But there are certain problems that the more devices are used, the higher the chances of hardware failure. Hadoop Distributed File System (HDFS) addresses this problem as it stores multiple copies of the data so that one can access another copy of the data in case of failure. Another problem attached to the use of various machines is that there has to be one concise way of combining the data using a standard programming model. Hadoop’s MapReduce solves this problem. It allows reading and writing data from the machines (or disks) by performing computations over sets of keys and values. In fact, both HDFS and MapReduce form the core of Hadoop architecture’s functionalities.
Hadoop follows a master slave architecture design for data storage and distributed data processing using HDFS and MapReduce respectively. The master node for data storage is hadoop HDFS is the NameNode and the master node for parallel processing of data using Hadoop MapReduce is the Job Tracker. The slave nodes in the hadoop architecture are the other machines in the Hadoop cluster which store data and perform complex computations. Every slave node has a Task Tracker daemon and a DataNode that synchronizes the processes with the Job Tracker and NameNode respectively. In Hadoop architectural implementation the master or slave systems can be setup in the cloud or on-premise.
Image Credit : OpenSource.com
As mentioned above, Hadoop Common is a collection of libraries and utilities that other Hadoop components require. Another name for Hadoop common is Hadoop Stack. Hadoop Common forms the base of the Hadoop framework. The critical responsibilities of Hadoop Common are:
a) Supplying source code and documents, and a contribution section.
b) Performing basic tasks- abstraction of the file system, generalization of the operating system, etc.
c) Supporting the Hadoop Framework by keeping Java Archive files (JARs) and scripts needed to initiate Hadoop.
A file on HDFS is split into multiple bocks and each is replicated within the Hadoop cluster. A block on HDFS is a blob of data within the underlying file system with a default size of 64MB. The size of a block can be extended up to 256 MB based on the requirements.
Image Credit :Apache.org
Hadoop Distributed File System (HDFS) stores the application data and file system metadata separately on dedicated servers. NameNode and DataNode are the two critical components of the Hadoop HDFS architecture. Application data is stored on servers referred to as DataNodes and file system metadata is stored on servers referred to as NameNode. HDFS replicates the file content on multiple DataNodes based on the replication factor to ensure reliability of data. The NameNode and DataNode communicate with each other using TCP based protocols. For the Hadoop architecture to be performance efficient, HDFS must satisfy certain pre-requisites –
All the files and directories in the HDFS namespace are represented on the NameNode by Inodes that contain various attributes like permissions, modification timestamp, disk space quota, namespace quota and access times. NameNode maps the entire file system structure into memory. Two files fsimage and edits are used for persistence during restarts.
When the NameNode starts, fsimage file is loaded and then the contents of the edits file are applied to recover the latest state of the file system. The only problem with this is that over the time the edits file grows and consumes all the disk space resulting in slowing down the restart process. If the hadoop cluster has not been restarted for months together then there will be a huge downtime as the size of the edits file will be increase. This is when Secondary NameNode comes to the rescue. Secondary NameNode gets the fsimage and edits log from the primary NameNode at regular intervals and loads both the fsimage and edit logs file to the main memory by applying each operation from edits log file to fsimage. Secondary NameNode copies the new fsimage file to the primary NameNode and also will update the modified time of the fsimage file to fstime file to track when then fsimage file has been updated.
DataNode manages the state of an HDFS node and interacts with the blocks .A DataNode can perform CPU intensive jobs like semantic and language analysis, statistics and machine learning tasks, and I/O intensive jobs like clustering, data import, data export, search, decompression, and indexing. A DataNode needs lot of I/O for data processing and transfer.
On startup every DataNode connects to the NameNode and performs a handshake to verify the namespace ID and the software version of the DataNode. If either of them does not match then the DataNode shuts down automatically. A DataNode verifies the block replicas in its ownership by sending a block report to the NameNode. As soon as the DataNode registers, the first block report is sent. DataNode sends heartbeat to the NameNode every 3 seconds to confirm that the DataNode is operating and the block replicas it hosts are available.
YARN in Hadoop YARN stands for Yet Another Resource Negotiator. It came into the picture after the Hadoop 2.x versions. The critical function of YARN is to supply the computational resources required for applications’ execution by distinguishing between resource management and job scheduling.
The two essential characteristics of YARN infrastructure are Resource Manager (RM) and Node Manager (NM). Let us consider the profile of YARN in a bit more detail.
a) Resource Manager (RM): In the two-level cluster, the Resource Manager is the master of the YARN infrastructure. It lies in the top tier (one per cluster) and is aware of where the lower tiers, or in other words, slaves, are present. It knows how many resources the slaves have. The Resource Manager performs various functions, the most important being Resource Scheduling, that is, deciding how to allocate resources.
b) Node Manager (NM): In the two-level cluster, the Node Manager (also known as Application Manager) lies at the lower tier (many per cluster). It is the slave of the YARN infrastructure. When the Node Manager is called for execution, it declares itself to the Resource Manager. Resource Scheduler (that performs Resource Scheduling). The node has the capability to assign resources to the cluster, and its resource capacity is the volume of memory and other resources. During the execution, the Resource Scheduler (one that performs Resource Scheduling) controls how to utilize this capacity.
Important functions of Hadoop YARN
1. Hadoop YARN promotes the dynamic distribution of cluster resources between various Hadoop components and frameworks like MapReduce, Impala, and Spark.
2. At present, YARN controls memory and CPU. In the future, we expect it to be responsible for the allocation of additional resources like disk and network I/O.
The heart of the distributed computation platform Hadoop is its java-based programming paradigm Hadoop MapReduce. Map or Reduce is a special type of directed acyclic graph that can be applied to a wide range of business use cases. Map function transforms the piece of data into key-value pairs and then the keys are sorted where a reduce function is applied to merge the values based on the key into a single output.
The execution of a MapReduce job begins when the client submits the job configuration to the Job Tracker that specifies the map, combine and reduce functions along with the location for input and output data. On receiving the job configuration, the job tracker identifies the number of splits based on the input path and select Task Trackers based on their network vicinity to the data sources. Job Tracker sends a request to the selected Task Trackers.
The processing of the Map phase begins where the Task Tracker extracts the input data from the splits. Map function is invoked for each record parsed by the “InputFormat” which produces key-value pairs in the memory buffer. The memory buffer is then sorted to different reducer nodes by invoking the combine function. On completion of the map task, Task Tracker notifies the Job Tracker. When all Task Trackers are done, the Job Tracker notifies the selected Task Trackers to begin the reduce phase. Task Tracker reads the region files and sorts the key-value pairs for each key. The reduce function is then invoked which collects the aggregated values into the output file.
With 1.59 billion accounts (approximately 1/5th of worlds total population) , 30 million FB users updating their status at least once each day, 10+ million videos uploaded every month, 1+ billion content pieces shared every week and more than 1 billion photos uploaded every month – Facebook uses hadoop to interact with petabytes of data. Facebook runs world’s largest Hadoop Cluster with more than 4000 machine storing hundreds of millions of gigabytes of data. The biggest hadoop cluster at Facebook has about 2500 CPU cores and 1 PB of disk space and the engineers at Facebook load more than 250 GB of compressed data (is greater than 2 TB of uncompressed data) into HDFS daily and there are 100’s of hadoop jobs running daily on these datasets.
Where is the data stored at Facebook?
135 TB of compressed data is scanned daily and 4 TB compressed data is added daily. Wondering where is all this data stored? Facebook has a Hadoop/Hive warehouse with two level network topology having 4800 cores, 5.5 PB storing up to 12TB per node. 7500+ hadoop hive jobs run in production cluster per day with an average of 80K compute hours. Non-engineers i.e. analysts at Facebook use Hadoop through hive and aprroximately 200 people/month run jobs on Apache Hadoop.
Hadoop/Hive warehouse at Facebook uses a two level network topology -
Hadoop at Yahoo has 36 different hadoop clusters spread across Apache HBase, Storm and YARN, totalling 60,000 servers made from 100's of different hardware configurations built up over generations.Yahoo runs the largest multi-tenant hadoop installation in the world withh broad set of use cases. Yahoo runs 850,000 hadoop jobs daily.
For organizations planning to implement hadoop architecture in production, the best way to determine whether Hadoop is right for their company is - to determine the cost of storing and processing data using Hadoop. Compare the determined cost to the cost of legacy approach for managing data.