What is a Hadoop Cluster?
In general, a computer cluster is a collection of various computers that work collectively as a single system.
“A hadoop cluster is a collection of independent components connected through a dedicated network to work as a single centralized data processing resource. “
“A hadoop cluster can be referred to as a computational computer cluster for storing and analysing big data (structured, semi-structured and unstructured) in a distributed environment.”
“A computational computer cluster that distributes data analysis workload across various cluster nodes that work collectively to process the data in parallel.”
Hadoop clusters are also known as “Shared Nothing” systems because nothing is shared between the nodes in a hadoop cluster except for the network which connects them. The shared nothing paradigm of a hadoop cluster reduces the processing latency so when there is a need to process queries on huge amounts of data the cluster-wide latency is completely minimized.
If you would like more information about Big Data and Hadoop Training, please click the orange "Request Info" button on top of this page.
Advantages of a Hadoop Cluster Setup
- As big data grows exponentially, parallel processing capabilities of a Hadoop cluster help in increasing the speed of analysis process. However, the processing power of a hadoop cluster might become inadequate with increasing volume of data. In such scenarios, hadoop clusters can scaled out easily to keep up with speed of analysis by adding extra cluster nodes without having to make modifications to the application logic.
- Hadoop cluster setup is inexpensive as they are held down by cheap commodity hardware. Any organization can setup a powerful hadoop cluster without having to spend on expensive server hardware.
- Hadoop clusters are resilient to failure meaning whenever data is sent to a particular node for analysis, it is also replicated to other nodes on the hadoop cluster. If the node fails then the replicated copy of the data present on the other node in the cluster can be used for analysis.
Hadoop Cluster Architecture
A hadoop cluster architecture consists of a data centre, rack and the node that actually executes the jobs. Data centre consists of the racks and racks consists of nodes. A medium to large cluster consists of a two or three level hadoop cluster architecture that is built with rack mounted servers. Every rack of servers is interconnected through 1 gigabyte of Ethernet (1 GigE). Each rack level switch in a hadoop cluster is connected to a cluster level switch which are in turn connected to other cluster level switches or they uplink to other switching infrastructure.
Components of a Hadoop Cluster
Hadoop cluster consists of three components -
- Master Node – Master node in a hadoop cluster is responsible for storing data in HDFS and executing parallel computation the stored data using MapReduce. Master Node has 3 nodes – NameNode, Secondary NameNode and JobTracker. JobTracker monitors the parallel processing of data using MapReduce while the NameNode handles the data storage function with HDFS. NameNode keeps a track of all the information on files (i.e. the metadata on files) such as the access time of the file, which user is accessing a file on current time and which file is saved in which hadoop cluster. The secondary NameNode keeps a backup of the NameNode data.
- Slave/Worker Node- This component in a hadoop cluster is responsible for storing the data and performing computations. Every slave/worker node runs both a TaskTracker and a DataNode service to communicate with the Master node in the cluster. The DataNode service is secondary to the NameNode and the TaskTracker service is secondary to the JobTracker.
- Client Nodes – Client node has hadoop installed with all the required cluster configuration settings and is responsible for loading all the data into the hadoop cluster. Client node submits mapreduce jobs describing on how data needs to be processed and then the output is retrieved by the client node once the job processing is completed.
Work on interesting Hadoop Projects for just $9!
Single Node Hadoop Cluster vs. Multi Node Hadoop Cluster
As the name says, Single Node Hadoop Cluster has only a single machine whereas a Multi-Node Hadoop Cluster will have more than one machine.
In a single node hadoop cluster, all the daemons i.e. DataNode, NameNode, TaskTracker and JobTracker run on the same machine/host. In a single node hadoop cluster setup everything runs on a single JVM instance. The hadoop user need not make any configuration settings except for setting the JAVA_HOME variable. For any single node hadoop cluster setup the default replication factor is 1.
In a multi-node hadoop cluster, all the essential daemons are up and run on different machines/hosts. A multi-node hadoop cluster setup has a master slave architecture where in one machine acts as a master that runs the NameNode daemon while the other machines acts as slave or worker nodes to run other hadoop daemons. Usually in a multi-node hadoop cluster there are cheaper machines (commodity computers) that run the TaskTracker and DataNode daemons while other services are run on powerful servers. For a multi-node hadoop cluster, machines or computers can be present in any location irrespective of the location of the physical server.
If you would like to understand the steps for Hadoop Installation on your system, check out the free hadoop tutorials on Single Node Hadoop Cluster Setup / Installation and Multinode Node Hadoop Cluster Setup / Installation.
For the complete list of big data companies and their salaries- CLICK HERE
Best Practices for Building a Hadoop Cluster
Hadoop’s performance depends on various factors based on the hardware resources which use hard drive (I/O storage), CPU, memory, network bandwidth and other well-configured software layers. Building a Hadoop cluster is a complex task that requires consideration of several factors like choosing the right hardware, sizing the hadoop cluster and configuring it correctly.
Choosing the Right Hardware for a Hadoop Cluster
Many organizations are in a predicament when setting up hadoop infrastructure as they are not aware on what kind of machines they need to purchase for setting up an optimized hadoop environment and what is the ideal configuration they must use. The foremost thing that bothers users is deciding on the hardware for the hadoop cluster. Hadoop runs on industry-standard hardware but there is no ideal cluster configuration like providing a list of hardware specifications to setup cluster hadoop. The hardware chosen for a hadoop cluster setup should provide a perfect balance between performance and economy for a particular workload. Choosing the right hardware for a hadoop cluster is a standard chicken-and-egg problem that requires complete understanding of the workloads (IO bound or CPU bound workloads) to fully optimize it after thorough testing and validation. The number of machines or the hardware specification of machines depends on factors like –
- Volume of the Data
- The type of workload that needs to be processed (CPU driven or Use-Case/IO Bound)
- Data storage methodology (Data container , data compression technique used , if any)
- Data retention policy ( How long can you afford to keep the data before flushing it out)
Sizing a Hadoop Cluster
The data volume that the hadoop users will process on the hadoop cluster should be a key consideration when sizing the hadoop cluster. Knowing the data volume to be processed helps decide as to how many nodes or machines would be required to process the data efficiently and how much memory capacity will be required for each machine. The best practice to size a hadoop cluster is sizing it based on the amount of storage required. Whenever a new node is added to the hadoop cluster, more computing resources will be added to the new storage capacity.
Configuring the Hadoop Cluster
To obtain maximum performance from a Hadoop cluster, it needs to be configured correctly. However, finding the ideal configuration for a hadoop cluster is not easy. Hadoop framework needs to be adapted to the cluster it is running and also to the job. The best way to decide on the ideal configuration for the cluster is to run the hadoop jobs with the default configuration available to get a baseline. After that the job history log files can be analysed to see if there is any resource weakness or if the time taken to run the jobs is higher than expected. Repeating the same process can help fine tune the hadoop cluster configuration in such a way that it best fits the business requirements. The number of CPU cores and memory resources that need to be allocated to the daemons also has a great impact on the performance of the cluster. In case of small to medium data context, one CPU core is reserved on each DataNode whereas 2 CPU cores are reserved on each DataNode for HDFS and MapReduce daemons, in case of huge data context.
Are hadoop clusters a good solution for all data analysis requirements?
Having listed out the benefits of a hadoop cluster setup, it is extremely important to understand if it is ideal to use a hadoop cluster setup for all data analysis needs. For example, if a company has intense data analysis requirements but has relatively less data then under such circumstances the company might not benefit from using Hadoop cluster setup. A hadoop cluster setup is always optimized for large datasets. For instance, 10MB of data when fed to a hadoop cluster for processing will take more time to process when compared to traditional systems.
Hadoop clusters make an assumption that data can be torn apart and analysed by parallel processes running on different cluster nodes. Thus, a hadoop cluster is the right tool for analysis only in a parallel processing environment. The answer to when you should consider building a hadoop cluster depends on whether or not your organization’s data analysis needs can be met by the capabilities of a hadoop cluster setup.
Here are a few scenarios where hadoop cluster setup might not be a right fit –
- If the analysis requires processing large number of small files then it might not be ideal to use a hadoop cluster because the amount of memory that will be required to store the metadata within the namenode will be huge.
- If the task requires multiple write scenarios between files.
- Tasks that require near real-time data access or any other low latency tasks.