Hadoop Multinode Cluster Setup for Ubuntu 12.04
Setting up a Hadoop cluster on multi node is as easy as reading this tutorial. This tutorial is a step by step guide for installation of a multi node cluster on Ubuntu 12.04.
Before setting up the cluster, let’s first understand Hadoop and its modules.
What is Apache Hadoop?
Apache Hadoop is an open source java based programming framework that supports the processing of large data set in a distributed computing environment.
What are the different modules in Apache Hadoop?
Apache Hadoop framework is composed of following modules:
- Hadoop common – collection of common utilities and libraries that support other Hadoop modules
- Hadoop Distributed File System (HDFS) – Primary distributed storage system used by Hadoop applications to hold large volume of data. HDFS is scalable and fault-tolerant which works closely with a wide variety of concurrent data access application.
- Hadoop YARN (Yet Another Resource Negotiator) – A framework for job scheduling and cluster resource management. It is an architectural center of Hadoop that allows multiple data processing engines to handle data stored in HDFS.
- Hadoop MapReduce – A YARN based system for parallel processing of large data sets in a reliable manner.
Minimum two ubuntu machines to complete the multi node installation but it is advisable to use 3 machines for a balanced test environment. This article has used Hadoop version 2.5.2 with 3 ubuntu machines where one machine will serve as a master plus slave, and other 2 machines as slaves.
Machine 1: dzmnhdp01 IP address: 192.168.56.11 Machine 2: dzmnhdp02 IP address: 192.168.56.12 Machine 3: dzmnhdp03 IP address: 192.168.56.13
Hadoop daemons (perceive daemons as Windows services) are Java services which run their own JVM (Java Virtual Machine) and therefore require java installation on each machine. Secure shell (SSH) is also required to make remote login for operating securely over an unsecured network.
Install Java and SSH on all machines (nodes):
# Download packages and install Java and SSH $ sudo apt-get update $ sudo apt-get install openjdk-7-jdk $ sudo apt-get install openssh-server # Update JAVA path in your bashrc file $ sudo vi ~/.bashrc OR $ sudo gedit ~/.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 $ source ~/.bashrc
Add hostnames and their static IP addresses in /etc/hosts for host name resolution and comment the local host. This will help in avoiding errors of unreachable hosts.
$ sudo vi /etc/hosts 192.168.56.11 dzmnhdp01 192.168.56.12 dzmnhdp02 192.168.56.13 dzmnhdp03 # 127.0.0.1 localhost
Ping your machines for validating the host name resolution:
$ ping dzmnhdp01 $ ping dzmnhdp02 $ ping dzmnhdp03
Set up Hadoop user:
There must be a common user in all machines to administrate the cluster and this will help in making all nodes talking to each other with a password less connection under the guidance of a common user.
# Create Hadoop group $ sudo addgroup hadoop # Create a user inside “hadoop” group (enter/confirm password and other fields can be left blank) $ sudo adduser --ingroup hadoop hduser # Give linux control to “hduser”, add hduser in sudoers file (to close the file press CTL+X followed by Y and enter) $ sudo visudo hduser ALL=(ALL) ALL
Login as “hduser” and generate ssh key to enable a password less connection between the nodes. These steps must be performed on each node.
Alternate Option: Copying private and authorized key from on node to another also enable the password-less connection.
Generate SSH key:
# Login as hduser $ su - hduser # Generate ssh key $ ssh-keygen -t rsa -P "" # Enable the authorization $ cp /home/hduser/.ssh/id_rsa.pub /home/hduser/.ssh/authorized_keys # Copy the public key from master to slaves and vice versa # dzmnhdp01 to dzmnhdp02/03 # dzmnhdp02 to dzmnhdp01/03 # dzmnhdp03 to dzmnhdp01/02 $ ssh-copy-id -i /home/hduser/.ssh/id_rsa.pub hduser@dzmnhdp02 # Modify permissions $ sudo chmod 700 /home/hduser/.ssh ; sudo chmod 640 /home/hduser/.ssh/authorized_keys ; sudo chmod 600 /home/hduser/.ssh/id_rsa # Check you connection between the nodes $ ssh hduser@dzmnhdp02 # To disable the known host warning $ sudo sed -i 's/^.*StrictHostKeyChecking.*$/StrictHostKeyChecking=no/' /etc/ssh/ssh_config ; service ssh restart
Hadoop enables different distributed mode to run:
- Standalone mode – Default mode of Hadoop which utilize local file system for input and output operations instead of HDFS and is mainly used for debugging purpose
- Pseudo Distributed mode (Single node cluster) – Hadoop cluster will be set up on a single server running all Hadoop daemons on one node and is mainly used for real code to test in HDFS.
- Fully Distributed mode (Multi node cluster) – Setting up Hadoop cluster on more than one server enabling a distributed environment for storage and processing which is mainly used for production phase.
# Fix a base directory for Hadoop ecosystem $ cd /usr/local # Download the Hadoop 2.5.2 binary file $ sudo wget http://www.us.apache.org/dist/hadoop/core/hadoop-2.5.2/hadoop-2.5.2.tar.gz # Un-tar it, assign a soft link, change permissions $ sudo tar -xzf hadoop-2.5.2.tar.gz $ sudo ln -s hadoop-2.5.2 hadoop $ sudo chown -R hduser:hadoop hadoop $ sudo chown -R hduser:hadoop hadoop-2.5.2 # Set up Hadoop environment variables $ sudo vi ~/.bashrc export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin $ source ~/.bashrc
Hadoop configuration files:
Basic configuration is the requirement of every software and therefore add below given parameters to seven important hadoop configuration files to make it run in safe environment:
Note: All child elements “property” in an xml configuration file must fall under parent element “configuration”
$ cd $HADOOP_HOME/etc/hadoop
- Hadoop-env.xml - Contains the environment variables which is required by Hadoop such as log file location, java path, heap size, PIDs etc.
$ sudo vi hadoop-env.sh export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/ export HADOOP_HEAPSIZE=400
- core-site.xml – Instructs the location of Namenode to Hadoop daemons in the cluster. This also includes I/O settings for HDFS and MapReduce.
$ sudo vi core-site.xml
fs.defaultFS hdfs://dzmnhdp01:9000 The name of the default file system (HDFS). A URI whose scheme and authority determine the FileSystem implementation. fs.trash.interval 30 Number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabled
- hdfs-site.xml – Configuration for HDFS daemons related parameters such as block replication factor, permission checking, storage location etc.
$ sudo vi hdfs-site.xml
dfs.replication 3 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. dfs.datanode.data.dir file:///hdfs_storage/data Determines where on the local filesystem a DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
- mapred-site.xml - Configuration for MapReduce daemons and jobs but for Hadoop 2x it is used to point YARN framework.
# Create a copy of mapred file from its template $ cp mapred-site.xml.template mapred-site.xml # Edit the mapred file $ sudo vi mapred-site.xml
mapred.job.tracker dzmnhdp01:9001 mapreduce.framework.name yarn The runtime framework for executing MapReduce jobs. Can be one of local, classic or yarn.
- yarn-site.xml - Configuration for YARN daemons related parameters such as resource manager, node manager, container class, mapreduce class etc.
$ sudo vi yarn-site.xml
yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.aux-services.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler yarn.resourcemanager.address dzmnhdp01:8032 The address of applications manager interface in the RM yarn.resourcemanager.scheduler.address dzmnhdp01:8030 The address of scheduler interface in the RM yarn.resourcemanager.resource-tracker.address dzmnhdp01:8031 yarn.nodemanager.address 0.0.0.0:59392 The address of container manager interface in the RM
After updating the above given 5 configuration files, create “hdfs_storage” directory in all nodes and copy complete Hadoop 2.5.2 folder using SCP to other two nodes (dzmnhdp02 and dzmnhdp03).
## Node 1 (dzmnhdp01) # Create hdfs directory and assign permissions $ sudo mkdir /hdfs_storage $ sudo mkdir /hdfs_storage/data $ sudo chown -R hduser:hadoop /hdfs_storage # Transfer Hadoop folder to other nodes $ scp -r /usr/local/hadoop-2.5.2 hduser@dzmnhdp02:/home/hduser $ scp -r /usr/local/hadoop-2.5.2 hduser@dzmnhdp03:/home/hduser ## Node 2 & 3 (dzmnhdp02 $ dzmnhdp03) # Create hdfs directory and assign permissions $ sudo mkdir /hdfs_storage $ sudo mkdir /hdfs_storage/data $ sudo chown -R hduser:hadoop /hdfs_storage # Move the Hadoop folder to base directory, create soft link and assign permissions $ sudo mv /home/hduser/hadoop-2.5.2 /usr/local $ cd /usr/local $ sudo ln -s hadoop-2.5.2 hadoop $ sudo chown -R hduser:hadoop hadoop $ sudo chown -R hduser:hadoop hadoop-2.5.2 # Set up Hadoop environment variables $ sudo vi ~/.bashrc export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin $ source ~/.bashrc
Update the remaining two configuration files in the master node (dzmnhdp01):
- slaves – List of hosts, one per file, where Hadoop slave daemons will run.
- Keep it blank on other nodes
# Add hosted nodes for slaves in dzmnhdp01 $ sudo vi slaves dzmnhdp01 dzmnhdp02 dzmnhdp03
- Masters – List of hosts, one per file, where Secondary Namenode will run.
- Keep it blank on other nodes
# Add hosted node for secondary namenode in dzmnhdp01 $ sudo vi masters dzmnhdp02
Update the parameters for Namenode directory and Secondary Namenode in node 1 & 2 (dzmnhdp01 & dzmnhdp02):
## Node 1 (dzmnhdp01) $ sudo vi hdfs-site.xml
dfs.namenode.name.dir file:///hdfs_storage/name Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma- delimited list of directories then the name table is replicated in all of the directories, for redundancy. ## Node 2 (dzmnhdp01) $ cd $HADOOP_HOME/etc/hadoop $ sudo vi hdfs-site.xml dfs.secondary.http.address dzmnhdp02:50090 The secondary namenode http server address and port. If the port is 0 then the server will start on a free port. dfs.http.address dzmnhdp01:50070 The address and the base port where the dfs namenode web ui will listen on. If the port is 0 then the server will start on a free port dfs.secondary.http.address dzmnhdp02:50090 The secondary namenode http server address and port. If the port is 0 then the server will start on a free port. fs.checkpoint.period 600 600 seconds when SNN will checkpoint NN for edits and FSimage merge fs.checkpoint.dir /hdfs_storage/snnfsi Determines where on the local filesystem the DFS Secondary ame node should store the temporary images to merge fs.checkpoint.edits.dir /hdfs_storage/snnedits Determines where on the local filesystem the DFS Secondary name node should store the temporary edits to merge
Create the secondary namenode FSimage and edits directories in node 2 (dzmnhdp02):
## Node 2 (dzmnhdp02) # Create FSimage and edits directory and assign permissions $ sudo mkdir /hdfs_storage/snnfsi $ sudo mkdir /hdfs_storage/snnedits $ sudo chown -R hduser:hadoop /hdfs_storage
Format Namenode and start DFS daemons:
# Format the Namenode on master node (dzmnhdp01) $ hdfs namenode -format # Confirm formatting by checking the VERSION $ more /hdfs_storage/name/current/VERSION # Start the DFS daemons (dzmnhdp01) $ $HADOOP_HOME/sbin/start-dfs.sh $ Confirm the same cluster ID for Namenode, Secondary Namenode and Datanode $ more /hdfs_storage/name/current/VERSION (on dzmnhdp01) $ more /hdfs_storage/snnfsi/current/VERSION (on dzmnhdp02) $ more /hdfs_storage/snnedits/current/VERSION (on dzmnhdp02) $ more /hdfs_storage/data/current/VERSION (on any node) # Confirm the DFS daemons on each node $ jps ## jps output dzmnhdp01 2463 DataNode 2680 Jps 2241 NameNode dzmnhdp02 2174 DataNode 2367 SecondaryNameNode 2436 Jps dzmnhdp03 2209 DataNode 2277 Jps
Start YARN daemons:
# Start the YARN daemons (dzmnhdp01) $ $HADOOP_HOME/sbin/start-yarn.sh # Confirm the DFS + YARN daemons on each node $ jps
Validate the functional Hadoop cluster:
# Check the Namenode safe mode status on any node (it must be OFF and normally takes couple of minutes to reflect this state) $ hdfs dfsadmin -safemode get # Check the cluster report $ hdfs dfsadmin -report
If safe mode is OFF and report display the clear picture of your cluster then you have set up a perfect Hadoop multi node cluster.
Test the environment with MapReduce:
Download an ebook to the local file system and copy it to the Hadoop file system (HDFS). By default, Hadoop folder include example jar files to help testing the environment.
# Download an ebook and rename it $ cd ~ $ wget http://www.gutenberg.org/ebooks/4300.txt.utf-8 $ mv 4300.txt.utf-8 4300.txt # Copy the book in HDFS under a new directory $ hdfs dfs -mkdir /mn_test $ hdfs dfs -put ~/4300.txt /mn_test/ # Verify the content $ hdfs dfs -ls /mn_test Found 1 items -rw-r--r-- 3 hduser supergroup 1573151 2015-12-29 13:00 /mn_test/4300.txt # Run the wordcount mapreduce program on the downloaded book $ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar wordcount /mn_test/4300.txt /mn_wcoutput # Check the output in HDFS $ hdfs dfs -cat /mn_wcoutput/part-r-00000
Track through Web Consoles:
By default Hadoop HTTP web-consoles allow access without any form of authentication.
# NameNode http://:50070 http://192.168.56.11:50070 # Resource Manager http://192.168.56.11:8088 # Secondary NameNode http://192.168.56.12:50090
Troubleshooting the environment:
By default, all logs for Hadoop gets stored in $HADOOP_HOME/logs. For any issue regarding installation, these logs will help to troubleshoot the cluster.