Getting Started with Hadoop
What will you learn from this Hadoop tutorial for beginners?
This big data hadoop tutorial will cover the pre-installation environment setup to install hadoop on Ubuntu and detail out the steps for hadoop single node setup so that you perform basic data analysis operations on HDFS and Hadoop MapReduce. This hadoop tutorial has been tested with –
- Ubuntu Server 12.04.5 LTS (64-bit)
- Java Version 1.7.0_101
Attractions of the Hadoop Installation Tutorial
- Steps to install the pre-requisites software Java
- Configuring the Linux Environment
- Hadoop Configuration
- Hadoop Single Node Setup-Installing Hadoop in Standalone Mode
- Hadoop Single Node Setup-Installing Hadoop in Pseudo Distributed Mode
- Common Errors encountered while installing hadoop on Ubuntu and how to troubleshoot them.
Here are a few related posts that will help you understand in detail about the Hadoop Ecosystem -
Apache Hadoop is supported by all flavors of Linux, thus it is suggested to install Linux OS before setting up the environment for hadoop installation. If you have an OS other than Linux then you can proceed with installing hadoop on Ubuntu through Virtual Machine which has Linux in it. Apache Hadoop can be installed in 3 different modes of execution –
- Standalone Mode – Single node hadoop cluster setup
- Pseudo Distributed Mode – Single node hadoop cluster setup
- Fully Distributed Mode – Multi-node hadoop cluster setup
If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.
Do you want the Hadoop Tutorial PDF to be delivered to your inbox? Send us an email at email@example.com to get the Hadoop Tutorial PDF delivered to your inbox.
Hadoop Pre-installation Environment Setup
In this hadoop tutorial , we are using Ubuntu Server 12.04.5 LTS (64 bit). You can download it from this link
Check for update or update the source index
Before you begin to install hadoop on Ubuntu , ensure that it is updated with the latest packages from all the repositories and PPA’s. Execute the below command to see if there are any updates available-
$ sudo apt –get update
Java is the main pre-requsiite software to run hadoop. To run hadoop on Ubuntu, you must have Java installed on your machine preferably Java version 1.6+ from Sun/Oracle or OpenJDK must be installed. You can check if Java is already installe don your machine using the below command-java –version. JavaTM 1.6.x or later, preferably from Oracle or Openjdk, must be installed. However, using Java 1.6+ is recommended for this hadoop tutorial. We are using Openjdk-7 to install Java software in this hadoop tutorial-
$ sudo apt-get install openjdk-7-jdk
You can check the java version installed on your machine by using the command - java -version
SSH is required to manage the remote machines and your local machines before using hadoop on it. In this hadoop tutorial, we will use openssh server which can be installed as follows –
$ sudo apt-get install openssh-server
Adding a dedicated Hadoop User Account
Creating a dedicated hadoop user helps separate HDFS from UNIX file system. We can begin by creating a “Hadoop” group as follows –
The next step is to create a hadoop user named hsuser and add it to the “hadoop” group created in the above step.
$ sudo adduser –ingroup hadoop hduser
On executing the above command, it will prompt you for the password and other details as shown below -
Grant All Permissions to the Hadoop User “hduser created in the above Step
To grant all permissions to the created hadoop user, you must configure the sudoers file located at /etc/sudoers. However, this file cannot be configured directly and should be done using the visudo command as follows-
$ sudo visudo
Just type “o” to insert a line at the end with the command to grant all permissions to the hadoop user “hduser” as follows -
hduser ALL=(ALL) ALL
Hit Escape to exit the insert mode of the editor and type “:x” to save the changes and exit from the file.
Configuring SSH Access
SSH Setup is needed to perform various operations on the hadoop cluster so that the master node can login to the slave nodes to start or stop them.SSH must be setup even on the secondary NameNode listed in the master’s file so that it can be started from the Namenode using the command ./start-dfs.sh and the job tracker node with ./start-mapred.sh.
To configure SSH access, you must login as hadoop user hduser -
$ sudo - hduser
The next step is to generate the SSH key for the hadoop user using the following command-
$ ssh-keygen –t rsa –P “”
In the above screenshot, the command hduser@ubuntu:~$ ssh-keygen -t rsa -P "" command will create an empty password RSA key pair. It is not suggested to use an empty password, however, if you do not always want to enter the passphrase whenever hadoop interacts with the nodes then you must give an empty password. This will ensure that hadoop interacts with the nodes without your interaction.
The next step is to enable SSH access with the key generated in the previous step -
$ sudo cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/.ssh/authorized_keys
As Hadoop does not support IPv6 and is tested to work only IPv4 network, it is suggested to disable IPv6. However, if you are not using IPv6, you can simply skip this step of the hadoop installation process.
To disable IPv6, you need open the file /etc /sysctl.conf using the vi editor as shown below -
$ sudo vi /etc/sysctl.conf
Copy the below lines of code to disable IPv6 -
#disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
To ensure that IPv6 has been disabled , run the following command -
Having disabled IPv6, it is suggested that you power off the machine and restart it again using the below command -
$ sudo reboot now
Hurray, you have completed the environment setup to install hadoop. Now, let’s get started with Hadoop installation in standalone mode.
Enrol Now for Certified Hands-On Hadoop Training Online
Hadoop Single Node Setup- Standalone Mode
Hadoop on a single node in standalone mode runs as a single java process. This mode of execution is of great help for debugging purpose. This mode of execution helps you run your MapReduce application on small data before you start running it on a hadoop cluster with big data.
Hadoop can be downloaded using the “wget” command as shown below -
$ wget https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz
The compressed hadoop file needs to be unzipped as follows -
$ tar –xvzf /home/hduser/hadoop-1.2.1.tar.gz
Verify if Hadoop is installed
To confirm that hadoop has been installed, you can run the following command -
$ ls /home/hduser/
Listing the contents of /home/hduser/ shows that hadoop has been installed. You can now move the contents of the directory to the location of your choice. Let/s move hadoop directory to /usr/local/
$ sudo mv /home/hduser/hadoop-1.2.1 /usr/local/hadoop
Assign Ownership of Hadoop to hduser
Ensure that you change thw ownership of all files to “hduser” and the “hadoop” group.
$ sudo chown –R hduser:hadoop /usr/local/hadoop
Before hadoop is up and running, you need to configure the hadoop environment.
Update .bashrc – We are updating .bashrc with the following Hadoop environment variables:
# Set HADOOP_HOME
# Add Hadoop bin and sbin directory to PATH
$ sudo vi /home/hduser/.bashrc
Enter the following lines at the end of the file:
#Hadoop Environment Variables export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin
Enter the following command to update bashrc:
$ exec bash
Java is a pre-requisite for hadoop to run, so you need to inform hadoop where java is installed by setting the variable JAVA_HOME in the hadoop-env.sh file.
$ sudo vi /usr/local/hadoop/conf/hadoop-env.sh
Enter the following lines at the end of the file:
#JAVA HOME variable Export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
Note: If you are trying to install hadoop with Ubuntu server or Ubuntu Desktop (32-bit) then your JAVA_HOME would be – JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386
Confirm the successful installation of Hadoop in standalone mode
$ hadoop jar /usr/local/hadoop/hadoop-examples-1.2.1.jar wordcount /usr/local/hadoop/README.txt /home/hduser/Output
Hadoop Single Node Setup – Pseudo Distributed Mode
Hadoop is installed on a single machine in this mode of execution also just like standalone mode but in this all the daemons run as separate Java processes i.e. NameNode, DataNode, JobTracker, TaskTracker, Secondary NameNode all run on a single machine.
Create data directory for Hadoop - HDFS
Create a data folder HDFS using mkdir and assign all permissions. A hadoop user will have to read or write to these directories , thus it is necessary to change the permissions of the above directories for the corresponding hadoop user.
$ mkdir /usr/local/hadoop/hdfs
Hadoop configuration files are present in the HADOOP_HOME/conf dir in this tutorial the path is /usr/local/hadoop/conf/.
This XML file contains common properties to HDFS, MapReduce, YARN . Hadoop provides default configuration for these properties in the core-default.xml file.The default properties and their values can be found on the following Github link - https://github.com/facebookarchive/hadoop-20/blob/master/src/core/core-default.xml.
Open the core-site.xml file using the vi editor -
$ sudo vi /usr/local/hadoop/conf/core-site.xml
Enter the following lines between the configuration
fs.default.name hdfs://192.168.81.139:10001 hadoop.tmp.dir /usr/local/hadoop/hdfs
URI of NameNode.
Path to hdfs
used as the base for temporary directories locally
It contains mapreduce override properties. The default properties and their values here: https://github.com/facebookarchive/hadoop-20/blob/master/src/mapred/mapred-default.xml.
Open the mapred-site.xml using the vi editor -
$ sudo vi /usr/local/hadoop/conf/mapred-site.xml
Host or IP and port of JobTracker.
Configuring the masters
It contains IP’s or hostname of all secondary NameNode’s or checkpoint servers, one per line. Execute the following command to edit masters:
$ sudo vi /usr/local/hadoop/conf/masters
Enter your checkpoint server or system IP for Pseudo-mode-
Configuring the Slaves
It contains IP’s or hostname of all datanodes, one per line. Execute the following commands to edit the slaves :
$ sudo vi /usr/local/hadoop/conf/slaves
Enter your datanode IP or system IP for pseudo-mode:
It contains HDFS override properties. The default properties and their values for hdfs-site. Can be found here: https://github.com/facebookarchive/hadoop-20/blob/master/src/hdfs/hdfs-default.xml.
Execute the following commands to edit the hdfs-site.xml :
$ sudo vi /usr/local/hadoop/conf/hdfs-site.xml
Enter the following lines between
Value in positive integer
No. of replicate/duplicate block
Here is the formula to calculate replication:
Replication factor = No. of datanodes or less than the no. of datanode (depends on probability of failure of nodes)
Formatting the NameNode
The foremost step to get hadoop up and running is to format the hadoop distributed file system (HDFS) of your hadoop cluster.NameNode should be formatted when hadoop cluster is setup for the first time. If you format an already running HDFS, you will lose all the data that is present in HDFS for the cluster.
Execute the following command to format the NameNode:
$ hadoop namenode -format
Starting the Hadoop Cluster
Start the Hadoop server by executing the following command:
To verify if all the services are up and running, execute the below command-
Stop the hadoop servers
Stop the Hadoop server by executing the following command:
Apache Hadoop uses $HADOOP_HOME/logs directory to maintain all the error logs so whenever you face any issues while isntalling hadoop on ubuntu then look at the log files.
Common Errors Encountered during Hadoop Installation
Error :JAVA_HOME is not set.
Solution to JAVA_HOME is not set error -
Open the hadoop-env.sh file located in HADOOP_HOME/conf/ using vi editor and set the Java home path-
$ sudo vi $HADOOP_HOME/conf/hadoop-env.sh
Just to ensure that the error is resolved, try starting the hadoop server again using the hadoop command start-all.sh.
Error: hadoop: command not found
Here’s a snapshot of the error that you might encounter -
To resolve this issue, open the .bachrc fileusing vi editor and define the path for HADOOP_HOME/bin -
$ sudo vi /home/hduser/.bashrc
Execute the below command to update the .bashrc -
Error: ls cannot access .: No such file or directory.
When you do not mention the path after the ls command, it takes the default path . /user/
$ hadoop fs –mkdir /user/hduse
Error : Some index files failed to download or old index file used.
This error can be resolved by removing the old index files by running the command- $ sudo rm –r /var/lib/apt/lists/*
If you encounter any “Incompatible NamespaceID’s” exception then to trouble shoot such error you have to do the following –
- Stop all the services
- Delete /tmp/hadoop/dfs/data
- Start all the services again.