Hadoop 2.0 (YARN) Framework - The Gateway to Easier Programming for Hadoop Users

Hadoop 2.0 (YARN) Framework - The Gateway to Easier Programming for Hadoop Users

With a rapid pace in evolution of Big Data, its processing frameworks also seem to be evolving in a full swing mode. Hadoop (Hadoop 1.0) has progressed from a more restricted processing model of batch oriented MapReduce jobs to developing specialized and interactive processing models (Hadoop 2.0). With the advent of Hadoop 2.0, it is possible for organizations to create data crunching methodologies within Hadoop which were not possible with Hadoop 1.0 architectural limitations. In this piece of writing we provide the users an insight on the novel Hadoop 2.0 (YARN) and help them understand the need to switch from Hadoop 1.0 to Hadoop 2.0.

Evolution of Hadoop 2.0 (YARN) -Swiss Army Knife of Big Data

With the introduction of Hadoop in 2005 to support cluster distributed processing of large scale data workloads through the MapReduce processing engine, Hadoop has undergone a great refurbishment over time. The result of this is a better and advanced Hadoop framework that does not merely support MapReduce but renders support to various other distributed processing models also.

Build hands-on projects in Big Data and Hadoop

The huge data giants on the web such as Google, Yahoo and Facebook who had adopted Apache Hadoop had to depend on the partnership of Hadoop HDFS with the resource management environment and MapReduce programming. These technologies collectively enabled the users to manage processes and store huge amounts of semi-structured, structured or unstructured data within Hadoop clusters. Nevertheless there were certain intrinsic drawbacks with Hadoop MapReduce pairing. For instance, Google and other users of Apache Hadoop had various alluding issues with Hadoop 1.0 of not having the ability to keep track with the flood of information that they were collecting online due to the batch processing arrangement of MapReduce.

Introduction to Hadoop YARN (Hadoop 2.0)


Difference between hadoop 1 and hadoop 2


Hadoop 2.0 popularly known as YARN (Yet another Resource Negotiator) is the latest technology introduced in Oct 2013 that is being used widely nowadays for processing and managing distributed big data.

Hadoop YARN is an advancement to Hadoop 1.0 released to provide performance enhancements which will benefit all the technologies connected with the Hadoop Ecosystem along with the Hive data warehouse and the Hadoop database (HBase). Hadoop YARN comes along with the Hadoop 2.x distributions that are shipped by Hadoop distributors. YARN performs job scheduling and resource management duties devoid of the users having to use Hadoop MapReduce on Hadoop Systems.

Hadoop YARN has a modified architecture unlike the intrinsic characteristics of Hadoop 1.0 so that the systems can scale up to new levels and responsibilities can be clearly assigned to the various components in Hadoop HDFS.

Need to Switch from Hadoop 1.0 to Hadoop 2.0 (YARN)

The foremost version of Hadoop had both advantages and disadvantages. Hadoop MapReduce is a standard established for big data processing systems in the modern era but the Hadoop MapReduce architecture does have some drawbacks which generally come into action when dealing with huge clusters.

Limitations of Hadoop 1.0

Issue of Availability:

Hadoop 1.0 Architecture had only one single point of availability i.e. the Job Tracker, so in case if the Job Tracker fails then all the jobs will have to restart.

Issue of Scalability:

The Job Tracker runs on a single machine performing various tasks such as Monitoring, Job Scheduling, Task Scheduling and Resource Management. In spite of the presence of several machines (Data Nodes), they were not being utilized in an efficient manner, thereby limiting the scalability of the system.

Cascading Failure Issue:

In case of Hadoop MapReduce when the number of nodes is greater than 4000 in a cluster, some kind of fickleness is observed. The most common kind of failure that was observed is the cascading failure which in turn could cause the overall cluster to deteriorate when trying to overload the nodes or replicate data via network flooding.

Multi-Tenancy Issue:

The major issue with Hadoop MapReduce that paved way for the advent of Hadoop YARN was multi-tenancy. With the increase in the size of clusters in Hadoop systems, the clusters can be employed for a wide range of models.

Hadoop MapReduce devotes the nodes of the cluster in the Hadoop System so that they can be repurposed for other big data workloads and applications. Nevertheless, with Big Data and Hadoop, ruling the data processing applications for cloud deployments, the number of nodes in the cluster is likely to increase and this issue is addressed with a switch from 1.x to 2.x.

This is not just the end of the limitations coming from Hadoop MapReduce apart from the above mentioned issues there were several other concerns addressed by Hadoop programmers with version 1.0 such as inefficient utilization of the resources, hindering constraints in running any other Non-MapReduce applications, running ad-hoc queries, carrying out real time analysis and limitations in running the message passing approach.

Want to learn big data processing using Hadoop 2.0 YARN? Join our Hadoop Training Classes to become a Hadoop Expert!

Understanding the Differences between the Components of Hadoop 1.0 and Hadoop 2.0


Hadoop 1.0 vs Hadoop 2.0

The Hadoop 1.0 or the so called MRv1 mainly consists of 3 important components namely:

1) Resource Management:

This is an infrastructure component that takes care of monitoring the nodes, allocating the resources and scheduling various jobs.

2) Application Programming Interface (API):

This component is for the users to program various MapReduce applications.

3) Framework:

This component is for all the runtime services such as Shuffling, Sorting and executing Map and Reduce processes.

Hadoop 2.0 YARN

The major difference with Hadoop 2.0 is that, in this next generation of Hadoop the cluster resource management capabilities are moved into YARN.


YARN has taken an edge over the cluster management responsibilities from MapReduce, so that now MapReduce just takes care of the Data Processing and other responsibilities are taken care of by YARN.

Hadoop 2.0 (YARN) and Its Components


Hadoop 2.0 Architecture


In Hadoop 2.0, the Job Tracker in YARN mainly depends on 3 important components

1. Resource Manager Component:

This component is considered as the negotiator of all the resources in the cluster. Resource Manager is further categorized into an Application Manager that will manage all the user jobs with the cluster and a pluggable scheduler. This is a relentless YARN service that is designed for receiving and running the applications on the Hadoop Cluster. In Hadoop 2.0, a MapReduce job will be considered as an application.

2. Node Manager Component:

This is the job history server component of YARN which will furnish the information about all the completed jobs. The NM keeps a track of all the users’ jobs and their workflow on any particular given node.

3. Application Master Component (aka User Job Life Cycle Manager):

This is the component where the job actually resides and the Application Master component is responsible for managing each and every Map Reduce job and is concluded once the job completes processing.

For the complete list of big data companies and their salaries- CLICK HERE

A Gist on Hadoop 2.0 Components

RM-Resource Manager

1.It is the global resource scheduler

2.It runs on the Master Node of the Cluster

3.It is responsible for negotiating the resources of the system amongst the competing applications.

4.It keeps a track on the heartbeats from the Node Manager

NM-Node Manager

1.Node Manager communicates with the resource manager.

2.It runs on the Slave Nodes of the Cluster

AM-Application Master

1.There is one AM per application which is application specific or framework specific.

2.The AM runs in Containers that are created by the resource manager on request.

Migration from Hadoop 1.0 to Hadoop 2.0

With the advent of YARN framework as a part of the Hadoop 2.0 platform, there are several applications and tools available now for Hadoop programmers that will help them make the best out of big data which they never thought of.

YARN has been capable of providing the organizations something that is far beyond Map Reduce, by separating the cluster resource management function completely from the data processing function. With comparatively less overloaded sophisticated programming protocols and being cost effective, companies preferably would like to migrate their applications from Hadoop 1.0 to  Hadoop 2.0. An edge that YARN provides to Hadoop Users is that it is backward compatible (i.e. one can easily run an existing Map Reduce job on Hadoop 2.0 without making any modifications) thus compelling the companies to migrate from Hadoop 1.0 to Hadoop 2.0 without even giving it a second thought.

Despite the fact that most of the Hadoop applications have migrated from Hadoop 1.0 to Hadoop 2.0 there are migrations that are still in progress and companies are consistently striving hard to accomplish this long needed upgrade for their applications.

With Hadoop YARN, it is now easy for Hadoop Developers to build applications directly with Hadoop, devoid of having to bolt them from any other outside third party vendor tools which was the case with Hadoop 1.0.This is another important reason why companies that are currently using Hadoop, will establish Hadoop 2.0 as a platform for creating applications and manipulating data for more effectively and efficiently.

YARN is the elephant sized change that Hadoop 2.0 has brought in but undoubtedly there are lots of challenges involved as companies migrate from Hadoop 1.0 to Hadoop 2.0 however the basic changes to the MR framework will have greater usability level for Hadoop in the upcoming big data scenarios. Hadoop 2.0 being more isolated and scalable over the earlier version, it is anticipated that soon there will be several novel tools that will get the most out of the new features in YARN (Hadoop 2.0).

Get IBM Certification for Big Data Hadoop to land a top gig as a Hadoop Developer at one of the top big data companies!

Related Posts

How much Java is required to learn Hadoop? 

Top 100 Hadoop Interview Questions and Answers 2016

Difference between Hive and Pig - The Two Key components of Hadoop Ecosystem 

Make a career change from Mainframe to Hadoop - Learn Why



Build hands-on projects with industry professionals

Relevant Projects

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Hive Project - Visualising Website Clickstream Data with Apache Hadoop
Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Event Data Analysis using AWS ELK Stack
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Data Warehouse Design for E-commerce Environments
In this hive project, you will design a data warehouse for e-commerce environments.