Hadoop Wiki

Apache Hadoop
Hadoop is an open source distributed processing framework based on Java programming language for storing and processing large volumes of structured/unstructured data on clusters of commodity hardware. It is the big data platform with huge processing power and the ability to handle limitless concurrent jobs.

Enrol Now for hands on Hadoop Training

Apache Pig

Apache Pig is a hadoop component that provides abstraction over MapReduce so that programmers can analyse large volumes of data using the procedural language Pig Latin. All Pig Latin scripts are converted to Hadoop MapReduce jobs internally by the Pig Engine. Apache Pig can execute jobs also in Apache Spark or Apache Tez.

Apache Pig Wiki References

Apache Hive

Apache Hive is a data warehouse like infrastructure built on top of Hadoop for data querying, data summarization and data analysis. It provides SQL like interface for execution of MapReduce jobs through Hive Query Language (HiveQL). All Hive queries are split by the Hive service into simple MapReduce jobs and then executed across the Hadoop cluster.

Apache HBase

HBase is an open source NoSQL column-oriented distributed database for real-time read/write access of large datasets built on top of HDFS. It is a horizontally scalable database and provides low latency so that even larger tables can be looked up faster. HBase works well for sparse datasets and provides Google's Big Table like features for Hadoop.

Apache Sqoop

Sqoop, got its name from two different and famous technologies SQL and Hadoop i.e. "Sq." from SQL and "oop" from Hadoop. Sqoop is a tool, primarily used for bulk transfer of data, so that data from various relational databases, data warehouses, or even from NoSQL data stores, can be imported/exported easily. Based on connector based architecture, other tools can also be connected to Sqoop, and Sqoop can also be connected to other tools like plugins very easily. For e.g., Sqoop can be connected to Apache Oozie, a work flow managing tool, and import/export tasks can be automated.

Apache Flume

Flume is a data ingestion tool used to send streaming data such as log files, events, etc. from different sources to HDFS. It is an efficient, reliable distributed tool for collecting, aggregating and transporting data from multiple web servers to a centralized data store.

Apache Oozie

Oozie is a java based web application used for scheduling Hadoop jobs. Hadoop developers can run a series of jobs at a given schedule by arranging them in an ordered pipeline in the distributed environment. Oozie is tightly coupled with other components of Hadoop like Pig, Hive and Sqoop and thus can support the execution of various hadoop jobs.

Big Data

Big Data refers to large and complex datasets (structured and unstructured) which cannot be computed and processed using traditional applications. Big data is characterized by 3 important V's - Volume, Velocity and Variety :
  • Volume of big data can be measured in terms or several megabytes, gigabytes, terabytes or petabytes
  • Variety - Big data can exists in different file formats, SQL database stores, sensor data, social media data or data in any other form.
  • Velocity of big data refers to the speed with which the data can be analysed to gain meaningful business gains.

Big Data Wiki References


DataNode is the machine where the actual data resides within the hadoop cluster. Data within a file is replicated on several DataNodes based on the replication factor to achieve reliability in case of failure. DataNode is referred to as the slave machine in Hadoop Architecture.

Datanode Wiki References

Hadoop Cluster

Hadoop Cluster is a special form of a computer cluster designed for storing and analysing (structured and unstructured data) that runs on open source distributed processing software Hadoop. Unlike a normal computer cluster that contains high end servers, hadoop cluster is composed of low cost commodity computers. A hadoop cluster is composed of a NameNode, DataNode, Job Tracker and Task Tracker.

Hadoop Common

It is an integral component of the hadoop ecosystem that consists of generic libraries and basic utilities for supporting other hadoop components - HDFS, MapReduce, and YARN.


Hadoop Distributed File System (HDFS) is the primary storage component in the Hadoop framework. HDFS is a scalable java based file system that reliably stores large datasets of structured or unstructured data.


MapReduce is a java based programming paradigm for processing large volumes of data stored in HDFS or any other storage file systems. MapReduce is the heart of the Apache Hadoop framework that provides scalability across thousands of hadoop cluster. Every MapReduce job performs two distinct tasks as the name suggests - one Map task and one Reduce task. Map job takes a set of data, processes it at node level and generates the output (another set of data). The reduce job takes the output of the map job as the input and combines them to smaller set of tuples (reduces the large dataset into a smaller one) based on the transformations and the business logic.


Yet Another Resource Negotiator (YARN) is a cluster management technology in Hadoop 2.0. The main functionality of YARN is that it splits job monitoring or scheduling and resource management into two separate daemons by having a single global resource manager. YARN makes Hadoop to be used for operational applications which cannot wait for the completion of batch jobs. YARN makes hadoop ecosystem more robust and provides an open source computing environment that is easily scalable in future.

processing person-icon