Hadoop PIG Tutorial

Fundamentals of Apache Pig

Apache Pig is designed to handle any kind of data. Apache Pig is a high level extensible language designed to reduce the complexities of coding MapReduce applications. Pig was developed at Yahoo to help people use Hadoop to emphasize on analysing large unstructured data sets by minimizing the time spent on writing Mapper and Reducer functions.

All tasks are encoded in a manner that helps the system to optimize the execution automatically because typically 10 lines of code in Pig equal 200 lines of code in Java. Pig converts its operators into MapReduce code. Apache Pig is composed of 2 components mainly-on is the Pig Latin programming language and the other is the Pig Runtime environment in which Pig Latin programs are executed.

What is Apache Pig

Apache Pig is a high-level language platform developed to execute queries on huge datasets that are stored in HDFS using Apache Hadoop. It is similar to SQL query language but applied on a larger dataset and with additional features. The language used in Pig is called PigLatin. It is very similar to SQL. It is used to load the data, apply the required filters and dump the data in the required format. It requires a Java runtime environment to execute the programs. Pig converts all the operations into Map and Reduce tasks which can be efficiently processed on Hadoop. It basically allows us to concentrate upon the whole operation irrespective of the individual mapper and reducer functions.

For example, Pig can be used to run a query to find the rows which exceed a threshold value. It can be used to join two different types of datasets based upon a key. Pig can be used to iterative algorithms over a dataset. It is ideal for ETL operations i.e; Extract, Transform and Load. It allows a detailed step by step procedure by which the data has to be transformed. It can handle inconsistent schema data.

Apache Pig Basic Commands and Syntax

Pig Latin is the language used to write Pig programs. It is similar to algorithm broken down into various steps that could be written using SQL transformations. Pig platform follows lazy evaluation strategy. No value or operation is evaluated until the the value or the transformed data is required. This reduces the repeated calculations.

Learn Hadoop by working on interesting Big Data and Hadoop Projects

Every pig program consists of three parts. Loading, Transforming, Dumping of the data.

Below given is a sample data load command. We provide the file location which can be a directory or a specific file. We select the load function through which data is parsed from the file. PigStorage function parses each line in the file and splits the data based on the argument provided with the function to generate the fields. We provide the schema (field names with data type) in the load function after the keyword ‘as‘.

 modified_data = GROUP data BY ;
 counts = FOREACH modified_data GENERATE group, 
          COUNT(data);

We can either dump the processed data or store it in a file based upon the requirements. Using dump method, the processed data is displayed on the standard output.

store counts into '’ ;
DUMP counts;

User Defined Function

This is an important feature in PigLatin language. It allows to extend the present evaluation function and mathematical function by using custom used written functions. It allows users to write the custom functions using Java, Python,Jython, Ruby, Groovy and Javascript. Java functions are written by extending the evaluation function class. Then the scripts have to be added to pig library using the ‘register’ command.

In case of Java, the required data types are imported from the resective classes and the custom function is written by extending the resecting class. In case of Jython the script is registered using jython which imports the required scripts to interpret the jython script. The output schema for every function is specified so that pig can parse the data. The same goes with Javascript. In case of Ruby ‘pigudf’ library is extended and jruby is used to register the script. In case of python udf, python command line is used in which the data is streamed in and out to execute the script.

Executing Pig programs

A pig program can be executed in three methods. We can write a pig script file containing all the commands and execute it from the command line. We can use the interactive shell, Grunt to execute commands line by line. It can also run scripts using run or exec commands. We can execute the required commands by extending the PigRunner class. It provides the access to run the commands from any program.

To run the program using the script, run the following command. The script can be stored in hdfs which can be distributed to other machines in case the program is run in cluster mode.

$ pig

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization