DeZyre - live hands on training
  • Home
  • Mini Projects
  • Blog
  • Sign In
  • FREE PROJECT RECIPES

Apache Pig Tutorial: User Defined Function Example

  • Back to tutorial home
  • Request Info

Learn how you can build Big Data Projects


Apache Pig Tutorial: User Defined Function (Python)

This case study of Apache Pig programming will cover how to write a user defined function. The example of student grades database is used to illustrate writing and registering the custom scripts in python for Apache Pig. The theme of this example is to analyze the performance of students The database in question, contains student, subject and score details. The custom script presented in this case study build using Python calculates the weighted average or grade point average of the student.


The sample data used in this example will be of following format:

10010801, 101, 8;
10010802, 101, 5;
10010801, 101, 10;

First column in the data set is for the roll number of the student, second column is for course number (like, PHY-101) and the last column represents the grade scored by the student in the respective course (on a scale of 1-10). Objective is to calculate the weighted average of all the subject grades to calculate the final result.

Solution:

The first step is to write the custom program to calculate the grade point average. The grades and the credits are taken as input. The corresponding elements are multiplied and their sum is calculated. This total is then divided with the sum of credits to obtain the GPA. The ‘division’ module is imported so that all division operations in the program produce float type output. The schema of the output function has to be predefined using outputSchema option such that Pig can parse the output. In this case, it is of ‘double’ type. The value are temporarily stored in arrays and the sum of credits is calculated and stored in ‘denom’.The input variables are of type ‘list’, therefore stored in an array. The ‘num’ variable consists of the weighted sum of the grades.

from __future__ import division
@outputSchema("num1:double")
def get_gpa(credit, grade):
   num,denom = 0,0
   temp1, temp2 = [], []
   for t in credit:
       temp1.append(t[0])
       denom += t[0]
   for u in grade:
       temp2.append(u[0])

The ‘num’ variable consists of the weighted sum of the grades. It is obtained by multiplying each grade with its respective credit. It is then divided with the sum of credits to calculate the required value.

for i in range(0, len(temp1)):
       num += temp1[i]*temp2[i]
   val = num / denom
   return val

This python script needs to registered in Apache Pig for usage in its scripts. It is done by running the following command. It can be executed in interactive shell or using pig script. We have registered the script in ‘func’ library. The jython library in Pig contains the important scripts required for running python programs.

grunt> REGISTER 'test.py' using jython AS func

Before we import the dataset, let’s analyze various data import/dump options available. Pig can handle compressed file (.gz) without any use of external function or libraries. Bin Storage Method is used to load and store  temporary data in the binary format. JsonLoader and JsonStorage methods are used to read and store the data in the required json format and specifying the schema of the document as the arguments. Pig Dump method is used to store the results in the tuple form in UTF-8 format. Now we need to import the dataset. As it is a comma separated data with each line having distinct data, PigStorage function is used with ‘,’ as argument. The data schema is defined for the ease of transformations.

grunt> data = LOAD 'sample_data' USING PigStorage(',') AS 
              (student_id:int, subject_id:int, credit:int, grade:int) ;

The data is then aggregated / grouped based on the student ID value. The credits and grades of each student is then passed to the function to calculate the gpa. Observe that we made a reference to the function using the library name ‘func’ and the function name ‘get_gpa’.

grunt> grp_student = GROUP data BY student_id ;
grunt> gpa = FOREACH grp_student GENERATE group AS    
       stud_id, data.subject_id  AS subj_id,
       func.get_gpa (data.credit, data.grade)  AS gpa;

Let’s consider an example to have an overview on how the code gets gets implemented.

3750059, 3691, 3, 7
3750059, 1060, 2, 9
3750059, 6614, 4, 8
3750059, 9857, 3, 4

Upon grouping the data by student, it gets modified in the following form. The data has a single id i.e; student ID. In the student id, we have the data related to the specific student id. The data format remains the same with the subject id, credits, grade values in the same order.

3750059: { 3691, 3, 7; 1060, 2, 9; 6614, 4, 8; 9857, 3, 4}

With the ‘GENERATE’ command all the the data in a specific field are converted into an array. For instance, the ‘GENERATE’ command generates data in three objects i.e; data.subject, data.credit, data.grade. Each of them is an array consisting the respective values in an array format. The function is called for each student id with two array of values as inputs. The calculated GPA values get stored in ‘gpa’ relation. The relation data schema is viewed using ‘DESCRIBE’ command We can store the data for future reference using ‘STORE’ command. We can rank the results using ‘RANK’ command based on the gpa attribute.

Learn Hadoop by working on interesting Big Data and Hadoop Projects for just $9

grunt> DESCRIBE gpa;
gpa: {stud_id: int,subj_id: {(subject_id: int)},gpa: double}
grunt> gpa = RANK gpa by gpa DESC ;
grunt> STORE gpa into  USING PigStorage(',')

We can execute various queries based on the subject and gpa to analyze the data. Let’s calculate the average grade of each subject. For this we must first group the data based upon the subject ID and then calculate average for each group i.e; subject.

grunt> subject_data = GROUP data by subject_id;
grunt> subject_avg = FOREACH subject_data GENERATE group,
                     AVG(data.grade)

We can use ‘JOIN’ statement to join the table with calculated cgpa with the student details table using the student id as the unique key. Assuming the student is stored in the first field of the ‘student_info’ relation. Since the student id is stored in the second field of ‘gpa’ relation we specify ‘$1’.

grunt> final_data = JOIN gpa BY $1 FULL, student_info by $0

We can further run queries based upon the data of student information and gpa. Like calculating the average grades for a given institution. We can utilize the ‘GROUP’ command and in-built mathematical functions such as average, minimum, maximum and count. For further we can write custom functions for calculating the percentiles, median, etc.

Sample Output:

Relation(gpa) :-

Index number, Student ID, { Subject Codes }, GPA

98948,4519926,{(2426),(7054),(9294)},1.1666666666666667
98938,4512199,{(3691),(1060),(6614),(9857)},1.1875

Relation(subject_avg) :-

9534,5.00871105861509
9844,4.990113405059611

  • Promotional Price
  • Microsoft Track
    Microsoft Professional Hadoop Certification Program
  • Hackerday

Online courses

  • Hadoop Training
  • Spark Training
  • Data Science in Python
  • Data Science in R
  • Data Science Training
  • Hadoop Training in California
  • Hadoop Training in New York
  • Hadoop Training in Texas
  • Hadoop Training in Virginia
  • Hadoop Training in Washington
  • Hadoop Training in New Jersey
  • Hadoop Training in Dallas
  • Hadoop Training in Atlanta
  • Hadoop Training in Chicago
  • Hadoop Training in Canada
  • Hadoop Training in Charlotte
  • Hadoop Training in Abudhabi
  • Hadoop Training in Dubai
  • Hadoop Training in Detroit
  • Hadoop Training in Edison
  • Hadoop Training in Germany
  • Hadoop Training in Fremont
  • Hadoop Training in Houston
  • Hadoop Training in Sanjose

Other Tutorials

Hadoop Online Tutorial – Hadoop HDFS Commands Guide

MapReduce Tutorial–Learn to implement Hadoop WordCount Example

Hadoop Hive Tutorial-Usage of Hive Commands in HQL

Hive Tutorial-Getting Started with Hive Installation on Ubuntu

Learn Java for Hadoop Tutorial: Inheritance and Interfaces

Learn Java for Hadoop Tutorial: Classes and Objects

Learn Java for Hadoop Tutorial: Arrays

Tutorial- Hadoop Multinode Cluster Setup on Ubuntu

Apache Pig Tutorial Example: Web Log Server Analytics

Impala Case Study: Web Traffic

Impala Case Study: Flight Data Analysis

Hadoop Impala Tutorial

Apache Hive Tutorial: Tables

Flume Hadoop Tutorial: Twitter Data Extraction

Flume Hadoop Tutorial: Website Log Aggregation

Hadoop Sqoop Tutorial: Example Data Export

Hadoop Sqoop Tutorial: Example of Data Aggregation

Apache Zookepeer Tutorial: Example of Watch Notification

Apache Zookepeer Tutorial: Centralized Configuration Management

Hadoop Zookeeper Tutorial

Hadoop Sqoop Tutorial

Hadoop PIG Tutorial

Hadoop Oozie Tutorial

Hadoop NoSQL Database Tutorial

Hadoop Hive Tutorial

Hadoop HDFS Tutorial

Hadoop hBase Tutorial

Hadoop Flume Tutorial

Hadoop 2.0 YARN Tutorial

Hadoop MapReduce Tutorial

Big Data Hadoop Tutorial for Beginners- Hadoop Installation

Big Data and Hadoop Training Courses in Popular Cities

  • Microsoft Big Data and Hadoop Certification
  • Hadoop Training in Texas
  • Hadoop Training in California
  • Hadoop Training in Dallas
  • Hadoop Training in Chicago
  • Hadoop Training in Charlotte
  • Hadoop Training in Dubai
  • Hadoop Training in Edison
  • Hadoop Training in Fremont
  • Hadoop Training in San Jose
  • Hadoop Training in New Jersey
  • Hadoop Training in New York
  • Hadoop Training in Atlanta
  • Hadoop Training in Canada
  • Hadoop Training in Abu Dhabi
  • Hadoop Training in Detroit
  • Hadoop Trainging in Germany
  • Hadoop Training in Houston
  • Hadoop Training in Virginia
  • Hadoop Training in Washington
  • Contact Us
  • Mini Projects
  • Free Recipes
  • Blog
  • Tutorials
  • Privacy Policy
  • Disclaimer
Copyright 2019 Iconiq Inc. All rights reserved. All trademarks are property of their respective owners.