Apache Pig Tutorial: User Defined Function Example

Apache Pig Tutorial: User Defined Function (Python)

This case study of Apache Pig programming will cover how to write a user defined function. The example of student grades database is used to illustrate writing and registering the custom scripts in python for Apache Pig. The theme of this example is to analyze the performance of students The database in question, contains student, subject and score details. The custom script presented in this case study build using Python calculates the weighted average or grade point average of the student.

The sample data used in this example will be of following format:

10010801, 101, 8;
10010802, 101, 5;
10010801, 101, 10;

First column in the data set is for the roll number of the student, second column is for course number (like, PHY-101) and the last column represents the grade scored by the student in the respective course (on a scale of 1-10). Objective is to calculate the weighted average of all the subject grades to calculate the final result.

Solution:

The first step is to write the custom program to calculate the grade point average. The grades and the credits are taken as input. The corresponding elements are multiplied and their sum is calculated. This total is then divided with the sum of credits to obtain the GPA. The ‘division’ module is imported so that all division operations in the program produce float type output. The schema of the output function has to be predefined using outputSchema option such that Pig can parse the output. In this case, it is of ‘double’ type. The value are temporarily stored in arrays and the sum of credits is calculated and stored in ‘denom’.The input variables are of type ‘list’, therefore stored in an array. The ‘num’ variable consists of the weighted sum of the grades.

from __future__ import division
@outputSchema("num1:double")
def get_gpa(credit, grade):
   num,denom = 0,0
   temp1, temp2 = [], []
   for t in credit:
       temp1.append(t[0])
       denom += t[0]
   for u in grade:
       temp2.append(u[0])

The ‘num’ variable consists of the weighted sum of the grades. It is obtained by multiplying each grade with its respective credit. It is then divided with the sum of credits to calculate the required value.

for i in range(0, len(temp1)):
       num += temp1[i]*temp2[i]
   val = num / denom
   return val

This python script needs to registered in Apache Pig for usage in its scripts. It is done by running the following command. It can be executed in interactive shell or using pig script. We have registered the script in ‘func’ library. The jython library in Pig contains the important scripts required for running python programs.

grunt> REGISTER 'test.py' using jython AS func

Before we import the dataset, let’s analyze various data import/dump options available. Pig can handle compressed file (.gz) without any use of external function or libraries. Bin Storage Method is used to load and store temporary data in the binary format. JsonLoader and JsonStorage methods are used to read and store the data in the required json format and specifying the schema of the document as the arguments. Pig Dump method is used to store the results in the tuple form in UTF-8 format. Now we need to import the dataset. As it is a comma separated data with each line having distinct data, PigStorage function is used with ‘,’ as argument. The data schema is defined for the ease of transformations.

grunt> data = LOAD 'sample_data' USING PigStorage(',') AS 
              (student_id:int, subject_id:int, credit:int, grade:int) ;

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

The data is then aggregated / grouped based on the student ID value. The credits and grades of each student is then passed to the function to calculate the gpa. Observe that we made a reference to the function using the library name ‘func’ and the function name ‘get_gpa’.

grunt> grp_student = GROUP data BY student_id ;
grunt> gpa = FOREACH grp_student GENERATE group AS    
       stud_id, data.subject_id  AS subj_id,
       func.get_gpa (data.credit, data.grade)  AS gpa;

Let’s consider an example to have an overview on how the code gets gets implemented.

3750059, 3691, 3, 7
3750059, 1060, 2, 9
3750059, 6614, 4, 8
3750059, 9857, 3, 4

Upon grouping the data by student, it gets modified in the following form. The data has a single id i.e; student ID. In the student id, we have the data related to the specific student id. The data format remains the same with the subject id, credits, grade values in the same order.

3750059: { 3691, 3, 7; 1060, 2, 9; 6614, 4, 8; 9857, 3, 4}

With the ‘GENERATE’ command all the the data in a specific field are converted into an array. For instance, the ‘GENERATE’ command generates data in three objects i.e; data.subject, data.credit, data.grade. Each of them is an array consisting the respective values in an array format. The function is called for each student id with two array of values as inputs. The calculated GPA values get stored in ‘gpa’ relation. The relation data schema is viewed using ‘DESCRIBE’ command We can store the data for future reference using ‘STORE’ command. We can rank the results using ‘RANK’ command based on the gpa attribute.

Learn Hadoop by working on interesting Big Data and Hadoop Projects

grunt> DESCRIBE gpa;
gpa: {stud_id: int,subj_id: {(subject_id: int)},gpa: double}
grunt> gpa = RANK gpa by gpa DESC ;
grunt> STORE gpa into  USING PigStorage(',')

We can execute various queries based on the subject and gpa to analyze the data. Let’s calculate the average grade of each subject. For this we must first group the data based upon the subject ID and then calculate average for each group i.e; subject.

grunt> subject_data = GROUP data by subject_id;
grunt> subject_avg = FOREACH subject_data GENERATE group,
                     AVG(data.grade)

We can use ‘JOIN’ statement to join the table with calculated cgpa with the student details table using the student id as the unique key. Assuming the student is stored in the first field of the ‘student_info’ relation. Since the student id is stored in the second field of ‘gpa’ relation we specify ‘$1’.

grunt> final_data = JOIN gpa BY $1 FULL, student_info by $0

We can further run queries based upon the data of student information and gpa. Like calculating the average grades for a given institution. We can utilize the ‘GROUP’ command and in-built mathematical functions such as average, minimum, maximum and count. For further we can write custom functions for calculating the percentiles, median, etc.

Sample Output:

Relation(gpa) :-

Index number, Student ID, { Subject Codes }, GPA

98948,4519926,{(2426),(7054),(9294)},1.1666666666666667
98938,4512199,{(3691),(1060),(6614),(9857)},1.1875

Relation(subject_avg) :-

9534,5.00871105861509
9844,4.990113405059611

Build an Awesome Job Winning Project Portfolio with Solved End-to-End Big Data Projects