How to implement GROUPING_ID function in Hive

This recipe helps you implement GROUPING_ID function in Hive
Last Updated: 23 Aug 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to implement the GROUPING_ID function in Hive?

In this recipe, we implement the GROUPING_ID function in Hive. While querying a table, we know that the GROUP BY clause is inefficient as it reads an entire table twice to get us the result which GROUPING SET can do in one scan. But we do have limitations with the GROUPING SET, as it may be confusing sometimes to know which group is used to perform the aggregation. In such cases, we can use the GROUPING_ID function to know which group is used for that aggregation result.

ETL Orchestration on AWS using Glue and Step Functions

Recipe Objective: How to implement the GROUPING_ID function in Hive?

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop and Hive are installed on your local EC2 instance. If not already installed, follow the below link to do the same.

Steps to set up an environment:

In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS and Hive are installed. If not installed, please find the links provided above for installations.
Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Implementing GROUPING_ID function in Hive:

Throughout the recipe, we used the “user_info” table present in the “demo” database. Firstly, enter the database using the use demo; command and list all the tables in it using the show tables; command. Let us also look at the user_info table schema using the describe user_info; command.

bigdata_1

Implementing the GROUPING_ID function over the attributes- id, profession, and reviews to fetch these details of the user grouped by the (id, profession) set. The query for the same is given below:

SELECT id, profession, sum(reviews) as total_reviews, GROUPING_ID FROM user_info GROUP BY id, profession GROUPING SETS ((id,profession),id,()); bigdata_2

The above query calculates the number of users present in each profession and the total reviews given by all the users belonging to the same profession. Where the profession column is null, we have the total sum of reviews given by the users across all the professions. The query specifies that the grouping is done over profession concerning each user id. The output holds three values in the final column- 0,1,3. Here, 0 => no group. 1 => aggregation performed over one column. Here, id. 3=> aggregation performed over both columns. Here, id, profession. The sample output is given below.

bigdata_3

Download Materials

bigdata_1

bigdata_2

bigdata_3

What Users are saying..

Abhinav Agarwal

Graduate Student at Northwestern University

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More