How to implement GROUPING_ID function in Hive

This recipe helps you implement GROUPING_ID function in Hive

Recipe Objective: How to implement the GROUPING_ID function in Hive?

In this recipe, we implement the GROUPING_ID function in Hive. While querying a table, we know that the GROUP BY clause is inefficient as it reads an entire table twice to get us the result which GROUPING SET can do in one scan. But we do have limitations with the GROUPING SET, as it may be confusing sometimes to know which group is used to perform the aggregation. In such cases, we can use the GROUPING_ID function to know which group is used for that aggregation result.

ETL Orchestration on AWS using Glue and Step Functions

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop and Hive are installed on your local EC2 instance. If not already installed, follow the below link to do the same.

Steps to set up an environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS and Hive are installed. If not installed, please find the links provided above for installations.
  • Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Implementing GROUPING_ID function in Hive:

Throughout the recipe, we used the “user_info” table present in the “demo” database. Firstly, enter the database using the use demo; command and list all the tables in it using the show tables; command. Let us also look at the user_info table schema using the describe user_info; command.

bigdata_1

Implementing the GROUPING_ID function over the attributes- id, profession, and reviews to fetch these details of the user grouped by the (id, profession) set. The query for the same is given below:

SELECT id, profession, sum(reviews) as total_reviews, GROUPING_ID FROM user_info GROUP BY id, profession GROUPING SETS ((id,profession),id,()); bigdata_2

The above query calculates the number of users present in each profession and the total reviews given by all the users belonging to the same profession. Where the profession column is null, we have the total sum of reviews given by the users across all the professions. The query specifies that the grouping is done over profession concerning each user id. The output holds three values in the final column- 0,1,3. Here, 0 => no group. 1 => aggregation performed over one column. Here, id. 3=> aggregation performed over both columns. Here, id, profession. The sample output is given below.

bigdata_3

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

GCP Project to Explore Cloud Functions using Python Part 1
In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

SQL Project for Data Analysis using Oracle Database-Part 1
In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database

dbt Snowflake Project to Master dbt Fundamentals in Snowflake
DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

Build a real-time Streaming Data Pipeline using Flink and Kinesis
In this big data project on AWS, you will learn how to run an Apache Flink Python application for a real-time streaming platform using Amazon Kinesis.

Databricks Data Lineage and Replication Management
Databricks Project on data lineage and replication management to help you optimize your data management practices | ProjectPro

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.