How to implement OVER clause with COUNT SUM and AVG in Hive

This recipe helps you implement OVER clause with COUNT SUM and AVG in Hive

Recipe Objective: How to implement the OVER clause with COUNT, SUM, and AVG in Hive?

Hive has become one of the leading tools in the extensive data ecosystem, and a standard for SQL queries over petabytes of data in Hadoop as it deals with large amounts of data, aggregation and windowing help in gathering and expressing the data in summary to get more information about particular groups based on specific conditions. This recipe implements the windowing functions COUNT, SUM, and AVG with the OVER clause.

Data Ingestion with SQL using Google Cloud Dataflow

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop and Hive are installed on your local EC2 instance. If not already installed, follow the below link to do the same.

Steps to set up an environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS and Hive are installed. If not installed, please find the links provided above for installations.
  • Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Implementing OVER clause with COUNT, SUM, and AVG in Hive:

Throughout the recipe, we used the “user_info” table present in the “demo” database. Firstly, enter the database using the use demo; command and list all the tables in it using the show tables; command. Let us also look at the user_info table schema using the describe user_info; command.

bigdata_1

<COUNT(): Let us count the reviews given by the users concerning each profession. To do this, the query is as shown below:

SELECT profession, COUNT(id) OVER(PARTITION BY profession) FROM user_info; bigdata_2

Sample output is given below:

bigdata_3

SUM(): Let us find the sum of all the reviews given by the users concerning each profession. The query for doing this is shown below:

SELECT profession, SUM(id) OVER(PARTITION BY profession) FROM user_info; bigdata_4

Sample output:

bigdata_5

AVG(): Let us find the average of all the reviews given by the users concerning each profession. The query for doing this is shown below:

SELECT profession, AVG(id) OVER(PARTITION BY profession) FROM user_info; bigdata_6

Sample output:

bigdata_7

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

Learn to Create Delta Live Tables in Azure Databricks
In this Microsoft Azure Project, you will learn how to create delta live tables in Azure Databricks.

Build an ETL Pipeline with DBT, Snowflake and Airflow
Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS
Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.

GCP Project to Explore Cloud Functions using Python Part 1
In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.