How to implement OVER clause with COUNT SUM and AVG in Hive

This recipe helps you implement OVER clause with COUNT SUM and AVG in Hive
Last Updated: 08 Sep 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to implement the OVER clause with COUNT, SUM, and AVG in Hive?

Hive has become one of the leading tools in the extensive data ecosystem, and a standard for SQL queries over petabytes of data in Hadoop as it deals with large amounts of data, aggregation and windowing help in gathering and expressing the data in summary to get more information about particular groups based on specific conditions. This recipe implements the windowing functions COUNT, SUM, and AVG with the OVER clause.

Data Ingestion with SQL using Google Cloud Dataflow

Recipe Objective: How to implement the OVER clause with COUNT, SUM, and AVG in Hive?

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop and Hive are installed on your local EC2 instance. If not already installed, follow the below link to do the same.

Steps to set up an environment:

In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS and Hive are installed. If not installed, please find the links provided above for installations.
Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Implementing OVER clause with COUNT, SUM, and AVG in Hive:

Throughout the recipe, we used the “user_info” table present in the “demo” database. Firstly, enter the database using the use demo; command and list all the tables in it using the show tables; command. Let us also look at the user_info table schema using the describe user_info; command.

bigdata_1

<COUNT(): Let us count the reviews given by the users concerning each profession. To do this, the query is as shown below:

SELECT profession, COUNT(id) OVER(PARTITION BY profession) FROM user_info; bigdata_2

Sample output is given below:

bigdata_3

SUM(): Let us find the sum of all the reviews given by the users concerning each profession. The query for doing this is shown below:

SELECT profession, SUM(id) OVER(PARTITION BY profession) FROM user_info; bigdata_4

Sample output:

bigdata_5

AVG(): Let us find the average of all the reviews given by the users concerning each profession. The query for doing this is shown below:

SELECT profession, AVG(id) OVER(PARTITION BY profession) FROM user_info; bigdata_6

Sample output:

bigdata_7

Download Materials

bigdata_1

bigdata_2

bigdata_3

bigdata_4

bigdata_5

bigdata_6

bigdata_7

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack

In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

View Project Details

Hands-On Real Time PySpark Project for Beginners

In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

View Project Details

Spark Project-Analysis and Visualization on Yelp Dataset

The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

View Project Details

How to implement OVER clause with COUNT SUM and AVG in Hive

Recipe Objective: How to implement the OVER clause with COUNT, SUM, and AVG in Hive?

Table of Contents

Prerequisites:

Steps to set up an environment:

Implementing OVER clause with COUNT, SUM, and AVG in Hive:

Jingwei Li

Relevant Projects

You might also like

Relevant Projects