How to implement OVER clause with MIN and MAX in Hive

This recipe helps you implement OVER clause with MIN and MAX in Hive
Last Updated: 25 Aug 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to implement the OVER clause with MIN and MAX in Hive?

Hive has become one of the leading tools in the extensive data ecosystem and a standard for SQL queries over petabytes of data in Hadoop. As it deals with large amounts of data, aggregation and windowing help gather and express the data in summary to get more information about particular groups based on specific conditions. In this recipe, we implement the windowing functions MAX and MIN with the OVER clause.

Access Source Code for Airline Dataset Analysis using Hadoop

Recipe Objective: How to implement the OVER clause with MIN and MAX in Hive?

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop and Hive are installed on your local EC2 instance. If not already installed, follow the below link to do the same.

Steps to set up an environment:

In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS and Hive are installed. If not installed, please find the links provided above for installations.
Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Implementing OVER clause with MIN and MAX in Hive:

Throughout the recipe, we used the “user_info” table present in the “demo” database. Firstly, enter the database using the use demo; command and list all the tables in it using the show tables; command. Let us also look at the user_info table schema using the describe user_info; command.

bigdata_1

MAX(): The MAX function is used to compute the maximum rows in the column or expression. Let us try implementing this by calculating most of the reviews given by the users concerning each profession. We partition the entire data over the profession column and pass the reviews column to the MAX function to implement this. Query to do the same is given below.

SELECT profession, MAX(reviews) OVER (PARTITION BY profession) FROM user_info; bigdata_2

Sample output is given below:

bigdata_3

MIN(): Similar to the MAX function, we compute the minimum of the rows in the column or expression. Let us try implementing this by calculating the minimum of the reviews given by the users concerning each profession. We partition the entire data over the profession column and pass the reviews column to the MIN function to implement this. Query to do the same is given below.

SELECT profession, MIN(reviews) OVER (PARTITION BY profession) FROM user_info; bigdata_4

The sample output is given below.

bigdata_5

Download Materials

bigdata_1

bigdata_2

bigdata_3

bigdata_4

bigdata_5

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Python and MongoDB Project for Beginners with Source Code-Part 1

In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

View Project Details

PySpark Project-Build a Data Pipeline using Kafka and Redshift

In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Apache Kafka and AWS Redshift

View Project Details

Big Data Project for Solving Small File Problem in Hadoop Spark

This big data project focuses on solving the small file problem to optimize data processing efficiency by leveraging Apache Hadoop and Spark within AWS EMR by implementing and demonstrating effective techniques for handling large numbers of small files.

View Project Details

How to implement OVER clause with MIN and MAX in Hive

Recipe Objective: How to implement the OVER clause with MIN and MAX in Hive?

Table of Contents

Prerequisites:

Steps to set up an environment:

Implementing OVER clause with MIN and MAX in Hive:

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects