Window function in Hive LEAD LAG FIRST VALUE LAST VALUE

This recipe helps you how to work with windowing functions in Hive which are LEAD LAG FIRST_VALUE and LAST_VALUE and how to use them
Last Updated: 19 Aug 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to work with windowing functions in Hive- LEAD, LAG, FIRST_VALUE, LAST_VALUE?

In this recipe, we will work with windowing functions in Hive - LEAD, LAG, FIRST_VALUE, LAST_VALUE.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

Recipe Objective: How to work with windowing functions in Hive- LEAD, LAG, FIRST_VALUE, LAST_VALUE?

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop and Hive are installed on your local EC2 instance. If not already installed, follow the below link to do the same.

Steps to set up an environment:

In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS and Hive are installed. If not installed, please find the links provided above for installations.
Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Working with windowing functions in Hive:

Windowing allows features to create a window on the data set to operate analytical functions such as LEAD, LAG, FIRST_VALUE, and LAST_VALUE. The syntax for the query is given below:

SELECT &ltcolumns_name&gt, &ltanalytical function&gt(column_name) OVER (&ltwindowing specification&gt) FROM &lttable_name>

Throughout this recipe, we used the “user_info” table present in our database. Here, the windowing specification includes PARTITION BY that takes a column(s) of the table as a reference, ORDER BY specifies the order of column(s), ascending, by default, and frame that defines the boundary of the frame by start and end value. Following is the schema of the table.

bigdata_1

LEAD: It is an analytics function used to return the data from the next set of rows. By default, the lead is of 1 row, and it will return NULL if it exceeds the current window. Here is the output of when the “id” column is passed to the LEAD() function by partitioning the table over the “profession” of the user. We then printed it in the ascending order of their “age.”

bigdata_2

Sample output:

bigdata_3

LAG: It is the opposite of the LEAD function; it returns the data from the previous data set. By default, the lag is of 1 row and returns NULL if the lag for the current row is exceeded before the beginning of the window. Following is the output where we passed the “id” column to the LAG() function by partitioning the table over the “profession” of the user and printing in the ascending order of their “age.”

bigdata_4

Sample output:

bigdata_5

FIRST_VALUE: This function returns the value from the first row in the window based on the clause and is assigned to all the rows of the same group. Following is the output where we passed the “id” column to the FIRST_VALUE() function by partitioning the table over the “profession” of the user and printing in the ascending order of their “age.”

bigdata_6

Sample output:

bigdata_7

LAST_VALUE: It is the opposite of FIRST_VALUE; it returns the value from the last row in a window based on the clause and is assigned to all the rows of the same group. Following is the output where we passed the “id” column to the LAST_VALUE() function by partitioning the table over the “profession” of the user and printing in the ascending order of their “age.”

bigdata_8

Sample output:

bigdata_9

Here, the employees of the same age are assigned the value of that emp_id who is last in that age group.

Download Materials

bigdata_1

bigdata_2

bigdata_3

bigdata_4

bigdata_5

bigdata_6

bigdata_7

bigdata_8

bigdata_9

What Users are saying..

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Movielens Dataset Analysis on Azure

Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

View Project Details

Building Data Pipelines in Azure with Azure Synapse Analytics

In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

View Project Details

Build an Analytical Platform for eCommerce using AWS Services

In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

View Project Details

Build a Streaming Pipeline with DBT, Snowflake and Kinesis

This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

View Project Details

Build a Data Pipeline with Azure Synapse and Spark Pool

In this Azure Project, you will learn to build a Data Pipeline in Azure using Azure Synapse Analytics, Azure Storage, Azure Synapse Spark Pool to perform data transformations on an Airline dataset and visualize the results in Power BI.

View Project Details

GCP Project to Learn using BigQuery for Exploring Data

Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

View Project Details

Yelp Data Processing using Spark and Hive Part 2

In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

View Project Details

Yelp Data Processing Using Spark And Hive Part 1

In this big data project, you will learn how to process data using Spark and Hive as well as perform queries on Hive tables.

View Project Details

Flask API Big Data Project using Databricks and Unity Catalog

In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

View Project Details

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

View Project Details

Window function in Hive LEAD LAG FIRST VALUE LAST VALUE

Recipe Objective: How to work with windowing functions in Hive- LEAD, LAG, FIRST_VALUE, LAST_VALUE?

Table of Contents

Prerequisites:

Steps to set up an environment:

Working with windowing functions in Hive:

Anand Kumpatla

Relevant Projects

You might also like

Relevant Projects