Window function in Hive LEAD LAG FIRST VALUE LAST VALUE

This recipe helps you how to work with windowing functions in Hive which are LEAD LAG FIRST_VALUE and LAST_VALUE and how to use them

Recipe Objective: How to work with windowing functions in Hive- LEAD, LAG, FIRST_VALUE, LAST_VALUE?

In this recipe, we will work with windowing functions in Hive - LEAD, LAG, FIRST_VALUE, LAST_VALUE.

Build a Real-Time Dashboard with Spark, Grafana and Influxdb

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop and Hive are installed on your local EC2 instance. If not already installed, follow the below link to do the same.

Steps to set up an environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS and Hive are installed. If not installed, please find the links provided above for installations.
  • Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Working with windowing functions in Hive:

Windowing allows features to create a window on the data set to operate analytical functions such as LEAD, LAG, FIRST_VALUE, and LAST_VALUE. The syntax for the query is given below:

SELECT &ltcolumns_name&gt, &ltanalytical function&gt(column_name) OVER (&ltwindowing specification&gt) FROM &lttable_name>

Throughout this recipe, we used the “user_info” table present in our database. Here, the windowing specification includes PARTITION BY that takes a column(s) of the table as a reference, ORDER BY specifies the order of column(s), ascending, by default, and frame that defines the boundary of the frame by start and end value. Following is the schema of the table.

bigdata_1

LEAD: It is an analytics function used to return the data from the next set of rows. By default, the lead is of 1 row, and it will return NULL if it exceeds the current window. Here is the output of when the “id” column is passed to the LEAD() function by partitioning the table over the “profession” of the user. We then printed it in the ascending order of their “age.”

bigdata_2

Sample output:

bigdata_3

LAG: It is the opposite of the LEAD function; it returns the data from the previous data set. By default, the lag is of 1 row and returns NULL if the lag for the current row is exceeded before the beginning of the window. Following is the output where we passed the “id” column to the LAG() function by partitioning the table over the “profession” of the user and printing in the ascending order of their “age.”

bigdata_4

Sample output:

bigdata_5

FIRST_VALUE: This function returns the value from the first row in the window based on the clause and is assigned to all the rows of the same group. Following is the output where we passed the “id” column to the FIRST_VALUE() function by partitioning the table over the “profession” of the user and printing in the ascending order of their “age.”

bigdata_6

Sample output:

bigdata_7

LAST_VALUE: It is the opposite of FIRST_VALUE; it returns the value from the last row in a window based on the clause and is assigned to all the rows of the same group. Following is the output where we passed the “id” column to the LAST_VALUE() function by partitioning the table over the “profession” of the user and printing in the ascending order of their “age.”

bigdata_8

Sample output:

bigdata_9

Here, the employees of the same age are assigned the value of that emp_id who is last in that age group.

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Movielens Dataset Analysis on Azure
Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

Building Data Pipelines in Azure with Azure Synapse Analytics
In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

Build a Data Pipeline with Azure Synapse and Spark Pool
In this Azure Project, you will learn to build a Data Pipeline in Azure using Azure Synapse Analytics, Azure Storage, Azure Synapse Spark Pool to perform data transformations on an Airline dataset and visualize the results in Power BI.

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, you will learn how to process data using Spark and Hive as well as perform queries on Hive tables.

Flask API Big Data Project using Databricks and Unity Catalog
In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.