How to Select Group By and Order By in hive?

How to Select Group By and Order By in hive?

How to Select Group By and Order By in hive?

This recipe helps you Select Group By and Order By in hive


Recipe Objective

In most of the big data scenarios, it will be required to group by the rows that have the same values and we sort the rows in ascending or descending order as required . Tables have varying number of columns and using * in the select statement will all the retrieve data but is sometimes used to mention all the column names in the select query manually. It is very hard to do because of the high number of columns. So, here our requirement is to exclude column(s) from select query in hive.

System requirements :

Step 1 : Prepare the dataset

Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.

Data of Output looks as follows:

Before create a table open the hive shell and we need to create a database as follows : Open the hive shell as below

To create database using below queries :

Create database dezyre_db; use dezyre_db;

As follows below:

Step 2 : Create a Hive Table and Load the Data into the Table and verify the Data

Here we are going create a hive table for loading the data from this table to created bucketed tables, Use below to create a hive table:

CREATE TABLE employee ( employee_id int, company_id int, seniority int, salary int, join_date string, quit_date string, dept string ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

the above query runs as follows :

Loading the data into hive table and verifying the data

load data local inpath '/home/bigdata/Downloads/empdata.csv' into table employee;

Verifying the data by running the select query as follows

Step 3 : Group by usage in hive

The GROUP BY clause is used to group all the rows in a result set using a particular collection column. It is used to query a group of rows.

Here we are going run an example query using group by on the hive table as follows

Select dept, count(*) as countof_emp from employee group by dept ;

Output of the above query : the above query will give count of the employees in each dept as a result as below:

Step 4: Order by usage in Hive

The ORDER BY is used to retrieve the rows based on one column and sort the rows set by ascending or descending order, the default order value is ascending order Here we are going run an example query using order by on the hive table as follows

Select * from employee order by salary desc;

Output of the above query: The above query give the highest salaries of the employees details in the descending order as follows below

Step 5: Excluding the column

The use of the Exclude column is when you have less columns in your data then we will use the select the column names using the select statements, but your table contains many columns ex 90 columns then that time you need to exclude a few columns in the select statement as follows:

Before going to exclude column we need to set or enable a property

hive> set;

And also enable property to print the columns as follows

hive> set hive.cli.print.header=true;

Here we are going to use below query to excluding the quit date column as follows and also we use the regulate expression to exclude columns

hive> select `(quit_date)?+.+` from employee;

Output of the above query as follows:

We can also exclude the multiple columns using the below query as follows:

hive> select `(join_date|quit_date)?+.+` from employee;

we can use regular expressions to exclude multiple columns

Output of the above query : as we can see that excluding the join_date and quit_date

Relevant Projects

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

Movielens dataset analysis for movie recommendations using Spark in Azure
In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.