What are the types of bucketing in hive ?

This recipe explains what are the types of bucketing in hive

Recipe Objective

In most of the big data scenarios , bucketing is a technique offered by Apache Hive in order to manage large datasets by dividing into more manageable parts which can be retrieved easily and can be used for reducing query latency, known as buckets. The bucketing technique will improve the query performance, bucketing can be followed by partitioning where partitions can be further divided into buckets.

System requirements :

Step 1 : Prepare the dataset

Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.

Data of Output looks as follows:

Before create a table open the hive shell and we need to create a database as follows :Open the hive shell as below

To create database using below queries :

Create database dezyre_db; use dezyre_db;

As follows below:

Step 2 : Create a Hive Table and Load the Data into the Table and verify the Data

Here we are going create a hive table for loading the data from this table to created bucketed tables, Use below to create a hive table:

CREATE TABLE employee ( employee_id int, company_id int, seniority int, salary int, join_date string, quit_date string, dept string ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

the above query runs as follows :

Loading the data into hive table and verifying the data

load data local inpath '/home/bigdata/Downloads/empdata.csv' into table employee;

Verifying the data by running the select query as follows

Step 3 : Create a Bucketed table with Partition

Here we are going to create bucketed table with partition "partition by" and bucket with "clustered by"

CREATE TABLE emp_bucketed_patitioned_tbl ( employee_id int, company_id int, seniority int, salary int , join_date string, quit_date string ) PARTITIONED BY(dept string) CLUSTERED BY (salary) SORTED BY (salary ASC) INTO 4 BUCKETS;

the query runs as follows

Note : Set a property if your version is less than 2.1 version as By default, the bucket is disabled in Hive. We have to enable it by setting value true to the below property in the hive

SET hive.enforce.bucketing=TRUE;

Step 4 : load data into Bucketed table with Partition

INSERT OVERWRITE TABLE emp_bucketed_patitioned_tbl PARTITION (dept) SELECT * FROM employee;

Note : when you are loading the data into partition table set a property

set hive.exec.dynamic.partition.mode=nonstrict;

When you load the data into the table i will performs map reduce job in the background as below

The above query runs as below

Step 5: Create a Bucketed table without Partition

Here we are going to create bucketed table bucket with "clustered by"

CREATE TABLE emp_bucketed_tbl_only ( employee_id int, company_id int, dept string, seniority int, salary int, join_date string, quit_date string ) CLUSTERED BY (salary) SORTED BY (salary ASC) INTO 4 BUCKETS;

The above query runs as follows

Step 6 : Load data into Bucketed table without Partition

Here we are going to insert data into the bucketed table which is without any partition:

INSERT OVERWRITE TABLE emp_bucketed_tbl_only SELECT * FROM employee;

When you load the data into the table i will performs map reduce job in the background as looks below

You can verify the data as with and without partitioned data by using hdfs command listout

hdfs dfs -ls /user/hive/warehouse/dezyre_db.db/

We in the below image we see that data is organized into buckets

You can also verify the bucketed data going through the user interface hadoop browse to locate hdfs directory. From my side the hive tables will creates in the /user/hive/warehouse/ location

In the below pictures it shows that partition by a specific column

another picture it shows that bucketed table only

Step 7 : Performance between Bucketed table query and Normal base table query

In the below we are running the queries to check the performance between the bucketed table and the base table with same data in both tables , but the data in the bucketed table will be organized to divided into the parts so it will bring the results quickly as compare to base table ,Bucketed tables allows much more efficient sampling than the non-bucketed tables. With sampling, we can try out queries on a section of data for testing and debugging purposes when the original data sets are very huge.

Bucketed table query :

select * from emp_bucketed_patitioned_tbl where dept='marketing' and salary=48000;

Output of the above query:

Base table query :

select * from employee where dept='marketing' and salary=48000;

Output of the above query:

Compared the times between the both tables bucketed will give the quick results.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Azure Data Factory and Databricks End-to-End Project
Azure Data Factory and Databricks End-to-End Project to implement analytics on trip transaction data using Azure Services such as Data Factory, ADLS Gen2, and Databricks, with a focus on data transformation and pipeline resiliency.

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

Build Streaming Data Pipeline using Azure Stream Analytics
In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

Build an ETL Pipeline for Financial Data Analytics on GCP-IaC
In this GCP Project, you will learn to build an ETL pipeline on Google Cloud Platform to maximize the efficiency of financial data analytics with GCP-IaC.

Build Classification and Clustering Models with PySpark and MLlib
In this PySpark Project, you will learn to implement pyspark classification and clustering model examples using Spark MLlib.

AWS CDK Project for Building Real-Time IoT Infrastructure
AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Web Server Log Processing using Hadoop in Azure
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics