What are the types of bucketing in hive ?

This recipe explains what are the types of bucketing in hive

Recipe Objective

In most of the big data scenarios , bucketing is a technique offered by Apache Hive in order to manage large datasets by dividing into more manageable parts which can be retrieved easily and can be used for reducing query latency, known as buckets. The bucketing technique will improve the query performance, bucketing can be followed by partitioning where partitions can be further divided into buckets.

System requirements :

Step 1 : Prepare the dataset

Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.

Data of Output looks as follows:

Before create a table open the hive shell and we need to create a database as follows :Open the hive shell as below

To create database using below queries :

Create database dezyre_db; use dezyre_db;

As follows below:

Step 2 : Create a Hive Table and Load the Data into the Table and verify the Data

Here we are going create a hive table for loading the data from this table to created bucketed tables, Use below to create a hive table:

CREATE TABLE employee ( employee_id int, company_id int, seniority int, salary int, join_date string, quit_date string, dept string ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

the above query runs as follows :

Loading the data into hive table and verifying the data

load data local inpath '/home/bigdata/Downloads/empdata.csv' into table employee;

Verifying the data by running the select query as follows

Step 3 : Create a Bucketed table with Partition

Here we are going to create bucketed table with partition "partition by" and bucket with "clustered by"

CREATE TABLE emp_bucketed_patitioned_tbl ( employee_id int, company_id int, seniority int, salary int , join_date string, quit_date string ) PARTITIONED BY(dept string) CLUSTERED BY (salary) SORTED BY (salary ASC) INTO 4 BUCKETS;

the query runs as follows

Note : Set a property if your version is less than 2.1 version as By default, the bucket is disabled in Hive. We have to enable it by setting value true to the below property in the hive

SET hive.enforce.bucketing=TRUE;

Step 4 : load data into Bucketed table with Partition

INSERT OVERWRITE TABLE emp_bucketed_patitioned_tbl PARTITION (dept) SELECT * FROM employee;

Note : when you are loading the data into partition table set a property

set hive.exec.dynamic.partition.mode=nonstrict;

When you load the data into the table i will performs map reduce job in the background as below

The above query runs as below

Step 5: Create a Bucketed table without Partition

Here we are going to create bucketed table bucket with "clustered by"

CREATE TABLE emp_bucketed_tbl_only ( employee_id int, company_id int, dept string, seniority int, salary int, join_date string, quit_date string ) CLUSTERED BY (salary) SORTED BY (salary ASC) INTO 4 BUCKETS;

The above query runs as follows

Step 6 : Load data into Bucketed table without Partition

Here we are going to insert data into the bucketed table which is without any partition:

INSERT OVERWRITE TABLE emp_bucketed_tbl_only SELECT * FROM employee;

When you load the data into the table i will performs map reduce job in the background as looks below

You can verify the data as with and without partitioned data by using hdfs command listout

hdfs dfs -ls /user/hive/warehouse/dezyre_db.db/

We in the below image we see that data is organized into buckets

You can also verify the bucketed data going through the user interface hadoop browse to locate hdfs directory. From my side the hive tables will creates in the /user/hive/warehouse/ location

In the below pictures it shows that partition by a specific column

another picture it shows that bucketed table only

Step 7 : Performance between Bucketed table query and Normal base table query

In the below we are running the queries to check the performance between the bucketed table and the base table with same data in both tables , but the data in the bucketed table will be organized to divided into the parts so it will bring the results quickly as compare to base table ,Bucketed tables allows much more efficient sampling than the non-bucketed tables. With sampling, we can try out queries on a section of data for testing and debugging purposes when the original data sets are very huge.

Bucketed table query :

select * from emp_bucketed_patitioned_tbl where dept='marketing' and salary=48000;

Output of the above query:

Base table query :

select * from employee where dept='marketing' and salary=48000;

Output of the above query:

Compared the times between the both tables bucketed will give the quick results.

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Build an ETL Pipeline for Financial Data Analytics on GCP-IaC
In this GCP Project, you will learn to build an ETL pipeline on Google Cloud Platform to maximize the efficiency of financial data analytics with GCP-IaC.

AWS Project for Batch Processing with PySpark on AWS EMR
In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

Airline Dataset Analysis using PySpark GraphFrames in Python
In this PySpark project, you will perform airline dataset analysis using graphframes in Python to find structural motifs, the shortest route between cities, and rank airports with PageRank.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

Orchestrate Redshift ETL using AWS Glue and Step Functions
ETL Orchestration on AWS - Use AWS Glue and Step Functions to fetch source data and glean faster analytical insights on Amazon Redshift Cluster

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

Graph Database Modelling using AWS Neptune and Gremlin
In this data analytics project, you will use AWS Neptune graph database and Gremlin query language to analyse various performance metrics of flights.

Databricks Data Lineage and Replication Management
Databricks Project on data lineage and replication management to help you optimize your data management practices | ProjectPro

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.