What are the types of bucketing in hive ?

This recipe explains what are the types of bucketing in hive
Last Updated: 16 Jun 2022

Get access to Big Data projects View all Big Data projects

BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective

In most of the big data scenarios , bucketing is a technique offered by Apache Hive in order to manage large datasets by dividing into more manageable parts which can be retrieved easily and can be used for reducing query latency, known as buckets. The bucketing technique will improve the query performance, bucketing can be followed by partitioning where partitions can be further divided into buckets.

System requirements :

Install ubuntu in the virtual machine click here
Install single node hadoop machine click here
Install apache hive click here

Step 1 : Prepare the dataset

Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.

Data of Output looks as follows:

Before create a table open the hive shell and we need to create a database as follows :Open the hive shell as below

To create database using below queries :


Create database dezyre_db;
use dezyre_db;

As follows below:

Step 2 : Create a Hive Table and Load the Data into the Table and verify the Data

Here we are going create a hive table for loading the data from this table to created bucketed tables, Use below to create a hive table:


CREATE TABLE employee (
employee_id int,
company_id int,
seniority int,
salary int,
join_date string,
quit_date string,
dept string
)  
ROW FORMAT DELIMITED
fields terminated by ','
TBLPROPERTIES ("skip.header.line.count"="1");

the above query runs as follows :

Loading the data into hive table and verifying the data


load data local inpath '/home/bigdata/Downloads/empdata.csv' into table employee;

Verifying the data by running the select query as follows

Step 3 : Create a Bucketed table with Partition

Here we are going to create bucketed table with partition "partition by" and bucket with "clustered by"


CREATE TABLE emp_bucketed_patitioned_tbl (
employee_id int,
company_id int,
seniority int,
salary int ,
join_date string,
quit_date string
) 
PARTITIONED BY(dept string)
CLUSTERED BY (salary)
SORTED BY (salary ASC) INTO 4 BUCKETS;

the query runs as follows

Note : Set a property if your version is less than 2.1 version as By default, the bucket is disabled in Hive. We have to enable it by setting value true to the below property in the hive


SET hive.enforce.bucketing=TRUE;

Step 4 : load data into Bucketed table with Partition


INSERT OVERWRITE TABLE emp_bucketed_patitioned_tbl PARTITION (dept) SELECT * FROM employee;

Note : when you are loading the data into partition table set a property


set hive.exec.dynamic.partition.mode=nonstrict;

When you load the data into the table i will performs map reduce job in the background as below

The above query runs as below

Step 5: Create a Bucketed table without Partition

Here we are going to create bucketed table bucket with "clustered by"



CREATE TABLE emp_bucketed_tbl_only (
employee_id int,
company_id int,
dept string,
seniority int,
salary int,
join_date string,
quit_date string
)
CLUSTERED BY (salary)
SORTED BY (salary ASC) INTO 4 BUCKETS;

The above query runs as follows

Step 6 : Load data into Bucketed table without Partition

Here we are going to insert data into the bucketed table which is without any partition:


INSERT OVERWRITE TABLE emp_bucketed_tbl_only SELECT * FROM employee;

When you load the data into the table i will performs map reduce job in the background as looks below

You can verify the data as with and without partitioned data by using hdfs command listout


hdfs dfs -ls /user/hive/warehouse/dezyre_db.db/

We in the below image we see that data is organized into buckets

You can also verify the bucketed data going through the user interface hadoop browse to locate hdfs directory. From my side the hive tables will creates in the /user/hive/warehouse/ location

In the below pictures it shows that partition by a specific column

another picture it shows that bucketed table only

Step 7 : Performance between Bucketed table query and Normal base table query

In the below we are running the queries to check the performance between the bucketed table and the base table with same data in both tables , but the data in the bucketed table will be organized to divided into the parts so it will bring the results quickly as compare to base table ,Bucketed tables allows much more efficient sampling than the non-bucketed tables. With sampling, we can try out queries on a section of data for testing and debugging purposes when the original data sets are very huge.

Bucketed table query :


select *  from emp_bucketed_patitioned_tbl where dept='marketing' and salary=48000;

Output of the above query:

Base table query :


select *  from employee where dept='marketing' and salary=48000;

Output of the above query:

Compared the times between the both tables bucketed will give the quick results.

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More