What are the types of bucketing in hive ?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

What are the types of bucketing in hive ?

What are the types of bucketing in hive ?

This recipe explains what are the types of bucketing in hive

0

Recipe Objective

In most of the big data scenarios , bucketing is a technique offered by Apache Hive in order to manage large datasets by dividing into more manageable parts which can be retrieved easily and can be used for reducing query latency, known as buckets. The bucketing technique will improve the query performance, bucketing can be followed by partitioning where partitions can be further divided into buckets.

System requirements :

Step 1 : Prepare the dataset

Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.

Data of Output looks as follows:

Before create a table open the hive shell and we need to create a database as follows :Open the hive shell as below

To create database using below queries :

Create database dezyre_db; use dezyre_db;

As follows below:

Step 2 : Create a Hive Table and Load the Data into the Table and verify the Data

Here we are going create a hive table for loading the data from this table to created bucketed tables, Use below to create a hive table:

CREATE TABLE employee ( employee_id int, company_id int, seniority int, salary int, join_date string, quit_date string, dept string ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

the above query runs as follows :

Loading the data into hive table and verifying the data

load data local inpath '/home/bigdata/Downloads/empdata.csv' into table employee;

Verifying the data by running the select query as follows

Step 3 : Create a Bucketed table with Partition

Here we are going to create bucketed table with partition "partition by" and bucket with "clustered by"

CREATE TABLE emp_bucketed_patitioned_tbl ( employee_id int, company_id int, seniority int, salary int , join_date string, quit_date string ) PARTITIONED BY(dept string) CLUSTERED BY (salary) SORTED BY (salary ASC) INTO 4 BUCKETS;

the query runs as follows

Note : Set a property if your version is less than 2.1 version as By default, the bucket is disabled in Hive. We have to enable it by setting value true to the below property in the hive

SET hive.enforce.bucketing=TRUE;

Step 4 : load data into Bucketed table with Partition

INSERT OVERWRITE TABLE emp_bucketed_patitioned_tbl PARTITION (dept) SELECT * FROM employee;

Note : when you are loading the data into partition table set a property

set hive.exec.dynamic.partition.mode=nonstrict;

When you load the data into the table i will performs map reduce job in the background as below

The above query runs as below

Step 5: Create a Bucketed table without Partition

Here we are going to create bucketed table bucket with "clustered by"

CREATE TABLE emp_bucketed_tbl_only ( employee_id int, company_id int, dept string, seniority int, salary int, join_date string, quit_date string ) CLUSTERED BY (salary) SORTED BY (salary ASC) INTO 4 BUCKETS;

The above query runs as follows

Step 6 : Load data into Bucketed table without Partition

Here we are going to insert data into the bucketed table which is without any partition:

INSERT OVERWRITE TABLE emp_bucketed_tbl_only SELECT * FROM employee;

When you load the data into the table i will performs map reduce job in the background as looks below

You can verify the data as with and without partitioned data by using hdfs command listout

hdfs dfs -ls /user/hive/warehouse/dezyre_db.db/

We in the below image we see that data is organized into buckets

You can also verify the bucketed data going through the user interface hadoop browse to locate hdfs directory. From my side the hive tables will creates in the /user/hive/warehouse/ location

In the below pictures it shows that partition by a specific column

another picture it shows that bucketed table only

Step 7 : Performance between Bucketed table query and Normal base table query

In the below we are running the queries to check the performance between the bucketed table and the base table with same data in both tables , but the data in the bucketed table will be organized to divided into the parts so it will bring the results quickly as compare to base table ,Bucketed tables allows much more efficient sampling than the non-bucketed tables. With sampling, we can try out queries on a section of data for testing and debugging purposes when the original data sets are very huge.

Bucketed table query :

select * from emp_bucketed_patitioned_tbl where dept='marketing' and salary=48000;

Output of the above query:

Base table query :

select * from employee where dept='marketing' and salary=48000;

Output of the above query:

Compared the times between the both tables bucketed will give the quick results.

Relevant Projects

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Data Warehouse Design for E-commerce Environments
In this hive project, you will design a data warehouse for e-commerce environments.

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Movielens dataset analysis for movie recommendations using Spark in Azure
In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Event Data Analysis using AWS ELK Stack
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.