How to create static and dynamic partitions in hive?

How to create static and dynamic partitions in hive?

How to create static and dynamic partitions in hive?

This recipe helps you create static and dynamic partitions in hive


Recipe Objective

In most of the big data scenarios , Hive is an ETL and data warehouse tool on top of the hadoop ecosystem, it is used for the processing of the different types structured and semi-structured data, it is a database. Most of the times, we need to store the data based on conditions as partitions. Partitions are used to arrange table data into partitions by splitting tables into different parts based on the values to create partitions. Partitions are mainly useful for hive query optimisation to reduce the latency in the data. In hive we have two different partitions that are static and dynamic

System requirements :

Step1 : Prepare the dataset

Here we are using the food related comma separated values dataset to perform the csv file

Use linux command to check data as follows:

head -10 food_prices.csv

Data of Output looks as follows:

Before create a table open the hive shell , we need to create a database as follows Open the hive shell

To create database :

Create database dezyre_db; use dezyre_db;

Step 2 : Create a Hive Table and Load the data

Here we are going create a hive table as follows below hive script to create a table as follows

CREATE TABLE food_prices ( series_reference string, Period string, data_value int, status string, units string, subject string, product string, series_title string, year string )ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

In the above hql script we are creating a hive temporary table also skipping the headers which we are going to read from the csv file .

Step 3 : Load data into hive table

load data local inpath '/home/bigdata/Downloads/food_prices.csv' into table food_prices;

Step 4 : Query and verify the data

To query the data using below query in hive shell

Select * from food_prices limit 10;

Output of the above query :

Step 5 : Create a Partition table with Partition key

Here we are going to create a partition table by specifying the "partition by" while creating the table.

Static Partition : In static partitioning we need to pass the values of the partitioned column manually when we load the data into the table. So, We need to manually create each partition before inserting data into a partition

Syntax to create a partition table :

create table partioned_food_prices ( series_reference string, data_value int, status string, units string, subject string, product string ) partitioned by (year string) ;

Loading the data by specific the partition key as follows :

hive> insert overwrite table partioned_food_prices partition(year = '2019') select series_reference, data_value, status, units, subject, product from food_prices where year='2019';

Here in the above we are loading the data specifying year as partition key and value, that particular data will load into the table.

Output of the query: it will run the mapreduce to insert the data into the partitioned table from the temporary table.

We can verify the data by run the hdfs command as below:

hdfs dfs -cat /user/hive/warehouse/partioned_food_prices/year=2019/000000_0 head -10

Output of the above command:

We can also verify the data by running select query as below:

Dynamic partition : In Dynamic partitioning, Partitions will be created dynamically based on input data to the table.

Syntax to create partition table:

create table partitioned_food_prices ( series_reference string, Period string, data_value int, status string, units string, subject string, product string, series_title string )partitioned by (year string);

Enable properties for the partition set hive.exec.dynamic.partition.mode=nonstrict;

Loading the data into partitioned table from the temp table :

hive> insert overwrite table partitioned_food_prices partition(year) select * from default.food_prices;

in the Static partition while creating the partition we were giving some static value for the partition column. In a dynamic partition, the partition will happen dynamically i.e. it will create a partition based on the value of the partition column.

Query run as below: -

Output of the above query : it will create partitioned directories based on the year.

We can also verify the data by running the below query

Step 6 : To drop or delete the static/dynamic partition column

To drop or delete the partition column by using the alter table with delete partition command

Relevant Projects

Event Data Analysis using AWS ELK Stack
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

Design a Hadoop Architecture
Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.