How to create static and dynamic partitions in hive?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to create static and dynamic partitions in hive?

How to create static and dynamic partitions in hive?

This recipe helps you create static and dynamic partitions in hive

0

Recipe Objective

In most of the big data scenarios , Hive is an ETL and data warehouse tool on top of the hadoop ecosystem, it is used for the processing of the different types structured and semi-structured data, it is a database. Most of the times, we need to store the data based on conditions as partitions. Partitions are used to arrange table data into partitions by splitting tables into different parts based on the values to create partitions. Partitions are mainly useful for hive query optimisation to reduce the latency in the data. In hive we have two different partitions that are static and dynamic

System requirements :

Step1 : Prepare the dataset

Here we are using the food related comma separated values dataset to perform the csv file

Use linux command to check data as follows:

head -10 food_prices.csv

Data of Output looks as follows:

Before create a table open the hive shell , we need to create a database as follows Open the hive shell

To create database :

Create database dezyre_db; use dezyre_db;

Step 2 : Create a Hive Table and Load the data

Here we are going create a hive table as follows below hive script to create a table as follows

CREATE TABLE food_prices ( series_reference string, Period string, data_value int, status string, units string, subject string, product string, series_title string, year string )ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

In the above hql script we are creating a hive temporary table also skipping the headers which we are going to read from the csv file .

Step 3 : Load data into hive table

load data local inpath '/home/bigdata/Downloads/food_prices.csv' into table food_prices;

Step 4 : Query and verify the data

To query the data using below query in hive shell

Select * from food_prices limit 10;

Output of the above query :

Step 5 : Create a Partition table with Partition key

Here we are going to create a partition table by specifying the "partition by" while creating the table.

Static Partition : In static partitioning we need to pass the values of the partitioned column manually when we load the data into the table. So, We need to manually create each partition before inserting data into a partition

Syntax to create a partition table :

create table partioned_food_prices ( series_reference string, data_value int, status string, units string, subject string, product string ) partitioned by (year string) ;

Loading the data by specific the partition key as follows :

hive> insert overwrite table partioned_food_prices partition(year = '2019') select series_reference, data_value, status, units, subject, product from food_prices where year='2019';

Here in the above we are loading the data specifying year as partition key and value, that particular data will load into the table.

Output of the query: it will run the mapreduce to insert the data into the partitioned table from the temporary table.

We can verify the data by run the hdfs command as below:

hdfs dfs -cat /user/hive/warehouse/partioned_food_prices/year=2019/000000_0 head -10

Output of the above command:

We can also verify the data by running select query as below:

Dynamic partition : In Dynamic partitioning, Partitions will be created dynamically based on input data to the table.

Syntax to create partition table:

create table partitioned_food_prices ( series_reference string, Period string, data_value int, status string, units string, subject string, product string, series_title string )partitioned by (year string);

Enable properties for the partition set hive.exec.dynamic.partition.mode=nonstrict;

Loading the data into partitioned table from the temp table :

hive> insert overwrite table partitioned_food_prices partition(year) select * from default.food_prices;

in the Static partition while creating the partition we were giving some static value for the partition column. In a dynamic partition, the partition will happen dynamically i.e. it will create a partition based on the value of the partition column.

Query run as below: -

Output of the above query : it will create partitioned directories based on the year.

We can also verify the data by running the below query

Step 6 : To drop or delete the static/dynamic partition column

To drop or delete the partition column by using the alter table with delete partition command

Relevant Projects

Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Design a Hadoop Architecture
Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Real-Time Log Processing using Spark Streaming Architecture
In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Data Warehouse Design for E-commerce Environments
In this hive project, you will design a data warehouse for e-commerce environments.