In most of the big data scenarios , Hive is an ETL and data warehouse tool on top of the hadoop ecosystem, it is used for the processing of the different types structured and semi-structured data, it is a database. Most of the times, we need to store the data based on conditions as partitions. Partitions are used to arrange table data into partitions by splitting tables into different parts based on the values to create partitions. Partitions are mainly useful for hive query optimisation to reduce the latency in the data. In hive we have two different partitions that are static and dynamic
Here we are using the food related comma separated values dataset to perform the csv file
Use linux command to check data as follows:
head -10 food_prices.csv
Data of Output looks as follows:
Before create a table open the hive shell , we need to create a database as follows Open the hive shell
To create database :
Create database dezyre_db;
Here we are going create a hive table as follows below hive script to create a table as follows
CREATE TABLE food_prices ( series_reference string, Period string, data_value int, status string, units string, subject string, product string, series_title string, year string )ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");
In the above hql script we are creating a hive temporary table also skipping the headers which we are going to read from the csv file .
load data local inpath '/home/bigdata/Downloads/food_prices.csv' into table food_prices;
To query the data using below query in hive shell
Select * from food_prices limit 10;
Output of the above query :
Here we are going to create a partition table by specifying the "partition by" while creating the table.
Static Partition : In static partitioning we need to pass the values of the partitioned column manually when we load the data into the table. So, We need to manually create each partition before inserting data into a partition
Syntax to create a partition table :
create table partioned_food_prices ( series_reference string, data_value int, status string, units string, subject string, product string ) partitioned by (year string) ;
Loading the data by specific the partition key as follows :
hive> insert overwrite table partioned_food_prices partition(year = '2019') select series_reference, data_value, status, units, subject, product from food_prices where year='2019';
Here in the above we are loading the data specifying year as partition key and value, that particular data will load into the table.
Output of the query: it will run the mapreduce to insert the data into the partitioned table from the temporary table.
We can verify the data by run the hdfs command as below:
hdfs dfs -cat /user/hive/warehouse/partioned_food_prices/year=2019/000000_0 head -10
Output of the above command:
We can also verify the data by running select query as below:
Dynamic partition : In Dynamic partitioning, Partitions will be created dynamically based on input data to the table.
Syntax to create partition table:
create table partitioned_food_prices ( series_reference string, Period string, data_value int, status string, units string, subject string, product string, series_title string )partitioned by (year string);
Enable properties for the partition
Loading the data into partitioned table from the temp table :
hive> insert overwrite table partitioned_food_prices partition(year) select * from default.food_prices;
in the Static partition while creating the partition we were giving some static value for the partition column. In a dynamic partition, the partition will happen dynamically i.e. it will create a partition based on the value of the partition column.
Query run as below: -
Output of the above query : it will create partitioned directories based on the year.
We can also verify the data by running the below query
To drop or delete the partition column by using the alter table with delete partition command