Explain how to write SQL code to extract column mappings and store into final table?

Explain how to write SQL code to extract column mappings and store into final table?

Explain how to write SQL code to extract column mappings and store into final table?

This recipe explains what how to write SQL code to extract column mappings and store into final table


Recipe Objective

In most of the big data scenarios , We will be using an Optimized Row Columnar (ORC) file format which provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.in this we are going to create an ORC table and will do transformations and extraction of the columns, date formats..etc. .

System requirements :

Step 1: Prepare a Dataset

Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.

Use linux command to check data as follows:

head -10 emp_data.csv

Data is looks as follows :

Step 2 : Copy the CSV data to HDFS (hadoop distributed file system)

Here we are going to copy the csv file to hdfs. Before that we need to create a directory to store the file. To create a directory in the HDFS :

hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data/emp/

above command as follows:

In the above commands we create /data folder 1st then inside the data folder created another folder emp.

To check whether a directory is created or not.

To copy the csv file use below command

hduser@bigdata-VirtualBox:~$ hdfs dfs -put /home/hduser/Downloads/empdata.csv /data/emp/

To verify the file is exists or not in directory path follow as below:

hduser@bigdata-VirtualBox:~$ hdfs dfs -ls /data/emp/

Note: In our scenario directory path in the hadoop is /data/emp/

Before create an external table open the hive shell and we need to create a database as follows : Open the hive shell

To create a database :

Create database dezyre_db; use dezyre_db;

Step 3 : Create an External Hive Table and Load the data

Here we are going create an external hive table on the top of the csv file for that use below hive script to create a table as follows :

CREATE EXTERNAL TABLE IF NOT EXISTS employee( `emp_id` string, `firstname` string, `lastname` string, `gender` string, `email_id` string, `father_name` string, `mother_name` string, `date_of_birth` string, `time_of_birth` string, `age_in_yrs` string, `weight_in_kgs` string, `date_of_joining` string, `quarter_of_Joining` string, `month_name_of_joining` string, `age_in_company_years` string, `salary` string, `last_hike` string, `ssn` string, `ph_no` string, `place` string, `county` string, `city` string, `state` string, `zip` string, `region` string, `user_name` string, `password` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"" ) STORED AS TEXTFILE LOCATION '/data/emp' TBLPROPERTIES ("skip.header.line.count"="1");

In the above hql script we are creating an external by specifying the data format and location and also skipping the headers of the csv file.

Step 4 : Query and verify the data

To query the data using below query in hive shell

Select * from employee limit 4;

Output of the above code:

Step 5 : Create a final table (ORC table)

Now we are going to create a hive final table, which is in ORC format.To create a final table follow as below

CREATE TABLE IF NOT EXISTS final_emp_table ( `emp_id int, `firstname` string, `lastname` string, `gender` string, `email_id` string, `mail_service` string, `father_name` string, `mother_name` string, `date_of_birth` string, `time_of_birth` string, `age_in_yrs` float, `weight_in_kgs` string, `date_of_joining` string, `year_of_joining` string, `quarter_of_joining` string, `month_name_of_joining` string, `age_in_company_years` float, `salary` float, `last_hike` string, `ssn` string, `ph_no` string, `place` string, `county` string, `city` string, `state` string, `region` string, `user_name` string, `password` string ) PARTITIONED BY (`zip` int) STORED AS ORC;

Above command as follows

Step 6 : Copy data from external table to final table

In this step we are going to copy the data from the external table and insert into the final table we will create new columns from the existing external table using insert statement and also we made transformations changing the date formats extracting the data as follows

Setting the properties to creating the partition as below

hive>SET hive.exec.compress.intermediate=true; hive>SET hive.exec.dynamic.partition = true; hive>SET hive.exec.dynamic.partition.mode = nonstrict; hive>SET hive.mapred.mode = nonstrict;

Above command as follows

Insert statement to insert the data from the temporary or external table

INSERT OVERWRITE TABLE final_emp_table partition(`zip`) SELECT cast(`emp_id` as int), cast(`firstname` as string), cast(`lastname` as string), cast(`gender` as string), cast(`email_id` as string), regexp_extract (`email_id`,'([A-z])*\\.([A-z])*$',0) as mail_service, cast(`father_name` as string), cast(`mother_name` as string), cast(`date_of_birth` as string), cast(`time_of_birth` as string), cast(`age_in_yrs` as float), cast(`weight_in_kgs` as string), coalesce( CAST( from_unixtime(unix_timestamp(`date_of_joining`, 'MM/dd/yyyy'), 'yyyy-MM-dd') as string) , CAST( from_unixtime(unix_timestamp(`date_of_joining`, 'dd-MM-yyyy'), 'yyyy-MM-dd') as string) ) , coalesce( CAST( from_unixtime(unix_timestamp(date_of_joining, 'MM/dd/yyyy'), "yyyy") as string) , CAST(from_unixtime(unix_timestamp(date_of_joining, 'dd-MM-yyyy'), "yyyy")AS STRING) )as year_of_joining, cast(`quarter_of_joining` as string), cast(`month_name_of_joining` as string), cast(`age_in_company_years`as float), cast(`salary` as float), cast(`last_hike` as float), cast(`ssn`as string), cast(`ph_no`as string), cast(`place` as string), cast(`county`as string), cast(`city` as string), cast(`state` as string), cast(`region` as string), cast(`user_name`as string), cast(`password` as string), cast(`zip` as int) FROM employee ;

In the above insert statement we convert data format and extract year from the date and partition on zip column ,also extracting the domain from email id and creating a new column mailservice.

Above command as follows

Output of the above query:

Step 6 : Copy data from external table to final table

To verify the data in final table run the query as below

Select * from final_emp_table limit 5;

Output of the above query:

Relevant Projects

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Event Data Analysis using AWS ELK Stack
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Data Warehouse Design for E-commerce Environments
In this hive project, you will design a data warehouse for e-commerce environments.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.