Explain how to write SQL code to extract column mappings and store into final table?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

Explain how to write SQL code to extract column mappings and store into final table?

Explain how to write SQL code to extract column mappings and store into final table?

This recipe explains what how to write SQL code to extract column mappings and store into final table

0

Recipe Objective

In most of the big data scenarios , We will be using an Optimized Row Columnar (ORC) file format which provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.in this we are going to create an ORC table and will do transformations and extraction of the columns, date formats..etc. .

System requirements :

Step 1: Prepare a Dataset

Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.

Use linux command to check data as follows:

head -10 emp_data.csv

Data is looks as follows :

Step 2 : Copy the CSV data to HDFS (hadoop distributed file system)

Here we are going to copy the csv file to hdfs. Before that we need to create a directory to store the file. To create a directory in the HDFS :

hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data/emp/

above command as follows:

In the above commands we create /data folder 1st then inside the data folder created another folder emp.

To check whether a directory is created or not.

To copy the csv file use below command

hduser@bigdata-VirtualBox:~$ hdfs dfs -put /home/hduser/Downloads/empdata.csv /data/emp/

To verify the file is exists or not in directory path follow as below:

hduser@bigdata-VirtualBox:~$ hdfs dfs -ls /data/emp/

Note: In our scenario directory path in the hadoop is /data/emp/

Before create an external table open the hive shell and we need to create a database as follows : Open the hive shell

To create a database :

Create database dezyre_db; use dezyre_db;

Step 3 : Create an External Hive Table and Load the data

Here we are going create an external hive table on the top of the csv file for that use below hive script to create a table as follows :

CREATE EXTERNAL TABLE IF NOT EXISTS employee( `emp_id` string, `firstname` string, `lastname` string, `gender` string, `email_id` string, `father_name` string, `mother_name` string, `date_of_birth` string, `time_of_birth` string, `age_in_yrs` string, `weight_in_kgs` string, `date_of_joining` string, `quarter_of_Joining` string, `month_name_of_joining` string, `age_in_company_years` string, `salary` string, `last_hike` string, `ssn` string, `ph_no` string, `place` string, `county` string, `city` string, `state` string, `zip` string, `region` string, `user_name` string, `password` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"" ) STORED AS TEXTFILE LOCATION '/data/emp' TBLPROPERTIES ("skip.header.line.count"="1");

In the above hql script we are creating an external by specifying the data format and location and also skipping the headers of the csv file.

Step 4 : Query and verify the data

To query the data using below query in hive shell

Select * from employee limit 4;

Output of the above code:

Step 5 : Create a final table (ORC table)

Now we are going to create a hive final table, which is in ORC format.To create a final table follow as below

CREATE TABLE IF NOT EXISTS final_emp_table ( `emp_id int, `firstname` string, `lastname` string, `gender` string, `email_id` string, `mail_service` string, `father_name` string, `mother_name` string, `date_of_birth` string, `time_of_birth` string, `age_in_yrs` float, `weight_in_kgs` string, `date_of_joining` string, `year_of_joining` string, `quarter_of_joining` string, `month_name_of_joining` string, `age_in_company_years` float, `salary` float, `last_hike` string, `ssn` string, `ph_no` string, `place` string, `county` string, `city` string, `state` string, `region` string, `user_name` string, `password` string ) PARTITIONED BY (`zip` int) STORED AS ORC;

Above command as follows

Step 6 : Copy data from external table to final table

In this step we are going to copy the data from the external table and insert into the final table we will create new columns from the existing external table using insert statement and also we made transformations changing the date formats extracting the data as follows

Setting the properties to creating the partition as below

hive>SET hive.exec.compress.intermediate=true; hive>SET hive.exec.dynamic.partition = true; hive>SET hive.exec.dynamic.partition.mode = nonstrict; hive>SET hive.mapred.mode = nonstrict;

Above command as follows

Insert statement to insert the data from the temporary or external table

INSERT OVERWRITE TABLE final_emp_table partition(`zip`) SELECT cast(`emp_id` as int), cast(`firstname` as string), cast(`lastname` as string), cast(`gender` as string), cast(`email_id` as string), regexp_extract (`email_id`,'([A-z])*\\.([A-z])*$',0) as mail_service, cast(`father_name` as string), cast(`mother_name` as string), cast(`date_of_birth` as string), cast(`time_of_birth` as string), cast(`age_in_yrs` as float), cast(`weight_in_kgs` as string), coalesce( CAST( from_unixtime(unix_timestamp(`date_of_joining`, 'MM/dd/yyyy'), 'yyyy-MM-dd') as string) , CAST( from_unixtime(unix_timestamp(`date_of_joining`, 'dd-MM-yyyy'), 'yyyy-MM-dd') as string) ) , coalesce( CAST( from_unixtime(unix_timestamp(date_of_joining, 'MM/dd/yyyy'), "yyyy") as string) , CAST(from_unixtime(unix_timestamp(date_of_joining, 'dd-MM-yyyy'), "yyyy")AS STRING) )as year_of_joining, cast(`quarter_of_joining` as string), cast(`month_name_of_joining` as string), cast(`age_in_company_years`as float), cast(`salary` as float), cast(`last_hike` as float), cast(`ssn`as string), cast(`ph_no`as string), cast(`place` as string), cast(`county`as string), cast(`city` as string), cast(`state` as string), cast(`region` as string), cast(`user_name`as string), cast(`password` as string), cast(`zip` as int) FROM employee ;

In the above insert statement we convert data format and extract year from the date and partition on zip column ,also extracting the domain from email id and creating a new column mailservice.

Above command as follows

Output of the above query:

Step 6 : Copy data from external table to final table

To verify the data in final table run the query as below

Select * from final_emp_table limit 5;

Output of the above query:

Relevant Projects

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Data Warehouse Design for E-commerce Environments
In this hive project, you will design a data warehouse for e-commerce environments.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Finding Unique URL's using Hadoop Hive
Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Real-Time Log Processing using Spark Streaming Architecture
In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.