Explain how to write SQL code to extract column mappings and store into final table?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

Explain how to write SQL code to extract column mappings and store into final table?

Explain how to write SQL code to extract column mappings and store into final table?

This recipe explains what how to write SQL code to extract column mappings and store into final table

Recipe Objective

In most of the big data scenarios , We will be using an Optimized Row Columnar (ORC) file format which provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.in this we are going to create an ORC table and will do transformations and extraction of the columns, date formats..etc. .

System requirements :

Step 1: Prepare a Dataset

Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.

Use linux command to check data as follows:

head -10 emp_data.csv

Data is looks as follows :

Step 2 : Copy the CSV data to HDFS (hadoop distributed file system)

Here we are going to copy the csv file to hdfs. Before that we need to create a directory to store the file. To create a directory in the HDFS :

hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data/emp/

above command as follows:

In the above commands we create /data folder 1st then inside the data folder created another folder emp.

To check whether a directory is created or not.

To copy the csv file use below command

hduser@bigdata-VirtualBox:~$ hdfs dfs -put /home/hduser/Downloads/empdata.csv /data/emp/

To verify the file is exists or not in directory path follow as below:

hduser@bigdata-VirtualBox:~$ hdfs dfs -ls /data/emp/

Note: In our scenario directory path in the hadoop is /data/emp/

Before create an external table open the hive shell and we need to create a database as follows : Open the hive shell

To create a database :

Create database dezyre_db; use dezyre_db;

Step 3 : Create an External Hive Table and Load the data

Here we are going create an external hive table on the top of the csv file for that use below hive script to create a table as follows :

CREATE EXTERNAL TABLE IF NOT EXISTS employee( `emp_id` string, `firstname` string, `lastname` string, `gender` string, `email_id` string, `father_name` string, `mother_name` string, `date_of_birth` string, `time_of_birth` string, `age_in_yrs` string, `weight_in_kgs` string, `date_of_joining` string, `quarter_of_Joining` string, `month_name_of_joining` string, `age_in_company_years` string, `salary` string, `last_hike` string, `ssn` string, `ph_no` string, `place` string, `county` string, `city` string, `state` string, `zip` string, `region` string, `user_name` string, `password` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"" ) STORED AS TEXTFILE LOCATION '/data/emp' TBLPROPERTIES ("skip.header.line.count"="1");

In the above hql script we are creating an external by specifying the data format and location and also skipping the headers of the csv file.

Step 4 : Query and verify the data

To query the data using below query in hive shell

Select * from employee limit 4;

Output of the above code:

Step 5 : Create a final table (ORC table)

Now we are going to create a hive final table, which is in ORC format.To create a final table follow as below

CREATE TABLE IF NOT EXISTS final_emp_table ( `emp_id int, `firstname` string, `lastname` string, `gender` string, `email_id` string, `mail_service` string, `father_name` string, `mother_name` string, `date_of_birth` string, `time_of_birth` string, `age_in_yrs` float, `weight_in_kgs` string, `date_of_joining` string, `year_of_joining` string, `quarter_of_joining` string, `month_name_of_joining` string, `age_in_company_years` float, `salary` float, `last_hike` string, `ssn` string, `ph_no` string, `place` string, `county` string, `city` string, `state` string, `region` string, `user_name` string, `password` string ) PARTITIONED BY (`zip` int) STORED AS ORC;

Above command as follows

Step 6 : Copy data from external table to final table

In this step we are going to copy the data from the external table and insert into the final table we will create new columns from the existing external table using insert statement and also we made transformations changing the date formats extracting the data as follows

Setting the properties to creating the partition as below

hive>SET hive.exec.compress.intermediate=true; hive>SET hive.exec.dynamic.partition = true; hive>SET hive.exec.dynamic.partition.mode = nonstrict; hive>SET hive.mapred.mode = nonstrict;

Above command as follows

Insert statement to insert the data from the temporary or external table

INSERT OVERWRITE TABLE final_emp_table partition(`zip`) SELECT cast(`emp_id` as int), cast(`firstname` as string), cast(`lastname` as string), cast(`gender` as string), cast(`email_id` as string), regexp_extract (`email_id`,'([A-z])*\\.([A-z])*$',0) as mail_service, cast(`father_name` as string), cast(`mother_name` as string), cast(`date_of_birth` as string), cast(`time_of_birth` as string), cast(`age_in_yrs` as float), cast(`weight_in_kgs` as string), coalesce( CAST( from_unixtime(unix_timestamp(`date_of_joining`, 'MM/dd/yyyy'), 'yyyy-MM-dd') as string) , CAST( from_unixtime(unix_timestamp(`date_of_joining`, 'dd-MM-yyyy'), 'yyyy-MM-dd') as string) ) , coalesce( CAST( from_unixtime(unix_timestamp(date_of_joining, 'MM/dd/yyyy'), "yyyy") as string) , CAST(from_unixtime(unix_timestamp(date_of_joining, 'dd-MM-yyyy'), "yyyy")AS STRING) )as year_of_joining, cast(`quarter_of_joining` as string), cast(`month_name_of_joining` as string), cast(`age_in_company_years`as float), cast(`salary` as float), cast(`last_hike` as float), cast(`ssn`as string), cast(`ph_no`as string), cast(`place` as string), cast(`county`as string), cast(`city` as string), cast(`state` as string), cast(`region` as string), cast(`user_name`as string), cast(`password` as string), cast(`zip` as int) FROM employee ;

In the above insert statement we convert data format and extract year from the date and partition on zip column ,also extracting the domain from email id and creating a new column mailservice.

Above command as follows

Output of the above query:

Step 6 : Copy data from external table to final table

To verify the data in final table run the query as below

Select * from final_emp_table limit 5;

Output of the above query:

Relevant Projects

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

Design a Hadoop Architecture
Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.