Explain how to write SQL code to create a Hive table to query the data?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

Explain how to write SQL code to create a Hive table to query the data?

Explain how to write SQL code to create a Hive table to query the data?

This recipe explains what how to write SQL code to create a Hive table to query the data

0

Recipe Objective

In most of the big data scenarios , Hive is an ETL and data warehouse tool on top of the hadoop ecosystem, it is used for the processing the different types structured and semi-structured data, it is a database we can also perform DDL and DML operations, and it provides flexible query language called as HQL for better querying and processing of data.

System requirements :

Step 1: Prepare a Dataset

Here we are using the bank related comma separated values (csv) dataset for the create hive table in local.

Use linux command to check data as follows:

head -10 listbankdataset.csv

Data is looks as follows :

Step 2 : Copy the CSV data to HDFS (hadoop distributed file system)

Here we are going to copy the csv file to hdfs. Before that we need to create a directory to store the file. To create a directory in the HDFS :

hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data/client_bank_details/

above command as follows:

In the above commands we create /data folder 1st then inside the data folder created another folder client_bank_details.

To check whether a directory is created or not.

To copy the csv file use below command

hduser@bigdata-VirtualBox:~$ hdfs dfs -put /home/hduser/listbankdataset.csv /data/client_bank_details/

To verify the file is exists or not in directory path follow as below:

hduser@bigdata-VirtualBox:~$ hdfs dfs -ls /data/client_bank_details/

Note: In our scenario directory path in the hadoop is /data/client_bank_details/

Before create an external table open the hive shell and we need to create a database as follows : Open the hive shell

To create a database :

Create database dezyre_db; use dezyre_db;

Step 3 : Create an External Hive Table and Load the data

Here we are going create an external hive table on the top of the csv file for that use below hive script to create a table as follows :

CREATE EXTERNAL TABLE IF NOT EXISTS `client_bank_details` ( `client_num` string, `attrition_flag` string, `customer_age` string, `gender` string, `dependent_count` string, `education_level` string, `marital_atatus` string, `Income_xategory` string, `card_xategory` string, `months_on_book` string, `total_relationship_count` string, `months_inactive_12_mon` string, `contacts_count_12_mon` string, `credit_cimit` string, `total_revolving_bal` string, `avg_open_to_buy` string, `total_amt_chng_q4_q1` string, `total_trans_amt` string, `total_trans_ct` string, `total_ct_chng_q4_q1` string, `avg_utilization_ratio` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"" ) STORED AS TEXTFILE LOCATION '/data/client_bank_details' TBLPROPERTIES ("skip.header.line.count"="1");

In the above hql script we are creating an external by specifying the data format and location and also skipping the headers of the csv file.

above command looks as follows:

Step 4 : Query and verify the data

To query the data using below query in hive shell

Select * from client_bank_details limit 10;

Output of the above code:

Relevant Projects

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

Design a Hadoop Architecture
Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.