Explain how to write SQL code to create a Hive table to query the data?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

Explain how to write SQL code to create a Hive table to query the data?

Explain how to write SQL code to create a Hive table to query the data?

This recipe explains what how to write SQL code to create a Hive table to query the data

Recipe Objective

In most of the big data scenarios , Hive is an ETL and data warehouse tool on top of the hadoop ecosystem, it is used for the processing the different types structured and semi-structured data, it is a database we can also perform DDL and DML operations, and it provides flexible query language called as HQL for better querying and processing of data.

System requirements :

Step 1: Prepare a Dataset

Here we are using the bank related comma separated values (csv) dataset for the create hive table in local.

Use linux command to check data as follows:

head -10 listbankdataset.csv

Data is looks as follows :

Step 2 : Copy the CSV data to HDFS (hadoop distributed file system)

Here we are going to copy the csv file to hdfs. Before that we need to create a directory to store the file. To create a directory in the HDFS :

hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data/client_bank_details/

above command as follows:

In the above commands we create /data folder 1st then inside the data folder created another folder client_bank_details.

To check whether a directory is created or not.

To copy the csv file use below command

hduser@bigdata-VirtualBox:~$ hdfs dfs -put /home/hduser/listbankdataset.csv /data/client_bank_details/

To verify the file is exists or not in directory path follow as below:

hduser@bigdata-VirtualBox:~$ hdfs dfs -ls /data/client_bank_details/

Note: In our scenario directory path in the hadoop is /data/client_bank_details/

Before create an external table open the hive shell and we need to create a database as follows : Open the hive shell

To create a database :

Create database dezyre_db; use dezyre_db;

Step 3 : Create an External Hive Table and Load the data

Here we are going create an external hive table on the top of the csv file for that use below hive script to create a table as follows :

CREATE EXTERNAL TABLE IF NOT EXISTS `client_bank_details` ( `client_num` string, `attrition_flag` string, `customer_age` string, `gender` string, `dependent_count` string, `education_level` string, `marital_atatus` string, `Income_xategory` string, `card_xategory` string, `months_on_book` string, `total_relationship_count` string, `months_inactive_12_mon` string, `contacts_count_12_mon` string, `credit_cimit` string, `total_revolving_bal` string, `avg_open_to_buy` string, `total_amt_chng_q4_q1` string, `total_trans_amt` string, `total_trans_ct` string, `total_ct_chng_q4_q1` string, `avg_utilization_ratio` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"" ) STORED AS TEXTFILE LOCATION '/data/client_bank_details' TBLPROPERTIES ("skip.header.line.count"="1");

In the above hql script we are creating an external by specifying the data format and location and also skipping the headers of the csv file.

above command looks as follows:

Step 4 : Query and verify the data

To query the data using below query in hive shell

Select * from client_bank_details limit 10;

Output of the above code:

Relevant Projects

Design a Hadoop Architecture
Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Data processing with Spark SQL
In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

Log Analytics Project with Spark Streaming and Kafka
In this spark project, you will use the real-world production logs from NASA Kennedy Space Center WWW server in Florida to perform scalable log analytics with Apache Spark, Python, and Kafka.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Online Hadoop Projects -Solving small file problem in Hadoop
In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.