In most of the big data scenarios , Hive is an ETL and data warehouse tool on top of the hadoop ecosystem, it is used for the processing the different types structured and semi-structured data, it is a database we can also perform DDL and DML operations, and it provides flexible query language called as HQL for better querying and processing of data.
Here we are using the bank related comma separated values (csv) dataset for the create hive table in local.
Use linux command to check data as follows:
head -10 listbankdataset.csv
Data is looks as follows :
Here we are going to copy the csv file to hdfs. Before that we need to create a directory to store the file. To create a directory in the HDFS :
hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data
hduser@bigdata-VirtualBox:~$ hdfs dfs -mkdir /data/client_bank_details/
above command as follows:
In the above commands we create /data folder 1st then inside the data folder created another folder client_bank_details.
To check whether a directory is created or not.
To copy the csv file use below command
hduser@bigdata-VirtualBox:~$ hdfs dfs -put /home/hduser/listbankdataset.csv /data/client_bank_details/
To verify the file is exists or not in directory path follow as below:
hduser@bigdata-VirtualBox:~$ hdfs dfs -ls /data/client_bank_details/
Note: In our scenario directory path in the hadoop is /data/client_bank_details/
Before create an external table open the hive shell and we need to create a database as follows : Open the hive shell
To create a database :
Create database dezyre_db; use dezyre_db;
Here we are going create an external hive table on the top of the csv file for that use below hive script to create a table as follows :
CREATE EXTERNAL TABLE IF NOT EXISTS `client_bank_details` ( `client_num` string, `attrition_flag` string, `customer_age` string, `gender` string, `dependent_count` string, `education_level` string, `marital_atatus` string, `Income_xategory` string, `card_xategory` string, `months_on_book` string, `total_relationship_count` string, `months_inactive_12_mon` string, `contacts_count_12_mon` string, `credit_cimit` string, `total_revolving_bal` string, `avg_open_to_buy` string, `total_amt_chng_q4_q1` string, `total_trans_amt` string, `total_trans_ct` string, `total_ct_chng_q4_q1` string, `avg_utilization_ratio` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"" ) STORED AS TEXTFILE LOCATION '/data/client_bank_details' TBLPROPERTIES ("skip.header.line.count"="1");
In the above hql script we are creating an external by specifying the data format and location and also skipping the headers of the csv file.above command looks as follows:
To query the data using below query in hive shell
Select * from client_bank_details limit 10;
Output of the above code: