Apache HIVE

HIVE- A Data Warehousing Solution for Big Data on Hadoop

Hive is a data warehousing solution developed on top of Hadoop to meet the big data challenges of storing, managing and processing large data sets without having to write complex Java based MapReduce programs. Hive is a familiar programming model for big data professionals who know SQL but do not have a good grip in programming. Hive is not a relational database or an architecture for online transaction processing. It is particularly designed for online analytical processing systems (OLAP).

Hive compiler converts the queries written in HiveQL into MapReduce jobs so that Hadoop developers need not worry much about the complex programming code beyond the processing and they can focus on the business problem. The three important functions performed by Hive include - data summarization, data querying and data analysis. Apache Hive is extensively used by data scientists and data analysts for data exploration, building data pipelines and for processing ad-hoc queries.

Hive Components

Apache Flume

Image Credit: Hortonworks
  1. CLI, JDBC, ODBC or any other Web GUI form the external interfaces to the Hive framework interface for creating interaction between user and HDFS.
  2. Metastore Thrift API keeps tracks of what data is stored in which part of the HDFS .It is like a system catalog.
  3. Driver is heart of the Hive architecture responsible for compilation, optimization and execution of HiveQL statements.
  4. Thrift Server is a client side API for executing HiveQL statements.

How a HiveQL query is executed in Apache Hive?

Whenever a user submits a HiveQL query, it is first compiled. The compiled query is then executed by an execution engine like Hadoop MapReduce or Apache Tez. Data in diverse formats like ORC, AVRO, Parquet, or Text reside in HDFS on which the query is to be executed. YARN then allocates desired resources across the Hadoop cluster for execution. The results of the query execution are sent over a JDBC or ODBC connection.

Features of Apache Hive

  • Hive architecture has a system catalog that make lookup easy.
  • Hive Metastore provides flexibility in schema designs and helps in data serialization or data deserialization.
  • Enhanced performance through data partitioning.
  • External tables make processing data possible even without actually storing it into HDFS.
  • Automatically optimizes logical plans through rule based optimizer.
  • Bucketing in Hive enhances the join performance especially when the bucket key and join key are same.

Hive Advantages

  • Provides enhanced performance when compared to slow MapReduce jobs through optimized index support, query vectorization, cost based optimization and the Tez engine.
  • Provides enterprise level security through object level securables so that the managers can ensure that only required team members can access the data.
  • Most of the big data developers or data analysts are familiar with SQL and Hive provides SQL ANSI compliant code and several advanced SQL like features-GROUP BY , HAVING, OLAP functions, date and timestamp data types, XML ,etc.
  • Every business requires that data be stored in a non-corrupted state and Hive ensures this level of data integrity by providing complete ACID transaction support.
  • Reduces the development time by eliminating the need to write lengthy Java MapReduce code.

Companies Using Hive

  • VideoEgg uses Hive to analyse its overall usage data.
  • Facebook uses Hive for machine learning, ad-hoc analysis and reporting.
  • Hi5 uses Hive for social graph analysis.
  • Grooveshark uses hive for cleaning data sets and performing user analytics.

HIVE Blogs

Impala vs Hive: Difference between Sql on Hadoop components
Cloudera Impala is 6 to 69 times faster than Apache Hive.To conclude, Impala does have a number of performance related advantages over Hive but it also depends upon the kind of task at hand. That being said, Jamie Thomson has found some really interesting results through dumb querying published on sqlblog.com, especially in terms of execution time. Click to read more.
MapReduce vs. Pig vs. Hive
There is no need to import any additional libraries and anyone with basic knowledge of SQL and without a Java background can easily understand it. The learning curve for Java MapReduce is high when compared to that of Pig Latin or HiveQL. Click to read more.
Hive vs Impala - SQL War in the Hadoop Ecosystem
Apache Hive is an effective standard for SQL-in-Hadoop. Hive is a front end for parsing SQL statements, generating logical plans, optimizing logical plans, translating them into physical plans which are executed by MapReduce jobs. Click to read more.
Difference between Pig and Hive-The Two Key Components of Hadoop Ecosystem
Hive is similar to a SQL Interface in Hadoop. The data that is stored in HBase component of the Hadoop Ecosystem can be accessed through Hive. Hive is of great use for developers who are not well-versed with the MapReduce framework for writing data queries that are transformed into MapReduce jobs in Hadoop. Click to read more.
Innovation in Big Data Technologies aides Hadoop Adoption
Today, Hive is used at Facebook to store more than 2 petabytes of uncompressed data and regularly loads more than 15 terabytes of data daily. Hadoop Hive is extensively used at Facebook for BI, machine learning and other simple summarization jobs. Click to read more.

HIVE Tutorials

Fundamentals of Hive
Hive is an easy way to work with data stored in HDFS (Hadoop file system). It is SQL oriented query language. Basically Hive is SQL for Hadoop cluster. It is an open source data warehouse system on top of HDFS that adds structure to the data. Just like database, Hive has features of creating database, making tables and crunching data with query language. Click to read more.
Hive: Internal Tables
There are 2 types of tables in Hive, Internal and External. This case study describes creation of internal table, loading data in it, creating views, indexes and dropping table on weather data. Click to read more.

HIVE Interview Questions

  1. What happens on executing the below query? After executing the below query, if you modify the column –how will the changes be tracked?

    • Hive> CREATE INDEX index_bonuspay ON TABLE employee (bonus).
    • AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'; Read more.
  2. Can you list few commonly used Hive services?

    • Command Line Interface (cli)
    • Hive Web Interface (hwi)
    • HiveServer (hiveserver) Read more.
  3. What is the use of Hcatalog?

    • Hcatalog can be used to share data structures with external systems. Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive's data warehouse. Read more.

HIVE Slides

Hive Introduction

Advanced Hive

HIVE Videos

HIVE Questions & Answers

  1. how to delete database from hive?

    • is that possible? i created a database but its messed up now i want to delete that and again create the database of same name.is that possible?please help me Click to read answer.
  2. count columns in Hive

    • Anyone know a query to count the number of columns in a table using a Hive QL? Ex : if the table "t1" has columns c1,c2,c3...........cn (where n can be 200+), i want to know the number of columns using a hive QL, the output of the Hive QL should be columns=223. Click to read answer.
  3. Regarding saving data in Hive table

    • For example: Michigan, USA. I am trying to save data in hive table with delimited fields terminated by ','. Instead of just Michigan and USA, why am I getting numbers in front of Michigan. Click to read answer.
  4. Problem creating a database in hive

    • after giving rwx permission to all user using a sudo command and removing .lck files I tried to create a database by running the following commands: hive> CREATE DATABASE hivedemo > LOCATION '/user/hive/warehouse/hivedemo/';Click to read answer.
  5. Partitioned Hive table question

    • when create hive table, if I use partition, I can't decalre the column in the "Create table" part of the ddl any more (otherwise, I received an error). But then, how do I make sure data from each column from the file go to the right column in the table that created? Click to read answer.

HIVE Assignments

You will need to have a working knowledge in HiveQL to be able to work on these assignments.

Click here to download the respective datasets from the zip file.

Hive Assignment - 1

Data Sets: NASDAQ Exchange Daily 1970-2010 Open, Close, High, Low and Volume

Download links: NASDAQ_dividends_A.csv    NASDAQ_daily_prices_A_sample.csv

Summary of data: There are two types of data

  1. File name starting with NASDAQ_daily_prices*: These files is a CSV (comma seperated values) file which contains following fields which are self explanatory.

    exchange,stock_symbol,date,stock_price_open,stock_price_high,stock_price_low,stock_price,_close,stock_volume,stock_price_adj_close

  2. File name starting with NASDAQ_daily_prices*: These files is a CSV (comma seperated values) file which contains following fields which are self explanatory.

    exchange,stock_symbol,date,dividends

Problems & Solutions:

CREATE EXTERNAL TABLE nasdaq_dividend ( exchange1 STRING, stock_symbol STRING, modified_date STRING, dividends FLOAT ) 
COMMENT 'This is the nasdaq dividend table' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' 
STORED AS TEXTFILE 
LOCATION '/input/nasdaq/dividends';
CREATE EXTERNAL TABLE nasdaq_prices (exchange1 STRING, stock_symbol STRING, modified_date STRING,stock_price_open FLOAT,stock_price_high FLOAT,stock_price_low FLOAT,stock_price_close FLOAT,stock_volume FLOAT,stock_price_adj_close FLOAT )
COMMENT 'This is the nasdaq prices table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/input/nasdaq/daily_price/';
select stock_symbol, sum(stock_volume) as stock_volume 
from nasdaq_prices 
where stock_price_close > 5 
group by stock_symbol;
select stock_symbol, max(stock_price_close) as highest_price 
from nasdaq_prices 
group by stock_symbol;
select stock_symbol, max(dividends) as highest_dividends 
from nasdaq_dividend 
group by stock_symbol;
select np.stock_symbol, highest_price, highest_dividends 
from (select stock_symbol, max(stock_price_close) as highest_price 
      from nasdaq_prices 
      group by stock_symbol) np 
join (select stock_symbol, max(dividends) as highest_dividends 
      from nasdaq_dividend 
      group by stock_symbol) nd 
on np.stock_symbol = nd.stock_symbol;
select np.stock_symbol, highest_price, highest_dividends 
from (select stock_symbol, max(stock_price_close) as highest_price 
      from nasdaq_prices 
      group by stock_symbol)  np 
full outer join (select stock_symbol, max(dividends) as highest_dividends 
      from nasdaq_dividend 
      group by stock_symbol) nd 
on np.stock_symbol = nd.stock_symbol;

PIG Assignment - 2

Data Set: The NBER U.S. Patent Data

Data Link: http://www.nber.org/patents/

We will use two of the data sets:

  1. Patent data, including constructed variables:

    Data link: http://www.nber.org/patents/pat63_99.zip

    Summary of this data is given here - http://www.nber.org/patents/pat63_99.txt

  2. Class codes with corresponding class names:

    Data Link: http://www.nber.org/patents/list_of_classes.txt

    Summary of data: This data need to be cleaned in a way so that it only contains class Id and title in tab separated form. Just remove first 9 lines.

Problems & Solutions:

For this problem we have divided data for patents into multiple files, one file containing patents granted in one particular year.

CREATE EXTERNAL TABLE patents ( PATENT STRING , year String, GDATE STRING ,APPYEAR STRING ,COUNTRY STRING ,POSTATE STRING ,ASSIGNEE STRING ,ASSCODE STRING ,CLAIMS STRING ,NCLASS STRING ,CAT STRING ,SUBCAT STRING ,CMADE STRING ,CRECEIVE STRING ,RATIOCIT STRING ,GENERAL STRING ,ORIGINAL STRING ,FWDAPLAG STRING ,BCKGTLAG STRING ,SELFCTUB STRING ,SELFCTLB STRING ,SECDUPBD STRING ,SECDLWBD STRING ) COMMENT 'This is the patents table' PARTITIONED BY(GYEAR STRING) CLUSTERED BY(NCLASS) SORTED BY(NCLASS) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/input/patents/data';

--For creating data in different partition:
-------------------------------------------
hadoop dfs -put patent_1963.txt /input/patents/data/gyear=1963/
hadoop dfs -put patent_1964.txt /input/patents/data/gyear=1964/
hadoop dfs -put patent_1965.txt /input/patents/data/gyear=1965/
hadoop dfs -put patent_1999.txt /input/patents/data/gyear=1999/

--For creating partitions in table
---------------------------------
ALTER TABLE patents ADD PARTITION(gyear=1963);
ALTER TABLE patents ADD PARTITION(gyear=1964);
ALTER TABLE patents ADD PARTITION(gyear=1965);
ALTER TABLE patents ADD PARTITION(gyear=1999);

--You can also use directly this:
--------------------------------
LOAD DATA LOCAL INPATH "patent_1963.txt" INTO TABLE patents PARTITION(gyear=1963);
LOAD DATA LOCAL INPATH "patent_1964.txt" INTO TABLE patents PARTITION(gyear=1964);
LOAD DATA LOCAL INPATH "patent_1965.txt" INTO TABLE patents PARTITION(gyear=1965);
LOAD DATA LOCAL INPATH "patent_1999.txt" INTO TABLE patents PARTITION(gyear=1999);
select count(*) from patents where GYEAR=1963;
select COUNTRY, count(*) from patents where GYEAR=1999 group by COUNTRY;

Hive Commands:

CREATE TABLE player_runs(player_id INT, year_of_play STRING, runs_scored INT)
COMMENT 'This is the player_run table'
STORED AS TEXTFILE;
describe player_runs
ALTER TABLE player_runs RENAME TO runs_of_player;
ALTER TABLE runs_of_player ADD COLUMNS (balls_played INT COMMENT 'a new int column');
drop table runs_of_player;

CREATE TABLE player_runs(player_id INT, year_of_play STRING, runs_scored INT, balls_played INT, country STRING )
COMMENT 'This is the player_run table'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH "data/runs.csv" INTO TABLE player_runs;

select * from player_runs limit 10;
CREATE EXTERNAL TABLE player_runs_p(player_id INT, year_of_play STRING, runs_scored INT, balls_played INT)
COMMENT 'This is the staging player_runs table'  PARTITIONED BY(country STRING)  
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/input/runs';

hadoop dfs -mkdir /input/runs/country=India
hadoop dfs -put data/runs_India.csv /input/runs/country=India/

ALTER TABLE player_runs_p ADD PARTITION(country='India');
LOAD DATA LOCAL INPATH "data/runs_US.csv" INTO TABLE player_runs_p PARTITION(country='US');
LOAD DATA LOCAL INPATH "data/runs_India.csv" INTO TABLE player_runs_p PARTITION(country='INDIA');

FROM player_runs_p
INSERT OVERWRITE TABLE player_runs 
SELECT player_id, year_of_play, runs_scored,balls_played 
where player_id=10;
select player_id, runs_scored from player_runs;
select * from player_runs where player_id=10;
select player_id, sum(runs_scored) from player_runs group by player_id;
CREATE EXTERNAL TABLE players(player_id INT, name STRING)
COMMENT 'This is the staging player table'  
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/input/player';

LOAD DATA LOCAL INPATH "data/player.csv" INTO TABLE players;

select * from players join player_runs on players.player_id=player_runs.player_id;

select * from players full outer join player_runs on players.player_id=player_runs.player_id;

select * from players left outer join player_runs on players.player_id=player_runs.player_id;
CREATE TABLE player_total_runs(player_id INT, runs_scored INT)
COMMENT 'This is the player_total_run table'
STORED AS TEXTFILE;

CREATE TABLE yearly_runs(year_of_play STRING, runs_scored INT)
COMMENT 'This is the yearly_run table'
STORED AS TEXTFILE;

//Insert into two tables
FROM player_runs
INSERT OVERWRITE TABLE player_total_runs
    select player_id, sum(runs_scored) 
    group by player_id 
INSERT OVERWRITE TABLE yearly_runs
    select year_of_play, sum(runs_scored) 
    group by year_of_play;

//Insert into one table and one HDFS file
FROM player_runs
INSERT OVERWRITE TABLE player_total_runs
    select player_id, sum(runs_scored) 
    group by player_id 
INSERT OVERWRITE DIRECTORY '/output/yearly_runs'
    select year_of_play, sum(runs_scored) 
    group by year_of_play;
CREATE EXTERNAL TABLE player_runs_distribute(player_id INT, year_of_play STRING, runs_scored INT, balls_played INT)
COMMENT 'This is the staging player_runs table'  PARTITIONED BY(country STRING)  
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/input/runs_distribute/';

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM player_runs 
INSERT OVERWRITE TABLE player_runs_distribute PARTITION(country)
SELECT player_id,  year_of_play , runs_scored , balls_played, country 
DISTRIBUTE BY  country;

select * from player_runs_distribute where country='India' limit 10;
CREATE EXTERNAL TABLE player_runs_clustered(player_id INT, year_of_play STRING, runs_scored INT, balls_played INT)
COMMENT 'This is the  player_runs table'  PARTITIONED BY(country STRING) 
clustered by (player_id) INTO 10 buckets 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/input/runs_clustered/';

LOAD DATA LOCAL INPATH "data/runs_extra.csv" INTO TABLE player_runs;

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.enforce.bucketing=true;

FROM player_runs 
INSERT OVERWRITE TABLE player_runs_clustered PARTITION(country)
SELECT player_id,  year_of_play , runs_scored , balls_played, country 
DISTRIBUTE BY  country;

select avg(runs_scored) from player_runs_clustered TABLESAMPLE(BUCKET 1 OUT OF 10);
select /*+ MAPJOIN(players)*/ * from players join player_runs on players.player_id=player_runs.player_id;

select * from players join player_runs on players.player_id=player_runs.player_id;
processing person-icon