Fundamentals of Hive
The size of the dataset being used in the industry for business intelligence is growing rapidly. Hadoop was the solution for large data storage but using Hadoop was not easy task for end users, especially for those who were not familiar with the map reduce concept. Hive is an easy way to work with data stored in HDFS (Hadoop file system). It is SQL oriented query language. Basically Hive is SQL for Hadoop cluster. It is an open source data warehouse system on top of HDFS that adds structure to the data. Just like database, Hive has features of creating database, making tables and crunching data with query language. Query language used for Hive is called Hive Query Language (HQL). HQL and SQL have similar DDL statements for creating objects, DML statements for querying objects. HQL also has features for working with unstructured data in HDFS.
Hive System Architecture
The above figure shows the connection of Hive to Hadoop (HDFS + Map Reduce) and the internal structure of Hive.
The main components of Hive are:
- Metastore: It stores all the metadata of Hive. It stores data of data stored in database, tables, columns, etc.
- Driver: It includes compiler, optimizer and executor used to break down the Hive query language statements.
- Query compiler: It compiles HiveQL into graph of map reduce tasks.
- Execution engine: It executes the tasks produces by compiler.
- Thrift server: It provides an interface to connect to other applications like MySQL, Oracle, Excel, etc. through JDBC/ODBC drivers.
- Command line interface: It is also called Hive shell. It is used for working with data either interactively or batch data processing.
- Web Interface: It is a visual structure on Hive used for interaction with data.
Data Storage in Hive:
Hive has different forms of storage options and they include:
- Metastore: Metastore keeps track of all the metadata of database, tables, columns, datatypes etc. in Hive. It also keeps track of HDFS mapping.
- Tables: There can be 2 types of tables in Hive. First, normal tables like any other table in database. Second, external tables which are like normal tables except for the deletion part. HDFS mappings are used to create external tables which are pointers to table in HDFS. The difference between the two types of tables is that when the external table is deleted its data is not deleted. Its data is stored in the HDFS whereas in case of normal table the data also gets deleted on deleting the table.
- Partitions: Partition is slicing of tables that are stored in different subdirectory within a table’s directory. It enhances query performance especially in case of select statements with “WHERE” clause.
- Buckets: Buckets are hashed partitions and they speed up joins and sampling of data.
Hive vs. RDBMS (Relational database)
Hive and RDBMS are very similar but they have different applications and different schemas that they are based on.
- RDBMS are built for OLTP (Online transaction processing) that is real time reads and writes in database. They also perform little part of OLAP (online analytical processing).
- Hive is built for OLAP that is real time reporting of data. Hive does not support inserting into an existing table or updating table data like RDBMS which is an important part of OLTP process. All data is either inserted in new table or overwritten in existing table.
- RDBMS is based on write schema that means when data is entered in the table it is checked against the schema of table to ensure that it meets the requirements. Thus loading data in RDBMS is slower but reading is very fast.
- Hive is based on read schema that means data is not checked when it is loaded so data loading is fast but reading is slower.
Hive Query Language (HQL)
HQL is very similar to traditional database. It stores data in tables, where each table consists of columns and each column consists of specific number of rows. Each column has its own data type. Hive supports primitive as well as complex data types. Primitive types like Integer, Bigint, Smallint, Tinyint, Float, Double Boolean, String, and Binary are supported. Complex types include Associative array: map , Structs: struct , and Lists: list .
Data Definition statements (DDL) like create table, alter table, drop table are supported. All these DDL statements can be used on Database, tables, partitions, views, functions, Index, etc. Data Manipulation statements (DML) like load, insert, select and explain are supported. Load is used for taking data from HDFS and moving it into Hive. Insert is used for moving data from one Hive table to another. Select is used for querying data. Explain gives insights into structure of data.
Create Table Statement
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ [ROW FORMAT row_format] [STORED AS file_format]] [LOCATION hdfs_path]
The above statement creates a table with col_name and data_type. The table can be PARTITIONED BY col_name or it can be CLUSTERED BY col_name. Table can be SORTED BY ascending or descending order at the time of creation. It can also be BUCKETED BY mentioning the number of buckets. ROW FORMAT should have delimiters used to terminate the fields and lines. File format of the data stored can be specified with STORED AS. Also the default location of Hive table can be overwritten by using LOCATION.
FROM from_statement INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
The above insert statement is used to insert data in multiple tables with single from statement. The syntax of insert statement is explicitly INSERT OVERWRITE which says that Hive does not support inserting into an existing table or updating table.
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [CLUSTER BY col_list] [SORT BY col_list]] [LIMIT number]
The above statement is like database select statement where SELECT is used to select the expressions. FROM specifies the table name and WHERE gives the condition. GROUP BY aggregates the columns to select. CLUSTER BY is used for bucketing and SORT sorts the selected columns. LIMIT gives the freedom to choose the number of records that should be included in the selection.
Prerequisites for installing Hive:
- Java 1.7
- Hadoop 2.x
Steps to install:
- Download stable version of Hive from http://Hive.apache.org/
- Go to downloads and select the latest mirror. Download the latest tar ball apache-Hive-1.2.1-bin.tar.gz
- Unzip the tar ball using following command: tar -xzvf tar -xzvf apache-Hive-1.2.1-bin.tar.gz
- Set the environment variable HIVE_HOME to point to the installation directory:
export HIVE_HOME = /user/local/Hive
- Finally, add HIVE_HOME/bin to PATH:
export PATH = $PATH:HIVE_HOME/bin
To start Hive shell just type Hive after setting the path and Hive shell will fire up. To verify that Hive has started use command:
All the Hive properties will show up and look for mapred.job.tracker = hname : 1002 to verify that Hive has found the Hadoop cluster. Thus Hive is installed successfully and database can be created followed by tables and queries.