Fundamentals of Impala
What is Impala?
Impala is an open source massively parallel processing query engine on top of clustered systems like Apache Hadoop. It was created based on Google’s Dremel paper. It is an interactive SQL like query engine that runs on top of Hadoop Distributed File System (HDFS). Impala uses HDFS as its underlying storage.
It integrates with HIVE metastore to share the table information between both the components. Impala makes use of existing Apache Hive (Initiated by Facebook and open sourced to Apache) that many Hadoop users already have in place to perform batch oriented , long-running jobs in form of SQL queries. Impala main goal is to make SQL-on Hadoop operations fast and efficient to appeal to new categories of users and open up Hadoop to new types of use cases. Impala – HIVE integration gives an advantage to use either HIVE or Impala for processing or to create tables under single shared file system HDFS without any changes in the table definition. However not all SQL-queries are supported by Impala, there could be few syntactical changes. ImpalaQL is a subset of HiveQL, with some functional limitations like transforms. Impala can also query HBase tables. Impala is also distributed among the cluster like Hadoop. It has different types of daemons running on specific hosts of cluster like Impala daemon, Statestore and Catalog Services, which we will discuss in the coming sections.
Why is it used?
- Impala gives parallel processing database technology on top of Hadoop eco-system. It allows users to perform low latency queries interactively.
Hive MapReduce job will take some minimum time in launching and processing queries where as impala gives results in seconds.
Impala being real-time query engine best suited for analytics and for data scientists to perform analytics on data stored in Hadoop File System.
- As Impala gives results in real-time, it is best fit in reporting tools or visualization tools like Pentaho, Tableau which already comes with connectors (allows to query and perform visualizations directly from Graphical User Interface).
- Many other open source visualization tools are also available in market. Impala comes with an in-built support of processing all Hadoop supported file formats (ORC, Parquet, ..,).
Impala uses parquet (default) file format, initiated with the inspiration of Google’s Dremel paper published in 2010, which is optimized for large-scale queries. Parquet is a columnar storage, which stores data vertically rather than horizontal storage in normal Data warehouse. This brings the huge performance in queries using aggregation functions on numeric fields because it reads only the column split part files rather than reading entire data set like hive. Parquet files can also be processed using Hive and PIG. It is built to support efficient compression and encoding schemes. Impala performs well for real-time interaction with the data on Hadoop Distributed File system or the tables already exist in Hive. Impala support snappy compression also which is the default compression codec used in Hive or Hadoop.
Where is it used?
- Impala can be used when there is a need of low latent results.
- Partial data needs to be analyzed.
- Quick analysis need to be done.
- Obtain results in real time (in this context real time is defined as user can “wait for result”).
- It is best for same kind of queries needs to be processed several times.
Three daemons of Impala plays major role in its architecture.
Core part of Impala is a daemon that runs on each node of cluster called impalad. It reads and writes to data files, accepts queries transmitted from Hue or any other connection requests from Data Analytic tools with JDBC ODBC connections. It distributes the work to other nodes in the impala cluster and transmits the intermediate results back to the coordinator node. The node in which job launched is known as coordinator node. You can submit the query to impala daemon running on any node of your cluster and that node serves as the coordinator node for that query. Coordinator node distributes the query for processing among the cluster of impalad daemons of other nodes. The other nodes transmit the partial data back to the coordinator, which then constructs final result set for that query.
Statestore checks for Impala daemons availability on all the nodes of cluster. A daemon process called statestored physically represents it. Only one node in the cluster need to have this process. Impala daemons are in continuous communication with the Statestore, to confirm which nodes are healthy and accept new work for processing. If an impalad daemon goes down because of any of the reason like hardware failure, network connection issue, then the statestored daemon informs all the other impalad daemons running in the cluster, so that future queries can avoid assigning tasks/queries to process. If statestore is not available, the other nodes continue running and distribute work among the other impalad daemons as usual with the assumption that all impalad daemons are running fine. The cluster just becomes less robust when impalad fails on a node while statestore is offline. When statestored comes online, it re-establishes the communication with other nodes and resumes monitoring.
Catalog service relays the metadata changes from Impala DDL(Data Definition Language) queries or DML(Data Manipulation Language) queries to all nodes in the cluster. This process is represented as catalogd daemon and we need only such process in one of the hosts in cluster. Configuring catalogd and statestored in a single host is suggested because any query requests are passed through statestored daemon. When table has been created, updated with new schema or loaded with new data through Hive, we need to refresh the Impala catalogd daemon using “INVALIDATE METADATA” statement in impala so that the changes will be known to impalad process. It applies for HDFS changes too. If there is any file/directory created or deleted directly via HDFS commands, we should update the impala catalogd with INVALIDATE statement.
Third party applications or any query requests will be taken by impalad then execution plan will be prepared and submit the query plan to rest of the other impalad ‘s in the cluster. Before the execution plan, impalad will communicate with statestore and hive metastore and Namenode for live impalad nodes and table metadata and file information.
Basic Commands and Syntax Imapala
Impala provides command line interface, which gives interactive query results. Impala supports all the DML operations and DDL (No UPDATES). Any change in Impala will be reflected back to Hive metastore immediately with the help of catalogd. But this is not the same if any change in hive metastore directly. This means if there is any change in hive metastore or hdfs file system, there should be manual command “invalidate metadata” needs to be executed.
Some of the basic command syntax has been given below:
- Impala supported Datatypes
It supports all the numeric datatypes and character datatypes and date datatypes.
TINYINT, SMALLINT, INT, BIGINT, FLOAT, DECIMAL, DOUBLE, REAL, CHAR, VARCHAR, STRING, TIMESTAMP, BOOLEAN.
- Impala table queries
CREATE TABLE DEZYRE_USER_INFO (user_name bigint, unique_accounts bigint, server_hits bigint, topic_name string, accessed_date string) STORED AS PARQUET;
The above statement creates a table called “DEZYRE_USER_INFO” with five columns in it and the underlying file format is specified as PARQUET.
Select * from dezyre_user_info limit 10;
The above select clause will give the sample 10 records from Impala table.
Select user_name, count(distinct topic_name) from dezyre_user_info where accessed_date >= ‘2015-01-01’ group by user_name.
The above query will give you the total number of topics(subjectts) followed or accessed for each user (group by clause) for the given period (filter condition).
Select user_name, topic_name, DATEDIFF(MIN(accessed_date),MAX(accessed_date)) from dezyre_user_info group by user_name, topic_name.
The above query will give the number of days user spent on each topic.
Select user_name, SUM(server_hits) AS tot_hits, MAX(server_hits) from dezyre_user_info group by user_name order by tot_hits desc.
The above query gives the total number of times user accessed dezyre server and in the descending order that means will get top accessed users on top.
Impala has an in-built support for all the aggregation functions like SUM, AVG, MIN, MAX, COUNT. It gives a flexibility of changing the datatype runtime using CAST function. It has conditional functions to decide based on the values dynamically. And all the JOIN operations can be performed (but Joining large tables is not suggested as Impala performs operations in memory).
Impala also support VIEWS that act as variables or aliases for queries defined. If there is a complicated query which needs to run multiple times then a view can be created and can query on the created query rather than executing the complicated query each time.
Create view Query 1 as,
Select user,sum(a),sum(b),sum(c) from (Select a,b,c ,user_id from table1 t1 join table2 t2 on t1.id = t2.id where a > 10)A JOIN (Select user,user_id from user_info where date > ‘2015-01-01’)B ON A.user_id = B.user_id;