Impala Case Study: Airport Flight Data
In this use case we are going to deal with Airport information systems data, which gives us the information regarding flight delays, reason flight get delayed, time in different formats, source and destination details including diverted routes.
The data that maintained is big in size and it is increasing. Processing the data multiple times is a time taking process. Visualization tools needs to fetch the data in real time and the graphs or charts made on top of data needs to be updated quickly.
Using Impala processing data makes easy. It gives results quickly and in real time (time in seconds). Repeated queries gives even more performance. Impala caches results in memory(for some time and deletes based on memory and queries running in the cluster).
Impala is a distributed process runs on top of HDFS. It requires
- Running Hadoop Cluster with all the services.
- Active Hive
- MySQL or PostgreSQL which will be used as metastore for both Hive and Impala.
- Java Virtual Machine (JVM). Oracle JVM is suggested.
- For hardware requirements Impala H/W
- Multiple components need to be installed on various nodes of the cluster.
(Suggested way of installing is by using Cloudera Manager, which gives automatic installation of components via graphical user interface)
Existing system runs on top of DB2 system. Data needs to be copied to HDFS on daily basis and tables in Impala needs to be updated(Configurable).
sqoop import –connect jdbc:db2://airportinfo.com:50001/testdb –username dezyre –password password –table airport_delay –m 1 –target-dir /user/dezyre/airport.db/airport_delay –fields-terminated-by ‘,’
The above command opens a JDBC connection to fetch the data from given table using provided authentication and dumps the data in specified location. It will launch a single connection as we have specified –m 1, need to increase -m to n(any number of mappers) and –split-by is must and should for more than one mapper.
The above command prints some log information which can tell us, how many records have been retrieved, how much time it has taken to complete the sqoop import, at what speed the data transfer has been done, how much data sqoop has transferred to hdfs and so on. This information can be helpful in increasing the performance, which we call performance tuning and also for statistics.
Now data is available on HDFS in Impala table and can be queried for any kind of information. Use “invalidate metadata” in case of table or data not found in impala.
- Using the data copied to HDFS, lets create a table exposing on top of it. In Impala we can create a table by pointing to a location that already exists in Hadoop file system.
+--------------------------------------------------------------------------------------+ | CREATE TABLE airport.airport_delay(| |year STRING,| |quarter STRING,| |month STRING,| |dayofmonth STRING,| |dayofweek STRING,| |flightdate STRING,| |uniquecarrier STRING,| |airlineid STRING,| |carrier STRING,| |tailnum STRING,| |flightnum STRING,| |originairportid STRING,| |originairportseqid STRING,| |origincitymarketid STRING,| | origin STRING,| |origincityname STRING,| |originstate STRING,| |originstatefips STRING,| |originstatename STRING,| |originwac STRING,| |destairportid STRING,| |destairportseqid STRING,| |destcitymarketid STRING,| |dest STRING,| |destcityname STRING,| |deststate STRING,| |deststatefips STRING,| |deststatename STRING,| |destwac STRING,| |crsdeptime STRING,| |deptime STRING,| |depdelay STRING,| |depdelayminutes STRING,| |depdel15 STRING,| |departuredelaygroups STRING,| |deptimeblk STRING,| |taxiout STRING,| |wheelsoff STRING,| |wheelson STRING,| |taxiin STRING,| |crsarrtime STRING,| |arrtime STRING,| |arrdelay STRING,| |arrdelayminutes STRING,| |arrdel15 STRING,| |arrivaldelaygroups STRING,| |arrtimeblk STRING,| |cancelled STRING,| |cancellationcode STRING,| |diverted STRING,| |crselapsedtime STRING,| |actualelapsedtime STRING,| |airtime STRING,| |flights STRING,| |distance STRING,| |distancegroup STRING,| |carrierdelay STRING,| |weatherdelay STRING,| |nasdelay STRING,| |securitydelay STRING,| |lateaircraftdelay STRING,| |firstdeptime STRING,| |totaladdgtime STRING,| |longestaddgtime STRING,| |divairportlandings STRING,| |divreacheddest STRING,| |divactualelapsedtime STRING,| |divarrdelay STRING,| |divdistance STRING,| |div1airport STRING,| |div1airportid STRING,| |div1airportseqid STRING,| |div1wheelson STRING,| |div1totalgtime STRING,| |div1longestgtime STRING,| |div1wheelsoff STRING,| |div1tailnum STRING,| |div2airport STRING,| |div2airportid STRING,| |div2airportseqid STRING,| |div2wheelson STRING,| |div2totalgtime STRING,| |div2longestgtime STRING,| |div2wheelsoff STRING,| |div2tailnum STRING,| |div3airport STRING,| |div3airportid STRING,| |div3airportseqid STRING,| |div3wheelson STRING,| |div3totalgtime STRING,| |div3longestgtime STRING,| |div3wheelsoff STRING,| |div3tailnum STRING,| |div4airport STRING,| |div4airportid STRING,| |div4airportseqid STRING,| |div4wheelson STRING,| |div4totalgtime STRING,| |div4longestgtime STRING,| |div4wheelsoff STRING,| |div4tailnum STRING,| |div5airport STRING,| |div5airportid STRING,| |div5airportseqid STRING,| |div5wheelson STRING,| |div5totalgtime STRING,| |div5longestgtime STRING,| |div5wheelsoff STRING,| |div5tailnum STRING| | )| | WITH SERDEPROPERTIES ('serialization.format'='1')| | STORED AS TEXTFILE | | Location ‘/user/dezyre/airport.db/airport_delay’;
- Now we have a table with predefined column structure. Lets query the table and get some useful informationn.
Select * from dezyre_airport.airport_delay where arrdelayminutes > 15;
We are selecting all the flight information including start time, origin, destination, carrier information, weather delay, diverted information from the table airport_delay in dezyre_airport database for those flights which got delayed for more than 15 minutes.
This information can be plotted as a graph on any Television or Display boards in airport, so that customers can get real time updates about the flight delays and estimates.
We can also find out the possible ways of delay by calculating the weather delay details and using delay caused. This can be helpful in sending alerts to passengers or customers waiting at departure block about the possible delay.
Using the historical data of delay, Airport Authority Team can take preventive measures in advance from which we can make sure there would be no delay and can satisfy the passengers comfort.
Impala could be the best solution for any interactive real time results which can get results in seconds mostly minutes and best suited for visualization tools and dashboards. Companies providing visualization products are also providing connectors to connect to Impala via JDBC drivers and trift services.