Airline Dataset Analysis using Hadoop, Hive, Pig and Impala

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala

Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.
explanation image


Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

SUBHABRATA BISWAS linkedin profile url

Lead Consultant, ITC Infotech

The project orientation is very much unique and it helps to understand the real time scenarios most of the industries are dealing with. And there is no limit, one can go through as many projects... Read More

profile image

Shailesh Kurdekar linkedin profile url

Solutions Architect at Capital One

I have worked for more than 15 years in Java and J2EE and have recently developed an interest in Big Data technologies and Machine learning due to a big need at my workspace. I was referred here by a... Read More

What will you learn

Introduction to Data infrastructures Methods for ingestion of data(Backend Service, Data Warehouse)
Tackling Small file problem
Roadmap of the project and business problem
Hive JDBC and Impala ODBC driver
Extracting and loading the data in Cloudera VMware
Data preprocessing with Pig
Writing Queries in Hue Hive for creating tables
Hive vs. MPP database systems (Hive vs. Impala/Drill)
Basic EDA using Hive
Hive/Impala partitioning and clustering
Writing data from Pig to Hive directly using HCatloader
Data compression, tuning and query optimization using parquet
Using database views to represent data
Clustering , Sampling and Bucketed Tables
Building time series data model
Impala compute Stats and File format
Visualizing data using Microsoft Excel via ODBC

Project Description

Before data on any platform will become an asset to any organization, it has to pass through processing stage to ensure quality and availability. Afterward, that data has to be available to users (both human and system users). The availability of quality data in any organization is the guarantee of the value that data science (in general) will be to that organization. 

We are using the airline on-time performance dataset (flights data csv) to demonstrate these principles and techniques in this hadoop project and we will proceed to answer the below questions -

  • When is the best time of day/day of week/time of year to fly to minimize delays?
  • Do older planes suffer more delays?
  • How does the number of people flying between different locations change over time?

We will also transform the data access model into time series and demonstrate how clients can access data in our big data infrastructure using a simple tool like the Excel spreadsheet.

Similar Projects

In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Curriculum For This Mini Project

Introduction to Data Infrastructure
Methods to ingest data in a data infrastructure
Messaging Layer Example
Small File Problem
Business problem overview and topics covered
Hive JDBC and Impala ODBC drivers
Data Pre-processing
Data Extraction and Loading
Setting up the Datawarehouse
Creating Data Table
Impala Architecture
Working with Hive versus Impala & File Formats
Hive query for Airline data analysis + Parquet - 1
Hive query for Airline data analysis + Parquet - 2
Hive query for Airline data analysis + Parquet - 3
Read and write data to tables
Parquet data compression
Calculate average flight delay
Partitioning Basics
Where to do the data processing - Hive or Impala ?
Partitioning Calculations
Dynamic Paritioninig
Clustering, Sampling, Bucketed Tables
Hive Compression and Execution Engine
Impala COMPUTE STATS and File Formats
Using database views to represent data
Using Excel or Qlikview for Visualization