1-844-696-6465 (US)        +91 77600 44484        help@dezyre.com
airline-online-performance.jpg

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala

Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Users who bought this project also bought

What will you learn

  • Data preprocessing with Pig
  • Hive vs. MPP database systems (Hive vs. Impala/Drill)
  • Hive/Impala partitioning and clustering
  • Data compression, tuning and query optimization
  • Using database views to represent data.
  • Building time series data model
  • Visuliazing data using Microsoft Excel via ODBC

What will you get

  • Access to recording of the complete project
  • Access to all material related to project like data files, solution files etc.

Prerequisites

  • For purpose of visualization, it is expected that you have Microsoft Excel on your host machine or an equivalent.

Project Description

Before data on any platform will become an asset to any organization, it has to pass through processing stage to ensure quality and availability. Afterward, that data has to be available to users (both human and system users). The availability of quality data in any organization is the guarantee of the value that data science (in general) will be to that organization. 

We are using the airline on-time performance dataset (flights data csv) to demonstrate these principles and techniques in this hadoop project and we will proceed to answer the below questions -

  • When is the best time of day/day of week/time of year to fly to minimize delays?
  • Do older planes suffer more delays?
  • How does the number of people flying between different locations change over time?

We will also transform the data access model into time series and demonstrate how clients can access data in our big data infrastructure using a simple tool like the Excel spreadsheet.

Instructors

 
Michael

Big Data & Enterprise Software Engineer

I am passionate about software development, databases, data analysis and the android platform. My native language is java but no one has stopped me so far from learning and using angular and node.js. Data and data analysis is thrilling and so are my experiences with SQL on Oracle, Microsoft SQL Server, Postgres and MyS see more...

Curriculum For This Mini Project

 
  Introduction to Data Infrastructure
07m
  Methods to ingest data in a data infrastructure
06m
  Messaging Layer Example
11m
  Small File Problem
03m
  Business problem overview and topics covered
02m
  Hive JDBC and Impala ODBC drivers
02m
  Data Pre-processing
06m
  Data Extraction and Loading
03m
  Setting up the Datawarehouse
13m
  Creating Data Table
02m
  Impala Architecture
14m
  Working with Hive versus Impala & File Formats
08m
  Hive query for Airline data analysis + Parquet - 1
21m
  Hive query for Airline data analysis + Parquet - 2
05m
  Hive query for Airline data analysis + Parquet - 3
16m
  Read and write data to tables
16m
  Parquet data compression
06m
  Calculate average flight delay
10m
  Partitioning Basics
02m
  Where to do the data processing - Hive or Impala ?
10m
  Partitioning Calculations
15m
  Dynamic Paritioninig
04m
  Clustering, Sampling, Bucketed Tables
13m
  Hive Compression and Execution Engine
15m
  Impala COMPUTE STATS and File Formats
13m
  Using database views to represent data
15m