Predicting Flight Delays using Apache Spark and Kylin

Predicting Flight Delays using Apache Spark and Kylin

In this project, we will be building and querying an OLAP Cube for Flight Delays on the Hadoop platform.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Arvind Sodhi

VP - Data Architect, CDO at Deutsche Bank

I have extensive experience in data management and data processing. Over the past few years I saw the data management technology transition into the Big Data ecosystem and I needed to follow suit. I... Read More

Ray Han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

What will you learn

Discuss the installation of Apache Kylin in a Hortonworks sandbox.
Design star schema on our flight dataset
Implementing our star schema in Kylin
Building and merging Kyline segments incrementally.
Building Cubes using Kylin Restful API
How to execute Cubes using Spark Engine

Project Description

In previous Hackerday sessions, we have introduced how to bring OLAP to extremely large datasets in Apache Kylin. For those who don't know what Kylin is, Kylin (kylin.apache.org) is a Distributed Analytics Engine that provides SQL interface and multidimensional analysis (OLAP) on the large dataset using MapReduce or Spark. This means that I can answer classical aggregate queries in the Hadoop platform with a low latency over billions of records.

In this Hackerday, we will be performing an OLAP cube design using the flight on-time dataset. Since we have previously introduced Kylin, this Hackerday session will look at more involved features like incremental build, performance tuning or consideration tips, we will discuss the Spark engine as well as how to build different types of model.

Similar Projects

In this Hackerday, we will go through the basis of statistics and see how Spark enables us to perform statistical operations like descriptive and inferential statistics over the very large dataset.

In this hive project, you will design a data warehouse for e-commerce environments.

In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Curriculum For This Mini Project