Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis

In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

End-to-end implementation of Big data pipeline on AWS
Scalable, reliable, secure data architecture followed by top notch Big data leaders
Detailed explanation of Ws in Big Data and data pipeline building and automation of the processes
Real time streaming data import from external API using NiFi
Parsing of the complex Json data into csv using NiFi and storing in HDFS
Encryption of one of the PII fields in the data using NiFi
Sending parsed data to Kafka for data processing using PySpark and writing the data to output Kafka topic
Consume data from Kafka and store processed data in HDFS
Create a Hive external table on top of the data stored in HDFS followed by data query
Data cleaning, transformation, storing in the data lake
Visualisation of the key performance indicators by using top end industry big data tools
Data flow orchestration for continuous integration of the data pipeline using Airflow
Visualisation of the data using AWS QuickSight and Tableau

Project Description

In this Big Data project, a senior Big Data Architect will demonstrate how to implement a Big Data pipeline on AWS at scale. You will be using the Covid-19 dataset. This will be streamed real-time from an external API using NiFi. The complex json data will be parsed into csv format using NiFi and the result will be stored in HDFS.

Then this data will be sent to Kafka for data processing using PySpark. The processed data will then be consumed from Spark and stored in HDFS. Then a Hive external table is created on top of HDFS. Finally the cleaned, transformed data is stored in the data lake and deployed. Visualisation is then done using Tableau and AWS QuickSight.

Similar Projects

The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

In this big data project, we'll work with Apache Airflow and write scheduled workflow, which will download data from Wikipedia archives, upload to S3, process them in HIVE and finally analyze on Zeppelin Notebooks.

Curriculum For This Mini Project

Introduction to building data pipeline
08m
Big Data pipeline - Roles in Big Data industry
06m
Business Impact of Data Pipelines
04m
System Requirements
07m
Data Architecture
05m
Hive vs Flume vs Presto vs Druid
12m
Spark vs Airflow vs Oozie
11m
AWS EC2 - Dataset
05m
Start AWS Services
02m
Data Extraction with NiFi
04m
Data Encryption - Parsing
08m
Data Sources - HDFS - Kafka
05m
Streaming Data from Kafka to PySpark
08m
PySpark Streaming output: Kafka - NiFi - HDFS
07m
HDFS to Hive Table
04m
Dataflow Orchestration with Airflow
08m
Quicksight Visualisation
09m
Tableau Visualisation
06m