Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis

In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Camille St. Omer

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

James Peebles

Data Analytics Leader, IQVIA

This is one of the best of investments you can make with regards to career progression and growth in technological knowledge. I was pointed in this direction by a mentor in the IT world who I highly... Read More

What will you learn

End-to-end implementation of Big data pipeline on AWS
Scalable, reliable, secure data architecture followed by top notch Big data leaders
Detailed explanation of Ws in Big Data and data pipeline building and automation of the processes
Real time streaming data import from external API using NiFi
Parsing of the complex Json data into csv using NiFi and storing in HDFS
Encryption of one of the PII fields in the data using NiFi
Sending parsed data to Kafka for data processing using PySpark and writing the data to output Kafka topic
Consume data from Kafka and store processed data in HDFS
Create a Hive external table on top of the data stored in HDFS followed by data query
Data cleaning, transformation, storing in the data lake
Visualisation of the key performance indicators by using top end industry big data tools
Data flow orchestration for continuous integration of the data pipeline using Airflow
Visualisation of the data using AWS QuickSight and Tableau

Project Description

In this Big Data project, a senior Big Data Architect will demonstrate how to implement a Big Data pipeline on AWS at scale. You will be using the Covid-19 dataset. This will be streamed real-time from an external API using NiFi. The complex json data will be parsed into csv format using NiFi and the result will be stored in HDFS.

Then this data will be sent to Kafka for data processing using PySpark. The processed data will then be consumed from Spark and stored in HDFS. Then a Hive external table is created on top of HDFS. Finally the cleaned, transformed data is stored in the data lake and deployed. Visualisation is then done using Tableau and AWS QuickSight.

Similar Projects

PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.

In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

Curriculum For This Mini Project

Introduction to building data pipeline
Big Data pipeline - Roles in Big Data industry
Business Impact of Data Pipelines
System Requirements
Data Architecture
Hive vs Flume vs Presto vs Druid
Spark vs Airflow vs Oozie
AWS EC2 - Dataset
Start AWS Services
Data Extraction with NiFi
Data Encryption - Parsing
Data Sources - HDFS - Kafka
Streaming Data from Kafka to PySpark
PySpark Streaming output: Kafka - NiFi - HDFS
HDFS to Hive Table
Dataflow Orchestration with Airflow
Quicksight Visualisation
Tableau Visualisation