Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis

In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Mike Vogt

Information Architect at Bank of America

I have had a very positive experience. The platform is very rich in resources, and the expert was thoroughly knowledgeable on the subject matter - real world hands-on experience. I wish I had this... Read More

Ray Han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

What will you learn

End-to-end implementation of Big data pipeline on AWS
Scalable, reliable, secure data architecture followed by top notch Big data leaders
Detailed explanation of Ws in Big Data and data pipeline building and automation of the processes
Real time streaming data import from external API using NiFi
Parsing of the complex Json data into csv using NiFi and storing in HDFS
Encryption of one of the PII fields in the data using NiFi
Sending parsed data to Kafka for data processing using PySpark and writing the data to output Kafka topic
Consume data from Kafka and store processed data in HDFS
Create a Hive external table on top of the data stored in HDFS followed by data query
Data cleaning, transformation, storing in the data lake
Visualisation of the key performance indicators by using top end industry big data tools
Data flow orchestration for continuous integration of the data pipeline using Airflow
Visualisation of the data using AWS QuickSight and Tableau

Project Description

In this Big Data project, a senior Big Data Architect will demonstrate how to implement a Big Data pipeline on AWS at scale. You will be using the Covid-19 dataset. This will be streamed real-time from an external API using NiFi. The complex json data will be parsed into csv format using NiFi and the result will be stored in HDFS.

Then this data will be sent to Kafka for data processing using PySpark. The processed data will then be consumed from Spark and stored in HDFS. Then a Hive external table is created on top of HDFS. Finally the cleaned, transformed data is stored in the data lake and deployed. Visualisation is then done using Tableau and AWS QuickSight.

Similar Projects

In this project, we will look at running various use cases in the analysis of crime data sets using Apache Spark.

In this big data project, we will look at how to mine and make sense of connections in a simple way by building a Spark GraphX Algorithm and a Network Crawler.

Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Curriculum For This Mini Project

Introduction to building data pipeline
Big Data pipeline - Roles in Big Data industry
Business Impact of Data Pipelines
System Requirements
Data Architecture
Hive vs Flume vs Presto vs Druid
Spark vs Airflow vs Oozie
AWS EC2 - Dataset
Start AWS Services
Data Extraction with NiFi
Data Encryption - Parsing
Data Sources - HDFS - Kafka
Streaming Data from Kafka to PySpark
PySpark Streaming output: Kafka - NiFi - HDFS
HDFS to Hive Table
Dataflow Orchestration with Airflow
Quicksight Visualisation
Tableau Visualisation