In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.
This project will cover the understanding of Apache Spark with main focus on one of its components, Spark SQL. We will understand how Spark and Spark SQL works, its internal functioning, its capabilities and advantages over other data processing tools. We are going to take up one business problem in the area of Supply Chain. Our tech stack will be Databricks and the latest Spark 3.0 for this project. We will use Spark SQL to understand the business data and generate insights from it which must help us frame a solution for our business problem.
This is a typical Big Data ETL visualization project implemented in AWS cloud using cloud native tools like Glue which is used to Spark jobs without maintaining cluster infrastructure, Step Functions which is used to schedule jobs based on dependency ,Redshift which is the ultimate petabyte scale data warehouse solution in AWS and Quicksight which is AWS managed Visualization tool to create business reports