HANDS-ON-LAB

Real Time Data Processing Example using Kafka and Azure Blob

Problem Statement

This hands-on process Real Time Data Processing Example using Kafka and Azure Blob code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog.

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Dataset

Download dataset from here.

Tasks

Setup and configure Kafka: Install and configure Kafka in your environment. Create the necessary Kafka topics, event brokers, and consumers for testing purposes. Configure the Kafka sink to write data into Azure Blob Storage.
Ingest data into Azure Blob Storage: Download the dataset provided and upload it to Azure Blob Storage. Configure the data ingestion pipeline to consume data from Kafka and store it in Azure Blob Storage. Verify the data using the Data Explorer.
Create Delta Tables: Create Delta Tables for the Order and Product data stored in Azure Blob Storage. Define the appropriate schemas and data types for each column. Load the data into the Delta Tables.
Join tables and schedule batch processing: Join the Order and Product tables using the product_id column. Schedule a batch processing job to run daily at 7 PM local time to update the joined table with new data.
Develop Spark Streaming application: Create a Spark Streaming application that reads data from Kafka. Implement the necessary logic to aggregate the data based on trigger events and display it in the console. Test the application by simulating trigger events and validating the output.
Implement window function: Extend the Spark Streaming application to incorporate a window function that calculates the total order value for every hour. Configure the window size and sliding interval to meet the desired aggregation requirements.

Join the hands-on lab to master real-time data processing with Kafka, Azure Blob Storage, and Spark Streaming.

Learnings

Setting up and configuring Kafka for real-time data processing.
Configuring and utilizing Azure Blob Storage for data ingestion and storage.
Creating Delta Tables and performing joins between tables.
Scheduling batch processing jobs using Delta Tables.
Developing Spark Streaming applications for real-time data processing.
Implementing window functions in Spark Streaming for data aggregation.

S.no	Table Column	Data Type
1.	Order Date	TimestampType()
2.	Order ID	IntegerType()
3.	Order quantity	IntegerType()
4.	Order Amount	IntegerType()
5.	Product_ID	MapType(String Type (),IntegerType() )

S.no	Table Column	Data Type
1.	Product_ID	Number
2.	Product Description	String

FAQs

Q1. What is the role of Kafka in this exercise?

Kafka serves as the messaging system in the real-time data pipeline. It allows for efficient and scalable data streaming between producers and consumers, ensuring reliable and low-latency data processing.

Q2. How does Azure Blob Storage contribute to the pipeline?

Azure Blob Storage is used for data ingestion and storage. It acts as a repository for the streamed data from Kafka, enabling durable storage and easy access for further processing and analysis.

Q3. What are the learning outcomes of this exercise?

By completing this exercise, you will gain experience in setting up and configuring Kafka for real-time data processing, utilizing Azure Blob Storage for data ingestion and storage, creating Delta Tables and performing joins between tables, scheduling batch processing jobs, developing Spark Streaming applications, and implementing window functions for data aggregation.