HANDS-ON-LAB

Spark Streaming Databricks Example

Problem Statement

This hands-on process Spark Streaming Databricks Example code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog.

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Tasks

Set up Azure Blob Storage: Create a new container in Azure Blob Storage and upload the three files containing YouTube video data to a folder within the container.
Mount Azure Blob Storage in Databricks: Mount the Azure Blob Storage container in Databricks to access the data files. Configure the necessary credentials and permissions for the mount.
Develop Spark code: Write Spark code in Databricks to stream the data from the files. Remove the "trending_date" column and rename the "thumbnail_link" column to "url". Process the data using Spark transformations and actions.
Print the output: Display the processed data in the output console to verify the column removal and renaming operations.
Create an Azure Event Hub: Set up an Azure Event Hub to receive the streamed data from Spark. Configure the necessary Event Hub settings and obtain the connection string.
Stream data into Event Hub: Modify the Spark code to include the streaming functionality and send the processed data to the Azure Event Hub. Configure the connection details and ensure successful data streaming.

Join the hands-on lab to master streaming YouTube video data with Spark and Azure Event Hub.

Learnings

Setting up Azure Blob Storage and uploading files.
Mounting Azure Blob Storage in Databricks for data access.
Processing streaming data using Spark in Databricks.
Removing and renaming columns in Spark DataFrames.
Printing the output of processed data.
Creating and configuring Azure Event Hub.
Streaming data into Azure Event Hub using Spark.

Column	Data type
video_id	String
trending_date	String
title	String
channel_title	String
category_id	Integer
publish_time	String
tags	String
views	Long
likes	Long
dislikes	Long
comment_count	Long
thumbnail_link	String
comments_disabled	String
ratings_disabled	String
video_error_or_removed	String

FAQs

Q1. What is the role of Spark in this exercise?

Answer: Spark is used to process and analyze the streaming YouTube video data. It allows for efficient data transformations, such as removing and renaming columns, and provides the capability to stream the processed data to external systems like Azure Event Hub.

Q2. How does Azure Blob Storage contribute to the pipeline?

Azure Blob Storage acts as the source for the YouTube video data files. It provides durable storage and allows seamless access to the data for Spark processing in Databricks.

Q3. What are the learning outcomes of this exercise?

Answer: By completing this exercise, you will gain experience in setting up Azure Blob Storage and accessing files in Databricks, processing streaming data using Spark, performing column transformations, printing processed data output, creating and configuring Azure Event Hub, and streaming data to Event Hub using Spark.