HANDS-ON-LAB

Spark Streaming Databricks Example

Problem Statement

This hands-on process Spark Streaming Databricks Example code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog. 

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Tasks

  1. Set up Azure Blob Storage: Create a new container in Azure Blob Storage and upload the three files containing YouTube video data to a folder within the container.

  2. Mount Azure Blob Storage in Databricks: Mount the Azure Blob Storage container in Databricks to access the data files. Configure the necessary credentials and permissions for the mount.

  3. Develop Spark code: Write Spark code in Databricks to stream the data from the files. Remove the "trending_date" column and rename the "thumbnail_link" column to "url". Process the data using Spark transformations and actions.

  4. Print the output: Display the processed data in the output console to verify the column removal and renaming operations.

  5. Create an Azure Event Hub: Set up an Azure Event Hub to receive the streamed data from Spark. Configure the necessary Event Hub settings and obtain the connection string.

  6. Stream data into Event Hub: Modify the Spark code to include the streaming functionality and send the processed data to the Azure Event Hub. Configure the connection details and ensure successful data streaming.


Join the hands-on lab to master streaming YouTube video data with Spark and Azure Event Hub

Learnings

  • Setting up Azure Blob Storage and uploading files.

  • Mounting Azure Blob Storage in Databricks for data access.

  • Processing streaming data using Spark in Databricks.

  • Removing and renaming columns in Spark DataFrames.

  • Printing the output of processed data.

  • Creating and configuring Azure Event Hub.

  • Streaming data into Azure Event Hub using Spark.



Column

Data type

video_id

String

trending_date

String

title

String

channel_title

String

category_id

Integer

publish_time

String

tags

String

views

Long

likes

Long

dislikes

Long

comment_count

Long

thumbnail_link

String

comments_disabled

String

ratings_disabled

String

video_error_or_removed

String

FAQs

Q1. What is the role of Spark in this exercise?

Answer: Spark is used to process and analyze the streaming YouTube video data. It allows for efficient data transformations, such as removing and renaming columns, and provides the capability to stream the processed data to external systems like Azure Event Hub.

 

Q2. How does Azure Blob Storage contribute to the pipeline?

Azure Blob Storage acts as the source for the YouTube video data files. It provides durable storage and allows seamless access to the data for Spark processing in Databricks.

 

Q3. What are the learning outcomes of this exercise?

Answer: By completing this exercise, you will gain experience in setting up Azure Blob Storage and accessing files in Databricks, processing streaming data using Spark, performing column transformations, printing processed data output, creating and configuring Azure Event Hub, and streaming data to Event Hub using Spark.