HANDS-ON-LAB

Azure Synapse Analytics and Azure Databricks

Problem Statement

This hands-on Azure Synapse Analytics and Azure Databricks code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog. 

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Tasks

  1. Download the Movie Lens dataset and upload it to Azure Blob Storage within a designated container and folders. (https://grouplens.org/datasets/movielens/)

  2. Create external tables in Synapse for each dataset in the Movie Lens zip folders, ensuring appropriate naming.

  3. Write SQL statements to validate the dataset conformity and entity relationships within Synapse.

  4. Create a Synapse Pipeline with a Dataflow, leveraging the low code/no code features, to process Movies, Rating, Links, and Tags datasets with appropriate joins and load the resultant dataset into a Delta table within Synapse as an external table.

  5. Use a Python Notebook to perform analysis on the Delta table, retrieving the top 10 movies with the highest ratings.

  6. Integrate the saved Python Notebook into the Synapse Pipeline.

  7. Use Azure Databricks to repeat the exercise, building transformations using Delta Live Tables.

Unlock valuable insights from the Movie Lens dataset using Synapse and Azure Databricks. Enroll in our lab and gain practical skills in data analysis and transformation.

Learnings

  • Uploading data to Azure Blob Storage and creating external tables in Synapse.

  • Validating dataset conformity and entity relationships using SQL statements in Synapse.

  • Building a Synapse Pipeline with a Dataflow for data transformation and loading into a Delta table.

  • Performing analysis on the Delta table using a Python Notebook.

  • Integrating the Python Notebook into the Synapse Pipeline.

  • Repeating the exercise using Azure Databricks and Delta Live Tables for data transformations.

FAQs

Q1. What is the Movie Lens dataset?

The Movie Lens dataset is a popular dataset containing movie ratings and other related information. It is often used for movie recommendation systems and data analysis.

 

Q2. What will I learn from this exercise?                                                   

By completing this exercise, you will learn how to upload data to Azure Blob Storage, create external tables in Synapse, perform SQL statements to validate data, build Synapse Pipelines with Dataflows, perform data analysis using Python Notebooks, and utilize Azure Databricks for data transformations.

 

Q3. Why is Delta table used in this exercise?

Delta table is used as a storage format in Synapse and Azure Databricks because it offers performance optimizations and built-in data versioning and transaction capabilities, making it suitable for handling large-scale data processing and analytics tasks.