HANDS-ON-LAB

Airline Dataset Analysis with Spark and Azure Blob

Problem Statement

This hands-on process Airline Dataset Analysis with Spark and Azure Blob code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog. 

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Tasks

  1. Download the Movie Lens dataset from the provided website. https://grouplens.org/datasets/movielens/

  2. Upload the datasets into an Azure Blob storage account within a container and appropriate folders.

  3. Create Synapse external tables for each dataset in the Movie Lens zip folders, naming the tables appropriately.

  4. Write SQL statements to confirm the dataset conformity and entity relationships between the created tables in Synapse.

  5. Create a Synapse pipeline with a Dataflow to join and transform the Movies, Rating, Links, and Tags datasets.

  6. Load the resulting dataset into a Delta Table within Synapse as an external table.

  7. Use a Python Notebook to perform analysis on the Delta table and retrieve the movies with the highest ratings and specific tags.

  8. Add the saved Python notebook into the created Synapse pipeline.

Ready to dive into movie data analysis? Enroll in our lab and learn how to leverage Azure Blob Storage and Synapse to analyze the Movie Lens dataset and extract meaningful insights.

Learnings

  • Uploading datasets into Azure Blob storage for further processing.

  • Creating Synapse external tables and verifying dataset conformity and entity relationships using SQL statements.

  • Building a Synapse pipeline with a Dataflow for data transformation and joining.

  • Loading the transformed dataset into a Delta Table as an external table in Synapse.

  • Performing analysis on the Delta table using a Python Notebook to retrieve specific movie information based on ratings and tags.

  • Incorporating the Python Notebook into a Synapse pipeline for automation and workflow management.

FAQs

Q1. What is the Movie Lens dataset?

The Movie Lens dataset is a popular movie recommendation dataset containing information about movies, ratings, links, and tags. It is commonly used for movie analysis and recommendation system development.

 

Q2. How does Azure Blob Storage and Synapse help in analyzing the Movie Lens dataset?

Azure Blob Storage provides a scalable and secure storage solution for hosting the dataset, while Synapse offers powerful data integration and analytics capabilities. Together, they enable efficient data processing, table creation, transformation, and analysis.

 

Q3. What can I learn from this exercise?

By completing this exercise, you will learn how to upload datasets to Azure Blob Storage, create external tables in Synapse, validate dataset conformity using SQL statements, build data transformation pipelines, load data into Delta Tables, and perform analysis using Python Notebooks.