Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks

In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Mohamed Yusef Ahmed

Software Developer at Taske

Recently I became interested in Hadoop as I think its a great platform for storing and analyzing large structured and unstructured data sets. The experts did a great job not only explaining the... Read More

Nathan Elbert

Senior Data Scientist at Tiger Analytics

This was great. The use of Jupyter was great. Prior to learning Python I was a self taught SQL user with advanced skills. I hold a Bachelors in Finance and have 5 years of business experience.. I... Read More

What will you learn

Introduction to YELP Dataset
Uploading raw datasets to Azure Data Lake Storage Gen 2
Data Ingestion using Azure Datafactory
JSON to csv file conversion
Spinning up cluster on Azure Databricks
Configuration of ADLS on Azure Databricks
Saving csv into Parquet file format for better performance
Optimization using partition and coalesce
Decide what to partition based on what analysis is needed
PySpark dataframes
How does auto scaling clusters in spark databricks work
Analyse yelp dataset - top 10 Categories
Analyse yelp dataset - number of available categories
Analyse yelp dataset - number of restaurants per state
Analyse yelp dataset - top restaurants per city/state
How to use Broadcast Join to join 2 dataframes
Analyse yelp dataset - number of italian restaurants
Visualize business insights and sentiments

Project Description

Yelp dataset is a subset of Yelp's businesses, reviews, and user data. In the dataset you'll find information about businesses across 11 metropolitan areas in 4 countries. In this Databricks Azure project, you will learn how to ingest this data, read data, clean it, manipulate it, optimize, and get business insights out of it using Microsoft Azure Tech stack.

Similar Projects

The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

In this project, we will evaluate and demonstrate how to handle unstructured data using Spark.

In this NoSQL project, we will use two NoSQL databases(HBase and MongoDB) to store Yelp business attributes and learn how to retrieve this data for processing or query.

Curriculum For This Mini Project

Introduction to Yelp
03m
Solution Architecture
03m
Uploading data and storage
04m
Data Ingestion Using Azure Data Factory
05m
Spinup A Databricks Cluster
06m
Configuring Adls - Databricks
09m
Parquet File Format & Partitioning
08m
Coalesce & Re-partition
06m
Creating Dataframes from Parquet Files
06m
Analyse Top 10 Categories
05m
Analyse Number Of Categories Available
03m
Analyse Number Of Restaurants Per State
03m
Analyse Top Restaurants Per City State
06m
Analysis Using Broadcast Join
11m
Analysis Of Italian Restaurants
04m