Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks

In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Nathan Elbert

Senior Data Scientist at Tiger Analytics

This was great. The use of Jupyter was great. Prior to learning Python I was a self taught SQL user with advanced skills. I hold a Bachelors in Finance and have 5 years of business experience.. I... Read More

Mohamed Yusef Ahmed

Software Developer at Taske

Recently I became interested in Hadoop as I think its a great platform for storing and analyzing large structured and unstructured data sets. The experts did a great job not only explaining the... Read More

What will you learn

Introduction to YELP Dataset
Uploading raw datasets to Azure Data Lake Storage Gen 2
Data Ingestion using Azure Datafactory
JSON to csv file conversion
Spinning up cluster on Azure Databricks
Configuration of ADLS on Azure Databricks
Saving csv into Parquet file format for better performance
Optimization using partition and coalesce
Decide what to partition based on what analysis is needed
PySpark dataframes
How does auto scaling clusters in spark databricks work
Analyse yelp dataset - top 10 Categories
Analyse yelp dataset - number of available categories
Analyse yelp dataset - number of restaurants per state
Analyse yelp dataset - top restaurants per city/state
How to use Broadcast Join to join 2 dataframes
Analyse yelp dataset - number of italian restaurants
Visualize business insights and sentiments

Project Description

Yelp dataset is a subset of Yelp's businesses, reviews, and user data. In the dataset you'll find information about businesses across 11 metropolitan areas in 4 countries. In this Databricks Azure project, you will learn how to ingest this data, read data, clean it, manipulate it, optimize, and get business insights out of it using Microsoft Azure Tech stack.

Similar Projects

This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

In this project, we are going to talk about insurance forecast by using regression techniques.

In this project, we will be building and querying an OLAP Cube for Flight Delays on the Hadoop platform.

Curriculum For This Mini Project

Introduction to Yelp
Solution Architecture
Uploading data and storage
Data Ingestion Using Azure Data Factory
Spinup A Databricks Cluster
Configuring Adls - Databricks
Parquet File Format & Partitioning
Coalesce & Re-partition
Creating Dataframes from Parquet Files
Analyse Top 10 Categories
Analyse Number Of Categories Available
Analyse Number Of Restaurants Per State
Analyse Top Restaurants Per City State
Analysis Using Broadcast Join
Analysis Of Italian Restaurants