Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks

Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks

In this Databricks Azure project, you will use Spark & Parquet file formats to analyse the Yelp reviews dataset. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.
explanation image


Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Camille St. Omer linkedin profile url

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

profile image

Swati Patra linkedin profile url

Systems Advisor , IBM

I have 11 years of experience and work with IBM. My domain is Travel, Hospitality and Banking - both sectors process lots of data. The way the projects were set up and the mentors' explanation was... Read More

What will you learn

Introduction to YELP Dataset
Uploading raw datasets to Azure Data Lake Storage Gen 2
Data Ingestion using Azure Datafactory
JSON to csv file conversion
Spinning up cluster on Azure Databricks
Configuration of ADLS on Azure Databricks
Saving csv into Parquet file format for better performance
Optimization using partition and coalesce
Decide what to partition based on what analysis is needed
PySpark dataframes
How does auto scaling clusters in spark databricks work
Analyse yelp dataset - top 10 Categories
Analyse yelp dataset - number of available categories
Analyse yelp dataset - number of restaurants per state
Analyse yelp dataset - top restaurants per city/state
How to use Broadcast Join to join 2 dataframes
Analyse yelp dataset - number of italian restaurants
Visualize business insights and sentiments

Project Description

Yelp dataset is a subset of Yelp's businesses, reviews, and user data. In the dataset you'll find information about businesses across 11 metropolitan areas in 4 countries. In this Databricks Azure project, you will learn how to ingest this data, read data, clean it, manipulate it, optimize, and get business insights out of it using Microsoft Azure Tech stack.

Similar Projects

Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Curriculum For This Mini Project

Introduction to Yelp
Solution Architecture
Uploading data and storage
Data Ingestion Using Azure Data Factory
Spinup A Databricks Cluster
Configuring Adls - Databricks
Parquet File Format & Partitioning
Coalesce & Re-partition
Creating Dataframes from Parquet Files
Analyse Top 10 Categories
Analyse Number Of Categories Available
Analyse Number Of Restaurants Per State
Analyse Top Restaurants Per City State
Analysis Using Broadcast Join
Analysis Of Italian Restaurants