Data Mining Project on Yelp Dataset using Hadoop Hive

Data Mining Project on Yelp Dataset using Hadoop Hive

Use the Hadoop ecosystem to glean valuable insights from the Yelp dataset. You will be analyzing the different patterns that can be found in the Yelp data set, to come up with various approaches in solving a business problem.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

Understanding the roadmap of the project
Downloading and understanding the dataset
Setting up Xshell and Putty
Converting Jason type file into text files
Loading the data in HDFS
Creating Databases in Hive
Creating tables in the databases and how to access different elements of the table
Create new desired TABLE to copy the data
Understanding different data frames like arrays, structures, and Maps
Creating tables in Sardi and importing data by typecasting datatypes
Data preprocessing and analyzing the dataset using queries
Creating queries to populate and filter the data
Analyze log files in HIVE
Debugging your queries to filter out errors
Writing queries for ranking and partitioning your tables

Project Description

Yelp it! is the term people use to review a local business, restaurant or products across the main US states and cities. Yelp has grown from a simple reviews site to something much more. It is now a strong community of users who contribute reviews per their own volition. Now let us understand what does this mean in terms of data that is generated in Yelp. Since its inception in 2004, Yelp has collected a staggering 25 million reviews for its local businesses, restaurants, doctors, services, etc. They have an average of 66 million unique visitors to their site every month. Yelp App is used on 5.7 million mobile devices. They have an impressive Y-O-Y growth with reviews growing by 64%, visitors growing at 67%, local businesses at 97% and active local advertisers at 118%. That is a LOT of data! Phew! 

It comes as no surprise when we say that Yelp has managed to crush all their competition mainly because they are so good at big data analysis. Data of this magnitude has a story to tell and businesses need to figure out what their data is telling them in order to make smarter business decisions than their competitors. In the following project, we have taken a Yelp data-set and we will be using Hive to analyze this data. Hive is the easiest of the Hadoop tools to learn. If you are from a data warehousing background and know SQL well - it will be a breeze to work on Hive. Hive is a data warehouse infrastructure built on top of Hadoop and is quite versatile in its usage, as it supports different storage types such as plain text, RCFile, Amazon S3, HBase, ORC, etc. Hive has its own SQL like language called HiveQL with schemas - which transparently converts queries to MapReduce or Apache Spark jobs. 

You will be working on solving these business problems for the end-user:

  • Overall business review counts for a particular period of time. 
  • Local Businesses' opening and closing timings. 
  • Amenities provided by each business. 
  • List the top restaurants in a state by the number of reviews.
  • List the top restaurants in number of listed categories.
  • Filter the top categories by number of review counts.
  • Joining reviews dataset with the businesses to do further analysis about the businesses. 

 

Similar Projects

In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

In this big data project, we will be performing an OLAP cube design using AdventureWorks database. The deliverable for this session will be to design a cube, build and implement it using Kylin, query the cube and even connect familiar tools (like Excel) with our new cube.

Curriculum For This Mini Project

Introduction
04m
Dataset Overview
11m
Setting up the machine configuration
01m
Troubleshooting configuration
12m
Process flow of the solution
11m
Convert JSON data to text
18m
Convert JSON data to text - seperate fields from values
21m
Summary of Process flow
08m
Explanation of the Yelp dataset
08m
Load data into HDFS
13m
Create Database
13m
Working with the tables
13m
Working with map, strut and arrays
08m
Analyse Yelp reviews
08m
Analyse number of reviews by state and city
15m
Analyse top 10 categories by number of reviews
02m
Analyse top business which have over 2000 good reviews
06m
Analyse number of sub-categories available
05m
Analyse top 10 restaurants in each city with over 100 reviews
07m
Analyse top 50 restaurants by reviews that are open more than 11 hours/day
26m
Debugging query
11m
Ranking query
07m
Analyse top 3 businesses by average daily users, where users found the reviews useful
12m
Analyse increase in ratings for a business in a year
16m