1-844-696-6465 (US)        +91 77600 44484        help@dezyre.com
analyze-yelp-data.jpg

Data Mining Project on Yelp Dataset using Hadoop Hive

Use the Hadoop ecosystem to glean valuable insights from the Yelp dataset. You will be analyzing the different patterns that can be found in the Yelp data set, to come up with various approaches in solving a business problem.
4.74.7

Users who bought this project also bought

What will you learn

  • Analyze JSON dataLoading JSON format to Hive
  • Create a Schema to the fields in the table.
  • Creating queries to set up the EXTERNAL TABLE in Hive
  • Create new desired TABLE to copy the data.
  • Creating query to populate and filter the data.
  • Analyze log files in HIVE.

What will you get

  • Access to recording of the complete project
  • Access to all material related to project like data files, solution files etc.

Prerequisites

  • Minimum RAM 8GB
  • Oracle Virtual Box
  • Three Ubuntu lab machines (setup documented)
  • Networking
  • Yelp data sets

Project Description

Yelp it! is the term people use to review a local business, restaurant or products across the main US states and cities. Yelp has grown from a simple reviews site to something much more. It is now a strong community of users who contribute reviews per their own volition. Now let us understand what does this mean in terms of data that is generated in Yelp. Since its inception in 2004, Yelp has collected a staggering 25 million reviews for its local businesses, restaurants, doctors, services, etc. They have an average of 66 million unique visitors to their site every month. Yelp App is used on 5.7 million mobile devices. They have an impressive Y-O-Y growth with reviews growing by 64%, visitors growing at 67%, local businesses at 97% and active local advertisers at 118%. That is a LOT of data! Phew! 

It comes as no surprise when we say that Yelp has managed to crush all their competition mainly because they are so good at big data analysis. Data of this magnitude has a story to tell and businesses need to figure out what their data is telling them in order to make smarter business decisions than their competitors. In the following project, we have taken a Yelp data-set and we will be using Hive to analyze this data. Hive is the easiest of the Hadoop tools to learn. If you are from a data warehousing background and know SQL well - it will be a breeze to work on Hive. Hive is a data warehouse infrastructure built on top of Hadoop and is quite versatile in its usage, as it supports different storage types such as plain text, RCFile, Amazon S3, HBase, ORC, etc. Hive has its own SQL like language called HiveQL with schemas - which transparently converts queries to MapReduce or Apache Spark jobs. 

You will be working on solving these business problems for the end-user:

  • Overall business review counts for a particular period of time. 
  • Local Businesses' opening and closing timings. 
  • Amenities provided by each business. 
  • List the top restaurants in a state by the number of reviews.
  • List the top restaurants in number of listed categories.
  • Filter the top categories by number of review counts.
  • Joining reviews dataset with the businesses to do further analysis about the businesses. 

 

Instructors

 
Sakhuja

Senior Hadoop Engineer at Sirius Computer Solutions

Abhishek has a corporate experience for 5 years in the fields of Hadoop R&D, Big Data technologies, Hadoop administration, IBM Netezza Database Administration, Data Warehousing, Data Mining (Netezza, Oracle PL/SQL and Microsoft SQL Server), Development, ETL and Advanced analytics. He has a vast exposures on various pro see more...