NoSQL Project on Yelp Dataset using HBase and MongoDB

NoSQL Project on Yelp Dataset using HBase and MongoDB

In this NoSQL project, we will use two NoSQL databases(HBase and MongoDB) to store Yelp business attributes and learn how to retrieve this data for processing or query.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Shailesh Kurdekar

Solutions Architect at Capital One

I have worked for more than 15 years in Java and J2EE and have recently developed an interest in Big Data technologies and Machine learning due to a big need at my workspace. I was referred here by a... Read More

Arvind Sodhi

VP - Data Architect, CDO at Deutsche Bank

I have extensive experience in data management and data processing. Over the past few years I saw the data management technology transition into the Big Data ecosystem and I needed to follow suit. I... Read More

What will you learn

Why store data in a NoSQL database
Difference between sparse and densely distributed data
Understanding document-term matrix
Downloading and understanding the Yelp dataset
Writing queries in Hue-Impala for visualizing the dataset
Denormalization, its need and how to denormalize the dataset
Integrating Spark with Hive
Clustering business data based on different attributes
Revisit NoSQL databases concepts
Consistency, Availability, and Partitioning in traditional RDMS
Setting up the connection between MongoDB and Spark for collecting the data
Storing sparse business attributes in HBase
Storing sparse business attributes in MongoDB
Creating recursive function for iterating and reading the data
Using DAGS scheduler for scheduling the task to perform data analysis automatically
Integrating Hive and NoSQL databases for data retrieval using query
Integrating Spark and NoSQL databases for retrieving data for processing

Project Description

Still on the series on Data engineering using Yelp dataset, we have established several concepts - from data warehousing to graph analysis. Well done.

But in today's world, not all data are best stored on HDFS. Some special requirements and scenario could require a data storage with a very low latency that could also handle large dataset. Here comes the use of NoSQL databases.

In this NoSQL project, we will use two NoSQL databases(HBase and MongoDB) to store Yelp business attributes and also learn how to retrieve these data for processing or query. We will substantiate the value of these other ways to store data over using HDFS and how to join them with data stored in HDFS in real time.

Seeing that MongoDB is not available in Cloudera Quickstart VM, we are encouraged to install MongoDB on our host machine while setting up a host network interface between the host and the VM for this big data project.

Similar Projects

Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

In this big data project, we will be performing an OLAP cube design using AdventureWorks database. The deliverable for this session will be to design a cube, build and implement it using Kylin, query the cube and even connect familiar tools (like Excel) with our new cube.

Hive Project -Learn to write a Hive program to find the first unique URL, given 'n' number of URL's.

Curriculum For This Mini Project

10-June-2017
02h 48m
11-June-2017
02h 59m