Building a Data Warehouse using Spark on Hive

Building a Data Warehouse using Spark on Hive

In this hive project , we will build a Hive data warehouse from a raw dataset stored in HDFS and present the data in a relational structure so that querying the data will be natural.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

• How to run hive queries on Spark
• Hadoop data warehousing with Hive
• Using the interactive Scala Build Tool (sbt) with Spark
• Data serialization with kryo serialization example
• Performance optimization using caching.
• Broadcast variables
• Writing spark RDD to Hive using Spark SQL
• Explore parquet data storage format and reasons for choosing parquet.
• Building Hive external tables using parquet dataset
• Writing queries against datasets using impala.

Project Description

This hive project aims to build a Hive data warehouse from a raw dataset stored in HDFS and present the data in a relational structure so that querying the data will is natural. The dataset set for this big data project is from the movielens open dataset on movie ratings.

The spark project makes use of some advance concepts in Spark programming and also stores it final output incrementally in Hive tables built using the parquet data storage format. We will also demostrate some complex queries on this tables using Hive and impala. The spark application will be written in scala and the development process will be automated using the Scala Build tool(sbt).

The data warehouse is built by loading, extracting and transforming the dataset into structures that will provide the basis for data scientists to perform different forms of model discovery.

We will use following tools in this project:

Similar Projects

Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.

In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

In this project, we will look at two database platforms - MongoDB and Cassandra and look at the philosophical difference in how these databases work and perform analytical queries.

Curriculum For This Mini Project

24-Sep-2016
02h 37m
25-Sep-2016
03h 40m