Hadoop Project for Beginners-SQL Analytics with Hive

In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.
What will you learn

Roadmap of the project
Understanding Serializing and Deserializing and how does it works
Setting up the environment in Cloudera Manager
Downloading and understanding the dataset
Understanding the schema of the dataset
Moving the data from MySQL to HDFS
Data ingestion/transformation using Sqoop, Spark, and Hive
Creating and executing Scoop Job
Using Append to increase the performance and speed of loading the data to HDFS
Creating your Hive table and troubleshooting it
Using Parquet and Xpath to access schema
Writing aggregate and Select queries using UDAFs.
Hive versus MySQL database
Rollup and Cube in context of Grouping Sets Aggregation using windowing functions.
Query optimizations in Hive

Project Description

In this hive project, we want to take a deeper dive into some analytical features in Hive. Using SQL is still very dominant and will remain so for the nearest features. Most big data tools have been adapted to allow users interact with them using the familiar SQL language. This is because of years of knowledge and skill that has gone into training, acceptance, tooling, standards development and re-engineering. So in many cases, using these cool features of SQL to access data solves a lot of analytical questions without ever needing us to resort to machine learning, BI or data mining.

In this big data project, we want to look at these features in Hive that allows us to perform analytical queries over large datasets.

We will be using the adventure works dataset in a MySQL dataset. Therefore, there will be a need to ingest and transform the data before we proceed to analytics.

Curriculum For This Mini Project

Cloning the dataset
Understanding the dataset
Load the data
Query the data
Create a Sqoop job
Executing the Sqoop job
Why is append used ?
Build hive tables on top of the data
Troubleshooting hive table
Using Parquet and xpath
Select statement
Use case based aggregations
Q&A - the problem statement
Q&A - Hive versus myql database
Enhancing aggregate functions
Grouping sets
Rollup versus Cube
Windowing analytic functions
Properties of windowing analytic functions
Solving an example - finding %