Tough engineering choices with large datasets in Hive Part - 2

This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.
Videos
Each project comes with 2-5 hours of micro-videos explaining the solution.
Code & Dataset
Get access to 50+ solved projects with iPython notebooks and datasets.
Project Experience
Add project experience to your Linkedin/Github profiles.

What will you learn

  • Common misuse/abuse of hive
  • How to use and interpret Hive's explain command
  • File formats and their relative performance (Text, JSON, SequenceFile, Avro, ORC and Parquet)
  • Compression
  • Spark and hive for transformation
  • Hive and Impala - making choices
  • Execution engines and performance

Project Description

The use of Hive or the hive meta store is so ubiquitous in big data engineering that achieving efficient use of the tool is a factor in the success of many projects. Whether in integrating with Spark or using hive as an ETL tool, many projects either fail or succeed as they grow in scale and complexity because of decisions made early in the project.

In this big data project on hive, we will explore using hive efficiently and this hive porject format will take an exploratory pattern rather than a project building pattern. The goal of this big data project is to explore Hive in uncommon ways towards mastery.

We will be using different datasets in this sessions, exploring different Hadoop file formats like text, CSV, JSON, ORC, Parquet, Avro, and SequenceFile, will look at compression and different codecs and take a look at the performance of each when you try integration with either spark or impala.

The idea is to explore enough so that we can make a reasonable argument about what to do or not in any given big scenario.

Curriculum For This Mini Project

 
  Agenda for the Session
03m
  What is ordering?
07m
  Order By and Group By
32m
  Order By, Sort By, Distribute By and Cluster By
13m
  Sampling -Random and Bucket Sampling
18m
  Installing Hortonworks Sandbox and Ambari Overview
04m
  Block Sampling
06m
  Hive Execution Engines-Spark, Tez and MapReduce
39m
  Cross Join
11m
  Q & A
19m
  Recap of the Previous Session
08m
  Joins
05m
  Map Join and Reduce Join
16m
  Left Semi Join
06m
  Bucket Map Join
08m
  Sort Merge Bucket Map Join
01m
  Skew Join
03m
  Explain
03m
  Indexes
03m
  Bitmap Indexes
09m
  File Types -Sequence, Avro,ORC, Parquet
35m
  Compression and Compression Codec
21m
  Interaction with Spark
12m
  Q & A
06m