1-844-696-6465 (US)        +91 77600 44484        help@dezyre.com
tough-engineering-choices-with-large-datasets-in-hive-part-2.jpg

Tough engineering choices with large datasets in Hive Part - 2

This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.
What are the prerequisites for this project?
  • It is expected that students have a fair knowledge of Hadoop and Hive.
  • Installation Cloudera Quickstart VM or any other Hadoop cluster.
  • It will be also nice if we can explore the tez execution engine as well. Tez is currently available in the Hortonworks HDP sandbox so it will be nice if students download and set up this sandbox as well. It is not mandatory but would be complementary.

What will you learn

  • Common misuse/abuse of hive
  • How to use and interpret Hive's explain command
  • File formats and their relative performance (Text, JSON, SequenceFile, Avro, ORC and Parquet)
  • Compression
  • Spark and hive for transformation
  • Hive and Impala - making choices
  • Execution engines and performance

Project Description

The use of Hive or the hive meta store is so ubiquitous in big data engineering that achieving efficient use of the tool is a factor in the success of many projects. Whether in integrating with Spark or using hive as an ETL tool, many projects either fail or succeed as they grow in scale and complexity because of decisions made early in the project.

In this big data project on hive, we will explore using hive efficiently and this hive porject format will take an exploratory pattern rather than a project building pattern. The goal of this big data project is to explore Hive in uncommon ways towards mastery.

We will be using different datasets in this sessions, exploring different Hadoop file formats like text, CSV, JSON, ORC, Parquet, Avro, and SequenceFile, will look at compression and different codecs and take a look at the performance of each when you try integration with either spark or impala.

The idea is to explore enough so that we can make a reasonable argument about what to do or not in any given big scenario.

Instructors

 
Michael

Big Data & Enterprise Software Engineer

I am passionate about software development, databases, data analysis and the android platform. My native language is java but no one has stopped me so far from learning and using angular and node.js. Data and data analysis is thrilling and so are my experiences with SQL on Oracle, Microsoft SQL Server, Postgres and MyS see more...

What is Hackerday?

Stay updated in technology trends by working on projects

Live online coding sessions led by industry experts

Build 2-4 projects a month each lasting 6 hours designed to teach you advanced concepts

Code in groups and connect with your community