Online Hadoop Projects -Solving small file problem in Hadoop

In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

  • Overview of Hadoop small file problem, its causes, and solutions

  • Understanding the Hadoop Small file problem, what are small files and how are they generated

  • Effect of Small-File problem

  • What is Get Input Split and how does it work?

  • Small file problem using CLI and Sqoop

  • Small file problem in streaming

  • Solution (Streaming): Preprocessing and storing in a NoSQL database

  • Solving small file problem in the streaming context using Flume

  • What are HDFS and its architecture

  • Solving small file problem in the Batch Mode context by merging before storing in HDFS

  • Understanding Sequence files and how to access them

  • Solving small file problem in the Batch Mode context by using Sequence File

  • Solving small file problem in the Batch Mode context by using Compression

  • Solving small file problem in the Batch Mode context by using CombineFileInputFormat

Project Description

We have come to learn that Hadoop's distributed file system was engineered to favor fewer larger files over many small files. However, we mostly would not have control over how data come. Many data ingestion to data infrastructures come in small bits and whether we are implementing a data lake on HDFS or not, we will have to deal with this data inputs.

In this online hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to resolve the small file problem in hadoop.

We will start by defining what it means, how inevitable this situation could arise, how to identify bottlenecks in a hadoop cluster owing to the small file problem and varieties of ways to solve them.

Similar Projects

Big Data Project Hive Project- Denormalize JSON Data and analyse it with HIVE Scripts
In this hive project, you will work on denormalizing the JSON data and create HIVE scripts with ORC file format.
Big Data Project Data processing with Spark SQL
In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.
Big Data Project NoSQL Project on Yelp Dataset using HBase and MongoDB
In this NoSQL project, we will use two NoSQL databases(HBase and MongoDB) to store Yelp business attributes and learn how to retrieve this data for processing or query.
Big Data Project Design a Hadoop Architecture
Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

Curriculum For This Mini Project

 
  Overview of the Project
02m
  Understanding the Hadoop Small File Problem?
33m
  Effect of the Small File Problem
05m
  How InputSplit works?
09m
  InputSplit and Block Boundary Overlap
08m
  How small file problem arises in batch mode?
05m
  Small File Problem in Batch Mode-Using CLI and Sqoop
11m
  How small file problem arises in a Streaming Context?
17m
  Solving the Small File Problem in a Streaming Context using Flume
18m
  Solving the Small File Problem in a Streaming Context by Storing in NoSQL
09m
  Quick Recap of the Previous Session
22m
  Solving Small File Problem in Batch Mode-Merging before storing in HDFS
42m
  Solving Small File Problem in Batch Mode using SequenceFile
23m
  Solving Small File Problem in Batch Mode using Compression
11m
  Solving Small File Problem in Batch Mode using CombineFileInputFormat
07m