Each project comes with 2-5 hours of micro-videos explaining the solution.
Code & Dataset
Get access to 50+ solved projects with iPython notebooks and datasets.
Add project experience to your Linkedin/Github profiles.
Overview of Hadoop small file problem, its causes, and solutions
Understanding the Hadoop Small file problem, what are small files and how are they generated
Effect of Small-File problem
What is Get Input Split and how does it work?
Small file problem using CLI and Sqoop
Small file problem in streaming
Solution (Streaming): Preprocessing and storing in a NoSQL database
Solving small file problem in the streaming context using Flume
What are HDFS and its architecture
Solving small file problem in the Batch Mode context by merging before storing in HDFS
Understanding Sequence files and how to access them
Solving small file problem in the Batch Mode context by using Sequence File
Solving small file problem in the Batch Mode context by using Compression
Solving small file problem in the Batch Mode context by using CombineFileInputFormat
We have come to learn that Hadoop's distributed file system was engineered to favor fewer larger files over many small files. However, we mostly would not have control over how data come. Many data ingestion to data infrastructures come in small bits and whether we are implementing a data lake on HDFS or not, we will have to deal with this data inputs.
In this online hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to resolve the small file problem in hadoop.
We will start by defining what it means, how inevitable this situation could arise, how to identify bottlenecks in a hadoop cluster owing to the small file problem and varieties of ways to solve them.