Online Hadoop Projects -Solving small file problem in Hadoop

Online Hadoop Projects -Solving small file problem in Hadoop

In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Camille St. Omer

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

Mike Vogt

Information Architect at Bank of America

I have had a very positive experience. The platform is very rich in resources, and the expert was thoroughly knowledgeable on the subject matter - real world hands-on experience. I wish I had this... Read More

What will you learn

Overview of Hadoop small file problem, its causes, and solutions
Understanding the Hadoop Small file problem, what are small files and how are they generated
Effect of Small-File problem
What is Get Input Split and how does it work?
Small file problem using CLI and Sqoop
Small file problem in streaming
Solution (Streaming): Preprocessing and storing in a NoSQL database
Solving small file problem in the streaming context using Flume
What are HDFS and its architecture
Solving small file problem in the Batch Mode context by merging before storing in HDFS
Understanding Sequence files and how to access them
Solving small file problem in the Batch Mode context by using Sequence File
Solving small file problem in the Batch Mode context by using Compression
Solving small file problem in the Batch Mode context by using CombineFileInputFormat

Project Description

We have come to learn that Hadoop's distributed file system was engineered to favor fewer larger files over many small files. However, we mostly would not have control over how data come. Many data ingestion to data infrastructures come in small bits and whether we are implementing a data lake on HDFS or not, we will have to deal with this data inputs.

In this online hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to resolve the small file problem in hadoop.

We will start by defining what it means, how inevitable this situation could arise, how to identify bottlenecks in a hadoop cluster owing to the small file problem and varieties of ways to solve them.

Similar Projects

Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Analyze clickstream data of a website using Hadoop Hive to increase sales by optimizing every aspect of the customer experience on the website from the first mouse click to the last.

In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

Curriculum For This Mini Project

Overview of the Project
02m
Understanding the Hadoop Small File Problem?
33m
Effect of the Small File Problem
05m
How InputSplit works?
09m
InputSplit and Block Boundary Overlap
08m
How small file problem arises in batch mode?
05m
Small File Problem in Batch Mode-Using CLI and Sqoop
11m
How small file problem arises in a Streaming Context?
17m
Solving the Small File Problem in a Streaming Context using Flume
18m
Solving the Small File Problem in a Streaming Context by Storing in NoSQL
09m
Quick Recap of the Previous Session
22m
Solving Small File Problem in Batch Mode-Merging before storing in HDFS
42m
Solving Small File Problem in Batch Mode using SequenceFile
23m
Solving Small File Problem in Batch Mode using Compression
11m
Solving Small File Problem in Batch Mode using CombineFileInputFormat
07m