1-844-696-6465 (US)        +91 77600 44484        help@dezyre.com
solving-hadoop-small-file-problem.jpg

Solving the Hadoop Small File Problem

In this project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the Hadoop Big Data problem.
4.64.6

Users who bought this project also bought

What will you learn

  • What is Small file problem in Hadoop
  • How it arises (Batch and Streaming mode)
  • Solution (Streaming): Using flume
  • Solution (Streaming): Preprocessing and storing in a NoSQL database
  • Solution (Batch): Merging before storing in HDFS
  • Solution (Batch): Sequencefile
  • Solution (Batch): Compression
  • Solution (Batch): CombineFileInputFormat

What will you get

  • Access to recording of the complete project
  • Access to all material related to project like data files, solution files etc.

Prerequisites

  • A little knowledge of java is nice to have but not entirely mandatory

Project Description

We have come to learn that Hadoop's distributed file system was engineered to favor fewer larger files over many small files. However, we mostly would not have control over how data come. Many data ingestion to data infrastructures come in small bits and whether we are implementing a data lake on HDFS or not, we will have to deal with this data inputs.

In this hackerday, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the Hadoop big data problem.

We will start by defining what it means, how inevitable this situation could arise, how to identify bottlenecks in a cluster owing to the small file problem and varieties of ways to solve them.

Instructors

 
Michael

Big Data & Enterprise Software Engineer

I am passionate about software development, databases, data analysis and the android platform. My native language is java but no one has stopped me so far from learning and using angular and node.js. Data and data analysis is thrilling and so are my experiences with SQL on Oracle, Microsoft SQL Server, Postgres and MyS see more...