CombineFileInputFormat vs SequenceFile in hadoop


1 Answer(s)


Hi Sonu,

The pros and cons of the solutions of SequenceFiles vs CombineFileInputFormat is that you are using CombineFileInputFormat to overcome part of the small files problem not getting rid of it.  What I mean by that is you still have lots of small files which will impact your Name Node and Data Node memory footprints.  You will have lots of files that when processed via Map Reduce are very small units of work.  You are combining the units of work by using the special InputFormat but you are not solving the real problem.  If you fix the real problem by merging your data ahead of time, you won't have to worry about combining smalls tasks into larger ones.

Hope this helps.

Thanks.