May 16 2014 12:58 AM
Agree that all examples involves the data which was already available in hdfs. That was for illustration.
Before moving on answer let us understand the issues with small files:
1. Storage: Each file stored in hdfs consumes 150Bytes of data in NN memory.
So more small files, more space acquired in NN while actual data stored in hdfs would be small.
This forces users to keep there files in sequence files so NN memory can be freed to store more data.
In this case, if files will be stored in sequence file, and cleaned up from hdfs, free space can be utilised for storing more data.
2. Processing: If a job involves processing multiple small files using map-reduce framework, multiple map tasks will be instantiated, all of which will be handling small amount of data.
The time taken for such map job initialisation would be much more higher than the time taken for actual small file processing which will waste such map task slots.
This will act as blocker for other jobs as all map-slots would be full for that moment.
If we keep small files in larger sequence files, each map will get enough data to process in parallel with other tasks and true parallel processing benefits can be achieved.
So even in this case, when files are already on hdfs and then are combined to large sequence file, processing such sequence file for further analysis will really help.