Question about Sequence Files

2 Answer(s)

DeZyre Support

Hi David,

Agree that all examples involves the data which was already available in hdfs. That was for illustration.
Before moving on answer let us understand the issues with small files:
1. Storage: Each file stored in hdfs consumes 150Bytes of data in NN memory.
So more small files, more space acquired in NN while actual data stored in hdfs would be small.
This forces users to keep there files in sequence files so NN memory can be freed to store more data.
Solution:
In this case, if files will be stored in sequence file, and cleaned up from hdfs, free space can be utilised for storing more data.

2. Processing: If a job involves processing multiple small files using map-reduce framework, multiple map tasks will be instantiated, all of which will be handling small amount of data.
The time taken for such map job initialisation would be much more higher than the time taken for actual small file processing which will waste such map task slots.
This will act as blocker for other jobs as all map-slots would be full for that moment.
Solution:
If we keep small files in larger sequence files, each map will get enough data to process in parallel with other tasks and true parallel processing benefits can be achieved.
So even in this case, when files are already on hdfs and then are combined to large sequence file, processing such sequence file for further analysis will really help.

May 16 2014 12:58 AM

DeZyre Support

Answer to your second question, yes it is possible to read small files directly from local filesystem.
The only thing that we have to do is to modify the conf object to point to local filesystem that can be done as follows:
Configuration conf = new Configuration();
conf.set('fs.default.name','file:///');
Job job = new Job(conf, "ImagesToSequenceFileConverter");
Now whaterevr location will be passed to this program as input path for files, code will search them on local filesystem and not on hdfs.

If you see ImagesToSequenceFileConverter[sequence file assignment], just add conf.set('fs.default.name','file:///'); at line#52 and re-create the jar and pass the local-filesystem location where images are stored and code will pick files direct from there.

Hope this helps.
Vote-up, if works for you.
Happy learning!!

May 16 2014 01:06 AM

Question about Sequence Files

2 Answer(s)

Relevant Projects

You might also like

Related Questions

Related Blogs