Question about Sequence Files



0
The examples in module 6 all seem to create the sequence file by reading input that is already in HDFS. I understand this is illustrative, but would this be the way it works in the real world? Doesn't it defeat the purpose to put all the small files into HDFS just to turn around and pack them into a sequence file? If they are already in HDFS, why bother? Even if one was going to delete the source files from HDFS after creating the sequence files this still seems inefficient. Would it not be more common to have the process that creates the sequence file read from a local file system and output the sequence file to HDFS? Or is that not possible?

2 Answer(s)


0

Hi David,

Agree that all examples involves the data which was already available in hdfs. That was for illustration.
Before moving on answer let us understand the issues with small files:
1. Storage: Each file stored in hdfs consumes 150Bytes of data in NN memory.
So more small files, more space acquired in NN while actual data stored in hdfs would be small.
This forces users to keep there files in sequence files so NN memory can be freed to store more data.
Solution:
In this case, if files will be stored in sequence file, and cleaned up from hdfs, free space can be utilised for storing more data.

2. Processing: If a job involves processing multiple small files using map-reduce framework, multiple map tasks will be instantiated, all of which will be handling small amount of data.
The time taken for such map job initialisation would be much more higher than the time taken for actual small file processing which will waste such map task slots.
This will act as blocker for other jobs as all map-slots would be full for that moment.
Solution:
If we keep small files in larger sequence files, each map will get enough data to process in parallel with other tasks and true parallel processing benefits can be achieved.
So even in this case, when files are already on hdfs and then are combined to large sequence file, processing such sequence file for further analysis will really help.

0

Answer to your second question, yes it is possible to read small files directly from local filesystem.
The only thing that we have to do is to modify the conf object to point to local filesystem that can be done as follows:
Configuration conf = new Configuration();
conf.set('fs.default.name','file:///');
Job job = new Job(conf, "ImagesToSequenceFileConverter");
Now whaterevr location will be passed to this program as input path for files, code will search them on local filesystem and not on hdfs.

If you see ImagesToSequenceFileConverter[sequence file assignment], just add conf.set('fs.default.name','file:///'); at line#52 and re-create the jar and pass the local-filesystem location where images are stored and code will pick files direct from there.

Hope this helps.
Vote-up, if works for you.
Happy learning!!