How does HDFS split files with non-fixes record size, log/text files


2 Answer(s)


You can specify the mechanism to split the file.

Please look at the documentation at the following limk

http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/FileInputFormat.html

Hadoop has a concept of 'split' that may comprise of one or more HDFS blocks.
A map task operates on individual split so line split between two or more block issue is resolved.

Now next question could be how hadoop ensures complete line reading as line may be split up among multiple splits?
So to answer that, hadoop performs a remote-read operation [taken care in RecordReader] till it reaches to the EOL.

Example:
If a map task got a split that contains start of one line but not the end, hadoop will continue to read it from the next split until it reaches to EOL of this line.
Next map task will first seek to the first EOL found[we already read this in previous map-job] in split and then start reading the next line.