How does HDFS split files with non-fixes record size, log/text files

If HDFS uses fixed block size, then end of block may be in the middle of a log line, if the line of log/text file split in the middle, information from that row can not be accessed on one data node only, because continuation of that line is on another node.

2 Answer(s)


You can specify the mechanism to split the file.

Please look at the documentation at the following limk


Hadoop has a concept of 'split' that may comprise of one or more HDFS blocks.
A map task operates on individual split so line split between two or more block issue is resolved.

Now next question could be how hadoop ensures complete line reading as line may be split up among multiple splits?
So to answer that, hadoop performs a remote-read operation [taken care in RecordReader] till it reaches to the EOL.

If a map task got a split that contains start of one line but not the end, hadoop will continue to read it from the next split until it reaches to EOL of this line.
Next map task will first seek to the first EOL found[we already read this in previous map-job] in split and then start reading the next line.