Apr 17 2015 02:38 PM
The spilled record has to do with the transient data during the map and reduce operations.
Note that it's not just the map operations that generate the spilled records. When the in-memory buffer (controlled by mapred.job.shuffle.merge.percent) runs out or reaches the threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk.
What you need to do is:
1. Write your map and reduce functions to use as little memory as possible. They should not
be using an unlimited amount of memory. For example you cand do this by avoiding to accumulate values in a map.
2. Write a combiner function and specify the minimum number of spill files needed for the
combiner to run min.num.spills.for.cobine (default 3)
3. Tune the variables in the right way. We use buffering to minimize disk writes
– io.sort.mb Size of map-side buffer to store and merge map output before spilling
to disk. (Map-side buffer)
– fs.inmemorysize.mb Size of reduce-side buffer for storing & merging multi-map
output before spilling to disk. (Reduce side-buffer)