MapReduce - What is Spilled Records count?


2 Answer(s)


"Spilled Records" means the total number of records that were written to disk during a job and includes both map and reduce side spills. Spilled records can be equal to zero which is good for Memory and IO performance. If it is grater than 0 it means the memory exceeds the limit that is defined and reserved for map output buffer.. you can control this limit by setting parameters in mapred-site.xml. For better performance you should keep this spilled records small by optimizing and tuning number of tasks and/or number of cluster nodes. The more splits you have the less spills you get

hi Sasikumar,
The spilled record has to do with the transient data during the map and reduce operations.
Note that it's not just the map operations that generate the spilled records. When the in-memory buffer (controlled by mapred.job.shuffle.merge.percent) runs out or reaches the threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk.

What you need to do is:
1. Write your map and reduce functions to use as little memory as possible. They should not
be using an unlimited amount of memory. For example you cand do this by avoiding to accumulate values in a map.
2. Write a combiner function and specify the minimum number of spill files needed for the
combiner to run min.num.spills.for.cobine (default 3)
3. Tune the variables in the right way. We use buffering to minimize disk writes
– io.sort.mb Size of map-side buffer to store and merge map output before spilling
to disk. (Map-side buffer)
– fs.inmemorysize.mb Size of reduce-side buffer for storing & merging multi-map
output before spilling to disk. (Reduce side-buffer)

Thanks