How do we find median of a big array using map reduce


1 Answer(s)


Hi,

This is a sort of parallel version of sequential quickselect.

Pick a small, arbitrary chunk of the input set. Sort this sequentially. We're going to use these as a whole bunch of pivots, in parallel. Call this array pivots, and let its size be k.

Perform a map/reduce as follows: for each value x in the input set, binary-search to find x's position relative to pivots; call this position bucket(x). This is an integer between 0 and k. The reduce step is to count the number of elements in each bucket; define bucket[b] to be the number of x with bucket(x) = b.

The median must be in the "median bucket." Pick out all the values in that median bucket, and use a traditional sequential selection algorithm to find the element with the correct index.

Link: http://stackoverflow.com/questions/6968215/find-median-of-a-large-integer-set-using-mapreduce

I hope this helps.

Thanks.