when doing map side join, how to pass the dictionary/abbreviation file


1 Answer(s)


Hi Igor,

Your understanding for map-side join is quite clear that we need to put dictionary file to some location which will be accessible to each mapper.
One of the most commonly used way is to use distributed caching.
Concept:
When there is one small file available in joining two data-sets, it is always an optimized way to store that file in Distributed Cache which makes the file available to all data-nodes that will run the map tasks and inside the map tasks, access those files and perform the join.
Steps:
1. Store small file in Distributed Cache in driver class
2. Inside map method, retrieve the small file and perform the join itself in map class rather than passing it to reducer.
3. More optimization can be applied by :
a. retrieve small file in map class [setup method]
b. populate a class level hash-map in setup method making join key as map key and value
c. Inside map method, check for the join key in hash-map[O(1)] and perform a join if there is a hit.

A good example can be found here : http://hadooped.blogspot.in/2013/09/map-side-join-sample.html


Hope it helps.
Post if you face any issue doing this.


Happy Learning @ Dezyre !!!