There are 2 kinds of joins in mapreduce - map side join and the reduce side join.
- Map-side join - This join happens before the input reaches the map phase. It is suited for 2 scenarios:
- One of the inputs is small enough to be fit in memory - Consider the example of some kind of a metadata which needs to be associated with a much larger number of records. In this particular case, the smaller input could be replicated across all the tasktracker nodes in memory and a join could be performed as the bigger input is being read by the mapper.
- Both the inputs are sorted and partitioned into equal sizes with the guarantee that records belonging to a key fall in the same partition - Consider the example of outputs coming out of multiple reducer jobs which had equal number of reducers and the same keys emitted. In this case, an index could be built from one of the inputs (key, filename, offset) and it could be looked up as the other input is read.
- Reduce-side join - This join happens at the reducer phase. It places no restrictions on the size of the input, the only disadvantage being that all the data/records (from both the inputs) have to go through the shuffle and sort phase. It works as following : The map phase tags the records with an identifier to distinguish the sources and the parsing logic at the reducer. Records pertaining to the same key reach the same reducer and the reducer takes care of joining, taking care of the fact that records from different source tags need to be parsed and dealt with differently.