Map Reduce

Map-Reduce is a paradigm of parallel computation that initially comes from Functional Programming

The main characteristics

  • it's scalable and fault tolerant: it scales lineraly with amount of data
  • it a data processing tool that can handle parallel processing of large volumes of data
  • it typically runs on a Distributed File System


Main Idea:

  • hide details of parallel execution
  • allow user to focus only on data processing strategies


Programming Model

The Map-Reduce model is simple. A programmer needs to specify two primitives:

  • a Map function which produces "intermediate" result
  • a Reduce function which produces the final result
  • And there's a shuffle stage in-between


Map Function

Map($[(k, v)]$) $\to [(k_2, v_2)]$

  • input: a list of $(k, v)$ pairs
  • the map function is applied to each pair in the input
  • and it outputs a list of $(k_2, v_2)$ pairs

Shuffle

Shuffle

  • from a list of $(k_2, v_2)$ pairs form a list of $(k_2, [v_2])$ pairs
  • i.e. group intermediate results by key

Reduce Function

Reduce($[(k_2, [v_2])]$)

  • now for each combined pair we apply reduce


Program Execution

Procession

  • input is broken into pieces
  • each piece is given for a map to execute
  • there are one or more reduces


9f4126a0c41c47c7a21c3f4888f5966e.png


Hadoop MapReduce

For Hadoop implementation refer to Hadoop MapReduce#Map-Reduce Job Execution


Example

Word Count

Word-Counting: need to calculate how many occurrences of each word there are

  • distribute all documents among $k$ computers
  • for each document return a set of (word, freq) pairs (the map phase)
  • now sum the occurrences for each word (the reduce phase)

Pseudo-code

def map(String input_key, String doc):
  for each word w in doc:
    EmitIntermediate(w, 1)

def reduce(String output_key, Iterator output_vals):
  int res = 0
  for each v in output_vals:
    res += v
  Emit(res)


Matrix-Matrix Multiplication

  • Suppose we have two sparse matrices: $A$, a $m \times k$ matrix and $B$, a $k \times n$ matrix
  • We want to calculate $C = A \times B$

Map:

  • for each element $(i, j) \in A$
    • emit $((i, k), A[i, j])$ for $k \in 1..N$
  • for each element $(j, k) \in B$
    • emit $((i, k), B[j, k])$ for $i \in 1..L$

Reduce:

  • key = $(i, k)$
  • value = $\sum_j (A[i, j] \times B[j, k])$

Implementation in Python


MapReduce Patterns

Joins


Implementations

MapReduce is implemented on the following systems:


See also

Sources

Share your opinion