# ML Wiki

## Map Reduce

Map-Reduce is a paradigm of parallel computation that initially comes from Functional Programming

The main characteristics

• it's scalable and fault tolerant: it scales lineraly with amount of data
• it a data processing tool that can handle parallel processing of large volumes of data
• it typically runs on a Distributed File System

Main Idea:

• hide details of parallel execution
• allow user to focus only on data processing strategies

## Programming Model

The Map-Reduce model is simple. A programmer needs to specify two primitives:

• a Map function which produces "intermediate" result
• a Reduce function which produces the final result
• And there's a shuffle stage in-between

### Map Function

Map($[(k, v)]$) $\to [(k_2, v_2)]$

• input: a list of $(k, v)$ pairs
• the map function is applied to each pair in the input
• and it outputs a list of $(k_2, v_2)$ pairs

### Shuffle

Shuffle

• from a list of $(k_2, v_2)$ pairs form a list of $(k_2, [v_2])$ pairs
• i.e. group intermediate results by key

### Reduce Function

Reduce($[(k_2, [v_2])]$)

• now for each combined pair we apply reduce

## Program Execution

Procession

• input is broken into pieces
• each piece is given for a map to execute
• there are one or more reduces ## Example

### Word Count

Word-Counting: need to calculate how many occurrences of each word there are

• distribute all documents among $k$ computers
• for each document return a set of (word, freq) pairs (the map phase)
• now sum the occurrences for each word (the reduce phase)

Pseudo-code

def map(String input_key, String doc):
for each word w in doc:
EmitIntermediate(w, 1)

def reduce(String output_key, Iterator output_vals):
int res = 0
for each v in output_vals:
res += v
Emit(res)


### Matrix-Matrix Multiplication

• Suppose we have two sparse matrices: $A$, a $m \times k$ matrix and $B$, a $k \times n$ matrix
• We want to calculate $C = A \times B$

Map:

• for each element $(i, j) \in A$
• emit $((i, k), A[i, j])$ for $k \in 1..N$
• for each element $(j, k) \in B$
• emit $((i, k), B[j, k])$ for $i \in 1..L$

Reduce:

• key = $(i, k)$
• value = $\sum_j (A[i, j] \times B[j, k])$

## Implementations

MapReduce is implemented on the following systems: