Map-Reduce is a paradigm of parallel computation that initially comes from Functional Programming

The main characteristics

- it's scalable and fault tolerant: it scales lineraly with amount of data
- it a data processing tool that can handle parallel processing of large volumes of data
- it typically runs on a Distributed File System

Main Idea:

- hide details of parallel execution
- allow user to focus only on data processing strategies

The Map-Reduce model is simple. A programmer needs to specify two primitives:

- a
**Map**function which produces "intermediate" result - a
**Reduce**function which produces the final result - And there's a shuffle stage in-between

Map($[(k, v)]$) $\to [(k_2, v_2)]$

- input: a list of $(k, v)$ pairs
- the map function is applied to each pair in the input
- and it outputs a list of $(k_2, v_2)$ pairs

Shuffle

- from a list of $(k_2, v_2)$ pairs form a list of $(k_2, [v_2])$ pairs
- i.e. group intermediate results by key

Reduce($[(k_2, [v_2])]$)

- now for each combined pair we apply reduce

Procession

- input is broken into pieces
- each piece is given for a map to execute
- there are one or more reduces

For Hadoop implementation refer to Hadoop MapReduce#Map-Reduce Job Execution

Word-Counting: need to calculate how many occurrences of each word there are

- distribute all documents among $k$ computers
- for each document return a set of (word, freq) pairs (the map phase)
- now sum the occurrences for each word (the reduce phase)

Pseudo-code

def map(String input_key, String doc): for each word w in doc: EmitIntermediate(w, 1) def reduce(String output_key, Iterator output_vals): int res = 0 for each v in output_vals: res += v Emit(res)

- Suppose we have two sparse matrices: $A$, a $m \times k$ matrix and $B$, a $k \times n$ matrix
- We want to calculate $C = A \times B$

Map:

- for each element $(i, j) \in A$
- emit $((i, k), A[i, j])$ for $k \in 1..N$

- for each element $(j, k) \in B$
- emit $((i, k), B[j, k])$ for $i \in 1..L$

Reduce:

- key = $(i, k)$
- value = $\sum_j (A[i, j] \times B[j, k])$

- see MapReduce/Joins

MapReduce is implemented on the following systems:

- Hadoop MapReduce
- CouchDB uses MapReduce for querying the database
- Spark, Flink also provide map and reduce functions

- Introduction to Data Science (coursera)
- Lee et al, Parallel Data Processing with MapReduce: A Survey [1]
- Hadoop: The Definitive Guide (book)