Line 3: | Line 3: | ||

The main characteristics | The main characteristics | ||

− | * it's scalable and fault tolerant | + | * it's scalable and fault tolerant: it scales lineraly with amount of data |

* it a data processing tool that can handle parallel processing of large volumes of data | * it a data processing tool that can handle parallel processing of large volumes of data | ||

* it typically runs on a [[Hadoop Distributed File System|Distributed File System]] | * it typically runs on a [[Hadoop Distributed File System|Distributed File System]] | ||

+ | |||

Main Idea: | Main Idea: | ||

Line 12: | Line 13: | ||

− | + | == Programming Model == | |

The Map-Reduce model is simple. A programmer needs to specify two primitives: | The Map-Reduce model is simple. A programmer needs to specify two primitives: | ||

− | * a Map function which produces "intermediate" result | + | * a '''Map''' function which produces "intermediate" result |

− | * a Reduce function which produces the final result | + | * a '''Reduce''' function which produces the final result |

− | * | + | * And there's a shuffle stage in-between |

+ | |||

− | + | === Map Function === | |

Map($[(k, v)]$) $\to [(k_2, v_2)]$ | Map($[(k, v)]$) $\to [(k_2, v_2)]$ | ||

* input: a list of $(k, v)$ pairs | * input: a list of $(k, v)$ pairs | ||

Line 24: | Line 26: | ||

* and it outputs a list of $(k_2, v_2)$ pairs | * and it outputs a list of $(k_2, v_2)$ pairs | ||

− | === | + | === Shuffle === |

− | + | Shuffle | |

− | * from a list of $( | + | * from a list of $(k_2, v_2)$ pairs form a list of $(k_2, [v_2])$ pairs |

* i.e. group intermediate results by key | * i.e. group intermediate results by key | ||

− | + | === Reduce Function === | |

Reduce($[(k_2, [v_2])]$) | Reduce($[(k_2, [v_2])]$) | ||

* now for each combined pair we apply reduce | * now for each combined pair we apply reduce | ||

− | == | + | |

+ | == Program Execution == | ||

Procession | Procession | ||

− | * | + | * input is broken into pieces |

− | * each piece is given for a map | + | * each piece is given for a map to execute |

− | * | + | * there are one or more reduces |

+ | |||

+ | {{void|https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/map-reduce.png}} | ||

− | + | http://hsto.org/files/9f4/126/a0c/9f4126a0c41c47c7a21c3f4888f5966e.png | |

− | |||

− | |||

− | |||

− | + | === [[Hadoop MapReduce]] === | |

− | + | For Hadoop implementation refer to [[Hadoop MapReduce#Map-Reduce Job Execution]] | |

− | + | == Example == | |

+ | === Word Count === | ||

Word-Counting: need to calculate how many occurrences of each word there are | Word-Counting: need to calculate how many occurrences of each word there are | ||

* distribute all documents among $k$ computers | * distribute all documents among $k$ computers | ||

Line 69: | Line 72: | ||

− | == | + | === [[Matrix-Matrix Multiplication]] === |

− | + | * Suppose we have two sparse matrices: $A$, a $m \times k$ matrix and $B$, a $k \times n$ matrix | |

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | + | ||

− | * Suppose we have two sparse matrices: $A | + | |

* We want to calculate $C = A \times B$ | * We want to calculate $C = A \times B$ | ||

Map: | Map: | ||

* for each element $(i, j) \in A$ | * for each element $(i, j) \in A$ | ||

− | + | ** emit $((i, k), A[i, j])$ for $k \in 1..N$ | |

* for each element $(j, k) \in B$ | * for each element $(j, k) \in B$ | ||

− | + | ** emit $((i, k), B[j, k])$ for $i \in 1..L$ | |

Reduce: | Reduce: | ||

Line 166: | Line 89: | ||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | |||

− | MapReduce | + | == MapReduce Patterns == |

− | + | === Joins === | |

− | * | + | * see [[MapReduce/Joins]] |

+ | |||

+ | == Implementations == | ||

+ | MapReduce is implemented on the following systems: | ||

+ | * [[Hadoop MapReduce]] | ||

+ | * [[CouchDB]] uses MapReduce for querying the database | ||

+ | * [[Spark]], [[Flink]] also provide map and reduce functions | ||

== See also == | == See also == | ||

− | * [[Hadoop]] | + | * [[Hadoop]] and [[Hadoop MapReduce]] |

* [[Hadoop Distributed File System]] | * [[Hadoop Distributed File System]] | ||

Line 189: | Line 109: | ||

* [[Introduction to Data Science (coursera)]] | * [[Introduction to Data Science (coursera)]] | ||

* Lee et al, Parallel Data Processing with MapReduce: A Survey [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf] | * Lee et al, Parallel Data Processing with MapReduce: A Survey [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf] | ||

− | + | * [[Hadoop: The Definitive Guide (book)]] | |

[[Category:Algorithms]] | [[Category:Algorithms]] | ||

[[Category:Hadoop]] | [[Category:Hadoop]] | ||

− | [[Category: | + | [[Category:MapReduce]] |

Map-Reduce is a paradigm of parallel computation that initially comes from Functional Programming

The main characteristics

- it's scalable and fault tolerant: it scales lineraly with amount of data
- it a data processing tool that can handle parallel processing of large volumes of data
- it typically runs on a Distributed File System

Main Idea:

- hide details of parallel execution
- allow user to focus only on data processing strategies

The Map-Reduce model is simple. A programmer needs to specify two primitives:

- a
**Map**function which produces "intermediate" result - a
**Reduce**function which produces the final result - And there's a shuffle stage in-between

Map($[(k, v)]$) $\to [(k_2, v_2)]$

- input: a list of $(k, v)$ pairs
- the map function is applied to each pair in the input
- and it outputs a list of $(k_2, v_2)$ pairs

Shuffle

- from a list of $(k_2, v_2)$ pairs form a list of $(k_2, [v_2])$ pairs
- i.e. group intermediate results by key

Reduce($[(k_2, [v_2])]$)

- now for each combined pair we apply reduce

Procession

- input is broken into pieces
- each piece is given for a map to execute
- there are one or more reduces

For Hadoop implementation refer to Hadoop MapReduce#Map-Reduce Job Execution

Word-Counting: need to calculate how many occurrences of each word there are

- distribute all documents among $k$ computers
- for each document return a set of (word, freq) pairs (the map phase)
- now sum the occurrences for each word (the reduce phase)

Pseudo-code

def map(String input_key, String doc): for each word w in doc: EmitIntermediate(w, 1) def reduce(String output_key, Iterator output_vals): int res = 0 for each v in output_vals: res += v Emit(res)

- Suppose we have two sparse matrices: $A$, a $m \times k$ matrix and $B$, a $k \times n$ matrix
- We want to calculate $C = A \times B$

Map:

- for each element $(i, j) \in A$
- emit $((i, k), A[i, j])$ for $k \in 1..N$

- for each element $(j, k) \in B$
- emit $((i, k), B[j, k])$ for $i \in 1..L$

Reduce:

- key = $(i, k)$
- value = $\sum_j (A[i, j] \times B[j, k])$

- see MapReduce/Joins

MapReduce is implemented on the following systems:

- Hadoop MapReduce
- CouchDB uses MapReduce for querying the database
- Spark, Flink also provide map and reduce functions

- Introduction to Data Science (coursera)
- Lee et al, Parallel Data Processing with MapReduce: A Survey [1]
- Hadoop: The Definitive Guide (book)