Line 3: | Line 3: | ||
The main characteristics | The main characteristics | ||
− | * it's scalable and fault tolerant | + | * it's scalable and fault tolerant: it scales lineraly with amount of data |
* it a data processing tool that can handle parallel processing of large volumes of data | * it a data processing tool that can handle parallel processing of large volumes of data | ||
* it typically runs on a [[Hadoop Distributed File System|Distributed File System]] | * it typically runs on a [[Hadoop Distributed File System|Distributed File System]] | ||
+ | |||
Main Idea: | Main Idea: | ||
Line 12: | Line 13: | ||
− | + | == Programming Model == | |
The Map-Reduce model is simple. A programmer needs to specify two primitives: | The Map-Reduce model is simple. A programmer needs to specify two primitives: | ||
− | * a Map function which produces "intermediate" result | + | * a '''Map''' function which produces "intermediate" result |
− | * a Reduce function which produces the final result | + | * a '''Reduce''' function which produces the final result |
− | * | + | * And there's a shuffle stage in-between |
+ | |||
− | + | === Map Function === | |
Map($[(k, v)]$) $\to [(k_2, v_2)]$ | Map($[(k, v)]$) $\to [(k_2, v_2)]$ | ||
* input: a list of $(k, v)$ pairs | * input: a list of $(k, v)$ pairs | ||
Line 24: | Line 26: | ||
* and it outputs a list of $(k_2, v_2)$ pairs | * and it outputs a list of $(k_2, v_2)$ pairs | ||
− | === | + | === Shuffle === |
− | + | Shuffle | |
− | * from a list of $( | + | * from a list of $(k_2, v_2)$ pairs form a list of $(k_2, [v_2])$ pairs |
* i.e. group intermediate results by key | * i.e. group intermediate results by key | ||
− | + | === Reduce Function === | |
Reduce($[(k_2, [v_2])]$) | Reduce($[(k_2, [v_2])]$) | ||
* now for each combined pair we apply reduce | * now for each combined pair we apply reduce | ||
− | == | + | |
+ | == Program Execution == | ||
Procession | Procession | ||
− | * | + | * input is broken into pieces |
− | * each piece is given for a map | + | * each piece is given for a map to execute |
− | * | + | * there are one or more reduces |
+ | |||
+ | {{void|https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/map-reduce.png}} | ||
− | + | http://hsto.org/files/9f4/126/a0c/9f4126a0c41c47c7a21c3f4888f5966e.png | |
− | |||
− | |||
− | |||
− | + | === [[Hadoop MapReduce]] === | |
− | + | For Hadoop implementation refer to [[Hadoop MapReduce#Map-Reduce Job Execution]] | |
− | + | == Example == | |
+ | === Word Count === | ||
Word-Counting: need to calculate how many occurrences of each word there are | Word-Counting: need to calculate how many occurrences of each word there are | ||
* distribute all documents among $k$ computers | * distribute all documents among $k$ computers | ||
Line 69: | Line 72: | ||
− | == | + | === [[Matrix-Matrix Multiplication]] === |
− | + | * Suppose we have two sparse matrices: $A$, a $m \times k$ matrix and $B$, a $k \times n$ matrix | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | * Suppose we have two sparse matrices: $A | + | |
* We want to calculate $C = A \times B$ | * We want to calculate $C = A \times B$ | ||
Map: | Map: | ||
* for each element $(i, j) \in A$ | * for each element $(i, j) \in A$ | ||
− | + | ** emit $((i, k), A[i, j])$ for $k \in 1..N$ | |
* for each element $(j, k) \in B$ | * for each element $(j, k) \in B$ | ||
− | + | ** emit $((i, k), B[j, k])$ for $i \in 1..L$ | |
Reduce: | Reduce: | ||
Line 166: | Line 89: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | MapReduce | + | == MapReduce Patterns == |
− | + | === Joins === | |
− | * | + | * see [[MapReduce/Joins]] |
+ | |||
+ | == Implementations == | ||
+ | MapReduce is implemented on the following systems: | ||
+ | * [[Hadoop MapReduce]] | ||
+ | * [[CouchDB]] uses MapReduce for querying the database | ||
+ | * [[Spark]], [[Flink]] also provide map and reduce functions | ||
== See also == | == See also == | ||
− | * [[Hadoop]] | + | * [[Hadoop]] and [[Hadoop MapReduce]] |
* [[Hadoop Distributed File System]] | * [[Hadoop Distributed File System]] | ||
Line 189: | Line 109: | ||
* [[Introduction to Data Science (coursera)]] | * [[Introduction to Data Science (coursera)]] | ||
* Lee et al, Parallel Data Processing with MapReduce: A Survey [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf] | * Lee et al, Parallel Data Processing with MapReduce: A Survey [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf] | ||
− | + | * [[Hadoop: The Definitive Guide (book)]] | |
[[Category:Algorithms]] | [[Category:Algorithms]] | ||
[[Category:Hadoop]] | [[Category:Hadoop]] | ||
− | [[Category: | + | [[Category:MapReduce]] |
Map-Reduce is a paradigm of parallel computation that initially comes from Functional Programming
The main characteristics
Main Idea:
The Map-Reduce model is simple. A programmer needs to specify two primitives:
Map($[(k, v)]$) $\to [(k_2, v_2)]$
Shuffle
Reduce($[(k_2, [v_2])]$)
Procession
For Hadoop implementation refer to Hadoop MapReduce#Map-Reduce Job Execution
Word-Counting: need to calculate how many occurrences of each word there are
Pseudo-code
def map(String input_key, String doc): for each word w in doc: EmitIntermediate(w, 1) def reduce(String output_key, Iterator output_vals): int res = 0 for each v in output_vals: res += v Emit(res)
Map:
Reduce:
MapReduce is implemented on the following systems: