http://mlwiki.org/index.php?title=Hadoop&feed=atom&action=historyHadoop - Revision history2024-03-29T12:36:24ZRevision history for this page on the wikiMediaWiki 1.25.3http://mlwiki.org/index.php?title=Hadoop&diff=681&oldid=prevAlexey at 11:58, 23 November 20152015-11-23T11:58:53Z<p></p>
<a href="http://mlwiki.org/index.php?title=Hadoop&diff=681&oldid=211">Show changes</a>Alexeyhttp://mlwiki.org/index.php?title=Hadoop&diff=211&oldid=prevAlexey at 13:35, 7 January 20142014-01-07T13:35:19Z<p></p>
<p><b>New page</b></p><div>== Hadoop ==<br />
usually the term ''Hadoop'' refers to an entire family of products. <br />
<br />
So it's usually a set of products:<br />
* parallel storage ([[Hadoop Distributed File System|HDFS]])<br />
* processing framework ([[MapReduce]] implementation)<br />
* [[Pig]]/[[Hive]] for declarative high-level language support<br />
* [[HBase]] as non-relational NoSQL database that runs on HDFS<br />
* Mahout - [[Data Mining]] and [[Machine Learning]] tool that works on top of Hadoop<br />
<br />
<br />
Hadoop<br />
* Any combination of them is still referred as Hadoop <br />
* Lots of vendors (like HortonWorks) provide their own distributions of Hadoop<br />
* Even though [[MapReduce]] is considered the most important part, it is entirely optional: we may just use HDFS and HBase - and still will consider this combination Hadoop <br />
<br />
<br />
== MapReduce Component ==<br />
This is a data processing tool on top of HDFS<br />
<br />
=== [[MapReduce]] Job Execution ===<br />
Procession<br />
* each processing job is broken down to pieces<br />
* each piece is given for a map task to execute<br />
* also there are one or more reduce tasks<br />
<br />
So it's performed in two steps <br />
* map phase <br />
* reduce phase<br />
<br />
=== Algorithm ===<br />
* master picks some idle workers and assigns them a map task<br />
* preparations (before starting the map task)<br />
** input file is loaded to DFS<br />
** it's partitioned into blocks (typically 64 kb each)<br />
** each block is replicated 3 times to guarantee fault-tolerance<br />
* '''map phase'''<br />
** each block is assigned to a map worker<br />
** it applies the [[MapReduce#Map Function|map function]] to it<br />
** intermediate results are sorted locally<br />
** then it's stored on local disk of mapper<br />
** it's partitioned into $R$ reduce tasks <br />
*** $R$ is specified beforehand<br />
*** partitioning is typically done by '''hash(key) % $R$'''<br />
* wait until ''all'' map tasks are completed<br />
* before reduce<br />
** the master assigns reduce task to workers<br />
** the intermediate results are shuffled and assigned to reducers<br />
** each reduces pulls its partition from mapper's disk<br />
*** all map results are already partitioned and stored on mapper disks<br />
*** read the input and group it by key<br />
** each record is assigned to only one reduces <br />
* '''reduce phase'''<br />
** now apply the [[MapReduce#Reduce Function|reduce function]] to each group<br />
** output is stored and replicated 3 times<br />
<br />
<br />
Hadoop in one picture: <br />
<br />
https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/hadoop.png<br />
<br />
(Figure source: Huy Vo, NYU Poly and [http://escience.washington.edu/get-help-now/what-hadoop])<br />
<br />
<br />
In short:<br />
* There's only one Name Node <br />
* the Name Node divides input files into $M$ ''splits'' (by key)<br />
* then the Name Node assigns ''workers'' (servers) to perform $M$ map tasks<br />
* while they are computing, it keeps track on their progress <br />
* Workers write their results on local disk dividing it into $R$ regions<br />
* once Map part is done, the Name Node assigns workers to the $R$ reduce tasks <br />
* Reduce workers read the regions from the map workers' local disks <br />
<br />
<br />
https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/map-reduce2.png<br />
<br />
<br />
=== Fault-Tolerance ===<br />
Achieved because of its Execution Scheme<br />
* detect failures and re-assigns tasks of failed nodes to others in the cluster<br />
* naturally leads to load balancing <br />
<br />
<br />
=== Runtime Scheduling Scheme ===<br />
* For job execution MapReduce component doesn't build any execution plan beforehand<br />
* It relies on the fault-tolerance scheme that naturally leads to load balancing<br />
* Nodes that have completed are assigned to other data blocks <br />
<br />
Result: no communication costs<br />
* MR tasks are done without communication between tasks<br />
<br />
<br />
== Advantages ==<br />
* [[MapReduce]] is simple and expressive<br />
** computing aggregation is easy<br />
* flexible<br />
** no dependency on [[Data Model]] or schema<br />
*** especially good for unstructured data<br />
*** cannot do that in [[Database]]s<br />
** can write in any programming language<br />
* fault-tolerance<br />
** detect failures and re-assigns tasks of failed nodes to others in the cluster<br />
* high scalability<br />
* even though not in the most efficient way<br />
* cheap<br />
** runs on commodity hardware<br />
** open source<br />
<br />
<br />
== Disadvantages ==<br />
However it has many disadvantages<br />
<br />
=== No Query Language ===<br />
No high-level declarative language as SQL<br />
* [[MapReduce]] is very low level - need to know programming languages <br />
* programs are expensive to write and to maintain<br />
* programmers that can do that are expensive<br />
* for [[Data Warehousing]]: [[OLAP]] is not that good in MapReduce<br />
<br />
Possible solutions: <br />
* [[Pig]] and [[Hive]]<br />
<br />
<br />
=== Performance ===<br />
Performance issues:<br />
* no schema, no index, need to parse each input<br />
** may cause performance degradation<br />
* not tuned for multidimensional queries<br />
* possible solutions: [[HBase]], [[Hive]]<br />
* because of fault-tolerance and scalability - it's not always optimized for I/O cost<br />
** all intermediate results are materialized (no [[Pipelining]])<br />
** triple replication<br />
* low latency<br />
** big overhead for small queries (job start time + jvm start time)<br />
<br />
Solutions for I/O optimization<br />
* [[HBase]]<br />
** [[Column-Oriented Databases|Column-Oriented Database]] that has index structures<br />
** data compression (easier for Column-Oriented Databases)<br />
* Hadoop++ [https://infosys.uni-saarland.de/projects/hadoop.php]<br />
** HAIL (Hadoop Aggressive Indexing Library) as an enhancement for HDFS <br />
** structured file format<br />
** 20x improvement in Hadoop performance<br />
<br />
<br />
=== Map and Reduce are Blocking ===<br />
* a transition from Map phase to Reduce phase cannot be made while Map tasks are still running<br />
** reason for it is that relies on [[External Merge Sort]] for grouping intermediate results<br />
** [[Pipelining]] is not possible<br />
* latency problems from this blocking processing nature<br />
* causes performance degradation - bad for [[OLAP|on-line processing]]<br />
<br />
Solution<br />
* Incremental MapReduce (like in [[CouchDB]] [http://stackoverflow.com/questions/11236676/why-is-mapreduce-in-couchdb-called-incremental] [http://eagain.net/articles/incremental-mapreduce/])<br />
<br />
<br />
=== Bad High Availability ===<br />
* single point of failure: the Name Node<br />
* it name node fails, it brings down the entire cluster<br />
* solutions: <br />
** use special hardware for it<br />
** regularly back up<br />
<br />
<br />
=== Fixed Data Flow ===<br />
* Sometimes the abstraction is too simple<br />
** many complex algorithms are hard to express with this paradigm<br />
* Design<br />
** read simple input<br />
** generate simple output<br />
* Again, tools like [[Hive]] can help<br />
<br />
<br />
=== Other ===<br />
And finally, it's very young<br />
<br />
<br />
<br />
== Hadoop for Data Warehousing ==<br />
{{ Main | Hadoop in Data Warehousing }}<br />
<br />
<br />
== Links ==<br />
* http://www.stanford.edu/class/ee380/Abstracts/111116.html - a lecture about Hadoop from Cloudera CTO <br />
<br />
== See also ==<br />
* [[MapReduce]]<br />
* [[Hadoop Distributed File System]]<br />
* [[Hadoop in Data Warehousing]]<br />
<br />
== Sources ==<br />
* Lee et al, Parallel Data Processing with MapReduce: A Survey [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf]<br />
* Ordonez et al, Relational versus non-relational database systems for data warehousing [http://www2.cs.uh.edu/~ordonez/w-2010-DOLAP-relnonrel.pdf]<br />
* Paper by Cloudera and Teradata, Awadallah and Graham, Hadoop and the Data Warehouse: When to Use Which. [http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/]<br />
* [[Introduction to Data Science (coursera)]]<br />
<br />
<br />
[[Category:Hadoop]]<br />
[[Category:Distributed Systems]]</div>Alexey