|
|
Line 1: |
Line 1: |
| == Hadoop == | | == Hadoop == |
− | usually the term ''Hadoop'' refers to an entire family of products.
| + | Usually the term ''Hadoop'' refers to an entire family of tools but mostly to |
| + | * [[HDFS]] distributed storage |
| + | * [[Hadoop MapReduce]] - processing framework ([[MapReduce]] implementation) |
| | | |
− | So it's usually a set of products:
| |
− | * parallel storage ([[Hadoop Distributed File System|HDFS]])
| |
− | * processing framework ([[MapReduce]] implementation)
| |
− | * [[Pig]]/[[Hive]] for declarative high-level language support
| |
− | * [[HBase]] as non-relational NoSQL database that runs on HDFS
| |
− | * Mahout - [[Data Mining]] and [[Machine Learning]] tool that works on top of Hadoop
| |
| | | |
| + | === Hadoop Ecosystem === |
| + | Other tools in the Hadoop ecosystem |
| + | * [[Pig]] declarative high-level language for running on [[Hadoop MapReduce]] |
| + | * [[Impala]], [[Hive]], [[Tez]] interactive SQL (SQL-like) languages |
| + | * [[Spark]], [[Flink]] for fast iterative processing, especially when data can be put in memory |
| + | * [[Storm]], [[Samza]], [[Flink]] for streaming |
| + | * [[HBase]] non-relational [[NoSQL]] database that runs on [[HDFS]] |
| + | * [[Mahout]] - [[Data Mining]] and [[Machine Learning]] tool that works on top of Hadoop |
| + | * [[Solr]] - for [[Inverted Index|document indexing]] on top of [[HDFS]] |
| | | |
− | Hadoop
| |
− | * Any combination of them is still referred as Hadoop
| |
− | * Lots of vendors (like HortonWorks) provide their own distributions of Hadoop
| |
− | * Even though [[MapReduce]] is considered the most important part, it is entirely optional: we may just use HDFS and HBase - and still will consider this combination Hadoop
| |
| | | |
| + | === What is "Hadoop" === |
| + | * Any combination of them is still referred as "Hadoop" |
| + | * Lots of vendors (Cloudera, HortonWorks, MapR) provide their own distributions of Hadoop |
| + | * Even though [[Hadoop MapReduce]] is an most important part of Hadoop, it is entirely optional: we may just use HDFS and HBase - and still will consider this combination Hadoop |
| | | |
− | == MapReduce Component ==
| |
− | This is a data processing tool on top of HDFS
| |
| | | |
− | === [[MapReduce]] Job Execution === | + | === Hadoop1 vs Hadoop2 === |
− | Procession
| + | * Hadoop1 uses its own executing engine (<code>TaskTracker</code>) |
− | * each processing job is broken down to pieces | + | * new generation of Hadoop - Hadoop2 - relies on [[YARN]] for this |
− | * each piece is given for a map task to execute | + | |
− | * also there are one or more reduce tasks
| + | |
| | | |
− | So it's performed in two steps
| |
− | * map phase
| |
− | * reduce phase
| |
| | | |
− | === Algorithm === | + | == Hadoop Configuration == |
− | * master picks some idle workers and assigns them a map task
| + | There's a Hadoop configuration folder, and each component typically has a file there with the configuration properties |
− | * preparations (before starting the map task)
| + | |
− | ** input file is loaded to DFS
| + | |
− | ** it's partitioned into blocks (typically 64 kb each)
| + | |
− | ** each block is replicated 3 times to guarantee fault-tolerance
| + | |
− | * '''map phase'''
| + | |
− | ** each block is assigned to a map worker
| + | |
− | ** it applies the [[MapReduce#Map Function|map function]] to it
| + | |
− | ** intermediate results are sorted locally
| + | |
− | ** then it's stored on local disk of mapper
| + | |
− | ** it's partitioned into $R$ reduce tasks
| + | |
− | *** $R$ is specified beforehand
| + | |
− | *** partitioning is typically done by '''hash(key) % $R$'''
| + | |
− | * wait until ''all'' map tasks are completed
| + | |
− | * before reduce
| + | |
− | ** the master assigns reduce task to workers
| + | |
− | ** the intermediate results are shuffled and assigned to reducers
| + | |
− | ** each reduces pulls its partition from mapper's disk
| + | |
− | *** all map results are already partitioned and stored on mapper disks
| + | |
− | *** read the input and group it by key
| + | |
− | ** each record is assigned to only one reduces
| + | |
− | * '''reduce phase'''
| + | |
− | ** now apply the [[MapReduce#Reduce Function|reduce function]] to each group
| + | |
− | ** output is stored and replicated 3 times
| + | |
| | | |
| + | Hadoop looks for configurations if <code>/etc/hadoop</code> or in <code>HADOOP_CONFIG_DIR</code> |
| | | |
− | Hadoop in one picture:
| + | common properties: |
| + | * <code>core-site.xml</code> - common properties |
| + | * <code>hdfs-site.xml</code> HDFS properties |
| + | * <code>mapred-site.xml</code> |
| + | * <code>yarn-site.xml</code> |
| | | |
− | https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/hadoop.png
| + | Depending on the values of these files, Hadoop can be run in several modes: |
− | | + | * Standalone/Local: for testing, run on a local machine |
− | (Figure source: Huy Vo, NYU Poly and [http://escience.washington.edu/get-help-now/what-hadoop])
| + | * [[Hadoop Pseudo Distributed Mode|Pseudodistributed]]: also run on a local machine, but jobs are executed by hadoop services (see [[Hadoop Pseudo Distributed Mode]] for configuration example) |
− | | + | * Fully Distributed: cluster (configuration is usually downloaded from cluster managers, e.g. Ambari or Cloudera Manager) |
− | | + | |
− | In short:
| + | |
− | * There's only one Name Node
| + | |
− | * the Name Node divides input files into $M$ ''splits'' (by key)
| + | |
− | * then the Name Node assigns ''workers'' (servers) to perform $M$ map tasks
| + | |
− | * while they are computing, it keeps track on their progress
| + | |
− | * Workers write their results on local disk dividing it into $R$ regions | + | |
− | * once Map part is done, the Name Node assigns workers to the $R$ reduce tasks
| + | |
− | * Reduce workers read the regions from the map workers' local disks
| + | |
− | | + | |
− | | + | |
− | https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/map-reduce2.png
| + | |
− | | + | |
− | | + | |
− | === Fault-Tolerance ===
| + | |
− | Achieved because of its Execution Scheme
| + | |
− | * detect failures and re-assigns tasks of failed nodes to others in the cluster
| + | |
− | * naturally leads to load balancing
| + | |
− | | + | |
− | | + | |
− | === Runtime Scheduling Scheme ===
| + | |
− | * For job execution MapReduce component doesn't build any execution plan beforehand
| + | |
− | * It relies on the fault-tolerance scheme that naturally leads to load balancing
| + | |
− | * Nodes that have completed are assigned to other data blocks
| + | |
− | | + | |
− | Result: no communication costs
| + | |
− | * MR tasks are done without communication between tasks
| + | |
− | | + | |
− | | + | |
− | == Advantages ==
| + | |
− | * [[MapReduce]] is simple and expressive
| + | |
− | ** computing aggregation is easy
| + | |
− | * flexible
| + | |
− | ** no dependency on [[Data Model]] or schema
| + | |
− | *** especially good for unstructured data
| + | |
− | *** cannot do that in [[Database]]s
| + | |
− | ** can write in any programming language
| + | |
− | * fault-tolerance
| + | |
− | ** detect failures and re-assigns tasks of failed nodes to others in the cluster
| + | |
− | * high scalability
| + | |
− | * even though not in the most efficient way
| + | |
− | * cheap
| + | |
− | ** runs on commodity hardware
| + | |
− | ** open source
| + | |
− | | + | |
− | | + | |
− | == Disadvantages ==
| + | |
− | However it has many disadvantages
| + | |
− | | + | |
− | === No Query Language ===
| + | |
− | No high-level declarative language as SQL
| + | |
− | * [[MapReduce]] is very low level - need to know programming languages | + | |
− | * programs are expensive to write and to maintain
| + | |
− | * programmers that can do that are expensive
| + | |
− | * for [[Data Warehousing]]: [[OLAP]] is not that good in MapReduce
| + | |
− | | + | |
− | Possible solutions:
| + | |
− | * [[Pig]] and [[Hive]]
| + | |
− | | + | |
− | | + | |
− | === Performance ===
| + | |
− | Performance issues:
| + | |
− | * no schema, no index, need to parse each input
| + | |
− | ** may cause performance degradation
| + | |
− | * not tuned for multidimensional queries
| + | |
− | * possible solutions: [[HBase]], [[Hive]]
| + | |
− | * because of fault-tolerance and scalability - it's not always optimized for I/O cost
| + | |
− | ** all intermediate results are materialized (no [[Pipelining]])
| + | |
− | ** triple replication
| + | |
− | * low latency
| + | |
− | ** big overhead for small queries (job start time + jvm start time)
| + | |
− | | + | |
− | Solutions for I/O optimization
| + | |
− | * [[HBase]] | + | |
− | ** [[Column-Oriented Databases|Column-Oriented Database]] that has index structures
| + | |
− | ** data compression (easier for Column-Oriented Databases)
| + | |
− | * Hadoop++ [https://infosys.uni-saarland.de/projects/hadoop.php]
| + | |
− | ** HAIL (Hadoop Aggressive Indexing Library) as an enhancement for HDFS
| + | |
− | ** structured file format
| + | |
− | ** 20x improvement in Hadoop performance
| + | |
− | | + | |
− | | + | |
− | === Map and Reduce are Blocking ===
| + | |
− | * a transition from Map phase to Reduce phase cannot be made while Map tasks are still running
| + | |
− | ** reason for it is that relies on [[External Merge Sort]] for grouping intermediate results
| + | |
− | ** [[Pipelining]] is not possible
| + | |
− | * latency problems from this blocking processing nature
| + | |
− | * causes performance degradation - bad for [[OLAP|on-line processing]]
| + | |
− | | + | |
− | Solution
| + | |
− | * Incremental MapReduce (like in [[CouchDB]] [http://stackoverflow.com/questions/11236676/why-is-mapreduce-in-couchdb-called-incremental] [http://eagain.net/articles/incremental-mapreduce/])
| + | |
− | | + | |
− | | + | |
− | === Bad High Availability ===
| + | |
− | * single point of failure: the Name Node
| + | |
− | * it name node fails, it brings down the entire cluster
| + | |
− | * solutions:
| + | |
− | ** use special hardware for it
| + | |
− | ** regularly back up
| + | |
− | | + | |
− | | + | |
− | === Fixed Data Flow ===
| + | |
− | * Sometimes the abstraction is too simple
| + | |
− | ** many complex algorithms are hard to express with this paradigm
| + | |
− | * Design
| + | |
− | ** read simple input
| + | |
− | ** generate simple output
| + | |
− | * Again, tools like [[Hive]] can help
| + | |
− | | + | |
− | | + | |
− | === Other ===
| + | |
− | And finally, it's very young
| + | |
| | | |
| | | |
Line 180: |
Line 48: |
| {{ Main | Hadoop in Data Warehousing }} | | {{ Main | Hadoop in Data Warehousing }} |
| | | |
− |
| |
− | == Links ==
| |
− | * http://www.stanford.edu/class/ee380/Abstracts/111116.html - a lecture about Hadoop from Cloudera CTO
| |
| | | |
| == See also == | | == See also == |
Line 190: |
Line 55: |
| | | |
| == Sources == | | == Sources == |
− | * Lee et al, Parallel Data Processing with MapReduce: A Survey [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf]
| |
− | * Ordonez et al, Relational versus non-relational database systems for data warehousing [http://www2.cs.uh.edu/~ordonez/w-2010-DOLAP-relnonrel.pdf]
| |
− | * Paper by Cloudera and Teradata, Awadallah and Graham, Hadoop and the Data Warehouse: When to Use Which. [http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/]
| |
| * [[Introduction to Data Science (coursera)]] | | * [[Introduction to Data Science (coursera)]] |
− | | + | * [[Hadoop: The Definitive Guide (book)]] |
| | | |
| [[Category:Hadoop]] | | [[Category:Hadoop]] |
| + | [[Category:MapReduce]] |
| [[Category:Distributed Systems]] | | [[Category:Distributed Systems]] |
There's a Hadoop configuration folder, and each component typically has a file there with the configuration properties