Line 1: Line 1:
 
== Hadoop ==
 
== Hadoop ==
usually the term ''Hadoop'' refers to an entire family of products.
+
Usually the term ''Hadoop'' refers to an entire family of tools but mostly to
 +
* [[HDFS]] distributed storage
 +
* [[Hadoop MapReduce]] - processing framework ([[MapReduce]] implementation)
  
So it's usually a set of products:
 
* parallel storage ([[Hadoop Distributed File System|HDFS]])
 
* processing framework ([[MapReduce]] implementation)
 
* [[Pig]]/[[Hive]] for declarative high-level language support
 
* [[HBase]] as non-relational NoSQL database that runs on HDFS
 
* Mahout - [[Data Mining]] and [[Machine Learning]] tool that works on top of Hadoop
 
  
 +
=== Hadoop Ecosystem ===
 +
Other tools in the Hadoop ecosystem
 +
* [[Pig]] declarative high-level language for running on [[Hadoop MapReduce]]
 +
* [[Impala]], [[Hive]], [[Tez]] interactive SQL (SQL-like) languages
 +
* [[Spark]], [[Flink]] for fast iterative processing, especially when data can be put in memory
 +
* [[Storm]], [[Samza]], [[Flink]] for streaming
 +
* [[HBase]] non-relational [[NoSQL]] database that runs on [[HDFS]]
 +
* [[Mahout]] - [[Data Mining]] and [[Machine Learning]] tool that works on top of Hadoop
 +
* [[Solr]] - for [[Inverted Index|document indexing]] on top of [[HDFS]]
  
Hadoop
 
* Any combination of them is still referred as Hadoop
 
* Lots of vendors (like HortonWorks) provide their own distributions of Hadoop
 
* Even though [[MapReduce]] is considered the most important part, it is entirely optional: we may just use HDFS and HBase - and still will consider this combination Hadoop
 
  
 +
=== What is "Hadoop" ===
 +
* Any combination of them is still referred as "Hadoop"
 +
* Lots of vendors (Cloudera, HortonWorks, MapR) provide their own distributions of Hadoop
 +
* Even though [[Hadoop MapReduce]] is an most important part of Hadoop, it is entirely optional: we may just use HDFS and HBase - and still will consider this combination Hadoop
  
== MapReduce Component ==
 
This is a data processing tool on top of HDFS
 
  
=== [[MapReduce]] Job Execution ===
+
=== Hadoop1 vs Hadoop2 ===
Procession
+
* Hadoop1 uses its own executing engine (<code>TaskTracker</code>)
* each processing job is broken down to pieces
+
* new generation of Hadoop - Hadoop2 - relies on [[YARN]] for this
* each piece is given for a map task to execute
+
* also there are one or more reduce tasks
+
  
So it's performed in two steps
 
* map phase
 
* reduce phase
 
  
=== Algorithm  ===
+
== Hadoop Configuration ==
* master picks some idle workers and assigns them a map task
+
There's a Hadoop configuration folder, and each component typically has a file there with the configuration properties
* preparations (before starting the map task)
+
** input file is loaded to DFS
+
** it's partitioned into blocks (typically 64 kb each)
+
** each block is replicated 3 times to guarantee fault-tolerance
+
* '''map phase'''
+
** each block is assigned to a map worker
+
** it applies the [[MapReduce#Map Function|map function]] to it
+
** intermediate results are sorted locally
+
** then it's stored on local disk of mapper
+
** it's partitioned into $R$ reduce tasks
+
*** $R$ is specified beforehand
+
*** partitioning is typically done by '''hash(key) % $R$'''
+
* wait until ''all'' map tasks are completed
+
* before reduce
+
** the master assigns reduce task to workers
+
** the intermediate results are shuffled and assigned to reducers
+
** each reduces pulls its partition from mapper's disk
+
*** all map results are already partitioned and stored on mapper disks
+
*** read the input and group it by key
+
** each record is assigned to only one reduces
+
* '''reduce phase'''
+
** now apply the [[MapReduce#Reduce Function|reduce function]] to each group
+
** output is stored and replicated 3 times
+
  
 +
Hadoop looks for configurations if <code>/etc/hadoop</code> or in <code>HADOOP_CONFIG_DIR</code>
  
Hadoop in one picture:  
+
common properties:  
 +
* <code>core-site.xml</code> - common properties
 +
* <code>hdfs-site.xml</code> HDFS properties
 +
* <code>mapred-site.xml</code>
 +
* <code>yarn-site.xml</code>
  
https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/hadoop.png
+
Depending on the values of these files, Hadoop can be run in several modes:
 
+
* Standalone/Local: for testing, run on a local machine
(Figure source: Huy Vo, NYU Poly and [http://escience.washington.edu/get-help-now/what-hadoop])
+
* [[Hadoop Pseudo Distributed Mode|Pseudodistributed]]: also run on a local machine, but jobs are executed by hadoop services (see [[Hadoop Pseudo Distributed Mode]] for configuration example)
 
+
* Fully Distributed: cluster (configuration is usually downloaded from cluster managers, e.g. Ambari or Cloudera Manager)
 
+
In short:
+
* There's only one Name Node
+
* the Name Node divides input files into $M$ ''splits'' (by key)
+
* then the Name Node assigns ''workers'' (servers) to perform $M$ map tasks
+
* while they are computing, it keeps track on their progress
+
* Workers write their results on local disk dividing it into $R$ regions
+
* once Map part is done, the Name Node assigns workers to the $R$ reduce tasks
+
* Reduce workers read the regions from the map workers' local disks
+
 
+
 
+
https://raw.github.com/alexeygrigorev/ulb-adb-project-couchbd/master/report/images/map-reduce2.png
+
 
+
 
+
=== Fault-Tolerance ===
+
Achieved because of its Execution Scheme
+
* detect failures and re-assigns tasks of failed nodes to others in the cluster
+
* naturally leads to load balancing
+
 
+
 
+
=== Runtime Scheduling Scheme ===
+
* For job execution MapReduce component doesn't build any execution plan beforehand
+
* It relies on the fault-tolerance scheme that naturally leads to load balancing
+
* Nodes that have completed are assigned to other data blocks
+
 
+
Result: no communication costs
+
* MR tasks are done without communication between tasks
+
 
+
 
+
== Advantages ==
+
* [[MapReduce]] is simple and expressive
+
** computing aggregation is easy
+
* flexible
+
** no dependency on [[Data Model]] or schema
+
*** especially good for unstructured data
+
*** cannot do that in [[Database]]s
+
** can write in any programming language
+
* fault-tolerance
+
** detect failures and re-assigns tasks of failed nodes to others in the cluster
+
* high scalability
+
* even though not in the most efficient way
+
* cheap
+
** runs on commodity hardware
+
** open source
+
 
+
 
+
== Disadvantages ==
+
However it has many disadvantages
+
 
+
=== No Query Language ===
+
No high-level declarative language as SQL
+
* [[MapReduce]] is very low level - need to know programming languages
+
* programs are expensive to write and to maintain
+
* programmers that can do that are expensive
+
* for [[Data Warehousing]]: [[OLAP]] is not that good in MapReduce
+
 
+
Possible solutions:
+
* [[Pig]] and [[Hive]]
+
 
+
 
+
=== Performance ===
+
Performance issues:
+
* no schema, no index, need to parse each input
+
** may cause performance degradation
+
* not tuned for multidimensional queries
+
* possible solutions: [[HBase]], [[Hive]]
+
* because of fault-tolerance and scalability - it's not always optimized for I/O cost
+
** all intermediate results are materialized (no [[Pipelining]])
+
** triple replication
+
* low latency
+
** big overhead for small queries (job start time + jvm start time)
+
 
+
Solutions for I/O optimization
+
* [[HBase]]
+
** [[Column-Oriented Databases|Column-Oriented Database]] that has index structures
+
** data compression (easier for Column-Oriented Databases)
+
* Hadoop++ [https://infosys.uni-saarland.de/projects/hadoop.php]
+
** HAIL (Hadoop Aggressive Indexing Library) as an enhancement for HDFS
+
** structured file format
+
** 20x improvement in Hadoop performance
+
 
+
 
+
=== Map and Reduce are Blocking ===
+
* a transition from Map phase to Reduce phase cannot be made while Map tasks are still running
+
** reason for it is that relies on [[External Merge Sort]] for grouping intermediate results
+
** [[Pipelining]] is not possible
+
* latency problems from this blocking processing nature
+
* causes performance degradation - bad for [[OLAP|on-line processing]]
+
 
+
Solution
+
* Incremental MapReduce (like in [[CouchDB]] [http://stackoverflow.com/questions/11236676/why-is-mapreduce-in-couchdb-called-incremental] [http://eagain.net/articles/incremental-mapreduce/])
+
 
+
 
+
=== Bad High Availability ===
+
* single point of failure: the Name Node
+
* it name node fails, it brings down the entire cluster
+
* solutions:
+
** use special hardware for it
+
** regularly back up
+
 
+
 
+
=== Fixed Data Flow ===
+
* Sometimes the abstraction is too simple
+
** many complex algorithms are hard to express with this paradigm
+
* Design
+
** read simple input
+
** generate simple output
+
* Again, tools like [[Hive]] can help
+
 
+
 
+
=== Other ===
+
And finally, it's very young
+
  
  
Line 180: Line 48:
 
{{ Main | Hadoop in Data Warehousing }}
 
{{ Main | Hadoop in Data Warehousing }}
  
 
== Links ==
 
* http://www.stanford.edu/class/ee380/Abstracts/111116.html - a lecture about Hadoop from Cloudera CTO
 
  
 
== See also ==
 
== See also ==
Line 190: Line 55:
  
 
== Sources ==
 
== Sources ==
* Lee et al, Parallel Data Processing with MapReduce: A Survey [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf]
 
* Ordonez et al, Relational versus non-relational database systems for data warehousing [http://www2.cs.uh.edu/~ordonez/w-2010-DOLAP-relnonrel.pdf]
 
* Paper by Cloudera and Teradata, Awadallah and Graham, Hadoop and the Data Warehouse: When to Use Which. [http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/]
 
 
* [[Introduction to Data Science (coursera)]]
 
* [[Introduction to Data Science (coursera)]]
 
+
* [[Hadoop: The Definitive Guide (book)]]
  
 
[[Category:Hadoop]]
 
[[Category:Hadoop]]
 +
[[Category:MapReduce]]
 
[[Category:Distributed Systems]]
 
[[Category:Distributed Systems]]

Latest revision as of 14:58, 23 November 2015

Hadoop

Usually the term Hadoop refers to an entire family of tools but mostly to


Hadoop Ecosystem

Other tools in the Hadoop ecosystem


What is "Hadoop"

  • Any combination of them is still referred as "Hadoop"
  • Lots of vendors (Cloudera, HortonWorks, MapR) provide their own distributions of Hadoop
  • Even though Hadoop MapReduce is an most important part of Hadoop, it is entirely optional: we may just use HDFS and HBase - and still will consider this combination Hadoop


Hadoop1 vs Hadoop2

  • Hadoop1 uses its own executing engine (TaskTracker)
  • new generation of Hadoop - Hadoop2 - relies on YARN for this


Hadoop Configuration

There's a Hadoop configuration folder, and each component typically has a file there with the configuration properties

Hadoop looks for configurations if /etc/hadoop or in HADOOP_CONFIG_DIR

common properties:

  • core-site.xml - common properties
  • hdfs-site.xml HDFS properties
  • mapred-site.xml
  • yarn-site.xml

Depending on the values of these files, Hadoop can be run in several modes:

  • Standalone/Local: for testing, run on a local machine
  • Pseudodistributed: also run on a local machine, but jobs are executed by hadoop services (see Hadoop Pseudo Distributed Mode for configuration example)
  • Fully Distributed: cluster (configuration is usually downloaded from cluster managers, e.g. Ambari or Cloudera Manager)


Hadoop for Data Warehousing

Main Article: Hadoop in Data Warehousing


See also

Sources