Hadoop

distributed-systems hadoop

Hadoop

usually the term ‘‘Hadoop’’ refers to an entire family of products.

So it’s usually a set of products:

parallel storage (HDFS)
processing framework (MapReduce implementation)
Pig/Hive for declarative high-level language support
HBase as non-relational NoSQL database that runs on HDFS
Mahout - Data Mining and Machine Learning tool that works on top of Hadoop

Hadoop

Any combination of them is still referred as Hadoop
Lots of vendors (like HortonWorks) provide their own distributions of Hadoop
Even though MapReduce is considered the most important part, it is entirely optional: we may just use HDFS and HBase - and still will consider this combination Hadoop

MapReduce Component

This is a data processing tool on top of HDFS

MapReduce Job Execution

Procession

each processing job is broken down to pieces
each piece is given for a map task to execute
also there are one or more reduce tasks

So it’s performed in two steps

map phase
reduce phase

Algorithm

master picks some idle workers and assigns them a map task
preparations (before starting the map task)
- input file is loaded to DFS
- it’s partitioned into blocks (typically 64 kb each)
- each block is replicated 3 times to guarantee fault-tolerance
’'’map phase’’’
- each block is assigned to a map worker
- it applies the map function to it
- intermediate results are sorted locally
- then it’s stored on local disk of mapper
- it’s partitioned into $R$ reduce tasks
  - $R$ is specified beforehand
  - partitioning is typically done by ‘'’hash(key) % $R$’’’
wait until ‘‘all’’ map tasks are completed
before reduce
- the master assigns reduce task to workers
- the intermediate results are shuffled and assigned to reducers
- each reduces pulls its partition from mapper’s disk
  - all map results are already partitioned and stored on mapper disks
  - read the input and group it by key
- each record is assigned to only one reduces
’'’reduce phase’’’
- now apply the reduce function to each group
- output is stored and replicated 3 times

Hadoop in one picture:

(Figure source: Huy Vo, NYU Poly and [http://escience.washington.edu/get-help-now/what-hadoop])

In short:

There’s only one Name Node
the Name Node divides input files into $M$ ‘‘splits’’ (by key)
then the Name Node assigns ‘‘workers’’ (servers) to perform $M$ map tasks
while they are computing, it keeps track on their progress
Workers write their results on local disk dividing it into $R$ regions
once Map part is done, the Name Node assigns workers to the $R$ reduce tasks
Reduce workers read the regions from the map workers’ local disks

Fault-Tolerance

Achieved because of its Execution Scheme

detect failures and re-assigns tasks of failed nodes to others in the cluster
naturally leads to load balancing

Runtime Scheduling Scheme

For job execution MapReduce component doesn’t build any execution plan beforehand
It relies on the fault-tolerance scheme that naturally leads to load balancing
Nodes that have completed are assigned to other data blocks

Result: no communication costs

MR tasks are done without communication between tasks

Advantages

MapReduce is simple and expressive
- computing aggregation is easy
flexible
- no dependency on Data Model or schema
  - especially good for unstructured data
  - cannot do that in Databases
- can write in any programming language
fault-tolerance
- detect failures and re-assigns tasks of failed nodes to others in the cluster
high scalability
even though not in the most efficient way
cheap
- runs on commodity hardware
- open source

Disadvantages

However it has many disadvantages

No Query Language

No high-level declarative language as SQL

MapReduce is very low level - need to know programming languages
programs are expensive to write and to maintain
programmers that can do that are expensive
for Data Warehousing: OLAP is not that good in MapReduce

Possible solutions:

Pig and Hive

Performance

Performance issues:

no schema, no index, need to parse each input
- may cause performance degradation
not tuned for multidimensional queries
possible solutions: HBase, Hive
because of fault-tolerance and scalability - it’s not always optimized for I/O cost
- all intermediate results are materialized (no Pipelining)
- triple replication
low latency
- big overhead for small queries (job start time + jvm start time)

Solutions for I/O optimization

HBase
- Column-Oriented Database that has index structures
- data compression (easier for Column-Oriented Databases)
Hadoop++ [https://infosys.uni-saarland.de/projects/hadoop.php]
- HAIL (Hadoop Aggressive Indexing Library) as an enhancement for HDFS
- structured file format
- 20x improvement in Hadoop performance

Map and Reduce are Blocking

a transition from Map phase to Reduce phase cannot be made while Map tasks are still running
- reason for it is that relies on External Merge Sort for grouping intermediate results
- Pipelining is not possible
latency problems from this blocking processing nature
causes performance degradation - bad for on-line processing

Solution

Incremental MapReduce (like in CouchDB [http://eagain.net/articles/incremental-mapreduce/)

Bad High Availability

single point of failure: the Name Node
it name node fails, it brings down the entire cluster
solutions:
- use special hardware for it
- regularly back up

Fixed Data Flow

Sometimes the abstraction is too simple
- many complex algorithms are hard to express with this paradigm
Design
- read simple input
- generate simple output
Again, tools like Hive can help

Other

And finally, it’s very young

Hadoop for Data Warehousing

Sources

Lee et al, Parallel Data Processing with MapReduce: A Survey [http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf]
Ordonez et al, Relational versus non-relational database systems for data warehousing [http://www2.cs.uh.edu/~ordonez/w-2010-DOLAP-relnonrel.pdf]
Paper by Cloudera and Teradata, Awadallah and Graham, Hadoop and the Data Warehouse: When to Use Which. [http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/]
Introduction to Data Science (coursera)

✏️ Edit on GitHub

Hadoop

Hadoop

MapReduce Component

MapReduce Job Execution

Algorithm

Fault-Tolerance

Runtime Scheduling Scheme

Advantages

Disadvantages

No Query Language

Performance

Map and Reduce are Blocking

Bad High Availability

Fixed Data Flow

Other

Hadoop for Data Warehousing

Links

See also

Sources