Hadoop

usually the term Hadoop refers to an entire family of products.

So it's usually a set of products:


Hadoop

  • Any combination of them is still referred as Hadoop
  • Lots of vendors (like HortonWorks) provide their own distributions of Hadoop
  • Even though MapReduce is considered the most important part, it is entirely optional: we may just use HDFS and HBase - and still will consider this combination Hadoop


MapReduce Component

This is a data processing tool on top of HDFS

MapReduce Job Execution

Procession

  • each processing job is broken down to pieces
  • each piece is given for a map task to execute
  • also there are one or more reduce tasks

So it's performed in two steps

  • map phase
  • reduce phase

Algorithm

  • master picks some idle workers and assigns them a map task
  • preparations (before starting the map task)
    • input file is loaded to DFS
    • it's partitioned into blocks (typically 64 kb each)
    • each block is replicated 3 times to guarantee fault-tolerance
  • map phase
    • each block is assigned to a map worker
    • it applies the map function to it
    • intermediate results are sorted locally
    • then it's stored on local disk of mapper
    • it's partitioned into $R$ reduce tasks
      • $R$ is specified beforehand
      • partitioning is typically done by hash(key) % $R$
  • wait until all map tasks are completed
  • before reduce
    • the master assigns reduce task to workers
    • the intermediate results are shuffled and assigned to reducers
    • each reduces pulls its partition from mapper's disk
      • all map results are already partitioned and stored on mapper disks
      • read the input and group it by key
    • each record is assigned to only one reduces
  • reduce phase
    • now apply the reduce function to each group
    • output is stored and replicated 3 times


Hadoop in one picture:

hadoop.png

(Figure source: Huy Vo, NYU Poly and [1])


In short:

  • There's only one Name Node
  • the Name Node divides input files into $M$ splits (by key)
  • then the Name Node assigns workers (servers) to perform $M$ map tasks
  • while they are computing, it keeps track on their progress
  • Workers write their results on local disk dividing it into $R$ regions
  • once Map part is done, the Name Node assigns workers to the $R$ reduce tasks
  • Reduce workers read the regions from the map workers' local disks


map-reduce2.png


Fault-Tolerance

Achieved because of its Execution Scheme

  • detect failures and re-assigns tasks of failed nodes to others in the cluster
  • naturally leads to load balancing


Runtime Scheduling Scheme

  • For job execution MapReduce component doesn't build any execution plan beforehand
  • It relies on the fault-tolerance scheme that naturally leads to load balancing
  • Nodes that have completed are assigned to other data blocks

Result: no communication costs

  • MR tasks are done without communication between tasks


Advantages

  • MapReduce is simple and expressive
    • computing aggregation is easy
  • flexible
    • no dependency on Data Model or schema
      • especially good for unstructured data
      • cannot do that in Databases
    • can write in any programming language
  • fault-tolerance
    • detect failures and re-assigns tasks of failed nodes to others in the cluster
  • high scalability
  • even though not in the most efficient way
  • cheap
    • runs on commodity hardware
    • open source


Disadvantages

However it has many disadvantages

No Query Language

No high-level declarative language as SQL

  • MapReduce is very low level - need to know programming languages
  • programs are expensive to write and to maintain
  • programmers that can do that are expensive
  • for Data Warehousing: OLAP is not that good in MapReduce

Possible solutions:


Performance

Performance issues:

  • no schema, no index, need to parse each input
    • may cause performance degradation
  • not tuned for multidimensional queries
  • possible solutions: HBase, Hive
  • because of fault-tolerance and scalability - it's not always optimized for I/O cost
    • all intermediate results are materialized (no Pipelining)
    • triple replication
  • low latency
    • big overhead for small queries (job start time + jvm start time)

Solutions for I/O optimization

  • HBase
  • Hadoop++ [2]
    • HAIL (Hadoop Aggressive Indexing Library) as an enhancement for HDFS
    • structured file format
    • 20x improvement in Hadoop performance


Map and Reduce are Blocking

  • a transition from Map phase to Reduce phase cannot be made while Map tasks are still running
  • latency problems from this blocking processing nature
  • causes performance degradation - bad for on-line processing

Solution


Bad High Availability

  • single point of failure: the Name Node
  • it name node fails, it brings down the entire cluster
  • solutions:
    • use special hardware for it
    • regularly back up


Fixed Data Flow

  • Sometimes the abstraction is too simple
    • many complex algorithms are hard to express with this paradigm
  • Design
    • read simple input
    • generate simple output
  • Again, tools like Hive can help


Other

And finally, it's very young


Hadoop for Data Warehousing

Main Article: Hadoop in Data Warehousing


Links

See also

Sources

  • Lee et al, Parallel Data Processing with MapReduce: A Survey [5]
  • Ordonez et al, Relational versus non-relational database systems for data warehousing [6]
  • Paper by Cloudera and Teradata, Awadallah and Graham, Hadoop and the Data Warehouse: When to Use Which. [7]
  • Introduction to Data Science (coursera)