ML Wiki

Usually the term Hadoop refers to an entire family of tools but mostly to

Other tools in the Hadoop ecosystem

• Any combination of them is still referred as "Hadoop"
• Lots of vendors (Cloudera, HortonWorks, MapR) provide their own distributions of Hadoop
• Even though Hadoop MapReduce is an most important part of Hadoop, it is entirely optional: we may just use HDFS and HBase - and still will consider this combination Hadoop

• Hadoop1 uses its own executing engine (TaskTracker)
• new generation of Hadoop - Hadoop2 - relies on YARN for this

There's a Hadoop configuration folder, and each component typically has a file there with the configuration properties

Hadoop looks for configurations if /etc/hadoop or in HADOOP_CONFIG_DIR

common properties:

• core-site.xml - common properties
• hdfs-site.xml HDFS properties
• mapred-site.xml
• yarn-site.xml

Depending on the values of these files, Hadoop can be run in several modes:

• Standalone/Local: for testing, run on a local machine
• Pseudodistributed: also run on a local machine, but jobs are executed by hadoop services (see Hadoop Pseudo Distributed Mode for configuration example)
• Fully Distributed: cluster (configuration is usually downloaded from cluster managers, e.g. Ambari or Cloudera Manager)

Hadoop for Data Warehousing

Main Article: Hadoop in Data Warehousing