Hadoop

Usually the term Hadoop refers to an entire family of tools but mostly to


Hadoop Ecosystem

Other tools in the Hadoop ecosystem


What is "Hadoop"

  • Any combination of them is still referred as "Hadoop"
  • Lots of vendors (Cloudera, HortonWorks, MapR) provide their own distributions of Hadoop
  • Even though Hadoop MapReduce is an most important part of Hadoop, it is entirely optional: we may just use HDFS and HBase - and still will consider this combination Hadoop


Hadoop1 vs Hadoop2

  • Hadoop1 uses its own executing engine (TaskTracker)
  • new generation of Hadoop - Hadoop2 - relies on YARN for this


Hadoop Configuration

There's a Hadoop configuration folder, and each component typically has a file there with the configuration properties

Hadoop looks for configurations if /etc/hadoop or in HADOOP_CONFIG_DIR

common properties:

  • core-site.xml - common properties
  • hdfs-site.xml HDFS properties
  • mapred-site.xml
  • yarn-site.xml

Depending on the values of these files, Hadoop can be run in several modes:

  • Standalone/Local: for testing, run on a local machine
  • Pseudodistributed: also run on a local machine, but jobs are executed by hadoop services (see Hadoop Pseudo Distributed Mode for configuration example)
  • Fully Distributed: cluster (configuration is usually downloaded from cluster managers, e.g. Ambari or Cloudera Manager)


Hadoop for Data Warehousing

Main Article: Hadoop in Data Warehousing


See also

Sources