Hadoop
Usually the term Hadoop refers to an entire family of tools but mostly to
Hadoop Ecosystem
Other tools in the Hadoop ecosystem
What is "Hadoop"
- Any combination of them is still referred as "Hadoop"
- Lots of vendors (Cloudera, HortonWorks, MapR) provide their own distributions of Hadoop
- Even though Hadoop MapReduce is an most important part of Hadoop, it is entirely optional: we may just use HDFS and HBase - and still will consider this combination Hadoop
Hadoop1 vs Hadoop2
- Hadoop1 uses its own executing engine (
TaskTracker
)
- new generation of Hadoop - Hadoop2 - relies on YARN for this
Hadoop Configuration
There's a Hadoop configuration folder, and each component typically has a file there with the configuration properties
Hadoop looks for configurations if /etc/hadoop
or in HADOOP_CONFIG_DIR
common properties:
-
core-site.xml
- common properties
-
hdfs-site.xml
HDFS properties
-
mapred-site.xml
-
yarn-site.xml
Depending on the values of these files, Hadoop can be run in several modes:
- Standalone/Local: for testing, run on a local machine
- Pseudodistributed: also run on a local machine, but jobs are executed by hadoop services (see Hadoop Pseudo Distributed Mode for configuration example)
- Fully Distributed: cluster (configuration is usually downloaded from cluster managers, e.g. Ambari or Cloudera Manager)
Hadoop for Data Warehousing
- Main Article: Hadoop in Data Warehousing
See also
Sources