ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Hadoop Pseudo Distributed Mode

Hadoop Pseudo Distributed Mode

Hadoop cluster can be emulated with “pseudo-distributed mode”

  • all Hadoop demons run, and applications feel like tHey are being executed on a real cluster
  • good for testing Hadoop MapReduce jobs before running them on a fully distributed cluster

Setting Up Locally

Preparation

  • install Hadoop from binaries, put e.g. to ~/soft/hadoop-2.6.0/
  • point HADOOP_CONF_DIR to some directory with config, e.g. ~/conf/hadoop-local

You need to export the following env variables:

#| /bin/bash | | export HADOOP_HOME=~/soft/hadoop-2.6.0 export HADOOP_BIN=$HADOOP_HOME/bin

export HADOOP_CONF_DIR=~/conf/hadoop-cluster export YARN_CONF_DIR=$HADOOP_CONF_DIR

export PATH=$HADOOP_BIN:$HADOOP_HOME/sbin:$PATH

Also, if you don’t have a java on your PATH, you need to create hadoop-env.sh in HADOOP_CONF_DIR and add (replace)

export JAVA_HOME=/home/user/soft/jdk1.8.0_60/ export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-“/etc/hadoop”}

Properties

Hadoop in “Pseudo-distributed mode” should have properties similar to these:

cat core-site.xml <?xml version=”1.0”?>

fs.defaultFS hdfs://localhost/ hadoop.tmp.dir /home/agrigorev/tmp/hadoop/

cat hdfs-site.xml <?xml version=”1.0”?>

dfs.replication 1

cat mapred-site.xml <?xml version=”1.0”?>

mapreduce.framework.name yarn

cat yarn-site.xml <?xml version=”1.0”?>

yarn.resourcemanager.hostname localhost yarn.nodemanager.aux-services mapreduce_shuffle

File System

  • Once the configuration is set, format the filesystem
  • hdfs namenode -format
  • if hadoop.tmp.dir is not specified, it’ll use /tmp/hadoop-${user.name}, which is cleaned after each reboot

Setting SSH Access

  • Application master and workers on the cluster communicate via ssh
  • it’s the same for pseudodistributed mode - except that the master and all the workers are located on the same machine
  • but they still need to use ssh for that
  • so make sure you can do ssh localhost
  • if not - check if ssh service and ssh-agent are running

Starting Daemons

To start, use

start-dfs.sh start-yarn.sh mr-jobhistory-daemon.sh start historyserver

make sure namenode started:

telnet localhost 8020

If namenode doesn’t start in local mode, do [http://stackoverflow.com/questions/8076439/namenode-not-getting-started]:

  • delete all contents from the hadoop temporary folder: rm -Rf tmp_dir
  • format the namenode: hadoop namenode -format
  • start the namenode again: start-dfs.sh

Starting datanodes

  • hadoop-daemon.sh start datanode
  • to check if it works:

hadoop fs -put somefile /home/username/ hadoop fs -ls /home/username/

Troubleshooting:

  • if datanode doesn’t start [http://stackoverflow.com/questions/16725804/]
  • if yarn resourcemanager doesn’t start
    • “Queue configuration missing child queue names for root” [http://stackoverflow.com/questions/28357130/unable-to-start-resourcemanager-capacity-scheduler-xml-not-found-hadoop-2-6-0]
    • copy capacity-scheduler.xml to HADOOP_CONF_DIR

Jobs Monitoring

yarn application -list yarn application -kill application_1445857836386_0002

Sources