Hadoop Pseudo Distributed Mode

hadoop mapreduce

Hadoop Pseudo Distributed Mode

Hadoop cluster can be emulated with “pseudo-distributed mode”

all Hadoop demons run, and applications feel like tHey are being executed on a real cluster
good for testing Hadoop MapReduce jobs before running them on a fully distributed cluster

Setting Up Locally

Preparation

install Hadoop from binaries, put e.g. to ~/soft/hadoop-2.6.0/
point HADOOP_CONF_DIR to some directory with config, e.g. ~/conf/hadoop-local

You need to export the following env variables:

#| /bin/bash | | export HADOOP_HOME=~/soft/hadoop-2.6.0 export HADOOP_BIN=$HADOOP_HOME/bin

export HADOOP_CONF_DIR=~/conf/hadoop-cluster export YARN_CONF_DIR=$HADOOP_CONF_DIR

export PATH=$HADOOP_BIN:$HADOOP_HOME/sbin:$PATH

Also, if you don’t have a java on your PATH, you need to create hadoop-env.sh in HADOOP_CONF_DIR and add (replace)

export JAVA_HOME=/home/user/soft/jdk1.8.0_60/ export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-“/etc/hadoop”}

Properties

Hadoop in “Pseudo-distributed mode” should have properties similar to these:

cat core-site.xml <?xml version=”1.0”?>

fs.defaultFS hdfs://localhost/ hadoop.tmp.dir /home/agrigorev/tmp/hadoop/

cat hdfs-site.xml <?xml version=”1.0”?>

dfs.replication 1

cat mapred-site.xml <?xml version=”1.0”?>

mapreduce.framework.name yarn

cat yarn-site.xml <?xml version=”1.0”?>

yarn.resourcemanager.hostname localhost yarn.nodemanager.aux-services mapreduce_shuffle

File System

Once the configuration is set, format the filesystem
hdfs namenode -format
if hadoop.tmp.dir is not specified, it’ll use /tmp/hadoop-${user.name}, which is cleaned after each reboot

Setting SSH Access

Application master and workers on the cluster communicate via ssh
it’s the same for pseudodistributed mode - except that the master and all the workers are located on the same machine
but they still need to use ssh for that
so make sure you can do ssh localhost
if not - check if ssh service and ssh-agent are running

Starting Daemons

To start, use

start-dfs.sh start-yarn.sh mr-jobhistory-daemon.sh start historyserver

make sure namenode started:

telnet localhost 8020

If namenode doesn’t start in local mode, do [http://stackoverflow.com/questions/8076439/namenode-not-getting-started]:

delete all contents from the hadoop temporary folder: rm -Rf tmp_dir
format the namenode: hadoop namenode -format
start the namenode again: start-dfs.sh

Starting datanodes

hadoop-daemon.sh start datanode
to check if it works:

hadoop fs -put somefile /home/username/ hadoop fs -ls /home/username/

Troubleshooting:

if datanode doesn’t start [http://stackoverflow.com/questions/16725804/]
if yarn resourcemanager doesn’t start
- “Queue configuration missing child queue names for root” [http://stackoverflow.com/questions/28357130/unable-to-start-resourcemanager-capacity-scheduler-xml-not-found-hadoop-2-6-0]
- copy capacity-scheduler.xml to HADOOP_CONF_DIR

Jobs Monitoring

yarn application -list yarn application -kill application_1445857836386_0002

Sources

Hadoop: The Definitive Guide (book)

✏️ Edit on GitHub

Hadoop Pseudo Distributed Mode