Hadoop Pseudo Distributed Mode
Hadoop cluster can be emulated with “pseudo-distributed mode”
- all Hadoop demons run, and applications feel like tHey are being executed on a real cluster
- good for testing Hadoop MapReduce jobs before running them on a fully distributed cluster
Setting Up Locally
Preparation
- install Hadoop from binaries, put e.g. to
~/soft/hadoop-2.6.0/
- point
HADOOP_CONF_DIR
to some directory with config, e.g.~/conf/hadoop-local
You need to export the following env variables:
#| /bin/bash | | export HADOOP_HOME=~/soft/hadoop-2.6.0 export HADOOP_BIN=$HADOOP_HOME/bin
export HADOOP_CONF_DIR=~/conf/hadoop-cluster export YARN_CONF_DIR=$HADOOP_CONF_DIR
export PATH=$HADOOP_BIN:$HADOOP_HOME/sbin:$PATH
Also, if you don’t have a java on your PATH, you need to create hadoop-env.sh
in HADOOP_CONF_DIR
and add (replace)
export JAVA_HOME=/home/user/soft/jdk1.8.0_60/ export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-“/etc/hadoop”}
Properties
Hadoop in “Pseudo-distributed mode” should have properties similar to these:
cat core-site.xml <?xml version=”1.0”?>
cat hdfs-site.xml <?xml version=”1.0”?>
cat mapred-site.xml <?xml version=”1.0”?>
cat yarn-site.xml <?xml version=”1.0”?>
File System
- Once the configuration is set, format the filesystem
hdfs namenode -format
- if
hadoop.tmp.dir
is not specified, it’ll use/tmp/hadoop-${user.name}
, which is cleaned after each reboot
Setting SSH Access
- Application master and workers on the cluster communicate via ssh
- it’s the same for pseudodistributed mode - except that the master and all the workers are located on the same machine
- but they still need to use ssh for that
- so make sure you can do
ssh localhost
- if not - check if ssh service and
ssh-agent
are running
Starting Daemons
To start, use
start-dfs.sh start-yarn.sh mr-jobhistory-daemon.sh start historyserver
make sure namenode started:
telnet localhost 8020
If namenode doesn’t start in local mode, do [http://stackoverflow.com/questions/8076439/namenode-not-getting-started]:
- delete all contents from the hadoop temporary folder:
rm -Rf tmp_dir
- format the namenode:
hadoop namenode -format
- start the namenode again:
start-dfs.sh
Starting datanodes
hadoop-daemon.sh start datanode
- to check if it works:
hadoop fs -put somefile /home/username/ hadoop fs -ls /home/username/
Troubleshooting:
- if datanode doesn’t start [http://stackoverflow.com/questions/16725804/]
- if yarn resourcemanager doesn’t start
- “Queue configuration missing child queue names for root” [http://stackoverflow.com/questions/28357130/unable-to-start-resourcemanager-capacity-scheduler-xml-not-found-hadoop-2-6-0]
- copy capacity-scheduler.xml to
HADOOP_CONF_DIR
Jobs Monitoring
yarn application -list yarn application -kill application_1445857836386_0002