Hadoop Pseudo Distributed Mode

Hadoop cluster can be emulated with "pseudo-distributed mode"

  • all Hadoop demons run, and applications feel like tHey are being executed on a real cluster
  • good for testing Hadoop MapReduce jobs before running them on a fully distributed cluster


Setting Up Locally

Preparation

  • install Hadoop from binaries, put e.g. to ~/soft/hadoop-2.6.0/
  • point HADOOP_CONF_DIR to some directory with config, e.g. ~/conf/hadoop-local

You need to export the following env variables:

#!/bin/bash

export HADOOP_HOME=~/soft/hadoop-2.6.0
export HADOOP_BIN=$HADOOP_HOME/bin

export HADOOP_CONF_DIR=~/conf/hadoop-cluster
export YARN_CONF_DIR=$HADOOP_CONF_DIR

export PATH=$HADOOP_BIN:$HADOOP_HOME/sbin:$PATH


Also, if you don't have a java on your PATH, you need to create hadoop-env.sh in HADOOP_CONF_DIR and add (replace)

export JAVA_HOME=/home/user/soft/jdk1.8.0_60/
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}


Properties

Hadoop in "Pseudo-distributed mode" should have properties similar to these:

cat core-site.xml
<?xml version="1.0"?>
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost/</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/agrigorev/tmp/hadoop/</value>
  </property>
</configuration>
cat hdfs-site.xml 
<?xml version="1.0"?>
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
cat mapred-site.xml 
<?xml version="1.0"?>
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>
cat yarn-site.xml 
<?xml version="1.0"?>
<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>localhost</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>


File System

  • Once the configuration is set, format the filesystem
  • hdfs namenode -format
  • if hadoop.tmp.dir is not specified, it'll use /tmp/hadoop-${user.name}, which is cleaned after each reboot


Setting SSH Access

  • Application master and workers on the cluster communicate via ssh
  • it's the same for pseudodistributed mode - except that the master and all the workers are located on the same machine
  • but they still need to use ssh for that
  • so make sure you can do ssh localhost
  • if not - check if ssh service and ssh-agent are running


Starting Daemons

To start, use

start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver


make sure namenode started:

telnet localhost 8020


If namenode doesn't start in local mode, do [1]:

  • delete all contents from the hadoop temporary folder: rm -Rf tmp_dir
  • format the namenode: hadoop namenode -format
  • start the namenode again: start-dfs.sh

Starting datanodes

  • hadoop-daemon.sh start datanode
  • to check if it works:
hadoop fs -put somefile /home/username/
hadoop fs -ls /home/username/ 


Troubleshooting:

  • if datanode doesn't start [2]
  • if yarn resourcemanager doesn't start

Jobs Monitoring

yarn application -list
yarn application -kill application_1445857836386_0002


Links

Sources

Share your opinion