Setting up Hadoop V2 on Multi-Node Cluster

#

The following are instructions on setting up a multi-node cluster on Hadoop version 2.

1. Set up a single node cluster as follows:

  • Create a virtual server at Rackspace with the Ubuntu 14 operating system and 8GB memory.
  • Using PUTTY connect to the new server.
  • Sign in using the root password.
  • Create a new user called hduser with a password.
  • Create a group called hadoop and add hduser to the hadoop group.
  • Install the latest version of Java onto the server. The version used in this solution is Java v8.
  • Edit /etc/hosts and make sure the ip address of this server is in there correctly with an alias. In this case we will be using the server alias Project-Hadoop that will be referred in the configurations below.
1a. Install Hadoop V2 as follows:
  • Download the latest Hadoop version. The version used in this solution was Hadoop-2.5.2
  • sudo tar xzf hadoop-2.5.2.tar.gz
  • sudo chown –R hduser:Hadoop /home/hduser/yarn/hadoop-2.5.2

The path created for the hadoop install is: /home/hduser/yarn/hadoop-2.5.2

1b. Set Environment Variables

Edit ./bashrc and include the following environment variables:

  • $ export HADOOP_HOME=$HOME/yarn/hadoop-2.5.2
  • $ export HADOOP_MAPRED_HOME=$HOME/yarn/hadoop-2.5.2
  • $ export HADOOP_COMMON_HOME=$HOME/yarn/hadoop-2.5.2
  • $ export HADOOP_HDFS_HOME=$HOME/yarn/hadoop-2.5.2
  • $ export HADOOP_YARN_HOME=$HOME/yarn/hadoop-2.5.2
  • $ export HADOOP_CONF_DIR=$HOME/yarn/hadoop-2.5.2/etc/hadoop
1c. Update Configuration Files

Edit the following xml configuration files found in $HADOOP_HOME/etc/hadoop as follows:

Xml configuration file Xml code
corse-site.xml

<configuration><property><name>fs.default.name</name><value>hdfs://Project-Hadoop:9000</value></property><property><name>hadoop.tmp.dir</name><value>/home/hduser/yarn/hadoop-2.5.2/tmp</value></property>


</configuration>

hdfs-site.xml <configuration><property><name>dfs.replication</name><value>2</value></property><property><name>dfs.namenode.name.dir</name><value>file:/home/hduser/yarn/yarn_data/hdfs/namenode</value></property>
<property>

<name>dfs.datanode.data.dir</name>

<value>file:/home/hduser/yarn/yarn_data/hdfs/datanode</value>

</property>

<property>

<name>dfs.permissions</name>

<value>false</value>

</property>

</configuration>

mapred-site.xml <?xml version=”1.0″?><configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property></configuration>
yarn-site.xml <configuration><property>
<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<property>

<name>yarn.resourcemanager.resource-tracker.address</name>

<value>Project-Hadoop:8025</value>

</property>

<property>

<name>yarn.resourcemanager.scheduler.address</name>

<value>Project-Hadoop:8030</value>

</property>

<property>

<name>yarn.resourcemanager.address</name>

<value>Project-Hadoop:8040</value>

</property>

</configuration>

2. Testing the single-node cluster and hadoop
  • Run $HADOOP_HOME/sbin/start-dfs.sh
  • Type jps at the command prompt. The following entities should have started: NameNode, SecondaryNameNode, DataNode.
  • Run $HADOOP_HOME/sbin/start-yarn.sh
  • Type jps at the command prompt. The following entities should have started: ResourceManager.
  • Type $HADOOP_HOME/bin/Hadoop dfsadmin –report
  • There should be a report describing the setup including one datanode.
  • Type: $HADOOP_HOME/bin/hadoop
  • A list of hadoop commands should appear.
  • If all of the above function correctly then Hadoop has been setup and installed correctly.
  • Format the HDFS node using the command: $HADOOP_HOME/bin/hadoop namenode -format
3. Setting up a multi-node cluster

In order to extend this set up to a multi-node cluster one needs to replicate the above on a new server and then set up the hosts, slaves and SSH keys correctly.

The simplest method is to image the working single-node cluster and then restore the image into a new server. The new server does not need to be the same size and for this project the slave node was set up as a 2GB server to save on hosting costs.

Once the new server has been restored, add the new server (slave node) ip address to the hosts file (/etc/hosts) of the master node. Similarly, confirm that both node ip addresses are in the slave node hosts file too.

Add the slave node ip address to the slaves file ($HADOOP_HOME/etc/hadoop/slaves) on the master node. So only the master node should have an entry in the slaves file.

Generate an SSH key on both servers (nodes) and copy them to each other. This allows the two nodes to interact with each other without needing to enter a password each time. Do this by issuing the command: ssh-keygen –t rsa –P “”. The public key generated should be placed in the $HOME/.ssh/authorized_keys file of both servers using: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

4. Testing the multi-node cluster and hadoop v2

Navigate to the master node again and run the next few commands.

  • Run $HADOOP_HOME/sbin/stop-dfs.sh
  • Run $HADOOP_HOME/sbin/stop-yarn.sh
  • Run $HADOOP_HOME/sbin/start-dfs.sh
  • Type jps at the command prompt. The following entities should have started: NameNode, SecondaryNameNode, DataNode.
  • Run $HADOOP_HOME/sbin/start-yarn.sh
  • Type jps at the command prompt. The following entities should have started: ResourceManager.
  • Type $HADOOP_HOME/bin/Hadoop dfsadmin –report
  • There should be a report describing the setup including two datanodes
  • Try jps on the slave node. The following entities should have started: NodeManager and DataNode.
  • Type $HADOOP_HOME/bin/hadoop

A list of hadoop commands should appear.

If all of the above function correctly then Hadoop has been setup and installed correctly.

Reformat the HDFS file system by issuing $HADOOP_HOME/bin/hadoop namenode –format

Note that as the slave server was restored from an image, the settings in the configuration xml files are already correct and are referring to the master node.

Latest from this author