Making a Multi-Node Distributed System with Hadoop 3.2.1

Hello Peppos!!

Ever wanted to share tasks between your computers/servers?

Worry no more, because today you are going to learn how to implement hadoop on your machines.(only the connecting part tho …).

Explanation

  • HDFS —A Distributed File System that loads data in machines inside the Cluster.
  • MapReduce — Programming model for large scale processing.
MapReduce Model for large Scale processing

Requirements

  • VirtualBox.
  • A Iso file for Ubuntu 18.04.
  • Computing power for 3 virtual machines.
  • An Internet Connection.

Preparing the virtual machines

sudo apt update && sudo apt upgrade

After the updates, get the hadoop installation files:

sudo wget https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

And now, we will need the following programs:

sudo apt install pdsh ssh openjdk-8-jdk

Testing the programs

java -version

And we will need to edit our bash file, in this case using nano:

sudo nano .bashrc

Go to the last line and enter:

export PDSH_RCMD_TYPE=ssh

To Save on nano press Ctrl^X to save and Y to accept and ENTER.

To test out ssh you need to create an ssh key:

ssh-keygen -t rsa -P ""

Copy the key to authorized keys:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

and now you can connect to localhost:

ssh localhost

to exit enter the command exit .

At this point we should edit some Network settings on VirtualBox.

Installing Hadoop

tar xzf hadoop-3.2.1.tar.gz

for better access to the hadoop directory we can rename it:

mv hadoop-3.2.1 hadoop

We will now create a sudo level User:

sudo adduser hadoopuser

And add the following commands to edit the hadoopuser

sudo usermod -aG hadoopuser hadoopuser
sudo adduser hadoopuser sudo

And now we can edit some files

[enviroment]

sudo nano /etc/environment

And we can now edit the file in the following way (the new part are on BOLD):

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"
JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre"

[hadoop-env.sh]

nano hadoop/etc/hadoop/hadoop-env.sh

Switch the part with export JAVA_HOME by:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

And now we move the hadoop folder to local:

sudo mv hadoop /usr/local/hadoop

And make the local folder as hadoopuser property:

sudo chown hadoopuser:root -R /usr/local/hadoop/
sudo chmod g+rwx -R /usr/local/hadoop/

We now can see the IP address of the machine with:

ip address

As we can see my ip address is 192.168.56.111.

That means that for all machines it’ll be:

  • Master:192.168.56.111
  • Slave1:192.168.56.112
  • Slave2:192.168.56.113

[hosts]

sudo nano /etc/hosts

At this point we need to clone our virtual machine by going to the settings of our vm in VirtualBox to make the slaves, we need to make new MAC addresses for each.

After creating the other 2 machines, we run all of them at the same time now to edit all the files.

[hostname]

sudo nano /etc/hostname
master machine
slave1 machine
slave2 machine

now we need to reboot the machines:

sudo reboot

Master Config

su - hadoopuser

and create an ssh key for the hadoopuser:

ssh-keygen -t rsa

and copy this key to all users:

ssh-copy-id hadoopuser@master
ssh-copy-id hadoopuser@slave1
ssh-copy-id hadoopuser@slave2

[core-site.xml]

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

add the following code to <configuration>

<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>

[hdfs-site.sml]

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

add the following code to <configuration>

<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>

[workers]

sudo nano /usr/local/hadoop/etc/hadoop/workers

Add:

slave1
slave2

And now we copy these configurations to slave1 and slave2:

scp /usr/local/hadoop/etc/hadoop/* slave1:/usr/local/hadoop/etc/hadoop/scp /usr/local/hadoop/etc/hadoop/* slave2:/usr/local/hadoop/etc/hadoop/

[.bashrc]

export PDSH_RCMD_TYPE=ssh

Now we format the hdfs file system:

source .bashrc
hdfs namenode -format

And after this we start dfs:

start-dfs.sh

And to confirm that everything is going alright we run the jps command on the slaves.

Test on slave2

Going back to the master e can get info for what is going on on master:9870, the Datanodes tab get’s us the following results:

[yarn]

export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME

And now we enter the yarn file:

sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

And enter the following code in <configuration>:

And now we can start yarn:

start-yarn.sh

And as before we can confirm on the slaves using jps if NodeManager is running.

And on the master we can enter master:8088 and get the following result:

As we can see both nodes are active.

As we can see we have created a Full Distributed system.

A Computer Science Student