Making a Multi-Node Distributed System with Hadoop 3.2.1

6 min readMay 29, 2021

Hello Peppos!!

Ever wanted to share tasks between your computers/servers?

Worry no more, because today you are going to learn how to implement hadoop on your machines.(only the connecting part tho …).

Explanation

Hadoop has 2 essential parts:

HDFS —A Distributed File System that loads data in machines inside the Cluster.
MapReduce — Programming model for large scale processing.

MapReduce Model for large Scale processing

Requirements

For this tutorial to work, we will need the following requirements:

VirtualBox.
A Iso file for Ubuntu 18.04.
Computing power for 3 virtual machines.
An Internet Connection.

Preparing the virtual machines

After creating the first virtual machine and installing Ubuntu18.04, you will need to update the system with:

sudo apt update && sudo apt upgrade

After the updates, get the hadoop installation files:

sudo wget https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

And now, we will need the following programs:

sudo apt install pdsh ssh openjdk-8-jdk

Testing the programs

We now can test our java version with:

java -version

And we will need to edit our bash file, in this case using nano:

sudo nano .bashrc

Go to the last line and enter:

export PDSH_RCMD_TYPE=ssh

To Save on nano press Ctrl^X to save and Y to accept and ENTER.

To test out ssh you need to create an ssh key:

ssh-keygen -t rsa -P ""

Copy the key to authorized keys:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

and now you can connect to localhost:

ssh localhost

to exit enter the command exit .

At this point we should edit some Network settings on VirtualBox.

Installing Hadoop

We need first to unpack the hadoop file to install it with tar:

tar xzf hadoop-3.2.1.tar.gz

for better access to the hadoop directory we can rename it:

mv hadoop-3.2.1 hadoop

We will now create a sudo level User:

sudo adduser hadoopuser

And add the following commands to edit the hadoopuser

sudo usermod -aG hadoopuser hadoopuser
sudo adduser hadoopuser sudo

And now we can edit some files

[enviroment]

Edit the Enviroment file with:

sudo nano /etc/environment

And we can now edit the file in the following way (the new part are on BOLD):

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"
JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre"

[hadoop-env.sh]

Edit the file with:

nano hadoop/etc/hadoop/hadoop-env.sh

Switch the part with export JAVA_HOME by:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

And now we move the hadoop folder to local:

sudo mv hadoop /usr/local/hadoop

And make the local folder as hadoopuser property:

sudo chown hadoopuser:root -R /usr/local/hadoop/
sudo chmod g+rwx -R /usr/local/hadoop/

We now can see the IP address of the machine with:

ip address

As we can see my ip address is 192.168.56.111.

That means that for all machines it’ll be:

Master:192.168.56.111
Slave1:192.168.56.112
Slave2:192.168.56.113

[hosts]

sudo nano /etc/hosts

At this point we need to clone our virtual machine by going to the settings of our vm in VirtualBox to make the slaves, we need to make new MAC addresses for each.

After creating the other 2 machines, we run all of them at the same time now to edit all the files.

[hostname]

we need to edit the hostname on all machines:

sudo nano /etc/hostname

now we need to reboot the machines:

sudo reboot

Master Config

We need to enter as hadoopuser:

su - hadoopuser

and create an ssh key for the hadoopuser:

ssh-keygen -t rsa

and copy this key to all users:

ssh-copy-id hadoopuser@master
ssh-copy-id hadoopuser@slave1
ssh-copy-id hadoopuser@slave2

[core-site.xml]

We need to edit this file:

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

add the following code to <configuration>

<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>

[hdfs-site.sml]

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

add the following code to <configuration>

<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>

[workers]

sudo nano /usr/local/hadoop/etc/hadoop/workers

Add:

slave1
slave2

And now we copy these configurations to slave1 and slave2:

scp /usr/local/hadoop/etc/hadoop/* slave1:/usr/local/hadoop/etc/hadoop/scp /usr/local/hadoop/etc/hadoop/* slave2:/usr/local/hadoop/etc/hadoop/

[.bashrc]

now we add again the following code to .bashrc:

export PDSH_RCMD_TYPE=ssh

Now we format the hdfs file system:

source .bashrc
hdfs namenode -format

And after this we start dfs:

start-dfs.sh

And to confirm that everything is going alright we run the jps command on the slaves.

Going back to the master e can get info for what is going on on master:9870, the Datanodes tab get’s us the following results:

[yarn]

We need first to execute the following commands for the terminal:

export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME

And now we enter the yarn file:

sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

And enter the following code in <configuration>:

And now we can start yarn:

start-yarn.sh

And as before we can confirm on the slaves using jps if NodeManager is running.

And on the master we can enter master:8088 and get the following result:

As we can see both nodes are active.

As we can see we have created a Full Distributed system.