Ever wanted to share tasks between your computers/servers?
Worry no more, because today you are going to learn how to implement hadoop on your machines.(only the connecting part tho …).
Hadoop has 2 essential parts:
- HDFS —A Distributed File System that loads data in machines inside the Cluster.
- MapReduce — Programming model for large scale processing.
For this tutorial to work, we will need the following requirements:
- A Iso file for Ubuntu 18.04.
- Computing power for 3 virtual machines.
- An Internet Connection.
Preparing the virtual machines
After creating the first virtual machine and installing Ubuntu18.04, you will need to update the system with:
sudo apt update && sudo apt upgrade
After the updates, get the hadoop installation files:
sudo wget https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
And now, we will need the following programs:
sudo apt install pdsh ssh openjdk-8-jdk
Testing the programs
We now can test our java version with:
And we will need to edit our bash file, in this case using nano:
sudo nano .bashrc
Go to the last line and enter:
To Save on nano press Ctrl^X to save and Y to accept and ENTER.
To test out ssh you need to create an ssh key:
ssh-keygen -t rsa -P ""
Copy the key to authorized keys:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
and now you can connect to localhost:
to exit enter the command
At this point we should edit some Network settings on VirtualBox.
We need first to unpack the hadoop file to install it with
tar xzf hadoop-3.2.1.tar.gz
for better access to the hadoop directory we can rename it:
mv hadoop-3.2.1 hadoop
We will now create a sudo level User:
sudo adduser hadoopuser
And add the following commands to edit the hadoopuser
sudo usermod -aG hadoopuser hadoopuser
sudo adduser hadoopuser sudo
And now we can edit some files
Edit the Enviroment file with:
sudo nano /etc/environment
And we can now edit the file in the following way (the new part are on BOLD):
Edit the file with:
Switch the part with export JAVA_HOME by:
And now we move the hadoop folder to local:
sudo mv hadoop /usr/local/hadoop
And make the local folder as hadoopuser property:
sudo chown hadoopuser:root -R /usr/local/hadoop/
sudo chmod g+rwx -R /usr/local/hadoop/
We now can see the IP address of the machine with:
As we can see my ip address is 192.168.56.111.
That means that for all machines it’ll be:
sudo nano /etc/hosts
At this point we need to clone our virtual machine by going to the settings of our vm in VirtualBox to make the slaves, we need to make new MAC addresses for each.
After creating the other 2 machines, we run all of them at the same time now to edit all the files.
we need to edit the hostname on all machines:
sudo nano /etc/hostname
now we need to reboot the machines:
We need to enter as hadoopuser:
su - hadoopuser
and create an ssh key for the hadoopuser:
ssh-keygen -t rsa
and copy this key to all users:
We need to edit this file:
sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
add the following code to <configuration>
sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
add the following code to <configuration>
sudo nano /usr/local/hadoop/etc/hadoop/workers
And now we copy these configurations to slave1 and slave2:
scp /usr/local/hadoop/etc/hadoop/* slave1:/usr/local/hadoop/etc/hadoop/scp /usr/local/hadoop/etc/hadoop/* slave2:/usr/local/hadoop/etc/hadoop/
now we add again the following code to .bashrc:
Now we format the hdfs file system:
hdfs namenode -format
And after this we start dfs:
And to confirm that everything is going alright we run the
jps command on the slaves.
Going back to the master e can get info for what is going on on master:9870, the Datanodes tab get’s us the following results:
We need first to execute the following commands for the terminal:
And now we enter the yarn file:
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
And enter the following code in <configuration>:
And now we can start yarn:
And as before we can confirm on the slaves using
jps if NodeManager is running.
And on the master we can enter master:8088 and get the following result:
As we can see both nodes are active.
As we can see we have created a Full Distributed system.