Hello Peppos!!
Ever wanted to share tasks between your computers/servers?
Worry no more, because today you are going to learn how to implement hadoop on your machines.(only the connecting part tho …).
Explanation
Hadoop has 2 essential parts:
- HDFS —A Distributed File System that loads data in machines inside the Cluster.
- MapReduce — Programming model for large scale processing.
Requirements
For this tutorial to work, we will need the following requirements:
- VirtualBox.
- A Iso file for Ubuntu 18.04.
- Computing power for 3 virtual machines.
- An Internet Connection.
Preparing the virtual machines
After creating the first virtual machine and installing Ubuntu18.04, you will need to update the system with:
sudo apt update && sudo apt upgrade
After the updates, get the hadoop installation files:
sudo wget https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
And now, we will need the following programs:
sudo apt install pdsh ssh openjdk-8-jdk
Testing the programs
We now can test our java version with:
java -version
And we will need to edit our bash file, in this case using nano:
sudo nano .bashrc
Go to the last line and enter:
export PDSH_RCMD_TYPE=ssh
To Save on nano press Ctrl^X to save and Y to accept and ENTER.
To test out ssh you need to create an ssh key:
ssh-keygen -t rsa -P ""
Copy the key to authorized keys:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
and now you can connect to localhost:
ssh localhost
to exit enter the command exit
.
At this point we should edit some Network settings on VirtualBox.
Installing Hadoop
We need first to unpack the hadoop file to install it with tar
:
tar xzf hadoop-3.2.1.tar.gz
for better access to the hadoop directory we can rename it:
mv hadoop-3.2.1 hadoop
We will now create a sudo level User:
sudo adduser hadoopuser
And add the following commands to edit the hadoopuser
sudo usermod -aG hadoopuser hadoopuser
sudo adduser hadoopuser sudo
And now we can edit some files
[enviroment]
Edit the Enviroment file with:
sudo nano /etc/environment
And we can now edit the file in the following way (the new part are on BOLD):
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"
JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre"
[hadoop-env.sh]
Edit the file with:
nano hadoop/etc/hadoop/hadoop-env.sh
Switch the part with export JAVA_HOME by:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
And now we move the hadoop folder to local:
sudo mv hadoop /usr/local/hadoop
And make the local folder as hadoopuser property:
sudo chown hadoopuser:root -R /usr/local/hadoop/
sudo chmod g+rwx -R /usr/local/hadoop/
We now can see the IP address of the machine with:
ip address
As we can see my ip address is 192.168.56.111.
That means that for all machines it’ll be:
- Master:192.168.56.111
- Slave1:192.168.56.112
- Slave2:192.168.56.113
[hosts]
sudo nano /etc/hosts
At this point we need to clone our virtual machine by going to the settings of our vm in VirtualBox to make the slaves, we need to make new MAC addresses for each.
After creating the other 2 machines, we run all of them at the same time now to edit all the files.
[hostname]
we need to edit the hostname on all machines:
sudo nano /etc/hostname
now we need to reboot the machines:
sudo reboot
Master Config
We need to enter as hadoopuser:
su - hadoopuser
and create an ssh key for the hadoopuser:
ssh-keygen -t rsa
and copy this key to all users:
ssh-copy-id hadoopuser@master
ssh-copy-id hadoopuser@slave1
ssh-copy-id hadoopuser@slave2
[core-site.xml]
We need to edit this file:
sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
add the following code to <configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
[hdfs-site.sml]
sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
add the following code to <configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
[workers]
sudo nano /usr/local/hadoop/etc/hadoop/workers
Add:
slave1
slave2
And now we copy these configurations to slave1 and slave2:
scp /usr/local/hadoop/etc/hadoop/* slave1:/usr/local/hadoop/etc/hadoop/scp /usr/local/hadoop/etc/hadoop/* slave2:/usr/local/hadoop/etc/hadoop/
[.bashrc]
now we add again the following code to .bashrc:
export PDSH_RCMD_TYPE=ssh
Now we format the hdfs file system:
source .bashrc
hdfs namenode -format
And after this we start dfs:
start-dfs.sh
And to confirm that everything is going alright we run the jps
command on the slaves.
Going back to the master e can get info for what is going on on master:9870, the Datanodes tab get’s us the following results:
[yarn]
We need first to execute the following commands for the terminal:
export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
And now we enter the yarn file:
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
And enter the following code in <configuration>:
And now we can start yarn:
start-yarn.sh
And as before we can confirm on the slaves using jps
if NodeManager is running.
And on the master we can enter master:8088 and get the following result:
As we can see both nodes are active.
As we can see we have created a Full Distributed system.