Hadoop Honours Project

Monday, April 30, 2012

Project complete

That is the project completed, and final submissions now being handed in.

Wednesday, March 14, 2012

Psudo cluster back up and running

The Pseudo Cluster is back up and running and able to run java programs, in this case, the wordcount example. Backups have been taken (again, hopefully whilst stable!) the previous tutorials needed some tweaking as there were missing bits or parts that have changed in the space of a month...

Two major issues were to be found in the reinstall - Java, which is now working, and Gedit... for some reason as the admin user I could not run GEDIT from the command prompt... and I could not edit the Hduser files that needed edited, but by logging out as the admin and logging in as hduser (this is after setting hduser up as an admin!) then gedit worked fine in the hduser account...

CONFUSED! but whatever, it works now :)

Installing sun-java6-jdk on Ubuntu 11.10

Intro Waffle:
*****

Frustratingly, I have been banging my head against a wall trying to get beeswax working in an isolated install of Linux on Virtual-Box. The issue being that I could not get Java6-jdk downloaded and installed. I had originally tried this on a working image of linux that had my previous hadoop working on, but since it would not work, I tailed off... I did do a backup of the original install but that clone corrupted... so now I am sat here with no operational honours project work and am forced to restart.

On restarting I hit that Java wall again, this time with some more determined searching, I found that it had been removed from some repositories, e.g Canonical. Here is what I uncovered during the search and tried, and it worked. Hopefully it will work for you too if you stumble over this trying to find a solution to a similar issue.

*****

Installing:

$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:ferramroberto/java

(I dont know how long this link will be hosted, but as of time of posting - this works, and has been hosted here for a few months.)

$ sudo apt-get update
$ sudo apt-get install sun-java6-jdk

With that done - its now a good time to test how efficient my own tutorials on setting up hadoop are.

-Stu

Monday, February 6, 2012

Pseudo cluster test complete

After some continued teething problems with setting up the pseudo cluster and getting everything to work together, I finally have a working pseudo cluster :) I performed the same map-reduce example as a single node setup with success. I also recorded a video of a bulk of the required work being done (4min, will only play in Linux video player as far as I know)

From here, I need to look at the other examples packaged with Hadoop, and then look to write my own program in order to analyse the server log files that the system will be used on.

Wednesday, January 25, 2012

Setting up Pseudo Cluster.

With the success of the single node setup, I wish to get a 'cluster' up and running again. Using the selection of tutorials from before, I hope to achieve this:

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

http://bigdatablog.co.uk/install-hadoop

http://hadoop.apache.org/common/docs/current/single_node_setup.html

Note that since the account was setup and given privelages in the last posts details, that shall be skipped here.

1. Create temp working directory

:~$ mkdir app

:~$ mkdir app/hadoop

:~$ mkdir app/hadoop/tmp

:~$ chmod -R 777 hadoop

Since hduser was previously setup as a sudo user - the above commands work in my case.

2. Setup config files.

Note: GEDIT works... if sudo is typed before it... weird!!!

2.1 Core-site.xml edit:

cd /usr/local/hadoop/conf

:~$ sudo gedit core-site.xml

Between the <configuration> </configuration> tags some code needs to be inserted.

<property>

<name>hadoop.tmp.dir</name>

<value>/app/hadoop/tmp</value>

</property>

//this will map hadoop.tmp.dir to the temporary directory that was created in step 1 above.

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:54310</value>

</property>

//this will set the default comms port to :54310 I believe... This is where I was getting an error before, we shall see if it is recreated shortly... hopefully creating a new virtual drive disk image and starting from scratch will prevent this - as it helped with the word count example.

2.2 mapred-site.xml

:~$ sudo gedit mapred-site.xml

Between the <configuration> </configuration> tags some code needs to be inserted:

<property>

<name>mapred.job.tracker</name>

<value>localhost:54311</value>

</property>

//this will set the mapreduce job tracker to use port 54311.

2.3 hdfs-site.xml

:~$ sudo gedit hdfs-site.xml

Between the <configuration> </configuration> tags some code needs to be inserted:

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

//Sets up the number of drives that will be replicated. Value is 1 as only one drive will be used.

3. Format Namenode

Prepare HDFS by formatting namenode before setting up the cluster:

:~$ hadoop namenode -format

by using the command :~$ ssh localhost we can test if the ssh server is running. In my case, I set a password and it will ask Every_single_time it wants to connect... so I will re-generate the ssh key to remove the password:

:~$ ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa

//key is generated

:~$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

//adds the generated key to the authorized_keys folder.

typing :~$ ssh localhost now gives an update readout, and a last-login timestamp.

Tuesday, January 24, 2012

Starting Afresh

Having not had much time, or luck with trying to get sample work up and running, I have decided to start anew and see where the various tutorials out there take me, so here is what I am doing:

Sources:
http://bigdatablog.co.uk/install-hadoop
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://hadoop.apache.org/common/docs/current/single_node_setup.html

1. Install Ubuntu (11.10 at time of writing):
Firstly, Ubuntu needs to be installed. I am doing this using virtualbox. I already have a few instances installed, but this one shall be specialised just for the project to try to keep it focused. I have Ubuntu run on a second monitor, whilst the primary monitor is used for generic work such as reading tutorials, and filling in this blog.

2. Install Sun Java 1.6 JDK
This seems easiest done from the command line by the following:

Note: the method that was posted here no longer works, to see a working tutorial, go to: http://hadoopproject.blogspot.com/2012/03/installing-sun-java6-jdk-on-ubuntu-1110.html

Note: When installing it, you must accept the license. It took me a while to figure how to accept - you have to hit tab to select the 'OK' in the middle of the screen, then hit enter.

3. Install SSH
In order to access the clusters we need to have SSH installed.

:~$ sudo apt-get install ssh

4. Setup a hadoop user account
Create a group called hadoop, then create and add the hduser user account to it.

:~$ sudo addgroup hadoop
:~$ sudo adduser --ingroup hadoop hduser

Note: In order to be able to perform sudo commands - the account needs to be added as an admin from an account that already has admin privelages. To do this simply type:

:~$ sudo adduser hduser admin

Configure SSH
Switch to the hadoop user account, then create an ssh key.

:~$ su - hduser
:~$ ssh-keygen -t rsa -P ""

Enable SSH access to localmachine:

:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Finally - test the connection, which will add the localhost to list of known hosts:

:~$ ssh localhost
confirm 'yes'

Note: Some tutorials claim you should disable IPV6 at this point... I shall not, at least for now.

5. Install Hadoop.
Note, hadoop v1.0.0 is now released, but as the tutorials were all written using 0.20.2 or 0.20.203 - I will acutally use 0.20.2. Using v1.0.0 may have contributed to some of my problems, and I am not familiar enough with the system to be adapting the code to work with 1.0.0.

On the apache site, the Hadoop 0.20.2 tar.gzz was downloaded, unpackaged, renamed for easier access and had the ownership changed to the hduser account:

cd /home/stu/downloads
:~$ sudo tar xzf hadoop-0.20.2.tar.gz
:~$ sudo mv hadoop-0.20.2 hadoop
:~$ sudo chown -R hduser:hadoop hadoop

Note: For some reason... Gedit wont work from the hduser account... sure I had this problem last time around but have not been able to fix. Any advice welcome!

6. Update .bashrc
This one is slightly tricky. Files starting with a . are hidden, but by typing the gedit command you can edit the file - even though it is not visible.

:~$ gedit $HOME/.bashrc

Alternatively you can open gedit and open a file, and when on the home page, right clock on the page and selecte 'show hidden' although using this method will open a read-only copy... so in essence, is relatively pointless.

edit to the bottom of the .bashrc doc:

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-6-sun

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

7. CONFIGURING:
7.1 Hadoop-env.sh
The JAVA_HOME variable needs configured. In my configuration this is:

:~$ sudo gedit /home/stu/Downloads/hadoop/conf/hadoop-env.sh

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/jvm/java-6-sun

Note: Sudo is being used as gedit is not working from hduser and these modifications are being made via another account. Unless sudo is used, the file is opened as read-only.

Note - at this point I copied the hadoop folder to /usr/local.

:~$ cd /home/stu/Downloads
:~$ sudo cp -r hadoop /usr/local

and repeated the export commands from command lines to set up the environmental variables (as per BigDataBlog):

:~$ export HADOO_HOME=/usr/local/hadoop
:~$ export PATH=$PATH:$HADOOP_HOME/bin
:~$ export JAVA_HOME=/usr/lib/jvm/java-6-sun

This may have already been done above, but I am just covering the bases to align my project with the tutorials for ease of use/referencing...

7.2 Standalone mode test:
From BIGDATABLOG

:~$ hadoop

Typing this should display a help message - if it does, it is correctly configured. I did it, and it does. #winning.

8. First MapReduce job:
This is where my old version stopped working... so lets set it up right.

cd /usr/local/hadoop
:~$ sudo chmod 777 hadoop-0.20.2-examples.jar

This should remove any permission issues, now we enter the following to perform a search for 'the' in the LICENCE.txt file (Which has also beem universified (chmod 777 - universified sounds quicker...)

:~$ hadoop jar /usr/local/hadoop/hadoop-0.20.2-examples.jar grep /usr/local/hadoop/LICENSE.txt outdir the
//performs functions - searching LICENSE.txt for 'the' and copying the output to the outdir folder.
:~$ cat outdir/part-00000

Output: 144 the
In other words, 144 'the's' were counted in the LICENSE.txt file. It works! Huzzah!

Now - for some notes after some playing::

It only seems to make a single file - which it puts into the outdir folder - if I try to run another check against a different word, say: 'and', then it does nothing as the file already exists. If I change from outdir to outdir2 it creates a new folder called 'outdir2' with the same part-00000 file - except when you 'cat part-00000' on this, it says '52 and' indicating 52 and's were found.

Monday, October 31, 2011

Update

Slowly working my way through the Hadoop book in an attempt to take in as much information as possible and get a sample server setup before the christmas holidays. I could probably jump the gun and google what I want, and get something running within a day, but it would bypass actually learning the background etc... so lets see how this goes!