CLOUDY THOUGHTS!!!: Installing Apache Mahout on Hadoop Cluster

Mahout is an open source scalable machine learning library by Apache. To process huge amounts of data, you have to run mahout on top of hadoop cluster. Mahout can run as a stand-alone application as well. To run mahout on hadoop cluster it is sufficient to install mahout only on Hadoop master node.
Mahout can be installed from its source distributions. It's obvious to build the source using ant like build tool. Here we use Apache maven to build the source by dynamically downloading dependencies and plug-ins from appropriate repositories.

Installing Maven on Ubuntu:

1) Download the latest version of apache-maven from the link http://maven.apache.org/download.html and extract it.

2) Set the MAVEN_HOME variable in the /etc/environment file.

PATH = $JAVA_HOME/bin:$MAVEN_HOME /bin

MAVEN_HOME = /your location/apache-maven-*

SUDO GEDIT /ETC/ENVIRONMENT

3) Add the above two lines in the environment file and save the file. Logout and login for the changes to take effect or use source command

SOURCE /ETC/ENVIRONMENT

4) Maven is installed and Check your maven version.

MVN --VERSION.

Mahout Installation:

1) Download and extract the mahout-distribution-* from the link http://www.apache.org/dyn/closer.cgi/mahout/ . Alternatively you can checkout mahout distributions from the trunk.

svn co http://svn.apache.org/repos/asf/mahout/trunk
cd mahout-distributions-*

2) Now build the package using maven

MVN CLEAN INSTALL

[avoid testing of files by using the option -DskipTests]

3) Mahout is installed and you can check by the command

/BIN/MAHOUT

4) Before running mahout on top of hadoop ensure hadoop is installed and running in any of the modes. To run mahout on top of hadoop set the environment variables HADOOP_HOME and HADOOP_CONF_DIR and comment the MAHOUT_LOCAL variable in the mahout script file.

SUDO GEDIT /BIN/MAHOUT

5) Add the following lines on the top

HADOOP_HOME = /your location/hadoop

HADOOP_CONF_DIR = /your location/hadoop/conf

6) Comment the following line which makes the mahout application to run locally.

#MAHOUT_LOCAL =/your location/mahout-distribution-*

Simple Program in Mahout:

1) 20news-group with naive bayes classifier algorithm. the test and train data set from the link http://people .csail.mit.edu/jrennie/20Newsgroups/20news- bydate.tar.gz.

2) Extract the tar file. Now you need to prepare the data set to be accepted by the bayes algorithm. You need to scan each directory and transform each file into relevant format. The mahout program prepare20newsgroups does this. Use the command to convert train and test data

bin/mahout prepare20newsgroups -p /your location/20news-bydate-train -o /your location/20news-train -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8

bin/mahout prepare20newsgroups -p /your location/20news-bydate-test -o /your/location/20news-test -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8

3) Now train the classifier to create the model.

bin/mahout trainclassifier -i /your location/20news-train -o /your location/20news-model -type bayes -ng 1 -source

hdfs

4) The model generated is tested with the test data set to measure the accuracy. A confusion matrix is generated for the classification.

bin/mahout testclassifier -d /your location/20news-test -m /your location/20news-model -type bayes -ng 1 -source hdfs -method parallel

[parallel - runs on map-reduce process , sequential - runs on local system ]

5) The confusion matrix generated

Thats it !!!! Mahout is configured to run on hadoop and you can play with it.

>>Venk@_7harun<<

6 comments:

Andrii Vozniuk7 August 2012 at 14:02
Thanks for the tutorial!

But when I try to run prepare20newsgroups, I get the following output:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /home/test/code/hadoop-0.20.2-cdh3u4/bin/hadoop and HADOOP_CONF_DIR=/home/b$
MAHOUT-JOB: /home/test/code/mahout-src-0.7/examples/target/mahout-examples-0.7-job.jar
ERROR driver.MahoutDriver: : Try the new vector backed naivebayes classifier see examples/bin/classify-20newsgroups.sh

Any idea about a cause of the error?
Ad8 October 2012 at 17:48
hi ,

i am using mahout 0.7..can you please tell me ..how to use the model created after running this below command for fresh set of data:-
bin/mahout trainnb \
-i 20news-train-vectors -el -o model -li labelindex -ow
.
Unknown13 April 2017 at 12:23
great blog to read... after reading this blog i got more useful information from this blog.. thank you for sharing

hadoop training in chennai adyar | big data training in chennai adyar
Unknown13 April 2017 at 15:36
After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

big data training institute in velachery | hadoop training in chennai velachery | big data training in chennai velachery
deiva3 September 2020 at 19:18
great blog to read... after reading this blog i got more useful information from this blog.. thank you for sharing
angular js training in chennai

angular js training in omr

full stack training in chennai

full stack training in omr

php training in chennai

php training in omr

photoshop training in chennai

photoshop training in omr

CLOUDY THOUGHTS!!!

Sunday, 15 July 2012

Installing Apache Mahout on Hadoop Cluster

6 comments:

Total Pageviews

Contributors

Followers