Sunday, 15 July 2012

Installing Apache Mahout on Hadoop Cluster

             Mahout is an open source scalable machine learning library by Apache. To process huge amounts of data, you have to run mahout on top of hadoop cluster. Mahout can run as a stand-alone application as well. To run mahout on hadoop cluster it is sufficient to install mahout only on Hadoop master node.
             Mahout can be installed from its source distributions. It's obvious to build the source using ant like build tool. Here we use Apache maven to build the source by dynamically downloading dependencies and plug-ins from appropriate repositories.

Installing Maven on Ubuntu:

      1) Download the latest version of apache-maven from the link http://maven.apache.org/download.html and extract it.

      2) Set the MAVEN_HOME variable in the /etc/environment file.
                   
                     
PATH = $JAVA_HOME/bin:$MAVEN_HOME /bin

MAVEN_HOME = /your location/apache-maven-*

SUDO GEDIT /ETC/ENVIRONMENT


                     

                     
      3) Add the above two lines in the environment file and save the file. Logout and login for the changes to take effect or use source command
                   
SOURCE /ETC/ENVIRONMENT


      4) Maven is installed and Check your maven version.
                     
MVN --VERSION.    


Mahout Installation:

      1) Download and extract the mahout-distribution-* from the link http://www.apache.org/dyn/closer.cgi/mahout/  . Alternatively you can checkout mahout distributions from the trunk.
                     
 svn co http://svn.apache.org/repos/asf/mahout/trunk
                      cd mahout-distributions-*
   
     2) Now build the package using maven
                     
MVN CLEAN INSTALL

              [avoid testing of files by using the option -DskipTests]  
  
     3) Mahout is installed and you can check by the command
                     
/BIN/MAHOUT



     4) Before running mahout on top of hadoop ensure hadoop is installed and running in any of the modes. To run mahout on top of hadoop set the environment variables HADOOP_HOME and HADOOP_CONF_DIR and comment the MAHOUT_LOCAL variable in the mahout script file.
                   
 SUDO GEDIT /BIN/MAHOUT

     5) Add the following lines on the top
                   
HADOOP_HOME =         /your location/hadoop
        HADOOP_CONF_DIR = /your location/hadoop/conf
         
     6) Comment the following line which makes the mahout application to run locally.
                   
#MAHOUT_LOCAL =/your location/mahout-distribution-*

         
Simple Program in Mahout:

     1) 20news-group with naive bayes classifier algorithm. the test and train data set from the link http://people .csail.mit.edu/jrennie/20Newsgroups/20news- bydate.tar.gz.
       
     2) Extract the tar file. Now you need to prepare the data set to be accepted by the bayes algorithm. You need to scan each directory and transform each file into relevant format. The mahout program prepare20newsgroups does this. Use the command to convert train and test data
           
bin/mahout prepare20newsgroups -p /your location/20news-bydate-train -o /your location/20news-train -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8

           
 bin/mahout prepare20newsgroups -p /your location/20news-bydate-test -o /your/location/20news-test -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8

     3) Now train the classifier to create the model.
           
bin/mahout trainclassifier -i /your location/20news-train -o /your location/20news-model -type bayes -ng 1 -source
hdfs          

                       
           
    4) The model generated is tested with the test data set to measure the accuracy. A confusion matrix is generated for the classification.
             
bin/mahout testclassifier -d /your location/20news-test -m /your location/20news-model -type bayes -ng 1 -source hdfs -method parallel
                       
                          [parallel - runs on map-reduce process , sequential - runs on local system ]
           
     5) The confusion matrix generated

   
 Thats it !!!! Mahout is configured to run on hadoop and you can play with it.




>>Venk@_7harun<<


 

6 comments:

  1. Thanks for the tutorial!

    But when I try to run prepare20newsgroups, I get the following output:

    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using /home/test/code/hadoop-0.20.2-cdh3u4/bin/hadoop and HADOOP_CONF_DIR=/home/b$
    MAHOUT-JOB: /home/test/code/mahout-src-0.7/examples/target/mahout-examples-0.7-job.jar
    ERROR driver.MahoutDriver: : Try the new vector backed naivebayes classifier see examples/bin/classify-20newsgroups.sh

    Any idea about a cause of the error?

    ReplyDelete
  2. hi ,

    i am using mahout 0.7..can you please tell me ..how to use the model created after running this below command for fresh set of data:-
    bin/mahout trainnb \
    -i 20news-train-vectors -el -o model -li labelindex -ow
    .

    ReplyDelete
    Replies
    1. @Priyandarshan Raj: You can give the input randomly to the created model and your model will classify your input data. Here is the sample code which explains how to work with your created model. https://bitbucket.org/jaganadhg/mahoutexperiments/changeset/a1223ad8fca2

      Delete
  3. great blog to read... after reading this blog i got more useful information from this blog.. thank you for sharing

    hadoop training in chennai adyar | big data training in chennai adyar

    ReplyDelete
  4. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

    big data training institute in velachery | hadoop training in chennai velachery | big data training in chennai velachery

    ReplyDelete