Mahout的安装配置与使用

236次阅读

共计 4811 个字符，预计需要花费 13 分钟才能阅读完成。

Mahout 是 Apache 旗下的一个机器学习和数据挖掘的分布式框架，包括聚类，分类，协同过滤，关联规则挖掘等经典的算法。

1. 安装 Maven

wget http://apache.etoak.com//maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz 下载
最新版本的 Maven. tar xvf apache-maven-3.0.2-bin.tar.gz 解压后配置路径，vi ~/.bashrc 在此文件添加如下两行

export M3_HOME=maven 的实际安装路径
export PATH=${M3_HOME}/bin:${PATH}

然后执行. ~/.bashrc 使之生效，通过 mvn -version 查看版本来看是否安装成功。

2. 安装 Mahout

svn co http://svn.apache.org/repos/asf/mahout/trunk mahout 将 mahout 从代码库中下载下来，在 mahout

目录执行 mvn install 安装，如果想快一点就跳过单元测试检验的那些个环节，使用 mvn clean install -DskipTests=true。

如果过程中没有报错的话说明安装成功。

3. 运行 Mahout 中的示例程序

在 /mahout/examples/bin 下有个聚类的测试脚本，我们可以运行来看一下，不过事先要配置好 Hadoop 的运行环境，
在《Hadoop Ubuntu 下的安装》《Hadoop 集群安装注意事项》有介绍如何配置 Hadoop 可以参考，我们运行下面这个脚本，cluster-syntheticcontrol.sh

xxx@xxx: ./cluster-syntheticcontrol.sh

Please select a number to choose the corresponding clustering algorithm

1. canopy clustering

2. kmeans clustering

3. fuzzykmeans clustering

4. dirichlet clustering

5. meanshift clustering

Enter your choice :

更多详情见请继续阅读下一页的精彩内容：http://www.linuxidc.com/Linux/2013-10/92026p2.htm

相关阅读：

Mahout 驾驭 Hadoop 之详解 http://www.linuxidc.com/Linux/2013-09/89921.htm

Ubuntu 10.04 下 Mahout 安装步骤详解 http://www.linuxidc.com/Linux/2011-10/44550.htm

这里有好几个聚类的算法，都是使用 http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

这里的数据。实际跑一下，在 hdfs 里面 output/clusters-* 目录下可以查看结果。

我们在运行一个协调过滤的例子，采用的是 SlopeOne 推荐的，从 http://www.grouplens.org/node/12 下载需要的测试数据，

wget http://www.grouplens.org/system/files/ml-100k.zip unzip 命令解压后，找到 ua.base 文件，将它放到 HDFS 文件系统

中去，Hadoop fs -put ./ua.base input。之后再 /mahout/core/target 目录下，执行下面的一串命令：

hadoop jar mahout-core-0.8-SNAPSHOT-job org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob -Dmapred.input.dir=input/ua.base -Dmapred.output.dir=output/recommend/ –recommenderClassName org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender

运行之后我们将结果从 hdfs 导出到本地，hadoop fs -get output/recommend/ output, 先将 part-r-00000.gz 解压一下，然后 tail -f part-r-00000 就可以看到生成的

推荐结果了：

934 [1500:5.612117,1643:5.000663,1463:5.0,1293:5.0,1642:4.8183527,1368:4.740515,1398:4.733873,1639:4.5879145,114:4.5742764,1158:4.508668]

935 [851:5.9419937,1449:5.3944926,1398:5.373123,1158:5.355277,1612:5.3333335,1064:5.3093076,1191:5.2806797,1080:5.2754364,1332:5.226541,1642:5.147603]

936 [1500:5.6213202,1449:5.337122,851:5.314575,1463:5.2984333,1612:5.2581205,1642:5.093908,1398:5.015227,1293:5.0,1064:4.9175024,1158:4.8493795]

937 [1500:5.1213202,1293:5.0,1449:4.864124,1467:4.5690355,1612:4.5,1642:4.4209914,1175:4.369398,851:4.277082,1398:4.2526293,169:4.2230315]

938 [1500:6.0,851:5.4116225,1449:5.261485,1080:5.135357,1639:5.120229,1398:5.0513654,1612:5.0416207,1064:4.988654,1158:4.974,1629:4.9657393]

939 [1294:6.5,1175:6.5,1467:6.226541,851:6.0186186,1080:6.0,1233:5.995873,1629:5.8823447,1449:5.8435216,169:5.7910223,114:5.780447]

940 [1500:5.5841513,1463:5.119504,1293:5.0,1449:4.636525,1643:4.544313,1368:4.500481,1191:4.486469,1398:4.401837,1612:4.361653,1642:4.3293104]

941 [1500:7.5,868:6.0,1398:6.0,1639:6.0,1158:5.512253,1467:5.4142137,1396:5.2928934,1642:5.2928934,1080:5.2132034,851:5.1633635]

942 [1463:5.5285954,1500:5.5118446,1293:5.5,1449:5.4203076,1398:5.29603,1642:5.252299,1467:5.209111,1158:5.1849523,1524:5.1738963,1368:5.1218886]

943 [1080:5.3644133,1233:5.303138,1449:5.2133603,1500:5.178636,1467:5.017004,1293:5.0,1612:5.0,1398:4.874856,1643:4.863265,1607:4.8206778]

怎么样，功能强大吧，所有的算法 mahout 都帮你实现好了，你需要的只是提供数据而已，ua.base 中的数据是这样的：

943 1044 3 888639903

943 1047 2 875502146

943 1074 4 888640250

943 1188 3 888640250

943 1228 3 888640275

943 1330 3 888692465

用户 ID ItemID preference 分值 Timestamp 时间戳

相关阅读：

《Hadoop 实战》中文版 + 英文文字版 + 源码【PDF】http://www.linuxidc.com/Linux/2012-10/71901.htm

Hadoop: The Definitive Guide【PDF 版】http://www.linuxidc.com/Linux/2012-01/51182.htm

更多 Hadoop 相关信息见Hadoop 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=13

Mahout 是 Apache 旗下的一个机器学习和数据挖掘的分布式框架，包括聚类，分类，协同过滤，关联规则挖掘等经典的算法。

1. 安装 Maven

wget http://apache.etoak.com//maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz 下载
最新版本的 Maven. tar xvf apache-maven-3.0.2-bin.tar.gz 解压后配置路径，vi ~/.bashrc 在此文件添加如下两行

export M3_HOME=maven 的实际安装路径
export PATH=${M3_HOME}/bin:${PATH}

然后执行. ~/.bashrc 使之生效，通过 mvn -version 查看版本来看是否安装成功。

2. 安装 Mahout

svn co http://svn.apache.org/repos/asf/mahout/trunk mahout 将 mahout 从代码库中下载下来，在 mahout

目录执行 mvn install 安装，如果想快一点就跳过单元测试检验的那些个环节，使用 mvn clean install -DskipTests=true。

如果过程中没有报错的话说明安装成功。

3. 运行 Mahout 中的示例程序

在 /mahout/examples/bin 下有个聚类的测试脚本，我们可以运行来看一下，不过事先要配置好 Hadoop 的运行环境，
在《Hadoop Ubuntu 下的安装》《Hadoop 集群安装注意事项》有介绍如何配置 Hadoop 可以参考，我们运行下面这个脚本，cluster-syntheticcontrol.sh