Ubuntu系统(64位)下安装并配置Hadoop2.2.0集群

227次阅读

共计 13552 个字符，预计需要花费 34 分钟才能阅读完成。

接上篇编译完 Hadoop-2.2.0，下面详细的介绍下如何在 Ubuntu12.04-64 server 安装并配置 Hadoop 集群。

再次强调：我们从 Apache 官方网站下载的 Hadoop2.2 是 linux32 位系统可执行文件，所以如果需要在 64 位系统上部署则需要单独下载 src 源码自行编译。编译的详细步骤参见：编译 hadoop2.2.0 http://www.linuxidc.com/Linux/2014-01/95728.htm

为了方便阐述，我们这里搭建一个有三台主机的小集群。

三台主机的 OS：Ubuntu 12.04-64 server

三台机器的分工如下：

Master: NameNode/ResouceManager

Slave1:DataNode/NodeManager

Slave2: DataNode/NodeManager

假定三台虚拟机的 IP 地址如下，后面会用到。

Master

:129.1.77.6

Slave1: 129.1.77.5

Slave2: 129.1.77.7

下面开始 Hadoop 的安装与配置；

1. 首先在三台机器上创建相同的用户（这是 Hadoop 的基本要求）

创建用户的步骤如下：

(1) sudo addgroup hadoop

(2) sudo adduser –ingroup hadoop haduser

编辑 /etc/sudoers 编辑文件，在 root ALL=(ALL)ALL 行下添加 haduser ALL=(ALL)ALL。如果不添加这行，haduser 将不能执行 sudo 操作。

2. 接下来的工作：

1）

确保三台机器上已经安装了 jdk，并正确配置了环境变量，jdk 的安装见 http://www.linuxidc.com/Linux/2014-01/95727.htm；

2）在三台主机上安装了 OpenSSH，并正确配置 SSH 可以无密码登录；

3. 下面安装 ssh

3.1 一般系统是默认安装了 ssh 命令的。如果没有，或者版本比较老，则可以重新安装：

sodu apt-get install ssh

3.2 设置 local 无密码登陆

安装完成后会在~ 目录（当前用户主目录，即这里的 /home/haduser）下产生一个隐藏文件夹.ssh（ls -a 可以查看隐藏文件）。如果没有这个文件，自己新建即可（mkdir .ssh）。

具体步骤如下：

1、进入.ssh 文件夹

2、ssh-keygen -t rsa 之后一路回车（产生秘钥）

3、把 id_rsa.pub 追加到授权的 key 里面去（cat id_rsa.pub >> authorized_keys）

4、重启 SSH 服务命令使其生效

注意：以上操作在每台机器上面都要进行。

3.4 此时已经可以进行 ssh 的无密码登陆, 查看是否可以从 master 主机无密码登录 slave，输入命令：

$:ssh slave1

$:ssh slave2

4. 在三台主机上分别设置：/etc/hosts 和 /etc/hostname

hosts 这个文件用于定义主机名和 IP 地址之间的映射关系。

127.0.0.1 localhost

129.1.77.6 master

129.1.77.5 slave1

129.1.77.7 slave2

hostname 这个文件用于定义 Ubuntu 的主机名：如：master（或者 slave1 等）

5. 以上正确完成之后便可进入 Hadoop 的安装

以下操作以 haduser 登录进行操作。

由于 hadoop 集群中每个机器上面的配置基本相同，所以我们先在 namenode 上面进行配置部署，然后再复制到其他节点。所以这里的安装过程相当于在每台机器上面都要执行。但需要注意的是集群中 64 位系统和 32 位系统的问题。

5.1、下载并解压

hadoop-2.2.0.tar.gz

文件

将在 64 位机器上编译好的

hadoop-2.2.0 拷贝

到 /home/hduser/hadoop 路径下。

5.2、HDFS 安装配置

1)

配置 /home/hduser/hadoop/etc/hadoop/hadoop-env.sh

替换 exportJAVA_HOME=${JAVA_HOME}为如下：

export JAVA_HOME=/usr/jdk1.7.0_45 (以自己的 jdk 为准)

同样，配置

yarn-env.sh，在里面加入：

export JAVA_HOME=/usr/jdk1.7.0_45 (以自己的 jdk 为准)

2)配置 etc/hadoop/core-site.xml 文件内容：

<?xml version=”1.0″ encoding=”UTF-8″?>
<?xml-stylesheet type=”text/xsl” href=”https://www.linuxidc.com/Linux/2014-01/configuration.xsl”?>

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000/</value>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<description></description>
</property>
</configuration>

3)配置 etc/hadoop/hdfs-site.xml 文件内容：

<?xml version=”1.0″ encoding=”UTF-8″?>
<?xml-stylesheet type=”text/xsl” href=”https://www.linuxidc.com/Linux/2014-01/configuration.xsl”?>

<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/haduser/hadoop/storage/hadoop2/hdfs/name</value>
<description>Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/haduser/hadoop/storage/hadoop2/hdfs/data1,/home/haduser/hadoop/storage/hadoop2/hdfs/data2,/home/haduser/hadoop/storage/hadoop2/hdfs/data3</value>
<description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/haduser/hadoop/storage/hadoop2/hdfs/tmp/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
</configuration>

5.3、YARN 安装配置

配置 etc/hadoop/yarn-site.xml 文件内容：

<?xml version=”1.0″?>

<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
<description>host is the hostname of the resource manager and
port is the port on which the NodeManagers contact the Resource Manager.
</description>
</property>

<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
<description>host is the hostname of the resourcemanager and port is the port
on which the Applications in the cluster talk to the Resource Manager.
</description>
</property>

<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
<description>In case you do not want to use the default scheduler</description>
</property>

<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
<description>the host is the hostname of the ResourceManager and the port is the port on
which the clients can talk to the Resource Manager. </description>
</property>

<property>
<name>yarn.nodemanager.local-dirs</name>
<value>${hadoop.tmp.dir}/nodemanager/local</value>
<description>the local directories used by the nodemanager</description>
</property>

<property>
<name>yarn.nodemanager.address</name>
<value>0.0.0.0:8034</value>
<description>the nodemanagers bind to this port</description>
</property>

<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>10240</value>
<description>the amount of memory on the NodeManager in GB</description>
</property>

<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>${hadoop.tmp.dir}/nodemanager/remote</value>
<description>directory on hdfs where the application logs are moved to </description>
</property>

<property>
<name>yarn.nodemanager.log-dirs</name>
<value>${hadoop.tmp.dir}/nodemanager/logs</value>
<description>the directories used by Nodemanagers as log directories</description>
</property>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
<description>shuffle service that needs to be set for Map Reduce to run </description>
</property>
</configuration>

相关阅读：

Ubuntu 13.04 上搭建 Hadoop 环境 http://www.linuxidc.com/Linux/2013-06/86106.htm

Ubuntu 12.10 +Hadoop 1.2.1 版本集群配置 http://www.linuxidc.com/Linux/2013-09/90600.htm

Ubuntu 上搭建 Hadoop 环境（单机模式 + 伪分布模式）http://www.linuxidc.com/Linux/2013-01/77681.htm

Ubuntu 下 Hadoop 环境的配置 http://www.linuxidc.com/Linux/2012-11/74539.htm

单机版搭建 Hadoop 环境图文教程详解 http://www.linuxidc.com/Linux/2012-02/53927.htm

搭建 Hadoop 环境（在 Winodws 环境下用虚拟机虚拟两个 Ubuntu 系统进行搭建）http://www.linuxidc.com/Linux/2011-12/48894.htm

5.4.

配置文件 slaves（这个文件里面保存所有 slave 节点）

写入以下内容：

slave1

slave2

5.5 配置完成后，复制到其他节点（slave1，slave2），如下图

Ubuntu 系统 (64 位) 下安装并配置 Hadoop2.2.0 集群

在 slave1 和 slave2 主机上修改对应的配置文件，将 master 改为对应的主机名

6. 启动集群

启动 HDFS 集群
首先，需要格式化 HDFS，执行如下命令：
haduser@master:~/Hadoop/hadoop-2.2.0$ bin/hdfs namenode -format

如果格式化正常，日志中不会出现异常信息，可以继续启动集群相关服务

启动 HDFS 集群，执行如下命令：

haduser@master:~/hadoop/hadoop-2.2.0$ sbin/start-dfs.sh

可以在 master 结点上看到如下几个进程：

haduser@master:~/hadoop/hadoop-2.2.0$ jps
6638 Jps
6015 NameNode
6525 SecondaryNameNode

在 slave 结点上看到如下进程：

haduser@slave1:~/hadoop/hadoop-2.2.0/etc/hadoop$ jps
4264 Jps
4208 DataNode

haduser@slave2:~/hadoop$ jps
5287 DataNode
5345 Jps

启动 YARN 集群
如果配置完成以后，启动 YARN 集群非常容易，只需要执行几个脚本就可以。
启动 ResourceManager，执行如下命令：

haduser@master:~/hadoop/hadoop-2.2.0$ sbin/yarn-daemon.sh start resourcemanager

haduser@master:~/hadoop/hadoop-2.2.0$ jps
12477 ResourceManager
12326 SecondaryNameNode
12691 Jps
12027 NameNode

在 slave 上启动 NodeManager 进程，执行如下命令：

haduser@slave1:~/hadoop/hadoop-2.2.0$ sbin/yarn-daemon.sh start nodemanager
starting nodemanager, logging to /home/haduser/hadoop/hadoop-2.2.0/logs/yarn-haduser-nodemanager-slave1.out
haduser@slave1:~/hadoop/hadoop-2.2.0$ jps

更多 Hadoop 相关信息见Hadoop 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=13