Hadoop集群间的HDFS文件拷贝

223次阅读

共计 3670 个字符，预计需要花费 10 分钟才能阅读完成。

1、背景

部门有个需求，在网络互通的情况下，把现有的 Hadoop 集群（未做 Kerberos 认证，集群名为：bd-stg-hadoop）的一些 hdfs 文件拷贝到新的 hadoop 集群（做了 Kerberos 认证，集群名为 zp-tt-hadoop）

如果是两个都没有做安全认证的集群互传文件，使用 distcp 可以很快实现。在当前情况下，情况可能要复杂一些。通过查阅资料，在 cdh 的官网上竟然有这么神奇的一个参数可以解决这样的需求。传送门：http://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_admin_distcp_secure_insecure.html

2、实现

2.1 Copying Data between two Insecure cluster

两个都没有做安全认证的集群，通常方法如下：

$ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo

也可以通过 webhdfs 的方式：

$ hadoop distcp webhdfs://nn1:8020/foo/bar  webhdfs://nn2:8020/bar/foo

对于不同 Hadoop 版本间的拷贝，用户应该使用 HftpFileSystem。这是一个只读文件系统，所以 DistCp 必须运行在目标端集群上。源的格式是

hftp://<dfs.http.address>/<path>（默认情况 dfs.http.address 是 <namenode>:50070）。

distcp 的一些参数如下：

-i: 忽略失败 (不建议开启)
-log：记录日志到 <logdir>
-m：同时拷贝的最大数目（指定了拷贝数据时 map 的数目。请注意并不是 map 数越多吞吐量越大。）-overwrite：覆盖目标（如果一个 map 失败并且没有使用 - i 选项，不仅仅那些拷贝失败的文件，这个分块任务中的所有文件都会被重新拷贝。就像下面提到的，它会改变生成目标路径的语义，所以 用户要小心使用这个选项。）-update：如果源和目标的大小不一样则进行覆盖
-f：使用 <urilist_uri> 作为源文件列表

2.2 Copying Data between a Secure and an Insecure

在 secure-cluster 上的 core-site.xml 配置文件中添加：

<property> 
  <name>ipc.client.fallback-to-simple-auth-allowed</name>
  <value>true</value> 
</property>

然后在 secure-cluster 执行如下命令：

distcp webhdfs://insecureCluster webhdfs://secureCluster 
distcp webhdfs://secureCluster webhdfs://insecureCluster

在我实际操作过程中，两个集群 hdfs 都有做 ha，使用上面的命令会报错：

16/09/27 14:47:52 ERROR tools.DistCp: Exception encountered 
Java.lang.IllegalArgumentException: java.net.UnknownHostException: bd-stg-hadoop
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:392)
    at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.initialize(WebHdfsFileSystem.java:167)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2643)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2680)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2662)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:379)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.hadoop.tools.GlobbedCopyListing.doBuildListing(GlobbedCopyListing.java:76)
    at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:86)
    at org.apache.hadoop.tools.DistCp.createInputFileListing(DistCp.java:365)
    at org.apache.hadoop.tools.DistCp.execute(DistCp.java:171)
    at org.apache.hadoop.tools.DistCp.run(DistCp.java:122)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.tools.DistCp.main(DistCp.java:429)
Caused by: java.net.UnknownHostException: bd-stg-hadoop

解决如下：

把 bd-stg-hadoop 集群名字改成了 active-namenode 的 host 再进行操作，如下：

$ hadoop distcp webhdfs://bd-stg-namenode-138/tmp/hivebackup/app/app_stg_session webhdfs://zp-tt-hadoop:8020/tmp/hivebackup/app/app_stg_session

成功执行后，可以发现原理就是启动一个只有 map 的 MapReduce 作业来实现两个集群间的数据复制。

2.3 Copying Data between two Secure cluster

这种情况相对有些复杂了，需要 Kerberos 做跨域的配置。本文暂不研究讨论。

Hadoop2.3-HA 高可用集群环境搭建 http://www.linuxidc.com/Linux/2017-03/142155.htm

Hadoop 项目之基于 CentOS7 的 Cloudera 5.10.1（CDH）的安装部署 http://www.linuxidc.com/Linux/2017-04/143095.htm

Hadoop2.7.2 集群搭建详解（高可用）http://www.linuxidc.com/Linux/2017-03/142052.htm

使用 Ambari 来部署 Hadoop 集群（搭建内网 HDP 源）http://www.linuxidc.com/Linux/2017-03/142136.htm

Ubuntu 14.04 下 Hadoop 集群安装 http://www.linuxidc.com/Linux/2017-02/140783.htm

Ubuntu 16.04 上构建分布式 Hadoop-2.7.3 集群 http://www.linuxidc.com/Linux/2017-07/145503.htm

CentOS 7.3 下 Hadoop2.8 分布式集群安装与测试 http://www.linuxidc.com/Linux/2017-09/146864.htm

CentOS 7 下 Hadoop 2.6.4 分布式集群环境搭建 http://www.linuxidc.com/Linux/2017-06/144932.htm

Hadoop2.7.3+Spark2.1.0 完全分布式集群搭建过程 http://www.linuxidc.com/Linux/2017-06/144926.htm

更多 Hadoop 相关信息见 Hadoop 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=13

本文永久更新链接地址 ：http://www.linuxidc.com/Linux/2017-09/146879.htm

正文完

星哥玩云-微信公众号

发表至：服务器应用

2022-01-21

0

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请注明出处。

软件配置管理中的SVN

自动部署Django项目详解

LVS-DR之VIP、DIP跨网段实例

Ubuntu 16.04 下安装 Docker 17.12

Samba服务器安装和配置

使用Docker创建Web服务详解

Linux 上利用Nginx代理uWSGI处理Flask Web应用

何在 Debian 10 Linux 上安装和配置 Squid 代理

CentOS下安装配置Apache Sentry服务

Hadoop集群间的HDFS文件拷贝

1、背景

2、实现

2.1 Copying Data between two Insecure cluster

2.2 Copying Data between a Secure and an Insecure

2.3 Copying Data between two Secure cluster

选择PHP与Python，可以考虑这三个问题

Centos 7平滑无缝升级PHP7.1.0到PHP 7.1.5

介绍ansible的Ad-hoc与commands模块

Linux安装使用pidstat命令以对进程数据进行监控

linux下使用tree命令以树形结构显示文件目录结构

在 Linux 中如何结束进程

Linux内核中的hash与bucket简单介绍

Zabbix 3.4 Source code compilation installation

shell字符串比较、判断是否为数字入门案例

Nginx创建密码验证保护网站目录安全