Ceph集群磁盘没有剩余空间的解决方法

294次阅读

共计 6336 个字符，预计需要花费 16 分钟才能阅读完成。

故障描述

OpenStack + Ceph 集群在使用过程中，由于虚拟机拷入大量新的数据，导致集群的磁盘迅速消耗，没有空余空间，虚拟机无法操作，Ceph 集群所有操作都无法执行。

故障现象

尝试使用 OpenStack 重启虚拟机无效
尝试直接用 rbd 命令直接删除块失败

 [root@controller ~]# rbd -p volumes rm volume-c55fd052-212d-4107-a2ac-cf53bfc049be
2015-04-29 05:31:31.719478 7f5fb82f7760  0 client.4781741.objecter  FULL, paused modify 0xe9a9e0 tid 6

查看 ceph 健康状态

 cluster 059f27e8-a23f-4587-9033-3e3679d03b31
 health HEALTH_ERR 20 pgs backfill_toofull; 20 pgs degraded; 20 pgs stuck unclean; recovery 7482/129081 objects degraded (5.796%); 2 full osd(s); 1 near full osd(s)
 monmap e6: 4 mons at {node-5e40.cloud.com=10.10.20.40:6789/0,node-6670.cloud.com=10.10.20.31:6789/0,node-66c4.cloud.com=10.10.20.36:6789/0,node-fb27.cloud.com=10.10.20.41:6789/0}, election epoch 886, quorum 0,1,2,3 node-6670.cloud.com,node-66c4.cloud.com,node-5e40.cloud.com,node-fb27.cloud.com
 osdmap e2743: 3 osds: 3 up, 3 in
        flags full
  pgmap v6564199: 320 pgs, 4 pools, 262 GB data, 43027 objects
        786 GB used, 47785 MB / 833 GB avail
        7482/129081 objects degraded (5.796%)
             300 active+clean
              20 active+degraded+remapped+backfill_toofull

 HEALTH_ERR 20 pgs backfill_toofull; 20 pgs degraded; 20 pgs stuck unclean; recovery 7482/129081 objects degraded (5.796%); 2 full osd(s); 1 near full osd(s)
pg 3.8 is stuck unclean for 7067109.597691, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.7d is stuck unclean for 1852078.505139, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.21 is stuck unclean for 7072842.637848, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
pg 3.22 is stuck unclean for 7070880.213397, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
pg 3.a is stuck unclean for 7067057.863562, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.7f is stuck unclean for 7067122.493746, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
pg 3.5 is stuck unclean for 7067088.369629, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.1e is stuck unclean for 7073386.246281, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
pg 3.19 is stuck unclean for 7068035.310269, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
pg 3.5d is stuck unclean for 1852078.505949, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.1a is stuck unclean for 7067088.429544, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.1b is stuck unclean for 7072773.771385, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
pg 3.3 is stuck unclean for 7067057.864514, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.15 is stuck unclean for 7067088.825483, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.11 is stuck unclean for 7067057.862408, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.6d is stuck unclean for 7067083.634454, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.6e is stuck unclean for 7067098.452576, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.c is stuck unclean for 5658116.678331, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.e is stuck unclean for 7067078.646953, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
pg 3.20 is stuck unclean for 7067140.530849, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
pg 3.7d is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.7f is active+degraded+remapped+backfill_toofull, acting [0,2]
pg 3.6d is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.6e is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.5d is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.20 is active+degraded+remapped+backfill_toofull, acting [0,2]
pg 3.21 is active+degraded+remapped+backfill_toofull, acting [0,2]
pg 3.22 is active+degraded+remapped+backfill_toofull, acting [0,2]
pg 3.1e is active+degraded+remapped+backfill_toofull, acting [0,2]
pg 3.19 is active+degraded+remapped+backfill_toofull, acting [0,2]
pg 3.1a is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.1b is active+degraded+remapped+backfill_toofull, acting [0,2]
pg 3.15 is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.11 is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.c is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.e is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.8 is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.a is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.5 is active+degraded+remapped+backfill_toofull, acting [2,0]
pg 3.3 is active+degraded+remapped+backfill_toofull, acting [2,0]
recovery 7482/129081 objects degraded (5.796%)
osd.0 is full at 95%
osd.2 is full at 95%
osd.1 is near full at 93%

解决方案一 (已验证)

增加 OSD 节点，这也是官方文档中推荐的做法，增加新的节点后，Ceph 开始重新平衡数据，OSD 使用空间开始下降

 2015-04-29 06:51:58.623262 osd.1 [WRN] OSD near full (91%)
2015-04-29 06:52:01.500813 osd.2 [WRN] OSD near full (92%)

解决方案二 (理论上，没有进行验证)

如果在没有新的硬盘的情况下，只能采用另外一种方式。在当前状态下，Ceph 不允许任何的读写操作，所以此时任何的 Ceph 命令都不好使，解决的方案就是尝试降低 Ceph 对于 full 的比例定义，我们从上面的日志中可以看到 Ceph 的 full 的比例为 95%，我们需要做的就是提高 full 的比例，之后尽快尝试删除数据，将比例下降。

尝试直接用命令设置，但是失败了，Ceph 集群并没有重新同步数据，怀疑可能仍然需要重启服务本身

ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'

修改配置文件，之后重启 monitor 服务，但是担心出问题，所以没有敢尝试该方法，后续经过在邮件列表确认，该方法应该不会对数据产生影响，但是前提是在恢复期间，所有的虚拟机不要向 Ceph 再写入任何数据。

默认情况下 full 的比例是 95%，而 near full 的比例是 85%，所以需要根据实际情况对该配置进行调整。

 [global]
    mon osd full ratio = .98
    mon osd nearfull ratio = .80

1

分析总结

原因

根据 Ceph 官方文档中的描述，当一个 OSD full 比例达到 95% 时，集群将不接受任何 Ceph Client 端的读写数据的请求。所以导致虚拟机在重启时，无法启动的情况。

解决方法

从官方的推荐来看，应该比较支持添加新的 OSD 的方式，当然临时的提高比例是一个解决方案，但是并不推荐，因为需要手动的删除数据去解决，而且一旦再有一个新的节点出现故障，仍然会出现比例变满的状况，所以解决之道最好是扩容。

思考

在这次故障过程中，有两点是值得思考的：

监控：由于当时服务器在配置过程中 DNS 配置错误，导致监控邮件无法正常发出，从而没有收到 Ceph WARN 的提示信息
云平台本身：由于 Ceph 的机制，在 OpenStack 平台中分配中，大多时候是超分的，从用户角度看，拷贝大量数据的行为并没有不妥之处，但是由于云平台并没有相应的预警机制，导致了该问题的发生

参考文档

http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity

在 CentOS 7.1 上安装分布式存储系统 Ceph http://www.linuxidc.com/Linux/2015-08/120990.htm

Ceph 环境配置文档 PDF http://www.linuxidc.com/Linux/2013-05/85212.htm

CentOS 6.3 上部署 Ceph http://www.linuxidc.com/Linux/2013-05/85213.htm

Ceph 的安装过程 http://www.linuxidc.com/Linux/2013-05/85210.htm

HOWTO Install Ceph On FC12, FC 上安装 Ceph 分布式文件系统 http://www.linuxidc.com/Linux/2013-05/85209.htm

Ceph 文件系统安装 http://www.linuxidc.com/Linux/2013-05/85208.htm

CentOS 6.2 64 位上安装 Ceph 0.47.2 http://www.linuxidc.com/Linux/2013-05/85206.htm

Ubuntu 12.04 Ceph 分布式文件系统 http://www.linuxidc.com/Linux/2013-04/82588.htm

Fedora 14 上安装 Ceph 0.24 http://www.linuxidc.com/Linux/2011-01/31580.htm

Ceph 的详细介绍 ：请点这里
Ceph 的下载地址 ：请点这里

本文永久更新链接地址 ：http://www.linuxidc.com/Linux/2015-10/124547.htm

正文完

星哥玩云-微信公众号

发表至：服务器应用

2022-01-21

0

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请注明出处。

shell脚本：LVS启动简易脚本

ElasticSearch集群搭建入门教程

使用 NodeJS+Express 开发服务端

Nginx安装与升级

HBase入门基础教程

将 Ubuntu 安装在 IBM Power System LC 服务器上

Cacti使用安装详解

SSH 端口转发实例详解

解析Ceph: 网络层的处理

Ceph集群磁盘没有剩余空间的解决方法

故障描述

故障现象

解决方案一 (已验证)

解决方案二 (理论上，没有进行验证)

分析总结

原因

解决方法

思考

参考文档

申请腾讯混元的API Key并且使用LobeChat调用混元AI

Docker部署搭建一个开源强大的图书管理系统

基于Docker快速搭建一个开源的IT人员在线工具箱-it-tools

让每个人都可以轻松使用Git-腾讯自研Git客户端

使用Docker部署开源的WPS-Office

Linux下zip的操作命令

Ubuntu16.04自带防火墙ufw配置和用法

为5G构建网络，MPLS需要改变些什么？

如何使用pgrep匹配

使用Nmcli命令从Linux终端连接WiFi

	[root@controller ~]# rbd -p volumes rm volume-c55fd052-212d-4107-a2ac-cf53bfc049be
	2015-04-29 05:31:31.719478 7f5fb82f7760 0 client.4781741.objecter FULL, paused modify 0xe9a9e0 tid 6

	cluster 059f27e8-a23f-4587-9033-3e3679d03b31
	health HEALTH_ERR 20 pgs backfill_toofull; 20 pgs degraded; 20 pgs stuck unclean; recovery 7482/129081 objects degraded (5.796%); 2 full osd(s); 1 near full osd(s)
	monmap e6: 4 mons at {node-5e40.cloud.com=10.10.20.40:6789/0,node-6670.cloud.com=10.10.20.31:6789/0,node-66c4.cloud.com=10.10.20.36:6789/0,node-fb27.cloud.com=10.10.20.41:6789/0}, election epoch 886, quorum 0,1,2,3 node-6670.cloud.com,node-66c4.cloud.com,node-5e40.cloud.com,node-fb27.cloud.com
	osdmap e2743: 3 osds: 3 up, 3 in
	flags full
	pgmap v6564199: 320 pgs, 4 pools, 262 GB data, 43027 objects
	786 GB used, 47785 MB / 833 GB avail
	7482/129081 objects degraded (5.796%)
	300 active+clean
	20 active+degraded+remapped+backfill_toofull

	HEALTH_ERR 20 pgs backfill_toofull; 20 pgs degraded; 20 pgs stuck unclean; recovery 7482/129081 objects degraded (5.796%); 2 full osd(s); 1 near full osd(s)
	pg 3.8 is stuck unclean for 7067109.597691, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.7d is stuck unclean for 1852078.505139, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.21 is stuck unclean for 7072842.637848, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
	pg 3.22 is stuck unclean for 7070880.213397, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
	pg 3.a is stuck unclean for 7067057.863562, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.7f is stuck unclean for 7067122.493746, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
	pg 3.5 is stuck unclean for 7067088.369629, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.1e is stuck unclean for 7073386.246281, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
	pg 3.19 is stuck unclean for 7068035.310269, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
	pg 3.5d is stuck unclean for 1852078.505949, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.1a is stuck unclean for 7067088.429544, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.1b is stuck unclean for 7072773.771385, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
	pg 3.3 is stuck unclean for 7067057.864514, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.15 is stuck unclean for 7067088.825483, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.11 is stuck unclean for 7067057.862408, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.6d is stuck unclean for 7067083.634454, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.6e is stuck unclean for 7067098.452576, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.c is stuck unclean for 5658116.678331, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.e is stuck unclean for 7067078.646953, current state active+degraded+remapped+backfill_toofull, last acting [2,0]
	pg 3.20 is stuck unclean for 7067140.530849, current state active+degraded+remapped+backfill_toofull, last acting [0,2]
	pg 3.7d is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.7f is active+degraded+remapped+backfill_toofull, acting [0,2]
	pg 3.6d is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.6e is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.5d is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.20 is active+degraded+remapped+backfill_toofull, acting [0,2]
	pg 3.21 is active+degraded+remapped+backfill_toofull, acting [0,2]
	pg 3.22 is active+degraded+remapped+backfill_toofull, acting [0,2]
	pg 3.1e is active+degraded+remapped+backfill_toofull, acting [0,2]
	pg 3.19 is active+degraded+remapped+backfill_toofull, acting [0,2]
	pg 3.1a is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.1b is active+degraded+remapped+backfill_toofull, acting [0,2]
	pg 3.15 is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.11 is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.c is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.e is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.8 is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.a is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.5 is active+degraded+remapped+backfill_toofull, acting [2,0]
	pg 3.3 is active+degraded+remapped+backfill_toofull, acting [2,0]
	recovery 7482/129081 objects degraded (5.796%)
	osd.0 is full at 95%
	osd.2 is full at 95%
	osd.1 is near full at 93%

	2015-04-29 06:51:58.623262 osd.1 [WRN] OSD near full (91%)
	2015-04-29 06:52:01.500813 osd.2 [WRN] OSD near full (92%)

	[global]
	mon osd full ratio = .98
	mon osd nearfull ratio = .80