Ceph pg unfound处理过程详解

87次阅读

共计 4290 个字符，预计需要花费 11 分钟才能阅读完成。

导读	今天检查 ceph 集群，发现有 pg 丢失, 本文就给大家介绍一下解决方法。

1. 查看集群状态

[root@k8snode001 ~]# ceph health detail
HEALTH_ERR 1/973013 objects unfound (0.000%); 17 scrub errors; Possible data damage: 1 pg recovery_unfound, 8 pgs inconsistent, 1 pg repair; Degraded data redundancy: 1/2919039 objects degraded (0.000%), 1 pg degraded
OBJECT_UNFOUND 1/973013 objects unfound (0.000%)
    pg 2.2b has 1 unfound objects
OSD_SCRUB_ERRORS 17 scrub errors
PG_DAMAGED Possible data damage: 1 pg recovery_unfound, 8 pgs inconsistent, 1 pg repair
    pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound
    pg 2.44 is active+clean+inconsistent, acting [14,8,21]
    pg 2.73 is active+clean+inconsistent, acting [25,14,8]
    pg 2.80 is active+clean+scrubbing+deep+inconsistent+repair, acting [4,8,14]
    pg 2.83 is active+clean+inconsistent, acting [14,13,6]
    pg 2.ae is active+clean+inconsistent, acting [14,3,2]
    pg 2.c4 is active+clean+inconsistent, acting [8,21,14]
    pg 2.da is active+clean+inconsistent, acting [23,14,15]
    pg 2.fa is active+clean+inconsistent, acting [14,23,25]
PG_DEGRADED Degraded data redundancy: 1/2919039 objects degraded (0.000%), 1 pg degraded
    pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound

从输出发现 pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound

现在我们来查看 pg 2.2b，看看这个 pg 的想想信息。

[root@k8snode001 ~]# ceph pg dump_json pools    |grep 2.2b
dumped all
2.2b       2487                  1        1         0       1  9533198403 3048     3048                active+recovery_unfound+degraded 2020-07-23 08:56:07.669903  10373'5448370  10373:7312614  [14,22,4]         14  [14,22,4]             14  10371'5437258 2020-07-23 08:56:06.637012   10371'5437258 2020-07-23 08:56:06.637012             0

可以看到它现在只有一个副本

2. 查看 pg map

[root@k8snode001 ~]# ceph pg map 2.2b
osdmap e10373 pg 2.2b (2.2b) -> up [14,22,4] acting [14,22,4]

从 pg map 可以看出，pg 2.2b 分布到 osd [14,22,4] 上

3. 查看存储池状态

[root@k8snode001 ~]# ceph osd pool stats k8s-1
pool k8s-1 id 2
  1/1955664 objects degraded (0.000%)
  1/651888 objects unfound (0.000%)
  client io 271 KiB/s wr, 0 op/s rd, 52 op/s wr
 
[root@k8snode001 ~]# ceph osd pool ls detail|grep k8s-1
pool 2 'k8s-1' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 88 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

4. 尝试恢复 pg 2.2b 丢失地块

[root@k8snode001 ~]# ceph pg repair 2.2b

如果一直修复不成功，可以查看卡住 PG 的具体信息，主要关注 recovery_state，命令如下

[root@k8snode001 ~]# ceph pg 2.2b  query
{
    "......"recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2020-07-21 14:17:05.855923",
            "might_have_unfound": [],
            "recovery_progress": {"backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []},
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {"pull_from_peer": [],
                    "pushing": []}
            },
            "scrub": {
                "scrubber.epoch_start": "10370",
                "scrubber.active": false,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.max_end": "MIN",
                "scrubber.subset_last_update": "0'0","scrubber.deep": false,"scrubber.waiting_on_whom": []}
        },
        {
            "name": "Started",
            "enter_time": "2020-07-21 14:17:04.814061"
        }
    ],
    "agent_state": {}}

如果 repair 修复不了; 两种解决方案，回退旧版或者直接删除

5. 解决方案

 回退旧版
[root@k8snode001 ~]# ceph pg  2.2b  mark_unfound_lost revert
直接删除
[root@k8snode001 ~]# ceph pg  2.2b  mark_unfound_lost delete

6. 验证

我这里直接删除了，然后 ceph 集群重建 pg, 稍等会再看，pg 状态变为 active+clean

[root@k8snode001 ~]#  ceph pg  2.2b query 
{ 
    "state": "active+clean", 
    "snap_trimq": "[]", 
    "snap_trimq_len": 0, 
    "epoch": 11069, 
    "up": [ 
        12, 
        22, 
        4 
    ],

再次查看集群状态

[root@k8snode001 ~]# ceph health detail 
HEALTH_OK

阿里云 2 核 2G 服务器 3M 带宽 61 元 1 年，有高配

腾讯云新客低至 82 元 / 年，老客户 99 元 / 年

代金券：在阿里云专用满减优惠券

正文完

星哥玩云-微信公众号

发表至： linux教程

2024-07-25

0

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请注明出处。

在Linux上配置无线网络

Linux搭建GitLab并汉化

简单介绍React Hooks是如何工作的

使用Prometheus监控Flink

使用 Python 创建你自己的 Shell （上）

Docker容器内多进程管理（二）-Monit

练习使用 Linux 的 grep 命令

【Linux面试真题】- 邮件转发代理也称邮件转发服务器，它可以使用SMTP协议，也可以使用什么协议？

DevOps 工具链可推动你的创新计划！

Ceph pg unfound处理过程详解

给你的NAS无限可能，安装小晓雅全家桶影音库

vmware下的网卡分配问题

软件开发之递归操作

ansible用法之ansible-playbook简单使用

自建一款开源音乐服务-Navidrome

Ubuntu之jdk安装

如何在 Linux 上永久挂载一个 Windows 共享

Windows和Linux设计和原理哪个系统更先进呢？

一起跟我来学dockerfile创建镜像

Linux中修改SSH端口号