Oracle 11g RAC 报错CRS-4535, CRS-4000解决

共计 15479 个字符，预计需要花费 39 分钟才能阅读完成。

环境：AIX6.1 + Oracle11.2.0.4 RAC(2 nodes)

1. 故障现象
2. 定位问题
3. 处理问题

1. 故障现象

使用 crsctl 查看集群各资源状态，在任一节点都会直接报错 CRS-4535, CRS-4000；但此时数据库是可以被正常访问的。
具体故障现象如下：

 #节点 1 查询
grid@bjdb1:/home/grid>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
 
#节点 2 查询
root@bjdb2:/>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

同样的，crs_stat -t 查看一样报错，错误码是 CRS-0184：

 root@bjdb1:/>crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.

节点 2 也一样！

确定此时数据库是可以被正常访问的。如下：

 #节点 2 模拟客户端登录 RAC 集群，使用 SCAN IP 访问，发现可以正常访问到数据库
oracle@bjdb2:/home/oracle>sqlplus jingyu/jingyu@192.168.103.31/bjdb
 
SQL*Plus: Release 11.2.0.4.0 Production on Mon Oct 10 14:24:47 2016
 
Copyright (c) 1982, 2013, Oracle.  All rights reserved.
 
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
 
SQL>

RAC 环境下的 /etc/hosts 文件相关内容：

 #scan
192.168.103.31  scan-ip

2. 定位问题

首先查看节点 1 的集群相关日志：
Clusterware(GI)的日志存放在 $GRID_HOME/log/nodename 下；
Clusterware(GI)对应几个关键的后台进程 css，crs，evm，它们的日志分别存在 cssd，crsd，evmd 目录下；

节点 1 查看相关日志：

 #查看 GI 的 alert 日志文件，最近的记录只是提示 GI 所在存储空间使用率高，稍后清理下即可，而且目前还有一定空间剩余，显然并非是此次故障的原因。root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log
2016-10-10 14:18:26.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:23:31.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:28:36.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:33:41.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:38:46.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
 
#因为 crsctl 不可以使用，进而查看 crs 的日志信息，发现 3 号已经有报错，无法打开裸设备，从而导致无法初始化 OCR；继续看错误信息，发现是这个时候访问共享存储时无法成功。怀疑此刻存储出现问题，需要进一步和现场人员确定此时间点是否有存储相关的施工。root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1/crsd>tail -f crsd.log
2016-10-03 18:04:40.248: [  OCRRAW][1]proprinit: Could not open raw device
2016-10-03 18:04:40.248: [  OCRASM][1]proprasmcl: asmhandle is NULL
2016-10-03 18:04:40.252: [  OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]
2016-10-03 18:04:40.253: [  CRSOCR][1] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage
 
2016-10-03 18:04:40.253: [    CRSD][1] Created alert : (:CRSD00111:) :  Could not init OCR, error: PROC-26: Error while accessing the physical storage
 
2016-10-03 18:04:40.253: [    CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26
2016-10-03 18:04:40.253: [    CRSD][1] Done.

节点 2 查看相关日志：

 #查看 GI 的 alert 日志，发现节点 2 的 ctss 有 CRS-2409 的报错，虽然根据 MOS 文档 ID 1135337.1 说明，This is not an error. ctssd is reporting that there is a time difference and it is not doing anything about it as it is running in observer mode. 只需要查看两个节点的时间是否一致，但实际上查询节点时间一致:
root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2>tail -f alert*.log
2016-10-10 12:29:22.145:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 12:59:38.799:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 13:34:11.402:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 14:12:44.168:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 14:44:04.824:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
 
#查看节点 2 的 crs 日志，发现和节点 1 相近的时间点，同样访问共享存储出现了问题，进而无法初始化 OCR
root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2/crsd>tail -f crsd.log
2016-10-03 18:04:31.077: [  OCRRAW][1]proprinit: Could not open raw device
2016-10-03 18:04:31.077: [  OCRASM][1]proprasmcl: asmhandle is NULL
2016-10-03 18:04:31.081: [  OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]
2016-10-03 18:04:31.081: [  CRSOCR][1] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage
 
2016-10-03 18:04:31.082: [    CRSD][1] Created alert : (:CRSD00111:) :  Could not init OCR, error: PROC-26: Error while accessing the physical storage
 
2016-10-03 18:04:31.082: [    CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26
2016-10-03 18:04:31.082: [    CRSD][1] Done.

现在登入到 grid 用户，确定下 ASM 磁盘组的状态：
sqlplus / as sysasm 直接查询 v$asm_diskgroup;

select name, state, total_mb, free_mb from v$asm_diskgroup;

发现 OCR_VOTE1 磁盘组在两个 ASM 实例上都是没有 mount；

 SQL> select instance_name from v$instance;
 
INSTANCE_NAME
------------------------------------------------
+ASM2
 
SQL> select name, state, total_mb, free_mb from v$asm_diskgroup;
 
NAME                           STATE                               TOTAL_MB    FREE_MB
------------------------------ --------------------------------- ---------- ----------
DATA                           MOUNTED                               737280      88152
FRA_ARCHIVE                    MOUNTED                                10240       9287
OCR_VOTE1                      DISMOUNTED                                 0          0

另一个节点一样；

 节点 mount OCR 相关磁盘组
SQL> select name, state from v$asm_diskgroup;
 
NAME                           STATE
------------------------------ ---------------------------------
DATA                           MOUNTED
FRA_ARCHIVE                    MOUNTED
OCR_VOTE1                      DISMOUNTED

再确认下目前 GI 的一些核心后台进程：

 # 发现 crs 这个进程是没有启动的，查询没有任何结果输出
root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep

同样，节点 2 查询也是一样没有启动 crs 进程。

简单总结问题现状：故障发生在 10 月 3 日下午 18:04 左右，所有节点都因为无法访问共享存储进而导致 OCR 初始化失败。目前的 crs 进程是没有正常启动的。

3. 处理问题

3.1 尝试手工挂载 OCR 磁盘组

 SQL> alter diskgroup ocr_vote1 mount;
 
Diskgroup altered.
 
SQL> select name, state from v$asm_diskgroup;
 
NAME                           STATE
------------------------------ ---------------------------------
DATA                           MOUNTED
FRA_ARCHIVE                    MOUNTED
OCR_VOTE1                      MOUNTED

3.2 节点 1 启动 CRS
目前，crs 这个进程依然是没有启动的，

 #证明 crsd.bin 当前没有启动
root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep

节点 1 尝试正常开启 crs 失败

 root@bjdb1:/>crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

节点 1 尝试正常关闭 crs 失败

 root@bjdb1:/>crsctl stop crs
CRS-2796: The command may not proceed when Cluster Ready Services is not running
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.

那么下一步如何处理呢？
最终选择在节点 1 强制停止 crs 再启动成功

 #强制关闭节点 1 的 crs
root@bjdb1:/>crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'bjdb1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.crf' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.evmd' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.asm' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'bjdb1'
CRS-2677: Stop of 'ora.evmd' on 'bjdb1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'bjdb1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'bjdb1' succeeded
 
CRS-5017: The resource action "ora.crf stop" encountered the following error:
action for daemon aborted. For details refer to "(:CLSN00108:)" in "/opt/u01/app/11.2.0/grid/log/bjdb1/agent/ohasd/orarootagent_root/orarootagent_root.log".
 
CRS-2675: Stop of 'ora.crf' on 'bjdb1' failed
CRS-2679: Attempting to clean 'ora.crf' on 'bjdb1'
CRS-2681: Clean of 'ora.crf' on 'bjdb1' succeeded
 
CRS-2675: Stop of 'ora.asm' on 'bjdb1' failed
CRS-2679: Attempting to clean 'ora.asm' on 'bjdb1'
CRS-2681: Clean of 'ora.asm' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'bjdb1'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'bjdb1'
CRS-2677: Stop of 'ora.cssd' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'bjdb1'
CRS-2677: Stop of 'ora.gipcd' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'bjdb1'
CRS-2677: Stop of 'ora.gpnpd' on 'bjdb1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'bjdb1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'bjdb1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
root@bjdb1:/>
 
#查看 crsctl 资源状态，此时肯定没有
root@bjdb1:/>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
 
#查看 crsd.bin,cssd.bin,evmd.bin, 都已经没有了相关进程
root@bjdb1:/>ps -ef|grep crsd.bin
    root  8126466 25428158   0 15:52:50  pts/0  0:00 grep crsd.bin
root@bjdb1:/>ps -ef|grep cssd.bin
    root  8126470 25428158   0 15:53:01  pts/0  0:00 grep cssd.bin
root@bjdb1:/>ps -ef|grep evmd.bin
    root 35520600 25428158   0 15:53:13  pts/0  0:00 grep evmd.bin
    
#查看 pmon 进程, 也都没有了
root@bjdb1:/>ps -ef|grep pmon|grep -v grep
root@bjdb1:/>
 
#尝试再次启动 crs，成功！root@bjdb1:/>crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
 
#查看 crsctl 资源，依然报错，说明还没有完全起来
root@bjdb1:/>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
 
#等待一段时间，可以查 GI 相关的核心后台进程
root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep
root@bjdb1:/>ps -ef|grep cssd.bin|grep -v grep
    grid 10747994 26542192   0 15:55:03      -  0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.bin
root@bjdb1:/>ps -ef|grep pmon
    root 39387390 25428158   0 15:57:23  pts/0  0:00 grep pmon
root@bjdb1:/>ps -ef|grep pmon|grep -v grep
    grid 39911466        1   0 15:58:47      -  0:00 asm_pmon_+ASM2
root@bjdb1:/>ps -ef|grep pmon|grep -v grep
    root 37814470 25428158   0 15:59:27  pts/0  0:00 grep pmon
    grid 39911466        1   0 15:58:47      -  0:00 asm_pmon_+ASM2
root@bjdb1:/>
root@bjdb1:/>ps -ef|grep crsd.bin
root@bjdb1:/>ps -ef|grep cssd.bin
    grid 10747994 26542192   0 15:55:03      -  0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.bin
 
root@bjdb1:/>ps -ef|grep evmd.bin
    grid 40173760        1   0 15:57:10      -  0:00 /opt/u01/app/11.2.0/grid/bin/evmd.bin
root@bjdb1:/>ps -ef|grep crsd.bin
    root 37683238        1   0 15:59:54      -  0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin reboot
root@bjdb1:/>
 
#当核心进程都起来时，再次查看 crsctl 资源情况，发现已经可以正常查询，各资源正在启动
root@bjdb1:/>crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       bjdb1
ora.LISTENER.lsnr
               ONLINE  ONLINE       bjdb1
ora.OCR_VOTE1.dg
               ONLINE  ONLINE       bjdb1
ora.asm
               ONLINE  ONLINE       bjdb1                    Started
ora.gsd
               OFFLINE OFFLINE      bjdb1
ora.net1.network
               ONLINE  ONLINE       bjdb1
ora.ons
               ONLINE  ONLINE       bjdb1
ora.registry.acfs
               ONLINE  ONLINE       bjdb1
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  OFFLINE
ora.bjdb.db
      1        ONLINE  OFFLINE
      2        ONLINE  OFFLINE
ora.bjdb1.vip
      1        ONLINE  ONLINE       bjdb1
ora.bjdb2.vip
      1        ONLINE  OFFLINE                               STARTING
ora.cvu
      1        ONLINE  ONLINE       bjdb1
ora.oc4j
      1        ONLINE  ONLINE       bjdb1
ora.scan1.vip
      1        ONLINE  OFFLINE                               STARTING

最后等待一段时间后，再次查询，发现节点 1 各资源已经全部正常。

 root@bjdb1:/>crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       bjdb1
ora.LISTENER.lsnr
               ONLINE  ONLINE       bjdb1
ora.OCR_VOTE1.dg
               ONLINE  ONLINE       bjdb1
ora.asm
               ONLINE  ONLINE       bjdb1                    Started
ora.gsd
               OFFLINE OFFLINE      bjdb1
ora.net1.network
               ONLINE  ONLINE       bjdb1
ora.ons
               ONLINE  ONLINE       bjdb1
ora.registry.acfs
               ONLINE  ONLINE       bjdb1
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  OFFLINE
ora.bjdb.db
      1        ONLINE  ONLINE       bjdb1                    Open
      2        ONLINE  OFFLINE
ora.bjdb1.vip
      1        ONLINE  ONLINE       bjdb1
ora.bjdb2.vip
      1        ONLINE  OFFLINE
ora.cvu
      1        ONLINE  ONLINE       bjdb1
ora.oc4j
      1        ONLINE  ONLINE       bjdb1
ora.scan1.vip
      1        ONLINE  OFFLINE

3.3 解决节点 1 上 GI 对应存储空间使用率过高
继续观察节点 1 的日志：

 grid@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log
2016-10-10 16:03:25.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 16:08:30.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 16:09:50.796:
[ctssd(5046446)]CRS-2407:The new Cluster Time Synchronization Service reference node is host bjdb1.
2016-10-10 16:10:20.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 16:15:25.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.

其实这个之前也看到过，就是需要清理 /opt/u01 目录空间了! 查找可以删除的一些历史日志，解决完这个提示就不会再出现！

3.4 节点 2 手工挂载 OCR，重启 CRS
节点 1 问题已解决，在节点 2 同样挂载 OCR 后重启 CRS
方法都一样，只是在节点 2 操作，不再赘述。

 #强制停止节点 2 的 crs
root@bjdb2:/>crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'bjdb2'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.crf' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.ctssd' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.evmd' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.asm' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'bjdb2'
CRS-2677: Stop of 'ora.crf' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.evmd' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'bjdb2' succeeded
CRS-2675: Stop of 'ora.asm' on 'bjdb2' failed
CRS-2679: Attempting to clean 'ora.asm' on 'bjdb2'
CRS-2681: Clean of 'ora.asm' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'bjdb2'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'bjdb2'
CRS-2677: Stop of 'ora.cssd' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'bjdb2'
CRS-2677: Stop of 'ora.gipcd' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'bjdb2'
CRS-2677: Stop of 'ora.gpnpd' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'bjdb2' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'bjdb2' has completed
CRS-4133: Oracle High Availability Services has been stopped.

再启动：
crsctl start crs

等待一段时间后查询：

 #观察到 crs 进程已经启动
root@bjdb2:/>ps -ef|grep crsd.bin|grep -v grep
    root 22610148        1   0 16:24:15      -  0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin reboot
#最后使用 crsctl 查看资源已经恢复正常
root@bjdb2:/>crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       bjdb1
               ONLINE  ONLINE       bjdb2
ora.LISTENER.lsnr
               ONLINE  ONLINE       bjdb1
               ONLINE  ONLINE       bjdb2
ora.OCR_VOTE1.dg
               ONLINE  ONLINE       bjdb1
               ONLINE  ONLINE       bjdb2
ora.asm
               ONLINE  ONLINE       bjdb1                    Started
               ONLINE  ONLINE       bjdb2                    Started
ora.gsd
               OFFLINE OFFLINE      bjdb1
               OFFLINE OFFLINE      bjdb2
ora.net1.network
               ONLINE  ONLINE       bjdb1
               ONLINE  ONLINE       bjdb2
ora.ons
               ONLINE  ONLINE       bjdb1
               ONLINE  ONLINE       bjdb2
ora.registry.acfs
               ONLINE  ONLINE       bjdb1
               ONLINE  ONLINE       bjdb2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       bjdb2
ora.bjdb.db
      1        ONLINE  ONLINE       bjdb1                    Open
      2        ONLINE  ONLINE       bjdb2                    Open
ora.bjdb1.vip
      1        ONLINE  ONLINE       bjdb1
ora.bjdb2.vip
      1        ONLINE  ONLINE       bjdb2
ora.cvu
      1        ONLINE  ONLINE       bjdb1
ora.oc4j
      1        ONLINE  ONLINE       bjdb1
ora.scan1.vip
      1        ONLINE  ONLINE       bjdb2
      
#查看运行在节点 2 上的监听程序，之前故障时，scan 的���听就在节点 2 上
root@bjdb2:/>ps -ef|grep tns
    grid  5308430        1   0   Aug 17      -  5:05 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN1 -inherit
    grid  5505240        1   1   Aug 17      - 27:23 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inherit
root@bjdb2:/>

至此，完成本次 RAC 集群 CRS-4535，CRS-4000 故障的处理；值得注意的是，巡检发现故障后，我整个 troubleshooting 解决过程，RAC 数据库对外都是可以提供服务的，这点也说明了 RAC 的稳定性！

更多 Oracle 相关信息见Oracle 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=12

本文永久更新链接地址：http://www.linuxidc.com/Linux/2016-11/137432.htm

Oracle 11g RAC 报错CRS-4535, CRS-4000解决

1. 故障现象

2. 定位问题

3. 处理问题

开源堡垒机JumpServer配置教程：使用步骤与配置

申请腾讯混元的API Key并且使用LobeChat调用混元AI

系统加固-Linux不允许用户使用密码登录，只能使用密钥登录

【开源安全保护】如何安装JumpServer堡垒机

基于Docker快速搭建一个开源的IT人员在线工具箱-it-tools

shell字符串比较、判断是否为数字入门案例

Secure Boot什么意思？BIOS中Secure Boot灰色无法更改解决方法详解

Mariadb学习总结（八）：聚合函数及分组查询

CentOS7系统Tab键失效的解决方法

为5G构建网络，MPLS需要改变些什么？

	#节点 1 查询
	grid@bjdb1:/home/grid>crsctl stat res -t
	CRS-4535: Cannot communicate with Cluster Ready Services
	CRS-4000: Command Status failed, or completed with errors.

	#节点 2 查询
	root@bjdb2:/>crsctl stat res -t
	CRS-4535: Cannot communicate with Cluster Ready Services
	CRS-4000: Command Status failed, or completed with errors.

	root@bjdb1:/>crs_stat -t
	CRS-0184: Cannot communicate with the CRS daemon.

	#节点 2 模拟客户端登录 RAC 集群，使用 SCAN IP 访问，发现可以正常访问到数据库
	oracle@bjdb2:/home/oracle>sqlplus jingyu/jingyu@192.168.103.31/bjdb

	SQL*Plus: Release 11.2.0.4.0 Production on Mon Oct 10 14:24:47 2016

	Copyright (c) 1982, 2013, Oracle. All rights reserved.

	Connected to:
	Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
	With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
	Data Mining and Real Application Testing options

	SQL>

	#查看 GI 的 alert 日志文件，最近的记录只是提示 GI 所在存储空间使用率高，稍后清理下即可，而且目前还有一定空间剩余，显然并非是此次故障的原因。root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log
	2016-10-10 14:18:26.125:
	[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
	2016-10-10 14:23:31.125:
	[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
	2016-10-10 14:28:36.125:
	[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
	2016-10-10 14:33:41.125:
	[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
	2016-10-10 14:38:46.125:
	[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.

	#因为 crsctl 不可以使用，进而查看 crs 的日志信息，发现 3 号已经有报错，无法打开裸设备，从而导致无法初始化 OCR；继续看错误信息，发现是这个时候访问共享存储时无法成功。怀疑此刻存储出现问题，需要进一步和现场人员确定此时间点是否有存储相关的施工。root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1/crsd>tail -f crsd.log
	2016-10-03 18:04:40.248: [ OCRRAW][1]proprinit: Could not open raw device
	2016-10-03 18:04:40.248: [ OCRASM][1]proprasmcl: asmhandle is NULL
	2016-10-03 18:04:40.252: [ OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]
	2016-10-03 18:04:40.253: [ CRSOCR][1] OCR context init failure. Error: PROC-26: Error while accessing the physical storage

	2016-10-03 18:04:40.253: [ CRSD][1] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage

	2016-10-03 18:04:40.253: [ CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26
	2016-10-03 18:04:40.253: [ CRSD][1] Done.

	#查看 GI 的 alert 日志，发现节点 2 的 ctss 有 CRS-2409 的报错，虽然根据 MOS 文档 ID 1135337.1 说明，This is not an error. ctssd is reporting that there is a time difference and it is not doing anything about it as it is running in observer mode. 只需要查看两个节点的时间是否一致，但实际上查询节点时间一致:
	root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2>tail -f alert*.log
	2016-10-10 12:29:22.145:
	[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
	2016-10-10 12:59:38.799:
	[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
	2016-10-10 13:34:11.402:
	[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
	2016-10-10 14:12:44.168:
	[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
	2016-10-10 14:44:04.824:
	[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.

	#查看节点 2 的 crs 日志，发现和节点 1 相近的时间点，同样访问共享存储出现了问题，进而无法初始化 OCR
	root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2/crsd>tail -f crsd.log
	2016-10-03 18:04:31.077: [ OCRRAW][1]proprinit: Could not open raw device
	2016-10-03 18:04:31.077: [ OCRASM][1]proprasmcl: asmhandle is NULL
	2016-10-03 18:04:31.081: [ OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]
	2016-10-03 18:04:31.081: [ CRSOCR][1] OCR context init failure. Error: PROC-26: Error while accessing the physical storage

	2016-10-03 18:04:31.082: [ CRSD][1] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage

	2016-10-03 18:04:31.082: [ CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26
	2016-10-03 18:04:31.082: [ CRSD][1] Done.

	SQL> select instance_name from v$instance;

	INSTANCE_NAME
	------------------------------------------------
	+ASM2

	SQL> select name, state, total_mb, free_mb from v$asm_diskgroup;

	NAME STATE TOTAL_MB FREE_MB
	------------------------------ --------------------------------- ---------- ----------
	DATA MOUNTED 737280 88152
	FRA_ARCHIVE MOUNTED 10240 9287
	OCR_VOTE1 DISMOUNTED 0 0

	节点 mount OCR 相关磁盘组
	SQL> select name, state from v$asm_diskgroup;

	NAME STATE
	------------------------------ ---------------------------------
	DATA MOUNTED
	FRA_ARCHIVE MOUNTED
	OCR_VOTE1 DISMOUNTED

	# 发现 crs 这个进程是没有启动的，查询没有任何结果输出
	root@bjdb1:/>ps -ef\|grep crsd.bin\|grep -v grep

	SQL> alter diskgroup ocr_vote1 mount;

	Diskgroup altered.

	SQL> select name, state from v$asm_diskgroup;

	NAME STATE
	------------------------------ ---------------------------------
	DATA MOUNTED
	FRA_ARCHIVE MOUNTED
	OCR_VOTE1 MOUNTED

	#证明 crsd.bin 当前没有启动
	root@bjdb1:/>ps -ef\|grep crsd.bin\|grep -v grep

	root@bjdb1:/>crsctl start crs
	CRS-4640: Oracle High Availability Services is already active
	CRS-4000: Command Start failed, or completed with errors.

	root@bjdb1:/>crsctl stop crs
	CRS-2796: The command may not proceed when Cluster Ready Services is not running
	CRS-4687: Shutdown command has completed with errors.
	CRS-4000: Command Stop failed, or completed with errors.

	#强制关闭节点 1 的 crs
	root@bjdb1:/>crsctl stop crs -f
	CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'bjdb1'
	CRS-2673: Attempting to stop 'ora.mdnsd' on 'bjdb1'
	CRS-2673: Attempting to stop 'ora.crf' on 'bjdb1'
	CRS-2673: Attempting to stop 'ora.ctssd' on 'bjdb1'
	CRS-2673: Attempting to stop 'ora.evmd' on 'bjdb1'
	CRS-2673: Attempting to stop 'ora.asm' on 'bjdb1'
	CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'bjdb1'
	CRS-2677: Stop of 'ora.evmd' on 'bjdb1' succeeded
	CRS-2677: Stop of 'ora.mdnsd' on 'bjdb1' succeeded
	CRS-2677: Stop of 'ora.ctssd' on 'bjdb1' succeeded

	CRS-5017: The resource action "ora.crf stop" encountered the following error:
	action for daemon aborted. For details refer to "(:CLSN00108:)" in "/opt/u01/app/11.2.0/grid/log/bjdb1/agent/ohasd/orarootagent_root/orarootagent_root.log".

	CRS-2675: Stop of 'ora.crf' on 'bjdb1' failed
	CRS-2679: Attempting to clean 'ora.crf' on 'bjdb1'
	CRS-2681: Clean of 'ora.crf' on 'bjdb1' succeeded

	CRS-2675: Stop of 'ora.asm' on 'bjdb1' failed
	CRS-2679: Attempting to clean 'ora.asm' on 'bjdb1'
	CRS-2681: Clean of 'ora.asm' on 'bjdb1' succeeded
	CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'bjdb1'
	CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'bjdb1' succeeded
	CRS-2673: Attempting to stop 'ora.cssd' on 'bjdb1'
	CRS-2677: Stop of 'ora.cssd' on 'bjdb1' succeeded
	CRS-2673: Attempting to stop 'ora.gipcd' on 'bjdb1'
	CRS-2677: Stop of 'ora.gipcd' on 'bjdb1' succeeded
	CRS-2673: Attempting to stop 'ora.gpnpd' on 'bjdb1'
	CRS-2677: Stop of 'ora.gpnpd' on 'bjdb1' succeeded
	CRS-2677: Stop of 'ora.drivers.acfs' on 'bjdb1' succeeded
	CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'bjdb1' has completed
	CRS-4133: Oracle High Availability Services has been stopped.
	root@bjdb1:/>

	#查看 crsctl 资源状态，此时肯定没有
	root@bjdb1:/>crsctl stat res -t
	CRS-4535: Cannot communicate with Cluster Ready Services
	CRS-4000: Command Status failed, or completed with errors.

	#查看 crsd.bin,cssd.bin,evmd.bin, 都已经没有了相关进程
	root@bjdb1:/>ps -ef\|grep crsd.bin
	root 8126466 25428158 0 15:52:50 pts/0 0:00 grep crsd.bin
	root@bjdb1:/>ps -ef\|grep cssd.bin
	root 8126470 25428158 0 15:53:01 pts/0 0:00 grep cssd.bin
	root@bjdb1:/>ps -ef\|grep evmd.bin
	root 35520600 25428158 0 15:53:13 pts/0 0:00 grep evmd.bin

	#查看 pmon 进程, 也都没有了
	root@bjdb1:/>ps -ef\|grep pmon\|grep -v grep
	root@bjdb1:/>

	#尝试再次启动 crs，成功！root@bjdb1:/>crsctl start crs
	CRS-4123: Oracle High Availability Services has been started.

	#查看 crsctl 资源，依然报错，说明还没有完全起来
	root@bjdb1:/>crsctl stat res -t
	CRS-4535: Cannot communicate with Cluster Ready Services
	CRS-4000: Command Status failed, or completed with errors.

	#等待一段时间，可以查 GI 相关的核心后台进程
	root@bjdb1:/>ps -ef\|grep crsd.bin\|grep -v grep
	root@bjdb1:/>ps -ef\|grep cssd.bin\|grep -v grep
	grid 10747994 26542192 0 15:55:03 - 0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.bin
	root@bjdb1:/>ps -ef\|grep pmon
	root 39387390 25428158 0 15:57:23 pts/0 0:00 grep pmon
	root@bjdb1:/>ps -ef\|grep pmon\|grep -v grep
	grid 39911466 1 0 15:58:47 - 0:00 asm_pmon_+ASM2
	root@bjdb1:/>ps -ef\|grep pmon\|grep -v grep
	root 37814470 25428158 0 15:59:27 pts/0 0:00 grep pmon
	grid 39911466 1 0 15:58:47 - 0:00 asm_pmon_+ASM2
	root@bjdb1:/>
	root@bjdb1:/>ps -ef\|grep crsd.bin
	root@bjdb1:/>ps -ef\|grep cssd.bin
	grid 10747994 26542192 0 15:55:03 - 0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.bin

	root@bjdb1:/>ps -ef\|grep evmd.bin
	grid 40173760 1 0 15:57:10 - 0:00 /opt/u01/app/11.2.0/grid/bin/evmd.bin
	root@bjdb1:/>ps -ef\|grep crsd.bin
	root 37683238 1 0 15:59:54 - 0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin reboot
	root@bjdb1:/>

	#当核心进程都起来时，再次查看 crsctl 资源情况，发现已经可以正常查询，各资源正在启动
	root@bjdb1:/>crsctl stat res -t
	--------------------------------------------------------------------------------
	NAME TARGET STATE SERVER STATE_DETAILS
	--------------------------------------------------------------------------------
	Local Resources
	--------------------------------------------------------------------------------
	ora.DATA.dg
	ONLINE ONLINE bjdb1
	ora.LISTENER.lsnr
	ONLINE ONLINE bjdb1
	ora.OCR_VOTE1.dg
	ONLINE ONLINE bjdb1
	ora.asm
	ONLINE ONLINE bjdb1 Started
	ora.gsd
	OFFLINE OFFLINE bjdb1
	ora.net1.network
	ONLINE ONLINE bjdb1
	ora.ons
	ONLINE ONLINE bjdb1
	ora.registry.acfs
	ONLINE ONLINE bjdb1
	--------------------------------------------------------------------------------
	Cluster Resources
	--------------------------------------------------------------------------------
	ora.LISTENER_SCAN1.lsnr
	1 ONLINE OFFLINE
	ora.bjdb.db
	1 ONLINE OFFLINE
	2 ONLINE OFFLINE
	ora.bjdb1.vip
	1 ONLINE ONLINE bjdb1
	ora.bjdb2.vip
	1 ONLINE OFFLINE STARTING
	ora.cvu
	1 ONLINE ONLINE bjdb1
	ora.oc4j
	1 ONLINE ONLINE bjdb1
	ora.scan1.vip
	1 ONLINE OFFLINE STARTING

	grid@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log
	2016-10-10 16:03:25.373:
	[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
	2016-10-10 16:08:30.373:
	[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
	2016-10-10 16:09:50.796:
	[ctssd(5046446)]CRS-2407:The new Cluster Time Synchronization Service reference node is host bjdb1.
	2016-10-10 16:10:20.373:
	[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
	2016-10-10 16:15:25.373:
	[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.

	#强制停止节点 2 的 crs
	root@bjdb2:/>crsctl stop crs -f
	CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'bjdb2'
	CRS-2673: Attempting to stop 'ora.mdnsd' on 'bjdb2'
	CRS-2673: Attempting to stop 'ora.crf' on 'bjdb2'
	CRS-2673: Attempting to stop 'ora.ctssd' on 'bjdb2'
	CRS-2673: Attempting to stop 'ora.evmd' on 'bjdb2'
	CRS-2673: Attempting to stop 'ora.asm' on 'bjdb2'
	CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'bjdb2'
	CRS-2677: Stop of 'ora.crf' on 'bjdb2' succeeded
	CRS-2677: Stop of 'ora.mdnsd' on 'bjdb2' succeeded
	CRS-2677: Stop of 'ora.evmd' on 'bjdb2' succeeded
	CRS-2677: Stop of 'ora.ctssd' on 'bjdb2' succeeded
	CRS-2675: Stop of 'ora.asm' on 'bjdb2' failed
	CRS-2679: Attempting to clean 'ora.asm' on 'bjdb2'
	CRS-2681: Clean of 'ora.asm' on 'bjdb2' succeeded
	CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'bjdb2'
	CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'bjdb2' succeeded
	CRS-2673: Attempting to stop 'ora.cssd' on 'bjdb2'
	CRS-2677: Stop of 'ora.cssd' on 'bjdb2' succeeded
	CRS-2673: Attempting to stop 'ora.gipcd' on 'bjdb2'
	CRS-2677: Stop of 'ora.gipcd' on 'bjdb2' succeeded
	CRS-2673: Attempting to stop 'ora.gpnpd' on 'bjdb2'
	CRS-2677: Stop of 'ora.gpnpd' on 'bjdb2' succeeded
	CRS-2677: Stop of 'ora.drivers.acfs' on 'bjdb2' succeeded
	CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'bjdb2' has completed
	CRS-4133: Oracle High Availability Services has been stopped.

	#观察到 crs 进程已经启动
	root@bjdb2:/>ps -ef\|grep crsd.bin\|grep -v grep
	root 22610148 1 0 16:24:15 - 0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin reboot

	#最后使用 crsctl 查看资源已经恢复正常
	root@bjdb2:/>crsctl stat res -t
	--------------------------------------------------------------------------------
	NAME TARGET STATE SERVER STATE_DETAILS
	--------------------------------------------------------------------------------
	Local Resources
	--------------------------------------------------------------------------------
	ora.DATA.dg
	ONLINE ONLINE bjdb1
	ONLINE ONLINE bjdb2
	ora.LISTENER.lsnr
	ONLINE ONLINE bjdb1
	ONLINE ONLINE bjdb2
	ora.OCR_VOTE1.dg
	ONLINE ONLINE bjdb1
	ONLINE ONLINE bjdb2
	ora.asm
	ONLINE ONLINE bjdb1 Started
	ONLINE ONLINE bjdb2 Started
	ora.gsd
	OFFLINE OFFLINE bjdb1
	OFFLINE OFFLINE bjdb2
	ora.net1.network
	ONLINE ONLINE bjdb1
	ONLINE ONLINE bjdb2
	ora.ons
	ONLINE ONLINE bjdb1
	ONLINE ONLINE bjdb2
	ora.registry.acfs
	ONLINE ONLINE bjdb1
	ONLINE ONLINE bjdb2
	--------------------------------------------------------------------------------
	Cluster Resources
	--------------------------------------------------------------------------------
	ora.LISTENER_SCAN1.lsnr
	1 ONLINE ONLINE bjdb2
	ora.bjdb.db
	1 ONLINE ONLINE bjdb1 Open
	2 ONLINE ONLINE bjdb2 Open
	ora.bjdb1.vip
	1 ONLINE ONLINE bjdb1
	ora.bjdb2.vip
	1 ONLINE ONLINE bjdb2
	ora.cvu
	1 ONLINE ONLINE bjdb1
	ora.oc4j
	1 ONLINE ONLINE bjdb1
	ora.scan1.vip
	1 ONLINE ONLINE bjdb2

	#查看运行在节点 2 上的监听程序，之前故障时，scan 的��听就在节点 2 上
	root@bjdb2:/>ps -ef\|grep tns
	grid 5308430 1 0 Aug 17 - 5:05 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN1 -inherit
	grid 5505240 1 1 Aug 17 - 27:23 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inherit
	root@bjdb2:/>