阿里云-云小站(无限量代金券发放中)
【腾讯云】云服务器、云数据库、COS、CDN、短信等热卖云产品特惠抢购

由于OCR文件损坏造成Oracle RAC不能启动的现象和处理方法

210次阅读
没有评论

共计 23941 个字符,预计需要花费 60 分钟才能阅读完成。

v$cluster_interconnects

集群节点间通信使用的 IP 地址

错误信息

  • 使用了公网进行连接
SQL> select * from v$cluster_interconnects;

NAME IP_ADDRESS IS_ SOURCE CON_ID
eth0 192.168.1.70 OS dependent software 0
  • 日志信息
Filename=alert_+ASM1.log

~~~~~~~~~~~~~~~~ 正常启动~~~~~~~~~~~~~~~~~~~~~~~~
Thu Jun 23 12:33:00 2016
**********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 4
Number of processor cores in the system is 16
Number of processor sockets in the system is 1
Private Interface 'eth1:1' configured from GPnP for use as a private interconnect.
[name='eth1:1', type=1, ip=169.254.31.89, mac=00-15-5d-75-0b-16, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62] <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Public Interface 'eth0' configured from GPnP for use as a public interface.
[name='eth0', type=1, ip=192.168.1.70, mac=00-15-5d-75-0b-15, net=192.168.1.0/24, mask=255.255.255.0, use=public/1]
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u1/app/12.1.0/grid/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is enabled
NOTE: remote asm mode is local (mode 0x301; from cluster type)
NOTE: Volume support enabled
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options.
ORACLE_HOME = /u1/app/12.1.0/grid
System name: Linux
Node name: test-rac1.sf.net
Release: 2.6.32-358.el6.x86_64
Version: #1 SMP Fri Feb 22 13:35:02 PST 2013
Machine: x86_64
Using parameter settings in server-side spfile +CRSDG/test-cluster/ASMPARAMETERFILE/registry.253.893674255
System parameters with non-default values:
large_pool_size = 12M
remote_login_passwordfile= "EXCLUSIVE"
asm_diskstring = "/dev/asm*"
asm_diskgroups = "DATADG"
asm_diskgroups = "FRADG"
asm_power_limit = 1
NOTE: remote asm mode is local (mode 0x301; from cluster type)
Thu Jun 23 12:33:02 2016
Cluster communication is configured to use the following interface(s) for this instance
169.254.31.89 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

~~~~~~~~~~~~~~~~ 不正常启动~~~~~~~~~~~~~~~~~~~~~~~~
Sun Jul 31 10:30:00 2016
**********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 4
Number of processor cores in the system is 16
Number of processor sockets in the system is 1
WARNING: No cluster interconnect has been specified. Depending on <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
the communication driver configured Oracle cluster traffic
may be directed to the public interface of this machine.
Oracle recommends that RAC clustered databases be configured
with a private interconnect for enhanced security and
performance.
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as /u1/app/12.1.0/grid/dbs/arch
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is enabled
NOTE: remote asm mode is local (mode 0x301; from cluster type)
NOTE: Volume support enabled
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options.
ORACLE_HOME = /u1/app/12.1.0/grid
System name: Linux
Node name: test-rac1.sf.net
Release: 2.6.32-358.el6.x86_64
Version: #1 SMP Fri Feb 22 13:35:02 PST 2013
Machine: x86_64
Using parameter settings in server-side spfile +CRSDG/test-cluster/ASMPARAMETERFILE/registry.253.893674255
System parameters with non-default values:
large_pool_size = 12M
remote_login_passwordfile= "EXCLUSIVE"
asm_diskstring = "/dev/asm*"
asm_diskgroups = "DATADG"
asm_diskgroups = "FRADG"
asm_power_limit = 1
NOTE: remote asm mode is local (mode 0x301; from cluster type)
Sun Jul 31 10:30:05 2016
Cluster communication is configured to use the following interface(s) for this instance
192.168.1.70 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
cluster interconnect IPC version: Oracle UDP/IP (generic) 
  • 原因
Filename=ohasd_orarootagent_root_278.trc

2016-07-31 08:31:10.211739 : USRTHRD:2180118272: {0:9:3} failed to receive ARP request
2016-07-31 08:31:10.211778 : USRTHRD:2180118272: {0:9:3} (null) category: -2, operation: read, loc: lnxrecv:2,os, OS error: 100, other: <<<<<<<<<<<<<<<<<<<<<<<<<<< 网卡有错误
2016-07-31 08:31:10.712354 : USRTHRD:2180118272: {0:9:3} Failed to check 169.254.31.89 on eth1 <<<<<<<<<<<<<<<<<<<<<<<<<<<
2016-07-31 08:31:10.712389 : USRTHRD:2180118272: {0:9:3} (null) category: 0, operation: , loc: , OS error: 0, other: <<<<<<<<<<<<<<<<<<<<<<<<<<<
2016-07-31 08:31:10.712419 : USRTHRD:2180118272: {0:9:3} Assigned IP 169.254.31.89 no longer valid on inf eth1
2016-07-31 08:31:10.712436 : USRTHRD:2180118272: {0:9:3} Attempt to reassign the IP 169.254.31.89 on inf eth1
2016-07-31 08:31:10.712455 : USRTHRD:2180118272: {0:9:3} VipActions::startIp {2016-07-31 08:31:11.039416 : AGFW:2191427328: {0:0:2} Agent received the message: AGENT_HB[Engine] ID 12293:1232478
2016-07-31 08:31:11.212863 : USRTHRD:2180118272: {0:9:3} Failed to check 169.254.31.89 on eth1
2016-07-31 08:31:11.212896 : USRTHRD:2180118272: {0:9:3} (null) category: 0, operation: , loc: , OS error: 0, other:
2016-07-31 08:31:11.213190 : USRTHRD:2180118272: {0:9:3} Adding 169.254.31.89 on eth1:1
2016-07-31 08:31:11.213400 : USRTHRD:2180118272: {0:9:3} Arp::sCreateSocket {2016-07-31 08:31:11.227198 : USRTHRD:2180118272: {0:9:3} Arp::sCreateSocket }
2016-07-31 08:31:11.227227 : USRTHRD:2180118272: {0:9:3} Flushing neighbours ARP Cache
2016-07-31 08:31:11.227245 : USRTHRD:2180118272: {0:9:3} Arp::sFlushArpCache {2016-07-31 08:31:11.227319 : USRTHRD:2180118272: {0:9:3} Arp::sSend: sending type 1
2016-07-31 08:31:11.227477 : USRTHRD:2180118272: {0:9:3} ignoring failure: failed to send arp
2016-07-31 08:31:12.312581 :CLSDYNAM:2169542400: [ora.ctssd]{0:9:3} [check] ClsdmClient::sendMessage clsdmc_respget return: status=0, ecode=0
2016-07-31 08:31:12.312645 :CLSDYNAM:2169542400: [ora.ctssd]{0:9:3} [check] translateReturnCodes, return = 0, state detail = OBSERVERCheckcb data [0x7fab4019bfe0]: mode[0xee] offset[0 ms].
2016-07-31 08:31:13.220363 : USRTHRD:2184320768: HAIP: event GIPCD_IF_UPDATE
2016-07-31 08:31:13.220588 : USRTHRD:2187224832: {0:9:3} dequeue change event 0x7fab58078a40, GIPCD_IF_UPDATE
2016-07-31 08:31:13.220651 : USRTHRD:2187224832: {0:9:3} HAIP: IF state gipcdadapterstateDown
2016-07-31 08:31:13.220681 : USRTHRD:2187224832: {0:9:3} It is non-cluster network, attr public
2016-07-31 08:31:13.220707 : USRTHRD:2187224832: {0:9:3} to verify routes
2016-07-31 08:31:13.220735 : USRTHRD:2187224832: {0:9:3} to verify start completion 1
2016-07-31 08:31:13.220983 : USRTHRD:2187224832: {0:9:3} HAIP: assigned ip '169.254.31.89'
2016-07-31 08:31:13.221001 : USRTHRD:2187224832: {0:9:3} HAIP: check ip '169.254.31.89'
2016-07-31 08:31:13.221017 : USRTHRD:2187224832: {0:9:3} Start: 1 HAIP assignment, 1, 1
2016-07-31 08:31:13.228064 : USRTHRD:2184320768: HAIP: event GIPCD_IF_UPDATE
2016-07-31 08:31:13.229735 : USRTHRD:2187224832: {0:9:3} dequeue change event 0x7fab58078fb0, GIPCD_IF_UPDATE
2016-07-31 08:31:13.229782 : USRTHRD:2187224832: {0:9:3} HAIP: IF state gipcdadapterstateDown
2016-07-31 08:31:13.229858 : USRTHRD:2187224832: {0:9:3} It is non-cluster network, attr public
2016-07-31 08:31:13.229884 : USRTHRD:2187224832: {0:9:3} to verify routes
2016-07-31 08:31:13.229909 : USRTHRD:2187224832: {0:9:3} to verify start completion 1
2016-07-31 08:31:13.230098 : USRTHRD:2187224832: {0:9:3} HAIP: assigned ip '169.254.31.89'
2016-07-31 08:31:13.230121 : USRTHRD:2187224832: {0:9:3} HAIP: check ip '169.254.31.89'
2016-07-31 08:31:13.230138 : USRTHRD:2187224832: {0:9:3} Start: 1 HAIP assignment, 1, 1
2016-07-31 08:31:13.231900 : USRTHRD:2184320768: HAIP: event GIPCD_IF_UPDATE
2016-07-31 08:31:13.232008 : USRTHRD:2187224832: {0:9:3} dequeue change event 0x7fab5808a7d0, GIPCD_IF_UPDATE
2016-07-31 08:31:13.232043 : USRTHRD:2187224832: {0:9:3} HAIP: IF state gipcdadapterstateDown
2016-07-31 08:31:13.232065 : USRTHRD:2187224832: {0:9:3} It is non-cluster network, attr public
2016-07-31 08:31:13.232130 : USRTHRD:2187224832: {0:9:3} to verify routes
2016-07-31 08:31:13.232151 : USRTHRD:2187224832: {0:9:3} to verify start completion 1

support 提供的方案

如电话讨论,目前问题在 node1 上,由于 7 月 31 日,8:30~~15:00之前,node1的私网网卡有报错,导致 GIprivate network disable掉,在 10:30 node1重庆过程中,无法找到私网信息,就使用了公网临时替代。造成了,node2正常启动时,无法 pingnode1的私网。现在的解决方案是:1. 关闭两个节点的GI
2 重启 node2 主机
3. 当 node2启动成功后,再重启node1

OCR

OCR 全称为 ”Oracle Cluster Registry”,字面意思为 Oracle 集群注册表。
Oracle 官方文档上的描述为:
Anything that Oracle Clusterware manages is known as a CRS resource. A CRS
resource can be a database, an instance, a service, a listener, a VIP address, or an application process. Oracle Clusterware manages CRS resources based on the resource’s configuration information that is stored in the Oracle Cluster Registry (OCR).

问题原因

根据 CRSD 的日志看,怀疑 OCR 文件损坏了。~~~~~~~~~~~~~~~~~~~~~~~~
2016-08-04 12:41:16.034910 :UiServer:1767343872: {2:30043:2} Done for ctx=0x7fdb040316a0
2016-08-04 12:41:16.040546 :  OCRRAW:1777850112: rtnode:3: invalid tnode 73
2016-08-04 12:41:16.040567 :  OCRRAW:1777850112: propropen:0: could not read tnode addrd=0
2016-08-04 12:41:16.040624 :  OCRRAW:1777850112: proprseterror: Error in accessing physical storage [26] Marking context invalid.    <<<<<<<<<<<<<<<<<Error in accessing physical storage [26]
~~~~~~~~~~~~~~~~~~~~~~~~~

处理方式

建议根据下面步骤还原一个 OCR 的有效备份,然后重新启动 crsd 进程
注意,执行下面步骤前,先关闭 node1 的 GI 软件,然后在 node2 上执行下面步骤。1. Identify available OCR backups using 'ocrconfig -showbackup'
2. Dump the contents (as privileged user) of all available backups using
'ocrdump <outputfile> -backupfile <backupfile>'
3. Identify all the backup locations where 'ocrdump' successfully completed
4. Inspect the contents of the ocrdump output, and identify a suitable backup
5. Shutdown the CRS stack on all nodes in the cluster
6. Restore the OCR backup (as privileged user) using
'ocrconfig -restore <file>'

执行上面步骤过程中,先不需要关闭 node2 的 instance 和 listener,但是上面操作要在申请完停机时间后做,有可能在做的过程中 instance 会受到影响。在 crsd 成功启动后,关闭 instance 和本地 listener,通过 crsctl 命令再次启动 instance 和本地 listener

在这些都完成之后,再启动 node1.

测试环境模拟

检查 OCR 自动备份

在 RAC 上默认 OCR 文件每四个小时会备份一次,可以使用命令 ocrconfig -showbackup 查看备份文件的生成时间和位置,每次备份只在一个节点上生成数据。

上午检查时得到的数据:

[oracle@ol6-121-rac2 ~]$ ocrconfig -showbackup

ol6-121-rac1     2016/08/17 07:32:10     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup00.ocr     0

ol6-121-rac1     2016/08/17 03:32:08     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup01.ocr     0

ol6-121-rac1     2016/08/16 23:32:05     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup02.ocr     0

ol6-121-rac1     2016/08/16 23:32:05     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/day.ocr     0

ol6-121-rac1     2016/08/16 23:32:05     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/week.ocr     0
PROT-25: Manual backups for the Oracle Cluster Registry are not available

下午 4 点时检查得到的数据:

[oracle@ol6-121-rac2 ~]$ ocrconfig -showbackup

ol6-121-rac1     2016/08/17 15:32:15     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup00.ocr     0

ol6-121-rac1     2016/08/17 11:32:12     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup01.ocr     0

ol6-121-rac1     2016/08/17 07:32:10     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup02.ocr     0

ol6-121-rac1     2016/08/16 23:32:05     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/day.ocr     0

ol6-121-rac1     2016/08/16 23:32:05     /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/week.ocr     0
PROT-25: Manual backups for the Oracle Cluster Registry are not available

其中最新的备份总是命名为 backup00.ocr,同时保留一份上周的备份和前一天的备份,这里因为集群是在 08/16 日首次启动的,所以 week.ocr 和 day.ocr 是同一个文件。

  • OCR 自动备份是通过 Master Node CRSD 进程完成,所以一旦 CRSD 进程不存在了,OCR 自动备份也会停止。

查看 ocr 文件的内容

可以使用 ocrdump 命令将 OCR 文件的内容转换成文本形式。

[oracle@ol6-121-rac1 ~]$ sudo /u01/app/12.1.0.2/grid/bin/ocrdump testbackup01.txt -backupfile /home/oracle/backup01.ocr

OCR 逻辑备份

Oracle 建议在任何集群配置变更前备份 ocr 文件,这是可以通过命令进行手动逻辑备份。需要使用 root 用户进行执行。

[oracle@ol6-121-rac1 ~]$ ocrconfig -export ocr_bak1.ocrdmp
PROT-20: Insufficient permission to proceed. Require privileged user

[oracle@ol6-121-rac1 ~]$ sudo -s /u01/app/12.1.0.2/grid/bin/ocrconfig -export ocr_bak1.ocrdmp

[oracle@ol6-121-rac1 ~]$ ls -lh *ocr*
-rw-------. 1 root   root      97K Aug 17 17:41 ocr_bak1.ocrdmp

模拟一次 OCR 损坏恢复

OCR 文件的位置

使用 ocrcheck 命令。

[oracle@ol6-121-rac1 ~]$ ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          4
         Total space (kbytes)     :     409568
         Used space (kbytes)      :       1548
         Available space (kbytes) :     408020
         ID                       :  144424447
         Device/File Name         :      +DATA
                                    Device/File integrity check succeeded

                                    Device/File not configured

                                    Device/File not configured

                                    Device/File not configured

                                    Device/File not configured

         Cluster registry integrity check succeeded

         Logical corruption check bypassed due to non-privileged user

登录 ASMCMD,一般在集群名文件夹下可以找到 OCRFILE 文件夹,系统使用的 OCR 文件就在此文件夹下面。

ASMCMD> pwd
+DATA/ol6-121-scan/OCRFILE
ASMCMD> ls -l
Type     Redund  Striped  Time             Sys  Name
OCRFILE  UNPROT  COARSE   AUG 18 10:00:00  Y    REGISTRY.255.919941339

关闭 CRS

[oracle@ol6-121-rac1 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stop crs

[root@ol6-121-rac2 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stop crs

模拟 OCR 文件损坏

因为不能在 ASMCMD 下编辑文件,所以从其它集群复制一个到本集群,然后使用命令还原。

若直接还原文件,会出现错误,原因是 ASM 没有启动,而目标 OCR 文件是存放在 ASM 磁盘上面的。

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/ocrconfig -restore /home/oracle/backup00.ocr
PROT-35: The configured OCR locations are not accessible

[root@ol6-121-rac1 ~]# ps -ef | grep asm
root      2415  6471  0 11:47 pts/1    00:00:00 grep asm

方法是只启动 cssd,然后在 sqlplus 中启动 ASM。同时确认 CRS 没有启动。

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start crs -excl -cssonly
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'ol6-121-rac1'
CRS-2676: Start of 'ora.cssdmonitor' on 'ol6-121-rac1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'ol6-121-rac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'ol6-121-rac1'
CRS-2676: Start of 'ora.diskmon' on 'ol6-121-rac1' succeeded
CRS-2676: Start of 'ora.cssd' on 'ol6-121-rac1' succeeded

[oracle@ol6-121-rac1 trace]$ sqlplus / as sysasm

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 18 14:03:38 2016

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ASM instance started

Total System Global Area 1140850688 bytes
Fixed Size                  2933400 bytes
Variable Size            1112751464 bytes
ASM Cache                  25165824 bytes
ASM diskgroups mounted
ASM diskgroups volume enabled

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager

然后可以执行 ocrconfig -restore 命令。

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/ocrconfig -restore /home/oracle/backup00.ocr
[root@ol6-121-rac1 ~]#

ASMCMD> ls -l
Type     Redund  Striped  Time             Sys  Name
OCRFILE  UNPROT  COARSE   AUG 18 14:00:00  Y    REGISTRY.255.919941339

尝试启动 CRS

在 OCR 修改后再尝试重启 CRS,可以看到 CRS 最终没有启动。

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.asm' on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.crf' on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'ol6-121-rac1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'ol6-121-rac1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'ol6-121-rac1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'ol6-121-rac1' succeeded
CRS-2677: Stop of 'ora.crf' on 'ol6-121-rac1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'ol6-121-rac1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'ol6-121-rac1' succeeded
CRS-2677: Stop of 'ora.asm' on 'ol6-121-rac1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'ol6-121-rac1'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'ol6-121-rac1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'ol6-121-rac1'
CRS-2677: Stop of 'ora.cssd' on 'ol6-121-rac1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'ol6-121-rac1'
CRS-2677: Stop of 'ora.gipcd' on 'ol6-121-rac1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'ol6-121-rac1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@ol6-121-rac1 ~]#

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -init -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details      
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       ol6-121-rac1             Started,STABLE
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.crf
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.crsd
      1        ONLINE  OFFLINE                               STABLE
ora.cssd
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.cssdmonitor
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.ctssd
      1        ONLINE  ONLINE       ol6-121-rac1             ACTIVE:0,STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.drivers.acfs
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.evmd
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.gipcd
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.gpnpd
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.mdnsd
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
ora.storage
      1        ONLINE  ONLINE       ol6-121-rac1             STABLE
--------------------------------------------------------------------------------

在日志 crsd.trc 中出现错误,目测第一个错误就是没有 grid 用户,因为测试环境上 GI 不是用 grid 用户搭建的。

2016-08-18 14:22:50.431205 :  CRSSEC:3000985344: {1:24681:2} Exception: ACL entry creation failed for: owner:grid:rwx
    CLSB:3000985344: Oracle Clusterware infrastructure error in CRSD (OS PID 2742): Fatal signal 6 has occurred in program crsd thread 3000985344; nested signal count is 1
Incident 81 created, dump file: /u01/app/oracle/diag/crs/ol6-121-rac1/crs/incident/incdir_81/crsd_i81.trc
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []

这时,可以看到 ASM 还是启动的,可以用 sqlplus 启动数据库,从而实现单节点运行。
但是首次启动时报 spfile 不存在错误,应该是 OCR 文件错误造成的 spfile 变更。

[oracle@ol6-121-rac1 trace]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 18 14:53:10 2016

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DATA/test/spfiletest.ora'
ORA-17503: ksfdopn:2 Failed to open file +DATA/test/spfiletest.ora
ORA-15056: additional error message
ORA-17503: ksfdopn:2 Failed to open file +DATA/test/spfiletest.ora
ORA-15173: entry 'spfiletest.ora' does not exist in directory 'test'
ORA-06512: at line 4
SQL> exit
Disconnected

可以通过,增加一个 spfile 文件别名使数据库启动时找到 spfile 文件。

ASMCMD [+DATA/test/parameterfile] > mkalias '+DATA/TEST/PARAMETERFILE/spfile.294.920047363' '+DATA/TEST/spfiletest.ora'
ASMCMD [+DATA/test/parameterfile] > cd ..
ASMCMD [+DATA/test] > ls
CONTROLFILE/
DATAFILE/
ONLINELOG/
PARAMETERFILE/
PASSWORD/
TEMPFILE/
spfiletest.ora

再次启动数据库正常

[oracle@ol6-121-rac1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Thu Aug 18 15:12:32 2016

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area  838860800 bytes
Fixed Size                  2929936 bytes
Variable Size             616565488 bytes
Database Buffers          213909504 bytes
Redo Buffers                5455872 bytes
Database mounted.
Database opened.

SQL> select instance_name,status from v$instance;

INSTANCE_NAME    STATUS
---------------- ------------
test1            OPEN

同样的方式可以启动第二个节点的实例,且实例间的事物一致性。

进行 OCR 文件恢复

  • 找到备份文件
    “`
    [oracle@ol6-121-rac1 ~]$ ocrconfig -showbackup

ol6-121-rac1 2016/08/18 07:32:31 /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup00.ocr 0

ol6-121-rac1 2016/08/18 03:32:26 /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup01.ocr 0

ol6-121-rac1 2016/08/17 23:32:22 /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/backup02.ocr 0

ol6-121-rac1 2016/08/17 03:32:08 /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/day.ocr 0

ol6-121-rac1 2016/08/16 23:32:05 /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/week.ocr 0
PROT-25: Manual backups for the Oracle Cluster Registry are not available


* 确认 CRS 关闭状态

[oracle@ol6-121-rac1 ~]$ ps -ef | grep crs
oracle 2669 3366 0 15:46 pts/1 00:00:00 grep crs
[oracle@ol6-121-rac1 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online


* 进行 OCR 文件还原

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/ocrconfig -restore /u01/app/12.1.0.2/grid/cdata/ol6-121-scan/week.ocr


* 在节点 2 启动 CRS,使用启动 cluster,启动 crs 则报错。中间实例可以登录和插入数据。

[root@ol6-121-rac2 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
[root@ol6-121-rac2 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start cluster
CRS-2672: Attempting to start ‘ora.crsd’ on ‘ol6-121-rac2’
CRS-2676: Start of ‘ora.crsd’ on ‘ol6-121-rac2’ succeeded
[root@ol6-121-rac2 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -t
——————————————————————————–
Name Target State Server State details
——————————————————————————–
Local Resources
——————————————————————————–
ora.DATA.dg
ONLINE ONLINE ol6-121-rac2 STABLE
ora.LISTENER.lsnr
ONLINE ONLINE ol6-121-rac2 STABLE
ora.asm
ONLINE ONLINE ol6-121-rac2 Started,STABLE
ora.net1.network
ONLINE ONLINE ol6-121-rac2 STABLE
ora.ons
ONLINE ONLINE ol6-121-rac2 STABLE
——————————————————————————–
Cluster Resources
——————————————————————————–
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.LISTENER_SCAN2.lsnr
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.LISTENER_SCAN3.lsnr
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.MGMTLSNR
1 ONLINE ONLINE ol6-121-rac2 169.254.42.254 192.1
68.1.102,STABLE
ora.cvu
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.mgmtdb
1 ONLINE OFFLINE ol6-121-rac2 Instance Shutdown,ST
ARTING
ora.oc4j
1 ONLINE OFFLINE ol6-121-rac2 STARTING
ora.ol6-121-rac1.vip
1 ONLINE INTERMEDIATE ol6-121-rac2 FAILED OVER,STABLE
ora.ol6-121-rac2.vip
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.scan1.vip
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.scan2.vip
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.scan3.vip
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.test.db
1 ONLINE OFFLINE STABLE
2 ONLINE ONLINE ol6-121-rac2 Open,STABLE
——————————————————————————–


* 在节点 1 同样启动集群

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start cluster
CRS-2672: Attempting to start ‘ora.crsd’ on ‘ol6-121-rac1’
CRS-2676: Start of ‘ora.crsd’ on ‘ol6-121-rac1’ succeeded

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -t

Name Target State Server State details

Local Resources

ora.DATA.dg
ONLINE ONLINE ol6-121-rac1 STABLE
ONLINE ONLINE ol6-121-rac2 STABLE
ora.LISTENER.lsnr
ONLINE ONLINE ol6-121-rac1 STABLE
ONLINE ONLINE ol6-121-rac2 STABLE
ora.asm
ONLINE ONLINE ol6-121-rac1 Started,STABLE
ONLINE ONLINE ol6-121-rac2 Started,STABLE
ora.net1.network
ONLINE ONLINE ol6-121-rac1 STABLE
ONLINE ONLINE ol6-121-rac2 STABLE
ora.ons
ONLINE ONLINE ol6-121-rac1 STABLE
ONLINE ONLINE ol6-121-rac2 STABLE
——————————————————————————–
Cluster Resources
——————————————————————————–
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE ol6-121-rac1 STABLE
ora.LISTENER_SCAN2.lsnr
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.LISTENER_SCAN3.lsnr
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.MGMTLSNR
1 ONLINE ONLINE ol6-121-rac2 169.254.42.254 192.1
68.1.102,STABLE
ora.cvu
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.mgmtdb
1 ONLINE ONLINE ol6-121-rac2 Open,STABLE
ora.oc4j
1 ONLINE ONLINE ol6-121-rac1 STABLE
ora.ol6-121-rac1.vip
1 ONLINE ONLINE ol6-121-rac1 STABLE
ora.ol6-121-rac2.vip
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.scan1.vip
1 ONLINE ONLINE ol6-121-rac1 STABLE
ora.scan2.vip
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.scan3.vip
1 ONLINE ONLINE ol6-121-rac2 STABLE
ora.test.db
1 ONLINE OFFLINE Instance Shutdown,ST
ABLE
2 ONLINE ONLINE ol6-121-rac2 Open,STABLE
——————————————————————————–


* 发现数据库没有自动启动,因此使用 srvctl 启动实例。结果不能使用集群命令启动,单独使用 sqlplus 还是可以启动,并且可以通过 listner 远程登录。

[oracle@ol6-121-rac1 ~]$ srvctl start instance -db test -node ol6-121-rac1
PRCR-1013 : Failed to start resource ora.test.db
PRCR-1064 : Failed to start resource ora.test.db on node ol6-121-rac1
CRS-2662: Resource ‘ora.test.db’ is disabled on server ‘ol6-121-rac1’

[oracle@ol6-121-rac1 ~]$ srvctl enable database -db test
PRCC-1010 : test was already enabled
PRCR-1002 : Resource ora.test.db is already enabled


* 关闭节点 1 的集群软件,错误原因可能是使用 sqlplus 直接启动了数据库,造成数据库的自启动项被停用。

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stat res -p

NAME=ora.test.db
TYPE=ora.database.type
ACL=owner:oracle:rwx,pgrp:oinstall:r–,other::r–,group:dba:r-x,group:oper:r-x,user:oracle:r-x
ACTIONS=startoption,group:”oinstall”,user:”oracle”,group:”dba”,group:”oper”
ACTION_SCRIPT=
ACTION_START_OPTION=
ACTION_TIMEOUT=600
ACTIVE_PLACEMENT=0
AGENT_FILENAME=%CRS_HOME%/bin/oraagent%CRS_EXE_SUFFIX%
AUTO_START=restore

参考文档 [ 处理 Oracle 11gR2 RAC 数据库资源不能自动启动的问题](http://blog.itpub.net/23135684/viewspace-759569)

[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/srvctl enable database -d test

PRCC-1010 : test was already enabled
PRCR-1002 : Resource ora.test.db is already enabled
[root@ol6-121-rac1 ~]#
[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/srvctl enable instance -d test -i test1
[root@ol6-121-rac1 ~]# /u01/app/12.1.0.2/grid/bin/srvctl enable instance -d test -i test2
“`

  • 关闭节点 1 的集群服务,然后重启下服务器看是否能够自动启动数据库
    在本次测试中,重启后 GI 服务和数据库都都能够正常启动。

更多 Oracle 相关信息见Oracle 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=12

本文永久更新链接地址:http://www.linuxidc.com/Linux/2017-03/142082.htm

正文完
星哥玩云-微信公众号
post-qrcode
 0
星锅
版权声明:本站原创文章,由 星锅 于2022-01-22发表,共计23941字。
转载说明:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
【腾讯云】推广者专属福利,新客户无门槛领取总价值高达2860元代金券,每种代金券限量500张,先到先得。
阿里云-最新活动爆款每日限量供应
评论(没有评论)
验证码
【腾讯云】云服务器、云数据库、COS、CDN、短信等云产品特惠热卖中