共计 11567 个字符,预计需要花费 29 分钟才能阅读完成。
测试环境出现问题
今天,收到项目组通知说发生了死锁,让我查一下死锁出现的原因。
首先 ,登录数据库,查看trace 日志所在路径
- SYS@cams>show parameter dump;
- NAME TYPE VALUE
- ———————————— ———– ——————————
- background_core_dump string partial
- background_dump_dest string /u01/app/Oracle/diag/rdbms/cams/cams/trace
- core_dump_dest string /u01/app/oracle/diag/rdbms/cams/cams/cdump
- max_dump_file_size string unlimited
- shadow_core_dump string PARTIAL
- user_dump_dest string /u01/app/oracle/diag/rdbms/cams/cams/trace
查看 alert 日志,搜索00060
- [oracle@db trace]$ cd /u01/app/oracle/diag/rdbms/cams/cams/trace/
- [oracle@db trace]$ vi alert_cams.log
找到 报错 trace 日志如下:
- ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/cams/cams/trace/cams_ora_12850.trc.
- Tue Sep 12 13:28:31 2017
- ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/cams/cams/trace/cams_ora_12854.trc.
- Tue Sep 12 13:28:36 2017
- ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/cams/cams/trace/cams_ora_12822.trc.
- Tue Sep 12 13:31:12 2017
打开 其中一个 trace 文件,部分内容如下:
- Trace file /u01/app/oracle/diag/rdbms/cams/cams/trace/cams_ora_12850.trc
- Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production
- With the Partitioning, OLAP, Data Mining and Real Application Testing options
- ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
- System name: Linux
- Node name: db
- Release: 2.6.32-431.el6.x86_64
- Version: #1 SMP Sun Nov 10 22:19:54 EST 2013
- Machine: x86_64
- VM name: VMWare Version: 6
- Instance name: cams
- Redo thread mounted by this instance: 1
- Oracle process number: 293
- Unix process pid: 12850, image: oracle@db
- *** 2017-09-12 13:28:28.401
- *** SESSION ID:(1139.47765) 2017-09-12 13:28:28.401
- *** CLIENT ID:() 2017-09-12 13:28:28.401
- *** SERVICE NAME:(SYS$USERS) 2017-09-12 13:28:28.401
- *** MODULE NAME:(JDBC Thin Client) 2017-09-12 13:28:28.401
- *** ACTION NAME:() 2017-09-12 13:28:28.401
- *** 2017-09-12 13:28:28.401
- DEADLOCK DETECTED ( ORA-00060 )
- [Transaction Deadlock]
- 这里指出死锁不是 Oracle 自身的错误,是用户的设计造成的问题
- The following deadlock is not an ORACLE error. It is a
- deadlock due to user error in the design of an application
- or from issuing incorrect ad-hoc SQL. The following
- information may aid in determining the deadlock:
- 这个死锁图表明两个 session 各持有一份资源,等待对方持有的资源
- Deadlock graph:
- ———Blocker(s)–––––––– ———Waiter(s)–––––––––
- Resource Name process session holds waits process session holds waits
- TX-00160018-00000adc 293 1139 X 384 65 X
- TX-0004000f-000064d6 384 65 X 293 1139 X
- session 1139: DID 0001-0125-000074A5 session 65: DID 0001-0180-000037EC
- session 65: DID 0001-0180-000037EC session 1139: DID 0001-0125-000074A5
- Rows waited on:
- Session 1139: obj – rowid = 00021C7B – AAAhx7AAHAAARMrAAB
- (dictionary objn – 138363, file – 7, block – 70443, slot – 1)
- Session 65: obj – rowid = 00021C7B – AAAhx7AAHAAARMtAAD
- (dictionary objn – 138363, file – 7, block – 70445, slot – 3)
- —– Information for the OTHER waiting sessions —–
- Session 65:
- sid: 65 ser: 27869 audsid: 3026187 user: 111/CAMS_CORE
- flags: (0x45) USR/– flags_idl: (0x1) BSY/–/–/–/–/–
- flags2: (0x40009) –/–/INC
- pid: 384 O/S info: user: oracle, term: UNKNOWN, ospid: 12854
- image: oracle@db
- client details:
- O/S info: user: cams, term: unknown, ospid: 1234
- machine: yy program: JDBC Thin Client
- application name: JDBC Thin Client, hash value=2546894660
- current SQL:
- 导致死锁的 SQL 语句 1
- select
- SEQ_NO, ACO_AC,AC_DTE,CCY,BAL,FRZ_AMT,
- CRE_TLR, CRE_DTE,
- CRE_BR, CRE_BK, UPD_TLR, UPD_DTE, UPD_BR, UPD_BK,TSL_AMT,VPS_BAL
- FROM CD_ACBALANCE
- WHERE ACO_AC = :1
- FOR UPDATE
- —– End of information for the OTHER waiting sessions —–
- Information for THIS session:
- —– Current SQL Statement for this session (sql_id=cfy88pmyts0fn) —–
- 导致死锁的 SQL 语句 2
- select
- SEQ_NO, ACO_AC,AC_DTE,CCY,BAL,FRZ_AMT,
- CRE_TLR, CRE_DTE,
- CRE_BR, CRE_BK, UPD_TLR, UPD_DTE, UPD_BR, UPD_BK,TSL_AMT,VPS_BAL
- FROM CD_ACBALANCE
- WHERE ACO_AC = :1
- FOR UPDATE
- ===================================================
这里可以明确的是 select for update 语句造成了死锁。但是考虑到环境为测试环境,没开启归档,不能通过 logminer 挖掘故障期间的 SQL 事物,而且开发人员已经通过 kill session 的方式处理了死锁,所以需要在别的环境中重现该问题,并进行分析。
重现死锁问题
首先连上数据库,启用 HR 用户
- [oracle@prod ~]$ sqlplus / as sysdba
- SQL*Plus: Release 11.2.0.4.0 Production on Wed Sep 13 03:14:17 2017
- Copyright (c) 1982, 2013, Oracle. All rights reserved.
- Connected to an idle instance.
- SYS@PROD>startup;
- ORACLE instance started.
- Total System Global Area 1202556928 bytes
- Fixed Size 2252704 bytes
- Variable Size 771752032 bytes
- Database Buffers 419430400 bytes
- Redo Buffers 9121792 bytes
- Database mounted.
- Database opened.
- SYS@PROD>conn hr/hr
- ERROR:
- ORA-28000: the account is locked
- Warning: You are no longer connected to ORACLE.
- @>conn / as sysdba
- Connected.
- SYS@PROD>alter user hr identified by hr account unlock;
- User altered.
- SYS@PROD>conn hr/hr
- Connected.
选中 COUNTRIES 表作为测试对象
- HR@PROD>desc countries;
- Name Null? Type
- —————————————– ——– —————————-
- COUNTRY_ID NOT NULL CHAR(2)
- COUNTRY_NAME VARCHAR2(40)
- REGION_ID NUMBER
- HR@PROD>select count(*) from countries;
- COUNT(*)
- ———-
- 25
- HR@PROD>select count(distinct(COUNTRY_ID)) from countries;
- COUNT(DISTINCT(COUNTRY_ID))
- —————————
- 25
经过验证,COUNTRIES表中的 COUNTRY_ID 字段是唯一的。
用 COUNTRIES 表新建一个 test 表,在 test 表中做测试
- HR@PROD>create table test as select * from countries;
- Table created.
- HR@PROD>desc test
- Name Null? Type
- —————————————– ——– —————————-
- COUNTRY_ID NOT NULL CHAR(2)
- COUNTRY_NAME VARCHAR2(40)
- REGION_ID NUMBER
- HR@PROD>select country_id,country_name,region_id from countries;
- CO COUNTRY_NAME REGION_ID
- — —————————————- ———-
- AR Argentina 2
- AU Australia 3
- BE Belgium 1
- BR Brazil 2
- CA Canada 2
- CH Switzerland 1
- CN China 3
- DE Germany 1
- DK Denmark 1
- EG Egypt 4
- FR France 1
- CO COUNTRY_NAME REGION_ID
- — —————————————- ———-
- IL Israel 4
- IN India 3
- IT Italy 1
- JP Japan 3
- KW Kuwait 4
- ML Malaysia 3
- MX Mexico 2
- NG Nigeria 4
- NL Netherlands 1
- SG Singapore 3
- UK United Kingdom 1
- CO COUNTRY_NAME REGION_ID
- — —————————————- ———-
- US United States of America 2
- ZM Zambia 4
- ZW Zimbabwe 4
- 25 rows selected.
情形1:
两个 select for update 语句想要同时锁定一条语句。
Session1:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘AU’ for update;
- CO COUNTRY_NAME REGION_ID
- — —————————————- ———-
- AU Australia 3
Sessions2:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘AU’ for update;
这时,session2会一直处于锁等待状态,而不会出现死锁。当把第一个事物 commit 或者 rollback 之后,第二个事物会继续执行。
Session1:
- HR@PROD>rollback;
- Rollback complete.
Session2:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘AU’ for update;
- CO COUNTRY_NAME REGION_ID
- — —————————————- ———-
- AU Australia
显然,两个 select for update 语句想要同时锁定一条语句,并不会出现死锁,而会出现锁等待的现象。
情形2:
第一个 select for update 事物锁定 A row 后,想要再锁定 B row;第二个select for update 事物锁定 B row 后,想要锁定A row。
Session1:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘AU’ for update;
- CO COUNTRY_NAME REGION_ID
- — —————————————- ———-
- AU Australia 3
Session 2:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘UK’ for update;
- CO COUNTRY_NAME REGION_ID
- — —————————————- ———-
- UK United Kingdom 1
如果 Session1 想要继续持有 Session2 正在持有的row:
Session1:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘UK’ for update;
这时,会出现锁等待的现场,和上一场景类似。如果 Session2 也想持有 Session1 正在持有的row:
Session2:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘AU’ for update;
这时,Session2还处于等待状态,但是 Session1 出现死锁:
Session1:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘UK’ for update;
- select country_id,country_name,region_id from test where country_id=‘UK’ for update
- *
- ERROR at line 1:
- ORA-00060: deadlock detected while waiting for resource
这时,死锁的现象已经重现,而且,场景 2 从头到尾只使用了一个 select for update 语句,只是换了参数而已。
定位死锁语句
通过检查数据库表,能够检查出是哪一条语句被死锁,产生死锁的机器是哪一台。
1)用 dba 用户执行以下语句
- select c.owner, c.object_name, c.object_type, b.sid,
- b.serial#, b.lockwait, b.status, b.oSUSEr, b.machine, b.process, b.program
- from v$locked_object a ,
- v$session b,
- dba_objects c
- where b.sid = a.session_id
- and a.object_id = c.object_id;
如果有输出的结果,则说明有死锁,且能看到死锁的机器是哪一台。字段说明:
OWNER:死锁语句所用的数据库用户,这里是 HR 用户。
OBJECT_NAME:产生死锁的对象,这里是 TEST 表。
OBJECT_TYPE:产生死锁的对象类型,这里是TABLE。
SID:SESSION标识,常用于连接 其它 列。
SERIAL#:SID有可能会重复,当两个 session 的SID重复时,SERIAL#用来区别session
LOCKWAIT:可以通过这个字段查询出当前正在等待的锁的相关信息,如果有内容表示被死锁或者有锁等待事件。
STATUS:用来判断 session 状态。ACTIVE:正执行 SQL 语句 。INACTIVE: 等待操作。KILLED:被标注为杀死。
OSUSER:客户端操作系统用户名。
MACHINE:客户端操作系统的机器名。
PROCESS:客户端进程的ID。
PROGRAM:客户端执行的应用程序。
2)用 dba 用户执行以下语句,可以查看到被死锁的语句。
- select sql_text from v$sql where (address,hash_value) in
- (select sql_address,sql_hash_value from v$session where sid in
- (select session_id from v$locked_object));
或者
- select sql_text from v$sql where (address,hash_value) in
- (select sql_address,sql_hash_value from v$session where lockwait is not null);
这里查出来出现死锁的语句和之前的测试结果一致。
3)分析 trace 日志文件
死锁发生时,可以再 trace 日志文件中找到如下语句:
- ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/prod/PROD/trace/PROD_ora_4852.trc.
打开 trace 文件,可以找到死锁的语句:
- —– Information for the OTHER waiting sessions —–
- Session 31:
- sid: 31 ser: 25 audsid: 110280 user: 84/HR
- flags: (0x45) USR/– flags_idl: (0x1) BSY/–/–/–/–/–
- flags2: (0x40009) –/–/INC
- pid: 23 O/S info: user: oracle, term: UNKNOWN, ospid: 4853
- image: oracle@prod (TNS V1-V3)
- client details:
- O/S info: user: oracle, term: pts/1, ospid: 3149
- machine: prod program: sqlplus@prod (TNS V1-V3)
- application name: SQL*Plus, hash value=3669949024
- current SQL:
- select country_id,country_name,region_id from test where country_id=‘AU’ for update
- —– End of information for the OTHER waiting sessions —–
- Information for THIS session:
- —– Current SQL Statement for this session (sql_id=6n5kjs2twrwwq) —–
- select country_id,country_name,region_id from test where country_id=‘UK’ for update
- ===================================================
解决死锁问题
方法一(杀进程)
1)查找死锁的进程:
- select sid, serial#, username, command, lockwait, osuser from v$session where lockwait is not null;
2)kill掉这个死锁的进程:
- alter system kill session ‘sid, serial#’;
这里执行语句为:alter system kill session ‘29,69’;
原先造成死锁的进程被killed。
3)如果还不能解决,使用杀系统进程的方式处理:
这里为了测试,再次重现了死锁,并使用杀进程方式进行处理。
- select p.spid from v$session s,v$process p where s.sid=XXX and s.paddr=p.addr;
其中,XXX使用第一步查出来的 SID 参数替代
这里查出来的进程号为:4257
- [oracle@prod ~]$ ps -ef | grep 4257
- oracle 4257 3149 0 07:08 ? 00:00:00 oraclePROD (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
- oracle 4321 4000 0 07:21 pts/4 00:00:00 grep 4257
- [oracle@prod ~]$ kill -9 4257
- [oracle@prod ~]$ ps -ef | grep 4257
- oracle 4327 4000 0 07:21 pts/4 00:00:00 grep 4257
杀完进程之后,造成死锁的进程被杀死
查不到死锁进程
方法二(重启库)
- SYS@PROD>shutdown immediate;
- Database closed.
- Database dismounted.
- ORACLE instance shut down.
- SYS@PROD>startup;
- ORACLE instance started.
- Total System Global Area 1202556928 bytes
- Fixed Size 2252704 bytes
- Variable Size 771752032 bytes
- Database Buffers 419430400 bytes
- Redo Buffers 9121792 bytes
- Database mounted.
- Database opened.
查看死锁进程:
方法三(commit || rollback)
Session1:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘AU’ for update;
- CO COUNTRY_NAME REGION_ID
- — —————————————- ———-
- AU Australia 3
Session2:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘UK’ for update;
- CO COUNTRY_NAME REGION_ID
- — —————————————- ———-
- UK United Kingdom 1
Session1:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘UK’ for update;
- Waiting……
Session2:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘AU’ for update;
- Waiting……
Session1:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘UK’ for update;
- select country_id,country_name,region_id from test where country_id=‘UK’ for update
- *
- ERROR at line 1:
- ORA-00060: deadlock detected while waiting for resource
这里 Session1 出现死锁,只要执行 commit 或者 rollback 就可以解除死锁,只不过事务中第一个 SQL 执行成功,第二个 SQL 执行失败。
Session1:
- HR@PROD>commit;
- Commit complete.
Session2:
- HR@PROD>select country_id,country_name,region_id from test where country_id=‘AU’ for update;
- CO COUNTRY_NAME REGION_ID
- — —————————————- ———-
- AU Australia 3
此时,死锁状态解除:
问题总结
最后,通过与开发人员交流,得出的结论是出现该问题的原因不是程序设计的问题,而是在开发环境中,有人通过 debug 模式连到服务器上进行代码调试,有人通过客户端的形式访问服务器上的应用,当两者同时调试某一功能时(通过 select for update 的语句进行一张表中的数据访问),在特殊的情况下,出现了死锁的情况。
当然,在开发环境中比较容易出现这类情况,如果需要避免,就需要使用该开发环境的开发人员与测试人员做好沟通了。对于容易出现锁表的功能,可以要求测试人员在测试环境中测试,除特殊情况,尽量减少在开发环境中测试的次数,以免锁表影响开发测试进度。
更多 Oracle 相关信息见Oracle 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=12
本文永久更新链接地址:http://www.linuxidc.com/Linux/2017-10/147436.htm