网卡故障导致RAC数据库故障
客户环境:RHEL 5.7、ORACLE 11.2.0.4.0 for RAC
故障描述:应用程序无法连接数据库,节点1公有网卡、私有网卡ping不通,节点2公有网卡、私有网卡可以ping通,vip及scan ip均ping不通。
故障原因:节点1所在服务器网卡控制器故障,导致节点1所在服务器的网卡不能正常工作,两个节点的心跳中断,悲催的是节点1持有仲裁盘,节点2被驱逐,节点2的集群被关闭,节点1的集群正常运行,可是节点1的网络已经瘫痪,导致应用程序无法连接数据库。
由于节点1的公有IP和私有IP不再同一网段,切都不通,而且只有这一台服务器出现网络故障,基本可以排除交换机的问题,而且两个网卡都不可用,因为多个网卡是通过一个网卡控制器控制的,所以最大的可能性就是网卡控制器的问题了。
分析过程:以下是故障分析过程及部分故障信息。由于本次为远程技术支持,节点1只能由同事在机房操作,故缺失节点1的部分日志信息。
远程连接节点2,发现集群已经关闭。
[root@pressdb4 ~]$ ps -ef | grep smon root 10434 1 0 2014 ? 03:35:06 /u01/app/grid/product/11.2.0/grid_home1/bin/osysmond.bin root 13647 13476 0 19:27 pts/1 00:00:00 grep smon [root@pressdb4 bin]# ps -ef | grep has root 9819 1 0 2014 ? 02:00:33 /u01/app/grid/product/11.2.0/grid_home1/bin/ohasd.bin reboot root 9947 1 0 2014 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run root 12170 32728 0 19:25 pts/1 00:00:00 grep has [root@pressdb4 bin]# ps -ef | grep ora root 9491 9475 0 2014 ? 00:05:09 hald-addon-storage: polling /dev/hda root 10297 1 0 2014 ? 04:40:46 /u01/app/grid/product/11.2.0/grid_home1/jdk/jre/bin/java -Xms64m -Xmx256m -classpath
此时由于两个节点的心跳网络没有恢复,仲裁盘被节点1持有,节点2的集群无法启动。
[root@pressdb4 bin]# ./crsctl start crs CRS-4640: Oracle High Availability Services is already active CRS-4000: Command Start failed, or completed with errors.
查看节点2的集群告警日志可以发现以下故障信息:
2015-01-28 06:16:47.621: [cssd(8857)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/product/11.2.0/grid_home1/log/pressdb4/cssd/ocssd.log 2015-01-28 06:16:47.621: [cssd(8857)]CRS-1603:CSSD on node pressdb4 shutdown by user. 2015-01-28 06:16:52.840: [ohasd(9819)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'pressdb4'. 2015-01-28 06:16:52.880: [ohasd(9819)]CRS-2878:Failed to restart resource 'ora.cssd' 2015-01-28 06:16:54.312: [cssd(10733)]CRS-1713:CSSD daemon is started in clustered mode 2015-01-28 06:17:00.114: [cssd(10733)]CRS-1707:Lease acquisition for node pressdb4 number 2 completed 2015-01-28 06:17:01.397: [cssd(10733)]CRS-1605:CSSD voting file is online: /dev/asm-disk1; details in /u01/app/grid/product/11.2.0/grid_home1/log/pressdb4/cssd/ocssd.log.
查看节点1和节点2的css日志可以发现心跳网络出现问题。
2015-01-28 02:54:18.763: [ CSSD][1096313152]clssnmPollingThread: node pressdb3/4 (1) at 50% heartbeat fatal, removal in 14.510 seconds 2015-01-28 02:54:25.776: [ CSSD][1096313152]clssnmPollingThread: node pressdb3/4 (1) at 75% heartbeat fatal, removal in 7.500 seconds 2015-01-28 02:54:30.789: [ CSSD][1096313152]clssnmPollingThread: node pressdb3/4 (1) at 90% heartbeat fatal, removal in 2.490 seconds, seedhbimpd 1
因为私有网络中断,两个节点之间的心跳不通,导致两个节点互相驱逐。
节点2驱逐节点1。
2015-01-28 02:54:33.276: [ CSSD][1096313152]clssnmMarkNodeForRemoval: node 1, pressdb3 marked for removal 2015-01-28 02:54:33.276: [ CSSD][1096313152]clssnmDiscHelper: pressdb3, node(1) connection failed, endp (0x268fd57), probe((nil)), ninf->endp 0x268fd57
节点1驱逐节点2。
2015-01-28 02:55:34.751: [ CSSD][1106127168]clssnmMarkNodeForRemoval: node 2, pressdb4 marked for removal 2015-01-28 02:55:34.751: [ CSSD][1106127168]clssnmDiscHelper: pressdb4, node(2) connection failed, endp (0x6e8), probe((nil)), ninf->endp 0x6e8
在2节点的RAC环境,心跳网络中断,ORACLE不好判断哪个节点才是故障节点,通常情况都是驱逐第二个节点,所以节点2很悲催的被驱逐了。
故障确认:经同事在机房验证,节点1的服务正常启动。
[root@pressdb3 bin]# ./crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora.DATA.dg ora....up.type ONLINE ONLINE pressdb3 ora....ER.lsnr ora....er.type ONLINE ONLINE pressdb3 ora....N1.lsnr ora....er.type ONLINE ONLINE pressdb3 ora.OCR.dg ora....up.type ONLINE ONLINE pressdb3 ora.asm ora.asm.type ONLINE ONLINE pressdb3 ora.cvu ora.cvu.type ONLINE ONLINE pressdb3 ora.gsd ora.gsd.type OFFLINE OFFLINE ora....network ora....rk.type ONLINE ONLINE pressdb3 ora.oc4j ora.oc4j.type ONLINE OFFLINE ora.ons ora.ons.type ONLINE ONLINE pressdb3 ora.pressdb.db ora....se.type ONLINE OFFLINE ora....db3.vip ora....t1.type ONLINE ONLINE pressdb3 ora....SM2.asm application ONLINE ONLINE pressdb3 ora....B4.lsnr application ONLINE ONLINE pressdb3 ora....db4.gsd application OFFLINE OFFLINE ora....db4.ons application ONLINE ONLINE pressdb3 ora....db4.vip ora....t1.type ONLINE ONLINE pressdb3 ora....ry.acfs ora....fs.type ONLINE ONLINE pressdb3 ora.scan1.vip ora....ip.type ONLINE ONLINE pressdb3
这样,由于心跳网络不通,导致节点2无法正常启动,应用程序也无法连接节点1的数据库。
解决方法:关闭节点1的集群服务,释放相应的资源,节点2即可正常启动。
节点1关闭集群。
[root@pressdb3 bin]# ./crsctl stop crs
节点2集群可以正常启动。
[root@pressdb4 bin]# ./crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora.DATA.dg ora....up.type ONLINE ONLINE pressdb4 ora....ER.lsnr ora....er.type ONLINE ONLINE pressdb4 ora....N1.lsnr ora....er.type ONLINE ONLINE pressdb4 ora.OCR.dg ora....up.type ONLINE ONLINE pressdb4 ora.asm ora.asm.type ONLINE ONLINE pressdb4 ora.cvu ora.cvu.type ONLINE ONLINE pressdb4 ora.gsd ora.gsd.type OFFLINE OFFLINE ora....network ora....rk.type ONLINE ONLINE pressdb4 ora.oc4j ora.oc4j.type ONLINE OFFLINE ora.ons ora.ons.type ONLINE ONLINE pressdb4 ora.pressdb.db ora....se.type ONLINE OFFLINE ora....db3.vip ora....t1.type ONLINE ONLINE pressdb4 ora....SM2.asm application ONLINE ONLINE pressdb4 ora....B4.lsnr application ONLINE ONLINE pressdb4 ora....db4.gsd application OFFLINE OFFLINE ora....db4.ons application ONLINE ONLINE pressdb4 ora....db4.vip ora....t1.type ONLINE ONLINE pressdb4 ora....ry.acfs ora....fs.type ONLINE ONLINE pressdb4 ora.scan1.vip ora....ip.type ONLINE ONLINE pressdb4
由于节点2所在服务器的网络没有问题,此时应用程序已经可以正常访问数据库。至于节点1,服务器修好后,正常启动集群即可。