由于私有网卡直连RAC一个节点故障导致整个集群瘫痪
最近一个客户RAC环境出现严重故障,影响业务正常运行1个多小时,客户电话描述说RAC的第二个节点主机自动重启,然后发现节点1不能正常工作,应用已经无法连接数据库,ping不通节点1的IP,通过KVM查看后,发现节点1服务器白屏,已经死机,此时节点2操作系统已经启动,但是集群(CRS)无法启动(客户反映是节点2ASM磁盘组无法挂载),我们到现场后发现,节点1的CRS、数据库等都没有任何日志信息,主机日志从10:29到11:38之间没有任何日志信息,客户说,他们他们发现节点1服务器死机后,就直接把服务器的电源给摁了,然后启动是服务器自动校验,半个小时还没起来,他们就又给摁了,有过了半个多小时,服务器才起来,而且在几点1没有启动这段时间,节点2一直无法启动集群,直到节点1操作系统起来之后,节点2的集群环境才启动成功。
客户环境是suse linux oracle10.2.0.5。
通过检查节点2的CSS日志,发现在10:29分,节点1挂掉。
[CSSD]2013-01-14 10:29:51.609 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 50% heartbeat fatal, eviction in 29.970 seconds seedhbimpd 0 [CSSD]2013-01-14 10:29:51.609 [1216416064] >TRACE: clssnmPollingThread: node db3 (1) is impending reconfig, flag 1039, misstime 30030 [CSSD]2013-01-14 10:29:51.610 [1216416064] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1) [CSSD]2013-01-14 10:29:52.614 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 50% heartbeat fatal, eviction in 28.960 seconds seedhbimpd 1 [CSSD]2013-01-14 10:30:06.614 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 75% heartbeat fatal, eviction in 14.960 seconds seedhbimpd 1 [CSSD]2013-01-14 10:30:07.609 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 75% heartbeat fatal, eviction in 13.970 seconds seedhbimpd 1 [CSSD]2013-01-14 10:30:17.609 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 90% heartbeat fatal, eviction in 3.970 seconds seedhbimpd 1 [CSSD]2013-01-14 10:30:18.613 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 90% heartbeat fatal, eviction in 2.960 seconds seedhbimpd 1 [CSSD]2013-01-14 10:30:19.609 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 90% heartbeat fatal, eviction in 1.970 seconds seedhbimpd 1 [CSSD]2013-01-14 10:30:20.613 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 90% heartbeat fatal, eviction in 0.960 seconds seedhbimpd 1 [CSSD]2013-01-14 10:30:21.577 [1216416064] >TRACE: clssnmDiscHelper: db3, node(1) connection failed, con (0x78f860), probe((nil)) [CSSD]2013-01-14 10:30:21.577 [1216416064] >TRACE: clssnmDiscHelper: node 1 clean up, con (0x78f860), init state 5, cur state 5 [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmDoSyncUpdate: Initiating sync 9 [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmSetupAckWait: Ack message type (11) [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmSetupAckWait: node(2) is ALIVE [CSSD]2013-01-14 10:30:21.578 [1107310912] >TRACE: clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243552) LATS(4115761046) Disk lastSeqNo(6243552) [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmSetMinMaxVersion: min product/protocol (10.2/1.4) [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmSetMinMaxVersion: max product/protocol (10.2/1.4) [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmDoSyncUpdate: Terminating node 1, db3, misstime(60000) state(5), seedhbimpd(1) [CSSD]2013-01-14 10:30:21.578 [1115703616] >TRACE: clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243552) LATS(4115761046) Disk lastSeqNo(6243552)
可见,10:29节点1已经出现故障,和客户描述的正好相反,再看节点2接下来的日志。
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmSetupAckWait: Ack message type (13) [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmCheckDskInfo: Checking disk info... [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmCheckDskSleepTime: Node 1, db3, is alive, DHB (1358130621, 1976175180) more than disk timeout of 57000 after the last NHB (1358130562, 1976115440) [CSSD]2013-01-14 10:30:21.578 [2191697312] >TRACE: clssgmQueueGrockEvent: groupName(crs_version) count(2) master(0) event(2), incarn 8, mbrc 2, to member 1, events 0x0, state 0x0 [CSSD]2013-01-14 10:30:21.578 [1124096320] >TRACE: clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243552) LATS(4115761046) Disk lastSeqNo(6243552) [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmCheckDskInfo: My cohort: 2 [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmCheckDskInfo: Surviving cohort: 1 [CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE: clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, db4, is smaller than cohort of 1 nodes led by node 1, db3, based on map type 2
节点2认为和节点1失去联系是自己出的问题,为了防止脑列,节点2自杀了,重启了节点2的服务器。
在10:35节点2服务器启动成功,但是此时节点1服务器还没有启动成功,节点2的集群无法启动。
[CSSD]2013-01-14 10:35:59.336 [62072224] >TRACE: clssnmReadNodeInfo: added node 1 (db3) to cluster [CSSD]2013-01-14 10:35:59.343 [62072224] >TRACE: clssnmReadNodeInfo: added node 2 (db4) to cluster [CSSD]2013-01-14 10:35:59.344 [62072224] >TRACE: clssnmInitNMInfo: Initialized with unique 1358130959 [CSSD]2013-01-14 10:35:59.347 [62072224] >TRACE: clssNMInitialize: Initializing with OCR id (1797159872) [CSSD]2013-01-14 10:36:00.355 [1107310912] >TRACE: clssnm_skgxninit: Compatible vendor clusterware not in use [CSSD]2013-01-14 10:36:00.359 [62072224] >TRACE: clssnmNMInitialize: misscount set to (60) [CSSD]2013-01-14 10:36:00.360 [62072224] >TRACE: clssnmStartNM: reboottime set to (3) sec [CSSD]2013-01-14 10:36:00.360 [62072224] >TRACE: clssnmNMInitialize: Network heartbeat thresholds are: impending reconfig 30000 ms, reconfig start (misscount) 60000 ms [CSSD]2013-01-14 10:36:03.376 [1107310912] >TRACE: clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243889) LATS(965066) Disk lastSeqNo(6243889) [CSSD]2013-01-14 10:36:03.384 [1115703616] >TRACE: clssnmvReadDskHeartbeat: read ALL for Joining [CSSD]2013-01-14 10:36:03.384 [1115703616] >TRACE: clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243889) LATS(965066) Disk lastSeqNo(6243889) [CSSD]2013-01-14 10:36:03.404 [1124096320] >TRACE: clssnmvReadDskHeartbeat: read ALL for Joining [CSSD]2013-01-14 10:36:03.404 [1124096320] >TRACE: clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243889) LATS(965086) Disk lastSeqNo(6243889) [CSSD]2013-01-14 10:36:03.451 [1208023360] >TRACE: clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
节点2CRS日志信息如下:
2013-01-14 10:35:56.811: [ default][2561460576][ENTER]0 Oracle Database 10g CRS Release 10.2.0.5.0 Production Copyright 1996, 2004, Oracle. All rights reserved 2013-01-14 10:35:56.827: [ default][2561460576]0CRS Daemon Starting 2013-01-14 10:35:56.828: [ CRSMAIN][2561460576]0Checking the OCR device 2013-01-14 10:35:56.840: [ CRSMAIN][2561460576]0Connecting to the CSS Daemon 2013-01-14 10:35:57.063: [ COMMCRS][1082132800]clsc_connect: (0xb7f660) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_db4_)) 2013-01-14 10:35:57.063: [ CSSCLNT][2561460576]clsssInitNative: connect failed, rc 9 2013-01-14 10:35:57.064: [ CRSRTI][2561460576]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..
直到节点1启动,一直都是0CSS is not ready,在11:36节点1启动后,节点2集群和数据库启动成功,随后节点1集群和数据库也启动成功,经分析和咨询客户,他们RAC的私有网卡是直连的,并没有经过交换机,所以在节点2重启后(节点1关闭状态),节点2在启动集群是发现心跳网卡不可用,集群环境就无法启动,直到节点1操作系统启动,心跳网卡可以工作,节点2才启动集群,ORACLE已经明确不建议使用私有网卡直连的方式搭建RAC,详见MOS文档:RAC: Frequently Asked Questions [ID 220970.1]
有时候,客户对问题的描述不一定是正确的,还需要自己去分析判断。
那么是什么原因导致节点1死机呢,在节点1的操作系统日志里,有如下一段信息:
Jan 14 10:29:22 DB3 syslog-ng[5942]: Changing permissions on special file /dev/xconsole Jan 14 10:29:22 DB3 syslog-ng[5942]: Changing permissions on special file /dev/tty10 Jan 14 10:29:22 DB3 kernel: printk: 8 messages suppressed. Jan 14 10:29:22 DB3 kernel: The following is only an harmless informational message. Jan 14 10:29:22 DB3 kernel: Unless you get a _continuous_flood_ of these messages it means Jan 14 10:29:22 DB3 kernel: everything is working fine. Allocations from irqs cannot be Jan 14 10:29:22 DB3 kernel: perfectly reliable and the kernel is designed to handle that. Jan 14 10:29:22 DB3 kernel: oracle: page allocation failure. order:0, mode:0x20 Jan 14 10:29:22 DB3 kernel: Jan 14 10:29:22 DB3 kernel: Call Trace: <IRQ> <ffffffff80168162>{__alloc_pages+796} Jan 14 10:29:22 DB3 kernel: klogd 1.4.1, ---------- state change ---------- Jan 14 10:29:22 DB3 kernel: <ffffffff801837bc>{kmem_getpages+106} <ffffffff80184bbe>{fallback_alloc+304} Jan 14 10:29:22 DB3 kernel: <ffffffff801850c3>{__kmalloc+179} <ffffffff8028f362>{__alloc_skb+93} Jan 14 10:29:22 DB3 kernel: <ffffffff881a9266>{:netxen_nic:netxen_alloc_rx_skb+39} Jan 14 10:29:22 DB3 kernel: <ffffffff881ab594>{:netxen_nic:netxen_process_rcv_ring+1604} Jan 14 10:29:22 DB3 kernel: <ffffffff8010f013>{do_gettimeofday+92} <ffffffff881aa9e9>{:netxen_nic:netxen_process_cmd_ring+46} Jan 14 10:29:22 DB3 kernel: <ffffffff881a5eb8>{:netxen_nic:netxen_nic_poll+64} <ffffffff80295144>{net_rx_action+165} Jan 14 10:29:22 DB3 kernel: <ffffffff8013a76c>{__do_softirq+85} <ffffffff8010c216>{call_softirq+30} Jan 14 10:29:22 DB3 kernel: <ffffffff8010d1a4>{do_softirq+44} <ffffffff8010d56e>{do_IRQ+64} Jan 14 10:29:22 DB3 kernel: <ffffffff8010b352>{ret_from_intr+0} <EOI> Jan 14 10:29:22 DB3 kernel: Mem-info:
这是1月14号死机前仅有的日志信息,可以推断是由于ORACLE无法分配内存页导致节点1主机死机。
本案例大致过程是10:29节点1ORACLE无法分配内存页导致节点1操作系统死机,节点2发现节点1宕机,而节点2又没有争取到表决盘,节点2以为是自己出了问题导致它和节点1失去联系,就把自己重启了,又由于节点1和节点2的心跳网络使用直连的方式,在节点1操作系统没有启动之前,节点2的心跳网络不可用,节点2集群一直无法启动,直到节点1操作系统启动成功。
文中有“经分析和咨询客户,他们RAC的私有网卡是直连的,并没有经过交换机,所以在节点2重启后(节点1关闭状态)”
2013-01-17 23:09想请问下,网卡直连是怎样的?
@Lost
2013-01-18 21:47他们是直接用一根网线,连接两台服务器的私有网卡,没有通过交换机,这样1台服务器异常关闭,CRS就会把另一个节点重启,就像这个客户这种情况,而且关闭的这台服务器没有启动,另一个节点的CRS无法正常启动。这种直连的方式,ORACLE已经不再支持了,MOS上有相关的文章。
好久没来博主博客了,就来看看!~大家 相互访问
2013-01-24 17:32@藏章博客
2013-01-25 14:33兄台的博客都很经典