I have a 2 node cluster (OES 2.2 on SLES 10.3 running in vmware, iscsi connection from esx hosts to san, VMs and sbd on san). When one node fails, a poison pill message will show on the other node, but then the second node will die soon after. Here's the messages log from last night's meltdown.


Mar 16 21:50:05 node1 /usr/sbin/cron[29451]: (root) CMD (/usr/local/bin/becheck.sh)
Mar 16 21:50:58 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 21:52:58 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 21:54:58 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 21:55:05 node1 /usr/sbin/cron[29617]: (root) CMD (/usr/local/bin/becheck.sh)
Mar 16 21:56:58 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 21:58:58 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:00:06 node1 syslog-ng[2969]: STATS: dropped 0
Mar 16 22:00:15 node1 /usr/sbin/cron[29798]: (root) CMD (/usr/local/bin/becheck.sh)
Mar 16 22:00:58 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:01:28 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:28 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:28 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:28 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: formPwdStream: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: formPwdStream: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: formPwdStream: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: formPwdStream: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: formPwdStream: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:29 node1 /usr/sbin/namcd[5514]: formPwdStream: Write error to socket, errno 32.
Mar 16 22:01:30 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:32 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:33 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:34 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:34 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:34 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:35 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:36 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:37 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:01:40 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:04:41 node1 /usr/sbin/namcd[5514]: write_errToSock: Write error to socket, errno 32.
Mar 16 22:05:19 node1 /usr/sbin/cron[30164]: (root) CMD (/usr/local/bin/becheck.sh)
Mar 16 22:10:12 node1 /usr/sbin/cron[30389]: (root) CMD (/usr/local/bin/becheck.sh)
Mar 16 22:15:24 node1 /usr/sbin/cron[30610]: (root) CMD (/usr/local/bin/becheck.sh)
Mar 16 22:18:13 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:20:02 node1 /usr/sbin/cron[30786]: (root) CMD (/usr/local/bin/becheck.sh)
Mar 16 22:20:13 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:22:14 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:22:27 node1 run-crons[30666]: dnschk returned 1
Mar 16 22:24:15 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:25:52 node1 /usr/sbin/cron[30964]: (root) CMD (/usr/local/bin/becheck.sh)
Mar 16 22:26:15 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:28:17 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:30:04 node1 /usr/sbin/cron[31104]: (root) CMD (/usr/local/bin/becheck.sh)
Mar 16 22:30:17 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:32:18 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:34:18 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:34:20 node1 kernel: CLUSTER-<WARNING>-<6077>: The cluster has lost communication with node [node2].
Mar 16 22:34:20 node1 kernel: Node [node2] may have failed or experiencing other problems.
Mar 16 22:34:20 node1 kernel: To ensure cluster stability, this node has sent a poison pill to node [node2].
Mar 16 22:34:20 node1 kernel: Tie breaker.
Mar 16 22:34:20 node1 kernel: Epoch are equal, different bitmask, same number of nodes.
Mar 16 22:34:20 node1 kernel: This node in the same partition as the old master.
Mar 16 22:34:20 node1 kernel: Other node's partition is not the same.
Mar 16 22:34:21 node1 kernel: MM_RemirrorPartition: cluster.sbd, 1
Mar 16 22:34:21 node1 kernel: NWRAID1: Stop Remirror (start) count=0 state=4 wait=0
Mar 16 22:34:21 node1 kernel: NWRAID1: Stop Remirror (end)
Mar 16 22:34:21 node1 kernel: NWRAID1: Remirror enabled
Mar 16 22:34:21 node1 kernel: device-mapper: ioctl: Target type does not support messages
Mar 16 22:34:21 node1 kernel: device-mapper: ioctl: Target type does not support messages
Mar 16 22:34:21 node1 kernel: CLUSTER-<WARNING>-<10290>: CRM:CRMSendNodeCmd: CRMAllocSendDesc returned NULL descriptor
Mar 16 22:35:46 node1 /usr/sbin/cron[31296]: (root) CMD (/usr/local/bin/becheck.sh)
Mar 16 22:36:18 node1 vmware-guestd: nds_nss_read_reply: AF_UNIX read() - no data
Mar 16 22:38:00 node1 iscsid: Kernel reported iSCSI connection 4:0 error (1011) state (3)

***

The becheck script is the last normal action before the wipeout. This is the master node. All manual cluster actions seem to behave as expected. There are several clustered pools, but it looks like they don't fail over. Does this look like a cluster problem, or more of a problem with the san connection?