Hi folks,

We're having a two node GroupWise 8 SP1 cluster running on OES2 SP1-SLES10 SP2 with EMC SAN Fibre Channel using EMC PowerPath.

When the resource (called POA_DIRETOR_SERVER) tries load, the other node send a poison pill and reboot the server. I found that all nodes rebooted during the night.

Every three months one of the resources (POOL) is corrupted. The last log:

NSSLOG ==> [Error] zlssMSAP.c[1899]
Oct 21 20:17:43 srv-corp-120 kernel: Oct 21, 2009 7:17:43 pm NSS<ZLSS>-4.11b-xxxx:
Oct 21 20:17:43 srv-corp-120 kernel: MSAP: Pool "POA_DIRETOR" ownership lost, pool may have been corrupted
Oct 21 20:17:43 srv-corp-120 kernel: by being activated from two servers at the same time.
...
Oct 22 09:47:21 srv-corp-120 kernel: err=20801 comnVol.c[894]
Oct 22 09:49:20 srv-corp-120 kernel: err=20801 comnVol.c[894]
Oct 22 09:49:39 srv-corp-120 sshd[16218]: Accepted keyboard-interactive/pam for root from 10.100.207.6 port 59479 ssh2
Oct 22 09:50:00 srv-corp-120 kernel: lsa_vol_statfs: zOpen = 20407
Oct 22 09:50:06 srv-corp-120 kernel: lsa_vol_statfs: zOpen = 20407
Oct 22 09:51:12 srv-corp-120 sshd[22740]: Accepted keyboard-interactive/pam for root from 172.22.0.101 port 1149 ssh2
Oct 22 09:51:15 srv-corp-120 kernel: lsa_vol_statfs: zOpen = 20407
Oct 22 09:51:15 srv-corp-120 kernel: lsa_vol_statfs: zOpen = 20407
Oct 22 09:52:24 srv-corp-120 kernel: err=20801 comnVol.c[894]
Oct 22 09:52:56 srv-corp-120 smdrd[19377]: Received Leave Event for POA_DIRETOR_SERVER
Oct 22 09:52:56 srv-corp-120 smdrd[19377]: Target name POA_DIRETOR_SERVER could not be de-advertised from SLP
Oct 22 09:53:44 srv-corp-120 kernel: CLUSTER-<WARNING>-<6077>: The cluster has lost communication with node [srv-corp-121].
Oct 22 09:53:44 srv-corp-120 kernel: Node [srv-corp-121] may have failed or experiencing other problems.
Oct 22 09:53:44 srv-corp-120 kernel: To ensure cluster stability, this node has sent a poison pill to node [srv-corp-121].
Oct 22 09:53:44 srv-corp-120 kernel: Epoch for this node is higher than for some other node.
Oct 22 09:53:44 srv-corp-120 kernel: Other node is slow to update epoch and bitmask (slow or dead).
Oct 22 09:58:53 srv-corp-120 syslog-ng[13581]: syslog-ng version 1.6.8 starting
Oct 22 09:58:53 srv-corp-120 ifup: lo
Oct 22 09:58:53 srv-corp-120 syslog-ng[13581]: Changing permissions on special file /dev/xconsole
Oct 22 09:58:53 srv-corp-120 syslog-ng[13581]: Changing permissions on special file /dev/tty10
Oct 22 09:58:53 srv-corp-120 dbus-daemon: nds_nss_GetGroupsbyMember: failed to init socket, status = -1
Oct 22 09:58:53 srv-corp-120 dbus-daemon: nds_nss_GetGroupsbyMember: failed to init socket, status

To correct the problem I did a rebuild with ravsui command.

Anybody knows how can I prevent this sort of thing from happening again?

Thanks.