I've also posted this in the OES-Install forum.


Just a little background info, I've worked with Novell Products since 1989 & only Novell Products. I know every version of NetWare starting at 2.15, every version of OES & every version of Cluster Services. I've also worked with VMware for several years. I'm not boasting just giving all relevant information.

I have a customer with a 3 node OES2sp2 Linux Cluster in a VMware ESX v4.1 environment. Two days ago I update all 3 nodes to OES2sp3 using the .ISO Image(Install CDs). Everything went fine, after restarts everything looked fine. About 4 hours after the update one node said it couldn't communicate with another and sent it a poison pill. Of the 2 remaining nodes another said it couldn't communicate with the 3rd node and sent it a poison pill. After a restart of the whole Cluster this still happens anywhere between an hour or 4 after the restart. It happens every 2 - 4 hours... I've updated the VMware Tools (It broke the Kernel (waiting for /dev/sda1 to appear) & I had to use the rescue disk, edit the /etc/sysconfig/kernel file & remake the kernel (mkinitrd)) and that didn't help. For the 1st 2 days time was in sync, now time sync seems to have a problem. Sometimes in Sync, sometime it isn't. In otherwords a NIGHTMARE! Nothing else has changed in the environment. Nothing on the network, nothing in the SAN, nothing in the ESX, no massive influx of users or data. eDir looks fine, except the Time Sync...

In the /var/log/messages I've noticed several point of interest. I've listed them below.

During the boot process loading of the vmxnet, vmxnet3 & NSS I get the comment (Before & After VM-Tools update), "Loading module compiled for kernel version into kernel version"???

The /var/log/messages also catches the Cluster Poison Pill as follows;
"kernel: CLUSTER-<WARNING>-<6077>: The cluster has lost communication with node srv01"
"kernel: Node [srv01] may have failed or experiencing other problems."
"kernel: To ensure cluster stability, this node has sent a poison pill to node [srv01]."
"kernel: Node appears to be dead while holding its lock."
"Kernel: MM_RemirrorPartition: swd-cluster.sbd, 1"
"kernel: device-mapper: ioctl: Target type does not support messages"
"kernel: device-mapper: ioctl: Target type does not support messages"

Right before the nodes freeze and/or get a posion pill the CPU utilization goes through the roof...

Now I know that the OES2sp3 Update only updated OES components and didn't change anything on the OS (SLES10sp3) except the kernel, which went from to At least that seems to be the case. If this is correct & sp3 updates the kernel then shouldn't novell have brought new SLES drivers for the VMware vmxnet, vmxnet3 & mptspi amongst other new drivers? No we are not using the VMware kernel.

Since 24 hours the Master of [root] (eDir) has had a hugh spike in it's CPU usage... Massive. This problem and the messages in the logs, plus other issues (Jsut now, the only node that was running in the cluster died.... It had a hugh spike in CPU utilization 100% and bye..) tells me that sp3 doesn't have the necessary drivers for the kernel update...

I really hope that Novell pulls Sp3... or fixes it.... It would've been nice to have known this in advance. A colleage of mine has a customer with a 2 node OES2 cluster and it looks like he is having similar problems...

Any ideas or help would be greatly appreciated...

Thanks in advance,