Hello guys,

I have a 2-node cluster both running on NW6.5 with Cluster Services 1.7. they are patched to sp4. The servers are using iSCSI to connect to their SAN via a dedicated gigabit NIC (actually, two NICs using fault tolerance)

The setup has been running with out a glitch for close to a year now. Suddenly, one of the servers running Groupwise 6.5.5 started to abend? It seems that it's taking a poison pill due to some communication problems; possibly with the SAN SBD partition. The best that I can figure out is that the NICs are reporting some "Receive failed, packet length mismatch" and "Receive failed, Checksum error" counts.

Once I reboot the server, it runs fine for a while (say a day or two) and then bang, it's down again.

I don't know how to trace the problem to the NIC(s) that may be faulty or a switch or what. I don't think it's the switch or the iSCSI SAN server as the 2nd node is running fine. Although when I check the same count on the iSCSI SAN server it's VERY high.

I'm using 3COM 3C200-T NICS and the driver it's loading is coming up as Broadcom B57 version 8.51 May 5, 2005

Any suggestions would be appreciated.



ABEND.log extract
Novell Open Enterprise Server, NetWare 6.5
PVER: 6.50.04

Server RESOURCESII halted Friday, May 5, 2006 2:06:27.323 pm
Abend 1 on P00: Server-5.70.04-0: CLUSTER: Node castout, fatal SAN read error

Running process: SBD Write Node Tick Thread Process
Thread Owned by NLM: SBD.NLM
Stack pointer: 889BADE0
OS Stack limit: 889B7000
CPU 0 (Thread 893BD5A0) is in a NO SLEEP state
Scheduling priority: 67371008
Wait state: 3030070 Yielded CPU
Additional Information:
The NetWare OS detected a problem with the system while executing a process owned by CLSTRLIB.NLM. It may be the source of the problem or there may have been a memory corruption.

Stack Walk
Stack Contents
893C2EC0 8926EA15 VLL.NLM|VipNSShutdown+429
893C2EC4 893BD666 20657441 73696F50 50206E6F 206C6C69 Ate Poison Pill
893C2EDC 8926EDF5 VLL.NLM|VllProviderPostEvent+D1
893C2EE0 893BD666 20657441 73696F50 50206E6F 206C6C69 Ate Poison Pill
893C2F00 893BA8C4 SBD.NLM|SbdCheckIO+7D8
893C2F04 00000001
893C2F08 00000004
893C2F0C 893BD666 20657441 73696F50 50206E6F 206C6C69 Ate Poison Pill
893C2FB0 893B99DA SBD.NLM|SBD.NLM (Code)+29DA
893C2FB4 8939A024 2A444253 00000001 00000116 4556494C SBD*........LIVE
893C2FB8 8939908C 00000000 00000000 00000002 00000001 ................
893C2FBC 00000006
Emulated 5000 and found no RET instruction Function may never return.

Network Card Details
Loaded from [C:\NWSERVER\DRIVERS\] on May 5, 2006 2:26:23 pm
(Address Space = OS)
Broadcom NetXtreme Gigabit Ethernet Driver
Version 8.51.01 May 5, 2005