Hello guys and HELP!

I have a 2-node cluster with one node running Groupwise and the other basic file/print services. I had an earlier post regarding this problem which I've included at the bottom of this one. Our setup has been running fine for at least a good 9-10 months but about a month ago we started to have some ABEND problems.

Basically, it seems that there are some NIC communication problems that are causing SBD read errors off of the iSCSI partition on the SAN.

Configuration:
I have a 2-node cluster both running on NWSB6.5 with Cluster Services 1.7. they are patched to sp4. The servers are using iSCSI to connect to their SAN via a dedicated gigabit NIC (actually, two NICs using fault tolerance)
Subnet A*, 10.0.0.0, iSCSI connectivity (2 NICs, Fault Tolerant/Duplexed by Novell)
Subnet B, 90.0.0.0, Main inner-company subnet (2 NICs, Fault Tolerant/Duplexed by Novell)
Subnet C, 192.168.1.0, Dedicated backup subnet

*Subnets B and C are using the same switches but subnet A has a dedicated switch
Errors: (reported by MONITOR.NLM)
Received Failed, packet checksum failure
Received Failed, packet length mismatch failure

Here are my questions:
1) Could there be another reason for getting SBD read errors other than NIC/hardware?
2) Why is there the SAME EXACT count across all NICs, across all subnets? Is that by design?
3) Would checksum/mismatch errors be significant enough to cause an SBD read failure? Should I look at other sources?

Thank you for taking the time to read this long post,

Soroush



Previous News Group Post and logs

I have a 2-node cluster both running on NW6.5 with Cluster Services 1.7. they are patched to sp4. The servers are using iSCSI to connect to their SAN via a dedicated gigabit NIC (actually, two NICs using fault tolerance)

The setup has been running with out a glitch for close to a year now. Suddenly, one of the servers running Groupwise 6.5.5 started to ABEND? It seems that it's taking a poison pill due to some communication problems; possibly with the SAN SBD partition. The best that I can figure out is that the NICs are reporting some "Receive failed, packet length mismatch" and "Receive failed, Checksum error" counts.

Once I reboot the server, it runs fine for a while (say a day or two) and then bang, it's down again.

I don't know how to trace the problem to the NIC(s) that may be faulty or a switch or what. I don't think it's the switch or the iSCSI SAN server as the 2nd node is running fine. Although when I check the same count on the iSCSI SAN server it's VERY high.

I'm using 3COM 3C200-T NICS and the driver it's loading is coming up as Broadcom B57 version 8.51 May 5, 2005

Any suggestions would be appreciated.

Thanks,

Soroush



ABEND.log extract
Novell Open Enterprise Server, NetWare 6.5
PVER: 6.50.04

Server RESOURCESII halted Friday, May 5, 2006 2:06:27.323 pm
Abend 1 on P00: Server-5.70.04-0: CLUSTER: Node castout, fatal SAN read error

Running process: SBD Write Node Tick Thread Process
Thread Owned by NLM: SBD.NLM
Stack pointer: 889BADE0
OS Stack limit: 889B7000
CPU 0 (Thread 893BD5A0) is in a NO SLEEP state
Scheduling priority: 67371008
Wait state: 3030070 Yielded CPU
Additional Information:
The NetWare OS detected a problem with the system while executing a process owned by CLSTRLIB.NLM. It may be the source of the problem or there may have been a memory corruption.

Stack Walk
Current EIP: 897300DB CLSTRLIB.NLM|NWCLSTR_Abend+3F
Stack Contents
893C2EC0 8926EA15 VLL.NLM|VipNSShutdown+429
893C2EC4 893BD666 20657441 73696F50 50206E6F 206C6C69 Ate Poison Pill
893C2EDC 8926EDF5 VLL.NLM|VllProviderPostEvent+D1
893C2EE0 893BD666 20657441 73696F50 50206E6F 206C6C69 Ate Poison Pill
893C2F00 893BA8C4 SBD.NLM|SbdCheckIO+7D8
893C2F04 00000001
893C2F08 00000004
893C2F0C 893BD666 20657441 73696F50 50206E6F 206C6C69 Ate Poison Pill
893C2FB0 893B99DA SBD.NLM|SBD.NLM (Code)+29DA
893C2FB4 8939A024 2A444253 00000001 00000116 4556494C SBD*........LIVE
893C2FB8 8939908C 00000000 00000000 00000002 00000001 ................
893C2FBC 00000006
Emulated 5000 and found no RET instruction Function may never return.

Network Card Details (Currently updated to 8.60)
B57.LAN
Loaded from [C:\NWSERVER\DRIVERS\] on May 5, 2006 2:26:23 pm
(Address Space = OS)
Broadcom NetXtreme Gigabit Ethernet Driver
Version 8.51.01 May 5, 2005