Hi All,

Iím stuck with my new nine nodes cluster. Regularly a few, but not all,
random nodes crashed at the same time with ďThis node in the Minority
partition and the node in Majority partition is Alive. Running Process:
VLL_EventThread CodĒ error.

All nine nodes are the same: HP ProLiant DL380G4 with the latest BIOS
version and NetWare 6.5 SP4a with all post-SP4a patches and NCS 1.7. All
nodes have the latest QL2X00.HAM v6.90.02 HBA driver. The HBA firmware is
the HP recommended version for QLA2340 card Ė 1.45. Each node has two HBA
with SecurePath connected to a SAN. The other servers connected to this
SAN (but not in this cluster) don't have any issues with the SAN, so
probably it's not problem with the SAN. All nine nodes have the same
latest Q57.LAN driver Ė v8.65 (I know, I know, itís a Broadcom). All nodes
connected to the same Gigabit switch. Port speed and duplex are set
to ďAutoĒ on all nodes and Gigabit switch (not possible to set 1000FD on
in the NIC driver). Itís a new cluster and at this time itís not servicing
any users. It doesnít have any third party applications or any other
applications at all. Only pure NW6.5 SP4 + hotfixes + NCS 1.7 + the latest
HBA and LAN drivers.
Minimum Service Processes = 300
Maximum Service Processes = 750

Previously the tolerance and slave watchdog timeouts were set to 8 as
default. Then we increased these parameters to 24 Ė it didnít help. Now
itís set to 60. Cluster is stable for 24hours, but who knows.

What else I can tune? Where I need to dig? Iíve already broke my mind.