I'm running NetWare 6.5 Sp6, in a 6 node cluster.

I have a badly behaved disk device that can really clog up. When that
happens, the Concurrent Disk Requests number goes over 1000. If it stays
there for long, or grows past a certain point, the split-brain detector
kicks the node out due to an inability to see the cluster-partion.

The cluster partition is on a good LUN. The I/O to the bad LUN seems to be
crowding out I/O to good LUNs. It looks to me that when the I/O queue gets
long enough the cluster node can miss its storage heartbeat and

I'm working with my vendor to fix the bad performance I'm having, but I'd
really like to isolate the fault if at all possible. This sort of thing
doesn't happen on our Windows servers that I can tell, but I know
different operating systems handle I/O queueing differently. I don't have
any high-volume linux servers pointing at this bad device, so I don't know
how that'd behave either.

Back in the TFS days there was a Maximum Concurrent Disk Cache Writes SET
parameter, but IIRC that isn't effective with NSS volumes like I have. Is
there any way to reduce the I/O starvation going on?

Novell, it does a network good