I need some help troubleshooting a problem I'm having with one node of a 10 node cluster. All nodes are netware 6.5sp5, HP BL20pG3 and G4 blades attached to an EMC Clariion through Cisco MDS9509 switches.

on 3/12/08 just after 11:00pm something happened and 7 nodes of the cluster got a poison pill. Nothing noted on sp event logs or the switches, other than the nodes disconnecting after they received the poison pill.

on 3/19/08, again just after 11:00pm one node got a poison pill.
Same thing on 3/21, and again on 3/24. Always just a few minutes after 11:00pm, 11:03, 11:06, 11:05.

There are a couple lun clones going on at this time but they kick off at 9:00pm, not 11:00. Everything was fine before 3/12, Lan guys say nothing has changed, San guy says nothing has changed, so maybe something is failing?

The servers are dual pathed to the san, running powerpath 3.00.06, ql2x00.ham 6.90.13. Both are the latest release. 4 of the servers are running n65nss5a, the other 6 are running n65nss5b.

Other than the obvious "Upgrade all the servers to sp7", what else should I be looking at? Also, when I reboot any of these servers they hang for a minute or two at "Scanning for devices and partitions". They never did this on the Compaq san. Is this normal with EMC?