Facts: NW6.5sp3, BE9.1, LSI MegaRAID 320-1 RAID5 controller.

I migrated our main production server to 6.5sp3 and new hardware back in
June and it has run beautifully until two nights ago, even BE has been
behaving itself. Then at 7.49pm local there was:

SERVER-5.70-1534[nmID=B0013] Device "V370-A1-D0:0] MegaRAID LD0 RAID5
75335R rev:1L37" deactivated by driver due to device failure.

I restarted the server when I came in the next morning, which came up fine
and it ran as normal all day. I installed LSI's GAM monitoring software,
and at the end of work updated the RAID controller HAM (Mega4_xx.ham) and
DDI to LSI's latest (v7.01.07). Last night at 7.48pm local the same
happened again.

GAM reported:
E-28 2 <ip> Tue Oct 18 19:21:47 2005 ctl: 0 chn: 0 tgt: 3 lun: 0 2
Read retries exhausted
E-28 2 <ip> Tue Oct 18 19:21:47 2005 ctl: 0 chn: 0 tgt: 3 lun: 0 3
Read retries exhausted
E-28 2 <ip> Tue Oct 18 19:47:44 2005 ctl: 0 chn: 0 tgt: 3 lun: 0 4
Read retries exhausted
E-28 2 <ip> Tue Oct 18 19:47:44 2005 ctl: 0 chn: 0 tgt: 3 lun: 0 5
Read retries exhausted
E-28 2 <ip> Tue Oct 18 19:47:44 2005 ctl: 0 chn: 0 tgt: 4 lun: 0 6
Read retries exhausted

This morning the monitor screen was showing 100% utilisation. I restarted
and again it came up fine and has been as normal so far today.

Now I think this looks like a failing RAID controller, but given the
coincidence of timing of the two failures (BE would have been running) I
just wanted to check whether there was any other cause that I should
investigate first. Obviously swapping the RAID controller on your main
production server isn't something you want to do just for laughs.

Thanks as always

Phil