I am troubleshooting a difficult intermittent situation with an OES 11 server.. The server is oes11 SP2 with all available patches applied as of mid-December, and the kernel is running 3.0.101-0.40. It's virtualized on VMware ESXI 5.5 Build 2302651. This problem has been occurring intermittently even with older versions such as OES11 SP1 as well as older versions of VMWare. So the versions don't seem to make a difference.

Here is what happens:
1) For about 4 Minutes, Users become disconnected from NSS volumes and cannot access the server at all.
2) It resolves itself.
3) There is no pattern for when it happens. It could be mid day or late afternoon.. It doesn't match any heavy usage patterns such as Start of Day etc.

The ONLY thing I can find in log files is the following:

Dec 30 11:41:17 fs01 nagios: SERVICE ALERT: localhost;Current Load;CRITICAL;SOFT;1;CRITICAL - load average: 46.38, 18.10, 7.91
Dec 30 11:42:17 fs01 nagios: SERVICE ALERT: localhost;Current Load;CRITICAL;SOFT;2;CRITICAL - load average: 51.59, 24.77, 10.85
Dec 30 11:43:17 fs01 nagios: SERVICE ALERT: localhost;Current Load;CRITICAL;SOFT;3;CRITICAL - load average: 54.02, 30.35, 13.64
Dec 30 11:44:17 fs01 nagios: SERVICE ALERT: localhost;Current Load;CRITICAL;HARD;4;CRITICAL - load average: 54.66, 34.85, 16.23
Dec 30 11:44:17 fs01 nagios: SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 54.66, 34.85, 16.23

In this particular case, the users were able to reconnect at 11:45. The server was not rebooted. Whatever was happening just cleared up and things started working again.

As I have attempted to work with Novell Support on this issue, I have been extremely frustrated because they blame VMWare and close the ticket and will not even address the actual issue. The reason they blame VMware is that because late at night the SAN makes some snapshots and replicas, and this seems to generate some errors that are Logged during that process. When they see these errors, they immediately blame Vmware and wash their hands of it without considering the actual circumstances of the unrelated situation.

Here are some facts around the events:
1) There are NO events logged on the VMware side of things during these outages
2) There are many other servers in the VMWARE environment.. A lot of them (GroupWise for example) are MUCH busier / heavier disk I/O systems than this one.
3) There are no events logged on the SAN itself.
4) I can't find anything out of the ordinary logged in other service log files related to NSS or otherwise.
5) By the time I become aware of the issue, it has resolved itself.
6) It is very disruptive to the user environment due to their user profiles being stored on the NSS volumes.

I'm basically at a loss because I can't find any thing to use to troubleshoot it. And the intermittent nature makes it difficult to try to get a packet trace. Sometimes it goes for weeks without it happening, then other times we see it three times in 2 days.

I appreciate any thoughts.