Our situation here is we are running 4 OES 11SP2 (patched to January 2015) virtual servers (12GB RAM, 6 Procs each) on ESX 5.5. We only use this cluster for NSS shares with CIFS and AFP enabled. We have a random issue where the NSS shares on certain cluster members becoming unreachable. What we have observed is the following:

1. Cannot SSH to server, get password prompt, but never returns anything after that
2. eDir shows 625 errors for the server
3. cannot login via vsphere to server
4. No one can map drives to resources hosted on this node
5. nagios complains that all swap is being used
6. Eventually users cannot authenticate "ERROR: CODIR: treeLoginUser: Failed to connect to local DS Agent for user: USERNAME, context: 1908998149, error: -625"
7. Other CIFS errors - ERROR: AUTH: SEV maintenance: Retrieval of SEV list has failed for the user: DIFFERENT USER, context: 1908998146, error: -625
8. Then we get looping CIFS errors of "CRITICAL: AUTH: Failed to connect to eDirectory. ServerIP:,Error: -625" and "CRITICAL: AUTH: Error in connecting to Local DS Agent: -625"

These continue until we reboot the server via vSphere

Clustering doesn't migrate the volumes until the reboot, if you do manage to ssh in (very slow) you cannot migrate from command line, or iManager, the resources goes comatose.

any idea?