NCS was setup and running on a two-node cluster with heartbeats on a separate network. It was working fine for approximately two weeks but I am having problems since yesterday. Servers are patched and have the latest version of ncs.

Yesterday, we observed that a number of services on that Master had stopped - iFolder, NCP accesss etc., These are setup as resources on the cluster. It appeared that the server had crashed as it was not responding to any commands and a forced shutdown had to be done. The only thing that was cpu intensive at that time were 6 PBS jobs that were running on it. It is a dual quad core Xeon CPU.
- After booting again, the master went into a continuous reboot cycle.
- The resources surprisingly did not failover to the slave node. This might be because the master had not completely crashed. I could ping it from the slave
- As I wasn't sure how to avoid the reboots of the master, I tried shutting down the slave node with the idea of doing a clean reboot of both the nodes.
- There were messages in the slave server logs that it is sending a poison pill to the master node and so on.
- After that the slave node also entered into a reboot cycle

I could only boot with ncs turned off. Cannot access Cluster manager from iManager as ncs is not running. Have access in ConsoleOne but no use.

sbdutil -v shows the following

Code:
Cluster (SBD) partition on /dev/evms/.nodes/ericluster.sbd
Signature  # HeartBeat State eState  Epoch   SbdLock Bitmask
SBD*       0 00403241  LIVE  PILL        5      UNLK 00000003
SBD*       1 00403500  LEFT              6      UNLK 00000002
SBD*       2 00000001                    0      UNLK 00000000
SBD*       3 00000001                    0      UNLK 00000000
SBD*       4 00000001                    0      UNLK 00000000
SBD*       5 00000001                    0      UNLK 00000000
SBD*       6 00000001                    0      UNLK 00000000
SBD*       7 00000001                    0      UNLK 00000000
SBD*       8 00000001                    0      UNLK 00000000
SBD*       9 00000001                    0      UNLK 00000000
SBD*      10 00000001                    0      UNLK 00000000
SBD*      11 00000001                    0      UNLK 00000000
SBD*      12 00000001                    0      UNLK 00000000
SBD*      13 00000001                    0      UNLK 00000000
SBD*      14 00000001                    0      UNLK 00000000
SBD*      15 00000001                    0      UNLK 00000000
SBD*      16 00000001                    0      UNLK 00000000
SBD*      17 00000001                    0      UNLK 00000000
SBD*      18 00000001                    0      UNLK 00000000
SBD*      19 00000001                    0      UNLK 00000000
SBD*      20 00000001                    0      UNLK 00000000
SBD*      21 00000001                    0      UNLK 00000000
SBD*      22 00000001                    0      UNLK 00000000
SBD*      23 00000001                    0      UNLK 00000000
SBD*      24 00000001                    0      UNLK 00000000
SBD*      25 00000001                    0      UNLK 00000000
SBD*      26 00000001                    0      UNLK 00000000
SBD*      27 00000001                    0      UNLK 00000000
SBD*      28 00000001                    0      UNLK 00000000
SBD*      29 00000001                    0      UNLK 00000000
SBD*      30 00000001                    0      UNLK 00000000
SBD*      31 00000001                    0      UNLK 00000000
-----------------------------------------------------------------

Starting NCS after system boot also triggers a reboot

Any suggestions as to how I can recover the systems?

Thanks,

Vimal