Hello everybody,

we have a Sles 10 sp3 + oes2 cluster with 2 nodes as virtual machines.
They both have a connection to the SAN over iscsi.
Cluster protocol settings:
Heartbeat: 1
Tolerance: 8
Master Watchdog: 1
Slave Watchdog: 8
Max Retransmits: 30

I take a snapshot of a node (let's say node1) without the VMs memory and delete it afterwards. Somewhere during creating and deleting the snapshot node1 dies because it gets a poison pill from node 2 (according to the logs of node2). Node2 somehow cannot reach node1 during the snapshot process.

Now I have 2 questions:

1. Should I change the protocol parameter settings like heartbeat, tolerance ... to "give the snapshot taking a little more time" - which parameters should I change?

2. After node1 receives the poison pill it just stays there with 100% cpu and is completely non-responsive (not even pings). How can I configure the nodes, that they reboot automatically after receiving the poison pill?