I'm searching for the optimal NCS configuration for a scarily
complicated setup: three nodes which share NSS and a SBD partition via
iSCSI (NIC 1 and VLAN 1). All have access to a dedicated management
network (NIC 2 and VLAN 2) as well as to the production network (NIC 3
and VLAN 3). All works nicely until there is a hickup in one of the
networks which happens more frequently than an actual problem with one
of the OES11 servers. So my search is for a configuration that gives
maximum tolerance for problems with the network infrastructure.
My questions:
- What is best practice value for the NCS Heartbeat Tolerance and Slave
Watchdog timeouts? For the time being I did set this to 16s. Will there
be any problems with values in the range of minutes other than cluster
resources not being available for longer time?
- Is it possible to specify these values separately for multiple
networks? For example, in a case where the switch of the production
network is faulted, but the one for the management network still works.
In this case the NCS hosts are still able to communicate, and I would
prefer them to do nothing and wait for the faulted network to start up
again. Instead, they will send poison pills after the Heartbeat
Tolerance has timed out.
- Finally I'm searching for a solution in case the iSCSI SAN fails. Last
time this happened for about 20s all nodes were gone which resulted in
an interruption of all services, of course. Is the time NCS hosts will
wait for an unreachable SBD partition one that I might configure?