This is especially likely to happen with a two node DAG with a File Share Witness (FSW). When network latency increases to the point that the cluster heartbeat threshold is reached, the node furthest from the FSW will go offline. The node that's in the same LAN as the FSW will stay online because it maintains quorum.
There are two properties that specify cluster health, as measured in heartbeats.
- CrossSubnetDelay specifies the heartbeat interval (in milliseconds) between subnets. The default is 1000 milliseconds (1 second).
- CrossSubnetThreshold specifies how many heartbeats can be missed between subnets before cluster failover (or failure) occurs. The default is 5 heartbeats.
If WAN latency causes unexpected cluster failover or failures, adjust the CrossSubnetDelay value to its maximum value of 4000 milliseconds (4 seconds) and the CrossSubnetThreshold property to its maximum value of 10, With these settings the cluster will not failover or fail until there are 10 missed heartbeats, 4 seconds apart (40 seconds).
This is accomplished from Powershell by doing the following:
- From one of the DAG members open the Windows Powershell Modules in Administrative Tools. This will launch Powershell and import all the Windows Powershell modules for installed features, including the new Windows Failover Cluster module.
- Run the following one-liner to configure the maximum values:
$cluster = Get-Cluster; $cluster.CrossSubnetThreshold = 10; $cluster.CrossSubnetDelay = 4000
- Check your settings using the following cmdlet:
Get-Cluster | fl *
Since cluster properties are instantly replicated between all nodes in the cluster, this only needs to be configured from one node in the DAG. The changes go into effect immediately and there is no need to restart services or the server.
Note that you can configure the same properties using cluster.exe, but I'm using Powershell here because cluster.exe is deprecated in Windows Server 2008 R2.