Since moving our entire infrastructure from Netware to OES, I've been trying to figure out a very strange issue. We are a K-12 school system with a data center at the central offices and a single server at each school. The data center has the servers that run all the centralized stuff, including the Master replicas for [root], the DA's for SLP, etc. We noticed that sometimes a school's server will drop out of the SLP tables. When that happens, we also cannot SSH to the school's server from the central servers. We can SSH to it from any other machine on the same subnet as the central servers. The central servers can ping it, but no UDP/TCP traffic can pass between them (no web traffic, etc). We finally realized it only happens when the ISP that serves that school has a brief blip or outage of some sort. The only way we have found to get the servers talking again is to reboot the central servers.

We have tried:
  • Restarting almost all services on central servers
  • Stopping and restarting networking
  • Switching to runlevel 1 and then back
  • Stopping firewall and/or shutting it off completely
  • Checking hosts.deny files
  • Disabling virus scanner (McAfee Enterprise)
  • Changing vmware network drivers from E1000 (known issue with vSphere 5.1) to VMXNET3

All servers run on HP Proliant servers (blades at central office, towers at schools) running vSphere 5.1. Links between the central office and schools are all 100Mbps fiber via two different ISP (depending on the school's location). A restart of the central office server fixes it every time. I am out of ideas now and at a loss as to why this happens when a WAN link goes down. We thought it was just isolated to one ISP, but it was just that that particular ISP is not quite as reliable as the other. We had an extended power outage at a school served by another ISP over the weekend and saw the same issue with that server after power was restored.

Any ideas or suggestions at where else we should be looking? We also run ZCM 11 at each school and the central office. When this happens, those servers never lose connectivity with each other and with the OES servers on each end. The ZCM servers run on the same vSphere host as the OES servers.