I'd like to use your a few idle cycles of your brain power to troubleshoot my DNS servers.

The Problem:
Novell DNS on my two OES 2.2 DNS servers has been crashing at least once a day for the last week. Seems to do so more between 8 and 11am but has been observed to do so at all hours of the day.

1. 2 physical OES 2.2 servers, 64-bit, 8GB RAM and multi-core processors. Neither server is a Master for any replica, but both have an entire set of replicas for the tree.
2. These 2 servers run eDirectory, DNS, and OpenSLP. One also runs radiusd. Those are their only functions.
3. There are two other DNS servers in the organization -- Windows 2008 and 2008 R2 which have a secondary copy of the same zones as the two main DNS servers. They do control one zone set -- service records. They pull data every ten minutes or so.
4. eDirectory refresh rate on Novell DNS was 5 minutes. I increased it to 20 to see if that stopped the crashing.
5. One subnet uses dynamic DNS, but only servers and a few clients are currently on this subnet.
6. One zone is handled primarily by an AD server at a school (the vocational school).
7. All others zones are controlled by Novell services.
8. Client DHCP subnets rank the 4 major DNS servers differently for balance. Only the 2 Novell servers can do recursive lookups to the outside Internet and the Windows servers go through them for unknown addresses.
9. 2 Standalone Windows 2003 servers for outside DNS queries (since we NAT). Not in this chain.

1. Problem is happening on both servers. Each server uses its own nds instance for LUM.
2. Sometimes novell-named will simply die and rcnovell-named will indicate death.
3. Sometimes our Orion monitor will show a degradation of DNS user experience (i.e. DNS is getting slow). Sometimes the server recovers from this and returns to normal utilization, and sometimes it is dead. Putting the debug level to 10 and looking at the named.run log indicates that during the slow times when it recovers, DNS records may not be refreshing from eDir. At these times, rcnovell-named status may indicate that all is well and all zones are fine.
4. On at least one of the servers namcd (NOT named) is a hog, sometimes using > 90% of a processor core before dropping to below 1%. LDAP is very responsive on both servers but LUM sometimes has trouble staying active for reasons that are not clear to me. There does NOT seem to be a DIRECT correlation with this -- I can id my user and the appropriate Novell system users even when the DNS server has problems.
5. Other times, rcnovell-named will either hang or give a connection refused message. In these cases, if the process has been running a while, it is necessary to do a kill-9 on the named process before running the rc script to bring it back up.
6. named never takes up more than 10% of processor utilization and is usually much lower than that.

1. Is there some more effective logging I can be doing other than ramping up the debug on the named.run?
2. Any specific areas of server health I should be looking at top try to figure out why novell-named is crashing and otherwise going comatose on me?

Johnnie Odom
Network Services
School District of Escambia County