I have a strange problem that I'm truly baffled by. Here is the

Have a customer moving off NetWare 6.5 to OES Linux 11 SP1. Customer is
not real big, has about 3 or 4 NetWare servers still and about 5 OES
Linux servers. Servers are spread out across a wide area network
(MPLS). I was recently switching the SLP DAs off the NetWare servers
and making several of the OES Linux servers DAs. Pretty much all
standard stuff I've done many times for other customers. During this, I
noticed a fair amount of -625, transport failure, errors being reported
during a sync status check. I thought that was odd, so I started
investigating. This turned out to be an all day ordeal.

For some reason, one particular site on the MPLS network seems to
constantly be reporting -625 and sometimes -626 all referrrals failed
errors to various other servers throughout the tree. My first
inclination is of course network issues and/or name resolution problems.
So that is where I started investigating. But as far as I can tell,
everything seems fine. I can get a Limber process to complete properly
and things seem fine for a while and then the 625/626 errors return.
This of course makes partition sync problematic. Not to mention, the
users travel between sites, so I need good replication. The tree is
very very small, not more than several thousand objects at most.

My experience and history tells me network issue. But I've run tests
and made sure TCP 524 is open between all sites. DNS short names
resolve properly on all servers and even the SLP backup fiels on the OES
DAs show all proper registrations. Yet I'm constantly seeing these
625/626 errors.

We patched all the servers up to the latest (as of Friday), no

So then I started watching DSTrace and watching the connection requests,
and I saw this:

request DSAResolveName by context 482a0013 succeeded
ConnRequest: received bad seq 10, expected 11, server conn 20,
Socket Recv(2): err: failed, invalid response (-708).
Socket Recv(2a): rxSize: 16 frags->len: 16.
ConnRequest: TRANSPORT FAILURE! Breaking remote connection 20
tcp: (seq# 11).
Total Addresses in bad address cache: 2
ncp request, verb: 104 by context 482a0013 failed, transport failure
ConnClose: TCP - open conns: 47, open sockets: 47

That seems to be the crux of the problem, but I don't know what it means
exactly. What is "Invalid Response (-708)"? What does that mean?

There are no firewalls between any of the sites and none of the severs
have the local SuSE firewall running.

We'll open an SR on Monday, but my gut keeps telling me some sort of
network issue, yet I can find no evidence of that other than eDir

Any ideas?



