Here are some symptoms and observations of our cluster and thought someone might be able to give me an idea where to look and what might be wrong. The problems are not new and have been an issue early on.

Environment:
OES1 SP2 - Linux
14- node cluster - about 70 cluster resources
4 Gig RAM, Dell 1950 servers
Hosts file, print and groupwise
Groupwise is on last 4 servers
File system is on SAN and NSS
Groupwise uses Reiser

Behavior of cluster resources and needing to reboot cluster:
We've had to reboot cluster several times since installation due to issues of stability.
Last time, we went 270 days without a reboot.
What eventually happened is iprint would not load.
Then Groupwise resources started going comatose or not responding.
The problem was getting worse, so we had to reboot the thing.

Behavior of cluster on reboot:
We have found that if 1st server comes up as Master IP node that the cluster is not stable. Groupwise resources will not load. Things just go comatose.
We have found it works best if the 13th or 14th server come up as the Master IP node server. Maybe the 10th server will be okay as Master IP server? Sometimes we have to reboot the 2nd time to get all cluster resources back up properly. Shouldn't any of the 14 servers be able to come up as the Master IP# server?

On last cluster reboot:
Most cluster resources came up comatose as in this last reboot which was done due to Dell Maintenance

I was not a member of rebooting the cluster this last time, but procedures were followed in rebooting cluster. The cluster was down and offline for several hours during the hardware maintenance.

Errors and problems with cluster resources not loading, Error #17:
I have noticed that an error #17 is generated whenever a cluster resource has problems loading. I have followed the Novell tid on error #17 which says that error #17 is due to resources trying to use same load #.
I've used NCPCON procedures and did determine on the server where the resource name WAS IN the cache. I'm able to reboot that server to clear its cache. I've not tried the cluster leave and cluster join procedure, but think that also happens in a cluster server reboot.

Examing load scripts:

I've examined closely the load and unload scripts in the two places on servers where these load scripts are stored. /var/opt/novell/ncs...... something like that I don't have specifics since I'm not in from of servers.
I do not see any discrepancy when comparing resource load scripts in iManager.

Last 4 cluster servers do have suspect load scripts:
I did notice on one or two of the last 4 servers that there does seem to be a discrepancy on the load scripts. It appears to me that when cluster was setup that some cluster resources were temporarily created and loaded on one or two of the last cluster servers. These servers do hold some cluster load scripts for file service cluster resources with numbers that are not correct or do conflict with the other cluster resources. However, since these last 4 servers only host Groupwise, would these erroneous load scripts cause issues with cluster resources trying to load on other cluster servers.

Example:
Highschool #1 - has a load script on server #14 using number 222.
But, highschool #1 should never load on hum 14 and thus never attempt to run this load script. Am I wrong in my thoughts??? In production a middle school uses cluster resource number 222.

Questions:
1. Okay, I've heard this a few times for us to upgrade to OES2.
We have a future project which will deal with this issue.
2. However, is what I describe a "normal" behavior for an OES1 SP2 cluster?
3. What does an error #17 mean? Are there other things besides
load scripts conflicts that can cause this?
4. With server #11 or #14 holding the suspect load scripts is this
potentially causing us a problem?
5. Should I just be able to delete the load.in and load.out scripts?
The files have an * at the end and don't seem to be just normal
files. I'd like to clean up the erroneous load and unload scripts.

Any insight or suggestions you can provide are greatly appreciated?

Thanks,
Lin