I have seen a couple of posts where ndsd crashes and ncp volumes are no longer available but I have not seen a post that came to a solution...
Hopefully someone can help me out...

I am going to apologize for the lengthy post ahead of time! :)

We have 3 node cluster consisting of three OES2 SP2 on SLES10 SP3 servers. One of the cluster resources (ncs_data_home) is a resource that has all of our student home directories. Consistently the node who hosts that node will have ndsd crash. We then have to restart ndsd.

The ncs_data_home cluster resource has over 30,000 home directories under a users folder.

We started out with 4GB of memory in each cluster node and found that ndsd would crash at least once every 2 days on the node that serviced home directories.

Increasing the memory for each node to 8GB of memory for each cluster node would yeild ndsd crashing probably once anywhere from 3-5 days.


I have tried nss tuning, as per the tuning guide from novell documentation.

I have tried edir tuning by increasing the database cache configuration to a hard limit that roughly matches the size of the dib set.

All with no avail.

We can cause ndsd to crash logging in as a admin level account or an account that has RF to the directory that contains the home directories and by using windows explorer to display the home directories listing.

Sometime I manage to bring up all the directories in windows. It can take up to 25 minutes to bring up the explorer window with all the directories.

When I do this, ndsd and ndpapp processes and memory show high utilization. Monitoring the process ndapp show zombie statuses.

Some other background info
-the cluster nodes are IBM HS21 blades with two quad core 2.33 Ghz Xeon procs connected to a IBM DS4700 using 4Gbp fc adaptors (no multipathing).
-even though there are 30000+ home directories, the file space usage is under 50GB total with a total number file count of under 400,000, since not all students utilize their student computer account.
-we have not LUM enabled our students accounts.
-we use novell storage manager to manage student home directory quotas.
-cluster nodes hold replicas of partitions that hold cluster, server and volume and user objects.
-cluster nodes are lum enabled.
-we were experiencing the same problems before we updated our servers from OES2 sp1 on SLES10.2.

I did a test of migrating a copy of the student home directories to a single OES2 sp2 on SLES10.3 server and found ndsd still crashes.
I performed a test whereby I gradually increasded the number of student home directories under the \\servername\home\users directory and timed how long it took windows explorer to display the directory listing.

1000-2000 directories = a few seconds
5,000 directories = 30 seconds or so
10,000 directories = 5 minutes
15,000 directories = ~15 minutes
20,000+ directories = ~20+ minutes to complete or get an not accessible error message which says I do not have permissions to use the network resource / the specified network name is no longer available or ndsd crashes.

This is on a test server matches a cluster node that is also running that has 6GB of memory and I am the only user that is accessing it.

I have ruled out the cluster being the problem. I am leaning towards a problem edir have a problem keeping track of the trustee info.

Anyway, hopefully someone can lend some insight before I have to log a call with Novell.