OES2/Linux
SLE-10-x86_64-SP3 + "online updates"
Linux mail-poa2 2.6.16.60-0.62.1-smp #1 SMP Mon Apr 12 18:53:46 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux
novell-nss-32bit-4.12.2969-0.6.1
novell-nss-admin-session-openwbem-provider-0.2.6-15
novell-nss-4.12.2969-0.6.1

Following an apparent nss error, volume was inaccessible and server hung, causing ASR to kick in and reboot.

from /var/log/messages:
Code:
Jun 25 10:06:49 mail-poa2 kernel: err=20300 beastHash.c[632]
Jun 25 10:06:49 mail-poa2 kernel: err=20300 beastHash.c[632]
...
Jun 25 10:10:04 mail-poa2 kernel: err=20300 beastHash.c[632]
...
Jun 25 10:10:17 mail-poa2 kernel: err=20300 beastHash.c[632]
Jun 25 10:10:17 mail-poa2 kernel: err=20300 beastHash.c[632]
Jun 25 10:10:17 mail-poa2 kernel: err=20300 beastHash.c[632]
Jun 25 10:10:17 mail-poa2 kernel: err=20300 beastHash.c[632]
Jun 25 10:10:18 mail-poa2 kernel: err=20300 beastHash.c[632]
Jun 25 10:10:18 mail-poa2 kernel: err=20300 beastHash.c[632]
Jun 25 10:41:10 mail-poa2 syslog-ng[4243]: syslog-ng version 1.6.8 starting
...
The server restarted by hp's ASR function; users had reported problems beginning around 10 am accessing the post office which resides on the nss volume.

Google search, TID search, forums search doesn't seem to turn up documentation on this error. No additional logs present in /var/opt/novell/log/nss/.

Physical storage is a fibre-attached SAN. There does not appear to be any errors reported on the SAN, also the other hosts attached to the SAN did not have any issues.

When the server did come back up, we had some corruption on several of the GroupWise databases, which was recovered by a restore of ngwguard.db and gwcheck to repair some user and message dbs.

Turns out that the GroupWise POA log began reporting dozens of database errors a few minutes before the first "err=20300 beastHash.c[632]" showed up in /var/log/messages:

Code:
10:02:09 352 Possibly damaged blob in database = userj50.db
10:02:29 352 Possibly damaged blob in database = userjg0.db
10:02:47 552 Performing database maintenance: userj50.db
10:02:47 552 Database rebuild caused by error: [C022]
10:02:47 880 Performing database maintenance: userj50.db
10:02:47 880 Database rebuild caused by error: [C022]
10:02:50 352 Possibly damaged blob in database = userjj0.db
10:02:56 552 Error: Database maintenance in progress [C057]
10:04:07 880 Performing database maintenance: msg211.db
10:04:07 880 Database rebuild caused by error: [C04F]
10:04:07 184 Performing database maintenance: msg211.db
10:04:07 184 Database rebuild caused by error: [C04F]
10:04:07 576 Performing database maintenance: msg211.db
10:04:07 576 Database rebuild caused by error: [C04F]
10:04:07 784 Performing database maintenance: msg59.db
10:04:07 784 Database rebuild caused by error: [C04F]
10:04:07 552 Performing database maintenance: userjj0.db
10:04:07 552 Database rebuild caused by error: [C04F]
10:04:07 912 Performing database maintenance: msg211.db
10:04:07 912 Database rebuild caused by error: [C04F]
10:04:07 064 Performing database maintenance: msg211.db
10:04:07 064 Database rebuild caused by error: [C04F]
10:04:07 528 Performing database maintenance: msg211.db
10:04:07 528 Database rebuild caused by error: [C04F]
10:04:09 352 Possibly damaged blob in database = userfr0.db
10:04:17 184 Performing database maintenance: msg59.db
10:04:17 184 Database rebuild caused by error: [C04F]
...
So the question is, does this error indicate a problem with nss, a problem with the underlying storage subsystem, or something else? Did a GroupWise repair function cause nss to get overwhelmed? Or did some nss/disk issue that didn't get logged lead to the need for GroupWise to begin the database rebuilds? Obviously we want to avoid a recurrence of the issue.

Thanks in advance,
Mike