We have encountered NSS corruption during a RAID5 event now on 2
identical servers in the last couple of months, and am wondering if
anyone else has experienced this, or has any ideas on what may be the
culprit.
We're running NW6.5sp3 with the CPR server.exe on identical Dell
PowerEdge 2850 servers w/ 6GB RAM, PERC-4dc controllers, and (6) 147GB
disks in each array.

During each event, one (of 6) drive failed in our internal RAID array.
On one server, we started experiencing NSS 20012 block errors while the
drive was in a failed state, on the other server, we started
experiencing these errors after the drive was replaced and the array
had successfully rebuilt the new drive. The RAID controller was
unaware of ANY errors on any drives, and the ttylogs from the PERC
controller showed nothing unusual.

After several of these errors, the pool deactivated. Attempts to run a
verify on the pool abended the server with multiple abends, and required
a power-off. A rebuild on the pool resulted in a warning about 30MB
worth of files will be lost, continue Y/N? at which point we answered
no while assessing what our options might be.

Another rebuild on the pool seemed to run fine with no warnings or
errors, and we restarted the server and activated the volumes.

At this point, we don't have any confidence in our RAID system, and are
trying to figure out what we can do to prevent this from occurring
again.

TIA for any insight as to what could be the cause for this.
-Phil



I've included some additional info below:

The systems were patched and installed in July 05-
The system BIOS is A02
The PERC BIOS is 1.1 (updated from perc4-FWP4351S-A08.exe)
The PERC firmware is 351s rev A08
The Pedge3.ham is ver 7.02.03

We will be loading system bios, ver A04,
a slightly newer PERC firmware, 351S rev A09 (10/17/05)
and a newer pedge3.ham 7.02.06 (09/30/05)

Based on my discussions with Dell, none of these updates seem to
address the problem we're seeing, and none of us thinks this will
resolve the issue.


snippets of NSS.log errors:

24-oct-2005 23:03:42 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] comnPool.c[2614]
Oct 24, 2005 11:03:42 pm NSS<COMN>-3.22-xxxx:
Pool SYS: System data error 20012(beastTree.c[506]). Block
134986293(file block -134986293)(ZID 1)

24-oct-2005 23:03:42 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] comnVol.c[8587]
Oct 24, 2005 11:03:42 pm NSS<COMN>-3.22-xxxx:
Volume DATA: System data error 20012(beastTree.c[506]). Block
134986293(file block -134986293)(ZID 1)

24-oct-2005 23:03:46 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] comnPool.c[2614]
Oct 24, 2005 11:03:46 pm NSS<COMN>-3.22-xxxx:
Pool SYS: System data error 20012(beastTree.c[506]). Block
135024699(file block -135024699)(ZID 1)

24-oct-2005 23:03:46 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] comnVol.c[8587]
Oct 24, 2005 11:03:46 pm NSS<COMN>-3.22-xxxx:
Volume DATA: System data error 20012(beastTree.c[506]). Block
135024699(file block -135024699)(ZID 1)

24-oct-2005 23:05:42 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] comnPool.c[2614]
Oct 24, 2005 11:05:43 pm NSS<COMN>-3.22-xxxx:
Pool SYS: System data error 20012(beastTree.c[506]). Block
22687319(file block -22687319)(ZID 1)

24-oct-2005 23:05:42 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] comnVol.c[8587]
Oct 24, 2005 11:05:43 pm NSS<COMN>-3.22-xxxx:
Volume DATA: System data error 20012(beastTree.c[506]). Block
22687319(file block -22687319)(ZID 1)

24-oct-2005 23:13:16 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] comnPool.c[2614]
Oct 24, 2005 11:13:16 pm NSS<COMN>-3.22-xxxx:
Pool SYS: System data error 20012(nameTree.c[45]). Block
333916(file block -333916)(ZID 6)

24-oct-2005 23:13:16 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] comnVol.c[8587]
Oct 24, 2005 11:13:16 pm NSS<COMN>-3.22-xxxx:
Volume DATA: System data error 20012(nameTree.c[45]). Block
333916(file block -333916)(ZID 6)

24-oct-2005 23:13:16 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] comnVol.c[8587]
Oct 24, 2005 11:13:17 pm NSS<COMN>-3.22-xxxx:
Volume DATA: System data error 20012(nameTree.c[45]). Block
333916(file block -333916)(ZID 6)

...

25-oct-2005 07:13:24 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] zfsVolumeData.c[212]
Oct 25, 2005 7:13:24 am NSS<ZLSS>-3.22-1449:
Error reading VolumeData Block 89425449, status=20206.

25-oct-2005 07:13:24 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] zfsVolumeData.c[212]
Oct 25, 2005 7:13:24 am NSS<ZLSS>-3.22-1449:
Error reading VolumeData Block 89425449, status=20206.

25-oct-2005 07:13:26 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] userTree.c[1960]
Oct 25, 2005 7:13:26 am NSS<ZLSS>-3.22-1461:
Unable to adjust user space (count will now be wrong).
Error=20206

25-oct-2005 07:13:26 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] userTree.c[1960]
Oct 25, 2005 7:13:26 am NSS<ZLSS>-3.22-1461:
Unable to adjust user space (count will now be wrong).
Error=20206

25-oct-2005 07:13:26 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] zfsVolumeData.c[212]
Oct 25, 2005 7:13:26 am NSS<ZLSS>-3.22-1449:
Error reading VolumeData Block 134171637, status=20206.

25-oct-2005 07:13:26 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] zfsVolumeData.c[212]
Oct 25, 2005 7:13:26 am NSS<ZLSS>-3.22-1449:
Error reading VolumeData Block 134171637, status=20206.

25-oct-2005 07:13:26 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] zfsVolumeData.c[212]
Oct 25, 2005 7:13:26 am NSS<ZLSS>-3.22-1449:
Error reading VolumeData Block 89403651, status=20206.

25-oct-2005 07:13:26 lvl="LOG_ERR" [NSS.NLM - v3.22-0]
[Error] zlssLogicalVolume.c[4722]
Oct 25, 2005 7:13:26 am NSS<ZLSS>-3.22-1449:
<<< BAD MESSAGE >>>

25-oct-2005 07:13:26 lvl="LOG_INFO" [ZLSS.NSS - v3.22-0]
[MSAP] comnLog.c[187]
Pool "SYS" - MSAP deactivate.


--
pgraybeal@faplawfirm.com