Greetings all,

I browsed many websites before posting here, maybe someone could lend us some advice on this :
We have a two node cluster SLES 10 SP2, OES2 SP1, patched all up, running on a three ESX servers 3.5 infrastructure (all servers identical physically, patched up too).

We have problems with NSS volumes, used by the cluster, configured under vmware as Virtual Disks (SCSI Controller in Physical bus sharing, meaning any ESX server can host the VMs and access to the disk images).
Randomely those 3 volumes deactivates, see below an example log :
Code:
Nov 12 00:19:37 SRVGW02 kernel: sd 1:0:2:0: reservation conflict
Nov 12 00:19:37 SRVGW02 kernel: sd 1:0:2:0: SCSI error: return code = 0x00000018
Nov 12 00:19:37 SRVGW02 kernel: end_request: I/O error, dev sdd, sector 2200
Nov 12 00:19:37 SRVGW02 kernel:  BIO_UPTODATE not set.  Got an IO error
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [Error] zlssMSAP.c[538]
Nov 12 00:19:37 SRVGW02 kernel:      Nov 12, 2009   1:19:37 am  NSS<ZLSS>-4.11b-xxxx:
Nov 12 00:19:37 SRVGW02 kernel:      MSAP: Pool "GWBACKUP" read error 20204.
Nov 12 00:19:37 SRVGW02 kernel:
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [MSAP] comnLog.c[201]
Nov 12 00:19:37 SRVGW02 kernel:      Pool "GWBACKUP" - Read error(20204).
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [Error] comnPool.c[2575]
Nov 12 00:19:37 SRVGW02 kernel:      Nov 12, 2009   1:19:37 am  NSS<COMN>-4.11b-xxxx:
Nov 12 00:19:37 SRVGW02 kernel:      Pool GWBACKUP: System data I/O error 20204(zlssMSAP.c[1812]).
Nov 12 00:19:37 SRVGW02 kernel:    Block 0(file block 0)(ZID 0)
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [Error] zfsVolumeData.c[216]
Nov 12 00:19:37 SRVGW02 kernel:      Nov 12, 2009   1:19:37 am  NSS<ZLSS>-4.11b-1449:
Nov 12 00:19:37 SRVGW02 kernel:      Error reading VolumeData Block 36, status=20206.
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [Error] zfsVolumeData.c[216]
Nov 12 00:19:37 SRVGW02 kernel:      Nov 12, 2009   1:19:37 am  NSS<ZLSS>-4.11b-1449:
Nov 12 00:19:37 SRVGW02 kernel:      Error reading VolumeData Block 36, status=20206.
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [Error] zfsVolumeData.c[216]
Nov 12 00:19:37 SRVGW02 kernel:      Nov 12, 2009   1:19:37 am  NSS<ZLSS>-4.11b-1449:
Nov 12 00:19:37 SRVGW02 kernel:      Error reading VolumeData Block 11796227, status=20206.
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [Error] zlssLogicalVolume.c[4989]
Nov 12 00:19:37 SRVGW02 kernel:      Nov 12, 2009   1:19:37 am  NSS<ZLSS>-4.11b-1449:
Nov 12 00:19:37 SRVGW02 kernel:      Error reading PoolData Block 11796230, status=20206.
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [Error] zfsVolumeData.c[216]
Nov 12 00:19:37 SRVGW02 kernel:      Nov 12, 2009   1:19:37 am  NSS<ZLSS>-4.11b-1449:
Nov 12 00:19:37 SRVGW02 kernel:      Error reading VolumeData Block 11796227, status=20206.
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [Error] zlssLogicalVolume.c[4989]
Nov 12 00:19:37 SRVGW02 kernel:      Nov 12, 2009   1:19:37 am  NSS<ZLSS>-4.11b-1449:
Nov 12 00:19:37 SRVGW02 kernel:      Error reading PoolData Block 11796230, status=20206.
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [Error] zfsVolumeData.c[216]
Nov 12 00:19:37 SRVGW02 kernel:      Nov 12, 2009   1:19:37 am  NSS<ZLSS>-4.11b-1449:
Nov 12 00:19:37 SRVGW02 kernel:      Error reading VolumeData Block 11796227, status=20206.
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [Error] zlssLogicalVolume.c[4989]
Nov 12 00:19:37 SRVGW02 kernel:      Nov 12, 2009   1:19:37 am  NSS<ZLSS>-4.11b-1449:
Nov 12 00:19:37 SRVGW02 kernel:      Error reading PoolData Block 11796230, status=20206.
Nov 12 00:19:37 SRVGW02 kernel: NSSLOG ==> [MSAP] comnLog.c[201]
Nov 12 00:19:37 SRVGW02 kernel:      Pool "GWBACKUP" - MSAP deactivate.
Nov 12 00:19:37 SRVGW02 kernel: NSS: Clearing the NSS Linux cache for a pool
Nov 12 00:19:37 SRVGW02 kernel: MM_RemirrorPartition: GWBACKUP, 0
We have one volume hosting GroupWise (1 MTA, 1 PO), another hosting GroupWise archives, the last hosting Files.
It seems that the problem appears when we are running heavily I/O access to those volumes (i.e : doing a backup, users copying files).

We have configured recently a supplemental volume but this time with raw device mapping ; it is a big 2 To volume hosting geographical data, so it is used very intensivly, and it does not deactivates ...

I think this is a Vmware problem, and that we should migrate all the virtual volumes to raw devices, but we have approx. 1To of data to move, with limited space available on the SAN ; So, any help / tip would be appreciated.

Actually we have setted up monitoring on the cluster resources, so they can come back online after those failures, but i fear data unstability.