Hi,

I didn't see the cluster forum, so thought I'd try here. I have a 5 node cluster with OES 2 on SUSE 10 nodes connected to an iSCSI netware 6.5 SP7 target.

We installed SP7 on the iSCSI server last night and had some trouble getting the cluster volumes to come online.

This morning I found that some of the nodes rebooted during the night. Now one of our cluster volumes (called GWPD1) won't load. When it loads, all the other nodes send the node it's loading on a poison pill.

When the first node it's loading on fails, it trys the other one it's configured to use and that one receives a poison pill as well.

Do I need to do a volume or pool rebuild? If so, do I run this from one of the SUSE Nodes--or from the Netware iSCSI target machine?

Thanks so much!
Bryan

This is what the logs say that send the poison pill:
Oct 18 10:10:57 r9-lcn03 kernel: CLUSTER-<WARNING>-<6077>: The cluster has lost communication with node [r9-lcn04].
Oct 18 10:10:57 r9-lcn03 kernel: Node [r9-lcn04] may have failed or experiencing other problems.
Oct 18 10:10:57 r9-lcn03 kernel: To ensure cluster stability, this node has sent a poison pill to node [r9-lcn04].
Oct 18 10:10:57 r9-lcn03 kernel: Epoch for this node is higher than for some other node.
Oct 18 10:10:57 r9-lcn03 kernel: Other node is slow to update epoch and bitmask (slow or dead).


This is in the log on the node (r9-lcn04) the cluster volume is loading on:
Oct 18 16:06:36 r9-lcn04 kernel: NSSLOG ==> [MSAP] comnLog.c[201]
Oct 18 16:06:36 r9-lcn04 kernel: Pool "GWPD1" - MSAP activate.
Oct 18 16:06:36 r9-lcn04 kernel: Server(ac1f20e0-d90f-1000-88-4c-0013211e0b9e) Cluster(feb1d200-fbe5-11da-9f-0e-001321b12732)
Oct 18 16:06:36 r9-lcn04 kernel: NSSLOG ==> [MSAP] comnLog.c[201]
Oct 18 16:06:36 r9-lcn04 kernel: Pool "GWPD1" - Watching pool.
Oct 18 16:06:54 r9-lcn04 kernel: NSSLOG ==> [Ownership] zPool.c[1142]
Oct 18 16:06:54 r9-lcn04 kernel: Pool "GWPD1" is owned by repair.c[1944]
Oct 18 16:06:55 r9-lcn04 kernel: NSS Free Tree(3) is corrupt.
Oct 18 16:06:55 r9-lcn04 kernel: ASSERTION FAILED
Oct 18 16:06:55 r9-lcn04 kernel: File is: /usr/src/packages/BUILD/nss/modules-build/linuxmpk/mpkmisc.c
Oct 18 16:06:55 r9-lcn04 kernel: Line Number is: 358
Oct 18 16:06:55 r9-lcn04 kernel: ------------[ cut here ]------------
Oct 18 16:06:55 r9-lcn04 kernel: kernel BUG at /usr/src/packages/BUILD/nss/modules-build/linuxmpk/nwcore.c:204!
Oct 18 16:06:55 r9-lcn04 kernel: invalid opcode: 0000 [#1]
Oct 18 16:06:55 r9-lcn04 kernel: SMP
Oct 18 16:06:55 r9-lcn04 kernel: last sysfs file: /devices/pci0000:00/0000:00:1c.0/0000:02:02.0/subsystem_device
Oct 18 16:06:55 r9-lcn04 kernel: Modules linked in: af_packet hangcheck_timer cma cmsg crm cvb css vipx sbd gipc vll sbdlib clstrlib sg crc32c libcrc32c zapi nebdrv nsslsa nssmanage nsszlss nsscomn ndpmod nss nsslibrary nsslnxlib linuxmpk libnss admindrv nwraid nls_utf8 cpufreq_ondemand cpufreq_userspace cpufreq_powersave acpi_cpufreq freq_table adminfs adminfsdrv iscsi_tcp libiscsi scsi_transport_iscsi ipv6 button battery ac apparmor aamatch_pcre loop dm_mod hw_random uhci_hcd ehci_hcd ide_cd i6300esb usbcore cdrom shpchp pci_hotplug tg3 reiserfs ata_piix ahci libata edd fan thermal processor cciss piix sd_mod scsi_mod ide_disk ide_core
Oct 18 16:06:55 r9-lcn04 kernel: CPU: 1
Oct 18 16:06:55 r9-lcn04 kernel: EIP: 0060:[<f8cc9434>] Tainted: PF VLI
Oct 18 16:06:55 r9-lcn04 kernel: EFLAGS: 00010296 (2.6.16.54-0.2.8-smp #1)
Oct 18 16:06:55 r9-lcn04 kernel: EIP is at assert_wait+0x0/0x9 [linuxmpk]
Oct 18 16:06:55 r9-lcn04 kernel: eax: 00000017 ebx: f2adc288 ecx: ffffff00 edx: 00000296
Oct 18 16:06:55 r9-lcn04 kernel: esi: 0000004d edi: f2adc290 ebp: f2adc288 esp: e5d5da88
Oct 18 16:06:55 r9-lcn04 kernel: ds: 007b es: 007b ss: 0068
Oct 18 16:06:55 r9-lcn04 kernel: Process MPK Thread (pid: 7890, threadinfo=e5d5c000 task=f2f369d0)
Oct 18 16:06:55 r9-lcn04 kernel: Stack: <0>f8cc1d34 f8cc9a38 00000166 f2adc290 f9ef5c08 f9ccb5cc 0000004f f2adc288
Oct 18 16:06:55 r9-lcn04 kernel: 0000006a 007d31f2 f2adc000 e5d5dbbc f9ef600f f9ee65d4 f2adc028 f2adc000
Oct 18 16:06:55 r9-lcn04 kernel: f9efd3fe 0069004d 0004006a 00000001 007d31eb 00000002 007d31f2 00000001
Oct 18 16:06:55 r9-lcn04 kernel: Call Trace:
Oct 18 16:06:55 r9-lcn04 kernel: [<f8cc1d34>] Abend+0x4c/0x50 [linuxmpk]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ef5c08>] FT_ValidateNode+0xdf/0x225 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ef600f>] insertExtent+0x2c1/0x39d [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ee65d4>] ZFS_BlockSignalHandler+0x0/0x11 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9efd3fe>] ZFSMAL_ReadBlk+0x263/0x367 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9efae99>] insertXNodeGetChild+0x103/0x137 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f949c585>] BST_doUnpackIndex+0xa6/0x591 [nsscomn]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9efb134>] ZFSPOOL_FreeExtent+0x1f4/0x218 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9efb606>] zfsFreeExtent+0x75/0x10f [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f94969d9>] BST_new+0x96/0xcc [nsscomn]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9edc8b6>] directFileMapTrunc+0x141/0x269 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ede32c>] ZFSVOL_VOL_truncateFile+0x1e9/0x20f [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f8cbcbd2>] kCurrentThread+0xe/0x54 [linuxmpk]
Oct 18 16:06:55 r9-lcn04 kernel: [<f949a731>] BEASTHASH_LookupByZid+0x2ec/0x388 [nsscomn]
Oct 18 16:06:55 r9-lcn04 kernel: [<f950fb89>] ROOT_BST_TruncateFile+0x1e/0x2a [nsscomn]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ea27db>] ZFSPOOL_PlayPurgeLog+0x6ec/0x15e6 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9e8c4a0>] ZFSVOL_VOL_GetBeastFromVolume+0x1c0/0x1dd [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9100000>] pclDisplayHelpLine+0x692/0x85e [nsslibrary]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9104208>] LB_StackFree+0x8/0xc [nsslibrary]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ef002b>] ZFSVOL_LoadSystemBeasts+0x1b8/0x1d1 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ea3740>] ZFSVOL_PlayPurgeLog+0x6b/0x77 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ef005e>] ZFSVOL_Activate+0x1a/0x1e [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ead54e>] VerifyAndLoadLVSystemBeasts+0x6f/0xeb [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9eae114>] ZVP_playLogicalVolumePurgeLog+0x22e/0x30f [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f949a9e6>] COMN_LookupByZid+0x31/0x1ee [nsscomn]
Oct 18 16:06:55 r9-lcn04 kernel: [<f8cc1f9f>] mpkAlloc+0x56/0x62 [linuxmpk]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9eaf3a8>] ZVP_playAllPurgeLogsVisitNode+0x210/0x3d8 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f90ff46d>] NSSMPK_LockNss+0x3c/0x5f [nsslibrary]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9eaf6e4>] ZVP_playAllPurgeLogs+0x174/0x1a5 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f8cbcbd2>] kCurrentThread+0xe/0x54 [linuxmpk]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9eb0d90>] ZVP_BeastTreeWalkPool+0x97/0x173 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ece699>] ZVP_ADSet+0xdc/0x115 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ed2861>] ZVP_MapPoolBlocks+0x3f1/0x565 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ed2a30>] ZVP_VerifyPool+0x5b/0x19d [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ecea5e>] ZVP_PoolStartup+0x195/0x226 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ec6428>] RAV_ProcessNew+0x6e/0x126 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ed2bfc>] ZVP_Main+0x8a/0xb6 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<f9ea45a1>] ZLSSPOOL_VOL_VolumeMaintenance+0x93/0x149 [nsszlss]
Oct 18 16:06:55 r9-lcn04 kernel: [<c0161595>] __fput+0x13d/0x167
Oct 18 16:06:55 r9-lcn04 kernel: [<c01784c1>] mntput_no_expire+0x11/0x73
Oct 18 16:06:55 r9-lcn04 kernel: [<f90ff46d>] NSSMPK_LockNss+0x3c/0x5f [nsslibrary]
Oct 18 16:06:55 r9-lcn04 kernel: [<f950ef14>] VP_PoolRAVThread+0xc0/0x185 [nsscomn]
Oct 18 16:06:55 r9-lcn04 kernel: [<f8cc0c20>] threadStartFunc+0x7c/0x90 [linuxmpk]
Oct 18 16:06:55 r9-lcn04 kernel: [<f8cc0ba4>] threadStartFunc+0x0/0x90 [linuxmpk]
Oct 18 16:06:55 r9-lcn04 kernel: [<c0102005>] kernel_thread_helper+0x5/0xb
Oct 18 16:06:55 r9-lcn04 kernel: Code: 24 08 89 04 d5 60 a1 cd f8 83 c1 08 81 f9 00 01 00 00 0f 84 af fe ff ff e9 e7 fe ff ff f0 ff 05 60 a9 cd f8 0f 8e ba 02 00 00 c3 <0f> 0b cc 00 1c a1 cc f8 c3 83 ec 0c e8 2e e8 5c c7 f0 ff 0d 60
Oct 18 16:06:55 r9-lcn04 kernel: <0>Fatal exception: panic in 5 seconds