Odd NFS Hangs
Hi there,
we run two boxes with SLED 10 SP2 and two new boxes with SLED 11 against a
NFS server with AIX 5.3.
SLED 10 boxes are running fine since more than one year but on SLED 11 we get complete stalls at odd times. Users cannot do anything more probably because of NFS pathes in their environment, root can still login on console, can start programs like top but a command like df produces a hang after listing local filesystems when it should list first NFS mounted one.
Sep 30 15:05:54 lx4 kernel: SysRq : Show Blocked State
Sep 30 15:05:54 lx4 kernel: task PC stack pid father
Sep 30 15:05:54 lx4 kernel: df D 0000000000000007 0 4260 4221
Sep 30 15:05:54 lx4 kernel: ffff88063cc63b28 0000000000000082 ffff88033cdd6180 ffff88063cc63ab8
Sep 30 15:05:54 lx4 kernel: ffffffff80a29000 ffffffff80a33680 ffffffff80a30470 ffffffff80a33680
Sep 30 15:05:54 lx4 kernel: ffffffff80a29000 ffffffff80a33680 ffffffff80a33680 ffffffff80a33680
Sep 30 15:05:54 lx4 kernel: Call Trace:
Sep 30 15:05:54 lx4 kernel: [<ffffffffa01d47e9>] rpc_wait_bit_killable+0x2d/0x31 [sunrpc]
Sep 30 15:05:54 lx4 kernel: [<ffffffff8049c44f>] __wait_on_bit+0x41/0x70
Sep 30 15:05:54 lx4 kernel: [<ffffffff8049c4e9>] out_of_line_wait_on_bit+0x6b/0x77
Sep 30 15:05:54 lx4 kernel: [<ffffffffa01d4f1c>] __rpc_execute+0xe1/0x22d [sunrpc]
Sep 30 15:05:54 lx4 kernel: [<ffffffffa01cf482>] rpc_run_task+0x4f/0x57 [sunrpc]
Sep 30 15:05:54 lx4 kernel: [<ffffffffa01cf571>] rpc_call_sync+0x3d/0x5a [sunrpc]
Sep 30 15:05:54 lx4 kernel: [<ffffffffa0273792>] nfs3_rpc_wrapper+0x19/0x50 [nfs]
Sep 30 15:05:54 lx4 kernel: [<ffffffffa0273941>] nfs3_proc_statfs+0x63/0x87 [nfs]
Sep 30 15:05:54 lx4 kernel: [<ffffffffa0268498>] nfs_statfs+0x61/0x137 [nfs]
Sep 30 15:05:54 lx4 kernel: [<ffffffff802b0e9d>] vfs_statfs+0x5b/0x76
Sep 30 15:05:54 lx4 kernel: [<ffffffff802b10ab>] sys_statfs+0x3e/0x93
Sep 30 15:05:54 lx4 kernel: [<ffffffff8020bfbb>] system_call_fastpath+0x16/0x1b
Sep 30 15:05:54 lx4 kernel: [<00007f87a7c22627>] 0x7f87a7c22627
Usually there is at least one running process in top taking 100% CPU time and an increasing load over time.
What helps is only pressing power button or using SysRq B.
We already tried switching of TCP segmentation offload, NIS, Automount etc., but the
problem keeps the same.
In other forums there are some results regarding a Google search with 'rpc_wait_bit_killable' but from quickly scanning those results I didn't find something usable.
I really think about installing SLED 10 SP2 on the new boxes if I find no solution.
Any ideas appreciated.
Oliver
|