Server has been running for over 1.5 years, with no problem. Last week started receiving errors and workstations freeze: lockd: cannot monitor statd: server rpc.statd not responding, timed out
Server: OS: Ubuntu 10.04.4 Kernel: Linux 2.6.32-51-server nfs-common 1:1.2.0-4ubuntu4.2 nfs-kernel-server 1:1.2.0-4ubuntu4.2 /home x.x.x.0/255.255.0.0(rw,no_root_squash,insecure,async,wdelay,no_subtree_check) /public x.x.x.0/255.255.0.0(rw,no_root_squash,insecure,async,wdelay,no_subtree_check)
Workstations: Ubuntu 10.04.x server:/home /home nfs defaults 0 0 server:/public /mnt/public nfs defaults 0 0
Ran rpcinfo -p both from workstations and from servers both return ok.
While lockd frozen, server is 100% accessible i.e ssh top df all return as expected. However the workstations are unable to move between desktops and become unresponsive, chrome stops functioning
On server ps -aux | grep lockd shows that the lockd process is D. However after a couple of min lockd returns to S and R, and workstations are functional again
After enabling nlm_debug I see that indeed the lockd process gets stuck
I notice in the below log that the lockd gets stuck for a minute 02:03:21 -- 02:04:21
This repeats when the lockd gets stuck and I found that by rebooting the "offending" workstation the all systems return to function normally.
Oct 2 02:04:21 fs1 kernel: [647001.312596] lockd: request from 172.x.x.x, port=960
Oct 2 02:04:21 fs1 kernel: [647001.312603] lockd: LOCK called
Oct 2 02:03:21 fs1 kernel: [646941.418685] lockd: nlmsvc_lookup_host(host='roi-lnx', vers=4, proto=tcp)
Oct 2 02:03:21 fs1 kernel: [646941.418687] lockd: get host roi-lnx
Oct 2 02:03:21 fs1 kernel: [646941.418688] lockd: nlm_lookup_host found host roi-lnx (172.16.16.76)
Oct 2 02:03:21 fs1 kernel: [646941.418689] lockd: nsm_monitor(roi-lnx)
Oct 2 02:04:21 fs1 kernel: [647001.312552] statd: server rpc.statd not responding,
timed out
Oct 2 02:04:21 fs1 kernel: [647001.312565] lockd: NSM upcall RPC failed, status=-5
Oct 2 02:04:21 fs1 kernel: [647001.312570] lockd: cannot monitor roi-lnx
Oct 2 02:04:21 fs1 kernel: [647001.312572] lockd: release host roi-lnx
This looks like a bug in lockd.
I have spent days looking though Google, and there are a couple of similar cases but no fixes.
Please let me know if you have any suggestions to resolve this issue.
Thanks Laurence