0

Server has been running for over 1.5 years, with no problem. Last week started receiving errors and workstations freeze: lockd: cannot monitor statd: server rpc.statd not responding, timed out

Server: OS: Ubuntu 10.04.4 Kernel: Linux 2.6.32-51-server nfs-common 1:1.2.0-4ubuntu4.2 nfs-kernel-server 1:1.2.0-4ubuntu4.2 /home x.x.x.0/255.255.0.0(rw,no_root_squash,insecure,async,wdelay,no_subtree_check) /public x.x.x.0/255.255.0.0(rw,no_root_squash,insecure,async,wdelay,no_subtree_check)

Workstations: Ubuntu 10.04.x server:/home /home nfs defaults 0 0 server:/public /mnt/public nfs defaults 0 0

Ran rpcinfo -p both from workstations and from servers both return ok.

While lockd frozen, server is 100% accessible i.e ssh top df all return as expected. However the workstations are unable to move between desktops and become unresponsive, chrome stops functioning

On server ps -aux | grep lockd shows that the lockd process is D. However after a couple of min lockd returns to S and R, and workstations are functional again

After enabling nlm_debug I see that indeed the lockd process gets stuck

I notice in the below log that the lockd gets stuck for a minute 02:03:21 -- 02:04:21

This repeats when the lockd gets stuck and I found that by rebooting the "offending" workstation the all systems return to function normally.

Oct  2 02:04:21 fs1 kernel: [647001.312596] lockd: request from 172.x.x.x, port=960
Oct  2 02:04:21 fs1 kernel: [647001.312603] lockd: LOCK          called
Oct  2 02:03:21 fs1 kernel: [646941.418685] lockd: nlmsvc_lookup_host(host='roi-lnx', vers=4, proto=tcp)
Oct  2 02:03:21 fs1 kernel: [646941.418687] lockd: get host roi-lnx
Oct  2 02:03:21 fs1 kernel: [646941.418688] lockd: nlm_lookup_host found host roi-lnx (172.16.16.76)
Oct  2 02:03:21 fs1 kernel: [646941.418689] lockd: nsm_monitor(roi-lnx)
Oct  2 02:04:21 fs1 kernel: [647001.312552] statd: server rpc.statd not responding, 
timed out
Oct  2 02:04:21 fs1 kernel: [647001.312565] lockd: NSM upcall RPC failed, status=-5
Oct  2 02:04:21 fs1 kernel: [647001.312570] lockd: cannot monitor roi-lnx
Oct  2 02:04:21 fs1 kernel: [647001.312572] lockd: release host roi-lnx

This looks like a bug in lockd.

I have spent days looking though Google, and there are a couple of similar cases but no fixes.

Please let me know if you have any suggestions to resolve this issue.

Thanks Laurence

Braiam
  • 67,791
  • 32
  • 179
  • 269

2 Answers2

1

In a similar environment with 10.04.4 ubuntu nfs-server serving approx. 50 ubuntu/mac os x clients (mostly 12.04.3), I had the same problem. The clients were only working when mounted the home-directories with the nolock option (which one shouldn't do).

After debugging all possible stuff in the network for two weeks a realized after finding this on serverfault, that the only change was including two new clients (12.04.3) with kernel 3.8.0-29-generic running. After taken these two out of the network (actually yesterday), the statd and lockd are stable again on the server.

I will report what happens today, once all clients will be in full operation again.

Is there any new client in your network?

0

I also had the similar experience in a 4-node cluser where all of the nodes are using 3.2.0-38-generic under ubuntu 12.04.5. The nfs version is:

dpkg -la | grep nfs
ii  libnfsidmap2                       0.25-1ubuntu2               NFS idmapping     library
ii  nfs-common                         1:1.2.5-3ubuntu3.2                                  NFS support files common to client and server
ii  nfs-kernel-server                  1:1.2.5-3ubuntu3.2                                  support for NFS kernel server

It is found one of the problematic node is constantly "attacking the NFS server". Once the problematic node is taken out from the system, no hangs occures again.