19

I'm using Ubuntu 20.04, with i3 as the window manager but the gnome subsystem installed.

I have a problem that at some random interval (hours or days) after a reboot, any attempt to open thunar or nautilus, or any attempt by an application (such as Firefox) to open a file dialog either takes a minute or more to open, or just times out forever and requires me to kill the application.

How can I find out which application, file, extension or driver is behind this delay?

I've tried the following:

  • Running as root - the applications are fast as normal.
  • Disable tracker (as per here)
  • libgoa seems up to date (as per this)
  • Tried the pulseaudio hack here
  • Running nautilus with strace; there's a 30-second gap with the below lines, but I'm not sure how to interpret it (fd=24 is some file descriptor?)
  • journalctl shows the below when (e.g.) nautilus finally works

I have several machines with very similar setups, only one has this behaviour. I can't figure out what application or library is the culprit.

Output of strace -t -o /tmp/nautilus nautilus:

20:15:32 write(24, "\1\0\0\0\0\0\0\0", 8) = 8
20:15:32 poll([{fd=24, events=POLLIN}], 1, 0) = 1 ([{fd=24, revents=POLLIN}])
20:15:32 read(24, "\1\0\0\0\0\0\0\0", 16) = 8
20:15:32 write(24, "\1\0\0\0\0\0\0\0", 8) = 8
20:15:32 futex(0x56241fa9c880, FUTEX_WAKE_PRIVATE, 2147483647) = 0
20:15:32 close(24)                      = 0
20:15:32 eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK) = 24
20:15:32 write(24, "\1\0\0\0\0\0\0\0", 8) = 8
20:15:32 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
20:15:32 futex(0x56241ef26dc0, FUTEX_WAKE_PRIVATE, 1) = 1
20:15:32 futex(0x56241ef26910, FUTEX_WAKE_PRIVATE, 1) = 1
20:15:32 futex(0x56241ef1c038, FUTEX_WAKE_PRIVATE, 1) = 1
20:15:32 poll([{fd=24, events=POLLIN}], 1, 25000) = 1 ([{fd=24, revents=POLLIN}])
20:15:32 read(24, "\1\0\0\0\0\0\0\0", 16) = 8
20:15:32 poll([{fd=24, events=POLLIN}], 1, 25000) = 0 (Timeout)
20:15:57 write(24, "\1\0\0\0\0\0\0\0", 8) = 8
20:15:57 futex(0x56241fa9c880, FUTEX_WAKE_PRIVATE, 2147483647) = 0
20:15:57 close(24)                      = 0

From the journal:

Apr 20 20:32:35 aubrey dbus-daemon[1180]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.134190' (uid=1000 pid=1450204 comm="nautilus " label="unconfined")
Apr 20 20:32:35 aubrey systemd[1]: Starting Hostname Service...
Apr 20 20:32:35 aubrey dbus-daemon[1180]: [system] Successfully activated service 'org.freedesktop.hostname1'
Apr 20 20:32:35 aubrey systemd[1]: Started Hostname Service.
Apr 20 20:32:35 aubrey nautilus[1450204]: Called "net usershare info" but it failed: Failed to execute child process “net” (No such file or directory)
Apr 20 20:32:45 aubrey ddclient[1450366]: WARNING:  file /var/cache/ddclient/ddclient.cache, line 3: Invalid Value for keyword 'ip' =

Edit:

The above trace is just the output of strace -t -o /tmp/nautilus nautilus - a very verbose log, but with tail -f I'm able to inspect it at the point where it hangs, which is where I see an entry such as 20:15:32 poll([{fd=7, events=POLLIN}], 1, 25000) = 0 (Timeout) followed by 30+ seconds of nothing (then it starts and is very verbose again).

The fd reported by lsof, if I'm reading it right, looks like:

nautilus 1719429 coljac 7u a_inode 0,13 0 17491 [eventfd]

The top entries from strace -c nautilus are as follows (yes, poll is number one):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 22.70    0.011333           2      3988           poll
 16.55    0.008265           1      5018      4490 stat
  9.31    0.004647           2      1966       131 futex
  7.73    0.003859           1      2119      1385 recvmsg
  7.33    0.003660           1      3612      1441 openat
  4.76    0.002375           2      1038       436 access
  4.29    0.002140           1      1424           read
  3.60    0.001799           1      1356       105 readlink
  3.50    0.001748           0      1868           mmap
  3.30    0.001646           8       188           getdents64
  3.14    0.001568           0      2232           close
  2.38    0.001188           0      1758           fstat
  2.14    0.001067           0      1150           munmap
  1.99    0.000993           2       337           writev
  1.67    0.000833          92         9           clone
  1.57    0.000782           2       367           write
  1.47    0.000733           3       190           mprotect

Output from lsof | wc -l: 1340334

Edit 2:

Currently, thunar won't open at all - it fails with the message Failed to register: Timeout was reached. The output from strace -c thunar is as follows:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 50.53    0.002814           3       882       792 openat
 28.21    0.001571           4       354           mmap
  4.79    0.000267           2        96           read
  3.75    0.000209           2        90           fstat
  3.54    0.000197           2        96           close
  3.48    0.000194           1       109           mprotect
  1.87    0.000104           5        18           poll
  1.45    0.000081           0       165         1 futex
  1.31    0.000073           2        28           write
...

If I run strace -t thunar I also see this in the output before it hangs:

futex(0x564445b9ff70, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=7, events=POLLIN}], 1, 25000) = 1 ([{fd=7, revents=POLLIN}])
read(7, "\1\0\0\0\0\0\0\0", 16)         = 8
poll([{fd=7, events=POLLIN}], 1, 25000

lsof -p <pid> returns the following:

thunar  2036483 coljac    7u  a_inode               0,13        0    17491 [eventfd]

Some more outout from strace thunar, which fails to open, is:

20:29:32 read(7, "\1\0\0\0\0\0\0\0", 16) = 8
20:29:32 poll([{fd=7, events=POLLIN}], 1, 25000) = 0 (Timeout)
20:29:58 write(7, "\1\0\0\0\0\0\0\0", 8) = 8
20:29:58 futex(0x5566ad906ba0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
20:29:58 close(7)                       = 0
20:29:58 write(6, "\1\0\0\0\0\0\0\0", 8) = 8
20:29:58 futex(0x5566ad8e1f70, FUTEX_WAKE_PRIVATE, 1) = 1
20:29:58 futex(0x5566ad8e1b90, FUTEX_WAKE_PRIVATE, 1) = 1
20:29:58 futex(0x5566ad8d9058, FUTEX_WAKE_PRIVATE, 1) = 1
20:29:58 openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/charset.alias", O_RDONLY) = -1 ENOENT (No such file or directory)
20:29:58 write(2, "Failed to register: Timeout was "..., 40) = 40
20:29:58 poll([{fd=4, events=POLLIN}], 1, 0) = 1 ([{fd=4, revents=POLLIN}])
20:29:58 read(4, "\1\0\0\0\0\0\0\0", 16) = 8
20:29:58 futex(0x7f0dccfb72a0, FUTEX_WAKE_PRIVATE, 1) = 1
20:29:58 poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)

Further strace outputs are here on pastebin.

  • Also as a new user (also slow): here
  • As root (not slow): here
Coljac
  • 585
  • Could you share the strace output log, I could find above only strace c resume. – user.dz Jun 08 '21 at 11:39
  • 4
    I think I might have the same problem. Starting my programs with dbus-run-sessions seems to fix it. But that does not fix the underlying cause it only prevents the file open dialog from speaking with the slow program it seems. And you will run out of D-Bus sessions after a month or so. I had this problem for years and your question motivated me to take another look at google for this and found this. The proposed pkill gvfsd-trash works for me. Thank you!!! – mxmlnkn Jun 27 '21 at 07:54
  • 1
    And a link to the underlying bug in gvfsd: https://gitlab.gnome.org/GNOME/gvfs/-/issues/485 – mxmlnkn Jun 27 '21 at 08:05
  • 1
    Wow @mxmlnkn - that's a great find. The 25 second delay is precisely what I am seeing as well, so this seems likely the same problem. Once I can reproduce it again and confirm this solution I will ask you to expand this comment into an answer. – Coljac Jun 28 '21 at 23:22
  • 1
    @mxmlnkn I was just able to reproduce the issue, and pkill gvfsd-trash worked. If you would like to add an answer I can mark this question as answered. – Coljac Jul 08 '21 at 03:31

5 Answers5

30

I had the same problem for years because I have uptimes of at least a month quite often. After a while, all programs using the system-supplied file open dialog will seemingly hang or wait for many minutes.

You can run the following command to kill gvfsd-trash, after which all programs hanging and waiting for the file dialog should immediately resume working.

pkill gvfsd-trash

Another thing to note is that gvfsd-trash gets restarted after a while, probably by caja or any other file browser or by the system file open dialog itself. This is not the first problem I had with gvfsd-trash and I begin to bear a grudge, however, I don't want to uninstall gvfsd, which I need for mounting USB thumb drives and my smartphone over MTP. So, I opted for the brute-force solution and made only the gvfsd-trash binary inaccessible, e.g., by renaming it:

sudo mv /usr/libexec/gvfsd-trash{,.bak}

It might be in another location on different systems, try asking your packet manager, e.g., by calling:

dpkg -L gvfs-daemons | grep trash

Beware of the implications. Without gvfsd-trash you won't be able to access the special trash:/// URI path via your file browser! Personally, I never used that anyway. You can also access deleted files manually in the .Trash folders, which exist for each mount point and in your home, e.g.:

  • ~/.local/share/Trash
  • /media/mounted-external-harddrive/.Trash

It looks like the problem is a combination of gvfsd-trash and D-Bus. There seems to be a fitting issue issue in the gvfs bugtracker, which has already been fixed but has not been rolled out to my system, yet.

For years, I used a different solution, namely starting each program in a new DBus session by prefixing the most important programs like Firefox and text editors with dbus-launch. However, that comes with its own shares of problems because each session will spawn at least five gvfsd processes and possible others and the D-Bus session will not close by closing the program opened in it and the number of D-Bus sessions in total is limited, so after a while, you won't be able to start programs.

mxmlnkn
  • 1,370
  • 2
    This fixed the problem for me - it was very difficult to find this out, thanks for the answer. – Coljac Jul 10 '21 at 03:27
  • 1
    Same here, spent ages trying to figure out Nautilus being slow to open and this resolved it. – Mikesname Jul 13 '21 at 14:51
  • Thanks! You saved my month long headache. The problem of dolphin leaking memory and all gtk file managers start slowly almost killed my linux desktop life. Any links to the issue on bug tracker? – xgdgsc Aug 16 '21 at 02:59
  • 1
    @xgdgsc I fixed the link. For some reason, Github didn't like the standard markdown syntax as it looks good to me. Weird. – mxmlnkn Aug 16 '21 at 11:53
  • It would deserve at least two up votes: one for the working answer and another for the elegant way to rename the file. – sdlins Dec 19 '21 at 14:33
  • 2
    @sdlins :) Thanks. Those are bash brace expansions. They frequently save me typing work in the shell. – mxmlnkn Dec 19 '21 at 15:14
  • This is crazy (it works). Strangely for me, the problem started one day when I actually emptied the trash. I never used the trash folder but randomly decided to empty it. I had been suffering from this issue for a week now! Thanks! – kgx Dec 31 '21 at 13:09
  • Thanks so much. This fixed it for me too; I wasn't getting file dialog popups in Chrome or Edge (in KDE) due to this. – lingfish Feb 12 '22 at 05:36
  • How did you figure out that the problem was gvfsd-trash? – dan-gph Dec 07 '22 at 14:34
  • 1
    @dan-gph It took me a while but after googling every month or so for a while, I found an answer in the Linux Mint forum, see also my comment under the question. – mxmlnkn Dec 07 '22 at 15:26
  • @mxmlnkn Thanks. – dan-gph Dec 14 '22 at 02:23
  • Fixed it for me as well. The funny thing is, that I never had issues with it, even though I have an uptime of 688 days. I only recently turned the swap off and reenabled it afterwards and since then I had the issue. I regularely init 1 and then init 5 - that would've also fixed it. Thanks @mxmlnkn for sharing the solution with us. – Marc Dix Sep 09 '23 at 05:18
3

Diagnosis

I guess you are showing an excerpt from you strace log, pointing to where the lag occurs. It would be useful if you posted the full strace command you used.

Actions that may lead to the culprit:

  1. While an straced nautilus is running, in another terminal use

    $ pidof nautilus
    $ lsof -p <pidno>
    

    where <pidno> is the PID returned by the previous command. That lets you inspect which file descriptors (answer to your question: Yes) the strace log refers to (source). You might want to pipe the output with less since it is typically long.

  2. Option -c can further help identifying bottleneck processes (source), yours is probably poll. From the man strace page:

    -c
    --summary-only
                Count time, calls, and errors for each system call and report a summary on program exit, suppressing the  regular
                output.  This attempts to show system time (CPU time spent running in the kernel) independent of wall clock time.
                If -c is used with -f, only aggregate totals for all traced processes are kept.
    

    This is an example of what the output looks like

    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     18,29    0,009128           4      2058           mmap
     15,19    0,007580           7       989         1 poll
     11,96    0,005967           2      2465       318 openat
      6,76    0,003374           1      2187           close
    ...
    
  3. Check other instances of delayed nautilus and thunar launching, to confirm a similar pattern.

  4. Apply the above, and any further diagnostic commands, to a comparison with what a root instance (which you mention it works fine..., I guess you meant ... at times where with a normal user the delays already appeared) produces.

  5. Consider the problem might be related to availability of file descriptors, or the like, see this.

  6. Please post the complete output of

    $ bash -c 'time strace -fc nautilus'
    $ bash -c 'time strace -fc thunar'
    $ bash -c 'strace -fc time nautilus'
    $ bash -c 'strace -fc time thunar'
    
  7. Please mention whether you moved your home dir, if you had a different Ubuntu version and updated, etc., as it may be relevant. Also, try creating a temporary user to see if the problem shows up. This would be an additional proof to comparing with root.

    EDIT a: The comparison between what you get with users <you>, tester and root, for bash -c 'time strace -fc nautilus', shows the following essential difference in terms of what could be spending the ~30sec at launching:

    <you>:

    (org.gnome.Nautilus:2641915): Tracker-WARNING **: 18:30:21.999: Falling back 
    to bus backend, the direct backend failed to initialize: Could not find datab
    ase file:'/home/coljac/.cache/tracker/meta.db'.
    

    tester:

    (org.gnome.Nautilus:2976790): dbind-WARNING **: 20:45:38.814: Couldn't register with accessibility bus: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
    strace: Process 2976919 attached
    strace: Process 2976920 attached
    

    ** (org.gnome.Nautilus:2976790): WARNING **: 20:46:04.102: Unable to get contents of the bookmarks file: Error opening file /home/tester/.gtk-bookmarks: No such file or directory

    root:

    ** (org.gnome.Nautilus:2980894): WARNING **: 21:00:42.191: Unable to get contents of the bookmarks file: Error opening file /root/.gtk-bookmarks: No such file or directory
    

    ** (org.gnome.Nautilus:2980894): WARNING **: 21:00:42.191: Unable to get contents of the bookmarks file: Error opening file /root/.gtk-bookmarks: No such file or directory strace: Process 2980907 attached Nautilus-Share-Message: 21:00:42.241: Called "net usershare info" but it failed: Failed to execute child process “net” (No such file or directory)

    So, tester vs. root shows a warning that could help with diagnosis. The strange thing is that the same warning does not show for <you>.

Solution

I would wait until you post more info after the diagnosis suggested. In the meantime, I give a (nuke?) workaround. I am assuming your "several machines with very similar setups" are a sort of clones (at least in the beginning) of each other. If so, and you don't find a solution (after the diagnosis), and this bothers you enough, you might try cloning again from one of your working systems, with clonezilla, e.g.

Related

  1. https://www.linuxquestions.org/questions/linux-server-73/strace-question-poll-taking-a-long-time-on-an-open-command-939714/
  2. Nautilus does not start: .cache/tracker/meta.db not found
  • Thanks for the helpful answer. I've updated the question with some more information, following your suggestions. Regarding the nuclear option, I have indeed tried just reinstalling the OS from scratch, and eventually the problem came back; I can only assume it's some piece of software that I need to or habitually use is causing the problem. – Coljac Jun 03 '21 at 12:27
  • @Coljac - Based on your feedback, I have a few comments (which I may transfer later to the answer to make everything clearer).
    1. with your strace -c, poll is number one, but still with a fraction of a second. It seems all the delay is not counted in strace -c, I am not familiar with such a case.
    2. I am surprised reinstalling OS the problem comes back. Do you have at least one of the "similar systems" exactly the same as the failing one, in terms of hardware? Or can you put together a system like that? If so, I would exchange hard disks, just to discard a hardware failure (unlikely).
    – sancho.s ReinstateMonicaCellio Jun 03 '21 at 13:01
  • Seems this is a tricky one. I added the outout of strace -c thunar to the question; thunar won't even open at the moment, it just times out. There are a lot of openat syscalls with errors. I don't know what this means, is it significant? strace for thunar and other apps that hang show the same fd information as for nautilus, i.e. [eventfd]. – Coljac Jun 03 '21 at 23:18
  • @Coljac - Please:
    1. The clear all possible doubts, post in the question each exact command you use, followed by the exact output you get (e.g., instead of "show the same fd information as for nautilus" posted in the comment). Often times (although not guaranteed this is such a case), this hides subtleties that are key for a solution.
    2. Comment on point 4, and the possible use of twin hardware to test.
    – sancho.s ReinstateMonicaCellio Jun 04 '21 at 09:15
  • I have edited the question to be clearer. 2) The strace outputs are a bit too long to diff between user and root, however there's no timeout associated with the poll([{fd=12, events=POLLIN}], 1, 25000) = 1 lines. The other thing of interest is that as user, there are attempts to open /usr/local/cuda-10.1/lib64/tls/x86_64/x86_64/libthunarx-3.so.0 (same for nautilus), the root user doesn't do that.
  • – Coljac Jun 04 '21 at 11:29
  • Regarding the cloning - I don't have the same hardware, but similar set up in terms of OS, window manager, etc. However, since this has persisted across clean reformats of this machine, I'm seeking here the ability to diagnose it in case it returns, rather than making it go away for a time. I don't have anything as clean as an image to restore. – Coljac Jun 04 '21 at 11:29
  • https://unix.stackexchange.com/questions/410877/why-do-time-and-strace-c-differ – sancho.s ReinstateMonicaCellio Jun 04 '21 at 14:58
  • https://www.pearsonhighered.com/assets/samplechapter/0/1/3/1/0131486829.pdf – sancho.s ReinstateMonicaCellio Jun 04 '21 at 16:17
  • @Coljac - Please see EDIT, item 6. – sancho.s ReinstateMonicaCellio Jun 05 '21 at 09:24
  • 1
    I put the resulting outputs here on pastebin; I'll edit them into the (already long) question if it proves fruitful. https://pastebin.com/GqiDf3tY – Coljac Jun 06 '21 at 09:00
  • I added a pastebin link to the question with your desired outputs. – Coljac Jun 06 '21 at 23:17
  • 1
    With your added info, please.
    1. See the two EDIT points.
    2. Apply the commands also for root level and/or a new temporary user, and compare the outputs. Do so for all diagnostic commands and post the output from now on. Those extra couple of minutes could be essential to solve the problem.
    – sancho.s ReinstateMonicaCellio Jun 07 '21 at 08:14
  • Thanks again for all your efforts. 1) I haven't moved home dir; experiments with a new user also reproduce the issue. 2) I amended the question to put the same strace outouts for a brand new user (slow) and for root (fast). – Coljac Jun 07 '21 at 11:09
  • Just in case you are still having the issue, have you tried booting via terminal w/ --verbose flag? That should tell you where it is when it starts stalling, no? Maybe you all thought of this already; that comment chain intimidates me too much to read it all. : ) – Nate T Jun 08 '21 at 06:01
  • nvm not a thing with execute cmd... – Nate T Jun 08 '21 at 06:11