Improve initial use of `find` performance time

Question

I'm working on a project to find all the .tar installation files on my system using the command:

time find / -type f \( -name "*.tar" -o -name "*.tar.*" \) 2>/dev/null | wc

The first time it runs I get:

real    1m10.767s

The second time it runs I get:

real    0m9.847s

I would like to always get the second time performance of < 10 seconds and forgo the initial performance of 1 minute 10 seconds. What is the best way of avoiding the one minute penalty the first time find is used?

Notes

Your initial find may be faster because I have one Ubuntu 16.04 installation plus two Windows 10 installations for a total of 2 million files.
OTOH your initial find may be slower as I have Ubuntu 16.04 and one of the Windows 10 installations on a Samsung Pro 960 NVMe SSD rated at 3,000 MBps whereas hard drives are rated at 140 MBps and good SSDs are rated at 400 MBps.
If you want to replicate tests but have no .tar files on your system, replace tar with bashrc in the section: -name "*.tar" -o -name "*.tar.*".

TL;DR

Drop RAM caches that speed up `find` disk access

You can repeat first/second performance tests by calling this little script before the first find:

#!/bin/bash
if [[ $(id -u) -ne 0 ]] ; then echo "Please run as root" ; exit 1 ; fi
sync; echo 1 > /proc/sys/vm/drop_caches
sync; echo 2 > /proc/sys/vm/drop_caches
sync; echo 3 > /proc/sys/vm/drop_caches

GIF showing how much RAM disk caching consumes

The find command run across / will consume about 500 MB of cache buffers as the .gif below shows when they are dropped:

^^^--- Notice the memory line immediately below the terminal window shows a drop from 4.74 GiB to 4.24 GiB. It actually drops to 4.11 GiB after the peek screen recorder saves the file and closes. On my system find disk caching is using about 5% of RAM.

@PerlDuck Actually I am because locate doesn't allow searching for two patterns at once :) — WinEunuuchs2Unix, Apr 22 '18 at 12:45
@PerlDuck The other problem is locate doesn't index the /tmp directory where some users might choose to download there .tar.XXX files. The locate commands that work best though are: locate .tar. (0.5 seconds) followed by locate -r "\.tar$" (0.8 seconds). The total of 1.3 seconds is much faster than RAM cached find which is 9.8 seconds. — WinEunuuchs2Unix, Apr 22 '18 at 12:58
Sure. Because locate operates on a "database" (I don't know the exact format, but let's think of it as a CSV file which is grep'ed). This DB (/var/lib/mlocate/mlocate.db) is updated once a day (via /etc/cron.daily/mlocate). — PerlDuck, Apr 22 '18 at 13:05
My manpage for locate says locate [OPTION]... PATTERN.... The ellipsis … after PATTERN indicates that you can supply any number of PATTERNs. Also, in /etc/updatedb.conf you can configure the search paths and add /tmp. — PerlDuck, Apr 22 '18 at 13:10
@PerlDuck I'm a big fan of locate. I have my sudo crontab -e updating every 15 minutes with the line: */15 * * * * /usr/bin/updatedb. — WinEunuuchs2Unix, Apr 22 '18 at 13:17
@PerlDuck Technically you don't actually add /tmp you remove it from the pruned directories list. But it's still not something all users want to do. Also there are many who hate the locate command and prefer the find command. It is this audience I'm catering too. — WinEunuuchs2Unix, Apr 22 '18 at 13:18
Hmmm... It's 7 minutes 32 seconds verses 9.8 seconds on my computer. I do not think there is a way to jump to the second, shorter time, because the first time is needed such that stuff is left in the memory cache for the second time. — Doug Smythies, Apr 22 '18 at 15:34
Sorry, but I don't get what you want to achieve. You like locate and even configured it to update every 15 minutes, you know how to configure it to search /tmp, and you know that it accepts multiple search patterns. So why don't you stick with it? — PerlDuck, Apr 22 '18 at 15:47
@PerlDuck For the final project I will use locate and find and leave it up to the user to comment out one line or the other. I'll take your vote to have the find command commented out and the locate command un-commnented. PS if the script is called with a valid .tar file name no search is performed to select from all downloaded .tar files. — WinEunuuchs2Unix, Apr 22 '18 at 15:51
The only thing I can think of is (overkill but): 1) do find / -type f > /tmp/filelist-cached at boot, 2) put the results in a SQLite DB, 3) setup inotifywait to monitor all your filesystem activity and update the SQLite DB accordingly, and 4) have a custom script that selects in the DB. This will speed up finding files (how often is that required?) but slow down every. thing. else. Your choice. — PerlDuck, Apr 22 '18 at 16:00
@PerlDuck This was initially a self-answered and much to my horror the answer didn't work. I quickly deleted my answer and then spent a few hours testing different things and revised the deleted answer many times. I've "un-deleted" the answer just now. — WinEunuuchs2Unix, Apr 22 '18 at 16:10
@DougSmythies Your 7 min 32 seconds must be on a spinning-platter? People don't reboot and run find command first thing. They sign onto websites, search stuff and maybe see a find command to try out after their second cup of coffee. So the seven minute background job would be complete and your find will be primed for lightening access. That is my theory anyway :) — WinEunuuchs2Unix, Apr 22 '18 at 16:12
I never saw your answer. This seems to work fine for me, and gets the same answer and takes 1.5 seconds: time locate --regex "\.tar\." "\.tar$" | wc -l — Doug Smythies, Apr 22 '18 at 16:13
@DougSmythies As commented to PerlDuck 3 comments up I deleted my answer then un-deleted it a few hours later after revisions. Thanks for the correct locate command syntax. My results are 1.061 seconds. I already love the locate command and have been promoting it over the find command: https://askubuntu.com/a/1020944/307523 but there are many who hate using locate so I'm trying to love find like that group. But I have to get over these speed-bumps that are slowing down my love... — WinEunuuchs2Unix, Apr 22 '18 at 16:18
@DougSmythies @PerlDuck Would you write an answer to solve the problem with locate? I think it'd be good to have both approaches represented here – the whole discussion above has a lot to commend it. — dessert, Apr 22 '18 at 20:59

score 2 · Answer 1 · edited Jun 12 '20 at 14:37

Challenging project

In the following sections are things that should work but don't work. In the end the only "sure-fire" way of making this work was with this bash script:

#!/bin/bash
# NAME: find-cache
# DESC: cache find command search files to RAM
# NOTE: Written for: https://askubuntu.com/questions/1027186/improve-initial-use-of-find-performance-time?noredirect=1#comment1669639_1027186
for i in {1..10}; do
    echo "========================" >> /tmp/find-cache.log
    printf "find-cache.log # $i: "  >> /tmp/find-cache.log
    date                            >> /tmp/find-cache.log
    echo "Free RAM at start:"       >> /tmp/find-cache.log
    free -h | head -n2              >> /tmp/find-cache.log
    printf "Count of all files: "   >> /tmp/find-cache.log
    SECONDS=0                       # Environment variable
    time find /* 2>/dev/null|wc -l  >> /tmp/find-cache.log
    duration=$SECONDS               # Set elapsed seconds
    echo "$(($duration / 60)) minutes and $(($duration % 60)) seconds for find." 

                                    >> /tmp/find-cache.log
    echo "Free RAM after find:"     >> /tmp/find-cache.log
    free -h | head -n2              >> /tmp/find-cache.log
    echo "Sleeping 15 seconds..."   >> /tmp/find-cache.log
    sleep 15
done

Copy above text to a script file named: find-cache. Put the script name in Startup Applications. Use the instructions in the next section but substitute the command name /usr/bin/find... with /<path-to-script>/find-cache.

Don't forget to mark the script as executable using:

chmod a+x /<path-to-script>/find-cache

<path-to-script> should be in your $PATH environment such as /usr/local/bin or preferably /home/<your-user-name>/bin. To double check use echo $PATH to reveal the environment variable.

Every time I login I usually startup conky and firefox. You probably do other things. To fine-tune settings for your system check the log file:

$ cat /tmp/find-cache.log
========================
find-cache.log # 1: Sun Apr 22 09:48:40 MDT 2018
Free RAM at start:
              total        used        free      shared  buff/cache   available
Mem:           7.4G        431M        5.9G        628M        1.1G        6.1G
Count of all files: 1906881
0 minutes and 59 seconds for find.
Free RAM after find:
              total        used        free      shared  buff/cache   available
Mem:           7.4G        1.1G        3.0G        599M        3.3G        5.3G
Sleeping 15 seconds...
========================
find-cache.log # 2: Sun Apr 22 09:49:54 MDT 2018
Free RAM at start:
              total        used        free      shared  buff/cache   available
Mem:           7.4G        1.2G        2.9G        599M        3.3G        5.3G
Count of all files: 1903097
0 minutes and 9 seconds for find.
Free RAM after find:
              total        used        free      shared  buff/cache   available
Mem:           7.4G        1.1G        3.0G        599M        3.3G        5.3G
Sleeping 15 seconds...
(... SNIP ...)

Note: between 1st and 2nd iteration free RAM drops 3 GB but firefox is restoring 12 tabs at the same time.

What's going on? For whatever reason when find is run just once in a startup bash job, or a cron reboot bash job, the Linux Kernel thinks: "They probably don't want to keep the page cache so I'll empty it to save RAM". However when the find command is run 10 times as in this script the Linux Kernel thinks: "Whoaa they really like this stuff in the page cache, I better not clear it out".

At least that is my best guess. Regardless of the reason, this approach works as tested many times.

What should work but doesn't work

Below are two attempts at making this project work. I've left them here so others don't waste time repeating them. If you think you can fix them by all means refine them, post an answer and I'll gleefully up-vote.

Use Startup Applications

Tap and release the Windows / Super key (it has the icon: Winkey1 or Winkey2 or Winkey3 ) to bring up dash.

In the search field type startup and you'll see the Startup Applications icon appear. Click the icon. When the window opens click Add on the right. Fill in the new Startup Program fields as follows:

Fill in the name as Cache Find to RAM.
Fill in the command as sleep 30 && find /* 2>/dev/null | wc.
Add a comment such as "Initial run of Find command to cache disk to ram".
Click the Add button on the bottom.

Now reboot and check performance of find command.

Credits: Windows Key icons copied from Super User post.

Cron at reboot

You can use cron to call the find command at boot time to cache the slow disk to fast RAM. Run the command crontab -e and add the following line at the bottom:

@reboot /usr/sleep 30 && /usr/bin/find /* 2>/dev/null | wc -l

@reboot tells cron to run this command at every boot / reboot.
/usr/sleep 30 has the find command wait 30 seconds before running so the boot runs as fast as possible. Increase this to 45 or 60 depending on your boot speed, time to login and your startup applications to run.
/usr/bin/find /* 2>/dev/null | wc-l calls the find command searching all files (/*). Any error messages are hidden by 2>/dev/null. The number of files are counted using | wc -l. On my system it is about 2 million due to one Ubuntu installation and two Windows 10 installations.
After adding the line use Ctrl+O followed by Enter to save the file.
After saving the file use Ctrl+X to exit the nano editor used by cron. If you chose a different editor than nano use the appropriate commands to save and exit.

As always the acronym YMMV (Your Mileage May Vary) applies.

After reboot I did these tests to prove it does not work:

rick@alien:~$ time find / -type f \( -name "*.tar" -o -name "*.tar.*" \) 2>/dev/null | wc
     26      26    1278
real    1m10.022s
user    0m7.246s
sys     0m12.840s
───────────────────────────────────────────────────────────────────────────────────────────
rick@alien:~$ time find / -type f ( -name ".tar" -o -name ".tar.*" ) 2>/dev/null | wc
     26      26    1278
real    0m8.954s
user    0m2.476s
sys     0m3.709s

@αғsнιη Thanks for pointing out that interesting link. However the Q&A is about caching file contents which would take 180 GB of RAM on my system. I only need to cache 2 million file names which appears to take only 1/2 GB of RAM. The linked answer is about increasing swappiness to 90/95. Indeed on my system it is set at 60 but it's a mute point because swap is never used. I don't have 10 million line C projects floating around that use 112 MB of headers from //usr/src/linux-headers-4.14.30-041430/. Although this is a good link for C programmers, I just have humble bash scripts. — WinEunuuchs2Unix, Apr 22 '18 at 17:31

Improve initial use of `find` performance time

Notes

TL;DR

Drop RAM caches that speed up `find` disk access

GIF showing how much RAM disk caching consumes

1 Answers1

Challenging project

What should work but doesn't work

Use Startup Applications

Cron at reboot

Linked

Improve initial use of `find` performance time

Notes

TL;DR

Drop RAM caches that speed up find disk access

GIF showing how much RAM disk caching consumes

1 Answers1

Challenging project

What should work but doesn't work

Use Startup Applications

Cron at reboot

Linked

Drop RAM caches that speed up `find` disk access