5

I mean to gzip all files *.vtu, at all depths below a given directory, in bash. I have such files at depths 1 and 2 below ./. I managed to do so with

$ gzip -v $(find . -name "*.vtu")

I could also use find ... -exec, and other combinations (see below).
Is there any way of doing it only with a capability of gzip (-r was my candidate)?

I expected

$ gzip -r -v "*.vtu"

where the pattern would not be expanded by the shell but expanded by gzip (and in a way to produce my intended result!), would work for this, but I get gzip: ...: No such file or directory with all combinations I tried. What I found is the following:

  1. With shopt -s globstar (from here), the command gzip -v **/*.vtu seems to do exactly what I want.
  2. If shopt | grep globstar gives globstar off, the command above does not work. In this case, I can use gzip -v */*.vtu, but it only works with files at depth=1. Likewise with gzip -v */*/*.vtu at depth=2.

In any case, I didn't find what is the effect/usefulness of flag -r.

Related:

  1. gzip all files with specific extensions
  2. https://stackoverflow.com/questions/10363921/how-to-gzip-all-files-in-all-sub-directories-in-bash
muru
  • 197,895
  • 55
  • 485
  • 740

3 Answers3

8

No, gzip can't do this, -r just means "descend into subdirectories" but there is no option for "descend into subdirectories and then look for files matching this glob". The expansion of the *.vtu glob happens before grep is launched, and it is handled by the shell not grep, so grep is given a specific list of files: those files matching *.vtu in the current directory.

So yes, globstar is your best bet. As for the use of -r, that is explained in man gzip:

-r --recursive
       Travel the directory structure recursively.  If any of the file
       names  specified on the command line are directories, gzip will
       descend into the directory and compress all the files it  finds
       there (or decompress them in the case of gunzip ).

So gzip -r foo means "descend into foo if foo is a directory and gzip any files in it". If foo matches both files and directories, if for example you had both file.vtu and my.vtu/ in the directory you ran gzip in, then the contents of my.vtu would also be compressed. Without it, you would get my.vtu is a directory -- ignored.

Other options include:

  • find . -name "*.vtu" -exec gzip {} + to compress all matching files.
  • gzip **/*.vtu with globstar set.
  • find . -name "*.vtu" | xargs gzip (as long as your names are sane and don't contain newlines)
  • find . -name "*.vtu" -print0 | xargs -0 gzip (if your file names can contain newlines)
terdon
  • 100,812
  • I find the way -r works is very weird. I posted the details I found in an answer. – sancho.s ReinstateMonicaCellio Mar 21 '24 at 15:34
  • @sancho.sReinstateMonicaCellio please don't. That isn't answering the question, so it shouldn't really be posted as an answer. – terdon Mar 21 '24 at 16:43
  • Besides this specific example (which I will delete), I think it is quite reasonable to add as answers some comments that are pertinent, but which are not per se answers but are too long to fit in a comment and/or require some "complex" formatting to be usefully readable. I have seen quite a few of them across SE, and I find it useful. I will shortly remove this comment as well, to remove the clutter. – sancho.s ReinstateMonicaCellio Mar 21 '24 at 23:42
  • Shouldn't examples with find put pattern in single quote to prevent shell expansion if there are files with matching name at current dir? – Cthulhu Mar 22 '24 at 16:17
  • 1
    @Cthulhu nah, I mean sure, you can and it might even make more sense syntactically, but globs aren't expanded in double quotes either (variables are, but not globs). Try echo "/et*" for example, and you will get /et* as output and not /etc. – terdon Mar 22 '24 at 16:35
2

After the answer by terdon, and upon tinkering a bit, I came to the conclusion that the way -r works is the following:

  1. If what is matched is a file (only in the present directory) do gzip.
  2. If what is matched is a directory, enter that directory, and down there execute gzip -r *.

For me, this is extremely weird (and therefore I would have never imagined this is how it works). For instance, if in ./ I have

foo
foo.vtk
test.vtk/
test.vtk/another.vtk/
test.vtk/another.vtk/cake.vtk
test.vtk/another.vtk/dow.txt
test.vtk/cake.vtk
test.vtk/dow.txt
test.vtk/this/
test.vtk/this/cake.vtk
test.vtk/this/dow.txt

command gzip -r -v *.vtk would gzip all files except ./foo. All files (not only *.vtk), in all subdirectories *.vtk (with depth=1) and * (with depth>1) would be gzipped.

  • 5
    Since this seems absolutely expected and intuitive for me, I really don't understand why it is confusing. Perhaps you should post another question? (this isn't answering your question, really). I suspect your main confusion is coming from the fact that you think you are passing *.vtk to gzip, but you are not. You are passing the result of the shell expanding *.vtk to the list of matching file names. So gzip never sees *.vtk, it just sees 'foo.vtk' 'test.vtk, so what you actually ran is gzip -r -v 'foo.vtk' 'test.vtk'. Is that more intuitive now? – terdon Mar 21 '24 at 16:31
  • You might want to compare with grep, specifically the difference between grep -r foo a* which will search for pattern in all files in the current dir whose name begins with a and in all files in all subdirectories whose name begins with a, and grep -r foo --include="a*" a* which will do the same, except it will only search files in subdirectories whose name starts with a. This is how pretty much all utilities behave since with very few exceptions (e.g. find -name which expends its own globs) globs are expanded by the shell, not the command. – terdon Mar 21 '24 at 16:35
  • @terdon - This is not an answer, but it is very long as a comment. So if it might be useful for others, I think it is worth posting. – sancho.s ReinstateMonicaCellio Mar 21 '24 at 16:52
  • 2
    I am not saying it isn't useful, but it is not an answer so no, please don't post it here. Also, since you don't explain why you find the behavior weird and since the behavior is actually normal and common and how pretty much all commands work, it isn't really clear to me what you are trying to say. – terdon Mar 21 '24 at 17:00
  • 2
    @sancho.sReinstateMonicaCellio the confusion is around the way the shell is handling the glob pattern, not the command. Things will be very confusing (as they are now) until you grok this. Try: alias echoargs='printf "%s\n"'. Then echoargs -r -v *.vtk. Imagine the echoargs is your command, instead of gzip. That is the argument list your command is seeing, literally -r -v foo.vtk test.vtk. It doesn't see your pattern *.vtk. So it can't use your pattern in subdirs, because it never saw the pattern in the first place. – jrw32982 Mar 21 '24 at 20:54
  • 2
    Contrast that with the way a shell works in Windows (e.g. cmd.exe). In Windows, the command actually sees the whole command line, as it was entered, so it would get to see and interpret the *.vtk pattern. But in Linux, the shell sees and interprets the command line before the command sees the resulting argument list. The command never sees the command as it was entered. That's why you have to quote the (regex) pattern to grep if it has any shell metacharacters, so that the shell won't interpret the pattern and it will get passed as an argument so the command sees it. – jrw32982 Mar 21 '24 at 20:58
  • @terdon And gzip doesn't have a --include option the way grep does. Also, I think you did understand why the OP found it confusing (because you tried to accurately enlighten that confusion). As an aside, rather than post his comments as an answer, where would you suggest that OP put his comments? Perhaps in the original question? (see my related question here: https://unix.meta.stackexchange.com/q/7209/42620). – jrw32982 Mar 21 '24 at 21:05
  • @terdon -
    1. I know that in the command I provided, *.vtu is expanded by the shell, prior to command execution, and then I agree 100% with you in that it is not a good example. What might be a good example of what I expected to be the way it works is passing "*.vtu" instead of *.vtu, so that it is not expanded by the shell, but processed by gzip, and then produce my intended result. I am not sure right now about similar examples of other commands... dou you have any? Or there is none?
    – sancho.s ReinstateMonicaCellio Mar 21 '24 at 23:35
  • @terdon - 2) Thinking over all what was written, I again concur in that the behavior is as expected. It would be somewhat long to explain in writing why the confusion, and it wouldn't add anything. I will first read and process all other comments, and then remove this answer. 3) Besides this specific example, I think it is quite reasonable to add as answers some comments that are pertinent, but which are not per se answers but are too long to fit in a comment and/or require some "complex" formatting to be usefully readable. I have seen quite a few of them across SE, and I find it useful. – sancho.s ReinstateMonicaCellio Mar 21 '24 at 23:41
  • 1
    @jrw32982 I would edit the extra points into the question here, yes. Not a perfect solution though, granted. And yes, Sancho, sometimes it can be useful but only of what is posted is at least part of the answer. This feels more like part of the question to me. – terdon Mar 22 '24 at 14:22
  • Sancho, I think you probably understand it now, but if you run gzip -r -v "*.vtu", you will see that it produces something like gzip: *.vtu: No such file or directory because there is no file named literally *.vtu. What you're missing is an option for gzip, something like the --include option for grep. Perhaps you were expecting that all commands with a similar -r option should have a --include option too, the way grep does. That's where @terdon's answer's command patterns come in handy (i.e. build your own command to do what you want). – jrw32982 Mar 22 '24 at 20:17
1

Not an exact answer to your question, but you can use xargs for that, which allows you to run multiple gzip processes in parallel, like

find -name '*.vtk' -print0 | xargs -r0n1 -P$(nproc) gzip
  • look for files
    • matching *.vtk, quoted so it is not expanded by the shell
    • print file names separated by NUL bytes (to have an unambiguous separator)
  • give the list of files to xargs
    • do not run if the list is empty (-r) because gzip would then use stdin
    • use NUL as separator (-0)
    • use one file name per gzip invocation (-n1)
    • run as many processes in parallel (-P) as the output of the nproc command says we have CPUs
    • run the gzip command for each input