When grep or sed are used with the option --extended-regexp and the pattern {1,9999} is a part of the regexp that is used, the performance of these commands become low. To be more clear, below are applied few tests.[1] [2]
- The relative performance of
grep -E,egrepandsed -Eis almost equal, so only the test that were made withgrep -Eare provided.
Test 1
$ time grep -E '[0-9]{1,99}' < /dev/null
real 0m0.002s
Test 2
$ time grep -E '[0-9]{1,9999}' < /dev/null
> real 0m0.494s
Test 3
$ time grep -E '[0123456789]{1,9999}' < /dev/null
> real 21m43.947s
Test 4
$ time grep -E '[0123456789]+' < /dev/null
$ time grep -E '[0123456789]*' < /dev/null
$ time grep -E '[0123456789]{1,}' < /dev/null
$ time grep -P '[0123456789]{1,9999}' < /dev/null
real 0m0.002s
What is the reason of this significant difference of the performance?
[0-9]+as well) – steeldriver Sep 29 '17 at 15:49grepitself, it's shell related parse for{...}. – αғsнιη Sep 29 '17 at 16:09time grep -E '[0-9]{1,99}' </dev/nullvs.time grep -E '[0-9]{1,9999}' </dev/null. Even with no input, the second command is slow (on 16.04). As expected, omitting-Eand escaping{and}behaves the same and replacing-Ewith-Pisn't slow (PCRE is a different engine). Most interesting is how much faster[0-9]is than.,x, and even[0123456789]. With any of those and{1,9999},grepconsumes a huge amount of RAM; I haven't dared let it run for more than ~10min. – Eliah Kagan Sep 29 '17 at 16:13{}are''quoted; the shell passes them unchanged togrep. Anyway,{1,9999}would be a very fast and simple brace expansion. The shell would just expand it to1 9999. – Eliah Kagan Sep 29 '17 at 16:13grephas to build a much deeper parse tree for an explicit repetition range such as[0-9]{n,m}than for a generic quantifier such as[0-9]+or[0-9]*or even[0-9]{1,}- have a look at the source code, in particularlib/regcomp.c– steeldriver Sep 29 '17 at 16:14grep -E 'x{1,9999}' </dev/null(and various other stuff) so much slower thangrep -E '[0-9]{1,9999}' </dev/null? It seemsgrepis designed to optimize for certain cases and it's backfiring badly here.grepapparently builds a really deep parse tree here--for some{1,9999}repetitions more than others!--but at least from a design perspective, it can't need to do so, because using-P(for PCRE) eliminates the wait. – Eliah Kagan Sep 29 '17 at 16:26sedconfirm it doesn'tgreponly performance, so I doubt it's can be shell related performance. – αғsнιη Sep 29 '17 at 16:31psandtopto verifygrepwas passed the expected arguments and that it, notbash, consumes lots of RAM and CPU. I expectgrepandsedboth use the POSIX regex functions implemented in libc for BRE/ERE matching; I shouldn't really have talked aboutgrepdesign specifically, except insofar as thegrepdevelopers chose to use that library. – Eliah Kagan Sep 29 '17 at 16:39time grep ... < /dev/null, so that people don't conflate the actual problem with the data fed togrepand other extraneous things. – muru Sep 29 '17 at 19:00grep, but alsosedat least! – dessert Sep 29 '17 at 19:01< /dev/null? I see it is working, but for me is unclear what portion of data is redirected to the command, what actually happens? – pa4080 Oct 02 '17 at 10:28ldd $(which grep) $(which awk), no need to guess. – pbhj Nov 09 '17 at 22:00lddongrepandawk, notice that both binaries link tolibc.so.6, and consider that to "confirm" that POSIX regex functions from<regex.h>were used?lddtells you what shared libraries a binary links to, not what it uses from each of them, nor even that it uses anything from them. – Eliah Kagan Nov 09 '17 at 22:10