When grep
or sed
are used with the option --extended-regexp
and the pattern {1,9999}
is a part of the regexp that is used, the performance of these commands become low. To be more clear, below are applied few tests.[1] [2]
- The relative performance of
grep -E
,egrep
andsed -E
is almost equal, so only the test that were made withgrep -E
are provided.
Test 1
$ time grep -E '[0-9]{1,99}' < /dev/null
real 0m0.002s
Test 2
$ time grep -E '[0-9]{1,9999}' < /dev/null
> real 0m0.494s
Test 3
$ time grep -E '[0123456789]{1,9999}' < /dev/null > real 21m43.947s
Test 4
$ time grep -E '[0123456789]+' < /dev/null
$ time grep -E '[0123456789]*' < /dev/null
$ time grep -E '[0123456789]{1,}' < /dev/null
$ time grep -P '[0123456789]{1,9999}' < /dev/null
real 0m0.002s
What is the reason of this significant difference of the performance?
[0-9]+
as well) – steeldriver Sep 29 '17 at 15:49grep
itself, it's shell related parse for{...}
. – αғsнιη Sep 29 '17 at 16:09time grep -E '[0-9]{1,99}' </dev/null
vs.time grep -E '[0-9]{1,9999}' </dev/null
. Even with no input, the second command is slow (on 16.04). As expected, omitting-E
and escaping{
and}
behaves the same and replacing-E
with-P
isn't slow (PCRE is a different engine). Most interesting is how much faster[0-9]
is than.
,x
, and even[0123456789]
. With any of those and{1,9999}
,grep
consumes a huge amount of RAM; I haven't dared let it run for more than ~10min. – Eliah Kagan Sep 29 '17 at 16:13{
}
are'
'
quoted; the shell passes them unchanged togrep
. Anyway,{1,9999}
would be a very fast and simple brace expansion. The shell would just expand it to1 9999
. – Eliah Kagan Sep 29 '17 at 16:13grep
has to build a much deeper parse tree for an explicit repetition range such as[0-9]{n,m}
than for a generic quantifier such as[0-9]+
or[0-9]*
or even[0-9]{1,}
- have a look at the source code, in particularlib/regcomp.c
– steeldriver Sep 29 '17 at 16:14grep -E 'x{1,9999}' </dev/null
(and various other stuff) so much slower thangrep -E '[0-9]{1,9999}' </dev/null
? It seemsgrep
is designed to optimize for certain cases and it's backfiring badly here.grep
apparently builds a really deep parse tree here--for some{1,9999}
repetitions more than others!--but at least from a design perspective, it can't need to do so, because using-P
(for PCRE) eliminates the wait. – Eliah Kagan Sep 29 '17 at 16:26sed
confirm it doesn'tgrep
only performance, so I doubt it's can be shell related performance. – αғsнιη Sep 29 '17 at 16:31ps
andtop
to verifygrep
was passed the expected arguments and that it, notbash
, consumes lots of RAM and CPU. I expectgrep
andsed
both use the POSIX regex functions implemented in libc for BRE/ERE matching; I shouldn't really have talked aboutgrep
design specifically, except insofar as thegrep
developers chose to use that library. – Eliah Kagan Sep 29 '17 at 16:39time grep ... < /dev/null
, so that people don't conflate the actual problem with the data fed togrep
and other extraneous things. – muru Sep 29 '17 at 19:00grep
, but alsosed
at least! – dessert Sep 29 '17 at 19:01< /dev/null
? I see it is working, but for me is unclear what portion of data is redirected to the command, what actually happens? – pa4080 Oct 02 '17 at 10:28ldd $(which grep) $(which awk)
, no need to guess. – pbhj Nov 09 '17 at 22:00ldd
ongrep
andawk
, notice that both binaries link tolibc.so.6
, and consider that to "confirm" that POSIX regex functions from<regex.h>
were used?ldd
tells you what shared libraries a binary links to, not what it uses from each of them, nor even that it uses anything from them. – Eliah Kagan Nov 09 '17 at 22:10