Grep -E, Sed -E - low performance when '[x]{1,9999}' is used, but why?

Question

When grep or sed are used with the option --extended-regexp and the pattern {1,9999} is a part of the regexp that is used, the performance of these commands become low. To be more clear, below are applied few tests.^[1] ^[2]

The relative performance of grep -E, egrep and sed -E is almost equal, so only the test that were made with grep -E are provided.

Test 1

$ time grep -E '[0-9]{1,99}' < /dev/null

real    0m0.002s

Test 2

$ time grep -E '[0-9]{1,9999}' < /dev/null

> real    0m0.494s

Test 3

$ time grep -E '[0123456789]{1,9999}' < /dev/null

> real    21m43.947s

Test 4

$ time grep -E '[0123456789]+' < /dev/null
$ time grep -E '[0123456789]*' < /dev/null
$ time grep -E '[0123456789]{1,}' < /dev/null
$ time grep -P '[0123456789]{1,9999}' < /dev/null

real    0m0.002s

What is the reason of this significant difference of the performance?

That's an interesting observation - my guess is you'd need to dig deep into grep's internals to find exactly how it's building the parse tree (it would be interesting to compare [0-9]+ as well) — steeldriver, Sep 29 '17 at 15:49
I think it doesn't goes to grep itself, it's shell related parse for {...}. — αғsнιη, Sep 29 '17 at 16:09
The input doesn't matter. As @steeldriver suggests, the slowdown precedes matching. A simpler test is time grep -E '[0-9]{1,99}' </dev/null vs. time grep -E '[0-9]{1,9999}' </dev/null. Even with no input, the second command is slow (on 16.04). As expected, omitting -E and escaping { and } behaves the same and replacing -E with -P isn't slow (PCRE is a different engine). Most interesting is how much faster [0-9] is than ., x, and even [0123456789]. With any of those and {1,9999}, grep consumes a huge amount of RAM; I haven't dared let it run for more than ~10min. — Eliah Kagan, Sep 29 '17 at 16:13
@αғsнιη No, the { } are ' ' quoted; the shell passes them unchanged to grep. Anyway, {1,9999} would be a very fast and simple brace expansion. The shell would just expand it to 1 9999. — Eliah Kagan, Sep 29 '17 at 16:13
I guess the one-line answer is that grep has to build a much deeper parse tree for an explicit repetition range such as [0-9]{n,m} than for a generic quantifier such as [0-9]+ or [0-9]* or even [0-9]{1,} - have a look at the source code, in particular lib/regcomp.c — steeldriver, Sep 29 '17 at 16:14
@steeldriver That makes sense, but why is grep -E 'x{1,9999}' </dev/null (and various other stuff) so much slower than grep -E '[0-9]{1,9999}' </dev/null? It seems grep is designed to optimize for certain cases and it's backfiring badly here. grep apparently builds a really deep parse tree here--for some {1,9999} repetitions more than others!--but at least from a design perspective, it can't need to do so, because using -P (for PCRE) eliminates the wait. — Eliah Kagan, Sep 29 '17 at 16:26
@EliahKagan But OP 3rd Update with sed confirm it doesn't grep only performance, so I doubt it's can be shell related performance. — αғsнιη, Sep 29 '17 at 16:31
@αғsнιη I don't know quite what you mean, but this definitely has nothing to do with the shell. During a long-running command, I used ps and top to verify grep was passed the expected arguments and that it, not bash, consumes lots of RAM and CPU. I expect grep and sed both use the POSIX regex functions implemented in libc for BRE/ERE matching; I shouldn't really have talked about grep design specifically, except insofar as the grep developers chose to use that library. — Eliah Kagan, Sep 29 '17 at 16:39
I suggest that you replace the tests with time grep ... < /dev/null, so that people don't conflate the actual problem with the data fed to grep and other extraneous things. — muru, Sep 29 '17 at 19:00
… and change the title as its not only grep, but also sed at least! — dessert, Sep 29 '17 at 19:01
Let's continue the discussion in this chatroom: https://chat.stackexchange.com/rooms/66406/room-for-dessert-and-eliah-kagan — dessert, Sep 29 '17 at 19:28
Hi, @muru, where I can read some explanation about < /dev/null? I see it is working, but for me is unclear what portion of data is redirected to the command, what actually happens? — pa4080, Oct 02 '17 at 10:28
@pa4080 that's the point: there's no data. grep reads nothing, so timing is almost entirely grep's startup — muru, Oct 02 '17 at 10:43
@EliahKagan you can confirm what libraries are used with ldd $(which grep) $(which awk), no need to guess. — pbhj, Nov 09 '17 at 22:00
@pbhj I don't follow. Are you proposing that I run ldd on grep and awk, notice that both binaries link to libc.so.6, and consider that to "confirm" that POSIX regex functions from <regex.h> were used? ldd tells you what shared libraries a binary links to, not what it uses from each of them, nor even that it uses anything from them. — Eliah Kagan, Nov 09 '17 at 22:10

Stéphane Chazelas · Accepted Answer · 2017-09-29T22:19:56.477

Note that it's not the matching that takes time, but the building of the RE. You'll find that it uses quite a lot of RAM as well:

$ valgrind grep -Eo '[0-9]{1,9999}' < /dev/null
==6518== HEAP SUMMARY:
==6518==     in use at exit: 1,603,530,656 bytes in 60,013 blocks
==6518==   total heap usage: 123,613 allocs, 63,600 frees, 1,612,381,621 bytes allocated
$ valgrind grep -Eo '[0-9]{1,99}' < /dev/null
==6578==     in use at exit: 242,028 bytes in 613 blocks
==6578==   total heap usage: 1,459 allocs, 846 frees, 362,387 bytes allocated
$ valgrind grep -Eo '[0-9]{1,999}' < /dev/null
==6594== HEAP SUMMARY:
==6594==     in use at exit: 16,429,496 bytes in 6,013 blocks
==6594==   total heap usage: 12,586 allocs, 6,573 frees, 17,378,572 bytes allocated

The number of allocs seems roughly proportional to the number of iterations, but the memory allocated seems to grow exponentially.

That's down to how GNU regexps are implemented. If you compile GNU grep with CPPFLAGS=-DDEBUG ./configure && make, and run those commands, you'll see the exponential effect in action. Going deeper than that would mean going through a lot of theory on DFA and dive into the gnulib regexp implementation.

Here, you can use PCREs instead that doesn't seem to have the same problem: grep -Po '[0-9]{1,65535}' (the maximum, though you can always do things like [0-9](?:[0-9]{0,10000}){100} for 1 to 1,000,001 repetitions) doesn't take more time nor memory than grep -Po '[0-9]{1,2}'.

@SergiyKolodyazhnyy, you can use grep -Po '[0-9]{1,9999} which doesn't seem to have the problem. — Stéphane Chazelas, Sep 29 '17 at 22:10
It's not only in sed -E or grep -E, but in awk also has this low performance (about last awk command). maybe awk also cannot using PCRE? — αғsнιη, Oct 25 '17 at 04:25

Grep -E, Sed -E - low performance when '[x]{1,9999}' is used, but why?

1 Answers1

Linked