8

I would like to know if there is a way to combine a series of grep statements where the effect is to "and" the expressions rather than "or" the matching expressions.

Demo below:

./script  
     From one grep statement, I want output like this
a b c

     not like this
a
c
a b
a b c
a b c d

Hear is a look at the script.

 #!/bin/bash
 string="a
 b
 c
 d
 a b
 a b c
 a b c d"

 echo -e "\t From one grep statement I want output like this"
 echo "$string" |
 grep a |grep c |grep -v d #Correct output but pipes three grep statements

 echo -e "\n\tNot like this"
 echo "$string" |
 grep -e'a' -e'c' -e-v'd' #One grep statement but matching expressions are "or" versus "and"
Keith Reynolds
  • 429
  • 3
  • 8
  • 18
  • You could simplify it and run (if using Bash and GNU grep): grep -E 'a|c|d' <<<"$string". This gets you the output on a horizontal line. However, specifying -o (only matching) for grep, makes the output vertical. –  Aug 17 '13 at 00:34
  • @mike thank you for the redirect, but your grep statement still gives output more like the latter half – Keith Reynolds Aug 17 '13 at 16:10
  • @Mik He wants and not or. – Sparhawk Aug 30 '13 at 04:22
  • @Sparhawk Thank you. Yes, I want the effect of "anding" matching expressions rather than "oring" matching expressions. – Keith Reynolds Aug 26 '15 at 17:56

3 Answers3

8

You cannot transform the filter grep a | grep c | grep -v d to a single simple grep. There are only complicated and ineffective ways. The result has slow performance and the meaning of the expression is obscured.

Single command combination of the three greps

If you just want to run a single command you can use awk which works with regular expressions too and can combine them with logical operators. Here is the equivalent of your filter:

awk '/a/ && /c/ && $0 !~ /d/'

I think in most cases there is no reason for simplifying a pipe to a single command except when the combination results in a realatively simple grep expression which could be faster (see results below).

Unix-like systems are designed to use pipes and to connect various utilities together. Though the pipe communication is not the most effective possible but in most cases it is sufficient. Because nowadays most of new computers have multiple CPU cores you can "naturally" utilize CPU parallelization just by using a pipe!

Your original filter works very well and I think that in many cases the awk solution would be a little bit slower even on a single core.

Performance comparison

Using a simple program I have generated a random testing file with 200 000 000 lines, each with 4 characters as a random combination from characters a, b, c and d. The file has 1 GB. During the tests it was completely loaded in the cache so no disk operations affected the performance measurement. The tests were run on Intel dual core.

Single grep

$ time ( grep -E '^[^d]*a[^d]*c[^d]*$|^[^d]*c[^d]*a[^d]*$' testfile >/dev/null )
real    3m2.752s
user    3m2.411s
sys 0m0.252s

Single awk

$ time ( awk '/a/ && /c/ && $0 !~ /d/' testfile >/dev/null )
real    0m54.088s
user    0m53.755s
sys 0m0.304s

The original three greps piped

$ time ( grep a testfile | grep c | grep -v d >/dev/null )
real    0m28.794s
user    0m52.715s
sys 0m1.072s

Hybrid - positive greps combined, negative piped

$ time ( grep -E 'a.*c|c.*a' testfile | grep -v d >/dev/null )
real    0m15.838s
user    0m24.998s
sys 0m0.676s

Here you see that the single grep is very slow because of the complex expression. The original pipe of three greps is pretty fast because of a good parallelization. Without parallelization - on a single core - the original pipe runs just slightly faster than awk which as a single process is not parallelized. Awk and grep probably use the same regular expressions code and the logic of the two solutions is similar.

The clear winner is the hybring combining two positive greps and leaving the negative one in the pipe. It seems that the regular expression with | has no performance penalty.

  • One reason for simplifying pipes into a single command is for efficiency. If we start with a large file, then pipes force it to be read multiple times. Alternatively, it might be possible to read through the file once, with a single grep command. – Sparhawk Aug 30 '13 at 04:25
  • @Sparhawk: This is misunderstanding of the pipes. The source file (but in this example there is no file but a variable) is read only once and the result of the first grep is passed to the next one directly through system calls and RAM. In fact the first grep can generate much smaller output for the second grep. – pabouk - Ukraine stay strong Aug 30 '13 at 05:24
  • @Sparhawk: Thank you for the solution with a single grep. I did not think of such a complicated tranformation :) I have slightly updated my reply. – pabouk - Ukraine stay strong Aug 30 '13 at 05:31
  • Nice work. You are correct about the pipes. I forgot that they didn't work sequentially, reading from stdin each time, but instead work concurrently. You've clearly shown that this is efficient from your test. (Having said that, I wonder if my one-liner is less efficient because the piped version can stop grepping when it encounters the first match. The awk test suggests that if grep was capable of AND, then the one-liner might be about the same efficiency as the piped version. Also, it's interesting that piped grep parallelises, while awk cannot.) – Sparhawk Aug 30 '13 at 10:47
  • Just out of interest, if you still have the files, can you please test time ( grep -E 'a.*c|c.*a' testfile | grep -v 'd' >/dev/null )? Cheers. – Sparhawk Aug 30 '13 at 10:49
  • @Sparhawk: Good idea with testing the hybrid solution. It is clearly the fastest one! ...it's interesting that piped grep parallelises, while awk cannot... The explanation is simple: The parallelisation is caused by pipe (the components of pipe can run in parallel). The awk example runs a single program (which has no internal parallelisation) and there is no pipe. In such cases you can use for example GNU parallel for parallelisation. – pabouk - Ukraine stay strong Aug 30 '13 at 12:53
2

The problem is that -e works as an or, not as an and. You can do it in one line, but it's pretty convoluted. The not part is the most complicated.

To simplify the a and c parts (assuming order is unknown):

grep -E 'a.*c|c.*a'

or

grep -e 'a.*c' -e 'c.*a'

Hence, you could do

grep -E 'a.*c|c.*a' | grep -v 'd'

For a single grep statement, you'll have to make sure there are no ds before, after or between the a and c:

grep -E '^[^d]*a[^d]*c[^d]*$|^[^d]*c[^d]*a[^d]*$'
Sparhawk
  • 6,929
-1

You can use the -xswitch, which acording to the grep man page, "select only those matches that exactly match the whole line.".

In your exemple, try : grep -x "a b c"

  • That would work if I knew what the whole line will look like, but wouldn't for something like: tail -f /var/log/auth.log |grep sshd |grep "Failed password for" |egrep -v "certain users|ipaddress" – Keith Reynolds Aug 17 '13 at 15:58