24

To be precise

Some text
begin
Some text goes here.
end
Some more text

and I want to extract entire block that starts from "begin" till "end".

with awk we can do like awk '/begin/,/end/' text.

How to do with grep?

αғsнιη
  • 35,660
Iker
  • 345

2 Answers2

17

Updated 18-Nov-2016 (since grep behavior is changed: grep with -P parameter now doesn't support ^ and $ anchors [on Ubuntu 16.04 with kernel v:4.4.0-21-generic])(wrong (non-)fix)

$ grep -Pzo "begin(.|\n)*\nend" file
begin
Some text goes here.  
end

note: for other commands just replace the '^' & '$' anchors with new-line anchor '\n' ______________________________

With grep command:

grep -Pzo "^begin\$(.|\n)*^end$" file

If you want don't include the patterns "begin" and "end" in result, use grep with Lookbehind and Lookahead support.

grep -Pzo "(?<=^begin$\n)(.|\n)*(?=\n^end$)" file

Also you can use \K notify instead of Lookbehind assertion.

grep -Pzo "^begin$\n\K(.|\n)*(?=\n^end$)" file

\K option ignore everything before pattern matching and ignore pattern itself.
\n used for avoid printing empty lines from output.

Or as @AvinashRaj suggests there are simple easy grep as following:

grep -Pzo "(?s)^begin$.*?^end$" file

grep -Pzo "^begin$[\s\S]*?^end$" file

(?s) tells grep to allow the dot to match newline characters.
[\s\S] matches any character that is either whitespace or non-whitespace.

And their output without including "begin" and "end" is as following:

grep -Pzo "^begin$\n\K[\s\S]*?(?=\n^end$)" file # or grep -Pzo "(?<=^begin$\n)[\s\S]*?(?=\n^end$)"

grep -Pzo "(?s)(?<=^begin$\n).*?(?=\n^end$)" file

see the full test of all commands here (out of dated as grep behavior with -P parameter is changed)

Note:

^ point the beginning of a line and $ point the end of a line. these added to the around of "begin" and "end" to matching them if they are alone in a line.
In two commands I escaped $ because it also using for "Command Substitution"($(command)) that allows the output of a command to replace the command name.

From man grep:

-o, --only-matching
      Print only the matched (non-empty) parts of a matching line,
      with each such part on a separate output line.

-P, --perl-regexp Interpret PATTERN as a Perl compatible regular expression (PCRE)

-z, --null-data Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. Like the -Z or --null option, this option can be used with commands like sort -z to process arbitrary file names.

αғsнιη
  • 35,660
  • 1
    change your grep like grep -Pzo "(?<=begin\n)(.|\n)*(?=\nend)" file to not to print \n character which exists on the line begin. – Avinash Raj Nov 19 '14 at 12:18
  • 1
    Use DOTALL modifier to make dot to match even newline chars also grep -Pzo "(?s)begin.*?end" file – Avinash Raj Nov 19 '14 at 12:19
  • Or Simply, grep -Pzo "begin[\s\S]*?end" file – Avinash Raj Nov 19 '14 at 12:20
  • @AvinashRaj thank you I added to avoiding \n but you can post your another solution as your own answer ;) – αғsнιη Nov 19 '14 at 12:38
  • Why? add it to yours. I have more reps :-) – Avinash Raj Nov 19 '14 at 12:42
  • You might want to use grep -Pzo "begin(.|\n)*\nend" file instead to make sure that end only matches at the beginning of a line and not in things like bend. – terdon Nov 19 '14 at 13:18
  • @terdon Can I use ^end instead? or even better ^end$? – αғsнιη Nov 19 '14 at 13:20
  • Huh, yes you can . I had thought that the ^ would only match the beginning of the file when using -z but apparently not. – terdon Nov 19 '14 at 13:23
  • @terdon yes that is but in upper case of -Z I think. – αғsнιη Nov 19 '14 at 13:24
  • The man page says: "-z: Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline." so I would expect the ^ and $ to match just before and just after a \0 instead. Apparently, they're hard coded to match \n. – terdon Nov 19 '14 at 13:27
  • @KasiyA it would be grep -Pzo "(?s)(?<=^|\n)begin.*?\nend" file – Avinash Raj Nov 19 '14 at 15:52
  • 1
    The siólution doesn't work. It produces an error: grep: ein nicht geschütztes ^ oder $ wird mit -Pz nicht unterstützt The translation of the error is something like: grep: a not protected ^ or $ is not supported with -Pz – musbach Nov 15 '16 at 08:01
  • I guess grep's behavior has changed. I just tested and musbach is right, the ^ and $ don't work with -Pz. It should work as expected if your replace ^ and $ with \n though. – terdon Nov 15 '16 at 08:45
  • @terdon http://paste.ubuntu.com/9096940/ – αғsнιη Nov 15 '16 at 09:23
  • 1
    Yes, I know, that's in your answer. I'm sure it worked when you posted this, but try it again today. The behavior of grep seems to have changed. – terdon Nov 15 '16 at 09:24
  • @terdon you are right. This works: grep -Pzo "begin\n(.|\n)*\nend\n" file. If I put before begin a \n (grep -Pzo "\nbegin\n(.|\n)*\nend\n" file) I get blank line and than the correct output. I guess that \n produces a linefeed but it looks strange to me. @KasiyA I am on Ubuntu 16.04. On what OS are you? – musbach Nov 16 '16 at 20:24
  • @musbach yes, \n is the newline character. You get an extra newline because with \nbegin you are including the newline character at the end of the previous line, so that's printed as a blank line. – terdon Nov 16 '16 at 20:55
  • at that time I was on 14.04, but right now I'm far away from my Ubuntu 16.04 to test it, once I come with 16.04 will double check, but for sure grep behavior is changed as Mr. terdon confirmed, @musbach – αғsнιη Nov 17 '16 at 03:45
  • Yes, the answer should be corrected or flaged as wrong. – musbach Nov 17 '16 at 07:39
  • I added also that it works on 14.04 and it doesn't work not on 10.4 and it doesn't work on 16.04 (see blow). Why it just works on 14.04 is very strange. – musbach Nov 18 '16 at 20:00
  • @musbach there's no way (and no reason) to flag an answer as wrong. You've left a comment explaining it, that's all that's needed. The answer was correct when posted, after all. – terdon Nov 18 '16 at 22:10
3

In case your grep doesn't support perl syntax (-P), you can try joining the lines, matching the pattern, then expanding the lines again as below:

$ tr '\n' , < foo.txt | grep -o "begin.*end" | tr , '\n'
begin
Some text goes here.
end
kenorb
  • 10,347