We can achieve this goal by the tool sed
- stream editor for filtering and transforming text. The short answer is given under point 5 below. But I've decided to write a detailed explanation.
0. First let's create a simple file to test our commands:
$ printf '\nTop text\nSender <example@email.com>\n\n<html>\n\tThe inner text 1\n</html>\n\nMiddle text\n\n<HTML>\n\tThe inner text 2\n</HTML>\n\nBottom text\n' | tee example.file
Top text
Sender <example@email.com>
<html>
The inner text 1
</html>
Middle text
<HTML>
The inner text 2
</HTML>
Bottom text
1. We can crop everything between the tags <html>
and </html>
, including them, in this way:
$ sed -n -e '/<html>/,/<\/html>/p' example.file
<html>
The inner text 1
</html>
The option -e script
(--expression=script
) adds a script to the commands to be executed. In this case the script that is added is '/<html>/,/<\/html>/p'
. While we have only one script we can omit this option.
The option -n
(--quiet
, --silent
) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tell sed
what to print.
This additional command is the print command p
, added to the end of the script. If sed
wasn't started with an -n
option, the p
command will duplicate the input.
Finally by the two comma separated patterns - /<html>/,/<\/html>/
- we can specify a range. Please note we using \
to escape the special character /
that plays role of delimiter here.
2. If we want to crop everything between the tags <html>
and </html>
, without printing them, we should add some additional commands:
$ sed -n '/<html>/,/<\/html>/{ /html>/d; p }' example.file
The inner text 1
The curly braces, {
and }
, are used to group the commands.
The command d
will delete each line that maces to the expression html>
.
3. But, our example.file
has also upper case <HTML>
tags. So we should make the pattern match case insensitive. We could do that by adding the flag /I
to the regular expressions:
$ sed -n '/<html>/I,/<\/html>/I{ /html>/Id; p }' example.file
The inner text 1
The inner text 2
- The
I
modifier to regular-expression matching is a GNU extension which causes the REGEXP to be matched in a case-insensitive manner.
4. If we want to remove all HTML tags between the <html>
tags we could add an additional command, that will parse and 'delete' the strings, which begin with <
and end with >
:
sed -n '/<html>/I,/<\/html>/I{ /html>/Id; s/<[^>]*>//g; p }' example.file
The command s
will substitute the strings that mach to the expression /<[^>]*>/
with an empty string //
- s/<old>/<new>/
.
The pattern flag g
will apply the replacement to all matches to the regexp, not just the first.
Probably we would want to omit the delete command in this case:
sed -n '/<html>/I,/<\/html>/I{ s/<[^>]*>//g; p }' example.file
5. To make the changes in place of the file and create a backup copy we can use the option -i
, or we can to create a new file based on the sed
's output by redirecting >
the output to the new file:
sed -n '/<html>/I,/<\/html>/I p' example.file -i.bak
sed -n '/<html>/I,/<\/html>/I p' example.file > new.file
References:
grep -o "<[^>]*>" file1.html
is what you expect? – αғsнιη Oct 06 '17 at 17:57cat test.txt|awk '/<html/{p=1; s=$0}p && /<\/html>/{print $0 FS s; s=""; p=0}p'
This is what i have done – DaviD Nov 03 '17 at 16:13