Sed is dumping the entire file

Question

I'm trying to parse contents of an HTML file to scrape a download directory, however I've modified it to a MWE that reproduces my issue:

sed -e 's|\(href\)|\1|' index.html

Prints the entirety of index.html. I was originally thinking that it was an issue with my expression, but this very basic expression proves that wrong.

The same happens if I remove -e or if I add g at the end.

It's been a while since I've done sed, am I doing something wrong here? Is sed getting confused with the characters in an html file?

For what you are looking for I suppose grep is the command to go with. — Ravexina, Mar 21 '19 at 20:32
@Ravexina grep prints the entire line, I am looking for a small portion of a line. — Brydon Gibson, Mar 21 '19 at 20:36
@zx485 Changing to / (or ,) does not change the behaviour — Brydon Gibson, Mar 21 '19 at 20:36
Use grep -o so grep prints only the matched (non-empty) parts of a matching line. — Ravexina, Mar 21 '19 at 20:52

score 2 · Accepted Answer · answered Mar 21 '19 at 21:07

2

you should use grep to find text in a file
sed is better for text substitutions

If you want to list the hypertext links, you can simply grep the file like this :

grep -Po '(?<=href=")[^"]*' index.html

answered Mar 21 '19 at 21:07

cmak.fr

8,696

This is the solution I'm using, however it may benefit from explaining what -o does - would make the answer more complete. – Brydon Gibson Mar 22 '19 at 15:49

pa4080 · Answer 2 · 2019-03-21T21:34:46.603

That you've explaned sounds as the normal behaviour of sed used with the command substitution. I suppose you are looking for something like this:

sed -nr 's/^.*href="(http.*)".*$/\1/p' index.html

Where:

/ is used as delimiter in this case (you can use | or #, etc.).
The option -n (--quiet, --silent) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tell sed what to print.
This additional command is the print command p, added to the end of the script. If sed wasn't started with an -n option, the p command will duplicate the input.
The option -r enables the extended regular expressions. Without this option our command can be:
```
sed -n 's/^.*href="$http.*$".*$/\1/p' index.html
```
The command s means substitute: #<string-or-regexp>#<replacement>#.
^ will match to the beginning of the line. $ will match to the end of the line.
within the the , the capture group (http.*), will be treated as the variable \1.

Example of usage:

$ cat index.html 
<!DOCTYPE html>
<html><head><title>Page Title</title></head><body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <a href="https://www.w3schools.com">Visit W3Schools</a>
</body></html>

$ sed -nr 's/^.*href="(http.*)".*$/\1/p' index.html 
https://www.w3schools.com

More examples:

score 1 · Answer 3 · answered Mar 21 '19 at 20:42

1

This may be overly cumbersome, but I think it would work for you, as long as your href contents contains no spaces.

grep "href" index.html |tr ' ' '\n'|grep "^href" |cut -f2 -d'='

The first grep singles out only lines that contain the href. The tr converts spaces to newlines. The second grep grabs just the href section you were interested in. Finally, the cut grabs everything after the "href=".

answered Mar 21 '19 at 20:42

S. Nixon

402

I am actually looking for what's after the href, so I'm looking for href="[I want this content]" – Brydon Gibson Mar 21 '19 at 20:47
2

And that's what you get, except it's wrapped in quotes. If you use cut -f2 -d'\"' instead, it should give the contents of the href line. Again, much more cumbersome than what others proposed, but I was not aware of the grep -o oprion. – S. Nixon Mar 22 '19 at 14:26

Sed is dumping the entire file

3 Answers3