-1

I'm trying to parse contents of an HTML file to scrape a download directory, however I've modified it to a MWE that reproduces my issue:

sed -e 's|\(href\)|\1|' index.html

Prints the entirety of index.html. I was originally thinking that it was an issue with my expression, but this very basic expression proves that wrong.

The same happens if I remove -e or if I add g at the end.

It's been a while since I've done sed, am I doing something wrong here? Is sed getting confused with the characters in an html file?

pa4080
  • 29,831

3 Answers3

2

you should use grep to find text in a file
sed is better for text substitutions

If you want to list the hypertext links, you can simply grep the file like this :

grep -Po '(?<=href=")[^"]*' index.html
cmak.fr
  • 8,696
  • This is the solution I'm using, however it may benefit from explaining what -o does - would make the answer more complete. – Brydon Gibson Mar 22 '19 at 15:49
2

That you've explaned sounds as the normal behaviour of sed used with the command substitution. I suppose you are looking for something like this:

sed -nr 's/^.*href="(http.*)".*$/\1/p' index.html 

Where:

  • / is used as delimiter in this case (you can use | or #, etc.).

  • The option -n (--quiet, --silent) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tell sed what to print.

  • This additional command is the print command p, added to the end of the script. If sed wasn't started with an -n option, the p command will duplicate the input.

  • The option -r enables the extended regular expressions. Without this option our command can be:

    sed -n 's/^.*href="\(http.*\)".*$/\1/p' index.html
    
  • The command s means substitute: #<string-or-regexp>#<replacement>#.

  • ^ will match to the beginning of the line. $ will match to the end of the line.

  • within the the , the capture group (http.*), will be treated as the variable \1.

Example of usage:

$ cat index.html 
<!DOCTYPE html>
<html><head><title>Page Title</title></head><body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <a href="https://www.w3schools.com">Visit W3Schools</a>
</body></html>

$ sed -nr 's/^.*href="(http.*)".*$/\1/p' index.html 
https://www.w3schools.com

More examples:

pa4080
  • 29,831
1

This may be overly cumbersome, but I think it would work for you, as long as your href contents contains no spaces.

grep "href" index.html |tr ' ' '\n'|grep "^href" |cut -f2 -d'='

The first grep singles out only lines that contain the href. The tr converts spaces to newlines. The second grep grabs just the href section you were interested in. Finally, the cut grabs everything after the "href=".

S. Nixon
  • 402
  • I am actually looking for what's after the href, so I'm looking for href="[I want this content]" – Brydon Gibson Mar 21 '19 at 20:47
  • 2
    And that's what you get, except it's wrapped in quotes. If you use cut -f2 -d'\"' instead, it should give the contents of the href line. Again, much more cumbersome than what others proposed, but I was not aware of the grep -o oprion. – S. Nixon Mar 22 '19 at 14:26