6

I would like to have a script that downloads a web page with curl, pipes it to w3m, which is stripping it from all content except text and links.

Is it possible to specify for the -T option of w3m, more than just one content-type and how?

To clarify my question a bit more, here's an example:

curl --user-agent "Mozilla/4.0" https://askubuntu.com/questions -s | w3m -dump -T text/html

which returns only text from Ask Ubuntu's questions page but with no links. If w3m cannot do it is there any other tool which is capable of scraping text and links simultaneously?

S.R.
  • 373
  • 1
  • 7
  • 16

3 Answers3

2

You can use lynx -dump. It will include a number like [16] before each link, and then a list of URLs at the end of the document.

For pipe usage, you can use lynx -dump -force_html -stdin. However, that will not handle relative links correctly because it doesn't know the original URL.

So the best way is to do lynx -dump http://.../ without separate curl.

jpa
  • 1,200
2

Well, after extensive research on my own, I guess, there is no such a tool...

However, for what it's worth, I did discover hxnormalize which made writting a particular script I needed, a relatively simple matter.

S.R.
  • 373
  • 1
  • 7
  • 16
0

I think -o display_link_number=1 does what you want, as in:

$ w3m -dump -o display_link_number=1 http://example.org
Example Domain

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

[1]More information...

References:

[1] https://www.iana.org/domains/example

graywolf
  • 101
  • 2