Getting text and links from a web page

Question

I would like to have a script that downloads a web page with curl, pipes it to w3m, which is stripping it from all content except text and links.

Is it possible to specify for the -T option of w3m, more than just one content-type and how?

To clarify my question a bit more, here's an example:

curl --user-agent "Mozilla/4.0" https://askubuntu.com/questions -s | w3m -dump -T text/html

which returns only text from Ask Ubuntu's questions page but with no links. If w3m cannot do it is there any other tool which is capable of scraping text and links simultaneously?

score 2 · Answer 1 · answered Dec 12 '19 at 19:57

You can use lynx -dump. It will include a number like [16] before each link, and then a list of URLs at the end of the document.

For pipe usage, you can use lynx -dump -force_html -stdin. However, that will not handle relative links correctly because it doesn't know the original URL.

So the best way is to do lynx -dump http://.../ without separate curl.

score 2 · Answer 2 · answered Aug 01 '16 at 02:33

2

Well, after extensive research on my own, I guess, there is no such a tool...

However, for what it's worth, I did discover hxnormalize which made writting a particular script I needed, a relatively simple matter.

answered Aug 01 '16 at 02:33

S.R.

373
1
7
16

score 0 · Answer 3 · answered Nov 21 '23 at 12:05

I think -o display_link_number=1 does what you want, as in:

$ w3m -dump -o display_link_number=1 http://example.org
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
[1]More information...
References:
[1] https://www.iana.org/domains/example

Getting text and links from a web page

3 Answers3