How to get WGET to download exact same web page html as browser

Question

Using a web browser (IE or Chrome) I can save a web page (.html) with Ctl-S, inspect it with any text editor, and see data in a table format. One of those numbers I want to extract, but for many, many web pages, too many to do manually. So I'd like to use WGET to get those web pages one after another, and write another program to parse the .html and retrieve the number I want. But the .html file saved by WGET when using the same URL as the browser does not contain the data table. Why not? It is as if the server detects the request is coming from WGET and not from a web browser, and supplies a skeleton web page, lacking the data table. How can I get the exact same web page with WGET? - Thx!

MORE INFO:

An example of the URL I'm trying to fetch is: http://performance.morningstar.com/fund/performance-return.action?t=ICENX&region=usa&culture=en-US where the string ICENX is a mutual fund ticker symbol, which I will be changing to any of a number of different ticker symbols. This downloads a table of data when viewed in a browser, but the data table is missing if fetched with WGET.

Most likely the initial HTML is filled in using AJAX techniques by a javascript fragment that downloads and populates the table. In this case you'd probably have better luck wgetting the call to this script. Like Braiam asks, if you provide the URL we may be better able to help figure this out. — roadmr, Jan 27 '14 at 15:55
ad More Info: In browser, when you display a source code, you don't see the original HTML (the same as got by wget) but HTML updated by javascript/ajax. Modern browsers shows such generated source instead of the plain HTML. — Vrata Blazek, Jan 02 '19 at 21:31
Does this answer your question? Is there any way of downloading the pure HTML content of a page using wget? — Raffa, Mar 04 '21 at 19:09

score 49 · Answer 1 · edited Apr 13 '17 at 12:23

49

As roadmr noted, the table on this page is generated by javascript. wget doesn't support javascript, it just dumps the page as received from the server (ie before any javascript code runs) and so the table is missing.

You need a headless browser that supports javascript like phantomjs:

$ phantomjs save_page.js http://example.com > page.html

with save_page.js:

var system = require('system');
var page = require('webpage').create();

page.open(system.args[1], function()
{
    console.log(page.content);
    phantom.exit();
});

Then if you just want to extract some text, easiest might be to render the page with w3m:

$ w3m -dump page.html

and/or modify the phantomjs script to just dump what you're interested in.

edited Apr 13 '17 at 12:23

Community

1

answered Nov 08 '14 at 11:04

lemonsqueeze

1,634

This also don't work, for example http://www.cotrino.com/lifespan/ – mrgloom Jan 12 '18 at 09:19
JS generated links wont work with that – QkiZ Jul 28 '18 at 19:21
2

2018: PhantomJS project is suspended until further notice :( – 1rq3fea324wre Sep 13 '18 at 01:44
This solution is only for downloading pages from specified urls. How do you pipe wget's site crawling mechanism with it? Also, what would the script look like with headless chrome? – Phil Mar 06 '19 at 09:15
1

It's not working. Error is TypeError: Attempting to change the setter of an unconfigurable property. TypeError: Attempting to change the setter of an unconfigurable property. – Prvt_Yadav Jul 13 '20 at 09:06

score 16 · Answer 2 · edited Aug 18 '14 at 12:36

16

You can download a Full Website Using wget --mirror

Example:

wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

The above command line which you want to execute when you want to download a full website and made available for local viewing.

Options:

--mirror turns on options suitable for mirroring.
-p downloads all files that are necessary to properly display a given HTML page.
--convert-links after the download, convert the links in document for local viewing.
-P ./LOCAL-DIR saves all the files and directories to the specified directory.

For more Info about Wget Options Read More this article: Overview About all wget Commands with Examples, or check Wget's man page.

edited Aug 18 '14 at 12:36

Andrea Corbellini

15,826

answered Aug 18 '14 at 12:24

GowriShankar

269

7

This won't work with javascript rendered content. For that you'll need to use phantomjs as answered by lemonsqueeze. – Mattias Feb 15 '18 at 12:40
2

This cmd will walk through all sub-urls too, which will download resources that are not needed to render the given webpage. – 1rq3fea324wre Sep 13 '18 at 01:49

score 3 · Answer 3 · answered Jan 27 '14 at 15:53

Instead of --recursive, which will just go ahead and "spider" every single link in your URL, use --page-requisites. Should behave exactly as the options you describe in graphical browsers.

       This option causes Wget to download all the files that are
       necessary to properly display a given HTML page.  This includes
       such things as inlined images, sounds, and referenced stylesheets.

       Ordinarily, when downloading a single HTML page, any requisite
       documents that may be needed to display it properly are not
       downloaded.  Using -r together with -l can help, but since Wget
       does not ordinarily distinguish between external and inlined
       documents, one is generally left with "leaf documents" that are
       missing their requisites.

For more information, do man wget and look for the --page-requisites option (use "/" to search while reading a man page).

score 2 · Answer 4 · answered Jan 27 '14 at 15:01

If the server's answer differs depending on an asking source, it is mostly because of HTTP_USER_AGENT variable (just a text string) that is provided with a request from the asking source, informing the server about technology.

You can check Your browser agent here -> http://whatsmyuseragent.com
According to the WGET manual this parameter should do the job --user-agent=AGENT.

If this does not help, i.e. JavaScript processing may be needed to get the same page as a browser, or maybe appropriate request with GET parameters so the server will prepare answer that doesn't require JavaScript to fill the page.

How to get WGET to download exact same web page html as browser

4 Answers4

Linked

Related