47

Using a web browser (IE or Chrome) I can save a web page (.html) with Ctl-S, inspect it with any text editor, and see data in a table format. One of those numbers I want to extract, but for many, many web pages, too many to do manually. So I'd like to use WGET to get those web pages one after another, and write another program to parse the .html and retrieve the number I want. But the .html file saved by WGET when using the same URL as the browser does not contain the data table. Why not? It is as if the server detects the request is coming from WGET and not from a web browser, and supplies a skeleton web page, lacking the data table. How can I get the exact same web page with WGET? - Thx!

MORE INFO:

An example of the URL I'm trying to fetch is: http://performance.morningstar.com/fund/performance-return.action?t=ICENX&region=usa&culture=en-US where the string ICENX is a mutual fund ticker symbol, which I will be changing to any of a number of different ticker symbols. This downloads a table of data when viewed in a browser, but the data table is missing if fetched with WGET.

user239598
  • 471
  • 1
  • 4
  • 4

4 Answers4

49

As roadmr noted, the table on this page is generated by javascript. wget doesn't support javascript, it just dumps the page as received from the server (ie before any javascript code runs) and so the table is missing.

You need a headless browser that supports javascript like phantomjs:

$ phantomjs save_page.js http://example.com > page.html

with save_page.js:

var system = require('system');
var page = require('webpage').create();

page.open(system.args[1], function()
{
    console.log(page.content);
    phantom.exit();
});

Then if you just want to extract some text, easiest might be to render the page with w3m:

$ w3m -dump page.html

and/or modify the phantomjs script to just dump what you're interested in.

lemonsqueeze
  • 1,634
  • This also don't work, for example http://www.cotrino.com/lifespan/ – mrgloom Jan 12 '18 at 09:19
  • JS generated links wont work with that – QkiZ Jul 28 '18 at 19:21
  • 2
    2018: PhantomJS project is suspended until further notice :( – 1rq3fea324wre Sep 13 '18 at 01:44
  • This solution is only for downloading pages from specified urls. How do you pipe wget's site crawling mechanism with it? Also, what would the script look like with headless chrome? – Phil Mar 06 '19 at 09:15
  • 1
    It's not working. Error is TypeError: Attempting to change the setter of an unconfigurable property. TypeError: Attempting to change the setter of an unconfigurable property. – Prvt_Yadav Jul 13 '20 at 09:06
16

You can download a Full Website Using wget --mirror

Example:

wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

The above command line which you want to execute when you want to download a full website and made available for local viewing.

Options:

  • --mirror turns on options suitable for mirroring.

  • -p downloads all files that are necessary to properly display a given HTML page.

  • --convert-links after the download, convert the links in document for local viewing.

  • -P ./LOCAL-DIR saves all the files and directories to the specified directory.

For more Info about Wget Options Read More this article: Overview About all wget Commands with Examples, or check Wget's man page.

  • 7
    This won't work with javascript rendered content. For that you'll need to use phantomjs as answered by lemonsqueeze. – Mattias Feb 15 '18 at 12:40
  • 2
    This cmd will walk through all sub-urls too, which will download resources that are not needed to render the given webpage. – 1rq3fea324wre Sep 13 '18 at 01:49
3

Instead of --recursive, which will just go ahead and "spider" every single link in your URL, use --page-requisites. Should behave exactly as the options you describe in graphical browsers.

       This option causes Wget to download all the files that are
       necessary to properly display a given HTML page.  This includes
       such things as inlined images, sounds, and referenced stylesheets.

       Ordinarily, when downloading a single HTML page, any requisite
       documents that may be needed to display it properly are not
       downloaded.  Using -r together with -l can help, but since Wget
       does not ordinarily distinguish between external and inlined
       documents, one is generally left with "leaf documents" that are
       missing their requisites.

For more information, do man wget and look for the --page-requisites option (use "/" to search while reading a man page).

roadmr
  • 34,222
  • 9
  • 81
  • 93
2

If the server's answer differs depending on an asking source, it is mostly because of HTTP_USER_AGENT variable (just a text string) that is provided with a request from the asking source, informing the server about technology.


  1. You can check Your browser agent here -> http://whatsmyuseragent.com

  2. According to the WGET manual this parameter should do the job --user-agent=AGENT.


If this does not help, i.e. JavaScript processing may be needed to get the same page as a browser, or maybe appropriate request with GET parameters so the server will prepare answer that doesn't require JavaScript to fill the page.

Esamo
  • 1,522
  • 2
  • 16
  • 28