24

On Google Chrome, when we go to the development mode, right-click an HTML element → CopyCopy element, we can copy the HTML content of a webpage. Below is an example of the procedure I've described:

Copying HTML content with Google Chrome

My problem is that, when I use wget for downloading the webpage, I get the source code of the page, including its JavaScript addresses and scripts.

I'd like to use the command line for downloading the final HTML result of a page, just like Google Chrome does in my example. Getting the HTML content that is being displayed on the page would be useful for automating the extraction of information from webpages for me.

Is it possible to download the HTML of a page (not the source code) using wget or other command line tools?

raylight
  • 493
  • 5
    wget does not process JavaScript to generate the final rendering of a page. When I have needed to get a complete copy of a site in its current state — more or less — it has been necessary to use a browser extension that executes only after onreadystatechange completes. –  Mar 04 '21 at 06:31
  • 1
    It's possible with python and selenium , but I guess that is more a question for Stackoverflow ... However, there are many resources for that online and probably this has been answered before. You will find a lot about PhantomJS, but that project is discontinued. You should use Chromium or Firefox as driver. – pLumo Mar 04 '21 at 07:52
  • 5
    Just to be clear on some concepts here: the difference you are seeing is not between "the HTML" and "the source code" but between the initial response sent to your browser, and the processed result after some client-side scripts have run; and there is no "final" HTML of the page, because the JavaScript could change the content completely every second (or every click), and no version would be any more "real" than the others. That doesn't make your question completely meaningless, but it's worth understanding that there's no guarantee you can get the "real" version of any page like this. – IMSoP Mar 04 '21 at 17:15
  • Duplicate: https://askubuntu.com/q/411540/396228 – sondra.kinsey Mar 04 '21 at 20:43
  • @IMSoP I suppose the OP's intent is to have exactly that dynamic behavior in a local copy. – Peter - Reinstate Monica Mar 04 '21 at 21:15
  • 2
    @sondra.kinsey Actually, the question is different. I said many times in my question that I didn't want to get the source code of the page as it is when on the "View Source Code" of the browser. The solutions on the link won't give me the HTML that I needed in this case. – raylight Mar 04 '21 at 21:22
  • 5
    @raylight Just to be really clear: there is no difference between "the HTML" and "the source code"; they mean exactly the same thing. What you are looking for is exactly the same as the other question, just worded differently: you want to get the HTML structure that you see in the browser after the browser has run some additional code which modifies that HTML. – IMSoP Mar 04 '21 at 21:56
  • @Raffa why does this site need a separate tag for HTML DOM? – muru Mar 05 '21 at 02:07
  • Cool, I wanna know how to get the CSS and JavaScript files too, but since this is Ubuntu, not sure how to do it. –  Mar 05 '21 at 08:20

1 Answers1

35

Since you have Google Chrome installed, you can get the web-page's inner HTML structure by running in the terminal:

google-chrome --headless --dump-dom 'URL' > ~/file.html

Replace URL with the URL of the web page you want. The HTML DOM of the page will be saved to a file named file.html in your home directory.

Raffa
  • 32,237
  • Will this also wait for document readyStateComplete? – Nzall Mar 05 '21 at 14:32
  • 3
    @Nzall Yes, Google Chrome supports HTML DOM readyState Property and --dump-dom will do that. – Raffa Mar 05 '21 at 14:42
  • 4
    I had to add --virtual-time-budget=10000 --timeout=10000 --run-all-compositor-stages-before-draw --disable-gpu --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36" to get everything rendered properly. – neu242 Mar 22 '22 at 12:05
  • 1
    Indeed, providing a user agent is often necessary, but --user-agent="Mozilla/5.0 (X11; Linux x86_64)" did the trick. Thanks! – ckujau Jul 17 '22 at 17:40