99

I want to download a whole website (with sub-sites). Is there any tool for that?

UAdapter
  • 17,587
  • 1
    what exactly are you trying to achieve? the title and the content of your question are not related, and the content is not specific. – RolandiXor Jan 07 '11 at 14:26
  • 1
    N.B., only following links (e.g., using --convert-links in wget) will not reveal sites that are only revealed by submitting a form, among other things. – Steven Jan 07 '11 at 17:40

8 Answers8

172

Try example 10 from here:

wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL
  • –mirror : turn on options suitable for mirroring.

  • -p : download all files that are necessary to properly display a given HTML page.

  • --convert-links : after the download, convert the links in document for local viewing.

  • -P ./LOCAL-DIR : save all the files and directories to the specified directory.
dadexix86
  • 6,616
shellholic
  • 5,682
  • is there any way to download only certain pages (for instance, several parts of an articles that is spread over several html documents)? – don.joey Feb 18 '13 at 09:16
  • @Private Yes, although it's probably easier to use python or something to get the pages (depending on the layout/url). If the url of the pages differs by a constantly growing number or you have a list of the pages, you could probably use wget in a bash script. – Vreality Apr 10 '13 at 22:00
  • 2
    You might consider using the --wait=seconds argument if you want to be more friendly to the site ; it will wait the specified number of seconds between retrievals. – belacqua Feb 07 '14 at 21:27
  • the above works, but for joomla the parameterized url creates files that are not linked locally. The one worked for me is wget -m -k -K -E http://your.domain.com from here: https://vaasa.hacklab.fi/2013/11/28/howto-make-a-static-copy-of-joomla-site-with-wget/ – M.Hefny Apr 19 '17 at 18:36
  • 2
    Also --no-parent to "never ascend to the parent directory" taken from here. – Daniel Dec 09 '17 at 01:45
  • 2
    Gotta love stackoverflow's oneliner accepted answers <3 – maaw Mar 20 '18 at 12:34
43

HTTrack for Linux copying websites in offline mode

httrack is the tool you are looking for.

HTTrack allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure.

Sid
  • 10,533
11

With wget you can download an entire website, you should use -r switch for a recursive download. For example,

wget -r http://www.google.com
muru
  • 197,895
  • 55
  • 485
  • 740
7

WEBHTTRACK WEBSITE COPIER is a handy tool to download a whole website onto your hard disk for offline browsing. Launch ubuntu software center and type "webhttrack website copier" without the quotes into the search box. select and download it from the software center onto your system. start the webHTTrack from either the laucher or the start menu, from there you can begin enjoying this great tool for your site downloads

frizeR
  • 169
3

I don't know about sub domains, i.e, sub-sites, but wget can be used to grab a complete site. Take a look at the this superuser question. It says that you can use -D domain1.com,domain2.com to download different domains in single script. I think you can use that option to download sub-domains i.e -D site1.somesite.com,site2.somesite.com

binW
  • 13,034
2

I use Burp - the spider tool is much more intelligent than wget, and can be configured to avoid sections if necessary. The Burp Suite itself is a powerful set of tools to aid in testing, but the spider tool is very effective.

Rory Alsop
  • 2,789
  • 1
    Isn't Burp Windows Only? The closed-source licence agreement for Burp is also quite heavy. Not to mention the price tag $299.00: – Kat Amsterdam May 08 '12 at 18:00
  • from the licence: WARNING: BURP SUITE FREE EDITION IS DESIGNED TO TEST FOR SECURITY FLAWS AND CAN DO DAMAGE TO TARGET SYSTEMS DUE TO THE NATURE OF ITS FUNCTIONALITY. TESTING FOR SECURITY FLAWS INHERENTLY INVOLVES INTERACTING WITH TARGETS IN NON-STANDARD WAYS WHICH CAN CAUSE PROBLEMS IN SOME VULNERABLE TARGETS. YOU MUST TAKE DUE CARE WHEN USING THE SOFTWARE, YOU MUST READ ALL DOCUMENTATION BEFORE USE, YOU SHOULD BACK UP TARGET SYSTEMS BEFORE USE AND YOU SHOULD NOT USE THE SOFTWARE ON PRODUCTION SYSTEMS OR OTHER SYSTEMS FOR WHICH THE RISK OF DAMAGE IS NOT ACCEPTED BY YOU. – Kat Amsterdam May 08 '12 at 18:00
  • For what it does, the price tag is amazingly cheap - I would recommend buying it for a wide range of security testing. And it is very easy to configure it to test exactly as you want - safer than AppScan in some instances:-) – Rory Alsop May 08 '12 at 18:22
  • 1
    @KatAmsterdam Regarding specifically the compatibility question: According to Wikipedia, Burp Suite is a Java application, so it should run fine on Ubuntu. – Eliah Kagan Apr 10 '13 at 22:21
  • Kat - it runs just fine on various flavours of Linux. The warning on the licence is the same as any tool you can use for security assessments. – Rory Alsop Aug 24 '16 at 15:42
2

You can download Entire Website Command :

wget -r -l 0 website

Example :

wget -r -l 0 http://google.com
  • 1
    Can you please explain how this command works? What it does? – Kaz Wolfe Jun 18 '16 at 17:21
  • @KazWolfe -r turns on recursive retrieving with a default depth of 5 levels. -l specifies the maximum recursion depth which is in this example 0. – Gilfoyle Oct 30 '19 at 18:21
1

If speed is a concern (and the server's wellbeing is not), you can try puf, which works like wget but can download several pages in parallel. It is, however, not a finished product, not maintained and horribly undocumented. Still, for to download a web site with lots and lots of smallish files, this might be a good option.

loevborg
  • 7,282