56

When using wget in a script to download some files from Google Docs, the name of the file is not preserved. For example:

wget 'http://spreadsheets.google.com/pub?key=pyj6tScZqmEfbZyl0qjbiRQ&output=xls' 

saves the file as pub?key=pyj6tScZqmEfbZyl0qjbiRQ instead of indicatorhivestimatedprevalence15-49.xls, which is what I get if I click on the link in a browser. Is there any way to enforce this "browser-like" behaviour in wget?

3 Answers3

101
wget --content-disposition 'http://spreadsheets.google.com/pub?key=pyj6tScZqmEfbZyl0qjbiRQ&output=xls'

will do the trick for you.

Its still not fully implemented and seems to bug out a bit sometimes so its not the default option in wget, use it at your own risk.

Bruno Pereira
  • 73,643
  • 1
    I know...! Nice eh? ;) – Bruno Pereira Nov 10 '11 at 00:27
  • I'm not really much of a web programmer, so I would have never thought of looking for the phrase "content disposition". You saved me having to manually look at the HTTP headers, discover the content-disposition header and deal with it. – Chinmay Kanchi Nov 10 '11 at 01:03
  • WOW + amazing. THX u roc good idea. – Kangarooo Nov 10 '11 at 03:50
  • @BrunoPereira, I am also trying to download google spreadsheet file. But I could not find the link for the file. Could you please say how to get the link for a google spreadsheet file so that I can use wget in the same way as Chinmay Kanchi. Thanks in advance. – user22180 Oct 15 '14 at 09:29
  • @ChinmayKanchi I call myself a web programmer last 15 years, but when it comes to this I always try and use a more meaningful name in code. – tishma Oct 20 '16 at 06:00
  • Btw this also works well when downloading lots of files via a list of URLs using the -i parameter. – Michael Feb 01 '18 at 10:27
  • Since it's annoying adding this verbose option everytime, you can make it the default by configuring this in your .wgetrc file! Just edit or create ~/.wgetrc and add a line that contains content-disposition=on.

    Here's some other useful things to put in your ~/.wgetrc: compression=auto, continue=on, header = User-Agent: ...., server_response = on, etc..

    – Chris Oct 09 '20 at 22:36
5

You can try to use curl to download and keep original filename:

curl -OJL ${your_url}
  • -O for remote-name
  • -J for remote-header-name
  • -L for location

see curl command line options.

Noam Manos
  • 309
  • 5
  • 8
0

The Google Docs link is really telling a script on the server to run, parsing that into the file you want. The file, to the best of my knowledge, does not exist ever on the server in the els form, but is generated at runtime when you ask for it. Thus, there isn't anything for wget to get.

In order to download the file, you would need to use the google API http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#DownloadingDocs/.

Ethan
  • 258
  • Yes, the server is asking a script to run, which creates the .xls file on the fly. However, a full-blown browser has no problem with this. So it's obviously possible to do without the Docs API. – Chinmay Kanchi Nov 10 '11 at 00:16
  • My thought had been that the script run in the browser would use the API, so to do it without the browser, one would have to recreate the script. Interesting that wget has a flag for it. – Ethan Nov 11 '11 at 01:39