0

Is there a way to update a website that copied for offline viewing in terminal? I downloaded everquest.allakhazam.com and was just curious because it is updated regularly. I don't want to have to go through the whole download process multiple times, because it takes a while.

Also I am very inexperienced with linux of any sort, and am not super experienced in terminal. So please be gentle. XD

Thanks in advance!

3 Answers3

2

wget -N http://www.yoururl.com/ where www.yoururl.com is the url you want to revisit should do the trick nicely. The -N switch will ask the server for the last-modified date. If the local file is newer, the remote file will not be re-fetched. However, if the remote file is more recent, wget will proceed fetching it normally. Note that you'll want to launch wget in the same directory you launched it originally.

A note on limitations quoted from man wget:

 If a file is downloaded more than once in the same directory,
           Wget's behavior depends on a few options, including -nc.  In
           certain cases, the local file will be clobbered, or overwritten,
           upon repeated download.  In other cases it will be preserved.

           When running Wget without -N, -nc, -r, or -p, downloading the same
           file in the same directory will result in the original copy of file
           being preserved and the second copy being named file.1.  If that
           file is downloaded yet again, the third copy will be named file.2,
           and so on.  (This is also the behavior with -nd, even if -r or -p
           are in effect.)  When -nc is specified, this behavior is
           suppressed, and Wget will refuse to download newer copies of file.
       Therefore, ""no-clobber"" is actually a misnomer in this
       mode---it's not clobbering that's prevented (as the numeric
       suffixes were already preventing clobbering), but rather the
       multiple version saving that's prevented.

Depending on your situation you may also need the -r (recursive) and -l (level depth) switches. For more information on the many switches and options available, see man wget

If wget doesn't work for you:

An alternative mentioned here to wget is httrack which is also capable of mirroring a website as well as updating it.

httrack is available by first enabling the Universe repository and then installing either via the software center or from the command line with the command sudo apt-get update && sudo apt-get install httrack

Sources wget:

https://superuser.com/questions/283481/how-do-i-properly-set-wget-to-download-only-new-files

man wget

http://www.editcorp.com/Personal/Lars_Appel/wget/wget_5.html

Source httrack:

http://www.linuxcertif.com/man/1/httrack/

Elder Geek
  • 36,023
  • 25
  • 98
  • 183
0

From here I got to use wget -N site.com. How ever it sounds like you need to download the website with wget -S site.com to check the last modification date. Then -N checks to see when the last modifaction date was and if it is more recent than the 'old' version, it updates the file.

Peyto
  • 478
0

wget does support this using the --timestamping option (aka -N). It will set the modification time of the downloaded file(s) to Last-Modified HTTP header.

When you try to download the file(s) again, it will send an If-Not-Modified-Since header, to which the server might respond with 304 Not Modified.

If you try this with http://www.jasny.net, you see

$ wget --timestamping http://www.jasny.net
--2017-04-06 22:56:37--  http://www.jasny.net/
Resolving www.jasny.net (www.jasny.net)... 151.101.36.133
Connecting to www.jasny.net (www.jasny.net)|151.101.36.133|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18433 (18K) [text/html]
Saving to: ‘index.html’

index.html

2017-04-06 22:56:37 (1,15 MB/s) - ‘index.html’ saved [18433/18433]

Than the second time

$ wget --timestamping http://www.jasny.net
--2017-04-06 22:56:38--  http://www.jasny.net/
Resolving www.jasny.net (www.jasny.net)... 151.101.36.133
Connecting to www.jasny.net (www.jasny.net)|151.101.36.133|:80... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘index.html’ not modified on server. Omitting download.

Unfortunately, everquest.allakhazam.com doesn't send a Last-Modified header. So using --timestamping won't work. Also the server doesn't respond to the If-Not-Modified-Since header.

Without the server supporting this, there is no other option than to download the whole website each time.