How I can download PDFs of a website by using only the root domain name?

Question

I am using this command:

wget -nd -e robots=off --wait 0.25 -r -A.pdf http://yourWebsite.net/

but I can't get PDFs from the website.

For example I have a root domain name:

www.example.com

and this site have PDFs, DOCs, HTMLs, etc. I want to download all PDFs by inserting only the root domain name, not the exact address of the download page.

score 10 · Accepted Answer · answered May 18 '14 at 12:36

10

The following command should work:

wget -r -A "*.pdf" "http://yourWebsite.net/"

See man wget for more info.

answered May 18 '14 at 12:36

Radu Rădeanu

169,590

@Rădeanu.Not work . It get html page (index.html) and then stop process. – PEDY May 18 '14 at 13:22
1

@PEDY the PDFs files must be linked by the index.html file, directly or indirectly, for wget to be able to find them. If they are just on the server, served by some script or dynamic php thing, wget will not be able to find them. The same problem happen if you want your PDF files searched by Google or similar thing; we used to have hidden pages with all the files statically linked to allow this... – Rmano May 18 '14 at 15:49

score 1 · Answer 2 · answered Apr 26 '18 at 20:54

In case the above doesn't work try this: (replace the URL)

lynx -listonly -dump http://www.philipkdickfans.com/resources/journals/pkd-otaku/ | grep pdf | awk '/^[ ]*[1-9][0-9]*\./{sub("^ [^.]*.[ ]*","",$0); print;}' | xargs -L1 -I {} wget {}

you might need to install lynx:

sudo apt install lynx

How I can download PDFs of a website by using only the root domain name?

2 Answers2

Linked