Extracting website URL

Question

Is there a way in Ubuntu to find all the directories in a website?

I have a website, and I want to check the internal links (directories) of that website.

Something like this:

...

The problem with this website is when I enter something like ubuntu.com/cloud, it doesn't show the subdirectories.

@pa4080 Because the two questions are not identical, the answers to the duplicate link do not really answer this question. — karel, Apr 05 '18 at 11:34
Thank you very much. Now all I have to worry about is leech close voters and eventually I will delete my two comments when they become obsolete. — karel, Apr 05 '18 at 12:27
I can't retract mine from my phone if I'm one of the leeches... — WinEunuuchs2Unix, Apr 05 '18 at 14:56
Huh. It used to be a convention to put a link to a sitemap in the footer of sites. I guess that's dieing. askubuntu.com has a link to it in its robots.txt, but it 404's. ubuntu.com doesn't even have a robots.txt. — JoL, Apr 05 '18 at 18:02
Possible duplicate of How do I create a CLI Web Spider that uses keywords and filters content? — David Foerster, Apr 05 '18 at 20:58
@pa4080 Cast his close vote, Karel dissented, pa4080 retracted, I came back in tonight to retract my own only to discover DavidFoerster has cast the very same close vote. Are we keeping open or closing. I'm confused now... FTR I do my own Stack Exchange web-scraping using bash: https://askubuntu.com/questions/900319/code-version-control-between-local-files-and-au-answers/900609#900609 As I only scrape my own AU answers and compare them to local disk answer scraping hits are less than 100. — WinEunuuchs2Unix, Apr 06 '18 at 01:06
Voting by nature is subjective. It is why it takes five close votes to close a question in this land of computers where everything is theoretically black and white with no shades of grey. There will be no repercussions from meta nor any other superpowers should you decide to close this question. And if this website was such a place where intimidation was the rule of law I'd be the first to leave. — WinEunuuchs2Unix, Apr 07 '18 at 03:43

karel · Accepted Answer · 2018-04-05T11:43:19.307

9

Open the terminal and type:

sudo apt install lynx  
lynx -dump -listonly -nonumbers "https://www.ubuntu.com/" | uniq -u

This command improves upon the previous command by redirecting the output to a text file named links.txt.

lynx -dump "https://www.ubuntu.com/" | awk '/http/{print $2}' | uniq -u > links.txt

edited Apr 05 '18 at 11:43

answered Apr 05 '18 at 10:47

karel

114,770

I trust this won't get you banned like the comment by *pa4080* below the question when referencing the link: https://askubuntu.com/questions/991447/how-do-i-create-a-cli-web-spider-that-uses-keywords-and-filters-content – WinEunuuchs2Unix Apr 05 '18 at 11:03
1

I'm not spidering a multiple page website, only returning the links on one single webpage. Assuming the extremely improbable case of a poorly designed single webpage that contains 100,000 links (how would it be possible to load such a page to begin with) I suppose that lynx would try to return all the links until it finished executing the command or until the terminal froze. – karel Apr 05 '18 at 13:04
1

Actually I've run very complicated backup scripts from the terminal involving hundreds of gigabytes of files transfered, and nothing froze until the backup command completed successfully. So I think that lynx would perform successfully even in this extreme case. – karel Apr 05 '18 at 13:09

score 5 · Answer 2 · edited Apr 05 '18 at 11:00

5

See this answer from superuser.com:

wget --spider -r --no-parent http://some.served.dir.ca/
ls -l some.served.dir.ca

There are free websites which will do this for you and convert the output to xml format though. I suggest you look into one of those as well to see which method is more suitable for your needs.

Edit OP has included a new screenprint

edited Apr 05 '18 at 11:00

pa4080

29,831

answered Apr 05 '18 at 10:46

DWD

296

1

You can add also depth of the recursion - for example to the second level: wget --spider -r -l2. – pa4080 Apr 05 '18 at 11:02

Extracting website URL

2 Answers2

Linked