2

I have a an html source file which i need to extract the links within them, number of links varies from file to file and links are formatted as such and are within single quote:

../xxx/yyy/ccc/bbbb/nameoffile.extension

I need to get the text between the single quote, replace the .. by http:// and output the result to a file.

Im a newbie and looking for a solution to automate this process in terminal.

its html sources files and the links are everywhere in the file, I need to get them one link per lines outputted in a file to pass to my existing xargs curl for download.

sample file would is almost like that :

<head>
<body>
<html>

blabla
</>
blibli afg fgfdg sdfg <b> blo blo href= '../xxx/yyy/ccc/bbbb/nameoffile1.extension' target blibli bloblo href= '../xxx/yyy/ccc/bbbb/nameoffile2.extension'  blibli

bloblo href= '../xxx/yyy/ccc/bbbb/nameoffile3.extension'

…

result looking for is a file containing this:

http://z.z.com/xxx/yyy/ccc/bbbb/nameoffile1.extension
http://z.z.com/xxx/yyy/ccc/bbbb/nameoffile2.extension
http://z.z.com/xxx/yyy/ccc/bbbb/nameoffile3.extension

can someone be kind enough to help me find a solution please.

source file as close as possible:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML>
    <HEAD>
    <TITLE>Inter num num - nil</TITLE>
    <link rel="stylesheet" type="text/css" href="style.css" />
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
    </HEAD>
    <BODY><table width=1200 align=center class=tabForm><tr><td align=left width=largeur_2 valign=top><img src=Img/logo.gif><br /></td><td align=center valign=center width=largeur_6><h1><font color='#CB150A'>Test d'épreuve</font></h1></td><td align=right valign=top width=largeur_2 class=dataLabel>Reçu le 11/03/2018 à 17:49<br /></td></tr>
    <tr><td width=1200 colspan=3 align=center><b><font color='#CB150A' size=+1>Client : zzz - Référence : 232323  - Désignation : Fiche d'accueil </font></b></color></td></tr>

    </table><BR/><table width=1200 align=center class=tabForm><tr><td class=dataLabelBig width=1200>M numnum ,<BR/><BR/>
    Job citée ci-dessus.<BR/>
    ci-joints toutes les informations nécessaires.
    <BR/><BR/>
    Sandy Jan<BR/>
    test@test.com</font></td></tr></table><br /><table width=1200 align=center class=tabForm><tr><td colspan=2  width=1200 class=dataLabel>Documents nécessaires à votre réponse</td></tr><tr><td colspan=2 width=1200 class=dataLabel><u><b>Job :</b></u> Suivi Travaux - <u><b>Article :</b></u> 232323  - Fiche d'accueil</td></tr><tr><td colspan=2 width=1200 class=dataLabel><a href='../path/path/path/path/path.html' target=_blank><img src=Img/pdf.png border=0> Fiche.html</a></td></tr><tr><td colspan=2 width=1200 class=dataLabel><a href='../path/path/path/path/pathd%27accueil%20traitant-20160621163240.pdf' target=_blank><img src=Img/pdf.png border=0> text.pdf</a></td></tr><tr><td colspan=2 width=1200 class=dataLabel><a href='../path/path/path/path/pathla%20S%E9curit%E9%20%281%29.doc' target=_blank><img src=Img/pdf.png border=0> Fiched'accueil.doc</a></td></tr></table><br /><table width=1200 align=center class=tabForm><tr><td colspan=2 class=dataLabelRed width=1200 >Notre commentaire</td></tr></tr><td colspan=2 class=dataLabel>mise a jour - Attention<br />
Impression <br /><br /></td></tr></table><br /><table width=1200 align=center class=tabForm><form method=post name=formvolume action=?&dossier=111734&coo=135&auth=b182f10b82ba&key=2e7c69213b28d7de6655&action=submit&type=volume enctype=multipart/form-data ><tr><td width=1200 align=left colspan=2 class=dataLabel><h3><img src=Img/h3Arrow.gif border=0>&nbsp;Remise de job  :</h3><br /></td></tr><tr><td align=left valign=top width=120 class=dataLabelRed>Votre commentaire</td><td width=1080 align=left class=dataLabel><textarea cols=200 rows=5 name=comment ></textarea></td></tr><tr><td align=left width=120 class=dataLabelRed>Votre fichier</td><td width=1080 align=left><input type=file name=fichier size=82></td></tr><tr><td align=center colspan=2 width=1200><br /><input type=button class=button value="&nbsp;&nbsp;Remettre votre réponse&nbsp;&nbsp;"  onClick="javascript: var ok=confirm('Etes vous certain de vouloir effectuer cette action ?');if(ok==true){ document.formvolume.submit();}else {return false}" ></form></td></tr><table></table></br><table width=1200 align=center class=tabForm><form method=post name=formvolume_complement action=?&dossier=111734&coo=135&auth=b182f10b82ba&key=2e7c69213b28d7de6655&action=submit_complement&type=volume enctype=multipart/form-data ><tr><td width=1200 align=left colspan=2 class=dataLabel><h3><img src=Img/h3Arrow.gif border=0>&nbsp;Demande de complément, votre réponse  :</h3><br /></td></tr><tr><tr><td align=left valign=top width=120 class=dataLabelRed>Votre commentaire</td><td width=1080 align=left class=dataLabel><textarea cols=200 rows=5 name=comment ></textarea></td></tr><td align=left width=120 class=dataLabelRed>Votre fichier</td><td width=1080 align=left><input type=file name=fichier size=82></td></tr><tr><td align=center colspan=2 width=1200><br /><input type=button class=button value="&nbsp;  Remettre votre réponse &nbsp;"  onClick="javascript: var ok =confirm('Etes v ?');if(ok==true){ document.formvolume_complement.submit();}else {return false}" ></form></td></tr><table></table></BODY></HTML></BODY>
</HTML>
grimdex
  • 41
  • please don't do cross posting https://unix.stackexchange.com/q/454820/72456 – αғsнιη Jul 12 '18 at 04:21
  • Do you have a link to a real web page address that can be used for testing? – WinEunuuchs2Unix Jul 12 '18 at 11:49
  • 1
    The sample is most helpful! We can see color codes are enclosed in single quotes too. So we need to expand the project scope to only print text between single quotes that have valid web page syntax. Also the file name can contain %20 which needs to be converted to a space. Plus a couple other things I couldn't see on my phone. – WinEunuuchs2Unix Jul 12 '18 at 17:42
  • 1
    @WinEunuuchs2Unix For exactly these reasons we shouldn't use sed and the like for parsing HTML, just as Amith KK suggests in his answer. – PerlDuck Jul 12 '18 at 18:09
  • You could also use xidel to do this in bash. – starbeamrainbowlabs Jul 12 '18 at 19:33
  • @PerlDuck I used sed to convert HTML to Text in my websync project: https://askubuntu.com/questions/900319/code-version-control-between-local-files-and-au-answers which extracts source code in Ask Ubuntu Answers and compares it to files on local disks. I've just posted an answer with the sed code and faster bash builtin equivalent code. – WinEunuuchs2Unix Jul 12 '18 at 23:44
  • 1
    @WinEunuuchs2Unix I did not mention that it is not possible, but rather discouraged as html varies a lot and something written in sed would fare much worse compared to an actual HTML processor when the structure of the input html varies. – Amith KK Jul 13 '18 at 06:04
  • @AmithKK Yes I read that bash is discouraged for converting HTML to text. However at the time it was the tool at hand and it works for extracting what is posted in Ask Ubuntu to my local drive. If HTML changes on Stack Exchange I can quickly revise a bash script. I like the absolute control / flexibility over HTML conversion process. The majority would prefer a third-party app and let someone else do the conversion. – WinEunuuchs2Unix Jul 13 '18 at 11:03

4 Answers4

4

Utilities like sed, awk etc. are not made for parsing structured data such as html. Hence a much more viable solution would be to use python to do the same.

Firstly, make sure BeautifulSoup is installed by:

sudo apt-get install python3 python3-bs4

Now create a new file (for instance test.py) and paste the short script I've written for this purpose:

#!/usr/bin/env python3
import sys
from bs4 import BeautifulSoup

DOMAIN = 'z.z.com/'

if  len(sys.argv)<2 or not sys.argv[1].endswith('.html'):
    print("Argument not provided or not .html file", file=sys.stderr)
    exit()

with open(sys.argv[1], 'r', encoding='latin-1') as f:
    webpage = f.read()

soup = BeautifulSoup(webpage, "lxml")
for a in soup.findAll('a', href=True):
    print(a['href'].replace("../","http://"+DOMAIN))

Python 2 version on request:

#!/usr/bin/env python2
import sys
from bs4 import BeautifulSoup

DOMAIN = 'z.z.com/'

if  len(sys.argv)<2 or not sys.argv[1].endswith('.html'):
    print >> sys.stderr, "Argument not provided or not .html file"
    exit()

with open(sys.argv[1], 'rb') as f:
    webpage = f.read().decode("latin-1")

soup = BeautifulSoup(webpage, "html.parser")
for a in soup.findAll('a', href=True):
    print(a['href'].replace("../","http://"+DOMAIN))

Modify the DOMAIN variable to match your actual domain, save this script in the current directory and run it as follows:

./test.py yourfile.html > outputfile

For reference, this is the output produced by the script on running it with the provided example in the question:

http://z.z.com/path/path/path/path/path.html
http://z.z.com/path/path/path/path/pathd%27accueil%20traitant-20160621163240.pdf
http://z.z.com/path/path/path/path/pathla%20S%E9curit%E9%20%281%29.doc
David Foerster
  • 36,264
  • 56
  • 94
  • 147
Amith KK
  • 13,412
1

Another Perl solution that uses a proper HTML parser is the following (say get-links.pl):

#!/usr/bin/env perl

use strict;
use warnings;
use File::Spec;
use WWW::Mechanize;

my $filename = shift or die "Must supply a *.html file\n";
my $absolute_filename = File::Spec->rel2abs($filename);

my $mech = WWW::Mechanize->new();
$mech->get( "file://$absolute_filename" );
my @links = $mech->links();
foreach my $link ( @links ) {
    my $new_link = $link->url;

    if ( $new_link =~ s(^\.\./)(http://z.z.com/) ) {
        print "$new_link\n";
    }
}

You may need to install the WWW::Mechanize module first because it is not a core module (meaning it isn't installed by default together with Perl). To do so, run

sudo apt install libwww-mechanize-perl

The script reads the given file, converts the filename to an absolute path (because we want to build a proper URI like file:///path/to/source.html).

After extracting the links (my @links = $mech->links();) it examines each link's URL and if it starts with ../ then that part is replaced by http://z.z.com/ and printed.

Usage:

./get-links.pl source.html

Output:

http://z.z.com/path/path/path/path/path.html
http://z.z.com/path/path/path/path/pathd%27accueil%20traitant-20160621163240.pdf
http://z.z.com/path/path/path/path/pathla%20S%E9curit%E9%20%281%29.doc

As @Amith KK already said in his answer: Parsing HTML (or XML) is best done with a proper parser because tools like sed and their kind may fail when there are other elements in the source that look like a link but aren't.

David Foerster
  • 36,264
  • 56
  • 94
  • 147
PerlDuck
  • 13,335
0

To extract data between single quotes from file test.html with replacing of two dots .. in URLs with http:// and to save the extracted data to file newfile.txt do:

cat test.html | sed -ne 's/^.*'\''\([^'\'']*\)'\''.*$/\1/p' | sed -e 's/\.\./http:\//g' > newfile.txt

Or try without sed:

cat test.html | grep -Eo "'[^'() ]+'" | tr -d \'\" | perl -pe 's/../http:\//' > newfile.txt

This works for the file sample added to question by author:

cat test.html | grep -Eo "'[^|'() ]+'" | grep -wE "('..)" | tr -d \'\" | perl -pe 's/../http:\/\/mysite.mydomain.com/' > newfile.txt
Bob
  • 2,533
  • not work, tried a lot of possibility of sed so far but not successful. awk pipe to multiple sed did gave some positive result but still far away - 3 weeks since taking my hairs off with this and today decided to ask the world wide community. hopefully will find a solution very soon – grimdex Jul 12 '18 at 10:19
  • 1
    @Wiking cat is not necessary there. – fugitive Jul 12 '18 at 11:36
  • code ive been trying so far:

    awk -F "'" '{ for (f=1; f<=(NF-1)/2; f++) print $(f*2) }' /file.html | LC_ALL=C sed '/^#/ d' | LC_ALL=C sed 's,.,,2' | LC_ALL=C sed 's,.,http://thesite.com,g' | xargs -n 1 curl -O

    – grimdex Jul 12 '18 at 12:26
  • here the result:

    http:/B150A http:/B150A

    – grimdex Jul 12 '18 at 12:57
  • yes definite domain:

    http://mysite.mydomain.com

    – grimdex Jul 12 '18 at 12:59
  • file updated in the question above – grimdex Jul 12 '18 at 13:04
0

Convert HTML to Text

As mentioned in comments you need to convert html to text format. For this there is a one-liner which should cover all the bases:

sed 's/&nbsp;/ /g; s/&amp;/\&/g; s/&lt;/\</g; s/&gt;/\>/g; s/&quot;/\"/g; s/#&#39;/\'"'"'/g; s/&ldquo;/\"/g; s/&rdquo;/\"/g;'

If you are converting 100s of thousands of lines, bash builtin commands are many times faster:

#-------------------------------------------------------------------------------
LineOut=""      # Make global
HTMLtoText () {
    LineOut=$1  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&amp;/\&/g; s/&lt;/\</g; 
    # s/&gt;/\>/g; s/&quot;/\"/g; s/&#39;/\'"'"'/g; s/&ldquo;/\"/g; 
    # s/&rdquo;/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut//&nbsp;/ }"
    LineOut="${LineOut//&amp;/&}"
    LineOut="${LineOut//&lt;/<}"
    LineOut="${LineOut//&gt;/>}"
    LineOut="${LineOut//&quot;/'"'}"
    LineOut="${LineOut//&#39;/"'"}"
    LineOut="${LineOut//&ldquo;/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//&rdquo;/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

Check if file exists

To test if file exists use a derivative of this function:

function validate_url(){
  if [[ `wget -S --spider $1  2>&1 | grep 'HTTP/1.1 200 OK'` ]]; then echo "true"; fi
}

Putting it all together

A final script still needs to be written based on sample data derived from a valid webpage with valid filenames.