I have a an html source file which i need to extract the links within them, number of links varies from file to file and links are formatted as such and are within single quote:
../xxx/yyy/ccc/bbbb/nameoffile.extension
I need to get the text between the single quote, replace the ..
by http://
and output the result to a file.
Im a newbie and looking for a solution to automate this process in terminal.
its html sources files and the links are everywhere in the file, I need to get them one link per lines outputted in a file to pass to my existing xargs curl for download.
sample file would is almost like that :
<head>
<body>
<html>
blabla
</>
blibli afg fgfdg sdfg <b> blo blo href= '../xxx/yyy/ccc/bbbb/nameoffile1.extension' target blibli bloblo href= '../xxx/yyy/ccc/bbbb/nameoffile2.extension' blibli
bloblo href= '../xxx/yyy/ccc/bbbb/nameoffile3.extension'
…
result looking for is a file containing this:
http://z.z.com/xxx/yyy/ccc/bbbb/nameoffile1.extension
http://z.z.com/xxx/yyy/ccc/bbbb/nameoffile2.extension
http://z.z.com/xxx/yyy/ccc/bbbb/nameoffile3.extension
can someone be kind enough to help me find a solution please.
source file as close as possible:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML>
<HEAD>
<TITLE>Inter num num - nil</TITLE>
<link rel="stylesheet" type="text/css" href="style.css" />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</HEAD>
<BODY><table width=1200 align=center class=tabForm><tr><td align=left width=largeur_2 valign=top><img src=Img/logo.gif><br /></td><td align=center valign=center width=largeur_6><h1><font color='#CB150A'>Test d'épreuve</font></h1></td><td align=right valign=top width=largeur_2 class=dataLabel>Reçu le 11/03/2018 à 17:49<br /></td></tr>
<tr><td width=1200 colspan=3 align=center><b><font color='#CB150A' size=+1>Client : zzz - Référence : 232323 - Désignation : Fiche d'accueil </font></b></color></td></tr>
</table><BR/><table width=1200 align=center class=tabForm><tr><td class=dataLabelBig width=1200>M numnum ,<BR/><BR/>
Job citée ci-dessus.<BR/>
ci-joints toutes les informations nécessaires.
<BR/><BR/>
Sandy Jan<BR/>
test@test.com</font></td></tr></table><br /><table width=1200 align=center class=tabForm><tr><td colspan=2 width=1200 class=dataLabel>Documents nécessaires à votre réponse</td></tr><tr><td colspan=2 width=1200 class=dataLabel><u><b>Job :</b></u> Suivi Travaux - <u><b>Article :</b></u> 232323 - Fiche d'accueil</td></tr><tr><td colspan=2 width=1200 class=dataLabel><a href='../path/path/path/path/path.html' target=_blank><img src=Img/pdf.png border=0> Fiche.html</a></td></tr><tr><td colspan=2 width=1200 class=dataLabel><a href='../path/path/path/path/pathd%27accueil%20traitant-20160621163240.pdf' target=_blank><img src=Img/pdf.png border=0> text.pdf</a></td></tr><tr><td colspan=2 width=1200 class=dataLabel><a href='../path/path/path/path/pathla%20S%E9curit%E9%20%281%29.doc' target=_blank><img src=Img/pdf.png border=0> Fiched'accueil.doc</a></td></tr></table><br /><table width=1200 align=center class=tabForm><tr><td colspan=2 class=dataLabelRed width=1200 >Notre commentaire</td></tr></tr><td colspan=2 class=dataLabel>mise a jour - Attention<br />
Impression <br /><br /></td></tr></table><br /><table width=1200 align=center class=tabForm><form method=post name=formvolume action=?&dossier=111734&coo=135&auth=b182f10b82ba&key=2e7c69213b28d7de6655&action=submit&type=volume enctype=multipart/form-data ><tr><td width=1200 align=left colspan=2 class=dataLabel><h3><img src=Img/h3Arrow.gif border=0> Remise de job :</h3><br /></td></tr><tr><td align=left valign=top width=120 class=dataLabelRed>Votre commentaire</td><td width=1080 align=left class=dataLabel><textarea cols=200 rows=5 name=comment ></textarea></td></tr><tr><td align=left width=120 class=dataLabelRed>Votre fichier</td><td width=1080 align=left><input type=file name=fichier size=82></td></tr><tr><td align=center colspan=2 width=1200><br /><input type=button class=button value=" Remettre votre réponse " onClick="javascript: var ok=confirm('Etes vous certain de vouloir effectuer cette action ?');if(ok==true){ document.formvolume.submit();}else {return false}" ></form></td></tr><table></table></br><table width=1200 align=center class=tabForm><form method=post name=formvolume_complement action=?&dossier=111734&coo=135&auth=b182f10b82ba&key=2e7c69213b28d7de6655&action=submit_complement&type=volume enctype=multipart/form-data ><tr><td width=1200 align=left colspan=2 class=dataLabel><h3><img src=Img/h3Arrow.gif border=0> Demande de complément, votre réponse :</h3><br /></td></tr><tr><tr><td align=left valign=top width=120 class=dataLabelRed>Votre commentaire</td><td width=1080 align=left class=dataLabel><textarea cols=200 rows=5 name=comment ></textarea></td></tr><td align=left width=120 class=dataLabelRed>Votre fichier</td><td width=1080 align=left><input type=file name=fichier size=82></td></tr><tr><td align=center colspan=2 width=1200><br /><input type=button class=button value=" Remettre votre réponse " onClick="javascript: var ok =confirm('Etes v ?');if(ok==true){ document.formvolume_complement.submit();}else {return false}" ></form></td></tr><table></table></BODY></HTML></BODY>
</HTML>
sed
and the like for parsing HTML, just as Amith KK suggests in his answer. – PerlDuck Jul 12 '18 at 18:09sed
to convert HTML to Text in mywebsync
project: https://askubuntu.com/questions/900319/code-version-control-between-local-files-and-au-answers which extracts source code in Ask Ubuntu Answers and compares it to files on local disks. I've just posted an answer with thesed
code and faster bash builtin equivalent code. – WinEunuuchs2Unix Jul 12 '18 at 23:44