2

I have a list of drug IDs. I need to search for each drug ID on a website (ebi.ac.uk/chembl) which is a database, and get information on the drug, including their structure and other details, and list them in a table.

I think one way I can do this is by writing a command to each time put the drug ID at the end of the url and extract the information for each drug. for example this is the list of drug IDs:

CHEMBL3126679
CHEMBL3126678
CHEMBL478673
CHEMBL2386960
CHEMBL2326937
CHEMBL1258156
CHEMBL393858

and this is the URL that contains the information for one drug:

https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL3126679

The last part should be changed every time.

What command can I use to achieve this?

Zanna
  • 70,465
  • 1
    "search toolbar of a website" - do you have any specific site in mind? Such search queries are usually send to the web server as GET or POST request to a specific path. You'd have to examine your specific site to find out where and how search queries are sent, then you could replicate that behaviour and do the request manually with a tool like wget or curl. You might have to further parse the response to extract all useful information. Anyway, that whole process is always specific to a single site, so your question is too broad as it is. Please clarify. – Byte Commander Apr 24 '18 at 07:35
  • Thank you for your answer. my website is: https://www.ebi.ac.uk/chembl/ , which is a databse. I have a list of drug IDs and I want to search the IDs one by one and extract their information. I can do it manually but for a thousand of drugs it is very difficult. – Tahereh S Apr 24 '18 at 07:39
  • You may try ChromeDriver or ChromiumDriver. See example in my answer. It has powerful Python-binding. – N0rbert Apr 24 '18 at 08:08
  • Looks like a dupe of https://askubuntu.com/questions/541307/for-loop-syntax-bash-script. – muru Apr 24 '18 at 09:57
  • 1
    @muru I think the main problem isn't looping over a list of strings but what command to issue in each iteration (curl, wget, ...) and how to pick the data from the response. curl https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL3126679 works, but the result contains a lot of JS code so I don't think it is that easy. I haven't analysed it in detail, though. – PerlDuck Apr 24 '18 at 10:00

1 Answers1

2

You can use a shell loop to process the IDs, curl or wget to get the data, and tools like pup to get process the HTML. For example, say the IDs are in a file named foo, then you can do:

while read id
do
    curl -sL "https://www.ebi.ac.uk/chembl/compound/inspect/$id" |
      pup 'tr:parent-of(td:contains("Canonical SMILES")) td:nth-child(2) text{}'
done < foo

Here, I have used the pup command to:

  1. look for a table containing Canonical SMILES - td:contains("...")
  2. get the parent row of that - tr:parent-of(...)
  3. and print the second cell in that row: td:nth-child(2) text{}

I get output like:

CC1=CN([C@H]2C[C@H](N=[N+]=[N-])[C@H](O2)C(=O)N)C(=O)NC1=O


NC(=O)[C@H]1O[C@H](C[C@@H]1N=[N+]=[N-])N2C=CC(=O)NC2=O


NC1=NC(=O)N(Cc2cn(nn2)[C@H]3C[C@H](O)[C@@H](CO)O3)C=C1


CC1(C)CC[C@@]2([C@H](O)C[C@]3(C)C(=CC[C@@H]4[C@@]5(C)CCC(N)C ...
Download SMILES



COC(=O)c1nn(c2cccc(F)c2)c3c4ccccc4S(=O)(=O)N(C)c13


COC(=O)[C@H](C)NP(=O)(OC[C@H]1O[C@@H](N2C=CC(=O)NC2=O)[C@](C ...
Download SMILES



CCO[C@]1(CO)O[C@H]([C@H](O)[C@@H]1O)N2C=CC(=NC2=O)N

I'll leave it to you to examine the HTML and figure out the other filters.

muru
  • 197,895
  • 55
  • 485
  • 740
  • thank you very much. I am trying to test it. but where is the output written? – Tahereh S Apr 25 '18 at 06:29
  • I can not get the output. I didn't get any error but there is no output! – Tahereh S Apr 25 '18 at 13:54
  • @TaherehS It works fine for me. What exactly did you do? You installed pup, right? In this answer, the word foo is the name of the file that contains the drug IDs. You get no output if the file exists but is empty. You get no output and the command never ends if you put the redirection operator the wrong way round. Could you give us a clue about what might have gone wrong in your case? – Zanna Apr 26 '18 at 07:58
  • @Zanna I put the name of my file instead of foo and executed the script. I get no output. should I define a file to put the output in it? like this: > output.txt? isn't it necessary to define the 'id' at first part of script? – Tahereh S Apr 27 '18 at 00:41
  • now I get 'EOF' as my output! – Tahereh S Apr 27 '18 at 03:48
  • The variable $id expands to a line of the file each time. Is your file in the current directory? You don't need to add a redirection for the output (unless you want to send it to a file) and redirecting the output will not work, unless you get some output in the terminal when you don't redirect it. Maybe you could show us the exact command you run and the exact output, and the ls -l output for the directory you are running it in, and the contents of the file you are passing to the script in place of foo and paste all those things on http://paste.ubuntu.com and give the link here? – Zanna Apr 27 '18 at 06:29
  • @ Zanna now I performed the script for only one link without using the text file that contains the ids. this is the command: curl -sL "https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL486822" | pup 'tr:parent-of(td:contains("CanonicalSMILES")) td:nth-child(2) text{}' I get this error : Selector parsing error: selector within ':parent-of' may not contain a pseudo class – Tahereh S Apr 27 '18 at 10:18
  • @TaherehS oh. Looks like you copied the search term incorrectly. There needs to be a space between Canonical and SMILES. I assume that's why you got no output, because there's no CanonicalSMILES. Please copy and paste muru's script exactly and only replace the filename (sorry I didn't see your message earlier - it's the opposite problem with @ and Zanna - if you include a space it doesn't work). – Zanna Apr 30 '18 at 08:18
  • @Zanna, I corrected the point you said. but the problem still exists. – Tahereh S Apr 30 '18 at 15:25
  • I could not reproduce your selector parsing error @TaherehS (btw there shouldn't be a semicolon after the URL either), please post the exact & complete commands you are using – Zanna Apr 30 '18 at 17:06
  • @Zanna this is the command that I use: curl -sL "https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL486822" | pup 'tr:parent-of(td:contains("Canonical SMILES")) td:nth-child(2) text{}' there is no semicolon after the URL in my command – Tahereh S May 01 '18 at 09:50
  • @Zanna there is no semicolon in my command when I run the command, but when I copy and paste it here this semicolon appears. – Tahereh S May 01 '18 at 10:43
  • @TaherehS you should enclose your commands in backticks (``) to make it render as code. This command works for me: curl -sL "https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL486822" | pup 'tr:parent-of(td:contains("Canonical SMILES")) td:nth-child(2) text{}' – Zanna May 01 '18 at 10:48
  • @Zanna OK. I have another question, if I want to extract also from a link in each webpage what should I do? for example for a drug in addition to the information that exist in the first webpage, there is also a link in that webpage that I want to extract some information from. is it possible? – Tahereh S May 03 '18 at 00:55
  • @TaherehS very likely :) but if you are stuck trying to do something a bit different from what you asked here, you should post a whole new question. You can link to this question to provide context. – Zanna May 03 '18 at 04:17