The QED score (Quantitative estimate of druglikeness) score for compounds was published by Bickerton et al 2012. To calculate this, a series of desirability functions (d) are derived, each
corresponding to a different molecular descriptor. These are combined to give the QED by taking the geometric mean of the individual functions. The individual functions used are 8 widely-used molecular properties that are thought to be important for determining druglikeness:
1) molecular weight (MW)
2) octanol-water partition coefficient (ALOGP)
3) number of hydrogen bond donors (HBD)
4) number of hydrogen bond acceptors
(HBA)
5) molecular polar surface area (PSA)
6) number of rotatable bonds
(ROTB)
7) the number of aromatic rings (AROM)
8) number of structural alerts (ALERTS).
Thursday, 23 June 2016
RNASeq gene expression
I'm trying to get my hands on some RNAseq gene expression data, and have to remind myself what do RPKM and FPKM mean again?
RPKM: Reads per kilobase of transcript, per million mapped reads. This is a measure of a transcript's expression level, normalised for the length of the transcript, and for the total amount of reads in the data set.
FPKM: Fragments per kilobase of transcript, per million mapped reads. In RNAseq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. That is, if your data is paired-end RNAseq data, and you see two reads from the same read-pair, then these are only counted as one fragment when calculating the FPKM (but counted as two reads when calculating RPKM).
How do you get RPKM or FPKM values?
1. First you need to map your raw reads to the genome assembly eg. using TopHat. If you already have a gff/gtf file of known genes, you can use the -G option to map to the known genes.
2. You can then calculate the RPKM or FPKM using Cufflinks.
RNAseq data files
Just to remind myself, there are also different types of RNAseq data files:
bam files: have the reads mapped to an assembly
BigWig files: useful for displaying RNAseq data, as they are in an indexed binary format.
Notes
There is some discussion on the FPKM values calculated by Cufflinks here. Apparently TopHat used to calculate RPKM, but it's now deprecated and recommended to use Cufflinks (see here).
RPKM: Reads per kilobase of transcript, per million mapped reads. This is a measure of a transcript's expression level, normalised for the length of the transcript, and for the total amount of reads in the data set.
FPKM: Fragments per kilobase of transcript, per million mapped reads. In RNAseq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. That is, if your data is paired-end RNAseq data, and you see two reads from the same read-pair, then these are only counted as one fragment when calculating the FPKM (but counted as two reads when calculating RPKM).
How do you get RPKM or FPKM values?
1. First you need to map your raw reads to the genome assembly eg. using TopHat. If you already have a gff/gtf file of known genes, you can use the -G option to map to the known genes.
2. You can then calculate the RPKM or FPKM using Cufflinks.
RNAseq data files
Just to remind myself, there are also different types of RNAseq data files:
bam files: have the reads mapped to an assembly
BigWig files: useful for displaying RNAseq data, as they are in an indexed binary format.
Notes
There is some discussion on the FPKM values calculated by Cufflinks here. Apparently TopHat used to calculate RPKM, but it's now deprecated and recommended to use Cufflinks (see here).
Classes of anthelmintic drugs
The MeSH browser can be used to study the MeSH (medical subject headings) classification system. Drugs and compounds in the MeSH pharmalogical classification can be browsed here. I'm interested in anthelmintic drugs. For these the MeSH classes are:
Anthelmintics
- Antinematodal Agents
- Filaricides
- Antiplatyhelmintic Agents
- Anticestodal Agents
- Schistosomicides
But anthelmintic drugs also go by other names..
Other names to look for are: anti-helminthic, antischistosomal, antifilarial, antitrematodal, filaricidal, fasciolicidal, fasciolicide, microfilaricidal, nematicidal, nematocidal, taeniacide.
Anthelmintics
- Antinematodal Agents
- Filaricides
- Antiplatyhelmintic Agents
- Anticestodal Agents
- Schistosomicides
But anthelmintic drugs also go by other names..
Other names to look for are: anti-helminthic, antischistosomal, antifilarial, antitrematodal, filaricidal, fasciolicidal, fasciolicide, microfilaricidal, nematicidal, nematocidal, taeniacide.
Friday, 17 June 2016
Scripts for gene statistics based on a gff
I've written some scripts to calculate gene statistics based on a gff file, here's how they work (to remind myself)..
(1) Given a protein fasta file, find the mean and median protein length:
First you get the longest transcript for each gene:
This needs a list of the transcript names:
eg.
% grep mRNA augustus3.c.gff3 | grep 'Parent=g' | tr ':' ' ' | tr ';' ' ' | cut -f9 | tr "=" ' ' | awk '{ print $2"\t"$4 }' > augustus3.transcripts
16,767 transcripts
Then:
[script uses python2 because of biopython version]
% python2 /nfs/users/nfs_a/alc/Documents/git/Python/take_longest_spliceform_only.py aug.c3.aa augustus3.transcripts augustus3.longest.pep
12,405 transcripts
Then:
[script uses python2 because of biopython version]
% python2 ~alc/Documents/git/Python/calc_mean_median_protein_len.py augustus3.longest.pep
Mean length=510.306812, median length=336.000000
(2) Given the gff file, calculate the intron statistics:
First make a gff file with the longest transcript:
% python2 /nfs/users/nfs_a/alc/Documents/git/Python/make_gff_for_spliceforms.py augustus3.longest.pep augustus3.c.gff3 augustus3.c.longest.gff3
Then calculate the total length of intron DNA, number of introns, mean and median intron length: % python3 ~alc/Documents/git/Python/calc_total_intron_len.py augustus3.c.longest.gff3
Num_scaffolds=483
Num_genes=12405
Num_cds_regions (coding exons)=74902
Total intron bases=156226973
Number of introns=62497
Median length of intron regions=1610.000000 bp
Mean length of intron regions=2501.492744 bp
(3) Given the gff file, calculate the coding exon statistics (total amount of bases in coding exons, number of coding exons, median and mean length of coding exons, mean and median coding exons per gene):
% python3 ~alc/Documents/git/Python/calc_total_exon_len.py augustus3.c.longest.gff3
Num_scaffolds=483
Num_genes=12405
Num_cds_regions (coding exons)=74902
Total coding exon (CDS) bases=19024352
Number of CDS regions=74902
Median length of CDS regions=165.000000 bp
Mean length of CDS regions=254.041868 bp
Mean number of CDS per gene=6.038049
Median number of CDS per gene=4.000000
(1) Given a protein fasta file, find the mean and median protein length:
First you get the longest transcript for each gene:
This needs a list of the transcript names:
eg.
% grep mRNA augustus3.c.gff3 | grep 'Parent=g' | tr ':' ' ' | tr ';' ' ' | cut -f9 | tr "=" ' ' | awk '{ print $2"\t"$4 }' > augustus3.transcripts
16,767 transcripts
Then:
[script uses python2 because of biopython version]
% python2 /nfs/users/nfs_a/alc/Documents/git/Python/take_longest_spliceform_only.py aug.c3.aa augustus3.transcripts augustus3.longest.pep
12,405 transcripts
Then:
[script uses python2 because of biopython version]
% python2 ~alc/Documents/git/Python/calc_mean_median_protein_len.py augustus3.longest.pep
Mean length=510.306812, median length=336.000000
(2) Given the gff file, calculate the intron statistics:
First make a gff file with the longest transcript:
% python2 /nfs/users/nfs_a/alc/Documents/git/Python/make_gff_for_spliceforms.py augustus3.longest.pep augustus3.c.gff3 augustus3.c.longest.gff3
Then calculate the total length of intron DNA, number of introns, mean and median intron length: % python3 ~alc/Documents/git/Python/calc_total_intron_len.py augustus3.c.longest.gff3
Num_scaffolds=483
Num_genes=12405
Num_cds_regions (coding exons)=74902
Total intron bases=156226973
Number of introns=62497
Median length of intron regions=1610.000000 bp
Mean length of intron regions=2501.492744 bp
(3) Given the gff file, calculate the coding exon statistics (total amount of bases in coding exons, number of coding exons, median and mean length of coding exons, mean and median coding exons per gene):
% python3 ~alc/Documents/git/Python/calc_total_exon_len.py augustus3.c.longest.gff3
Num_scaffolds=483
Num_genes=12405
Num_cds_regions (coding exons)=74902
Total coding exon (CDS) bases=19024352
Number of CDS regions=74902
Median length of CDS regions=165.000000 bp
Mean length of CDS regions=254.041868 bp
Mean number of CDS per gene=6.038049
Median number of CDS per gene=4.000000
Monday, 6 June 2016
A list of known anthelmintic drugs
I wanted to make a list of known anthelmintic (anti-helminth) drugs. It sounded so simple, but took a while. In the end I compiled info from various sources:
WHO ATC codes
- The WHO list of anthelmintic drugs for veterinary use (and here too), and list WHO list of anthelmintic drugs for human use
- the WHO also has some nice documents describing nomenclature for drugs here, where they list drugs in different classes, including anthelmintics
Wikipedia
- Anthelminitic drugs listed in wikipedia
KEGG Drugs
- KEGG drugs (although I don't have a subscription, so just could search for one name at a time, to confirm those I'd found)
ChEMBL
- a ChEMBL keyword search for 'anthelmintic' (by our collaborator Prudence Mutowo at ChEMBL - thanks Prudence)
Merck Manuals
- the Merck Veterinary Manual (online) proved to be a treasure trove
- the Merck Medical Manual (online)
DrugBank
- all the anthelmintics in DrugBank : these can be found by searching for those with MeSH term 'anthelmintic' : here (see DrugBank category browser)
PubChem
- 215 PubChem terms manually annotated with MeSH term 'antelmintics' here
(how I got that link: looked up one anthelmintic drug eg. carbendazole, then under Pharmalogical Properties, there are the MeSH terms such as 'anthelminitics', and a link to 'all PubChem compounds annotated with this term', so click on that ===> (later) actually I asked PubChem and they said I can just search for "anthelmintics"[PharmAction] AND "chembl"[SourceName] to get all PubChem anthelmintics that have a ChEMBL id.)
- You can find a list of antifilarial compounds in PubChem using this link
- In the end I found I got the most compounds in PubChem by searching for "anthelmintics OR anthelmintics OR nematocide OR nematicide".
WormBook
- a wormbook chapter on 'Anthelmintic drugs'
ChEBI
- a ChEBI search for 'anthelmintic'
Textbooks
- The book 'Approaches to design and synthesis of antiparasitic drugs' edited by Anand & Sharma, which has a table of anthelmintic drugs.
Gave me a list of about 160 anthelmintic compounds, many no longer in use but useful to have such a list..
Some things I learnt along the way:
Searching ChEMBL
- Sometimes a compound is in ChEMBL but has not been named. If you can get the SMILES string from another source (eg. from wikipedia or PubChem), you can search for it in ChEMBL using the SMILES string. To do this, go to the ChEMBL homepage, and paste the SMILES into the 'List Search' box on the right hand side after clicking on the SMILES Search radio button.
- When I search ChEMBL using the SMILES, sometimes it has matches that have the same structure if stereochemistry (around chiral centres) is ignored, but a different structure when stereochemistry is considered! I find the PubChem search with SMILES (PubChem structure search here) does respect stereochemistry, so sometimes it's best to search PubChem with the SMILES, and see if the PubChem match has a ChEMBL link in its entry (rather than searching ChEMBL directly).
Searching PubChem
- You can search PubChem using a SMILES here.
- Often there are several PubChem entries for a drug but they have different numbers of HCl or water molecules in them, for example for bisbenzimide.
- To search PubChem for a long IUPAC name such as "N-(4-bromophenyl)-2,6-dihydroxybenzamide", you need to include quotes on either end, or the search doesn't work.
Searching ChEBI
- You can do a search for the term 'anthelmintic' on the ChEBI homepage, but that only finds about 10 records.
- A far better way is to go to the ChEBI homepage, click on 'Advanced Search', scroll down the bottom and in 'Ontology term' select 'has role' and type 'anthelmintic'. This picks up compounds annotated as having a role 'antiplatyhelmintic', 'antinematodal' or 'schistosomicide' too.
Convering a picture of a molecule to a SMILES
- In some cases I couldn't find the molecule in PubChem or ChEMBL, but could see a structure in a paper. In this case I had to draw in the molecule using the molecule drawing tool at PubChem, and then this gave me the SMILES, which I could use to search ChEMBL.
Converting IUPAC names to SMILES
- In some cases, I found a list of IUPAC names in a textbook, and wanted to know if they are in PubChem. However, I've heard that the IUPAC names in PubChem are not always correct. I then heard I can convert IUPAC names to SMILES using the OPSIN software by Daniel Lowe and colleagues. I can then search PubChem using the SMILES on their structure search page. Great!
SMILES in PubChem
- PubChem gives the 'isomeric SMILES' and the 'canonical SMILES' for a molecule, but the one to use is the 'isomeric SMILES' as this gives the stereochemistry (unless there is no 'isomeric SMILES' given, in which case use the 'canonical SMILES').
Metal-containing compounds
- ChEMBL does not give SMILES for metal-containing compounds because SMILES does not have a good way to represent the coordinate bonds nor a recognised way to
describe stereo around a metal centre. This can make it awkward to find the structure though, eg.
for 'gallium nitrate', I found it in ChEMBL by searching for the name, but here is no PubChem link in ChEMBL and the ChEMBL entry has no SMILES but has has formula GaN3O9 and synonym ganite. I then searched for 'ganite' in PubChem and found 3 entries, only one is GaN3O, and this gives a SMILES.
- Another interesting example is motexafin gadolinium: this has two PubChem entries with the same SMILES:
158385: CCC1=C(C2=CC3=NC(=CN=C4C=C(C(=
CC4=NC=C5C(=C(C(=N5)C=C1[N-]2) CCCO)C)OCCOCCOCCOC) OCCOCCOCCOC)C(=C3CCCO)C)CC.CC( =O)[O-].CC(=O)[O-].[Gd+3]
393405: CCC1=C(C2=CC3=NC(=CN=C4C=C(C(= CC4=NC=C5C(=C(C(=N5)C=C1[N-]2) CCCO)C)OCCOCCOCCOC) OCCOCCOCCOC)C(=C3CCCO)C)CC.CC( =O)[O-].CC(=O)[O-].[Gd+3]
WHO ATC codes
- The WHO list of anthelmintic drugs for veterinary use (and here too), and list WHO list of anthelmintic drugs for human use
- the WHO also has some nice documents describing nomenclature for drugs here, where they list drugs in different classes, including anthelmintics
Wikipedia
- Anthelminitic drugs listed in wikipedia
KEGG Drugs
- KEGG drugs (although I don't have a subscription, so just could search for one name at a time, to confirm those I'd found)
ChEMBL
- a ChEMBL keyword search for 'anthelmintic' (by our collaborator Prudence Mutowo at ChEMBL - thanks Prudence)
Merck Manuals
- the Merck Veterinary Manual (online) proved to be a treasure trove
- the Merck Medical Manual (online)
DrugBank
- all the anthelmintics in DrugBank : these can be found by searching for those with MeSH term 'anthelmintic' : here (see DrugBank category browser)
PubChem
- 215 PubChem terms manually annotated with MeSH term 'antelmintics' here
(how I got that link: looked up one anthelmintic drug eg. carbendazole, then under Pharmalogical Properties, there are the MeSH terms such as 'anthelminitics', and a link to 'all PubChem compounds annotated with this term', so click on that ===> (later) actually I asked PubChem and they said I can just search for "anthelmintics"[PharmAction] AND "chembl"[SourceName] to get all PubChem anthelmintics that have a ChEMBL id.)
- You can find a list of antifilarial compounds in PubChem using this link
- In the end I found I got the most compounds in PubChem by searching for "anthelmintics OR anthelmintics OR nematocide OR nematicide".
WormBook
- a wormbook chapter on 'Anthelmintic drugs'
ChEBI
- a ChEBI search for 'anthelmintic'
Textbooks
- The book 'Approaches to design and synthesis of antiparasitic drugs' edited by Anand & Sharma, which has a table of anthelmintic drugs.
Gave me a list of about 160 anthelmintic compounds, many no longer in use but useful to have such a list..
Some things I learnt along the way:
Searching ChEMBL
- Sometimes a compound is in ChEMBL but has not been named. If you can get the SMILES string from another source (eg. from wikipedia or PubChem), you can search for it in ChEMBL using the SMILES string. To do this, go to the ChEMBL homepage, and paste the SMILES into the 'List Search' box on the right hand side after clicking on the SMILES Search radio button.
- When I search ChEMBL using the SMILES, sometimes it has matches that have the same structure if stereochemistry (around chiral centres) is ignored, but a different structure when stereochemistry is considered! I find the PubChem search with SMILES (PubChem structure search here) does respect stereochemistry, so sometimes it's best to search PubChem with the SMILES, and see if the PubChem match has a ChEMBL link in its entry (rather than searching ChEMBL directly).
Searching PubChem
- You can search PubChem using a SMILES here.
- Often there are several PubChem entries for a drug but they have different numbers of HCl or water molecules in them, for example for bisbenzimide.
- To search PubChem for a long IUPAC name such as "N-(4-bromophenyl)-2,6-dihydroxybenzamide", you need to include quotes on either end, or the search doesn't work.
Searching ChEBI
- You can do a search for the term 'anthelmintic' on the ChEBI homepage, but that only finds about 10 records.
- A far better way is to go to the ChEBI homepage, click on 'Advanced Search', scroll down the bottom and in 'Ontology term' select 'has role' and type 'anthelmintic'. This picks up compounds annotated as having a role 'antiplatyhelmintic', 'antinematodal' or 'schistosomicide' too.
Convering a picture of a molecule to a SMILES
- In some cases I couldn't find the molecule in PubChem or ChEMBL, but could see a structure in a paper. In this case I had to draw in the molecule using the molecule drawing tool at PubChem, and then this gave me the SMILES, which I could use to search ChEMBL.
Converting IUPAC names to SMILES
- In some cases, I found a list of IUPAC names in a textbook, and wanted to know if they are in PubChem. However, I've heard that the IUPAC names in PubChem are not always correct. I then heard I can convert IUPAC names to SMILES using the OPSIN software by Daniel Lowe and colleagues. I can then search PubChem using the SMILES on their structure search page. Great!
SMILES in PubChem
- PubChem gives the 'isomeric SMILES' and the 'canonical SMILES' for a molecule, but the one to use is the 'isomeric SMILES' as this gives the stereochemistry (unless there is no 'isomeric SMILES' given, in which case use the 'canonical SMILES').
Metal-containing compounds
- ChEMBL does not give SMILES for metal-containing compounds because SMILES does not have a good way to represent the coordinate bonds nor a recognised way to
describe stereo around a metal centre. This can make it awkward to find the structure though, eg.
for 'gallium nitrate', I found it in ChEMBL by searching for the name, but here is no PubChem link in ChEMBL and the ChEMBL entry has no SMILES but has has formula GaN3O9 and synonym ganite. I then searched for 'ganite' in PubChem and found 3 entries, only one is GaN3O, and this gives a SMILES.
- Another interesting example is motexafin gadolinium: this has two PubChem entries with the same SMILES:
158385: CCC1=C(C2=CC3=NC(=CN=C4C=C(C(=
393405: CCC1=C(C2=CC3=NC(=CN=C4C=C(C(=
However, these are different molecules, if you look at the pictures in PubChem you'll see that the two structures in PubChem have different coordinate bonds.
- Noel O'Blog tells me that SMILES for metal-containing compounds don't really work properly and so this is why we get these problems.
Chemical classes of compounds
- To find out the chemical class of a compound, it's useful to use ChEBI, which has classified compounds using a chemical ontology. If your compound isn't in ChEBI, you can use the SMILES to search for similar compounds: go to 'Advanced Search', then click on the 'File Open' symbol in the toolbar around the square at the left, then paste in your SMILES, and click OK, then click 'Find compounds which resemble this structure' on the right, and click 'Go'.
- Noel O'Blog tells me that SMILES for metal-containing compounds don't really work properly and so this is why we get these problems.
Chemical classes of compounds
- To find out the chemical class of a compound, it's useful to use ChEBI, which has classified compounds using a chemical ontology. If your compound isn't in ChEBI, you can use the SMILES to search for similar compounds: go to 'Advanced Search', then click on the 'File Open' symbol in the toolbar around the square at the left, then paste in your SMILES, and click OK, then click 'Find compounds which resemble this structure' on the right, and click 'Go'.
Thanks!
Thanks to Noel O'Blog for helping me when I got stuck with the cheminformatics. Thanks also to the ChEMBL helpdesk for answering my questions.
Friday, 3 June 2016
Matrix multiplication in R
It's quite easy to multiply matrices in R. For example if you have a row matrix:
> a <- matrix(c(0,0,0.5,0.5),1,4)
> a
[,1] [,2] [,3] [,4]
[1,] 0 0 0.5 0.5
and a 4x4 square matrix:
> P <- matrix(c(0,1/3,1/3,0,1/3,0,0,1,1/3,0,0,0,1/3,2/3,2/3,0),4,4)
> P
[,1] [,2] [,3] [,4]
[1,] 0.0000000 0.3333333 0.3333333 0.3333333
[2,] 0.3333333 0.0000000 0.0000000 0.6666667
[3,] 0.3333333 0.0000000 0.0000000 0.6666667
[4,] 0.0000000 1.0000000 0.0000000 0.0000000
Now multiply them:
> a %*% P
[,1] [,2] [,3] [,4]
[1,] 0.1666667 0.5 0 0.3333333
Handy for Markov chains!
> a <- matrix(c(0,0,0.5,0.5),1,4)
> a
[,1] [,2] [,3] [,4]
[1,] 0 0 0.5 0.5
and a 4x4 square matrix:
> P <- matrix(c(0,1/3,1/3,0,1/3,0,0,1,1/3,0,0,0,1/3,2/3,2/3,0),4,4)
> P
[,1] [,2] [,3] [,4]
[1,] 0.0000000 0.3333333 0.3333333 0.3333333
[2,] 0.3333333 0.0000000 0.0000000 0.6666667
[3,] 0.3333333 0.0000000 0.0000000 0.6666667
[4,] 0.0000000 1.0000000 0.0000000 0.0000000
Now multiply them:
> a %*% P
[,1] [,2] [,3] [,4]
[1,] 0.1666667 0.5 0 0.3333333
Handy for Markov chains!