I had a list of C. elegans genes (WBGene identifiers) and wanted to know which C. elegans genes have known phenotypes. On the WormBase page, they know say that WormMart is retired, and to use WormMine instead. So I did this:
To get phenotypes for a list of C. elegans genes:
1. Went to WormMine
2. Clicked on 'Lists' and uploaded my list of worm genes (separated by commas), and saved the list as 'wormsinglecopygenes2'
3. Then clicked on 'Phenotypes', and chose 'Genes->Phenotypes', and selected the box 'Constrain to be in the list wormsinglecopygenes2', and then pressed 'Go'.
4. Voila! It gave me a nice output spreadsheet.
To get all C. elegans genes with a particular phenotype:
1. Went to WormMine
2. Clicked on 'Templates', and then clicked on 'Phenotype-->Genes'.
3. Set 'phenotype == lethal'.
4. This found 1343 rows. I clicked 'Download'.
Note 12-Aug-2016: I notice that when I use WormMine I get far fewer genes with
a particular phenotype than when I search WormBase directly. For
example, for the phenotype 'molt defect' ('WBPhenotype:0000638'), when I search using
WBPhenotype:0000638 on www.wormbase.org, and download all the results,
the results file has 242 distinct genes. In contrast, when I go to
WormMine, and choose the template Phenotype->Genes and enter 'molt
defect', it gives me a list of just 19 genes. It seems to be missing
many genes, eg. WBGene00000039 is found using the website, but not
wormmine, and when I checked the wormbase page for that gene, it has an
RNAi phenotype observed of 'molt defect' for WBGene00000039, so surely
should be listed under WormMine too. I emailed WormBase and heard that RNAi information is not yet used in WormMine however, and that I can get a full file of phenotypes for C. elegans genes from the ftp site: ftp://ftp.wormbase.org/pub/wormbase/releases/WS253/ONTOLOGY/phenotype_association.WS253.wb
(Thanks Michael Paulini for help!)
To get all GO terms for all C. elegans genes:
1. Went to WormMine.
2. Clicked on 'QueryBuilder', then selected 'gene' from the dropdown list.
3. In the browser at the left, for 'Gene'->'WormBase Gene ID', clicked on 'Show'.
4. In the browser at the left, clicked on 'GO Annotations' -> 'Ontology Term' -> 'Ontology Annotations' -> and for 'Identifier' clicked on 'Show'.
5. Clicked 'Show Results' at the bottom right. This gave 274,273 rows of results. Then clicked 'Download'.
Note 7-Jul-2016:
- the website for WormMine now seems to be here.
Friday, 19 February 2016
Thursday, 18 February 2016
Calculating distances between drug molecules
Cluster a set of compounds by structural similarity
Given a set of drug molecules from Chembl, I need to calculate pairwise distances between the drugs, for clustering. Noel O'Blog helped me do it using RDKit by Greg Landrum, by writing the Python script below. Given Chembl ids. for the input molecules, it calculates pairwise similarities between drugs based on calculating an ECFP4 fingerprint for each drug molecule, and then calculating the distances between the fingerprints for each pair of drugs. The pairwise similarities given in the output file are scaled from 0-1, so can be converted to distances using 1 - similarity. The input file contains duplicates that are distinguished only by undefined/defined stereocentres; these are likely the same molecule, so the script removes duplicates that are distinguished only by stereochemistry.
Thanks Noel!
import collections
from rdkit import Chem, DataStructs
from rdkit.Chem import Descriptors, AllChem
smiles_lookup = dict( (y, x) for (x, y) in (z.split() for z in open("C:\Tools\Topliss\chembl20\chembl_20.ism") if len(z.split())==2))
def readdata(filename):
"""Return dictionary from chemblid to SMILES"""
it = open(filename)
header = next(it)
data = collections.defaultdict(set)
for line in it:
broken = line.rstrip().split()
if broken[2] in data: # Check for duplicates
assert False
data[broken[2]] = broken[1]
return data
def getmol(smi):
"""Keep a single largest component if disconnected"""
mol = Chem.MolFromSmiles(smi)
if "." in smi:
frags = list(Chem.GetMolFrags(mol, asMols=True))
frags.sort(key=lambda x:x.GetNumHeavyAtoms(), reverse=True)
mol = frags[0]
return mol
# The file contains duplicates that are distinguished only by undefined/defined stereocenters
# These are likely the same molecule. Let's remove duplicates that are distinguished only
# by stereochemistry
def mergeStereoisomers(chemblids):
smis = [data[chemblid] for chemblid in chemblids_filtered]
mols = [getmol(smi) for smi in smis]
normalised = {}
for i, mol in enumerate(mols):
smi = Chem.MolToSmiles(mol)
normalised[smi] = chemblids[i] # Stores only one
return normalised.values()
if __name__ == "__main__":
# ECFP setup
fingerprinter = lambda m: AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=16384)
srcs = ["SMILES4CMPDS_smiles.tsv"]
for src in srcs:
filename = src
print "Looking at %s" % filename
data = readdata(filename)
chemblids = data.keys()
print "Distinct chemblids in original TSV", len(chemblids)
# filter out those molecules without structures in ChEMBL
# ...this step doesn't apply here
chemblids_filtered = chemblids
print "Chembl Ids associated with SMILES", len(chemblids_filtered)
chemblids_final = mergeStereoisomers(chemblids_filtered)
print "Chembl Ids after removing stereoisomers", len(chemblids_final)
smis = [data[chemblid] for chemblid in chemblids_final]
mols = [getmol(smi) for smi in smis]
fps = [fingerprinter(mol) for mol in mols]
N = len(chemblids_final)
with open("%s_pairwise_sim.txt" % src, "w") as f:
# header
f.write("\t".join(chemblids_final) + "\n")
for i in range(N):
row = []
fpA = fps[i]
for j in range(N):
if i==j:
sim = 1.0
else:
fpB = fps[j]
sim = DataStructs.FingerprintSimilarity(fpA, fpB)
row.append("%.2f" % sim)
f.write(chemblids_final[i] + "\t" + "\t".join(row) + "\n")
Find any compounds in a new set ('set 2') that are not very similar to those in my original set ('set 1')
A second problem I had was to see if any compounds that I had found from a PubChem search ('set 2') were very different from a list I had already curated from various sources ('set 1'). Noel O'Blog came to the rescue, helping me to use RDKit to find molecules that are structurally different, defined by calculating similarities of <=0.4, when similarities were calculated using the Tanimoto coefficient of ECFP4 fingerprints as implemented in the RDKit toolkit, http://www.rdkit.org). Here's the Python script, which takes 'set 2' in input file 'pubchem_219_compounds', and 'set 1' in input file 'tmp.csv':
import csv
from rdkit import Chem, DataStructs
from rdkit.Chem import Descriptors, AllChem
def readPCCompounds():
return [x.rstrip().split("\t") for x in open("pubchem_219_compounds")]
def isDigit(char):
return char >= '0' and char <= '9'
def readX20():
ans = []
reader = csv.reader(open("tmp.csv"))
# Skip header
for N in range(7):
reader.next()
for row in reader:
if row[0] == "": break
if row[6][0]=='-': continue
if row[6].startswith("!"):
broken = row[6][1:].split("!")
if len(broken) == 2:
smiles = broken
else:
smiles = [broken[i] for i in range(1, len(broken), 2)]
for x in smiles:
ans.append( [row[0], x] )
else:
ans.append( [row[0], row[6]] )
return ans
def getmol(smi):
"""Keep a single largest component if disconnected"""
mol = Chem.MolFromSmiles(smi)
if "." in smi:
frags = list(Chem.GetMolFrags(mol, asMols=True))
frags.sort(key=lambda x:x.GetNumHeavyAtoms(), reverse=True)
mol = frags[0]
return mol
def getFP(smi):
mol = getmol(smi)
return AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=16384)
if __name__ == "__main__":
pubchems = readPCCompounds()
knowns = readX20()
pubchemfps = [getFP(smi) for name, smi in pubchems]
knownfps = [getFP(smi) for name, smi in knowns]
# for each PubChem entry...
for pubchem, pubchemfp in zip(pubchems, pubchemfps):
print "Considering PubChem entry %s..." % pubchem[0],
# ...compare to each known...
results = []
for known, knownfp in zip(knowns, knownfps):
sim = DataStructs.FingerprintSimilarity(pubchemfp, knownfp)
if sim > 0.4:
results.append( (sim, known[0]) )
if results:
results.sort(reverse=True)
print ", ".join("%s (%.2f)" % (x[1], x[0]) for x in results)
else:
print "nothing found with sim > 0.4"
Given a set of drug molecules from Chembl, I need to calculate pairwise distances between the drugs, for clustering. Noel O'Blog helped me do it using RDKit by Greg Landrum, by writing the Python script below. Given Chembl ids. for the input molecules, it calculates pairwise similarities between drugs based on calculating an ECFP4 fingerprint for each drug molecule, and then calculating the distances between the fingerprints for each pair of drugs. The pairwise similarities given in the output file are scaled from 0-1, so can be converted to distances using 1 - similarity. The input file contains duplicates that are distinguished only by undefined/defined stereocentres; these are likely the same molecule, so the script removes duplicates that are distinguished only by stereochemistry.
Thanks Noel!
import collections
from rdkit import Chem, DataStructs
from rdkit.Chem import Descriptors, AllChem
smiles_lookup = dict( (y, x) for (x, y) in (z.split() for z in open("C:\Tools\Topliss\chembl20\chembl_20.ism") if len(z.split())==2))
def readdata(filename):
"""Return dictionary from chemblid to SMILES"""
it = open(filename)
header = next(it)
data = collections.defaultdict(set)
for line in it:
broken = line.rstrip().split()
if broken[2] in data: # Check for duplicates
assert False
data[broken[2]] = broken[1]
return data
def getmol(smi):
"""Keep a single largest component if disconnected"""
mol = Chem.MolFromSmiles(smi)
if "." in smi:
frags = list(Chem.GetMolFrags(mol, asMols=True))
frags.sort(key=lambda x:x.GetNumHeavyAtoms(), reverse=True)
mol = frags[0]
return mol
# The file contains duplicates that are distinguished only by undefined/defined stereocenters
# These are likely the same molecule. Let's remove duplicates that are distinguished only
# by stereochemistry
def mergeStereoisomers(chemblids):
smis = [data[chemblid] for chemblid in chemblids_filtered]
mols = [getmol(smi) for smi in smis]
normalised = {}
for i, mol in enumerate(mols):
smi = Chem.MolToSmiles(mol)
normalised[smi] = chemblids[i] # Stores only one
return normalised.values()
if __name__ == "__main__":
# ECFP setup
fingerprinter = lambda m: AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=16384)
srcs = ["SMILES4CMPDS_smiles.tsv"]
for src in srcs:
filename = src
print "Looking at %s" % filename
data = readdata(filename)
chemblids = data.keys()
print "Distinct chemblids in original TSV", len(chemblids)
# filter out those molecules without structures in ChEMBL
# ...this step doesn't apply here
chemblids_filtered = chemblids
print "Chembl Ids associated with SMILES", len(chemblids_filtered)
chemblids_final = mergeStereoisomers(chemblids_filtered)
print "Chembl Ids after removing stereoisomers", len(chemblids_final)
smis = [data[chemblid] for chemblid in chemblids_final]
mols = [getmol(smi) for smi in smis]
fps = [fingerprinter(mol) for mol in mols]
N = len(chemblids_final)
with open("%s_pairwise_sim.txt" % src, "w") as f:
# header
f.write("\t".join(chemblids_final) + "\n")
for i in range(N):
row = []
fpA = fps[i]
for j in range(N):
if i==j:
sim = 1.0
else:
fpB = fps[j]
sim = DataStructs.FingerprintSimilarity(fpA, fpB)
row.append("%.2f" % sim)
f.write(chemblids_final[i] + "\t" + "\t".join(row) + "\n")
Find any compounds in a new set ('set 2') that are not very similar to those in my original set ('set 1')
A second problem I had was to see if any compounds that I had found from a PubChem search ('set 2') were very different from a list I had already curated from various sources ('set 1'). Noel O'Blog came to the rescue, helping me to use RDKit to find molecules that are structurally different, defined by calculating similarities of <=0.4, when similarities were calculated using the Tanimoto coefficient of ECFP4 fingerprints as implemented in the RDKit toolkit, http://www.rdkit.org). Here's the Python script, which takes 'set 2' in input file 'pubchem_219_compounds', and 'set 1' in input file 'tmp.csv':
import csv
from rdkit import Chem, DataStructs
from rdkit.Chem import Descriptors, AllChem
def readPCCompounds():
return [x.rstrip().split("\t") for x in open("pubchem_219_compounds")]
def isDigit(char):
return char >= '0' and char <= '9'
def readX20():
ans = []
reader = csv.reader(open("tmp.csv"))
# Skip header
for N in range(7):
reader.next()
for row in reader:
if row[0] == "": break
if row[6][0]=='-': continue
if row[6].startswith("!"):
broken = row[6][1:].split("!")
if len(broken) == 2:
smiles = broken
else:
smiles = [broken[i] for i in range(1, len(broken), 2)]
for x in smiles:
ans.append( [row[0], x] )
else:
ans.append( [row[0], row[6]] )
return ans
def getmol(smi):
"""Keep a single largest component if disconnected"""
mol = Chem.MolFromSmiles(smi)
if "." in smi:
frags = list(Chem.GetMolFrags(mol, asMols=True))
frags.sort(key=lambda x:x.GetNumHeavyAtoms(), reverse=True)
mol = frags[0]
return mol
def getFP(smi):
mol = getmol(smi)
return AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=16384)
if __name__ == "__main__":
pubchems = readPCCompounds()
knowns = readX20()
pubchemfps = [getFP(smi) for name, smi in pubchems]
knownfps = [getFP(smi) for name, smi in knowns]
# for each PubChem entry...
for pubchem, pubchemfp in zip(pubchems, pubchemfps):
print "Considering PubChem entry %s..." % pubchem[0],
# ...compare to each known...
results = []
for known, knownfp in zip(knowns, knownfps):
sim = DataStructs.FingerprintSimilarity(pubchemfp, knownfp)
if sim > 0.4:
results.append( (sim, known[0]) )
if results:
results.sort(reverse=True)
print ", ".join("%s (%.2f)" % (x[1], x[0]) for x in results)
else:
print "nothing found with sim > 0.4"