Wednesday 12 June 2019

Clustering compounds using DataWarrior

I wanted to cluster some chemical compounds (which I had stored in SMILES format), and had read about the free cheminformatics software DataWarrior, so decided to give it a try!

After downloading and installing DataWarrior on my Mac, I had a read through the DataWarrior user manual.

Reading in my data
I wanted to read in a tab-delimited text file in which the first column contains molecule name and the second column contains SMILES.
I made a little file with some test data, with two columns, the first with compound name, and the second with SMILES:
Nemadectin    C[C@@H]\1C/C(=C/C[C@@H]2C[C@@H](C[C@@]3(O2)C[C@@H]([C@@H]([C@H](O3)/C(=C/C(C)C)/C)C)O)OC(=O)[C@@H]4C=C([C@H]([C@@H]5[C@]4(/C(=C/C=C1)/CO5)O)O)C)/C
Hexachloroethane    C(C(Cl)(Cl)Cl)(Cl)(Cl)Cl
Tetrachloroethylene    C(=C(Cl)Cl)(Cl)Cl
N-Butyl chloride    CCCCCl
Fumazone    C(C(CBr)Br)Cl

I then went to the File->Open menu in DataWarrior, and voila, my data loaded into DataWarrior!

Clustering compounds
I next wanted to cluster my compounds by similarity. To do this, I selected all my compounds in the table at the 'Table' window (see top left) in DataWarrior, and then chose 'Cluster compounds/reactions' in the 'Chemistry' menu. 

This didn't produce any plot, but I noticed that two extra columns had been added to the 'Table' window, 'Cluster No' and 'Is Representative', saying what is the cluster number for each compound, and whether it is a good representative for the cluster.

Similarity analysis
Another type of analysis you can perform in DataWarrior is a 'similarity analysis'. To do this, I went to 'Chemistry'->'Analyse similarity/activity cliffs'. This brought up a new window called a 'Similarity chart', which shows my molecules, with similar compounds connected by a line. Also, when you click on the dot representing a compound, other molecules that are similar are coloured according to their level of similarity.

For example, if I click on 'ivermectin', I see the dot representing ivermectin becomes dark green, and other compounds that are similar are now coloured green:


In another nice example, I see a trio of compounds that are coloured green at the bottom (with lines between them), but another greenish compound somewhere on the left which is not joined by a line but is somewhat similar:

Note that if you move your mouse over a dot (compound) in the 'Similarity chart' window, the compound will appear in the window at the bottom left of the screen:

Note that it's also possible to hover over, or click on, a compound in the 'Table' of compounds at the top left of DataWarrior, and that compound will be highlighted in the 'Similarity chart' window, e.g. 
for euphorbol:

Note that if you want to export the connected components found by DataWarrior in this Similarity Analysis, you can go to File->Save special, and choose 'Textfile', and for each compound the text file has its number of neighbours, and the node numbers for the neighbours in the Similarity Analysis plot. You can then use that information to figure out the members of each connected component.

No comments: