Monday 23 March 2020

Comparing SMILES strings using OpenBabel

I have three different SMILES strings that should be for the same compound, from different sources.

It's a little difficult to be sure they are definitely the same thing, because the compound has stereochemistry and it's hard to be sure all those wedge bonds are the same just by looking at them.

In this case, when I paste the SMILES into CDK Depict, they look a little bit different from each other, but I think that's just due to rotation about a single bond.
One looks like this:

and the other looks like this:

If you are beginning to squint and roll your head sideways, then perhaps we need an easier way to compare them...

So here's a way to do it using OpenBabel (which happily, I find I installed on my Mac a while ago, see my blog post on that):

- first you can write a file with the three SMILES strings, something like this: (which I called 'digitoxin.smi', as they should all be the compound digitoxin):
O1CC(=CC1=O)[C@@H]2[C@@]3([C@@]([C@H]4[C@@H]([C@@]5([C@@H](C[C@H](CC5)O[C@@H]6O[C@@H]([C@H]([C@H](C6)O)O[C@@H]7O[C@@H]([C@H]([C@H](C7)O)O[C@@H]8O[C@@H]([C@H]([C@H](C8)O)O)C)C)C)CC4)C)CC3)(CC2)O)C    myspreadsheet
C[C@H]1O[C@@H](O[C@H]2[C@@H](O)C[C@H](O[C@H]3[C@@H](O)C[C@H](O[C@H]4CC[C@]5(C)[C@H]6CC[C@]7(C)[C@@H](C8=CC(=O)OC8)CC[C@]7(O)[C@@H]6CC[C@@H]5C4)O[C@@H]3C)O[C@@H]2C)C[C@H](O)[C@@H]1O    chembl
O1CC(=CC1=O)[C@@H]2[C@@]3([C@@]([C@H]4[C@@H]([C@@]5([C@@H](C[C@H](CC5)O[C@@H]6O[C@@H]([C@H]([C@H](C6)O)O[C@@H]7O[C@@H]([C@H]([C@H](C7)O)O[C@@H]8O[C@@H]([C@H]([C@H](C8)O)O)C)C)C)CC4)C)CC3)(CC2)O)C    sigmaspreadsheet


You can see all the '@' symbols which convey the stereochemistry info. in the SMILES strings. Stereochemistry info. in SMILES strings can be also conveyed by '/' or '\' symbols too.

Then use OpenBabel to convert these to Inchi keys:
% obabel digitoxin.smi -o inchikey
This gives me:
WDJUZGPOPHTGOT-XUDUSOBPSA-N
WDJUZGPOPHTGOT-XUDUSOBPSA-N
WDJUZGPOPHTGOT-XUDUSOBPSA-N

Apparently, the first block (before the first '-') says whether the atoms are connected in the same way, and the second block (after the first '-' and before the second '-') tells about the stereochemistry. Looks like the three compounds are the same, hurray!

Thanks!
Thanks to Noel O'Blog for help.