Inclusion of Atom Types in Force Fields to produce mixed CXSMILES, CXSMARTS, CurlySMILES

Sulstice
2 min readJun 25, 2022

--

It’s time we combined fields of cheminformatics and forcefields together based off the same chemical graph theory.

The CGenFF atom type language actually has the bond order, aromaticity, and ring system installed in the letters where their local (can extend into long range) chemical environment is captured in a 1-to-1 dictionary. Most that know SMILES, can already probably tell there is familiarity. If we look at the example of CG2D2 and try to write out the SMILES it would be in our mind and then converted to what it looks like as a molecule:

C(H)(H) ----> H-C-H

The problem with SMILES is that can’t capture more acute information about that particular atom.

C(C)(C)(C)

In SMILES, all these carbons are labeled the same even if they are a different chemical environment. So in my mind, I actually have less information now with my SMILES string. Their’s a lot of benefits of connecting SMILES to the CGenFF Atom Types and you really should read the paper and make your case as to why not but for now I think this is the best direction for my research.

So how do we do it? Super easy, since we have a previous blog post on getting CGenFF with SMILES before. We will use that exact same code since the rank ordering is preserved.

hetereo_atom_types = [i.split()[2] for i in self.cgenff_molecule.atoms if i.split()[2][0] != 'H']

curly_smiles = ''

for i, character in enumerate(self.smiles):

if character.isalpha():

curly_smiles += '%s{%s}' % (character, hetereo_atom_types.pop(0))
else:
curly_smiles += character

return curly_smiles

That is the meat of the algorithm for generating CurlySMILES, and pretty much everything for CX. The rank ordering is preserved so all we have to is ignore grammar and hydrogens and the atoms link up. We convert to interoperable tools:

cx_smiles = global_chem_molecule.get_cgenff_cxsmiles()
cx_smarts = global_chem_molecule.get_cgenff_cxsmarts()
curly_smiles = global_chem_molecule.get_curly_smiles()
print ('CXSMILES: %s\n' % cx_smiles)
print ('CXSMARTS: %s\n' % cx_smarts)
print ('Curly : %s\n' % curly_smiles)

And here our results:

CXSMILES: O=C(O)C(F)(F)C(F)(F)C(F)(F)F |atomProp:0.atom_type.OG2D1:0.atom_idx.7:1.atom_type.CG2O2:1.atom_idx.5:2.atom_type.OG311:2.atom_idx.6:3.atom_type.CG312:3.atom_idx.3:4.atom_type.FGA2:4.atom_idx.4:5.atom_type.FGA2:5.atom_idx.8:6.atom_type.CG312:6.atom_idx.1:7.atom_type.FGA2:7.atom_idx.0:8.atom_type.FGA2:8.atom_idx.2:9.atom_type.CG302:9.atom_idx.9:10.atom_type.FGA3:10.atom_idx.10:11.atom_type.FGA3:11.atom_idx.11:12.atom_type.FGA3:12.atom_idx.12|  CXSMARTS: [#9]-[#6](-[#9])(-[#6](-[#9])(-[#6](-[#8])=[#8])-[#9])-[#6](-[#9])(-[#9])-[#9] |atomProp:0.atom_type.FGA2:0.atom_idx.0:1.atom_type.CG312:1.atom_idx.1:2.atom_type.FGA2:2.atom_idx.2:3.atom_type.CG312:3.atom_idx.3:4.atom_type.FGA2:4.atom_idx.4:5.atom_type.CG2O2:5.atom_idx.5:6.atom_type.OG311:6.atom_idx.6:7.atom_type.OG2D1:7.atom_idx.7:8.atom_type.FGA2:8.atom_idx.8:9.atom_type.CG302:9.atom_idx.9:10.atom_type.FGA3:10.atom_idx.10:11.atom_type.FGA3:11.atom_idx.11:12.atom_type.FGA3:12.atom_idx.12|  Curly   : F{FGA2}C{CG312}(F{FGA2})(C{CG312}(F{FGA2})(C{CG2O2}(O{OG311})=O{OG2D1})F{FGA2})C{CG302}(F{FGA3})(F{FGA3})F{FGA3}

CurlySMILES is my favourite because it feels the most intuitive to read. Now if I want to look at parameters between them I can call the CGenFF molecule and have a look, possible automation down the road here.

Lastly, here’s the demo for you guys:

https://colab.research.google.com/drive/1hW0K6V5zPDFdZvFkYarbOr9wRoof2n4s?usp=sharing

--

--

Sulstice
Sulstice

No responses yet