Lecture 004 — Manipulating SMILES strings with Virtual Atoms and Bonds to create Combinatorial Libraries.

Sulstice
3 min readNov 4, 2022

--

Okay, let’s start pushing some boundaries of theory in SMILES. Here’s the colab notebook. Do you remember in organic chemistry we used to denote an “R” group as a placeholder for a substituent.

Well the way to represent that in SMILES and in python programmatically is this:

molecule = 'C(-[*:1])(=O)[H]'
r_groups = ['[H]', 'C', 'ERROR',]

The means a single bond to a virtual atom * which acts our placeholder or “R” group. If you copy/paste that string into the online ChemDraw you would get something like this:

Isn’t that fun? We can make a quick little substring replacement program to generate then two molecules at once based on the the list of r groups we have in the list:

molecules = [ molecule.replace('-[*:1]', r) for r in r_groups ]
print (molecules)

That’s pretty easy right? Imagine if you had larger lists to mine, let’s say 1,000,000 and you had 50 drugs with several different places to replace. The amount of combinations would start to rapidly increase and you will lead into a data explosion. So how do you start to mine these databases, where do you look?

Well first you need to know whether the replacement you made is actually viable. In computer science and software engineering you need to perform testing and validation especially of SMILES strings.

So a good start is to go back and use the RDKit parser as your initial validation point.

from rdkit import Chemvalid_molecules = []
failure_molecules = []
for molecule in molecules:
try:
rdkit_molecule = Chem.MolFromSmiles(molecule)
valid_molecules.append(Chem.MolToSmiles(rdkit_molecule))
except:
failure_molecules.append(molecule)
print ('Valid Molecules: %s' % valid_molecules)
print ('Failure Molecules: %s' % failure_molecules)

And that gives us:

Valid Molecules: ['C=O', 'CC=O']
Failure Molecules: ['C(ERROR)(=O)[H]']

So you can see we create a really easy validation code for SMILES strings just to see if they parse standard software.

The last concept for this lecture is the idea of a virtual bond ~. A virtual bond is a placeholder for the bond type like is a single bond, double bond, or triple bond. A wildcard like the virtual atom *.

r_groups = ['[H]', 'C', 'ERROR',]
molecules = [ molecule.replace('~[*:1]', r) for r in r_groups ]
print (molecules)
>>>['C(~[H])(=O)[H]', 'C(~C)(=O)[H]']

And if we copy/paste that into ChemDraw

The “Any” word is installed because it can be replaced by things like a double bond or a triple bond.

Practice Problems.

  1. Using the RDKit package can you perform the same functionality as the substring replace functionality with the more robust ReplaceSubstructs method with canonical SMILES. Here is an example of what that pseudo python code could look like
molecule = 'C(-[*:1])(=O)[H]'
r_groups = ['[H]', 'C']
replacements = [ ] for replacement in replacements: modified_molecule = Chem.ReplaceSubstructs(
Chem.MolFromSmiles(molecule),
Chem.MolFromSmiles('virtual atom and bond'),
Chem.MolFromSmiles('replacement')
)
# Convert to Canonical SMILES canonical_smiles = Chem.MolToSmiles()

# Populate the Replacement List
print (replacements)

--

--