Lecture 003 — Designing String Formation Algorithms & Introduction to Canonical SMILES with RDKit

4 min readOct 30, 2022

Howdy, if you remember in the previous Lecture 002 we briefly mentioned Canonicalism in SMILES. Remember, it means unique, meaning one particular formation of a string can equal the same molecule.

For something like propanol:

CCCO

Or perhaps:

OCCC

How do you decide which one is ranked higher? Well, canonicalism came more redefined in the second SMILES paper using effective graph theory to construct the best way to rank atoms in a notation that was unique to them. We can get into more advanced lectures since it will be more fun to code your own graph algorithms for chemical data later on.

Since the other half of this class requires a working knowledge of data structures, chances are you are already familiar with some sort of code most commonly is python. A lot of cheminformatic and scientific toolkits are modernly built with Python although I hear the rise of Julia. For now, I will remain in Python but future classes might switch.

RDKit is an open source cheminformatic toolkit that has functionality that is useful for us to design algorithms and play with data.

Feel free to follow a long with the notebook. We are going to be installing the pip way on a Google Colab notebook:

pip install rdkit-pypi

The code is very simple to test for canonicalism, lets look at our propanol example:

from rdkit import Chemmolecule = Chem.MolFromSmiles('OCCC')
unique_smiles = Chem.MolToSmiles(molecule, canonical=True)print (unique_smiles)

So we load a molecule by it’s SMILES into a variable. We can transform the molecule back to a canonical SMILES with the keyword argument canonical

Out we get:

CCCO

Interesting, huh? So how can we use this our advantage and what makes this concept powerful. Let’s take something more complex of a SMILES string an amino acid:

Where if you put this into ChemDraw would generate something like this:

NC(C)C(O)=O

and it’s canonical form:

CC(N)C(=O)O

Notice the switch of the oxygens and the placement of the nitrogen. Let’s take a more complex case where we have a tripeptide with all alanine, for now stereochemistry is ignored.

Where the ChemDraw SMILES is:

NC(C)C(NC(C)C(NC(C)C(O)=O)=O)=O

And it’s canonical form:

CC(N)C(=O)NC(C)C(=O)NC(C)C(=O)O

Well, as it so turns out. It’s more intuitive for me to read the non-canonical form. The oxygens are pushed to the side where the main carbon chain and it’s substituents are placed in parenthesis. Notice the (C). So is it better to always have canonicalism?

Let’s start looking at a more complex pattern:

NC(C)C(NC(C)C(NC(C)C(O)=O)=O)=O
NC(C)C(NC(C)C(NC(C)C(NC(C)C(NC(C)C(O)=O)=O)=O)=O)=O
NC(C)C(NC(C)C(NC(C)C(NC(C)C(NC(C)C(NC(C)C(O)=O)=O)=O)=O)=O)=O

So the non-canonical form has a pretty easy pattern to read for me as compared to it’s canonical form but that could be subjective:

CC(N)C(=O)NC(C)C(=O)NC(C)C(=O)O CC(N)C(=O)NC(C)C(=O)NC(C)C(=O)NC(C)C(=O)NC(C)C(=O)O CC(N)C(=O)NC(C)C(=O)NC(C)C(=O)NC(C)C(=O)NC(C)C(=O)NC(C)C(=O)O

My first paper when I was a younger homie in Texas was around designing an automation algorithm on a non-canonical SMILES to make artificial peptide libraries based on amino acids and non-amino acids. I used to call the combinational combinations “slots” as an homage to a Starcraft2 custom gambling game.

Let’s take an all valine peptide to make it more obvious where the slots on a tripeptide are:

NC(C(C)C)C(NC(C(C)C)C(NC(C(C)C)C(O)=O)=O)=O

Let me now right into a different form:

NC([:1])C(NC([:2])C(NC([:3])C(O)=O)=O)=O

From here the pattern might start to seem a little obvious. A canonical version would be:

C[:1](N)C(=O)NC(C(=O)NC(C(=O)O)[:2])[:3]

This wouldn’t be as intuitive for me to read or generate an algorithm behind. We will pause here as we get to a more advanced concept, virtual atoms and bonds, and that will get into Lecture 004. Please complete the homework assignment before moving onto Lecture 003. You can comment below your answers and I will grade.

Homework Assignment

Generate 3 Cyclic TriPeptides of Cysteine varying lengths SMILES strings in 5 non-canonical forms and 1 canonical form. A good way would be to use the online version of ChemDraw and use the stamp button. Hint: To generate multiple non-canonical forms, try playing with some parameters in RDKit for MolToSmiles.

2. Convert your 3 Cyclic peptides from Question 1 into a “Slot” form like the lecture.

XXX-[:1]-XXX-[:2]-XXX-[:3]-XXX

3. Bonus: Perform Question 1 for the amino acid Proline instead of cysteine.

Lecture 003 — Designing String Formation Algorithms & Introduction to Canonical SMILES with RDKit

Written by Sulstice

No responses yet