Lecture 001 — How to Write SMILES: Introduction to Converting Skeletal Diagrams to 1-D data string representation.

4 min readOct 2, 2022

This tutorial is for the researcher who has passed organic chemistry I and II, and data structures.

SMILES was developed as a 1-D language for the organic chemist to write compounds effectively into the computer while retaining as much information as possible. The language was built to be intuitive for someone to casually write and read. This allowed for databases to expand rapidly where manipulation of a 1-D string to edit and then expand into any new chemical group.

In today’s lecture we will be handling:

Functional Groups: Alkanes, Branched Alkanes, Alcohols, Simple Rings
Software Language: Python

Propane

In your Ochem I, you remember a common alkane, propane, where the prop meaning 3, and ane meaning single bond.

Here’s a rough sketch. Well to write it into a “word” of our language. To start off easy, we will stick to traditional english paradigms and go from left to right, starting with carbon 1.

1 2 3
C C C

In our language, it is assumed we denote a single bond with nothing. So if the letters are next to each other it is a single bond like this and the hydrogens are implied except where there is a specific charge:

CCC

If you wanted to do butane, we just need to add one letter C at the end:

CCCC

Expanding the string into all the alkanes all you have to do is add one letter at a time, perhaps a new alkane we haven’t seen before, can you draw that sketal diagram?

CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

One of the advantages of doing this, is rather than have a human count this, we can do it programmatically, let’s write it in python:

molecule = 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC'print ('Carbon Count: %s' % len(molecule))

Start to see the advantage right?

Propanol

So now let’s go back a bit and look at this, propanol. prop (3 carbons), an (single bond), ol (alcohol). To manipulate the string, it’s pretty simple. We add the letter “O” that represents the alcohol at the end of the string.

CCCO

Not too much work huh? This allows for easy expansion of adding different functional groups or exchanging them out.

Let’s say we had a list of alkanes and we wanted to convert them to alcohols:

alkanes = [
    'CCC',
    'CCCC',
    'CCCCC',
    'CCCCCC',
]

We could iterate through the list and append a O at the end of the string:

alcohols = [ alkane + 'O' for alkane in alkanes ]
print (alcohols)

And now we have our alcohols:

alcohols = [
    'CCCO',
    'CCCCO',
    'CCCCCO',
    'CCCCCCO',
]

Tetrahydrofuran (THF)

For a more complex example where we have a ring, THF. THF is a common solvent that we use in the lab and if you remember it’s skeletal drawing:

So I know the numbering is not exactly from what you remembered from previous lectures but I will get to that later. So let’s start off where I labeled the first carbon:

CCOCC

In our language, we enclose the ring by saying where the start and the end is with a number. Everything in between is in the ring:

C1---C1  # Ring   
--COC--  # Items in the Ring
C1COCC1  # Put it together

Isopentane (Branched Alkanes)

For something a little more complex let’s try you have a branched alkane:

So now we have something where there is a branch of a methyl group, we denote branches with a parenthesis () this acts a point to disconnect. Everything within that parenthesis is the branch. For isopentane:

C(C)CCC

If we wanted to expand the branch into a methanol instead, we can write it like:

C(CO)CCC

As a computer scientist this is where some tricky string manipulation can start to play where functional groups can be added with ease or branches can as well. Using this grammar try out the problems and for the next lecture we will go to aromatics and double bonds, triple bonds.

Practice Problems

And now we have represented a ring system in our 1-D language. Let’s pause there for now and try some practice problems:

) Write the SMILES to 1,5 — Pentandiol
) Write a program that takes in a list of branched alcohols and convert them into their truncated alkane:

Test Input:

branched_alcohols = [
    'CC(CO)C',
    'CC(CO)CC(CO)',
    'CC(CC(CO)C)CC',
    'CC(C(CO)O)CC,
]

Test Output:

CC(C)C
CC(C)CC(C)
CC(CC(C)C)CC
CC(CC)CC

Lecture 001 — How to Write SMILES: Introduction to Converting Skeletal Diagrams to 1-D data string representation.

Written by Sulstice

No responses yet