Validating SMILES with RDKit, PySMILES, MolVS, and PartialSMILES

3 min readMar 22, 2022

I wanted to see the performance of the SMILES I wrote in GlobalChem to see how well different algorithms of validation perform on a common basis set of molecules. The four main packages I chose (also ease of implementation) is RDKit, PartialSMILES, PySMILES, and MolVS. There’s a couple of others that are implemented in Javascript: ChemWriter and MetaMolecule’s SMIDGE . I wrote a function as part of the GlobalChemExtensions package to implement the python modules and come up with a generic parameter space to do SMILES performance testing, it’s also in the Google Colab Demonstration. You can essentially pass any SMILES in. Swap out the flags to test different modules.

from global_chem.global_chem import GlobalChem
from global_chem_extensions import GlobalChemExtensions

gc = GlobalChem()
smiles_list = gc.get_all_smiles()

sucesses, failures = GlobalChemExtensions.verify_smiles(
    smiles_list,
    rdkit=True, 
    partial_smiles=False,
    return_failures=True,
    pysmiles=False,
    molvs=False
)print (failures)

I’ll show the results first because that’s a little more fun:

So when constructing the initial SMILES I actually used RDKit as a base to verify that all SMILES are passed through something. I use RDKit pretty much everyday in comparison to the other software so I believe that at a bare minimum for any set of SMILES it should be RDKit compatible for me.

MolVS

That being said lets take a look at something that passes in RDKit fails in other algorithms. If we took a look at MolVS and what’s causing it to fail:

C/C=N/\/NC(OC(C)(C)C)=O 
CSi(C(C)(C)C)C 
CC(Si(C1=CC=CC=C1)C2=CC=CC=C2)(C)C 
[Al][C-]#[NH+] 
O#[C+]O [
C-]#O

There’s definitely some weird SMILES out there in different atomic valent states that can cause these parsers to say no this isn’t valid.

C/C=N/\/NC(OC(C)(C)C)=O

This issue might be internal because of the way I’ve written the string in python perhaps because of the /\/ escape character.

PySMILES

PySMILES took majority of the percentages and as close to RDKit as possible with one major difference:

C&1&1&1&1
C&1&1&1 
c1ccccc1C&1&1

These are polymer SMILES and something I have seen before in reference to capturing diamond. It’s written in the openSMILES documentation so it’s still a valid entry in my book.

PartialSMILES

For PartialSMILES entries, there was a significant drop in things that could be readily passed in. When further investigating into it

N1=NN=C[N]1 
[CH2]C=C 
[CH]C=C 
[CH2]CCCC

A big component of the percentage losses was the radicals causing an improper valence. Another couple of weird compounds also seemed to cause it go through a loop, granted these I have never seen before either:

O=I=O
C&1&1&1

As I browse through the partial SMILES set, a lot of the atomic valencies are warranted and it would be useful to know which dataset it comes from. It’s also useful as to whether that does need adjustment. One of the lists that was challenging for me was the interstellar molecules. They had a lot of strange SMILES and valence shells that were tricky to capture.

C=1(C=CC=CC1)

Funny enough, I didn’t realize I could write benzene with C=1 and still have it render happily. Obviously, I don’t think this is the correct SMILES string even it if it parses just because it looks ugly.

Validating SMILES with RDKit, PySMILES, MolVS, and PartialSMILES

Written by Sulstice

No responses yet