Applying Levenshtein distance on IUPAC/Preferred names in GlobalChem for Natural Language Processing.

4 min readAug 7, 2022

I have a lot of emotions when I write these blogs, I dance, I sing in a language that I am comfortable with. Music helps me process stuff so I’ll just put the songs that really help me understand myself. I I need it loud. Might start doing this as a therapy for my brain.

Playlist

1. Вечера- Rauf & Faik ( cover by Arusik Petrosyan )
2. The Limba & Andro - X.O

Theory

I’ve been wanting to look at the difference between IUPAC names for a long time. The benefit of IUPAC when the names are really long is that is that the when the functional groups are expanded the language itself fragments the molecule into smaller ones:

(4R,4aR,7S,7aR,12bS)-3-methyl-2,4,4a,7,7a,13-hexahydro-1H-4,12-methanobenzofuro[3,2-e]isoquinoline-7,9-diol

Let’s take morphine for example when we look at the name. Immediately if we remove all the grammar and stick to alpha characters:

methyl - hexahydro - methanobenzofuro - isoquinoline - diol

Immediately we have an idea of the chemical diversity in the name. We can start to assume what type of functional groups are actually in here. Which is the essence of chemical natural language processing. Let’s look at a molecule that is similar like heroin:

[(4R,4aR,7S,7aR,12bS)-9-acetyloxy-3-methyl-2,4,4a,7,7a,13-hexahydro-1H-4,12-methanobenzofuro[3,2-e]isoquinolin-7-yl] acetate

And now remove the grammar:

acetyloxy - methyl - hexahydro - methanobenzofuro - isoquinolin - acetate

We can see that the functional groups some of them match and that the similarity between morphine and heroin is pretty similar. Let’s take fentanyl:

N-phenyl-N-[1-(2-phenylethyl)piperidin-4-yl]propanamide

And now remove the grammar:

phenyl - phenylethyl - piperidin - propanamide

Fentanyl and morphine are very different to me in terms of their chemical functional groups inside the molecule. This means we can probably use this to do fuzzy chemical group matching using long IUPAC names.

Demo

Google Colaboratory

Edit description

colab.research.google.com

Algorithm

Since IUPAC names have different string lengths we need a stable algorithm that can handle this on large scale. This algorithm is simple, it only looks look at single edits a long a string.

We can use this to come up with a distance between two IUPAC names if we remove the grammar and only stick to functional group fragments. Perhaps we can get functional group similarity.

Implementation

The implementation is relatively simple for now — GlobalChem’s purpose is to have a reference index and if a word isn’t found perhaps we could rebuild the word up or reconstruct a SMILES string that you could get the diversity (that actually should be the feature now that I think about it):

gc = GlobalChem()
definition = gc.get_smiles_by_iupac(
    'methanobenzofuro',
    distance_tolerance=7,
    return_partial_definitions=True
)print (definition)

The first implementation if we take a group I am unfamiliar with and I know this dictionary well, I don’t remember writing “methano” so chances are this name is not going to be in my dataset but what I can do is find something relatively close to it.

A distance_tolerance number is how many edits did you want to be away from the words between each other. And that we can tune until how far tune we want. If we get the definition of what comes back and where that definition is coming from:

[{'methylbenzoate': 'c1ccc(C(=O)OC)cc1', 'network_path': 'global_chem.medicinal_chemistry.scaffolds.common_r_group_replacements', 'levenshtein_distance': 7}]

Knowing the history behind the word is what makes it valuable over time. This is human nature. We value things we repeat. And that’s how I can keep up with trends of data.

So let’s try something more intense, let’s see if we can do a partial reconstruction of the SMILES from the IUPAC name of morphine.

gc = GlobalChem()
definition = gc.get_smiles_by_iupac(
    '(4R,4aR,7S,7aR,12bS)-3-methyl-2,4,4a,7,7a,13-hexahydro-1H-4,12-methanobenzofuro[3,2-e]isoquinoline-7,9-diol',
    distance_tolerance=2,
    return_partial_definitions=False,
    reconstruct_smiles=True,
)
print (definition)

And let’s see how it worked in getting the SMILES back from a more complex name with a distance tolerance of 1:

C12=C(C=NC=C2)C=CC=C1.C12=C(C=NC=C2)C=CC=C1.[CH2]C.[CH3].c1cncc2ccccc12.C12=CC=CC=C1C=NC=C2.C.CC.[CH3]

So we must not have all partial fragments in there for something like diol. or hexa hydro suggesting we might need to increase our distance tolerance.

I also think we need a common partial fragment list, maybe if I do enough of these I can write a good annotated index for it. Alright tired for now. Have fun with it.

Happy Cheminformatics!

Applying Levenshtein distance on IUPAC/Preferred names in GlobalChem for Natural Language Processing.

Google Colaboratory

Edit description

Written by Sulstice

No responses yet