Most often we have the case where we have a list of IUPAC names and we want to convert them to a list of SMILES for query processing. There are two ways that I know how to do it.
1,1,1,2-tetrachloroethane
1,1,1-trichloroethane
1,1,1-trifluoropropan-2-ol
1,1,2,2-tetrachloroethane
1,1,2-trichloro-1,2,2-trifluoroethane
1,1,2-trichloroethane
1,1-dichloroethane
1,1-dichloroethene
We are going to use two different softwares: Cirpy and Stout. One of which is a lookup based on what data is available and the other is a recurrent neural network (RNN) that was trained on a SMILES/IUPAC dataset and now distributed as machine learned model.
Way #1: Brute Force Lookup
import cirpy
iupac_molecules = [
'1,1,1,2-tetrachloroethane',
'1,1,1-trichloroethane',
]
for molecule in iupac_molecules:
smiles = cirpy.resolve(molecule, 'smiles')
print (smiles)
Which then uses cactus resolver a lookup service by the NIH:
The Cirpy is a python resolver that has been long in use I think since I started cheminformatics.
Way #2: Machine Learning Recurrent Neural Networks
Although, I love the try and true method I have been yearning for something more robust.
What we need is something that understands the grammar and punctuation of IUPAC because as chemicals become bigger it becomes harder to write. However the IUPAC name generation contains a wealth of information about the functional group space. Morphine:
Has the IUPAC name:
(4R,4aR,7S,7aR,12bS)-3-Methyl-2,3,4,4a,7,7a-hexahydro-1H-4,12-
methano[1]benzofuro[3,2-e]isoquinoline-7,9-diol
You can identify the core functional groups from within. I believe for large sets of chemical space this method is more plausible and will prove useful for large language model correlations down the road.
from STOUT import translate_reverse
iupac_molecules = [
'1,1,1,2-tetrachloroethane',
'1,1,1-trichloroethane',
]
for molecule in iupac_molecules:
smiles = translate_reverse(molecule)
print(smiles)
Would be interesting to compare the difference of the name generations over large spaces.