Ah, the pain of cheminformatics in which to take a picture of molecules from a paper and convert them to a list of SMILES.
This research has been studied for decades in trying to understand handrawn structures and convert them for data entry. IBM’s tracing algorithm shows how it mapped the lines and characters of a 2D image of a molecule. However on large sets of imags of molecules it began to fail and more errors were prone.
Manually entry is cumbersome still and so with the evolution of neural networks and efficient decision trees this became a hot market for ML to solve. With some early optical recognition of chemical structures.
A common tool that exists is the DECIMER in where they use a convolutional neural network to fetch the smiles and converts into IUPAC names.
Let’s take an image of the Rings in Drugs (most common rings to pass FDA phase III trials) paper in where we want all the common rings in our own database:
Let’s go ahead and take a screenshot of the table, convert it to a PDF and upload it to the decimer engine:
The result is something that looks like this for some structures in which it performed reasonably well for steroid based complexes.
As we go a little further into the dataset, it seems to miss some stereochemistry:
and if there are other things in the image, it tends to take a “Guess” where the f score is placed as a “Si”.
When automating the chemical data capture process be careful for these mistakes. A validation component probably needs to be installed somehow. Hope it helps some folk in coming into computer vision for chemistry.
Happy Cheminformatics!