Exploring bridging cheminformatic languages between machines and humans using AI — Part 1
Recently, I wanted to add a feature to MolPDF where not only theSMILES
, 2D images are being produced I can also provide IUPAC
names for the “minable” pdfs. IUPAC
is the language used most commonly by wet lab chemists, whereas SMILES
is more of a 1D string representation of a molecule used by machines. Increasingly, it has become more pertinent to bridge the gap between the two and I figured it would be useful in the context of what MolPDF was designed to do. So let us do some digging at the current state of things…
We’ve got the NIH Cactus Resolver which can only process one request at a time (my personal chemical libraries are massive — I can’t imagine a big pharma company) so that makes the usability pretty low for large scale cheminformatics.
The famous cheminformatic company OpenEye which I haven’t used before but does claim it can produce IUPAC names via a simple python argument. Unfortunately, it costs $$$ or you have to be in academia to get a free license (provided what you produce is of public domain). And if you’re like me writing cheminformatic code on a Saturday morning for fun — don’t really fit any category. So what to do….
I figured I have SMILES
how would I take a SMILES string and convert it into IUPAC
. I could whip out my old orgo textbook from undergrad and start programming the rules but that seems inefficient and alone I don’t think I could do it (would need a team). I had the realization though that SMILES
and IUPAC
are man-made constructed languages. We built them. So let’s take that concept and apply it to other man-made languages perhaps English
and French
.
Next, I was googling was AI’s that were built for translating between french and english. RNNs were the most popular framework for building language AIs because of how words are processed sequentially and how contexts of words could be predicted depending on the number of layers in your network (this article really helped me out understanding the core of an RNN and how it works). So there must exist a sophisticated RNN for the english-french language and perhaps that I can easily adopt for cheminformatics.
But there’s a snag……
Even if I have an RNN that I can use I still need data. How do I produce it? Where would I find it? Well after days of searching (yes days…..) I found a dataset on the EPA website that actually fits my purpose but it’s a little hairy. It has a lot of data. I organized and started cleaning that data and have been pushing it to Kaggle. Check it out and help me organize this (~700k rows) ❤
So I have data! But it isn’t all IUPAC names, there are some common names in there. So I need to come up with a sophisticated way to clean.
So what does the code look like, in its current state not too pretty (work in progress!)? Github Link
Below I’ve included the code to browse through as well — I’ve got the inputs as a raw excel file and passing through the RNN model. This is with a blend of common names/iupac names but I’m curious to see what the AI will predict when I give it another random SMILES
. Do you think it would output the correct IUPAC
?