Retrosynthesis Artificial Intelligence

Sulstice
4 min readMar 27, 2024

--

Retrosynthesis is the art of taking a complex molecule and figuring out how to make it from working backwards by fragmenting the molecule into blocks and then determining a reaction to make it happen.

Take aspirin for example where the connection point is at the carbonyl carbon as a possible connection point. It can be broken down into two components of which can be bought and made it easily.

In the modern world, with the rise of artificial intellignce and enough data captured from electronic lab notebooks we have sufficient mining tools for determining retrosynthetic routes which make it easier for an organic chemist to make molecules faster (theoretically).

To have a rough overview of the scene, in the academic world it started off with two forms a Non-Transformer/Transformer Model. Here are a couple of published sources from RetroTrae

A Non-Transformer model uses a neural network (NN) to process strings based and make a guess at the features then passes those into a decision. If we take the case of benzene where the hidden layer breaks the string into fragments. Each node on the hidden layer becomes a string fragment that is valid or not:

Input Layer      Hidden Layer   Hidden Layer
C1=CC=CC=C1 ----> C1=CC=CC ----> C1=CC=C

A Transformer model takes the SMILES string of benzene and then breaks it down into a word embedding and a token. However, each layer of the word is contexutlized with the rest of the string.

C1=CC=CC=C1

001: C1=CC=CC=C1
002: C1-------C1
003: C1=CC----C1

This is a vector based method in which neural network design is not as much needed. This allows for alternative architectures like the Chat-GPT which it utilizes word embeddings and updates in real-time.

The blend of these two theories have now also given rise to different commercial software on the market which a lot seem to be trained on similar data with different models being implemented.

Synthia

Their case studies have been primarily around alkaloids and natural products and seem to have come up with some effective routes for long term synthesis. Their model comes from Merck and wasn’t available for awhile. Early versions were called Chematica which looked something like this

Chemical Abstract Service

It’s the original and it was labeled in their press release. I can assume after years of research and being a staple their platform would then depend on the research model utilized

Molecule One

These guys use the same data as the CAS however they are using a graph based attention network called MEGAN. It’s a bold claim and I haven’t fully tested their software yet to evaluate the performance.

Pending.ai

Pending.ai uses a combination of 3 neural networks as a multi-agent system. They have one a monte carlo search tree looking for the best candidates based on a series of restricted transformations. A policy networks that guide whether the reactions are feasible or not and determined as “rules” like a game. If a rule is legal or not. Their data is trained on Reaxys which they claim is essentially “all organic chemistry data”.

Spaya

Spaya like the Pending.Ai and is trained on the Reaxys data with not a clear vision as to what the machine learning model being implemented is. I believe it would be inherited from Reaxys using a Monte Carlo Tree Search.

Reaxys

This is the classic product for Retrosynthesis because they house the data of which they implement their own Monte Carlo Tree Search.

Chemical.Ai

This is a China based company that didn’t just train on Reaxys or CAS data but a more versatile set. Their machine learning model is not known exactly however from a guess it would be a monte carlo tree search based on a different series of data.

That’s all the software! I hope this helps some future organic chemists determine which ones to use and purchase for their own synthetic workflows.

Happy Cheminformatics!

--

--

Sulstice
Sulstice

Responses (1)