Lecture 003 - Approaching Machine Learning as an Organic Chemist Part 1

5 min readFeb 11, 2023

I am now ready to see what happened to machine learning as the field has taken time to mature. I went through about 100 repositories in the cheminformatic space and the list is here:

Sulstice's list / cheminformatics

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

And the one code that spoke out to me was the REINVENT.

The reason why I chose this software was because the input was relatively simple where the model input was a simple simples list, the dependencies were listed easily, and it was built for biological molecules but I want to play and see how I can decipher this.

I know by reading the documentation that the model is built for a molecule fragmenting and switching to fit a target with a particular score that uses molecular descriptors as probability distributions to guide learning of a molecule and generating new fragments that fall within that category:

LogP — The partition coefficient to tell the solubility of a molecule if it likes non-polar versus polar more.
Molecular Weight — g/mol molecular weight of a molecule, if you remember all your rules.
Rotatable Bonds — This corresponds to the the flexibility of the molecule. The more the flexible the more conformers the molecule has increasing entropy making it harder to be a drug that is ideally rigid.
Number of Aromatic Rings — Number of Aromatic Rings in the molecule where pi-stacking and other interactions can dictate binding in some cases as well as support rigidity.

We are going to first analyze the PerFluoroalkyls (PFAS) because it is a small chemical list and because of the fluorine group being hydrophobic allows it for it to be a powerful coating as “olephobic” (oil-repellent).

Figure 1: Perfluorobutanoic Acid From PubChem

However, the list in Global-Chem is relatively small. We can start to tweak that with a non-canonical SMILES. In the original paper by Olivecrona et al, however they trained on canonical SMILES and I have seen though that RNN’s trained on randomized SMILES by Arus-Pous et al generates better results because of the different representations of the bond order information allow flexibility.

pip install global-chem
pip install rdkit-pypi
pip install torch
pip install tqdm
pip install scikit-learn
pip install pexpect

Let’s give it a shot then generate around 27,000 SMILES of the different representations of PFAS.

Generate Randomized SMILES

# Imports
# -------

from rdkit import Chem
from global_chem import GlobalChem

if __name__ == '__main__':

    output_file = open('mol.smi', 'w')
    gc = GlobalChem()
    gc.build_global_chem_network()
    smiles_list = list(gc.get_node_smiles('emerging_perfluoroalkyls').values())

    for molecule in smiles_list:

        rdkit_molecule = Chem.MolFromSmiles(molecule)
        for i in range(1000):
            smiles = Chem.MolToSmiles(rdkit_molecule, doRandom=True)
            output_file.write(smiles + '\n')

According to the documentation, we can now train a Prior network with our SMILES input, the Agent network will be used to select new fragments from the prior and go through a decision process with the probability distributions of descriptors as weights. The input is the key here.

To run the code:

python data_structs.py mol.smi

Which then prepares data structure to train the model by reading the SMILES and understand the vocabulary. It also throws out any SMILES that doesn’t process through RDKit.

Since we are very limited in our element scope there isn’t much diversity in the characters, I count 7 including hydrogen from perfluoroctanoic acid:

smiles = 'C(=O)(C(C(C(C(C(C(C(F)(F)F)(F)F)(F)F)(F)F)(F)F)(F)F)(F)F)O'
grammar = ['C', '(', ')', 'F', '=', 'O', 'H']

Now we train the prior network which installs the vocabulary of different connection points from the model and starts learning the different probability distributions of the descriptors and their bond connection information.

The Agent can be tailored against a particular receptor for a protein/drug target and perhaps I shouldn’t be using it for material science but I would like to get some more chemical diversity into these compounds. After the Prior network is trained we start generating new compounds with the dopamine receptor data. Let’s see what we get over the course of a 1000 steps. We chose to keep the regular activity model to include all elements including sulphur because perfluorobutanesulfonic acid is a commonly used chemical as well and in our original input lists.

python main.py --scoring-function activity_model --num-steps 1000 --batch-size 100 --num-processes 12

I had to tweak some of the batch sizes in the original documentation code down to 100 because I didn’t have that many SMILES in the millions scale.

The output generate by the Agent was:

Score     Prior log P     SMILES
0.24    -3.31        O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.24    -3.26        O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.24    -3.18        O=S(=O)(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.24    -3.53        O=C(O)C(F)(F)OC(F)(F)OC(F)(F)OC(F)(F)OC(F)(F)F
0.24    -2.89        O=S(=O)(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.24    -3.78        CN(CC(=O)O)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.24    -3.21        O=C(O)C(F)(F)C(F)(F)C(F)(F)F
0.24    -3.37        O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.24    -3.11        O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.24    -4.54        O=S(=O)(O)C(F)(F)C(F)(F)OC(F)(C(F)(F)F)C(F)(F)OC(F)C(F)(F)F
0.24    -7.81        CCN(CC(=O)O)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.24   -20.57        SCC(CC(=O)O)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)F
0.23   -14.37        O=S(=O)(O)CCCC(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.23    -8.17        O=C(F)C(F)(OC(F)(F)C(F)(F)C(F)(F)F)C(F)(F)F)C(F)(F)F
0.23   -11.66        O=C(O)C(F)(OC(F)(C(F)(F)F)C(F)(F)OC(F)(F)F
0.23   -14.61        OOS(=O)(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.23    -6.84        O=S(=O)(O)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.23    -6.39        CN(CC(=O)O)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.23   -10.87        O=C(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.23   -16.36        O=C(F)C(F)(OC(F)(F)F)C(F)(F)C(F)(F)C(F)(F)F)C(F)(F)F
0.23   -22.54        CN(CO)(=O)C(F)(F)C(F)(F)F
0.23    -9.79        O=S(=O)(O)C(F)(F)C(F)(F)F
0.22    -3.48        O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.22    -3.40        O=S(=O)(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.22    -3.90        O=S(=O)(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.22    -9.98        O=C(O)C(F)(OC(F)(F)C(F)(OC(F)(F)C(F)(F)C(F)(F)F)C(F)(F)F)C(F)(F)F)C(F)(F)F
0.22    -3.81        CCN(CC(=O)O)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.22   -11.03        CCNN(CC(=O)O)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.22    -2.99        O=S(=O)(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.22    -3.34        O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.22    -4.69        O=S(=O)(O)CCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.22    -3.70        O=C(O)C(F)(OC(F)(C(F)(F)F)C(F)(F)OC(F)(F)C(F)(F)C(F)(F)F)C(F)(F)F
0.22   -11.91        O=C(O)C(F)(F)OC(F)(F)OC(F)(F)OC(F)(F)OC(F)(F)C(F)(F)C(F)(F)F)C(F)(F)F
0.22    -5.35        CN(CC(=O)O)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.18    -3.21        O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.18    -4.40        O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.18    -4.29        O=C(O)C(F)(F)OC(F)(F)OC(F)(F)OC(F)(F)OC(F)(F)OC(F)(F)F
0.18    -3.15        O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.18    -3.22        O=C(F)C(F)(OC(F)(F)C(F)(OC(F)(F)C(F)(F)C(F)(F)F)C(F)(F)F)C(F)(F)F
0.18    -3.02        O=S(=O)(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.18    -6.28        O=C(O)C(F)(F)C(F)OC(F)(F)C(F)(F)C(F)(F)F)C(F)(F)F
0.18    -3.44        O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.18    -3.16        O=C(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.18    -3.36        O=C(O)C(F)(F)C(F)OC(F)(F)C(F)(F)C(F)(F)OC(F)(F)F
0.00   -13.68        O=C(F)C(F)(OC(F)(F)C(F)(F)C(F)(F)F)C(F)(F)F
0.00    -3.73        O=S(=O)(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.00    -7.57        CN(CC(=O)O)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.00    -8.78        CCN(CC(=O)O)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
0.00    -6.71        O=S(=O)(O)C(F)(F)C(F)(F)C(F)(F)F
0.00    -8.45        O=C(O)C(F)(F)C(F)(F)F

This is interesting. As an organic chemist, I read the SMILES and came up with a selection of interesting connections or replacements to understand what the machine is doing.

Organic Chemist Selected List

O=S(=O)(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
O=C(O)C(F)(F)OC(F)(F)OC(F)(F)OC(F)(F)OC(F)(F)F
O=S(=O)(O)C(F)(F)C(F)(F)OC(F)(C(F)(F)F)C(F)(F)OC(F)C(F)(F)F
SCC(CC(=O)O)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)F
CCN(CC(=O)O)S(=O)(=O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
O=C(O)C(F)(F)C(F)OC(F)(F)C(F)(F)C(F)(F)OC(F)(F)F

I then sent it over to a co-worker who knows polymer science better than I do. The list was not good, the model inserted oxygens as ethers which I thought was an interesting move but it destroys it’s coating effect.

It could be tricky to synthesize as well. So I want to revisit this model again and one thing I want to try out is mixing the initial input with chemical lists that could be similar in structure but give different diversity in the connections. I think it depends on the input I give to introduce new structures but doesn’t have to be many just key functional groups relevant to the problem. We want to maintain the coating but avoid toxicity.

Using the word “oleophobic” as a keyword in exploring chemical space lexically to find more diverse compounds and figure out how to test in the second lecture.

Lecture 003 - Approaching Machine Learning as an Organic Chemist Part 1

Sulstice's list / cheminformatics

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Written by Sulstice

Responses (1)