Protonating your SMILES strings over a range of pH for Small Molecules and Proteins

Sulstice
2 min readMar 13, 2022

--

Recently, a co-worker challenged me draw a molecule and predict the pKa using just reasoning.

Let’s look at a simple example of acetic acid to butyric acid where we subsequently add a carbon can we predict the protonation of the carboxylic acid hydrogen? This raised an interesting data connection for SMILES. From our SMILES strings, depending on the environment, can we predict the different protonation states that could occur rather than quantitative pka values.

Dimorphite_dl was an open source library released for doing just that, although some protonation states that can occur need some additional filtering. Unfortunately, it was available for PyPi distribution, and it looked outdated by 4 years so some quick sunday edits and now here we go: https://github.com/Sulstice/dimorphite_dl

pip install dimorphite_dl

And then to check a SMILES over a range of a minimum and maximum pH which will alter whether that charge state is extant with a neutral (7), acidic (<7), basic (>7) environment per the general definition. max_variants means how many different charge states would we want to limit our output too. label_states, whether we wanted to label as Neutral , Positive , Negative , and the last argument, pka_precision, I think it’s if the pka’s fall without a certain value they get excluded from the output. Another sanity check for charge states.

from dimorphite_dl import DimorphiteDL

dimorphite_dl = DimorphiteDL(
min_ph=4.5,
max_ph=8.0,
max_variants=128,
label_states=False,
pka_precision=1.0
)
print(dimorphite_dl.protonate('CC(=O)O'))

And the output:

['CC(=O)[O-]', 'CC(=O)O']

Which makes sense. So let’s take this one step deeper. First we build an extension into the GlobalChemExtensions

from global_chem_extensions.global_chem_extensions import GlobalChemExtensions

GlobalChemExtensions().find_protonation_states(
['CC(=O)O'],
min_ph=4.0,
max_ph=8.1,
pka_precision=1.0,
max_variants=128,
label_states=False,
)

Let’s give a protein a shot and see what we get:

`
from global_chem_extensions.global_chem_extensions import GlobalChemExtensions

amino_acid_test = ['RSTEFGHIKLADPQ']
smiles = GlobalChemExtensions().amino_acids_to_smiles(amino_acid_test)
print (smiles)
GlobalChemExtensions().find_protonation_states(
smiles,
min_ph=4.0,
max_ph=8.1,
pka_precision=1.0,
max_variants=128,
label_states=False,
)

Whew that’s a lot of data. Well let’s stop there for now. Because mining this will be nuts. Happy hunting!

Google Colab Link: https://colab.research.google.com/drive/1tE1Sm3ey9LsvSyprG6ZP43TmzeyIpsjJ?usp=sharing

--

--

Sulstice
Sulstice

No responses yet