Recently, a co-worker challenged me draw a molecule and predict the pKa using just reasoning.
Let’s look at a simple example of acetic acid to butyric acid where we subsequently add a carbon can we predict the protonation of the carboxylic acid hydrogen? This raised an interesting data connection for SMILES. From our SMILES strings, depending on the environment, can we predict the different protonation states that could occur rather than quantitative pka values.
Dimorphite_dl was an open source library released for doing just that, although some protonation states that can occur need some additional filtering. Unfortunately, it was available for PyPi distribution, and it looked outdated by 4 years so some quick sunday edits and now here we go: https://github.com/Sulstice/dimorphite_dl
pip install dimorphite_dl
And then to check a SMILES over a range of a minimum and maximum pH which will alter whether that charge state is extant with a neutral (7), acidic (<7), basic (>7) environment per the general definition. max_variants
means how many different charge states would we want to limit our output too. label_states
, whether we wanted to label as Neutral
, Positive
, Negative
, and the last argument, pka_precision
, I think it’s if the pka’s fall without a certain value they get excluded from the output. Another sanity check for charge states.
from dimorphite_dl import DimorphiteDL
dimorphite_dl = DimorphiteDL(
min_ph=4.5,
max_ph=8.0,
max_variants=128,
label_states=False,
pka_precision=1.0
)
print(dimorphite_dl.protonate('CC(=O)O'))
And the output:
['CC(=O)[O-]', 'CC(=O)O']
Which makes sense. So let’s take this one step deeper. First we build an extension into the GlobalChemExtensions
from global_chem_extensions.global_chem_extensions import GlobalChemExtensions
GlobalChemExtensions().find_protonation_states(
['CC(=O)O'],
min_ph=4.0,
max_ph=8.1,
pka_precision=1.0,
max_variants=128,
label_states=False,
)
Let’s give a protein a shot and see what we get:
`
from global_chem_extensions.global_chem_extensions import GlobalChemExtensions
amino_acid_test = ['RSTEFGHIKLADPQ']
smiles = GlobalChemExtensions().amino_acids_to_smiles(amino_acid_test)
print (smiles)
GlobalChemExtensions().find_protonation_states(
smiles,
min_ph=4.0,
max_ph=8.1,
pka_precision=1.0,
max_variants=128,
label_states=False,
)
Whew that’s a lot of data. Well let’s stop there for now. Because mining this will be nuts. Happy hunting!
Google Colab Link: https://colab.research.google.com/drive/1tE1Sm3ey9LsvSyprG6ZP43TmzeyIpsjJ?usp=sharing