When evaluating ligand poses against proteins we often look at the overall shape of the ligand as it fits into the binding pocket. Is it structurally similar to another compound we have seen before?
The Bostrom’s paper talks about a new criteria when screening for ligands as they relate to their different protein classes. Each ligand is screened with it’s protein class and a similarity score is determined between the screening and the target. The criteria for the algorithm is as follows:
1.) 80 < molecular weight < 750
2.) 10 < number of nonhydrogen atoms < 70
3.) Must not contain atoms of types other than
H, C, O, N, F, P, S, Cl, Br, or I
4.) must contain at least one
non-carbon/non-hydrogen atom
5.) must not contain two or more phosphorus atoms
6.) must not have more than 10 rotatable bonds
7.) must not be a nucleic acid
8.) must not be composed only from
non-lead-like PDB-HET-groupsb
9.) must not be covalently bound
10.) must not have protein contacts from the crystal
packing environment in less than 3 Å distance
11.) must have contacts with protein in less than
7Å distance
The criteria is relatively simple but could be tricky to implement in Python. The function is available as part of Global-Chem.
!pip install global-chem[bioinformatics]
if you would like to skip the coding aspect and head right into the functionality:
from global_chem_extensions import GlobalChemExtensions
bioinformatics = GlobalChemExtensions().bioinformatics()
gc_protein = bioinformatics.initialize_globalchem_protein(
pdb_id='5tc0',
)
criteria_met = gc_protein.determine_bostrom_ligand_criteria(verbose=True)
ligand_smiles = gc_protein.ligand_smiles
print ("Ligand: %s, Criteria Met: %s" % (ligand_smiles, criteria_met))
For the implementation we need to first have a robust way to load PDBs into the python object as well as talking to other informatic platforms:
!pip install biopandas
!pip install pypdb
!pip install rdkit-pypi
We will first separate the protein and the ligand from each other. This gives some room to play with the ligand before we move to more complex criteria.
biopandas_pdb = PandasPdb().fetch_pdb('5tc0')
protein_dataframe = biopandas_pdb.df['ATOM']
ligand_dataframe = biopandas_pdb.df['HETATM']
RDKit
gives us the flexibility of analyzing small ligand molecular descriptors with ease so it is in our interest to convert the ligand dataframe into a SMILES format. The chemical descriptors can be accessed from the ligands chain ID and fetched from the RCSB protein data bank website and with their python package pypdb
chain_id = self.ligand_dataframe['residue_name'].to_list()[0]
chem_desc = pypdb.get_info(chain_id, url_root = 'https://data.rcsb.org/rest/v1/core/chemcomp/')
chem_desc = pypdb.to_dict(chem_desc)
ligand_name = chem_desc['chem_comp']['name']
ligand_descriptors = chem_desc['pdbx_chem_comp_descriptor']
for software in ligand_descriptors:
if software['program'] == 'OpenEye OEToolkits' and software['type'] == 'SMILES':
ligand_smiles = software['descriptor']
We then extract meta information to access the SMILES. I tried different SMILES strings from the platform and the most robust was coming from the OpenEye Toolkit when they submit data to the RCSB.
rdkit_molecule = Chem.MolFromSmiles(ligand_smiles)
Criteria 1: 80 < molecular weight < 750
molecular_weight = Descriptors.ExactMolWt(rdkit_molecule)
if 750 > molecular_weight > 80:
continue
else:
print ("Failed Check 1")
Criteria 2: 10 < number of non-hydrogen atoms < 70
non_hydrogen_atoms = Chem.rdchem.Mol.GetNumHeavyAtoms(rdkit_molecule)
if 70 > non_hydrogen_atoms > 10:
continue
else:
print ("Failed Check 2")
Criteria 3: Must not contain atoms of types other than H, C, O, N, F, P, S, Cl, Br, or I
element_boundaries = ['H', 'C', 'O', 'N', 'F', 'P', 'S', 'Cl', 'Br', 'I']
atom_symbols = []
atoms = Chem.rdchem.Mol.GetAtoms(rdkit_molecule)
for atom in atoms:
symbol = atom.GetSymbol()
atom_symbols.append(symbol)
if symbol not in element_boundaries:
if verbose:
print ("Failed Check 3")
Criteria 4: must contain at least one non-carbon/non-hydrogen atom
if 'H' in atom_symbols:
atom_symbols = list(filter(lambda x: x != 'H', atom_symbols))
if 'C' in atom_symbols:
atom_symbols = list(filter(lambda x: x != 'C', atom_symbols))
if len(atom_symbols) == 0:
print ("Failed Check 4")
Criteria 5: must not contain two or more phosphorus atoms
phosphorous_atoms = list(filter(lambda x: x == 'P', atom_symbols))
if len(phosphorous_atoms) > 1:
print ("Failed Check 5")
Criteria 6: must not have more than 10 rotatable bonds
rotatable_bonds = Descriptors.NumRotatableBonds(rdkit_molecule)
if rotatable_bonds > 10:
print ("Failed Check 6")
Criteria 7: must not be a nucleic acid
nucleic_acid_smiles = 'NC1OC(COP(O)(O)=O)C(O)C1'
nucleic_acid_molecule = Chem.MolFromSmiles(nucleic_acid_smiles)
substructures = rdkit_molecule.GetSubstructMatches(nucleic_acid_molecule)
if substructures:
print ("Failed Check 7")
Criteria 8: must not be composed only from non-lead-like PDB-HET-groupsb
I don’t know how to do this one so if you know then please feel free to add some code or let me know in the comments.
Criteria 9: must not be covalently bound
if 'COVALENT' in biopandas_pdb.pdb_text or 'COVALNT' in biopandas_pdb.pdb_text:
print ("Failed Check 9")
The covalent flagged is marked inside the PDB file and can have either the E
or not which I found odd. It allows for ease of filtering for covalent binding inhibitors.
Criteria 10: must not have protein contacts from the crystal
packing environment in less than 3 Å distance
I don’t know how to do this as well so if you know then please feel free to add some code or let me know in the comments.
Criteria 11: must have contacts with protein in less than 7Å distance
for index, atom_row in ligand_dataframe.iterrows():
# Check to see if there is water
if atom_row['residue_name'] == 'HOH':
continue
x_coord = atom_row['x_coord']
y_coord = atom_row['y_coord']
z_coord = atom_row['z_coord']
reference_point = (x_coord, y_coord, z_coord)
distances = biopandas_pdb.distance(xyz=reference_point, records=('ATOM',))
atoms_within_seven_angstroms = biopandas_pdb.df['ATOM'][distances < 7.0]
if len(atoms_within_seven_angstroms) > 0:
continue
else:
print ("Failed Check 11")
Criteria 8 and 10 still need completion but the workflow that is built is robust enough in handling larger amounts of PDB data sets for screening.
Passed Check 1 Molecular Weight: 433.08784608800005
Passed Check 2 Non Hydrogen Atoms: 29
Passed Check 3 All Atoms Are Within Element Boundaries
Passed Check 4 Non-Hydrogen & Non-Carbon Atoms Present: 11
Passed Check 5 Less than Two Phosphorous atoms Present
Passed Check 6 Less than Ten Rotatable Bonds Present
Passed Check 7 No Nucleic Acid Template Found
Passed Check 9 Not a Covalent Inhibitor
Check 11 Has Contacts within 7 Angstroms
Ligand: CS(=O)(=O)N1CCOC(C1)c2csc(n2)c3ccccc3NC(=O)c4[nH]ccn4, Criteria Met: True
Passed Check 1 Molecular Weight: 358.0688674079999
Passed Check 2 Non Hydrogen Atoms: 26
Passed Check 3 All Atoms Are Within Element Boundaries
Passed Check 4 Non-Hydrogen & Non-Carbon Atoms Present: 8
Passed Check 5 Less than Two Phosphorous atoms Present
Passed Check 6 Less than Ten Rotatable Bonds Present
Passed Check 7 No Nucleic Acid Template Found
Failed Check 9
Passed Check 1 Molecular Weight: 358.0688674079999
Passed Check 2 Non Hydrogen Atoms: 26
Passed Check 3 All Atoms Are Within Element Boundaries
Passed Check 4 Non-Hydrogen & Non-Carbon Atoms Present: 8
Passed Check 5 Less than Two Phosphorous atoms Present
Passed Check 6 Less than Ten Rotatable Bonds Present
Passed Check 7 No Nucleic Acid Template Found
Failed Check 9
Ligand: Cc1cc(c(c2c1C(=O)Oc3c(c(cc(c3O2)C(=O)O)OC)C)C=O)O, Criteria Met: False