Understanding drug-likeness filters with RDKit and exploring the WITHDRAWN database.

Sulstice
5 min readJan 12, 2020

As I delve into the world of cheminformatics, one of the things I came across recently was other filters that expand on Lipinski’s rule of 5 (the only one taught in my education). The list is continuous but the main ones I have found in the community are:

Lipinski Rule of 5
Ghose Filter
Veber Filter
Rule of 3 Filter
REOS Filter
Drug-like Filter (QED)

So how well do these rules work? Let’s do an analysis of a dataset from the database WITHDRAWN. WITHDRAWN is a database of compounds of FDA approved drugs that were withdrawn from the market. What I expect to find in the most ideal situation is that none of the drugs will pass the filtering but never is it that simple in science.

So let's see how we can put this code together for our analysis. First, I downloaded the database withdrawn in SDF format. Using the RDKit SD parser will make it easier to load molecules into a variable in memory.

from rdkit import Chem

if __name__ == '__main__':
molecules = Chem.SDMolSupplier('withdrawn_database.sdf')

Let’s now set up the rules for our analysis:

results = {
"Lipinski Rule of 5": 0,
"Ghose Filter": 0,
"Veber Filter": 0,
"Rule of 3 Filter": 0,
"REOS Filter": 0,
"Drug-like Filter": 0,
"Passes All Filters": 0,
}

For each molecule that passes a filter, let's log it and if a molecule passes all the filters then we will iterate it appropriately so.

So just what are these rules Lipinski, Ghose, etc.? Well in summary here you go:

Lipinski:
Moleculer Weight <= 500
LogP <= 5
H-Bond Donor Count <= 5
H-Bond Acceptor Count <= 10
Ghose:
Molecular weight between 160 and 480
LogP between -0.4 and +5.6
Atom count between 20 and 70
Molar refractivity between 40 and 130
Veber:
Rotatable bonds <= 10
Topological polar surface area <= 140
REOS:
Molecular weight between 200 and 500
LogP between -5.0 and +5.0
H-bond donor count between 0 and 5
H-bond acceptor count between 0 and 10
Formal charge between -2 and +2
Rotatable bond count between 0 and 8
Heavy atom count between 15 and 50
Rule of 3:
Molecular weight <= 300
LogP <= 3
H-bond donor <= 3
H-bond acceptor count <= 3
Rotatable bond count <= 3
Drug-Like (QED):
mass < 400
ring count > 0
rotatable bond count < 5
h-bond donor count <= 5
h-bond acceptor count <= 10
logP < 5

Okay we have all the conditionals, now lets set up how exactly we are going to achieve in getting these descriptors? By using the RDKit Library! If you are not familiar with RDKit the please reference this and play around!

With the help of RDKit we can fetch all properties using the code below:

molecular_weight = Descriptors.ExactMolWt(molecule)
logp = Descriptors.MolLogP(molecule)
h_bond_donor = Descriptors.NumHDonors(molecule)
h_bond_acceptors = Descriptors.NumHAcceptors(molecule)
rotatable_bonds = Descriptors.NumRotatableBonds(molecule)
number_of_atoms = Chem.rdchem.Mol.GetNumAtoms(molecule)
molar_refractivity = Chem.Crippen.MolMR(molecule)
topological_surface_area_mapping = Chem.QED.properties(molecule).PSA
formal_charge = Chem.rdmolops.GetFormalCharge(molecule)
heavy_atoms = Chem.rdchem.Mol.GetNumHeavyAtoms(molecule)
num_of_rings = Chem.rdMolDescriptors.CalcNumRings(molecule)

Whew, that’s a lot of variable declaration.

Next step, let’s put all those rules into if conditional statements and if a molecule passes those conditions then append that to the results! Simple as that, here we go:

# Lipinski
if molecular_weight <= 500 and logp <= 5 and h_bond_donor <= 5 and h_bond_acceptors <= 10and rotatable_bonds <= 5:
lipinski = True
results["Lipinski Rule of 5"] += 1

# Ghose Filter
if molecular_weight >= 160 and molecular_weight <= 480 and logp >= -0.4 and logp <= 5.6 and number_of_atoms >= 20 and number_of_atoms <= 70 and molar_refractivity >= 40 and molar_refractivity <= 130:
ghose_filter = True
results["Ghose Filter"] += 1

# Veber Filter
if rotatable_bonds <= 10 and topological_surface_area_mapping <= 140:
veber_filter = True
results["Veber Filter"] += 1

# Rule of 3
if molecular_weight <= 300 and logp <= 3 and h_bond_donor <= 3 and h_bond_acceptors <= 3 and rotatable_bonds <= 3:
rule_of_3 = True
results["Rule of 3 Filter"] += 1

# REOS Filter
if molecular_weight >= 200 and molecular_weight <= 500 and logp >= int(0 - 5) and logp <= 5 and h_bond_donor >= 0 and h_bond_donor <= 5 and h_bond_acceptors >= 0 and h_bond_acceptors <= 10 and formal_charge >= int(0-2) and formal_charge <= 2 and rotatable_bonds >= 0 and rotatable_bonds <= 8 and heavy_atoms >= 15 and heavy_atoms <= 50:
reos_filter = True
results["REOS Filter"] += 1

#Drug Like Filter
if molecular_weight < 400 and num_of_rings > 0 and rotatable_bonds < 5 and h_bond_donor <= 5 and h_bond_acceptors <= 10 and logp < 5:
drug_like_filter = True
results["Drug-like Filter"] += 1

Okay, so very verbose code and long if statements but they definitely get the job done.

If you would like the full script then please visit here, after doing some quick analysis on the withdrawn database we can achieve the results of:

{
‘Lipinski Rule of 5’: 314,
‘Ghose Filter’: 281,
‘Veber Filter’: 534,
‘Rule of 3 Filter’: 90,
‘REOS Filter’: 379,
‘Drug-like Filter’: 270,
‘Passes All Filters’: 4
}

Interestingly enough Rule of 3 Filter is the most stringent with barely with only 14% passing where are Lipinski rule of 5 allowed for 50% of the molecules! But if we apply all the filters we can narrow it down to 0.6% were passing. A future blog post may go a little deeper into why those particular 4 molecules passed all the filters but pulled from the market.

So the question I ask is are these filters good enough, are they even useful in discovering a new drug? Are their rules of thumb that should include as well ADMET scoring?

Another interesting run was on the BindingDB, a database of approved known therapeutics (not all FDA approved and clinically ready).

{ 
‘Lipinski Rule of 5’: 198273,
‘Ghose Filter’: 384881,
‘Veber Filter’: 615721,
‘Rule of 3 Filter’: 20099,
‘REOS Filter’: 412177,
‘Drug-like Filter’: 170907,
‘Passes All Filters’: 2469
}

Update 2022

I’ve decided to revisit this, since I also need it I made it part of a python package extensions that I am adding. What you can do is pass in your own smiles and apply the different filters to see how they work for your dataset.

Packages:

python -m pip install global-chem[cheminformatics]

Code:

from global_chem_extensions import GlobalChemExtensionssmiles_list = []
cheminformatics = GlobalChemExtensions().cheminformatics()
filtered_smiles = cheminformatics.filter_smiles_by_criteria(
smiles_list,
lipinski_rule_of_5=True,
ghose=False,
veber=False,
rule_of_3=False,
reos=False,
drug_like=False,
pass_all_filters=False
)

print (len('Filtered SMILES: %s' % len(filtered_smiles)))

--

--