Probability Distribution of Chemical Fingerprint Similarity Scores with RDKit and Plotly.

6 min readJan 22, 2023

In cheminformatics, we search through vast amount of chemical space to select and design compounds for different applications. Fingerprinting was developed as a way of condensing down a chemical graph into bits allowing for cheminformaticians to search through molecules looking for similarities by comparing their bits. When this happened a vast number of ways to generate the bit vector as well as algorithms to determine similarity were developed. In this lecture, I will go over the similarity algorithms that are implemented today and a small application of applying it in designing and selecting ring fragments.

First what we are going to be doing is taking the RingsInDrugs which is the most popular ring compounds that passed FDA Phase III trials. We want to go through a list of possible Kinase Inhibitors and select which fragments could be used for further testing based on our fingerprint analysis.

from global_chem import GlobalChem

gc = GlobalChem()
gc.build_global_chem_network()

rings_in_drugs = list(gc.get_node_smiles('rings_in_drugs').values())

So let’s convert them to a fingerprint using this function where we use the Morgan Radius. Again you can find in a previous blog here. Now let’s write our function with this time a bigger bit vector to capture more information and a larger radius of 2.

def convert_to_fingerprint(smiles):
    try:
        molecule = Chem.MolFromSmiles(smiles)
        fingerprint = AllChem.GetMorganFingerprintAsBitVect(
            molecule, 
            2,  
            nBits=1024,
        )
        return fingerprint
    except:        
        return None

And now let’s generate our reference fingerprints of the RingsInDrugs and filter out any None values in case it comes up.

ref_fps = [ convert_to_fingerprint(value) for value in rings_in_drugs ]
ref_fps = [ fp for value in ref_fps if value is not None ]

Alright, so now we have our reference fingerprints, let’s load up our kinase inhibitors:

inhibitor_values = list(gc.get_node_smiles('privileged_kinase_inhibitors').values())

Next what we want to do is go through and evaluate the different similarity differences of different algorithms and come up with a possible idea of what ring systems inhibitors tend to trend towards.

for value in inhibitor_values:
    
    criteria = 0.50
    fp = convert_to_fingerprint(value)
    
    tanimoto_scores = DataStructs.BulkTanimotoSimilarity(fp, ref_fps)

    if all(x > criteria for x in tanimoto_scores):
            print ('Tanimoto Accepted: %s' % value)

Tanimoto Similarity is a common metric used a lot in Cheminformatics but there are others that also dominate. Notice a parameter called criteria which I have used to determine the passing score where typical values usually lie in the 0.85range. For our code, I relaxed the score down to 0.5 Each similarity score is measured on values of 0 to 1. We will use the BulkTanimotoSimilarity function which can be used to evaluate all the fingerprints from our reference at once.

Tanimoto Accepted: C1(C2=CC=CC=C2C3=NN=NN3)=CC=CC=C1

We could make a very loose casual inference that is a useful ring system for both categories with the tetrazole conjoined into the biphenyl. Let’s expand to a lot of the different similarity methods implemented in RDKit.

for value in inhibitor_values:
    
    criteria = 0.50
    fp = convert_to_fingerprint(value)
    
    tanimoto_scores = DataStructs.BulkTanimotoSimilarity(fp, ref_fps)
    dice_scores = DataStructs.BulkDiceSimilarity(fp, ref_fps)
    kulczynski_scores = DataStructs.BulkKulczynskiSimilarity(fp, ref_fps)
    mcconnaughey_scores = DataStructs.BulkMcConnaugheySimilarity(fp, ref_fps)
    onbit_scores = DataStructs.BulkOnBitSimilarity(fp, ref_fps)
    rogot_goldberg_scores = DataStructs.BulkRogotGoldbergSimilarity(fp, ref_fps)
    russel_scores = DataStructs.BulkRusselSimilarity(fp, ref_fps)
    sokal_scores = DataStructs.BulkSokalSimilarity(fp, ref_fps)

    if all(x > criteria for x in tanimoto_scores):
        print ('Tanimoto Accepted: %s' % value)
    if all(x > criteria for x in dice_scores):
        print ('Dice Accepted: %s' % value)
    if all(x > criteria for x in kulczynski_scores):
        print ('Kulczynski Accepted: %s' % value)
    if all(x > criteria for x in mcconnaughey_scores):
        print ('Mcconnaughey Accepted: %s' % value)
    if all(x > criteria for x in onbit_scores):
        print ('On Bit Accepted: %s' % value)
    if all(x > criteria for x in rogot_goldberg_scores):
        print ('Rogot Goldberg: %s' % value)
    if all(x > criteria for x in russel_scores):
        print ('Russel: %s' % value)
    if all(x > criteria for x in sokal_scores):
        print ('Sokal: %s' % value)

Which returns:

Rogot Goldberg: C12=CC=CC=C1C=CN2
Kulczynski Accepted: C1(C2=CC=CC=C2)=CC=CC=C1
Rogot Goldberg: C1(C2=CC=CC=C2)=CC=CC=C1
Rogot Goldberg: C12=CC=CC=C1NC=N2
...
Dice Accepted: C1(C2=CC=CC=C2C3=NN=NN3)=CC=CC=C1
Tanimoto Accepted: C1(C2=CC=CC=C2C3=NN=NN3)=CC=CC=C1
Kulczynski Accepted: C1(C2=CC=CC=C2C3=NN=NN3)=CC=CC=C1
Mcconnaughey Accepted: C1(C2=CC=CC=C2C3=NN=NN3)=CC=CC=C1
On Bit Accepted: C1(C2=CC=CC=C2C3=NN=NN3)=CC=CC=C1
Rogot Goldberg: C1(C2=CC=CC=C2C3=NN=NN3)=CC=CC=C1
Sokal: C1(C2=CC=CC=C2C3=NN=NN3)=CC=CC=C1

Where across the board, except for the Rogot Goldberg, seems to favour that ring system as valuable. Interestingly, the Rogot Goldberg scoring mechanism seems to have the most relaxed criteria perhaps as compared to others for other possible new ring systems that should perhaps be looked at.

So let’s take a distribution of how these compounds scores on the whole set:

We can start to understand where the numbers like to lie in terms of scoring. The brown line is the Rogot Goldberg and it’s criteria seems to be as less stringent as something compared to Russel. We would have to deeper into the algorithms to find their functional group differences. Dice and Kutczynski seem to be very similar. For the full code plotting code see below:

from global_chem import GlobalChem

gc = GlobalChem()
gc.build_global_chem_network()

rings_in_drugs = list(gc.get_node_smiles('rings_in_drugs').values())
ref_fps = [ convert_to_fingerprint(value) for value in rings_in_drugs ]
ref_fps = [ fp for value in ref_fps if value is not None ]

inhibitor_values = list(gc.get_node_smiles('privileged_kinase_inhibitors').values())

values_accepted = []

all_dice = []
all_tanimoto = []
all_kulczynski = []
all_mcconnaughey = []
all_onbit = []
all_rogot_goldberg = []
all_russel = []
all_sokal = []

for value in inhibitor_values:
    
    criteria = 0.50
    fp = convert_to_fingerprint(value)
    
    dice_scores = DataStructs.BulkDiceSimilarity(fp, ref_fps)
    tanimoto_scores = DataStructs.BulkTanimotoSimilarity(fp, ref_fps)
    kulczynski_scores = DataStructs.BulkKulczynskiSimilarity(fp, ref_fps)
    mcconnaughey_scores = DataStructs.BulkMcConnaugheySimilarity(fp, ref_fps)
    onbit_scores = DataStructs.BulkOnBitSimilarity(fp, ref_fps)
    rogot_goldberg_scores = DataStructs.BulkRogotGoldbergSimilarity(fp, ref_fps)
    russel_scores = DataStructs.BulkRusselSimilarity(fp, ref_fps)
    sokal_scores = DataStructs.BulkSokalSimilarity(fp, ref_fps)
    
    all_dice.append(dice_scores)
    all_tanimoto.append(tanimoto_scores)
    all_kulczynski.append(kulczynski_scores)
    all_mcconnaughey.append(mcconnaughey_scores)
    all_onbit.append(onbit_scores)
    all_rogot_goldberg.append(rogot_goldberg_scores)
    all_russel.append(russel_scores)
    all_sokal.append(sokal_scores)
    
    if all(x > criteria for x in dice_scores):
        print ('Dice Accepted: %s' % value)
        values_accepted.append(value)
    if all(x > criteria for x in tanimoto_scores):
        print ('Tanimoto Accepted: %s' % value)
        values_accepted.append(value)
    if all(x > criteria for x in kulczynski_scores):
        print ('Kulczynski Accepted: %s' % value)
        values_accepted.append(value)
    if all(x > criteria for x in mcconnaughey_scores):
        print ('Mcconnaughey Accepted: %s' % value)
        values_accepted.append(value)
    if all(x > criteria for x in onbit_scores):
        print ('On Bit Accepted: %s' % value)
        values_accepted.append(value)
    if all(x > criteria for x in rogot_goldberg_scores):
        print ('Rogot Goldberg: %s' % value)
        values_accepted.append(value)
    if all(x > criteria for x in russel_scores):
        print ('Russel: %s' % value)
        values_accepted.append(value)
    if all(x > criteria for x in sokal_scores):
        print ('Sokal: %s' % value)
        values_accepted.append(value)
        
all_dice = sum(all_dice, [])
all_tanimoto = sum(all_tanimoto, [])
all_kulczynski = sum(all_kulczynski, [])
all_mcconnaughey = sum(all_mcconnaughey, [])
all_onbit = sum(all_onbit, [])
all_rogot_goldberg = sum(all_rogot_goldberg, [])
all_russel = sum(all_russel, [])
all_sokal = sum(all_sokal, [])

import plotly.figure_factory as ff

fig = ff.create_distplot([
    all_dice,
    all_tanimoto,
    all_kulczynski,
    all_mcconnaughey,
    all_onbit,
    all_rogot_goldberg,
    all_russel,
    all_sokal
], [
    'Dice',
    'Tanimoto',
    'Kulczynski',
    'Mcconnaughey',
    'Onbit',
    'Rogot Goldberg',
    'Russel',
    'Sokal'
], bin_size=.5, show_hist=False, show_rug=True)


fig.update_layout(legend=dict(itemsizing='constant'))
fig.update_layout(legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1,  
        font = dict(family = "Arial", size = 10),
        bordercolor="LightSteelBlue",
        borderwidth=2,
    ),
    legend_title = dict(font = dict(family = "Arial", size = 20))
)

fig.update_xaxes(
                 ticks="outside",
                 tickwidth=3,
                 tickcolor='black',
                 tickfont=dict(family='Arial', color='black', size=20),
                 title_font=dict(size=46, family='Arial'),
                 title_text='Similarity Score',
                 ticklen=15,
                 range=[0, 1]
)

fig.update_yaxes(
                 ticks="outside", 
                 tickwidth=3,
                 tickcolor='black', 
                 title_text='Probability Density',
                 tickfont=dict(family='Arial', color='black', size=50),
                 title_font=dict(size=46, family='Arial'),
                 ticklen=15,
                 range=[0, 10]
)    

fig.update_layout(
    title_text="Total Distribution",
    title_font=dict(size=44, family='Arial'),
    template='simple_white',
    xaxis_tickformat = 'i',
    bargap=0.2,
    height=600,
    width=1000
)

Probability Distribution of Chemical Fingerprint Similarity Scores with RDKit and Plotly.

Written by Sulstice