Principles Of Chemical Diversity

Sulstice
4 min readMar 5, 2023

--

Full Code

Philosophy

When navigating through chemical space, we need to create sets of rules to filter data with boundaries that pertains to a specific problem. Going in blind leaves us to brute force search chemical space rather than have intuition and experience guide us.

Our goal is to encode intuition into the computer by passing through a SMILES through a series of filters based on a user chosen question and visualize it. For example, we pass a molecular list using parallel categories and the idea is that we ask question, how many molecules contain a ring? With two answers: ‘Yes’ and ‘No’. The density of the line will tell us how many roughly of this data set has rings in it.

Figure 1: Sketch Diagram of Category Filtering

To design something like this visually, we use a technique called parallel coordinate diagrams:

And come up with a list of questions that intuitive for a chemist to ask on a series of chemical data:

how many molecules have rings in them?
how many molecules are aromatic?
What is the Molecular Weight Distribution?
What is the distribution of LogP Values?
what is the functional group diversity?

Code

First let’s get our python dependencies.

!pip install global-chem
!pip install plotly
!pip install rdkit-pypi

And now the imports:

import pandas as pd
import plotly.graph_objects as go
import rdkit.Chem.Descriptors as Descriptors

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from global_chem import GlobalChem

We are going to analyze the list of SMILES from the popular book PihKal: A Chemical Love Story with it’s list built into GlobalChem

gc = GlobalChem()
gc.build_global_chem_network()
pihkal_molecules = list(gc.get_node_smiles('pihkal').values())

For functional group diversity we would like functional groups to relate the data too. I think it would be interesting to see how many of these chemicals made would pass Phase 2 Drug Trials. We will convert them all to fingerprints and keep a reference for the names.

def convert_to_fingerprint(smiles):
try:
molecule = Chem.MolFromSmiles(smiles)
fingerprint = AllChem.GetMorganFingerprintAsBitVect(
molecule,
2,
nBits=1024,
)
return fingerprint
except:
return None

reference_names = list(gc.get_node_smiles('phase_2_hetereocyclic_rings').keys())
reference_smiles = list(gc.get_node_smiles('phase_2_hetereocyclic_rings').values())
reference_fingerprints = [ convert_to_fingerprint(value) for value in reference_smiles ]

Now we need to iterate through data and determine values for our questions. First setup the dataframe, we would like a key index on the text of the graph.

methanol: 1
acid: 2
thio: 3
halogen: 4

By determining the functional group number, we can then install that text in the graph later on. Let’s get our data pipeline ready:

molecule_dataframe = pd.DataFrame()

key = 0
logp_values = []
number_of_rings_values = []
molecular_weight_values = []
aromatic_values = []
functional_group_values = []

functional_group_keys = {}

Now let’s iterate through values and answer our questions:

for molecule in pihkal_molecules:

rdkit_molecule = Chem.MolFromSmiles(molecule)

# Molecular Descriptors

aromatic_pattern = Chem.MolFromSmarts('a:a:a')

logp = Descriptors.MolLogP(rdkit_molecule)
molecular_weight = Descriptors.ExactMolWt(rdkit_molecule)
num_of_rings = Chem.rdMolDescriptors.CalcNumRings(rdkit_molecule)
is_aromatic = rdkit_molecule.HasSubstructMatch(aromatic_pattern)

is_aromatic_value = 0

if is_aromatic:
is_aromatic_value = 1

logp_values.append(logp)
number_of_rings_values.append(num_of_rings)
molecular_weight_values.append(molecular_weight)
aromatic_values.append(is_aromatic_value)

# Functional Group Diversity

fingerprint = convert_to_fingerprint(molecule)
tanimoto_scores = DataStructs.BulkTanimotoSimilarity(fingerprint, reference_fingerprints)
functional_group = reference_names[tanimoto_scores.index(max(tanimoto_scores))]

if functional_group not in functional_group_keys:

functional_group_keys[functional_group] = key
functional_group_values.append(key)
key += 1
else:
functional_group_values.append(functional_group_keys[functional_group])

One thing to note is the aromatic pattern matching which i abstracted from this forum. Now let’s build our dataframe and have a look at what it looks like:


molecule_dataframe['logp'] = logp_values
molecule_dataframe['num_of_rings'] = number_of_rings_values
molecule_dataframe['molecular_weight'] = molecular_weight_values
molecule_dataframe['aromatic'] = aromatic_values
molecule_dataframe['functional_group'] = functional_group_values

print (molecule_dataframe)

And the output:

       logp  num_of_rings  molecular_weight  aromatic  functional_group
0 1.9922 1 239.152144 1 0
1 1.7698 1 237.136493 1 1
2 2.3154 1 241.113650 1 2
3 2.7055 1 255.129300 1 2
4 3.0940 1 269.144950 1 2
.. ... ... ... ... ...
174 2.3839 1 253.167794 1 1
175 2.7071 1 255.129300 1 1
176 2.7071 1 255.129300 1 1
177 3.0972 1 269.144950 1 1
178 3.0972 1 269.144950 1 1

[179 rows x 5 columns]

And now let’s plot. To design our plot, we need to think about what we want. I want the LogP values to be representative of a colors and molecular weight, aromatic, number of rings, molecular weight, and functional groups in that order.

fig = go.Figure(data=
go.Parcoords(
line = dict(color = molecule_dataframe['logp'],
colorscale = [[0,'purple'],[0.5,'lightseagreen'],[1,'gold']],
showscale = True,
cmin = min(logp_values),
cmax = max(logp_values)),
dimensions = list([
dict(tickvals = [0, 1],
ticktext = ['No', 'Yes'],
label = 'Aromatic', values = molecule_dataframe['aromatic']),
dict(range = [min(number_of_rings_values), max(number_of_rings_values)],
tickvals = [0, 1, 2, 3],
label = 'Number of Rings', values = molecule_dataframe['num_of_rings']),
dict(range = [min(molecular_weight_values), max(molecular_weight_values)],
label = 'Molecular Weight', values = molecule_dataframe['molecular_weight']),

dict(tickvals = list(functional_group_keys.values()),
ticktext = list(functional_group_keys.keys()),
label = 'Functional Group', values = molecule_dataframe['functional_group']),
])
)
)

fig.update_layout(
height=1000,
width=2000,
grid= dict(columns=1, rows=1)
)

fig.show()

And here we go:

Figure 2: Category Filtering of a List of Molecules

Inference

For this chemical dataset I can start to infer some things. All compounds are aromatic, most compounds have 1 to 2 rings, majority are around 230 to 250 M.W, and a lot of them have similarity towards a methylated indole group and surprisingly a similarity to tosuflaxcin.

Figure 3: Skeletal Diagram of Tosuflaxcin

This raises some new questions that we answer and easily modify the code to gain more insight.

--

--

Sulstice
Sulstice

No responses yet