Full Code
Philosophy
When navigating through chemical space, we need to create sets of rules to filter data with boundaries that pertains to a specific problem. Going in blind leaves us to brute force search chemical space rather than have intuition and experience guide us.
Our goal is to encode intuition into the computer by passing through a SMILES through a series of filters based on a user chosen question and visualize it. For example, we pass a molecular list using parallel categories and the idea is that we ask question, how many molecules contain a ring? With two answers: ‘Yes’ and ‘No’. The density of the line will tell us how many roughly of this data set has rings in it.
To design something like this visually, we use a technique called parallel coordinate diagrams:
And come up with a list of questions that intuitive for a chemist to ask on a series of chemical data:
how many molecules have rings in them?
how many molecules are aromatic?
What is the Molecular Weight Distribution?
What is the distribution of LogP Values?
what is the functional group diversity?
Code
First let’s get our python dependencies.
!pip install global-chem
!pip install plotly
!pip install rdkit-pypi
And now the imports:
import pandas as pd
import plotly.graph_objects as go
import rdkit.Chem.Descriptors as Descriptors
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from global_chem import GlobalChem
We are going to analyze the list of SMILES from the popular book PihKal: A Chemical Love Story with it’s list built into GlobalChem
gc = GlobalChem()
gc.build_global_chem_network()
pihkal_molecules = list(gc.get_node_smiles('pihkal').values())
For functional group diversity we would like functional groups to relate the data too. I think it would be interesting to see how many of these chemicals made would pass Phase 2 Drug Trials. We will convert them all to fingerprints and keep a reference for the names.
def convert_to_fingerprint(smiles):
try:
molecule = Chem.MolFromSmiles(smiles)
fingerprint = AllChem.GetMorganFingerprintAsBitVect(
molecule,
2,
nBits=1024,
)
return fingerprint
except:
return None
reference_names = list(gc.get_node_smiles('phase_2_hetereocyclic_rings').keys())
reference_smiles = list(gc.get_node_smiles('phase_2_hetereocyclic_rings').values())
reference_fingerprints = [ convert_to_fingerprint(value) for value in reference_smiles ]
Now we need to iterate through data and determine values for our questions. First setup the dataframe, we would like a key index on the text of the graph.
methanol: 1
acid: 2
thio: 3
halogen: 4
By determining the functional group number, we can then install that text in the graph later on. Let’s get our data pipeline ready:
molecule_dataframe = pd.DataFrame()
key = 0
logp_values = []
number_of_rings_values = []
molecular_weight_values = []
aromatic_values = []
functional_group_values = []
functional_group_keys = {}
Now let’s iterate through values and answer our questions:
for molecule in pihkal_molecules:
rdkit_molecule = Chem.MolFromSmiles(molecule)
# Molecular Descriptors
aromatic_pattern = Chem.MolFromSmarts('a:a:a')
logp = Descriptors.MolLogP(rdkit_molecule)
molecular_weight = Descriptors.ExactMolWt(rdkit_molecule)
num_of_rings = Chem.rdMolDescriptors.CalcNumRings(rdkit_molecule)
is_aromatic = rdkit_molecule.HasSubstructMatch(aromatic_pattern)
is_aromatic_value = 0
if is_aromatic:
is_aromatic_value = 1
logp_values.append(logp)
number_of_rings_values.append(num_of_rings)
molecular_weight_values.append(molecular_weight)
aromatic_values.append(is_aromatic_value)
# Functional Group Diversity
fingerprint = convert_to_fingerprint(molecule)
tanimoto_scores = DataStructs.BulkTanimotoSimilarity(fingerprint, reference_fingerprints)
functional_group = reference_names[tanimoto_scores.index(max(tanimoto_scores))]
if functional_group not in functional_group_keys:
functional_group_keys[functional_group] = key
functional_group_values.append(key)
key += 1
else:
functional_group_values.append(functional_group_keys[functional_group])
One thing to note is the aromatic pattern matching which i abstracted from this forum. Now let’s build our dataframe and have a look at what it looks like:
molecule_dataframe['logp'] = logp_values
molecule_dataframe['num_of_rings'] = number_of_rings_values
molecule_dataframe['molecular_weight'] = molecular_weight_values
molecule_dataframe['aromatic'] = aromatic_values
molecule_dataframe['functional_group'] = functional_group_values
print (molecule_dataframe)
And the output:
logp num_of_rings molecular_weight aromatic functional_group
0 1.9922 1 239.152144 1 0
1 1.7698 1 237.136493 1 1
2 2.3154 1 241.113650 1 2
3 2.7055 1 255.129300 1 2
4 3.0940 1 269.144950 1 2
.. ... ... ... ... ...
174 2.3839 1 253.167794 1 1
175 2.7071 1 255.129300 1 1
176 2.7071 1 255.129300 1 1
177 3.0972 1 269.144950 1 1
178 3.0972 1 269.144950 1 1
[179 rows x 5 columns]
And now let’s plot. To design our plot, we need to think about what we want. I want the LogP values to be representative of a colors and molecular weight, aromatic, number of rings, molecular weight, and functional groups in that order.
fig = go.Figure(data=
go.Parcoords(
line = dict(color = molecule_dataframe['logp'],
colorscale = [[0,'purple'],[0.5,'lightseagreen'],[1,'gold']],
showscale = True,
cmin = min(logp_values),
cmax = max(logp_values)),
dimensions = list([
dict(tickvals = [0, 1],
ticktext = ['No', 'Yes'],
label = 'Aromatic', values = molecule_dataframe['aromatic']),
dict(range = [min(number_of_rings_values), max(number_of_rings_values)],
tickvals = [0, 1, 2, 3],
label = 'Number of Rings', values = molecule_dataframe['num_of_rings']),
dict(range = [min(molecular_weight_values), max(molecular_weight_values)],
label = 'Molecular Weight', values = molecule_dataframe['molecular_weight']),
dict(tickvals = list(functional_group_keys.values()),
ticktext = list(functional_group_keys.keys()),
label = 'Functional Group', values = molecule_dataframe['functional_group']),
])
)
)
fig.update_layout(
height=1000,
width=2000,
grid= dict(columns=1, rows=1)
)
fig.show()
And here we go:
Inference
For this chemical dataset I can start to infer some things. All compounds are aromatic, most compounds have 1 to 2 rings, majority are around 230 to 250 M.W, and a lot of them have similarity towards a methylated indole group and surprisingly a similarity to tosuflaxcin.
This raises some new questions that we answer and easily modify the code to gain more insight.