Using Principal Component Analysis to distinct Aromatic and Non-Aromatic Compounds and Identify Common Scaffolds for Diverse Communities using SMILES and GlobalChem.

Sulstice
3 min readMay 13, 2022

--

Now that I have a wide enough data set to look at all common groups relevant to a subsection of a community. We can start to elucidate common scaffolds that exist between diverse communities.

So let’s begin, here is the demo so you can follow along:

!pip install -q global-chem[cheminformatics] --upgrade

Load the cheminformatics package up:

from global_chem import GlobalChem
from global_chem_extensions import GlobalChemExtensions
gc = GlobalChem()
cheminformatics = GlobalChemExtensions().cheminformatics()

And run the PCA Analysis. I talk about the more details of the code in my documentation here. We will start with some base configurations:

gc.build_global_chem_network()smiles_list = gc.get_all_smiles()successes, failures = cheminformatics.verify_smiles(
smiles_list,
partial_smiles=True,
return_failures=True
)
mol_ids = cheminformatics.node_pca_analysis(
successes,
morgan_radius = 1,
bit_representation = 512,
number_of_clusters = 3,
number_of_components = 0.95,
random_state = 0,
return_mol_ids=True,
)

We will be using the partial smiles work since I have found that one to be pretty useful.

Okay, so 2096 compounds and we set the 512 fingerprint because we want to keep the feature of the chemical space small. I just need to see the most important features for this blog post. Let’s set the level at 3 because that could be a magic number.

There is a distinctive line between two sets of data. The yellow and red overlap a little so let’s set our cluster level to two.

If we go through the compounds a little:

The code is somewhat aware that aromatic is a significant pattern and probably a main to cluster compounds. If you were to create a machine learning model to determine the distinction between aromaticity and non-aromatic systems. Could be something worthwhile to explore.

I expected some of these functional groups, a tetrahydrofuran ring makes sense since we see it in ligand fragments, synthesis as a common solvent etc. The carboxylic acid doesn’t surprise me too either. I guess what really shocked me was this class of compounds:

What exactly is the framework here and I am not aware where this functional group class really comes in. But is something that is apparently valuable enough to be in this dataset and I have forgotten.

This was an interesting night time venture. Happy Cheminformatics!

--

--

Sulstice
Sulstice

No responses yet