I have just realized I am my own best user. I forgot why I was interested in organic chemistry in the first place. Pihkal was such an awesome book because of how acute and kind of whimsical the science was. I’ve always loved the cloud in the air organic chemistry and is my roots of my knowledge. So I’ve always wanted to look at this book in more detail.
Here’s the code I will be using so we can get that over with. I also have it available as a demo in my software package, GlobalChem.
Install it
!pip install -q global-chem[cheminformatics] --upgrade
Load the cheminformatics package up:
from global_chem import GlobalChem
from global_chem_extensions import GlobalChemExtensionsgc = GlobalChem()
cheminformatics = GlobalChemExtensions().cheminformatics()
And run the PCA Analysis. I talk about the more details of the code in my documentation here. We will start with some base configurations:
gc.build_global_chem_network()
smiles_list = list(gc.get_node_smiles('pihkal').values())
mol_ids = cheminformatics.node_pca_analysis(
smiles_list,
morgan_radius = 1,
bit_representation = 512,
number_of_clusters = 5,
number_of_components = 0.95,
random_state = 0,
return_mol_ids=True,
)
What do we get:
I set it to 5 clusters. There’s around 200 compounds in this dataset with 40 in each cluster it seems reasonable. Also when I was writing the list I saw functional groups of different interest. A popular one is the methylene dioxy protection group used in various serotonin based drugs. It looks like an owl.
Chances are this is pretty important and the PCA will highlight that. Lets find out. It was pretty easy to find and a piece of information I got is the para to the ring conjunction is usually a nitrogen. A couple of interesting fragments I found were a cyclopropyl which could help binding affinity since proteins are hydrophobic and so is that particular functional group. An alkene on the terminal end which could be a good michael acceptor for covalent warhead binding. Interestingly, I also saw a thiomethyl group in the meta position, I don’t actually know what that will do.
I found these more substituted anisoles and variations with halogens as resonable fragment conjuntions. These two functional groups correlated highly with each other as indicated by the huge vector on the y-axis, but the data set was varied a lot because of the different types of functional group combinations I reckon. Lots of chemical space to explore. You can see as the y-axis reaches higher the chemical diversity increases.
Towards the bottom left it got a little more odd where we can see more complex ring conjunction structures and surprisingly a selenium based drug which I have seen as a more emerging atom type in therapeutics:
When looking at the chemical structures on the left hand side I actually feel like it could be manipulated down to 3 clusters because there seems to be a lot of functional group overlap.
As I scroll around it feels way more natural and if I was going to do some machine learning on this node it would be with these parameters as a start:
gc = GlobalChem()gc.build_global_chem_network(print_output=False, debugger=False)smiles_list = list(gc.get_node_smiles('pihkal').values())mol_ids = cheminformatics.node_pca_analysis(
smiles_list,
morgan_radius = 1,
bit_representation = 512,
number_of_clusters = 3,
number_of_components = 0.95,
random_state = 0,
return_mol_ids=True,
)
Happy Cheminformatics!