Understanding what is inside of Cannabis Sativa using Principal Component Analysis with Global-Chem

3 min readJul 25, 2022

Skip to Code:

https://colab.research.google.com/drive/1E6roBxG4XeSHW_50jYy25A08aJKkBdPu?usp=sharing

Skip to Live Demo:

https://sulstice.github.io/CannabisSativa/

Most of you know, we wrote about 400 compounds that constituent the general sense of a popular plant, Cannabis Sativa. The resource the material came from is Turner’s work on going back nearly 6,000 years.

https://pubs.acs.org/doi/abs/10.1021/np50008a001

The book’s table of contents actually cluster the molecules for us giving them a group name that you most often seen on the packaging of cannabis products.

This is actually great for us, because I now have a general classification layer into cannabis. We converted all the molecules listed in this resource to SMILES/SMARTS/Binary

Now, I’ve talked about principal component analysis a lot in my previous blog posts and the theory is discussed in the documentation of the Global-Chem package:

Principal Component Analysis SMILES

SMILES data if we would like to identify major features that collect groups of molecules that might not be as obvious…

sulstice.gitbook.io

So let’s move on to the hyper parameters, I wanted to minimize the amount of classifications because as you can see “vitamins” only have 1 compound relevant to them inside of cannabis, so after playing around I chose 6. Before we look at the results, I figured the scaffold of the Cannabinoids would cluster one side together, Terpenes usually have a general trait but have a wide set of chemical space, Fatty Acids with the carboxylic tail will be close together and be it’s own cluster. Sugars would be another one with surround hydroxyls continuously in the molecule. Hydrocarbons because of their long alkyl chain. And anything else will just be an umbrella.

So if we run the code:

gc = GlobalChem()gc.build_global_chem_network(print_output=False, debugger=False)smiles_list = list(
    gc.get_node_smiles('constituents_of_cannabis_sativa').values()
)sucesses, failures = cheminformatics.verify_smiles(
    smiles_list,
    rdkit=False,
    partial_smiles=True,
    return_failures=True,
)mol_ids = cheminformatics.node_pca_analysis(
    sucesses,
    morgan_radius = 1,
    bit_representation = 1024,
    number_of_clusters = 6,
    number_of_components = 0.95,
    random_state = 0,
    save_file=False,
    return_mol_ids=True,
)

And now have a look at the classifications at how the machine performed. Keep in mind I played with the bit representation length oscillating between 512 and 1024 with different Morgan radius. This can be allowed for fine “tuning”. The plot was made with a popular package called Bokeh.

Well it’s pretty obvious we have some clear clustering against PC1 and PC2, if we hover over

Looking around the graph we can get an idea of how the machine appropriately clustered molecules together under the umbrella of cannabis giving us an avenue to classify clusters of molecules. And if I go a little deeper into looking I can perhaps infer well the machine did and does it make sense.

I believe tweaking these hyper parameters to explore new cannabis chemical space will be fun especially in the cannabinoid and terpene section.

Understanding what is inside of Cannabis Sativa using Principal Component Analysis with Global-Chem

Principal Component Analysis SMILES

SMILES data if we would like to identify major features that collect groups of molecules that might not be as obvious…

Written by Sulstice

No responses yet