Skip to Code:
https://colab.research.google.com/drive/1E6roBxG4XeSHW_50jYy25A08aJKkBdPu?usp=sharing
Skip to Live Demo:
https://sulstice.github.io/CannabisSativa/
Most of you know, we wrote about 400 compounds that constituent the general sense of a popular plant, Cannabis Sativa. The resource the material came from is Turner’s work on going back nearly 6,000 years.
https://pubs.acs.org/doi/abs/10.1021/np50008a001
The book’s table of contents actually cluster the molecules for us giving them a group name that you most often seen on the packaging of cannabis products.
This is actually great for us, because I now have a general classification layer into cannabis. We converted all the molecules listed in this resource to SMILES/SMARTS/Binary
Now, I’ve talked about principal component analysis a lot in my previous blog posts and the theory is discussed in the documentation of the Global-Chem package:
So let’s move on to the hyper parameters, I wanted to minimize the amount of classifications because as you can see “vitamins” only have 1 compound relevant to them inside of cannabis, so after playing around I chose 6. Before we look at the results, I figured the scaffold of the Cannabinoids would cluster one side together, Terpenes usually have a general trait but have a wide set of chemical space, Fatty Acids with the carboxylic tail will be close together and be it’s own cluster. Sugars would be another one with surround hydroxyls continuously in the molecule. Hydrocarbons because of their long alkyl chain. And anything else will just be an umbrella.
So if we run the code:
gc = GlobalChem()gc.build_global_chem_network(print_output=False, debugger=False)smiles_list = list(
gc.get_node_smiles('constituents_of_cannabis_sativa').values()
)sucesses, failures = cheminformatics.verify_smiles(
smiles_list,
rdkit=False,
partial_smiles=True,
return_failures=True,
)mol_ids = cheminformatics.node_pca_analysis(
sucesses,
morgan_radius = 1,
bit_representation = 1024,
number_of_clusters = 6,
number_of_components = 0.95,
random_state = 0,
save_file=False,
return_mol_ids=True,
)
And now have a look at the classifications at how the machine performed. Keep in mind I played with the bit representation length oscillating between 512 and 1024 with different Morgan radius. This can be allowed for fine “tuning”. The plot was made with a popular package called Bokeh.
Well it’s pretty obvious we have some clear clustering against PC1 and PC2, if we hover over
Looking around the graph we can get an idea of how the machine appropriately clustered molecules together under the umbrella of cannabis giving us an avenue to classify clusters of molecules. And if I go a little deeper into looking I can perhaps infer well the machine did and does it make sense.
I believe tweaking these hyper parameters to explore new cannabis chemical space will be fun especially in the cannabinoid and terpene section.