Welcome to advanced cheminformatics where you will be learning more tricks of the trade in doing more fun things with large datasets. This course assumes you have a proficient understanding in both chemistry and computer science and have completed the undergraduate course.
This is your first 1,000,000 Million molecules that you can play around and let’s start analyzing what to do with a big dataset and start hunting for compounds of interest useful to a particular problem. The dataset is available here in the major repo as a zip file:
https://github.com/Sulstice/Cheminformatics-Teaching-Material/tree/main/advanced_lecture_series
Our data looks like just a list of SMILES that was arbitrarily extracted from the Enamine Database and looks like this:
CC1=CC(N2CCC(C#N)(C3CCC3)CC2)=NC=C1C#N
CN(CC1=NN=C2C=CC=CN12)C(=O)C1=CC=CC=C1Br
CC1=CC(C(NCC(C)N(C)C2CC2)C2CC2)=CC=C1F
CN1N=NC2=CC(CSCC3=CC=C(Br)C=C3)=CC=C21
CCC(C)(C)C(=O)N(C)C1=CC=CC(F)=C1OC
CCCC(C)CCNC(=O)C(=O)N[C@@H](C)CN
CCCC(C)CCNC(=O)C(=O)N[C@H](C)CN
N#CC1CN(C(=O)N[C@H]2CCNC2)CCN1C1CC1
N#CC1CN(C(=O)N[C@@H]2CCNC2)CCN1C1CC1
N#CC1CN(C(=O)NC2CCNC2)CCN1C1CC1
CCCOCCN(C)CCCCOC
COCC1(CO)CCN(CC2=CC=CC(C(=O)O)=C2Cl)C1
CC1=C(C)N(C)C(CN2CCN(CC3(N)CC3)CC2)=N1
CC1(C)CN(C2CCN(C(=O)OC3COC3)C2)C1
CC1(NC(=O)C2CN(C3=NN=CS3)C2)CSC1
CCC(CC)C(C)C(=O)OCC1=CC(=O)NC=C1
Our Goal today is to make Structure Data Files of each SMILES and add hydrogens. The workflow looks like this:
The first thing we will have to do today is pip install the necessary packages and I will summarize them in overview. It will be your job to explore them more in detail and perhaps find places for efficiency.
pip install modin # Parallelized DataFrame CPU Reading
pip install rdkit-pypi # Open Source Cheminformatics
pip install pandas # Standard DataFrame Package
pip install numpy # Efficient Data Structuring for Python Math too
pip install glances # Python Monitoring Tool for Workstations
For our stuff we will also be using concurrent which is apython internal package used for parallelizing python tasks over a set of CPUs.
So let’s go ahead an import our packages in the standard pythonic way and we want to set our sdf out directory and the smiles data file:
# Python Internals
# ----------------
import os
import sys
import time
from concurrent.futures import ProcessPoolExecutor
# Science
# -------
import numpy as np
import pandas as pd
from rdkit import Chem
import modin.pandas as mpd
# Constants
# ---------
cores = 8
start_time = time.time()
sdf_directory = 'sdf/'
smiles_file = 'lecture_001.smiles'
Okay, so first we want to load our data into memory. Now we are loading 1,000,000 smiles which is a little heavy and for 1 CPU it can take a little time which can disrupt efficient workflows which is where we use modin.
print ('Loading DataFrame...')
master_df = mpd.read_csv(
smiles_file,
sep='\t',
header=None,
usecols=[0],
names=['smiles'],
)
print ('DataFrame Loaded!')
This will parallelize the available CPU’s on your machine and read the file into memory. As you get to larger and larger datasets this package becomes incredibly useful. However sometimes it lacks the functionality of the original pandas. So after it is read into memory we can convert it:
master_df = master_df._to_pandas()
print (master_df.head(5))
Which gives us our dataframe. Next we want to write a function that converts the SMILES into SDF through RDKit. We also want to save the SMILES information in the SDF and add hydrogens:
def smiles_to_sdf(row):
try:
molecule = Chem.MolFromSmiles(row['smiles'])
molecule_with_hs = Chem.AddHs(molecule)
molecule_with_hs.SetProp('smiles', row['smiles'])
sdwriter = Chem.SDWriter(sdf_directory + '/' + str(row.name) + '.sdf')
sdwriter.write(molecule_with_hs)
except Exception as e:
pass
Here we can add each of the SMILES if RDKit accepts it with a try
and except
block that will capture the RDKit exception. We add hydrogens on the molecule and then set the property, Prop
, with the SMILES. Finally we write the file to the disk.
Now doing this over 1 CPU can be cumbersome and will take a long time because it’s done iteratively. Here we can use the ProcessPoolExecutor
to parallelizing the write of files to the disk:
i = 0
cores = 8
executor = ProcessPoolExecutor( max_workers = cores )
for index, row in master_df.iterrows():
executor.submit(smiles_to_sdf, row)
end_time = time.time()
This now then submits each row as a task and each task is then distributed over 8 cores. Each core then gets task to run and complete the job.
Awesome! If you got this far. Now what you might be seeing in your glances tool is the iowait time.
Which is the instruction to write files to the disk. Now this is mostly correlated to the write speed of your disk. A HD will have less than an SSD. What I use internally is NVMe-SSDs when I have a lot of files to write but something to think about in the future: