2nd Rotation begins of a wide-eyed cheminformatician in training! (first one is still kinda not over….talk about that in another blog).
I was tasked with screening a chemical database my end goal I’m still trying to determine. I had a pick of the litter, which database do I choose? the classic ZincDB, ChemBL, etc. or one of the harder ones the Enamine REAL DB. Of course, I went for the latter — the challenge is more fun.
Okay, we are looking 1.36 Billion compounds split into 20 parts .smiles
files each containing roughly ~68 Million compounds.
This
pandas.read_csv(massive_amount)
ain’t gonna cut it. For fun sake, I did use pandas
just to see how long it would take….10 minutes just to transfer half of part 1 (34 Million) into memory. Yeah…..not gonna work.
What about modin? Everyone is talking about that….make your pandas faster with one line as they claim. Parallel CPU — divide the reading of the file into chunks and spread it across several processes.
import modin as pd
pd.read_csv(massive_amount)
Took roughly ~40 seconds for 34 Million. 80 seconds to write it to a CSV. Claim is correct but not fast enough……
Sul! Sul! Sul! as they yell at me, the possible random coding gods in my head, use CUDA….do you have any GPU’s available on your workstation. Well, let’s check…. a quick nvidia-smi
>>>
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:09:00.0 Off |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:0A:00.0 Off |
Cool. I’m a python type of dude so how exactly do I use these GPUs? Well low and behold CuDF. Oh awesome, load my dataframes on the GPU for faster processing. Just need to install it into my conda environment…
gpu_df = cudf.read_csv(infile, usecols=['smiles'], headers=0, verbose=True)>>> Out of Memory
Well shit. So turns out a GPU doesn’t have that much memory to load stuff onto it especially big files. The limit, after many trials later, I found out to be maybe 15 million compounds. That won’t cut it.
So as it so happens to be what if we make a CPU-GPU dual loader. Maybe Modin can do things CuDF better and vice versa. Turns out that was actually the solution. The original enamine real database comes with a lot of stuff that I don’t want.
smiles,idnumber,reagent1,reagent2,reagent3,reagent4,reaction,MW HAC,sLogP,HBA,HBD,RotBonds,FSP3,TPSA,QED,PAINS,BRENK,NIH,ZINC LILLY,lead-like,350/3_lead-like fragments,strict_fragments PPI_modulators,natural_product-like,Type,InChiKey
I’m just hunting for the SMILES. The GPU doesn’t have enough memory to load all this stuff in. And even then it can be pretty slow so we need to pre-process this data and maybe, just maybe, if we give the GPU one column of strings it should be able to handle it. Turns out modin, is actually the solution to preprocess the data. Let me show you —
Imports first:
# Imports
# -------
import cupy
import cudf
import os
import time
from rdkit import Chem# Prelimary Checks
# ----------------
os.environ["MODIN_ENGINE"] = "ray"
import modin.pandas as pd # Needs to happen after setting the Modin Engine
Read the docs for the modin to understand the import structure. Trust me it is worth it.
Next set the GPU we are going to use, if you have multiple you can split your load across devices, we can do so with cupy
# GPU Configurations
# ------------------
cupy.cuda.Device(0).use()
Alright, let’s preprocess the file with modin first and what we will do is only extract the smiles
column.
import modin as pd
df = pd.read_csv(
infile,
sep='\t',
header=0,
usecols=['smiles'],
low_memory=True,
verbose=True
)
Pass the low_memory
as True to kind of act as a security to make sure nothing crashes. With usecols
only read in the smiles column.
GPUs like Nvidia actually process data faster if it’s a one-column tons of rows of data. Modin and CuDF don’t actually talk to each other even if they are built on the same library like pandas
. So the next best thing, I found out, is to actually output your dataframe from modin into a csv:
df.to_csv(outfile)
The performance of how long it took to load in 68 million compounds. Was
'Loading File': 42.22884821891785 seconds
'Creating CSV': 80.5738205909729 seconds
Took roughly, 2 minutes to load it in and output it into a csv. Well with our new file thats been preprocessed can one of the GPUs now handle the data.
gpu_df = cudf.read_csv(infile, usecols=['smiles'], headers=0, verbose=True)
No out of memory error, loading time:
GPU Loading Time': 0.6069104671478271 seconds
Ah ha! now we are getting some real speeds. Modin I can use to build up a storage of preprocessed csv files and I can use CuDF to load in huge chunks of data into memory in a relatively tolerable time for a coder.
Combining Modin and CuDF seems to be the solution for me and loading in 1.36 Billion compounds takes me roughly 15 seconds if I split the loads. More on that later!
Give it a shot.