Exploring Multi-Processing for Exhaustive Hydrogen Bonding Analysis using MDAnalysis and Futures
So here I am, First Year PhD Student, and it’s been a daunting process to get into but if you’re like me and have these similar requirements then this blog is for you!
Requirements:
- You’ve got a pretty powerful machine with a lot of nodes, GPUs, etc.
- You’re a python developer (python3)
- You want to analyze some data maybe hydrogen bonds of some trajectory or anything really (the multi-processing code will work with any function)
Note: For utility functions like initiate_universe
I’m gonna include it at the bottom for folk but to make the initial code cleaner :).
So let’s initiate the universe:
# Main Entry
# ----------import oscurrent_directory = os.getcwd()# Load Topology
structure_path = os.path.join(current_directory, './data/*.psf')# Load Trajectory
trajectory_path = os.path.join(current_directory, './data/*.xtc')# Universe Setup
# --------------
universe = initiate_universe(structure_path, trajectory_file_path)
Alright, that was pretty simple. Let’s now get out MDAnalysis
import.
from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA
Cool, so here’s what I have discovered and step through this piece by piece.
I’m going to set up a list of HBA
of different H-Bonds
within my system between specific atoms. I happened to have about 12 different H-Bond families I wanted to analyze between a ton of different chains
in my system.
For example, I wanted to look at
O --- H --- O
N --- H --- O
O --- H --- N
N --- H --- N
And the list goes on because there’s different oxygen numbers, or nitrogen numbers a long my molecules so the list is extensive. So what I did, is set up a list of HBA
instances of a class
in python.
Note: You can denote your own H-bond but see the figure below to get the idea of the cutoffs.
tasks = []tasks.append(
HBA(
universe=universe,
donors_sel='segid A and name O1',
hydrogens_sel='segid A and name HO3',
acceptors_sel='segid B and name N',
d_a_cutoff=3.5,
d_h_a_angle_cutoff=150
)
)// etc
So all in all I had about ~500 tasks to complete. But this hbond
analysis takes so long. I want my results now and if I have this massive machine that should be possible right?
Well I looked into this package called concurrent.futures.
Well this is interesting and in there it has something called a ProcessPoolExecutor
. Long story short, I can start several processes and control their output in python using this class
method.
Read the docs on this one because even I’m not too sure about the consequences so take it with a grain of salt.
Import it
from concurrent.futures import ProcessPoolExecutor
And I’ve had trouble storing the data in memory into a variable so what we’ll be doing is setting up a dataframe to capture the data and write to a csv file when each thread is complete.
import pandas as pdskeleton_dataframe = pd.DataFrame(
[['NA'] * 6],
columns=['Frame', 'Donor Index',
'Hydrogen Index', 'Acceptor
Index', 'Distance',
'Angle']
)
Alright, here we go this is going to be a little complex. We have a list of tasks, skeleton dataframe , and our imports. First, let us define a function called task
def task(task_num):
print ('Executing Thread PID: %s' % os.getpid())
print('Task Number: %s' % task_num)
task = tasks[task_num]
task.run(verbose=True)
print ('Number of Hbonds Found: %s' % str(len(task.hbonds)))
if len(task.hbonds) >= 1:
df =pd.DataFrame(
task.hbonds,columns=skeleton_dataframe.columns
)
df.to_csv('hbond_data/%s.csv' % str(task_num))
This function will take the argument task_num
because I just want to iterate through list tasks
.
def task(task_num):
That’s how we are going to pass which thread does what when we run the code.
The first print statements:
print ('Executing Thread PID: %s' % os.getpid())
print('Task Number: %s' % task_num)
I want to see where the run is at currently and also the PID (in case something goes wrong).
task = tasks[task_num]
task.run(verbose=True)
HBA
has the function run()
and I’m passing in the verbose argument to see the progress bar. I’ll show it running in just a sec :).
Now this next part is the most important, I haven’t been able to store any of the data in memory. And if there is no data the code so the if
statement took me forever to find.
print ('Number of Hbonds Found: %s' % str(len(task.hbonds)))
if len(task.hbonds) >= 1:
df =pd.DataFrame(
task.hbonds,columns=skeleton_dataframe.columns
)
df.to_csv('hbond_data/%s.csv' % str(task_num))
Cool, so we’ll have all the function ready. How do we execute?
executor = ProcessPoolExecutor(max_workers=4)for i in range(0, len(tasks)):
executor.submit(task, i)
So we can first initialize the executor with how many cores we want to allocate to running this analysis. For now, I just want 4
workers.
Next, we iterate through our tasks and submit them. We should see something like this:
And now we are cooking!
All my h_bonds
are going to be analyzed and I can walk away and go to class.
But wait how do I read all that data back in, that must be expensive. Don’t worry! I have that too
# Read in data for hbonds
# -----------------------------------all_files = glob.glob(current_directory + "/data/*.csv")for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
hbond_data.append(df)
Again I want to thank Paween (Post Doc) who listens to all my wild ideas and teaches me a ton of theory.
Utility Functions:
def initiate_universe(toplogy_file_path, trajectory_coordinates_file_path):'''Arguments:
topology_file_path (String): path to the .gro file.
trajectory_coordinates_file_path (String): path to the trajectory fileReturns:
universe (MDAnalysis Object): initiation of the universe'''universe = mda.Universe(gromacs_file_path, trajectory_coordinates_file_path)return universe
Environment:
Python: 3.6.9 | MDAnalysis: 1.0.0 | futures: 3.1.1