Exploring Multi-Processing for Exhaustive Hydrogen Bonding Analysis using MDAnalysis and Futures

4 min readOct 10, 2020

So here I am, First Year PhD Student, and it’s been a daunting process to get into but if you’re like me and have these similar requirements then this blog is for you!

Requirements:

You’ve got a pretty powerful machine with a lot of nodes, GPUs, etc.
You’re a python developer (python3)
You want to analyze some data maybe hydrogen bonds of some trajectory or anything really (the multi-processing code will work with any function)

Note: For utility functions like initiate_universe I’m gonna include it at the bottom for folk but to make the initial code cleaner :).

So let’s initiate the universe:

# Main Entry
# ----------import oscurrent_directory = os.getcwd()# Load Topology
structure_path = os.path.join(current_directory, './data/*.psf')# Load Trajectory
trajectory_path = os.path.join(current_directory, './data/*.xtc')# Universe Setup
# --------------
universe = initiate_universe(structure_path, trajectory_file_path)

Alright, that was pretty simple. Let’s now get out MDAnalysis import.

from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA

Cool, so here’s what I have discovered and step through this piece by piece.

I’m going to set up a list of HBA of different H-Bonds within my system between specific atoms. I happened to have about 12 different H-Bond families I wanted to analyze between a ton of different chains in my system.

For example, I wanted to look at

O --- H --- O
N --- H --- O
O --- H --- N
N --- H --- N

And the list goes on because there’s different oxygen numbers, or nitrogen numbers a long my molecules so the list is extensive. So what I did, is set up a list of HBA instances of a class in python.

Note: You can denote your own H-bond but see the figure below to get the idea of the cutoffs.

Figure 1: Adapted somewhere from google.

tasks = []tasks.append(
    HBA(
        universe=universe,
        donors_sel='segid A and name O1',
        hydrogens_sel='segid A and name HO3',
        acceptors_sel='segid B and name N',
        d_a_cutoff=3.5,
        d_h_a_angle_cutoff=150
    )
)// etc

So all in all I had about ~500 tasks to complete. But this hbond analysis takes so long. I want my results now and if I have this massive machine that should be possible right?

Well I looked into this package called concurrent.futures. Well this is interesting and in there it has something called a ProcessPoolExecutor . Long story short, I can start several processes and control their output in python using this classmethod.

Read the docs on this one because even I’m not too sure about the consequences so take it with a grain of salt.

Import it

from concurrent.futures import ProcessPoolExecutor

And I’ve had trouble storing the data in memory into a variable so what we’ll be doing is setting up a dataframe to capture the data and write to a csv file when each thread is complete.

import pandas as pdskeleton_dataframe = pd.DataFrame(
                                [['NA'] * 6], 
                                columns=['Frame', 'Donor Index',   
                                         'Hydrogen Index', 'Acceptor 
                                          Index', 'Distance',      
                                         'Angle']
                                  )

Alright, here we go this is going to be a little complex. We have a list of tasks, skeleton dataframe , and our imports. First, let us define a function called task

def task(task_num):
    
    print ('Executing Thread PID: %s' % os.getpid())
    print('Task Number: %s' % task_num)
    
    task = tasks[task_num]
    task.run(verbose=True)
    
    print ('Number of Hbonds Found: %s' % str(len(task.hbonds)))
    
    if len(task.hbonds) >= 1:
        df =pd.DataFrame( 
                      task.hbonds,columns=skeleton_dataframe.columns
            )
        df.to_csv('hbond_data/%s.csv' % str(task_num))

This function will take the argument task_num because I just want to iterate through list tasks .

def task(task_num):

That’s how we are going to pass which thread does what when we run the code.

The first print statements:

print ('Executing Thread PID: %s' % os.getpid())
print('Task Number: %s' % task_num)

I want to see where the run is at currently and also the PID (in case something goes wrong).

task = tasks[task_num]
task.run(verbose=True)

HBA has the function run() and I’m passing in the verbose argument to see the progress bar. I’ll show it running in just a sec :).

Now this next part is the most important, I haven’t been able to store any of the data in memory. And if there is no data the code so the if statement took me forever to find.

print ('Number of Hbonds Found: %s' % str(len(task.hbonds)))
    
if len(task.hbonds) >= 1:
   df =pd.DataFrame( 
                task.hbonds,columns=skeleton_dataframe.columns
                 )
   df.to_csv('hbond_data/%s.csv' % str(task_num))

Cool, so we’ll have all the function ready. How do we execute?

executor = ProcessPoolExecutor(max_workers=4)for i in range(0, len(tasks)):
    executor.submit(task, i)

So we can first initialize the executor with how many cores we want to allocate to running this analysis. For now, I just want 4 workers.

Next, we iterate through our tasks and submit them. We should see something like this:

And now we are cooking!

All my h_bonds are going to be analyzed and I can walk away and go to class.

But wait how do I read all that data back in, that must be expensive. Don’t worry! I have that too

# Read in data for hbonds
# -----------------------------------all_files = glob.glob(current_directory + "/data/*.csv")for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    hbond_data.append(df)

Again I want to thank Paween (Post Doc) who listens to all my wild ideas and teaches me a ton of theory.

Utility Functions:

def initiate_universe(toplogy_file_path, trajectory_coordinates_file_path):'''Arguments:
        topology_file_path (String): path to the .gro file.
        trajectory_coordinates_file_path (String): path to the trajectory fileReturns:
        universe (MDAnalysis Object): initiation of the universe'''universe = mda.Universe(gromacs_file_path, trajectory_coordinates_file_path)return universe

Environment:

Python: 3.6.9 | MDAnalysis: 1.0.0 | futures: 3.1.1

Exploring Multi-Processing for Exhaustive Hydrogen Bonding Analysis using MDAnalysis and Futures

Written by Sulstice

No responses yet