Lecture 005 — Understanding Chemical Data Formats and Writing Your Own Mol2 Parser Part 1.

9 min readNov 12, 2022

In cheminformatics you handle a lot of data and data can come in a variety of forms because of the way it was recorded. In the old days, they used to input the the information by a Walter Reed typewriter and you could type codeine like this into the computer:

Luckily, evolution has occurred and we have gotten better at recording chemical data into the computer. One such format that is very common in the cheminformatic world is Mol2. Reason being is that Mol2 contains the geometry of a molecule. When you look at a SMILES string let’s say of benzene:

C1=CC=CC=C1

We can’t exactly infer it’s geometry. So when we want to model the compound in different 3D computational studies it becomes more challenging because a requirement is the X, Y, Z coordinates of a molecule. There’s new file and different formats that also achieve the same thing but Mol2 is one of the most ubiquitous.

Let’s get an example going first of what a simple molecule like benzene looks like represented as a Mol2. A good spot to download some mol2 files to play with is the Pitt Quantum Repository and let’s get benzene.

To download benzene switch from simple to detailed view and scroll down the bottom until you see download mol2.

Alrighty, let’s have a look.

@<TRIPOS>MOLECULE
benzene
 12 12 0 0 0
SMALL
MULLIKEN_CHARGES

@<TRIPOS>ATOM
      1 H          -2.4786   -0.1269    0.0001 H       1  UNL1        0.1481
      2 C          -1.3922   -0.0712    0.0000 C.ar    1  UNL1       -0.1481
      3 C          -0.7579    1.1700    0.0000 C.ar    1  UNL1       -0.1481
      4 H          -1.3491    2.0832   -0.0000 H       1  UNL1        0.1481
      5 C           0.6344    1.2412   -0.0000 C.ar    1  UNL1       -0.1481
      6 H           1.1295    2.2099   -0.0001 H       1  UNL1        0.1481
      7 C           1.3922    0.0713    0.0000 C.ar    1  UNL1       -0.1481
      8 H           2.4787    0.1267    0.0000 H       1  UNL1        0.1481
      9 C           0.7577   -1.1701   -0.0000 C.ar    1  UNL1       -0.1481
     10 H           1.3493   -2.0830    0.0001 H       1  UNL1        0.1481
     11 C          -0.6344   -1.2413   -0.0000 C.ar    1  UNL1       -0.1481
     12 H          -1.1295   -2.2100   -0.0001 H       1  UNL1        0.1481
@<TRIPOS>BOND
     1    12    11    1
     2     6     5    1
     3     3     4    1
     4     3     5   ar
     5     3     2   ar
     6     5     7   ar
     7     2    11   ar
     8     2     1    1
     9     7     8    1
    10     7     9   ar
    11     9    11   ar
    12     9    10    1

Let’s have a look at what this molecule looks like. I use the software Chimera:

So we can see based on visual inspection that this mol2 file looks correct. We expect benzene to be planar. Any distortions and we might find that this particular file is not good data to work with. Be wary of that in chemistry a lot.

Now as you can see there is a lot of information in this file format so we are going to take it slowly over the course of a couple of lectures.

The first thing you notice will be the TRIPOS posted everywhere which is in reference to the Tripos Software Suite like SYBYL.

@<TRIPOS>MOLECULE
@<TRIPOS>ATOM
@<TRIPOS>BOND

Each section of the file contains different features of the molecule.

Molecule: Overall metadata about the molecule
Atom: Information about each atom in the molecule
Bond: How the atoms are connected through each other with bonds.

So before analyzing different parts of the file let’s have some fun and get into creating a robust parser to handle this data. First, let’s create the python structure for the object Mol2Parser and then tag it with a version and have one parameter be the mol2 file. We will have a print statement to verify we are parsing the mol2 correctly.

class Mol2Parser(object):

    __version__ = '0.0.1'

    def __init__(self, mol2_file):
        
        self.mol2_file = open(mol2_file, 'r')
        self.mol2_lines = self.mol2_file.readlines()
        
        # Confirm

        _ = [ print(line.strip('\n')) for line in self.mol2_lines ]

if __name__ == '__main__':
    
    mol2_parser = Mol2Parser('benzene.mol2')

So now let’s extend this into parsing it by section. We are going to use pandas dataframes to store the data.

pip install pandas

Let’s create the framework for how we think we are going to load our data. Since we know the different sections we can abstract out each method as a parser function. What I mean by that is reading in the atoms is its own dataframe, bond information, etc.

import pandas as pd

class Mol2Parser(object):

    __version__ = '0.0.1'

    def __init__(self, mol2_file):

        self.mol2_file = open(mol2_file, 'r').readlines()
        self.mol2_lines = self.mol2_file

        # Confirm

        # _ = [ print(line.strip('\n')) for line in self.mol2_lines ]

        # Initialize Dataframes

        self.atom_dataframe = None
        self.molecule_dataframe = None
        self.bond_dataframe = None

    def load_data(self):

        self.atom_dataframe = self.read_atom_information()
        self.molecule_dataframe = self.read_molecule_information()
        self.bond_dataframe = self.read_bond_information()

    def read_atom_information(self):

        pass

    def read_molecule_information(self):

        pass
    
    def read_bond_information(self):

        pass

if __name__ == '__main__':

    mol2_parser = Mol2Parser('benzene.mol2')
    mol2_parser.load_data()

So now, we have a framework where we can work with parsing each section of the file into different dataframes under the same object. You can do this with any chemical file format really since a lot of them follow this same architecture.

Let’s start with reading in the atom information:

self.atom_lines = ''.join(self.mol2_lines).split('@<TRIPOS>ATOM')[1].split('@<TRIPOS>BOND')[0].strip().split('\n')

We only want the data sandwiched between the ATOM and the BOND information. My preference in python are these one liners but you can expand it out if you like into a more verbose data transformation. Next we want to pass that into our read_atom_information function.

    def load_data(self):

        atom_lines = ''.join(self.mol2_lines).split('@<TRIPOS>ATOM')[1].split('@<TRIPOS>BOND')[0].strip().split('\n')

        self.atom_dataframe = self.read_atom_information(atom_lines)
        self.molecule_dataframe = self.read_molecule_information()
        self.bond_dataframe = self.read_bond_information()

Good so far? so let’s load the data into the frame, each column refers to different metadata on the atom. Each row is a different atom within the molecule. In benzene we expect there to be a total of 12 atoms because of 6 carbons and 6 hydrogens. Let’s also add a verbose flag since sometimes we want to see the output while debugging:

import pandas as pd

class Mol2Parser(object):

    __version__ = '0.0.1'

    def __init__(self, mol2_file, verbose=False):

        self.mol2_file = open(mol2_file, 'r').readlines()
        self.mol2_lines = self.mol2_file

        # Confirm

        # _ = [ print(line.strip('\n')) for line in self.mol2_lines ]

        # Initialize Dataframes

        self.atom_dataframe = None
        self.molecule_dataframe = None
        self.bond_dataframe = None

        # Micelleneanous Flags

        self.verbose = True

    def load_data(self):

        atom_lines = ''.join(self.mol2_lines).split('@<TRIPOS>ATOM')[1].split('@<TRIPOS>BOND')[0].strip().split('\n')

        self.atom_dataframe = self.read_atom_information(atom_lines)
        self.molecule_dataframe = self.read_molecule_information()
        self.bond_dataframe = self.read_bond_information()


    def read_atom_information(self, atom_lines):

        atom_indexes = [ line.split()[0] for line in atom_lines ]

        if self.verbose:
            print ('Atom Indexes: %s' % atom_indexes)

    def read_molecule_information(self):

        pass

    def read_bond_information(self):

        pass

if __name__ == '__main__':

    mol2_parser = Mol2Parser('benzene.mol2', verbose=True)
    mol2_parser.load_data()

The Atom index is the rank order of the numbering of the atoms in a molecule. Do you remember in Organic Chemistry I when you had to name a molecule by the longest carbon chain and you had to write some numbers down.

It’s the same concept here when adding to a file.

Atom Indexes: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']

Moving on to the next columns, we have the

Atom name
The x coordinate
The y coordinate
The z coordinate
The atom type (we can get to that later)
The id number of the molecule of the atom
The name of the molecule containing the atom
The charge associated with the atom
Finally an internal metadata column used by the SYBYL software suite. We won’t need that.

Let’s see what this looks like in the code:


    def read_atom_information(self, atom_lines):

        atom_indexes = [ line.split()[0] for line in atom_lines ]
        atom_names = [ line.split()[1] for line in atom_lines ]
        x_coordinates = [ line.split()[2] for line in atom_lines ]
        y_coordinates = [ line.split()[3] for line in atom_lines ]
        z_coordinates = [ line.split()[4] for line in atom_lines ]
        atom_types = [ line.split()[5] for line in atom_lines ]
        molecule_ids = [ line.split()[6] for line in atom_lines ]
        molecule_names = [ line.split()[7] for line in atom_lines ]
        charges = [ line.split()[8] for line in atom_lines ]

        if self.verbose:
            print ('Atom Indexes: %s' % atom_indexes)
            print ('Atom Names: %s' % atom_names)
            print ('X Coordinates: %s' % x_coordinates)
            print ('Y Coordinates: %s' % y_coordinates)
            print ('Z Coordinates: %s' % z_coordinates)
            print ('Atom Types: %s' % atom_types)
            print ('Molecule IDs: %s' % molecule_ids)
            print ('Molecule Names: %s' % molecule_names)
            print ('Charges: %s' % charges)

check the output:

Atom Indexes: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']
Atom Names: ['H', 'C', 'C', 'H', 'C', 'H', 'C', 'H', 'C', 'H', 'C', 'H']
X Coordinates: ['-2.4786', '-1.3922', '-0.7579', '-1.3491', '0.6344', '1.1295', '1.3922', '2.4787', '0.7577', '1.3493', '-0.6344', '-1.1295']
Y Coordinates: ['-0.1269', '-0.0712', '1.1700', '2.0832', '1.2412', '2.2099', '0.0713', '0.1267', '-1.1701', '-2.0830', '-1.2413', '-2.2100']
Z Coordinates: ['0.0001', '0.0000', '0.0000', '-0.0000', '-0.0000', '-0.0001', '0.0000', '0.0000', '-0.0000', '0.0001', '-0.0000', '-0.0001']
Atom Types: ['H', 'C.ar', 'C.ar', 'H', 'C.ar', 'H', 'C.ar', 'H', 'C.ar', 'H', 'C.ar', 'H']
Molecule IDs: ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']
Molecule Names: ['UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1']
Charges: ['0.1481', '-0.1481', '-0.1481', '0.1481', '-0.1481', '0.1481', '-0.1481', '0.1481', '-0.1481', '0.1481', '-0.1481', '0.1481']

Our Final task is to load it into the dataframe and wrap it up for Part 1, here is the full code:

 import pandas as pd

class Mol2Parser(object):

    __version__ = '0.0.1'

    def __init__(self, mol2_file, verbose=False):

        self.mol2_file = open(mol2_file, 'r').readlines()
        self.mol2_lines = self.mol2_file

        # Confirm

        # _ = [ print(line.strip('\n')) for line in self.mol2_lines ]

        # Initialize Dataframes

        self.atom_dataframe = None
        self.molecule_dataframe = None
        self.bond_dataframe = None

        # Micelleneanous Flags

        self.verbose = verbose

    def load_data(self):

        atom_lines = ''.join(self.mol2_lines).split('@<TRIPOS>ATOM')[1].split('@<TRIPOS>BOND')[0].strip().split('\n')

        self.atom_dataframe = self.read_atom_information(atom_lines)
        self.molecule_dataframe = self.read_molecule_information()
        self.bond_dataframe = self.read_bond_information()


    def read_atom_information(self, atom_lines):

        atom_indexes = [ line.split()[0] for line in atom_lines ]
        atom_names = [ line.split()[1] for line in atom_lines ]
        x_coordinates = [ line.split()[2] for line in atom_lines ]
        y_coordinates = [ line.split()[3] for line in atom_lines ]
        z_coordinates = [ line.split()[4] for line in atom_lines ]
        atom_types = [ line.split()[5] for line in atom_lines ]
        molecule_ids = [ line.split()[6] for line in atom_lines ]
        molecule_names = [ line.split()[7] for line in atom_lines ]
        charges = [ line.split()[8] for line in atom_lines ]

        if self.verbose:
            print ('Atom Indexes: %s' % atom_indexes)
            print ('Atom Names: %s' % atom_names)
            print ('X Coordinates: %s' % x_coordinates)
            print ('Y Coordinates: %s' % y_coordinates)
            print ('Z Coordinates: %s' % z_coordinates)
            print ('Atom Types: %s' % atom_types)
            print ('Molecule IDs: %s' % molecule_ids)
            print ('Molecule Names: %s' % molecule_names)
            print ('Charges: %s' % charges)

        self.atom_dataframe = pd.DataFrame()

        self.atom_dataframe['Atom Indexes'] = atom_indexes
        self.atom_dataframe['Atom Names'] = atom_names
        self.atom_dataframe['X Coordinates'] = x_coordinates
        self.atom_dataframe['Y Coordinates'] = y_coordinates
        self.atom_dataframe['Z Coordinatess'] = z_coordinates
        self.atom_dataframe['Atom Types'] = atom_types
        self.atom_dataframe['Molecule IDs'] = molecule_ids
        self.atom_dataframe['Molecule Names'] = molecule_names
        self.atom_dataframe['Charges'] = charges

        print (self.atom_dataframe)

    def read_molecule_information(self):

        pass

    def read_bond_information(self):

        pass

if __name__ == '__main__':

    mol2_parser = Mol2Parser('benzene.mol2', verbose=False)
    mol2_parser.load_data()

And the final output:

   Atom Indexes Atom Names X Coordinates Y Coordinates Z Coordinatess Atom Types Molecule IDs Molecule Names  Charges
0             1          H       -2.4786       -0.1269         0.0001          H            1           UNL1   0.1481
1             2          C       -1.3922       -0.0712         0.0000       C.ar            1           UNL1  -0.1481
2             3          C       -0.7579        1.1700         0.0000       C.ar            1           UNL1  -0.1481
3             4          H       -1.3491        2.0832        -0.0000          H            1           UNL1   0.1481
4             5          C        0.6344        1.2412        -0.0000       C.ar            1           UNL1  -0.1481
5             6          H        1.1295        2.2099        -0.0001          H            1           UNL1   0.1481
6             7          C        1.3922        0.0713         0.0000       C.ar            1           UNL1  -0.1481
7             8          H        2.4787        0.1267         0.0000          H            1           UNL1   0.1481
8             9          C        0.7577       -1.1701        -0.0000       C.ar            1           UNL1  -0.1481
9            10          H        1.3493       -2.0830         0.0001          H            1           UNL1   0.1481
10           11          C       -0.6344       -1.2413        -0.0000       C.ar            1           UNL1  -0.1481
11           12          H       -1.1295       -2.2100        -0.0001          H            1           UNL1   0.1481

Cool huh? Now we understand the mol2 file format a little better and what is contained. We also have our own little parser to do some analysis.

Practice Problems

1.) Write out the bond parsing information and use this blog as a guide for how to read in the bond information.

Lecture 005 — Understanding Chemical Data Formats and Writing Your Own Mol2 Parser Part 1.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sulstice

No responses yet