In cheminformatics you handle a lot of data and data can come in a variety of forms because of the way it was recorded. In the old days, they used to input the the information by a Walter Reed typewriter and you could type codeine like this into the computer:
Luckily, evolution has occurred and we have gotten better at recording chemical data into the computer. One such format that is very common in the cheminformatic world is Mol2. Reason being is that Mol2 contains the geometry of a molecule. When you look at a SMILES string let’s say of benzene:
C1=CC=CC=C1
We can’t exactly infer it’s geometry. So when we want to model the compound in different 3D computational studies it becomes more challenging because a requirement is the X, Y, Z coordinates of a molecule. There’s new file and different formats that also achieve the same thing but Mol2 is one of the most ubiquitous.
Let’s get an example going first of what a simple molecule like benzene looks like represented as a Mol2. A good spot to download some mol2 files to play with is the Pitt Quantum Repository and let’s get benzene.
- To download benzene switch from simple to detailed view and scroll down the bottom until you see download mol2.
Alrighty, let’s have a look.
@<TRIPOS>MOLECULE
benzene
12 12 0 0 0
SMALL
MULLIKEN_CHARGES
@<TRIPOS>ATOM
1 H -2.4786 -0.1269 0.0001 H 1 UNL1 0.1481
2 C -1.3922 -0.0712 0.0000 C.ar 1 UNL1 -0.1481
3 C -0.7579 1.1700 0.0000 C.ar 1 UNL1 -0.1481
4 H -1.3491 2.0832 -0.0000 H 1 UNL1 0.1481
5 C 0.6344 1.2412 -0.0000 C.ar 1 UNL1 -0.1481
6 H 1.1295 2.2099 -0.0001 H 1 UNL1 0.1481
7 C 1.3922 0.0713 0.0000 C.ar 1 UNL1 -0.1481
8 H 2.4787 0.1267 0.0000 H 1 UNL1 0.1481
9 C 0.7577 -1.1701 -0.0000 C.ar 1 UNL1 -0.1481
10 H 1.3493 -2.0830 0.0001 H 1 UNL1 0.1481
11 C -0.6344 -1.2413 -0.0000 C.ar 1 UNL1 -0.1481
12 H -1.1295 -2.2100 -0.0001 H 1 UNL1 0.1481
@<TRIPOS>BOND
1 12 11 1
2 6 5 1
3 3 4 1
4 3 5 ar
5 3 2 ar
6 5 7 ar
7 2 11 ar
8 2 1 1
9 7 8 1
10 7 9 ar
11 9 11 ar
12 9 10 1
Let’s have a look at what this molecule looks like. I use the software Chimera:
So we can see based on visual inspection that this mol2
file looks correct. We expect benzene to be planar. Any distortions and we might find that this particular file is not good data to work with. Be wary of that in chemistry a lot.
Now as you can see there is a lot of information in this file format so we are going to take it slowly over the course of a couple of lectures.
The first thing you notice will be the TRIPOS
posted everywhere which is in reference to the Tripos Software Suite like SYBYL.
@<TRIPOS>MOLECULE
@<TRIPOS>ATOM
@<TRIPOS>BOND
Each section of the file contains different features of the molecule.
- Molecule: Overall metadata about the molecule
- Atom: Information about each atom in the molecule
- Bond: How the atoms are connected through each other with bonds.
So before analyzing different parts of the file let’s have some fun and get into creating a robust parser to handle this data. First, let’s create the python structure for the object Mol2Parser
and then tag it with a version and have one parameter be the mol2 file. We will have a print statement to verify we are parsing the mol2 correctly.
class Mol2Parser(object):
__version__ = '0.0.1'
def __init__(self, mol2_file):
self.mol2_file = open(mol2_file, 'r')
self.mol2_lines = self.mol2_file.readlines()
# Confirm
_ = [ print(line.strip('\n')) for line in self.mol2_lines ]
if __name__ == '__main__':
mol2_parser = Mol2Parser('benzene.mol2')
So now let’s extend this into parsing it by section. We are going to use pandas dataframes to store the data.
pip install pandas
Let’s create the framework for how we think we are going to load our data. Since we know the different sections we can abstract out each method as a parser function. What I mean by that is reading in the atoms is its own dataframe, bond information, etc.
import pandas as pd
class Mol2Parser(object):
__version__ = '0.0.1'
def __init__(self, mol2_file):
self.mol2_file = open(mol2_file, 'r').readlines()
self.mol2_lines = self.mol2_file
# Confirm
# _ = [ print(line.strip('\n')) for line in self.mol2_lines ]
# Initialize Dataframes
self.atom_dataframe = None
self.molecule_dataframe = None
self.bond_dataframe = None
def load_data(self):
self.atom_dataframe = self.read_atom_information()
self.molecule_dataframe = self.read_molecule_information()
self.bond_dataframe = self.read_bond_information()
def read_atom_information(self):
pass
def read_molecule_information(self):
pass
def read_bond_information(self):
pass
if __name__ == '__main__':
mol2_parser = Mol2Parser('benzene.mol2')
mol2_parser.load_data()
So now, we have a framework where we can work with parsing each section of the file into different dataframes under the same object. You can do this with any chemical file format really since a lot of them follow this same architecture.
Let’s start with reading in the atom information:
self.atom_lines = ''.join(self.mol2_lines).split('@<TRIPOS>ATOM')[1].split('@<TRIPOS>BOND')[0].strip().split('\n')
We only want the data sandwiched between the ATOM
and the BOND
information. My preference in python are these one liners but you can expand it out if you like into a more verbose data transformation. Next we want to pass that into our read_atom_information
function.
def load_data(self):
atom_lines = ''.join(self.mol2_lines).split('@<TRIPOS>ATOM')[1].split('@<TRIPOS>BOND')[0].strip().split('\n')
self.atom_dataframe = self.read_atom_information(atom_lines)
self.molecule_dataframe = self.read_molecule_information()
self.bond_dataframe = self.read_bond_information()
Good so far? so let’s load the data into the frame, each column refers to different metadata on the atom. Each row is a different atom within the molecule. In benzene we expect there to be a total of 12 atoms because of 6 carbons and 6 hydrogens. Let’s also add a verbose flag since sometimes we want to see the output while debugging:
import pandas as pd
class Mol2Parser(object):
__version__ = '0.0.1'
def __init__(self, mol2_file, verbose=False):
self.mol2_file = open(mol2_file, 'r').readlines()
self.mol2_lines = self.mol2_file
# Confirm
# _ = [ print(line.strip('\n')) for line in self.mol2_lines ]
# Initialize Dataframes
self.atom_dataframe = None
self.molecule_dataframe = None
self.bond_dataframe = None
# Micelleneanous Flags
self.verbose = True
def load_data(self):
atom_lines = ''.join(self.mol2_lines).split('@<TRIPOS>ATOM')[1].split('@<TRIPOS>BOND')[0].strip().split('\n')
self.atom_dataframe = self.read_atom_information(atom_lines)
self.molecule_dataframe = self.read_molecule_information()
self.bond_dataframe = self.read_bond_information()
def read_atom_information(self, atom_lines):
atom_indexes = [ line.split()[0] for line in atom_lines ]
if self.verbose:
print ('Atom Indexes: %s' % atom_indexes)
def read_molecule_information(self):
pass
def read_bond_information(self):
pass
if __name__ == '__main__':
mol2_parser = Mol2Parser('benzene.mol2', verbose=True)
mol2_parser.load_data()
The Atom index is the rank order of the numbering of the atoms in a molecule. Do you remember in Organic Chemistry I when you had to name a molecule by the longest carbon chain and you had to write some numbers down.
It’s the same concept here when adding to a file.
Atom Indexes: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']
Moving on to the next columns, we have the
- Atom name
- The x coordinate
- The y coordinate
- The z coordinate
- The atom type (we can get to that later)
- The id number of the molecule of the atom
- The name of the molecule containing the atom
- The charge associated with the atom
- Finally an internal metadata column used by the SYBYL software suite. We won’t need that.
Let’s see what this looks like in the code:
def read_atom_information(self, atom_lines):
atom_indexes = [ line.split()[0] for line in atom_lines ]
atom_names = [ line.split()[1] for line in atom_lines ]
x_coordinates = [ line.split()[2] for line in atom_lines ]
y_coordinates = [ line.split()[3] for line in atom_lines ]
z_coordinates = [ line.split()[4] for line in atom_lines ]
atom_types = [ line.split()[5] for line in atom_lines ]
molecule_ids = [ line.split()[6] for line in atom_lines ]
molecule_names = [ line.split()[7] for line in atom_lines ]
charges = [ line.split()[8] for line in atom_lines ]
if self.verbose:
print ('Atom Indexes: %s' % atom_indexes)
print ('Atom Names: %s' % atom_names)
print ('X Coordinates: %s' % x_coordinates)
print ('Y Coordinates: %s' % y_coordinates)
print ('Z Coordinates: %s' % z_coordinates)
print ('Atom Types: %s' % atom_types)
print ('Molecule IDs: %s' % molecule_ids)
print ('Molecule Names: %s' % molecule_names)
print ('Charges: %s' % charges)
check the output:
Atom Indexes: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']
Atom Names: ['H', 'C', 'C', 'H', 'C', 'H', 'C', 'H', 'C', 'H', 'C', 'H']
X Coordinates: ['-2.4786', '-1.3922', '-0.7579', '-1.3491', '0.6344', '1.1295', '1.3922', '2.4787', '0.7577', '1.3493', '-0.6344', '-1.1295']
Y Coordinates: ['-0.1269', '-0.0712', '1.1700', '2.0832', '1.2412', '2.2099', '0.0713', '0.1267', '-1.1701', '-2.0830', '-1.2413', '-2.2100']
Z Coordinates: ['0.0001', '0.0000', '0.0000', '-0.0000', '-0.0000', '-0.0001', '0.0000', '0.0000', '-0.0000', '0.0001', '-0.0000', '-0.0001']
Atom Types: ['H', 'C.ar', 'C.ar', 'H', 'C.ar', 'H', 'C.ar', 'H', 'C.ar', 'H', 'C.ar', 'H']
Molecule IDs: ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']
Molecule Names: ['UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1', 'UNL1']
Charges: ['0.1481', '-0.1481', '-0.1481', '0.1481', '-0.1481', '0.1481', '-0.1481', '0.1481', '-0.1481', '0.1481', '-0.1481', '0.1481']
Our Final task is to load it into the dataframe and wrap it up for Part 1, here is the full code:
import pandas as pd
class Mol2Parser(object):
__version__ = '0.0.1'
def __init__(self, mol2_file, verbose=False):
self.mol2_file = open(mol2_file, 'r').readlines()
self.mol2_lines = self.mol2_file
# Confirm
# _ = [ print(line.strip('\n')) for line in self.mol2_lines ]
# Initialize Dataframes
self.atom_dataframe = None
self.molecule_dataframe = None
self.bond_dataframe = None
# Micelleneanous Flags
self.verbose = verbose
def load_data(self):
atom_lines = ''.join(self.mol2_lines).split('@<TRIPOS>ATOM')[1].split('@<TRIPOS>BOND')[0].strip().split('\n')
self.atom_dataframe = self.read_atom_information(atom_lines)
self.molecule_dataframe = self.read_molecule_information()
self.bond_dataframe = self.read_bond_information()
def read_atom_information(self, atom_lines):
atom_indexes = [ line.split()[0] for line in atom_lines ]
atom_names = [ line.split()[1] for line in atom_lines ]
x_coordinates = [ line.split()[2] for line in atom_lines ]
y_coordinates = [ line.split()[3] for line in atom_lines ]
z_coordinates = [ line.split()[4] for line in atom_lines ]
atom_types = [ line.split()[5] for line in atom_lines ]
molecule_ids = [ line.split()[6] for line in atom_lines ]
molecule_names = [ line.split()[7] for line in atom_lines ]
charges = [ line.split()[8] for line in atom_lines ]
if self.verbose:
print ('Atom Indexes: %s' % atom_indexes)
print ('Atom Names: %s' % atom_names)
print ('X Coordinates: %s' % x_coordinates)
print ('Y Coordinates: %s' % y_coordinates)
print ('Z Coordinates: %s' % z_coordinates)
print ('Atom Types: %s' % atom_types)
print ('Molecule IDs: %s' % molecule_ids)
print ('Molecule Names: %s' % molecule_names)
print ('Charges: %s' % charges)
self.atom_dataframe = pd.DataFrame()
self.atom_dataframe['Atom Indexes'] = atom_indexes
self.atom_dataframe['Atom Names'] = atom_names
self.atom_dataframe['X Coordinates'] = x_coordinates
self.atom_dataframe['Y Coordinates'] = y_coordinates
self.atom_dataframe['Z Coordinatess'] = z_coordinates
self.atom_dataframe['Atom Types'] = atom_types
self.atom_dataframe['Molecule IDs'] = molecule_ids
self.atom_dataframe['Molecule Names'] = molecule_names
self.atom_dataframe['Charges'] = charges
print (self.atom_dataframe)
def read_molecule_information(self):
pass
def read_bond_information(self):
pass
if __name__ == '__main__':
mol2_parser = Mol2Parser('benzene.mol2', verbose=False)
mol2_parser.load_data()
And the final output:
Atom Indexes Atom Names X Coordinates Y Coordinates Z Coordinatess Atom Types Molecule IDs Molecule Names Charges
0 1 H -2.4786 -0.1269 0.0001 H 1 UNL1 0.1481
1 2 C -1.3922 -0.0712 0.0000 C.ar 1 UNL1 -0.1481
2 3 C -0.7579 1.1700 0.0000 C.ar 1 UNL1 -0.1481
3 4 H -1.3491 2.0832 -0.0000 H 1 UNL1 0.1481
4 5 C 0.6344 1.2412 -0.0000 C.ar 1 UNL1 -0.1481
5 6 H 1.1295 2.2099 -0.0001 H 1 UNL1 0.1481
6 7 C 1.3922 0.0713 0.0000 C.ar 1 UNL1 -0.1481
7 8 H 2.4787 0.1267 0.0000 H 1 UNL1 0.1481
8 9 C 0.7577 -1.1701 -0.0000 C.ar 1 UNL1 -0.1481
9 10 H 1.3493 -2.0830 0.0001 H 1 UNL1 0.1481
10 11 C -0.6344 -1.2413 -0.0000 C.ar 1 UNL1 -0.1481
11 12 H -1.1295 -2.2100 -0.0001 H 1 UNL1 0.1481
Cool huh? Now we understand the mol2 file format a little better and what is contained. We also have our own little parser to do some analysis.
Practice Problems
1.) Write out the bond parsing information and use this blog as a guide for how to read in the bond information.