Preparing Chemical Data in 5 minutes with Cocktail Shaker, RDKit, and Pandas.
Tonight, I feel myself in the need for a library of 1D a 1 length peptide represented in every 1D string and labeled for every amino acid. What it’s for? Find out in the next coming blog posts!
A visual representation:
Okay so how do I generate this data? Turns out I wrote a package called Cocktail Shaker that has the ability to do this but I might have to hack it a bit.
First let's install our packages needed: cocktail-shaker, pandas
pip install cocktail-shaker # latest version
pip install pandas # latest version
From cocktail shaker we are going to need two classes and the pandas module: PeptideBuilder
and Cocktail
.
from peptide_builder import PeptideBuilder
from functional_group_enumerator import Cocktail
import pandas as pd
The first thing is to make a list of strings containing the side chains of the amino acids:
natural_amino_acids = [
"C", "CCCNC(N)=N", "CCC(N)=O", "CC(O)=O",
"CS", "CCC(O)=O", "CCC(O)=O", "CCC(N)=O",
"[H]", "CC1=CNC=N1", "C(CC)([H])C", "CC(C)C",
"CCCCN", "CCSC", "CC1=CC=CC=C1", "CO",
"C(C)([H])O", "CCC1=CNC2=C1C=CC=C2",
"CC1=CC=C(O)C=C1", "C(C)C"
]
Next declare a root dataframe where we will store our entire prepared data. We will be initialized with two columns.
root_dataframe = pd.DataFrame(columns=["smiles", "amino_acid"])
The smiles will be 1D chemical compound representation and the amino acid will be the string representation of the side chain attached. The amino_acid column will be acting as our “label” for data.
To prepare the data we will be iterating through each natural amino acid and building the strings separately (I haven’t built a feature like that into Cocktail Shaker just yet) here is full for loop:
for amino_acid in natural_amino_acids:
peptide_molecule = PeptideBuilder(length_of_peptide=1)
cocktail = Cocktail(peptide_backbone=peptide_molecule,
ligand_library=[amino_acid],
enable_isomers=False)
cocktail.shake()
molecules = cocktail.enumerate(dimensionality='1D',
enumeration_complexity='high')
dataframe = pd.DataFrame(molecules, columns=["smiles"])
dataframe["amino_acid"] = amino_acid
root_dataframe = pd.concat([root_dataframe, dataframe])
The first line in the for loop will take a 1D representation of a 1 length peptide backbone prepared for RDKit replace function that cocktail shaker utilizes.
peptide_molecule = PeptideBuilder(length_of_peptide=1)
# NC([*:1])C(NCC(O)=O)=O
Next, we wrap in the cocktail object with our peptide backbone string, and only pass in single amino acid.
cocktail = Cocktail(peptide_backbone=peptide_molecule,
ligand_library=[amino_acid],
enable_isomers=False)
Note: Previously, I was playing around with the enable_isomers parameter and it is painfully slow. So if you are trying to retain peptide 3D chemistry within SMILES be ready to wait!
Next “Shake” the cocktail to produce the desired installment of the amino acid.
cocktail.shake()
Next, we would like all representations of our data because eventually, it will produce a variety of 2D coordinates. It is known within the convolutional neural network community that a variety of portrayals of the same data helped the training and accuracy of the network.
Super simple!
molecules = cocktail.enumerate(dimensionality='1D', enumeration_complexity='high')
The Cocktail Object will retain the combinations that you generated before, here we just need dimensionality='1D'
representations and we want to get as many possible different representations as possible so we set enumeration_complexity='high'.
Next save to a dataframe with the label of the amino acid and join into the master dataframe!
dataframe = pd.DataFrame(molecules, columns=["smiles"])
dataframe["amino_acid"] = amino_acid
root_dataframe = pd.concat([root_dataframe, dataframe])
Estimated around 7498 unique smiles for amino acids was produced within a couple of minutes.
Full Code:
from peptide_builder import PeptideBuilder
from functional_group_enumerator import Cocktail
import pandas as pd
if __name__ == '__main__':
natural_amino_acids = ["C",... "C(C)C"]
root_dataframe = pd.DataFrame(columns=["smiles", "amino_acid"])
for amino_acid in natural_amino_acids:
peptide_molecule = PeptideBuilder(length_of_peptide=1)
cocktail = Cocktail(peptide_backbone=peptide_molecule,
ligand_library=[amino_acid],
enable_isomers=False)
molecules = cocktail.shake()
molecules = cocktail.enumerate(dimensionality='1D', enumeration_complexity='high')
dataframe = pd.DataFrame(molecules, columns=["smiles"])
dataframe["amino_acid"] = amino_acid
root_dataframe = pd.concat([root_dataframe, dataframe])
root_dataframe.to_hdf('data.h5', key='s', mode='w')