PDFs are very nice to look at — that’s a given. It’s the main vehicle for scientific publication but we’ve run into an issue recently. There are so many publications and the PDFs are becoming very difficult to sift through and find data.
Due to this difficulty, a new field of PDF mining has spurred about and from what I find is that we should just make PDFs more robust in the first place to be “mineable”
So I started MolPDF — a way to render your chemical libraries into PDF and back again. MolPDF takes a list of SMILES and renders them to 2D images and installs them onto a canvas for a PDF.
The Code
First, let’s take a look at what imports we need:
# Global Chem Imports
# -------------------
from global_chem import GlobalChem
# MolPDF Imports
# --------------
from molpdf import MolPDF, MolPDFParser
I’m going to need the class GlobalChem
from global_chem
as well as the two classes of MolPDF
, and MolPDFParser
from molpdf
Next let’s retrieve our list of functional group SMILES:
if __name__ == '__main__':
# Initialize Global Chem
global_chem = GlobalChem()
# Retrieve all Functional Groups
molecules = list(global_chem.functional_groups_smiles.values())
If we print molecules
we now have our “chemical library”
['CC(F)(F)F', 'C1(C2=CC=CC=C2)=CC=CC=C1', 'C1(CC=C2)=C2C=CC=C1', '[NH]1CCCC1', 'CC#CC', 'CCC(CC)CO', 'CC=C=C(C)C', 'C/N=N/C', 'CC(N(C)C)=O', 'C/C(C)=N/C', 'C/C(N(C)C)=N/C', 'CC(=O)OC(=O)C', 'C(=O)Br', 'C(=O)Cl', 'C(=O)F', 'C(=O)I', 'CC=O', 'C(=O)N', '*N', 'C12=CC=CC=C1C=C3C(C=CC=C3)=C2', 'C([N-][N+]#N)', 'C1=CC=CC=C1', 'C1=CC=C(C=C1)S', 'C1CCCCC1C1CCCCC1', 'Br', 'CCC=C', 'CCC#C', 'O=C=O', 'C(=O)O', 'Cl', 'COCCl', 'C1=CC=C1', 'C1CCC1', 'C1CCCCCC1', 'C1CCCCC1', 'C1=CCCC=C1', 'C1=CCC=CC1', 'C=1CCCCC=1', 'C1CCCC1', 'C1=CCC=C1', 'C1CC1', 'C1=CC1', '[2H][CH2]C', 'COC', 'CCOCC', 'CC(C)OC(C)C', 'C&1&1&1&1', 'C=[N+]=[N-]', '[NH4+].[NH4+].[O-]S(=O)(=O)[S-]', 'N', 'CC', 'CCS', 'CCO', 'C=C', 'COC', 'C(=O)OC', 'F', 'C=O', 'C1OC=CC=1', 'C&1&1&1', 'C#N', '[OH-]', 'NO', 'C1=CC=CC(CCC2)=C12', 'CC(=O)C', 'C', 'CS', 'CC(OC)=O', 'CN1CCCC1', 'CC(C)(C)OC', 'C12=CC=CC=C1C=CC=C2', '[N+](=O)[O-]', 'C[N+]([O-])=O', 'C12=CC=CC1=CC=C2', 'N1CC2CCCC2CC1', 'OC1CCCCC1', 'C=1(C=CC=CC1)', 'c1ccccc1C&1&1', 'O', 'N', 'CC(C)=O', 'CCC=O', 'CC=C', 'CC#C', 'N1CCCCC1', 'O=N1CCCCC1', 'NC', 'C12(CCCCC1)CCCCC2', 'S(=O)(=O)', 'C[N+](C)(C)C', 'S', 'OS(=O)(=S)O', 'CN(C)C', 'C1(C=CC=C2)=C2C(C=CC=C3)=C3C4=C1C=CC=C4']
Let’s now initialize our document and set a title
document = MolPDF(name='functional_groups.pdf')
document.add_title('Functional Groups Global Chem')
document.add_spacer() # add a little space
And now let’s generate the document with our array of molecules:
# Generate the document
document.generate(smiles=molecules, include_failed_smiles=True)
The include_failed_smiles
parameter is for smiles that fail to render into a 2D image will be labeled as such in the data and if you would like to include them in the generation of your PDF then by all means set it to True
:).
Our ending result
And our failing SMILES:
So that passes in chemical libraries into PDF but how do we make this “minable”. Simple! we actually pass metadata into the PDF and store variables and pertinent information.
MolPDF
stores the SMILES data in the Doucument.properties
which then makes it easy to mine. So I wrote the other half of MolPDFParser
# Read PDF
document = MolPDFParser('functional_groups.pdf')
molecules = document.extract_smiles()
And our output:
Any thoughts on either package or you have design, feature ideas etc then please email me or find my on github!
global-chem
package: https://github.com/Sulstice/global-chem
molpdf
package: https://github.com/Sulstice/molpdf
Full Code
# Global Chem Imports
# -------------------
from global_chem import GlobalChem
# MolPDF Imports
# --------------
from molpdf import MolPDF, MolPDFParser
if __name__ == '__main__':
# Initialize Global Chem
global_chem = GlobalChem()
# Retrieve all Functional Groups
molecules = list(global_chem.functional_groups_smiles.values())
document = MolPDF(name='functional_groups.pdf')
document.add_title('Functional Groups Global Chem')
document.add_spacer()
# Generate the document
document.generate(smiles=molecules, include_failed_smiles=True)
# Read PDF
document = MolPDFParser('functional_groups.pdf')
molecules = document.extract_smiles()