SMILES to PDF and back again— can we improve publication mining?

3 min readJun 19, 2020

PDFs are very nice to look at — that’s a given. It’s the main vehicle for scientific publication but we’ve run into an issue recently. There are so many publications and the PDFs are becoming very difficult to sift through and find data.

Due to this difficulty, a new field of PDF mining has spurred about and from what I find is that we should just make PDFs more robust in the first place to be “mineable”

So I started MolPDF — a way to render your chemical libraries into PDF and back again. MolPDF takes a list of SMILES and renders them to 2D images and installs them onto a canvas for a PDF.

Example PDF of Global Chem’s Functional Groups

The Code

First, let’s take a look at what imports we need:

# Global Chem Imports
# -------------------
from global_chem import GlobalChem

# MolPDF Imports
# --------------
from molpdf import MolPDF, MolPDFParser

I’m going to need the class GlobalChem from global_chem as well as the two classes of MolPDF, and MolPDFParser from molpdf

Next let’s retrieve our list of functional group SMILES:

if __name__ == '__main__':

    # Initialize Global Chem
    global_chem = GlobalChem()

    # Retrieve all Functional Groups
    molecules = list(global_chem.functional_groups_smiles.values())

If we print molecules we now have our “chemical library”

['CC(F)(F)F', 'C1(C2=CC=CC=C2)=CC=CC=C1', 'C1(CC=C2)=C2C=CC=C1', '[NH]1CCCC1', 'CC#CC', 'CCC(CC)CO', 'CC=C=C(C)C', 'C/N=N/C', 'CC(N(C)C)=O', 'C/C(C)=N/C', 'C/C(N(C)C)=N/C', 'CC(=O)OC(=O)C', 'C(=O)Br', 'C(=O)Cl', 'C(=O)F', 'C(=O)I', 'CC=O', 'C(=O)N', '*N', 'C12=CC=CC=C1C=C3C(C=CC=C3)=C2', 'C([N-][N+]#N)', 'C1=CC=CC=C1', 'C1=CC=C(C=C1)S', 'C1CCCCC1C1CCCCC1', 'Br', 'CCC=C', 'CCC#C', 'O=C=O', 'C(=O)O', 'Cl', 'COCCl', 'C1=CC=C1', 'C1CCC1', 'C1CCCCCC1', 'C1CCCCC1', 'C1=CCCC=C1', 'C1=CCC=CC1', 'C=1CCCCC=1', 'C1CCCC1', 'C1=CCC=C1', 'C1CC1', 'C1=CC1', '[2H][CH2]C', 'COC', 'CCOCC', 'CC(C)OC(C)C', 'C&1&1&1&1', 'C=[N+]=[N-]', '[NH4+].[NH4+].[O-]S(=O)(=O)[S-]', 'N', 'CC', 'CCS', 'CCO', 'C=C', 'COC', 'C(=O)OC', 'F', 'C=O', 'C1OC=CC=1', 'C&1&1&1', 'C#N', '[OH-]', 'NO', 'C1=CC=CC(CCC2)=C12', 'CC(=O)C', 'C', 'CS', 'CC(OC)=O', 'CN1CCCC1', 'CC(C)(C)OC', 'C12=CC=CC=C1C=CC=C2', '[N+](=O)[O-]', 'C[N+]([O-])=O', 'C12=CC=CC1=CC=C2', 'N1CC2CCCC2CC1', 'OC1CCCCC1', 'C=1(C=CC=CC1)', 'c1ccccc1C&1&1', 'O', 'N', 'CC(C)=O', 'CCC=O', 'CC=C', 'CC#C', 'N1CCCCC1', 'O=N1CCCCC1', 'NC', 'C12(CCCCC1)CCCCC2', 'S(=O)(=O)', 'C[N+](C)(C)C', 'S', 'OS(=O)(=S)O', 'CN(C)C', 'C1(C=CC=C2)=C2C(C=CC=C3)=C3C4=C1C=CC=C4']

Let’s now initialize our document and set a title

document = MolPDF(name='functional_groups.pdf')
document.add_title('Functional Groups Global Chem')
document.add_spacer() # add a little space

And now let’s generate the document with our array of molecules:

# Generate the document
document.generate(smiles=molecules, include_failed_smiles=True)

The include_failed_smiles parameter is for smiles that fail to render into a 2D image will be labeled as such in the data and if you would like to include them in the generation of your PDF then by all means set it to True :).

Our ending result

And our failing SMILES:

So that passes in chemical libraries into PDF but how do we make this “minable”. Simple! we actually pass metadata into the PDF and store variables and pertinent information.

MolPDF stores the SMILES data in the Doucument.properties which then makes it easy to mine. So I wrote the other half of MolPDFParser

# Read PDF
document = MolPDFParser('functional_groups.pdf')
molecules = document.extract_smiles()

And our output:

Any thoughts on either package or you have design, feature ideas etc then please email me or find my on github!

global-chem package: https://github.com/Sulstice/global-chem

molpdf package: https://github.com/Sulstice/molpdf

Full Code

# Global Chem Imports
# -------------------
from global_chem import GlobalChem

# MolPDF Imports
# --------------
from molpdf import MolPDF, MolPDFParser

if __name__ == '__main__':

    # Initialize Global Chem
    global_chem = GlobalChem()

    # Retrieve all Functional Groups
    molecules = list(global_chem.functional_groups_smiles.values())

    document = MolPDF(name='functional_groups.pdf')
    document.add_title('Functional Groups Global Chem')
    document.add_spacer()

    # Generate the document
    document.generate(smiles=molecules, include_failed_smiles=True)

    # Read PDF
    document = MolPDFParser('functional_groups.pdf')
    molecules = document.extract_smiles()

SMILES to PDF and back again— can we improve publication mining?

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sulstice

No responses yet