Converting your Knowledge Graph TSV/CSV to a Resource Description Framework (RDF) For Interoperability

Sulstice
3 min readApr 19, 2023

--

Global-Chem is a record of a common chemical names and their respective SMILES. It is a pure python knowledge graph and has an output to CSV/TSV.

Knowledge graphs have been around for a long time and a legacy format that is very common is Resource Description Frameworks (RDF). RDF’s have legacy and because of this a lot of manufacturing firms have data built on this format for graphs. A big movement that is occuring is Open Manufacturing Protocols being exchanged between industries.

Chemical data can be connected via the raw materials node between RDFs. This makes it attractive for me to do the work in making our graph more interoperable with others. Our imports:

  • rdflib — Generate the RDF, there are others but I found this was the easiest to implement
  • pandas — read the csv/tsv for easy data manipulation
import rdflib
import csv
import pandas as pd

Now we read the tsv format into a dataframe. The predicate is the category that the node is assigned too. For example, RingsInDrugs, is the most common ring systems in Drugs and it belongs to the Medicinal Chemistry category in our graph.

df = pd.read_csv(
'global_chem.tsv',
delimiter='\t',
header=None,
names=['name', 'smiles', 'node', 'predicate', 'path']
)

First we create the graph where each chemical name is connected to a node.

Depiction:

benzene ------- RingsInDrugs
novichok-5 ------ NerveToxicAgents

Here’s the code:

# Create the graph object which holds the triples
graph = rdflib.Graph()

for i, row in df.iterrows():
s = rdflib.URIRef(f'#/{row["name"]}')
p = rdflib.URIRef("#connectsTo")
o = rdflib.URIRef(f'#/{row["node"]}')
graph.add((s, p, o))

Then add each node to the predicate.

Depiction:

benzene ------- RingsInDrugs ------- Medicinal Chemistry
novichok-5 ------ NerveToxicAgents ------ War

Code:

for i, row in df.iterrows():

predicate = row['predicate']

if str(predicate) == 'nan':
predicate = 'miscellaenous'

s = rdflib.URIRef(f'#/{row["node"]}')
p = rdflib.URIRef("#connectsTo")
o = rdflib.URIRef(f'#/{predicate}')
graph.add((s, p, o))

Then add all the predicates to the root Node:

Depiction:

benzene ------- RingsInDrugs ------- Medicinal Chemistry ------ Global-Chem
novichok-5 ------ NerveToxicAgents ------ War ------ Global-Chem

Code:

for i, row in df.iterrows():

predicate = row['predicate']

if str(predicate) == 'nan':
predicate = 'miscellaenous'

s = rdflib.URIRef(f'#/{predicate}')
p = rdflib.URIRef("#connectsTo")
o = rdflib.URIRef(f'#/{"global-chem"}')

graph.add((s, p, o))

And then we generate the RDF file in XML format.

graph.serialize(destination='graph.ttl', format='application/rdf+xml')

And our final product is now this:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:ns1="#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
<rdf:Description rdf:about="#/carbon monosulfide">
<ns1:connectsTo rdf:resource="#/interstellar_space"/>
</rdf:Description>
<rdf:Description rdf:about="#/(6,6′-dimethoxy-[1,1′-biphenyl]-2,2′-diyl)bis(bis(3,5-di-tertbutylphenyl)phosphine)">
<ns1:connectsTo rdf:resource="#/nickel_ligands"/>
</rdf:Description>
<rdf:Description rdf:about="#/hemicellulose">
<ns1:connectsTo rdf:resource="#/constituents_of_cannabis_sativa"/>
</rdf:Description>
<rdf:Description rdf:about="#/tetrahydrofuran">
<ns1:connectsTo rdf:resource="#/common_r_group_replacements"/>
<ns1:connectsTo rdf:resource="#/rings_in_drugs"/>
<ns1:connectsTo rdf:resource="#/common_organic_solvents"/>

I hope to expand into more soon but very cool to start connecting the pieces together. Hope this help others convert their’s into a standard format.

--

--

Sulstice
Sulstice

No responses yet