Chemical Knowledge Graphs are a staple in ChemInformatics as titan data sources for chemical analysis and finding molecular compounds of interest. I created an open source chemical knowledge graph of only the most common chemicals of interest elected by the people called Global-Chem.
The data was initially organized in a category tree and overall looks like Figure 1:
One of the things I wanted to do is bring move away from a relational database model into a graph model. The reason being is chemicals are pretty finicky and their relationships change frequently as we evolve as evident in my own software.
- Relational Database deal with pre-defined fixed relationships and a lot of them. Think if you wanted to model real life, something like a vending machine would be one table and all the items that belong inside (relationship) that vending machine for sale.
- Graph Database deal with relationships between each node with no defined table structure. For the vending machine each item would it’s own table where the relationship to the machine would be defined individually.
Take a look at Figure 2 where you can see the hierarchy of each model where the relational database has predefined relationships which maximize efficiency.
The first step into generating a chemical knowledge graph was to define each point as a Node in my graph where each academic resource and the category they belonged to was a node. Their relationship still remained weightless which is the edge value.
Relationships between nodes in Global-Chem is stored as a parent/child network. The data structure of the animal
node looks like this:
'animals': {'children': ['snakes'],
'name': 'animals',
'node_value': <global_chem.global_chem.Node object at 0x7ff3a4ff6650>,
'parents': ['']},
Where the child node for animals is the snakes category. Animals is the classifier for which these chemicals exist. We stick to general terminology.
Connecting To NetworkX
My chemical knowledge graph data structure was only usable if it adhered to other platforms but also what is the best standard to stick to? I started off with a popular python package NetworkX and connecting my data structure into theirs. Which can be done easily by initiating the graph, adding all the nodes and each edge with the weightless relationship. The result of the porting can be seen in Figure 3 where the Global-Chem network is visualized with PyViz
from global_chem import GlobalChem
from global_chem_extensions import GlobalChemExtensions
from pyvis.network import Network
gc = GlobalChem()
cheminformatics = GlobalChemExtensions().cheminformatics()
gc = GlobalChem()
gc.build_global_chem_network()
network = gc.network
networkx_graph = cheminformatics.convert_to_networkx(network)
net = Network(notebook=True)
net.from_nx(networkx_graph)
net.save_graph('example.html')
Which gives us an idea that our graph ins interoperable with NetworkX.
Connecting To Neo4j
The idea behind connecting to Neo4j was to have our graph database available to cloud-based infrastructure pipelines hosted on platforms like Amazon Web Services or Microsoft’s Azure. This is coming back to Machine Learning Operations where we can create model pipelines stemming from this data. Neo4j acts as a great adapter for interoperability. So I chose it where it’s implementation can be found in greater detail here.
The Neo4j loads up the database and categorizes according to our relationships we have in Global-Chem using the file structure:
One of the major problems I have is mapping the same name to the molecule. I decided to try out the Neo4j Bloom which includes a semantic querying based on your organization:
This allows me to find things like how many names are unique and clean my data. Also allows others to find my problems in my data quicker.
Interoperability helps me maintain the structure of my own knowledge graph as I adhere to others. I hope my pieces of code help others create their own graphs and figure new things out.
Happy Cheminformatics!