Combining Chemical Languages with ChatGPT using Large Language Models (LLMs) Part 1

3 min readJun 4, 2023

It’s time to start embedding our common chemical language from GlobalChem into ChatGPT so folk can query data effectively be able to use it. Our goal is to extend our discord application to pulling data from GlobalChem

So let’s get into it. First let’s handle our imports. We will be using langchain as the main connector tool.

pip install langchain
pip install python-dotenv
pip install pandas

Let’s import the packages and submodules we will need

import os
import pandas as pd
from dotenv import load_dotenv

from langchain.agents import create_pandas_dataframe_agent
from langchain.memory import ConversationBufferWindowMemory
from langchain import OpenAI, ConversationChain, LLMChain, PromptTemplate

For our integration we are going to need an OpenAPI key which is fairly simple to setup.

Create a .env file and store your key and load the environment.

load_dotenv()
OPENAI_API_KEY= os.getenv('OPENAI_API_KEY')

Next we want to load GlobalCheminto a dataframe:

globalchem_dataframe = pd.read_csv(
  'https://raw.githubusercontent.com/Global-Chem/global-chem/development/global_chem/global_chem_outputs/global_chem.tsv',
  sep='\t',
  header=None,
  names=['name', 'smiles', 'node', 'predicate', 'path']
)

And now we want to create an agent that is only responsible for the GlobalChemdata.

agent = create_pandas_dataframe_agent(
  OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY), df,verbose=True
)

We are using the OpenAPI object where the temperature means the diversity of the text generated from the agent. A higher temperature means more diverse words could be produced and vice verse for a lower temperature. We set ours at 0 because we don’t want it to be diverse. We want it directly from the database.

To test it we can do something pretty simple:

agent.run('return a list of names of the node rings_in_drugs')

Here we can see the agent is now pulling data from our GlobalChem repository. Next we want to add our ChatGPT from OpenAI. The ConversationBufferWindowMemory means the amount of interactions to store in memory between the agent and the user.

prompt = PromptTemplate(
  template=template,
  input_variables=[]
)

chatgpt_chain = LLMChain(
  llm=OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY),
  prompt=prompt,
  verbose=True,
  memory=ConversationBufferWindowMemory(k=100),
)

The goal is to get the user to issue commands and now different commands will query different functionality. This is an excerpt of our bot code that issues the commands and delegates the functionality.

  # Global-Chem Integration

  if 'dataframe' in message.content.lower():
    question = text_message.split(':')[1]
    output = agent.run(question)
    await message.channel.send(str(output))

  # Language Chaining Chemicals

  if any(word in text_message for word in langchain_keywords):

    output = chatgpt_chain.predict(
      human_input=message.content.lower()
    )

    await message.channel.send(output)

Our final result is something like this, where prompt engineering really comes in. Depending on how you prompt the agent it can have some mishaps.

Prompt Engineering

Prompt Engineering is communicating with the chat system you are integrated with to get the answer you want. It’s like when you on a customer service call and you keep pressing some series of numbers to get to a representative.

Prompt: can you give me the names of the node of rings_in_drugs?
Answer: rings_in_drugs

Prompt: in the dataframe: can you give me a list of the names of the node rings_in_drugs?
Answer: The names of the nodes in the rings_in_drugs column are: ['rings_in_drugs']

Prompt: in the dataframe: can you give me the names of the node of rings_in_drugs?
Answer: rings_in_drugs

Prompt: in the dataframe: return a list of names of the node rings_in_drugs
Answer: ['alpha-ethylmescaline', '4-allyloxy-3,5-dimethoxyphenethylamine', ...]

It took me a couple of iterations to get the prompt right to get the data I wanted back and low and behold.

The next step is to integrate the tools part of langchaining to combine agents together. Stay tuned.

All the code is available here:

bots/mother_nature.py at main · Global-Chem/bots

Contribute to Global-Chem/bots development by creating an account on GitHub.

github.com

Combining Chemical Languages with ChatGPT using Large Language Models (LLMs) Part 1

bots/mother_nature.py at main · Global-Chem/bots

Contribute to Global-Chem/bots development by creating an account on GitHub.

Written by Sulstice

No responses yet