During this PhD, I get to explore the deep history of cheminformatics. I came across this paper. After reading the first paragraph I re-fell in love again with cheminformatics and ignites my motivation to continue. I hope you enjoy my perception of it as much as I did.
Morgan was assigned to capture and register the chemical structure of drawings at CAS and append them to a database so they could reference them back later.
This was back in 1965, that’s a tall order…
Well to do that he first decided he needed to come up with a rank ordering system, a way to sequentially at atoms in some sort of list for example for acetone:
C {
\ 001: 'C'
C = O -----> 002: 'C'
/ 003: 'O'
C 004: 'C'
}
He chose to implement an old method of a Search Tree something you will learn in undergraduate Data Structures class and reference the atoms as numbers but only capture lines and points So let’s take the paper’s example of isobutane:
It can be implemented into two number schemes, a number resonance if you will.
With both trees shown on the right. The level 2 tree is more efficient. If we started the search at the middle carbon in the isobutane rather than one of the connecting methyl groups.
We then generate the number haystack (a reduced version):
6 of the number sequences start with B which means these number sequences are going to be most efficient, which is called the invariant rule. So, we go our rank ordering now what?
Well Morgan decided the information would be stored in a series of 5 lists with the rank order being the key.
The modification list will be handled later (even he decided that as well). The attachment list is the bond connections and which rank order the atom is connected to. The first line being blank. The next is the ring closure list to identify weak points of a ring of least complexity and specifying the ring closure or perhaps conjunction. The node value is the lexical representation most commonly in English from Periodic Table, and the last list is the line value dictating essentially the bond order.
Awesome now we have a rank ordering system but how do we search through our compounds? As organic chemists, we are mostly interested in neighboring compounds and an atom’s environment because they will dictate the partial charge and you can tell reactivity. Well to do that morgan needed a numbering scheme to identify unique environments.
So for another example we took this hydronapthelene mercapto hydoxyl thing.
Essentially what you can do is start with a Radius of 0 around the atom. This means the sum of the atom connections is only how many bonds the atom has that are not hydrogens. So for the carbon bonded to the hydroxyl it has a value of 3 because it’s connected to 3 atoms and 1 being hydrogen but not stated. As you expand the radius meaning you take into account your neighboring atoms numbers the sum for our carbon-hydroxyl is now actually 3 + 1 + 2 = 6. A radius of 2 is 9 + 5 + 3 = 17.
Now you have a registration process and a way to search for atoms and they rank orders and what they are bonded to, put it all together: