Overview

This work is dedicated to my father who showed me the beauty of organic chemistry.

Back in the days of my employment at a scientific institution one of the recurring tasks was creating screening databases which are used for drug discovery. Part of the task, which has something to do with computer science, consisted in taking two databases with molecular structures of organic compounds, each usually representing a particular (chemical) class, fuse the molecules following a generic chemical reaction, and store the result in another database. The output database was further processed using some heuristics which are irrelevant to the subject of this article.

A molecule is represented in digital form as a weighted graph where each vertex and edge has weight: vertices represent atoms, and edges are bonds. The task boils down to removing some vertices from the source graphs and creating new edges between the them following the chemical reaction.

reaction between amine and acid

There are both commercial and free tools out there that have such functionality. All of them do a good job producing the Cartesian product of the sets of source molecules. However, virtually all of them perform badly at laying out the resulting molecule without inner intersections. In most cases I had to prepare the source molecules by hand taking into account peculiarities of the layout engine of the tool, which in almost all cases just rotated the source molecules so that the bonds that were to me merged are parallel.

inner intersection

At some point I got extremely bored with the manual part of the process and started looking for a way to automate it as much as possible. The first observation was that in great majority of cases the resulting molecule (graph) can be obtained by joining fragments (sub graphs) of the source molecules by a single bond (edge). Next follows the steps of the process of creating a screening database.

Remove unnecessary atoms from the source molecules to obtain fragments that are to be joined to obtain the resulting molecule.
Mark atoms of the fragments – one atom per fragment – that will form the future bond as follows. Add a special atom – R-atom – to each fragment and bind it to the atom in question, thus forming an R-bond. The direction of the R-bond determines the relative orientation of each fragment in the resulting molecule.
Bind the fragments as follows. Rotate the fragments so that R-bonds are horizontal and R-atoms are next to each other. Create a bond between the atoms adjacent to the R-atoms, and remove the latter from the resulting molecule.
Analyze the resulting molecule to see if it has any irregularities in the form of inner intersections. If there are any regular intersections change the layout of the source fragments and repeat step 2. In most cases it's sufficient to either flip one of the fragments or change the orientation of the R-bond to make the irregularities vanish.

fusion process

Almost all of the available tools automate step 3 above, which means that the rest of the steps are to be performed by hand. Steps 1 and 2 were automated using the Replace add-in for MDL's ISIS/Base desktop chemical structures DBMS. The first solution to automate step 4 solves the problem locally around the newly created bond and can be viewed as a first approximation. It was implemented in the Fuse application.

The Fuse application helped me a lot in my day-to-day work. However, as my productivity increased the number and type of databases to be processed increased as well. And the frequency of cases where the simple set of transformations used in the application was not enough increased too. So I started looking for a generalization of the layout engine that would be able to transform as many molecule fragments as necessary to minimize inner intersection, and not just the small fragment that includes the R-bond. The result is implemented in the Clean2D layout engine.

Oddly enough, the reverse task to the process described above was also quite common. Namely given a set of molecules, try to figure out what compounds they're made of, assuming that all of the them are a result of the same chemical reaction. The Split add-in solves the problem using substructure (subgraph) search and replace techniques.

Molecule Processing Toolkit

Overview