2D Structure Depiction in MOE
Dr. Alex M. Clark
Chemical Computing Group Inc.
MOE (Molecular Operating Environment) has introduced new functionality
for generating and displaying 2D structure diagrams using layout and rendering
conventions which are familiar to chemists. Cartesian (X,Y) structure
coordinates can be generated from molecular data structures regardless of
whether the input structure was originally represented as 2D, 3D or as a minimal
connection table. The success rate for aesthetically ideal structure layout is
high, and many difficult classes of organic molecules can be depicted with
The utility of 2D diagrams is relevant to many users of MOE. Chemical
features are usually significantly easier for the user to perceive than are
equivalent 3D representations. Such diagrams can be generated quickly for
individual examples or entire databases, and easily viewed within MOE or
communicated between other applications. As of MOE 2005.06, 2D structure
diagrams are used in key areas of the user interface, and can also be used as a
back-end computational tool by applying to databases or controlling directly
with SVL (the Scientific Vector Language) scripts or programs.
|3D Model||2D Diagram|
The subject of 2D structure depiction has been reviewed in the literature
[Helson], and a detailed description of the algorithm used by MOE is in progress
Structure depiction in MOE is achieved by classifying the molecular structure
graph into various segments, and generating sets of locally "possibilities" for
each segment. Once generated, these are combined together by random sampling,
and the best combination is selected, and used as the basis for the final
Consider the following structure:
The molecular connection graph reveals a number of opportunities for
identifying groups of atoms which can be treated differently with regard to the
manner in which the discrete possibility constraints are assigned. The primary
assignment classifications are rings, chains, atoms and pairs of atoms:
Hydrogen atoms are not part of the layout algorithm, and terminal atoms are
implied by their neighbors. Ring blocks are identified by the results of the
graph_dfs2 function, which provides an index identifier for each
distinct set of connected rings (including spiro-systems). Chain blocks are
discovered by preparing a distance matrix (using graph_distance) out of
suitable atoms, and matching longest chains of 4 or more atoms using
graph_shortestpath. Atoms with stereo-active double bonds are matched
with pairs. The graph_maxmatch function is used to obtain a further
pairing set consisting of the atoms which are still leftover; those which are
not combined into pairs remain isolated.
Atoms which are assigned to lone or paired partitions are assigned a set of
internal coordinates by cross-referencing to a pattern file. The atom itself,
and the bonds to its neighbors, are used to obtain a match, for example:
A variety of local preferences can be captured, such as the 120° or
90° C-C-C angles, which depend on the extent of substitution. Possible
permutations (such as mirror symmetry) can be expressed, as can less-preferred
local geometries, such as the cis representation of CS(=O)(=O)C.
Atoms which are assigned to pair constraints are combined by taking the outer
product of the possible internal coordinates of the two constituents. A
weighting bias is introduced when both atoms are di-substituted, in order to
favor a more linear layout. In the case of stereochemically restricted alkenes,
only combinations which encode the correct E or Z isomer are
Chains of atoms are detected and assigned to distinct blocks in order to
produce linear sequences, which are generally the aesthetically preferred
outcome. The following example has a single chain consisting of 14 non-terminal
The chain itself is encoded with internal coordinates of alternating
±120°, which produces the familiar zig-zag style. Substituents are
placed in the remaining trigonal interstice. Only atoms with 2 or 3 neighbors
are included in the explicit chain logic.
The following structure contains 3 explicitly detected chains:
The interior chain consists of 12 atoms. Two other chains of 4 non-terminal,
non-ring atoms are also present.
Each fused ring block is treated separately, and a significant amount of
effort is expended during the constraints assignment phase. While depiction of
some types of ring blocks can be a difficult task, a ring block layout has no
degrees of freedom (except its mirror image), and so it is quite valid to
express all of the constituent atoms in a single block constraint.
The individual rings are detected using a variation of the "smallest set of
smallest rings" concept, as provided by the graph_scycle_list function.
There are 5 main treatments which are applied to ring blocks, depending on their
Trivial catenations of small rings;
Systems which can be reduced down to a core which matches a entry in the
ring template database;
Planar ring systems which can be embedded using distance geometry
followed by refinement;
Non-planar ring systems which can be embedded using 3D embedding followed
by rotation and flattening;
Macrocyclic systems with a dominant large ring.
Terminal ring peeling is an effective way to reduce the effort required to
depict ring blocks. The graph formed by the adjacency of rings within the block,
as determined by the graph_cycleneighbors function, reveals nodes as
being terminal if they are connected to the ring block by just one point. In the
following example, the ring adjacency graph is a tree (contains no cycles of its
own), and so it can be reduced to a single polygon:
Reconstruction of such a ring is a trivial exercise in trigonometry. Other
ring systems present a limit to the trivial approach, such as the following
example, whose ring graph is not a tree:
In this case, one ring can be safely peeled away, but the remaining core
cannot be further decomposed by peeling. It is at this point that the reduced
core is matched against a precompiled set of 300 templates.
Ring templates are convenient for performance reasons, but more importantly
they are an effective way of capturing stylistic conventions which have evolved
for the purpose of visually describing non-planar molecules on in a flat
drawing, such as the following examples:
Template matching is done by disregarding atom and bond type, and labelling
the substrate and template graphs with the graph_walkclass function.
Symmetrical ties are broken by successive perturbation. Structures with graph
symmetry, but not necessarily positional symmetry, are permuted using the
graph_automorphism class of functions, in order to optimize the
positioning of substituents and heteroatoms.
The combination of ring peeling and template matching is sufficient to handle
a very large fraction of organic chemistry, but it is not comprehensive. Ring
system cores which do not match a template are subjected to a fully algorithmic
The ring system is cleaved at certain ring adjacency articulators, if they
exist. For each such chunk, a distance matrix is prepared, in which the
optimal distance between any two atoms is set to be the graph shortest path
distance, which is an adequate initial approximation. Distance geometry
embedding (using mp_DistanceEmbed) is used to project the coordinates,
where X and Y are the major axes, and Z is discarded.
A simplistic forcefield is then constructed, using only desired distance
terms which are composed of ideal bond lengths, ideal non-bonded distances to
represent angles for ideal polygons within individual rings, and cross-ring
terms for hexagons. The terms are expressed as squares of distances, and so are
easily differentiable, and the nearest local minimum is discovered using the
truncated Newton function (opt_TN). The function differs from typical
forcefields because it only includes intra-ring terms, which do not
buckle or distort the ring system unnecessarily. The method is very general for
planar ring systems:
For ring systems which do not have a reasonable planar embedding, a 3D
variation of the above method is used. Once the embedding is complete, the
(X,Y,Z) coordinates are rotated so as to maximize the spread of the X and Y
axes, minimize atom and bond overlap along the Z axis, and align substituents
close to the edge of the 2D space. Once this is done, the structure is flattened
onto the page. The method is especially effective for heavily encaged
The final method for embedding ring blocks is a special case for macrocycles,
which is used when the ring block contains one predominant ring with between 12
and 100 edges, and any number of smaller rings attached or embedded.
The macrocycle itself is mapped onto the edge of a "honeycomb" template, and
the smaller rings are projected by distance geometry then annealed. The main
ring is permuted in order to maximize exterior substituents and interior
heteroatoms. Constraints are formulated for the entire ring system, where the
macrocyclic atoms are desired to be connected by angles of 120°, which
enforces a much more aesthetically pleasing drawing style than a large regular
polygon. Constraints are also added between macrocyclic edges and embedded small
ring edges, cause the macrocycle to buckle if necessary to avoid edges
overlapping. The following examples illustrate macrocycle ring embedding:
Once the atoms within the ring system have been positioned, the immediate
substituents of the ring block are assigned relative positions, then the ring
block is expressed in terms of internal coordinates for each of its
constituents. The entire ring block is allowed one degree of freedom for its
mirror image, and any ring atom which has two or more non-ring substituents
includes permutational combinations for their respective positions.
Once each of the non-terminal heavy atoms have been assigned to a constraint
block (whether single- or multi-atom), consisting of internal coordinates
(relative angles and distances), some with weightings to encourage of discourage
selection, the sampling can begin.
100 iterations are conducted, in a block-vectorized loop. For each iteration,
each constraint has one of its sets of internal coordinates selected at random,
with probability proportional to its weighting. The selection is converted into
Cartesian coordinates, and the resulting arrangement of atoms scored according
to its disperseness. The candidate which has the maximally distributed
set of atoms is selected as the optimal result.
The following animated example shows the sampling of constraints which leads
to a variety of structures, to each of which is assigned a congestion score. The
lowest score is considered to be the best solution:
The sampled solution to the constraints is then subjected to a loop of
refinement of individual constraints, which are varied in order to examine the
global impact of the local change. When localized modification fails to produce
further improvement, the loop terminates.
The raw result from the sampling process needs several additional steps
before the process is complete. Because the internal coordinates do not define
an overall rotation, each of the connected components must be rotated in order
to best satisfy an aesthetic function. The fitness function is composed of the
maximum number of bonds aligned to 15° increments, and a favoring of width
Separate connected components must also be positioned relative to one
another. The components are ranked in descending order of size; multi-atom
blocks are stacked top-to-bottom, then single-atom groups are positioned in
remaining gaps. For single atom ions, an attempt is made to locate a site of
opposite charge in order to position nearby.
If the original structure featured explicit hydrogen atoms, these are placed
within remaining interstices. Hydrogen atoms are generally left out of the
main part of the depiction algorithm, since they are generally not drawn as
Occasions exist when the solution to the sampling algorithm is not an
acceptable structure, e.g. atoms overlap, or non-ring bonds cross. Apart from
ring embedding failure, there are two main reasons why this occurs: either the
degrees of freedom in the available constraints was too low and no acceptable
solutions exist, or the degrees of freedom is very high and contains little
degeneracy, in which case the 100 random samples were not sufficient.
Most of the failure cases are caused by insufficient degrees of freedom,
since the decision to use few discrete possibilities is one of the main reasons
why the algorithm works well. Therefore mechanisms exist for addressing such
cases. If the resulting structure contains overlapping atoms or intersecting
bonds, the paths between such offending atoms are identified, and the
constraints are loosened. Additional angular choices are added, increasing the
number of possible solutions, and the process is re-submitted to the sampling
algorithm. In most cases, a solution is found which is adequate.
Should the layout still fail, the algorithm falls back to a reactionary
method of brute-force stretching and rotating of bonds, making as many
modifications as necessary in order to resolve the conflict.
The following examples have been produced by the depiction algorithm
described in this article:
The depiction algorithm, and 2D graphical display of molecular diagrams, have
been introduced to a number of panels in the MOE user interface, such as the
Database Browser and specific 2D Molecule viewer. Molecular structures can also
be depicted in 2D when saving to file or copying to clipboard, and whole
databases can be depicted with a single command.
Structure depiction is also available to the SVL programmer [FuncRef], via a
[x, y] = DepictionCoordinates [mol]
The function operates with minimal input information, and returns the
placement of all the atoms in the molecular structure, in Cartesian coordinates.
For example, to replace the current model in the MOE main window with the
[x, y] = DepictionCoordinates [mol_Extract Atoms]
aSetPos [Atoms, [x, y, 0]]
In conclusion, MOE 2005 introduces a robust and effective method for
computing diagram structures, which is integrated into the user interface and
conveniently available to the SVL programmer. The new functionality can be used
within MOE to assist visualization of structures, as well as an automated or
bulk calculation which can be applied to data that is exported to other
applications, such as SD files or web pages. Custom interfaces can easily make
use of 2D diagrams, by combining with recently introduced functionality for
displaying vector graphics in panels and printing.
The layout algorithm is implemented entirely in SVL, and occupies ca.
3000 lines of code, in addition to the general purpose functions available to
SVL programmers which allow convenient and efficient manipulation of vectors,
graphs, geometry and nonlinear optimization. Users of MOE are able to examine
the full source code, under the terms of the MOE user license.
H.E. Helson, Reviews in Computational Chemistry, 13,
A.M. Clark, P. Labute, M. Santavy, Journal of Chemical
Informating and Modeling, 46, 1107-1123 (2006).
SVL function reference, available to MOE users.