2D Structure Depiction in MOEDr. Alex M. ClarkOctober 2005 Chemical Computing Group Inc.
IntroductionMOE (Molecular Operating Environment) has introduced new functionality for generating and displaying 2D structure diagrams using layout and rendering conventions which are familiar to chemists. Cartesian (X,Y) structure coordinates can be generated from molecular data structures regardless of whether the input structure was originally represented as 2D, 3D or as a minimal connection table. The success rate for aesthetically ideal structure layout is high, and many difficult classes of organic molecules can be depicted with confidence. The utility of 2D diagrams is relevant to many users of MOE. Chemical features are usually significantly easier for the user to perceive than are equivalent 3D representations. Such diagrams can be generated quickly for individual examples or entire databases, and easily viewed within MOE or communicated between other applications. As of MOE 2005.06, 2D structure diagrams are used in key areas of the user interface, and can also be used as a backend computational tool by applying to databases or controlling directly with SVL (the Scientific Vector Language) scripts or programs.
The subject of 2D structure depiction has been reviewed in the literature [Helson], and a detailed description of the algorithm used by MOE is in progress [Depict]. MethodologyStructure depiction in MOE is achieved by classifying the molecular structure graph into various segments, and generating sets of locally "possibilities" for each segment. Once generated, these are combined together by random sampling, and the best combination is selected, and used as the basis for the final output. Graph PartitioningConsider the following structure: The molecular connection graph reveals a number of opportunities for identifying groups of atoms which can be treated differently with regard to the manner in which the discrete possibility constraints are assigned. The primary assignment classifications are rings, chains, atoms and pairs of atoms: Hydrogen atoms are not part of the layout algorithm, and terminal atoms are implied by their neighbors. Ring blocks are identified by the results of the graph_dfs2 function, which provides an index identifier for each distinct set of connected rings (including spirosystems). Chain blocks are discovered by preparing a distance matrix (using graph_distance) out of suitable atoms, and matching longest chains of 4 or more atoms using graph_shortestpath. Atoms with stereoactive double bonds are matched with pairs. The graph_maxmatch function is used to obtain a further pairing set consisting of the atoms which are still leftover; those which are not combined into pairs remain isolated. Lone AtomsAtoms which are assigned to lone or paired partitions are assigned a set of internal coordinates by crossreferencing to a pattern file. The atom itself, and the bonds to its neighbors, are used to obtain a match, for example: A variety of local preferences can be captured, such as the 120° or 90° CCC angles, which depend on the extent of substitution. Possible permutations (such as mirror symmetry) can be expressed, as can lesspreferred local geometries, such as the cis representation of CS(=O)(=O)C. Paired AtomsAtoms which are assigned to pair constraints are combined by taking the outer product of the possible internal coordinates of the two constituents. A weighting bias is introduced when both atoms are disubstituted, in order to favor a more linear layout. In the case of stereochemically restricted alkenes, only combinations which encode the correct E or Z isomer are retained. Chain SequencesChains of atoms are detected and assigned to distinct blocks in order to produce linear sequences, which are generally the aesthetically preferred outcome. The following example has a single chain consisting of 14 nonterminal heavy atoms: The chain itself is encoded with internal coordinates of alternating ±120°, which produces the familiar zigzag style. Substituents are placed in the remaining trigonal interstice. Only atoms with 2 or 3 neighbors are included in the explicit chain logic. The following structure contains 3 explicitly detected chains: The interior chain consists of 12 atoms. Two other chains of 4 nonterminal, nonring atoms are also present. Ring BlocksEach fused ring block is treated separately, and a significant amount of effort is expended during the constraints assignment phase. While depiction of some types of ring blocks can be a difficult task, a ring block layout has no degrees of freedom (except its mirror image), and so it is quite valid to express all of the constituent atoms in a single block constraint. The individual rings are detected using a variation of the "smallest set of smallest rings" concept, as provided by the graph_scycle_list function. There are 5 main treatments which are applied to ring blocks, depending on their topology:
Terminal ring peeling is an effective way to reduce the effort required to depict ring blocks. The graph formed by the adjacency of rings within the block, as determined by the graph_cycleneighbors function, reveals nodes as being terminal if they are connected to the ring block by just one point. In the following example, the ring adjacency graph is a tree (contains no cycles of its own), and so it can be reduced to a single polygon: Reconstruction of such a ring is a trivial exercise in trigonometry. Other ring systems present a limit to the trivial approach, such as the following example, whose ring graph is not a tree: In this case, one ring can be safely peeled away, but the remaining core cannot be further decomposed by peeling. It is at this point that the reduced core is matched against a precompiled set of 300 templates. Ring templates are convenient for performance reasons, but more importantly they are an effective way of capturing stylistic conventions which have evolved for the purpose of visually describing nonplanar molecules on in a flat drawing, such as the following examples: Template matching is done by disregarding atom and bond type, and labelling the substrate and template graphs with the graph_walkclass function. Symmetrical ties are broken by successive perturbation. Structures with graph symmetry, but not necessarily positional symmetry, are permuted using the graph_automorphism class of functions, in order to optimize the positioning of substituents and heteroatoms. The combination of ring peeling and template matching is sufficient to handle a very large fraction of organic chemistry, but it is not comprehensive. Ring system cores which do not match a template are subjected to a fully algorithmic treatment. The ring system is cleaved at certain ring adjacency articulators, if they exist. For each such chunk, a distance matrix is prepared, in which the optimal distance between any two atoms is set to be the graph shortest path distance, which is an adequate initial approximation. Distance geometry embedding (using mp_DistanceEmbed) is used to project the coordinates, where X and Y are the major axes, and Z is discarded. A simplistic forcefield is then constructed, using only desired distance terms which are composed of ideal bond lengths, ideal nonbonded distances to represent angles for ideal polygons within individual rings, and crossring terms for hexagons. The terms are expressed as squares of distances, and so are easily differentiable, and the nearest local minimum is discovered using the truncated Newton function (opt_TN). The function differs from typical forcefields because it only includes intraring terms, which do not buckle or distort the ring system unnecessarily. The method is very general for planar ring systems: For ring systems which do not have a reasonable planar embedding, a 3D variation of the above method is used. Once the embedding is complete, the (X,Y,Z) coordinates are rotated so as to maximize the spread of the X and Y axes, minimize atom and bond overlap along the Z axis, and align substituents close to the edge of the 2D space. Once this is done, the structure is flattened onto the page. The method is especially effective for heavily encaged structures: The final method for embedding ring blocks is a special case for macrocycles, which is used when the ring block contains one predominant ring with between 12 and 100 edges, and any number of smaller rings attached or embedded. The macrocycle itself is mapped onto the edge of a "honeycomb" template, and the smaller rings are projected by distance geometry then annealed. The main ring is permuted in order to maximize exterior substituents and interior heteroatoms. Constraints are formulated for the entire ring system, where the macrocyclic atoms are desired to be connected by angles of 120°, which enforces a much more aesthetically pleasing drawing style than a large regular polygon. Constraints are also added between macrocyclic edges and embedded small ring edges, cause the macrocycle to buckle if necessary to avoid edges overlapping. The following examples illustrate macrocycle ring embedding: Once the atoms within the ring system have been positioned, the immediate substituents of the ring block are assigned relative positions, then the ring block is expressed in terms of internal coordinates for each of its constituents. The entire ring block is allowed one degree of freedom for its mirror image, and any ring atom which has two or more nonring substituents includes permutational combinations for their respective positions. SamplingOnce each of the nonterminal heavy atoms have been assigned to a constraint block (whether single or multiatom), consisting of internal coordinates (relative angles and distances), some with weightings to encourage of discourage selection, the sampling can begin. 100 iterations are conducted, in a blockvectorized loop. For each iteration, each constraint has one of its sets of internal coordinates selected at random, with probability proportional to its weighting. The selection is converted into Cartesian coordinates, and the resulting arrangement of atoms scored according to its disperseness. The candidate which has the maximally distributed set of atoms is selected as the optimal result. The following animated example shows the sampling of constraints which leads to a variety of structures, to each of which is assigned a congestion score. The lowest score is considered to be the best solution: The sampled solution to the constraints is then subjected to a loop of refinement of individual constraints, which are varied in order to examine the global impact of the local change. When localized modification fails to produce further improvement, the loop terminates. PostprocessingThe raw result from the sampling process needs several additional steps before the process is complete. Because the internal coordinates do not define an overall rotation, each of the connected components must be rotated in order to best satisfy an aesthetic function. The fitness function is composed of the maximum number of bonds aligned to 15° increments, and a favoring of width vs. height. Separate connected components must also be positioned relative to one another. The components are ranked in descending order of size; multiatom blocks are stacked toptobottom, then singleatom groups are positioned in remaining gaps. For single atom ions, an attempt is made to locate a site of opposite charge in order to position nearby. If the original structure featured explicit hydrogen atoms, these are placed within remaining interstices. Hydrogen atoms are generally left out of the main part of the depiction algorithm, since they are generally not drawn as explicit atoms. RetrofixingOccasions exist when the solution to the sampling algorithm is not an acceptable structure, e.g. atoms overlap, or nonring bonds cross. Apart from ring embedding failure, there are two main reasons why this occurs: either the degrees of freedom in the available constraints was too low and no acceptable solutions exist, or the degrees of freedom is very high and contains little degeneracy, in which case the 100 random samples were not sufficient. Most of the failure cases are caused by insufficient degrees of freedom, since the decision to use few discrete possibilities is one of the main reasons why the algorithm works well. Therefore mechanisms exist for addressing such cases. If the resulting structure contains overlapping atoms or intersecting bonds, the paths between such offending atoms are identified, and the constraints are loosened. Additional angular choices are added, increasing the number of possible solutions, and the process is resubmitted to the sampling algorithm. In most cases, a solution is found which is adequate. Should the layout still fail, the algorithm falls back to a reactionary method of bruteforce stretching and rotating of bonds, making as many modifications as necessary in order to resolve the conflict. GalleryThe following examples have been produced by the depiction algorithm described in this article:
UsageThe depiction algorithm, and 2D graphical display of molecular diagrams, have been introduced to a number of panels in the MOE user interface, such as the Database Browser and specific 2D Molecule viewer. Molecular structures can also be depicted in 2D when saving to file or copying to clipboard, and whole databases can be depicted with a single command. Structure depiction is also available to the SVL programmer [FuncRef], via a simple entrypoint: [x, y] = DepictionCoordinates [mol] The function operates with minimal input information, and returns the placement of all the atoms in the molecular structure, in Cartesian coordinates. For example, to replace the current model in the MOE main window with the diagram coordinates: [x, y] = DepictionCoordinates [mol_Extract Atoms[]] aSetPos [Atoms[], [x, y, 0]]
In conclusion, MOE 2005 introduces a robust and effective method for computing diagram structures, which is integrated into the user interface and conveniently available to the SVL programmer. The new functionality can be used within MOE to assist visualization of structures, as well as an automated or bulk calculation which can be applied to data that is exported to other applications, such as SD files or web pages. Custom interfaces can easily make use of 2D diagrams, by combining with recently introduced functionality for displaying vector graphics in panels and printing. The layout algorithm is implemented entirely in SVL, and occupies ca. 3000 lines of code, in addition to the general purpose functions available to SVL programmers which allow convenient and efficient manipulation of vectors, graphs, geometry and nonlinear optimization. Users of MOE are able to examine the full source code, under the terms of the MOE user license. References
