Journal Articles

Quasar-Cluster: A Different View Of Molecular Clustering

P. Labute
Chemical Computing Group Inc.
1010 Sherbrooke Street W, Suite 910; Montreal, Quebec; Canada H3A 2R7



The problem of Molecular Clustering is simply stated: Given a (large) collection of molecules, partition them into subsets in such a way that similar molecules belong to the same partition. The statement of the problem is deceptively simple. The devil, however, is in the details of the boldfaced words:

  • Partition. The method by which the partitioning is conducted is not at all obvious. Many methods have been proposed; some are based on graph theoretic ideas and some on spatial partitioning. The methods differ in algorithmic complexity from O(n) to O(n3).
  • Similar. What does it mean for two compounds to be similar? There are a number of approaches such as counting common features (e.g., fingerprints) or geometric distances between selected molecular descriptors.

Partitioning methods such as Jarvis-Patrick or Ward examine all inter-compound similarities (or distances) and derive lists of similar compounds that ultimately become the partitions. Such methods have an O(n2) algorithmic complexity and become less attractive for very large collections of compounds. Geometric binning methods use molecular descriptors and partition the descriptor space into "bins". Compounds are then assigned to bins depending on the value ranges of their descriptors. Such methods have an O(n) algorithmic complexity and are attractive for large collections.

In this document, we examine an O(n) clustering method that uses molecular descriptors to describe the compounds. The chief difference between the method to follow and geometric binning is that the binning is performed on the probability distributions of the descriptors rather than an arbitrary subdivision of space.

To motivate the method, consider the following two "problems":

  1. Suppose that a collection of students have written a test. Each student will have a "descriptor", namely his or her performance on the test (e.g., 73%, 56%, 95%). If we want to select a representative subset from this collection, we can sample students according to the histogram of test results. On the other hand, if we want to sample a diverse set of students, we can sort the test results and divide the collection into percentiles (e.g., top-third, middle-third, bottom-third) and select students from each group. This method is called nonparametric ranking.
  2. Suppose that a QSAR model of Molar Refractivity is to be estimated from a training set. Suppose further that the performance of the model is to be equally good on molecules with differing molecular weights. If the training set contains a molecular weight bias then one may choose a subset that removes the bias (or introduce row weights to remove the bias). Again, a diverse subset is required and nonparametric ranking is one way to make this selection.

The method to follow generalizes this notion of nonparametric ranking to higher dimensions. The implementation of the method in MOE 1998.03, called QuaSAR-Cluster, also is capable of generating importance weights and a principal component analysis.


Suppose that we are given m molecules each of which is described by an n-vector of real numbers xi=(xi1,...,xin), consisting of the descriptors for molecule i. Further, suppose that each molecule has an associated importance weight, wi, a non-negative real number. These weights can be thought of as the relative probability that the associated molecule will be encountered and are usually all equal to 1; however in some applications unequal weights are used. Let W denote the sum of all the weights.

Principal Component Analysis. Dimension reduction through principal component analysis (PCA) can be interpreted in the following manner. Let X denote a random n-vector and let Z denote a random p-vector with mean 0 and covariance matrix equal to the identity matrix. If we assume that X=RZ+x0 for some n by p linear transform R and some n-vector x0 then PCA is the estimation of the Z vectors from a sample of X vectors.

If we integrate both sides of the supposed affine transform we obtain

which shows that x0 is the mean of the distribution of the X vectors. Turning to the covariance of the X vectors, we observe that

These equations suggest the following method for estimating the Z vectors. We use the sampled data to approximate both E(X) and Cov(X):

Now, the sample covariance matrix, S, is symmetric semi-definite; hence all of its eigenvalues are real and non-negative. We can therefore diagonalize S so that

where Q is orthogonal and D is diagonal-sorted in descending order from top left to bottom right. Let p be the number of non-zero diagonal values in D (the square roots of the eigenvalues of S). We can estimate R as the first p columns of QTD and say that the X vectors have p principal components. In practice, we restrict the selection of p further by limiting the condition of the estimated RTR matrix; that is, we choose p so that the largest eigenvalue divided by the smallest eigenvalue is less than some specified threshold.

Cluster Determination. The assignment of cluster codes to molecules proceeds in p dimensional Euclidean space after estimating the p principal components of the molecular descriptors. Each of the p axes is divided into k intervals labeled "a", "b", "c" etc. Each of the p coordinates of a molecule is assigned a letter code depending on the axis intervals into which the coordinates fall. A cluster code is the concatenation of p letters. More precisely, if f(z) is the probability density function for a random variable Z we determine b0,...,bk with b0 equal to minus infinity and bk equal to plus infinity such that

This corresponds to a percentile-type subdivision of space. If a coordinate z is such that z is in the interval (bi-1,bi] then z is given the i-th label.

Let z1,...,zm be m samples of a continuous random variable Z. We can estimate f by accumulating a histogram of observed samples values on a set of B bins (b0,b1],...,(bB-1, bB] defined by B+1 numbers bj<bj+1, with b0 equal to minus infinity and bB equal to plus infinity. The usual procedure for counting the number of observations among m samples in bin j>0 is

which has unfortunate sensitivity to the selection of bin boundaries since observations close to the boundary between two bins are treated as if they were in the middle of one of the bins. To reduce the sensitivity to the bin boundaries we replace the delta function observation density with a normal random variable with mean zi with variance s2. We now have


Weight Assignment. Suppose that a cluster analysis has assigned a code to each of the m molecules. Suppose further that mi is the number of molecules that have same cluster code as molecule i. QuaSAR-Cluster will output a weight of 1/mi for molecule i. For example, if a total of three out of the m molecules are assigned a particular cluster code then each of the three molecules will be given a weight of 1/3.

Algorithm Summary. The foregoing considerations are integrated into the QuaSAR-Cluster computational procedure:

  1. Calculate the sample average vector x0 and covariance matrix S.
  2. Diagonalize S and select the p principal components.
  3. Form the PCA transform Q so that Z=Q(x- x0) has identity covariance and mean 0.
  4. For each i write the p-vector zi=Q(xi- x0) to the output.
  5. Estimate p probability density functions from the zi.
  6. For each density function compute the percentile-type intervals.
  7. For each i output a cluster label based on the p coordinates of the zi.
  8. For each i determine mi the number of molecules with the same cluster code.
  9. For each i output the weight 1/mi.

The procedure has the following parameters (explained in the foregoing equations):

  • maxcomp : the maximum number of components (the maximum value of p).
  • maxcond : the maximum condition of the QTQ transform.
  • swidth : the normal histogram smoothing width s.
  • ncodes : the number of subdivisions per axis, k.


The methods described in the preceeding section have been implement in the Molecular Operating Environment (MOE) version 1998.03 in the application QuaSAR-Cluster. QuaSAR-Cluster is written entirely in the SVL programming language (built into MOE) in about 500 lines. The purpose of QuaSAR-Cluster is three-fold:

  1. Clustering. When designing experiments it is often necessary that a "diverse" collection of molecules be used (in order to avoid bias). QuaSAR-Cluster can assign codes to a collection of molecules, based upon a table of molecular descriptors, so that "similar" molecules in the collection can be quickly identified.
  2. Dimension Reduction. Often, a table of molecular descriptors contains columns that are correlated and differ in relative scale. This can be a problem for distance-based diversity analysis since an Euclidean metric cannot be used directly (without overemphasizing some subset of the descriptors). QuaSAR-Cluster is capable of calculating a new table of descriptors, often smaller, that are uncorrelated and normalized (mean 0 and variance 1).
  3. Assigning Weights. When a moderately large collection of molecules is used in a QSAR study, it is important that bias not be introduced due to the particular selection of molecules. For example, near duplicate molecules can bias the fit. QuaSAR-Cluster can assign weights to molecules so that, for instance, the two near duplicate molecules will be assigned a weight of 0.5.

When the calculation is invoked, the following window appears:

  • Source Database. This field indicates the name of the database that contains the fields to be analyzed. If the toggle box beneath this field is checked then only the database entries (rows) that are selected in the Molecular Database Viewer will be used in the calculation.
  • Weight Field. The name of the field that will be used to weight individual entries of the database. These weights should be non-negative; negative weights will be set to zero for the calculation. If the Weight Field is set to (NONE), then all entries will have a weight of 1.
  • Fields. All of the numeric fields in the database are presented in a list. If no fields are selected, then all the numeric fields (except the Weight Field) will be used in the calculation. If fields are selected in the list, then the selected fields will be used in the calculation and the unselected fields will be ignored.
  • Output Database. This field contains the filename of the database into which the results of the calculation will be written. If this filename is different from the Source Database, a new database will be created. If the Output Database is empty or equal to the Source Database, the results will be written to the Source Database. If the Output Database is different from the Source Database, all database fields except those used in the analysis will be copied to the Output Database.
  • Principal Components. If checked, then the calculated principal components will be written in fields prefixed with the text in Field Prefix. For example, the default Field Prefix is "PCA"; if there are four principal components then these values will be written in fields PCA1, PCA2, PCA3 and PCA4.
  • Cluster Code. If checked then the calculated cluster code will be written to the character-type field named in the Cluster Field.
  • Cluster Weight. If checked then a new field with name New Weight Field will be created and will contain the inverse of the number of members in the cluster of each entry. For example, if a particular entry is assigned a code that is identical to three other entries then a value of 0.25 will be written.
  • Smoothing. This field controls the width of the smoothing distribution used in accumulating the distributions during clustering. It is the standard deviation of a normal distribution (the s value in the theory section). This field is not used if Cluster Code is not checked.
  • Code Count. This field controls the number of equiprobable subdivisions of each principal component axis; that is, the number of letter codes used per axis. Smaller values lead to fewer clusters.
  • Component Limit. This sets a limit on the number of principal components selected. A value of zero means that no limit is set; otherwise, at most the specified number of largest principal components will be used.
  • Condition Limit. This sets the maximum condition (see theory section) number of the principal component transform.

When the OK button is pressed the calculation will begin (and may take a few moments depending on the size of the database). Note that if the Output Database is different from the Source Database then the results will not be displayed unless a Molecular Database Viewer is opened on the output database.


A different view of Molecular Custering was presented that is based upon non-parametric ranking in high dimensions. The algorithmic complexity is O(n) making it suitable for large collections. Several applications are apparent:

  • Bias Removal in QSAR data sets.
  • Diverse Subset Selection.
  • Diverse Reagent Selection.

Because the method uses subdivision of space (with a statistical method of subdivision) each of the clusters has some definite meaning. This is in contrast to methods the subdivide descriptor space without regard to the meaning of the subdivisions.