The problem of Molecular Clustering is simply stated: Given a (large) collection of molecules, partition them into subsets in such a way that similar molecules belong to the same partition. The statement of the problem is deceptively simple. The devil, however, is in the details of the boldfaced words:
Partitioning methods such as Jarvis-Patrick or Ward examine all inter-compound similarities (or distances) and derive lists of similar compounds that ultimately become the partitions. Such methods have an O(n2) algorithmic complexity and become less attractive for very large collections of compounds. Geometric binning methods use molecular descriptors and partition the descriptor space into "bins". Compounds are then assigned to bins depending on the value ranges of their descriptors. Such methods have an O(n) algorithmic complexity and are attractive for large collections.
In this document, we examine an O(n) clustering method that uses molecular descriptors to describe the compounds. The chief difference between the method to follow and geometric binning is that the binning is performed on the probability distributions of the descriptors rather than an arbitrary subdivision of space.
To motivate the method, consider the following two "problems":
The method to follow generalizes this notion of nonparametric ranking to higher dimensions. The implementation of the method in MOE 1998.03, called QuaSAR-Cluster, also is capable of generating importance weights and a principal component analysis.
Suppose that we are given m molecules each of which is described by an n-vector of real numbers xi=(xi1,...,xin), consisting of the descriptors for molecule i. Further, suppose that each molecule has an associated importance weight, wi, a non-negative real number. These weights can be thought of as the relative probability that the associated molecule will be encountered and are usually all equal to 1; however in some applications unequal weights are used. Let W denote the sum of all the weights.
Principal Component Analysis. Dimension reduction through principal component analysis (PCA) can be interpreted in the following manner. Let X denote a random n-vector and let Z denote a random p-vector with mean 0 and covariance matrix equal to the identity matrix. If we assume that X=RZ+x0 for some n by p linear transform R and some n-vector x0 then PCA is the estimation of the Z vectors from a sample of X vectors.
If we integrate both sides of the supposed affine transform we obtain
![]()
which shows that x0 is the mean of the distribution of the X vectors. Turning to the covariance of the X vectors, we observe that
![]()
These equations suggest the following method for estimating the Z vectors. We use the sampled data to approximate both E(X) and Cov(X):
![]()
Now, the sample covariance matrix, S, is symmetric semi-definite; hence all of its eigenvalues are real and non-negative. We can therefore diagonalize S so that
![]()
where Q is orthogonal and D is diagonal-sorted in descending order from top left to bottom right. Let p be the number of non-zero diagonal values in D (the square roots of the eigenvalues of S). We can estimate R as the first p columns of QTD and say that the X vectors have p principal components. In practice, we restrict the selection of p further by limiting the condition of the estimated RTR matrix; that is, we choose p so that the largest eigenvalue divided by the smallest eigenvalue is less than some specified threshold.
Cluster Determination. The assignment of cluster codes to molecules proceeds in p dimensional Euclidean space after estimating the p principal components of the molecular descriptors. Each of the p axes is divided into k intervals labeled "a", "b", "c" etc. Each of the p coordinates of a molecule is assigned a letter code depending on the axis intervals into which the coordinates fall. A cluster code is the concatenation of p letters. More precisely, if f(z) is the probability density function for a random variable Z we determine b0,...,bk with b0 equal to minus infinity and bk equal to plus infinity such that

This corresponds to a percentile-type subdivision of space. If a coordinate z is such that z is in the interval (bi-1,bi] then z is given the i-th label.
Let z1,...,zm be m samples of a continuous random variable Z. We can estimate f by accumulating a histogram of observed samples values on a set of B bins (b0,b1],...,(bB-1, bB] defined by B+1 numbers bj<bj+1, with b0 equal to minus infinity and bB equal to plus infinity. The usual procedure for counting the number of observations among m samples in bin j>0 is

which has unfortunate sensitivity to the selection of bin boundaries since observations close to the boundary between two bins are treated as if they were in the middle of one of the bins. To reduce the sensitivity to the bin boundaries we replace the delta function observation density with a normal random variable with mean zi with variance s2. We now have
.
Weight Assignment. Suppose that a cluster analysis has assigned a code to each of the m molecules. Suppose further that mi is the number of molecules that have same cluster code as molecule i. QuaSAR-Cluster will output a weight of 1/mi for molecule i. For example, if a total of three out of the m molecules are assigned a particular cluster code then each of the three molecules will be given a weight of 1/3.
Algorithm Summary. The foregoing considerations are integrated into the QuaSAR-Cluster computational procedure:
The procedure has the following parameters (explained in the foregoing equations):
The methods described in the preceeding section have been implement in the Molecular Operating Environment (MOE) version 1998.03 in the application QuaSAR-Cluster. QuaSAR-Cluster is written entirely in the SVL programming language (built into MOE) in about 500 lines. The purpose of QuaSAR-Cluster is three-fold:
When the calculation is invoked, the following window appears:

When the OK button is pressed the calculation will begin (and may take a few moments depending on the size of the database). Note that if the Output Database is different from the Source Database then the results will not be displayed unless a Molecular Database Viewer is opened on the output database.
A different view of Molecular Custering was presented that is based upon non-parametric ranking in high dimensions. The algorithmic complexity is O(n) making it suitable for large collections. Several applications are apparent:
Because the method uses subdivision of space (with a statistical method of subdivision) each of the clusters has some definite meaning. This is in contrast to methods the subdivide descriptor space without regard to the meaning of the subdivisions.