## Quasar-Cluster: A Different View Of Molecular ClusteringP. LabuteChemical Computing Group Inc.
1010 Sherbrooke Street W, Suite 910; Montreal, Quebec; Canada H3A 2R7
## INTRODUCTIONThe problem of Molecular Clustering is simply stated: **Partition**. The method by which the partitioning is conducted is not at all obvious. Many methods have been proposed; some are based on graph theoretic ideas and some on spatial partitioning. The methods differ in algorithmic complexity from O(n) to O(n^{3}).**Similar**. What does it mean for two compounds to be similar? There are a number of approaches such as counting common features (e.g., fingerprints) or geometric distances between selected molecular descriptors.
Partitioning methods such as Jarvis-Patrick or Ward examine all
inter-compound similarities (or distances) and derive lists of
similar compounds that ultimately become the partitions. Such
methods have an O(n In this document, we examine an O(n) clustering method that uses
molecular descriptors to describe the compounds. The chief
difference between the method to follow and geometric binning is
that the binning is performed on the To motivate the method, consider the following two "problems": - Suppose that a collection of students have written a test. Each
student will have a "descriptor", namely his or her performance on
the test (e.g., 73%, 56%, 95%). If we want to select a
*representative*subset from this collection, we can sample students according to the histogram of test results. On the other hand, if we want to sample a*diverse*set of students, we can sort the test results and divide the collection into percentiles (e.g., top-third, middle-third, bottom-third) and select students from each group. This method is called nonparametric ranking. - Suppose that a QSAR model of Molar Refractivity is to be
estimated from a training set. Suppose further that the performance
of the model is to be equally good on molecules with differing
molecular weights. If the training set contains a molecular weight
bias then one may choose a subset that removes the bias (or
introduce row weights to remove the bias). Again, a
*diverse*subset is required and nonparametric ranking is one way to make this selection.
The method to follow generalizes this notion of nonparametric ranking to higher dimensions. The implementation of the method in MOE 1998.03, called QuaSAR-Cluster, also is capable of generating importance weights and a principal component analysis. ## METHODSSuppose that we are given x_{i1},...,x),
consisting of the descriptors for molecule _{in}i. Further,
suppose that each molecule has an associated importance weight,
w, a non-negative real number. These weights can
be thought of as the relative probability that the associated
molecule will be encountered and are usually all equal to 1;
however in some applications unequal weights are used. Let _{i}W
denote the sum of all the weights.
If we integrate both sides of the supposed affine transform we obtain which shows that These equations suggest the following method for estimating the
Now, the sample covariance matrix, where X
vectors have p principal components. In practice, we
restrict the selection of p further by limiting the
condition of the estimated R matrix;
that is, we choose ^{T}Rp so that the largest eigenvalue divided
by the smallest eigenvalue is less than some specified
threshold.
b_{0} equal to minus infinity and
b equal to plus infinity such that_{k}This corresponds to a percentile-type subdivision of space. If a
coordinate z
is given the i-th label.Let m
samples of a continuous random variable Z. We can estimate
f by accumulating a histogram of observed samples values on
a set of B bins
(b_{0},b_{1}],...,(b_{B-1},
b] defined by _{B}B+1 numbers
b<_{j}b_{j+1}, with
b_{0} equal to minus infinity and
b equal to plus infinity. The usual procedure
for counting the number of observations among _{B}m samples in
bin j>0 iswhich has unfortunate sensitivity to the selection of bin
boundaries since observations close to the boundary between two
bins are treated as if they were in the middle of one of the bins.
To reduce the sensitivity to the bin boundaries we replace the
delta function observation density with a normal random variable
with mean s^{2}.
We now have.
i. QuaSAR-Cluster will output a
weight of 1/m for molecule _{i}i. For
example, if a total of three out of the m molecules are
assigned a particular cluster code then each of the three molecules
will be given a weight of 1/3.
- Calculate the sample average vector
*x*_{0}and covariance matrix*S*. - Diagonalize
*S*and select the*p*principal components. - Form the PCA transform
*Q*so that*Z*=*Q*(*x*-*x*_{0}) has identity covariance and mean 0. - For each
*i*write the*p-*vector*z*=Q(_{i}*x*-_{i}*x*_{0}) to the output. - Estimate
*p*probability density functions from the*z*._{i} - For each density function compute the percentile-type intervals.
- For each
*i*output a cluster label based on the*p*coordinates of the*z*._{i} - For each
*i*determine*m*the number of molecules with the same cluster code._{i} - For each
*i*output the weight 1/*m*._{i}
The procedure has the following parameters (explained in the foregoing equations): `maxcomp`: the maximum number of components (the maximum value of*p*).`maxcond`: the maximum condition of the*Q*transform.^{T}Q`swidth`: the normal histogram smoothing width*s*.`ncodes`: the number of subdivisions per axis,*k*.
## IMPLEMENTATIONThe methods described in the preceeding section have been implement in the Molecular Operating Environment (MOE) version 1998.03 in the application QuaSAR-Cluster. QuaSAR-Cluster is written entirely in the SVL programming language (built into MOE) in about 500 lines. The purpose of QuaSAR-Cluster is three-fold: *Clustering*. When designing experiments it is often necessary that a "diverse" collection of molecules be used (in order to avoid bias). QuaSAR-Cluster can assign codes to a collection of molecules, based upon a table of molecular descriptors, so that "similar" molecules in the collection can be quickly identified.*Dimension Reduction*. Often, a table of molecular descriptors contains columns that are correlated and differ in relative scale. This can be a problem for distance-based diversity analysis since an Euclidean metric cannot be used directly (without overemphasizing some subset of the descriptors). QuaSAR-Cluster is capable of calculating a new table of descriptors, often smaller, that are uncorrelated and normalized (mean 0 and variance 1).*Assigning Weights*. When a moderately large collection of molecules is used in a QSAR study, it is important that bias not be introduced due to the particular selection of molecules. For example, near duplicate molecules can bias the fit. QuaSAR-Cluster can assign weights to molecules so that, for instance, the two near duplicate molecules will be assigned a weight of 0.5.
When the calculation is invoked, the following window appears: **Source Database**. This field indicates the name of the database that contains the fields to be analyzed. If the toggle box beneath this field is checked then only the database entries (rows) that are selected in the Molecular Database Viewer will be used in the calculation.**Weight Field**. The name of the field that will be used to weight individual entries of the database. These weights should be non-negative; negative weights will be set to zero for the calculation. If the Weight Field is set to (NONE), then all entries will have a weight of 1.**Fields**. All of the numeric fields in the database are presented in a list. If no fields are selected, then all the numeric fields (except the Weight Field) will be used in the calculation. If fields are selected in the list, then the selected fields will be used in the calculation and the unselected fields will be ignored.**Output Database**. This field contains the filename of the database into which the results of the calculation will be written. If this filename is different from the Source Database, a new database will be created. If the Output Database is empty or equal to the Source Database, the results will be written to the Source Database. If the Output Database is different from the Source Database, all database fields except those used in the analysis will be copied to the Output Database.**Principal Components**. If checked, then the calculated principal components will be written in fields prefixed with the text in**Field Prefix**. For example, the default Field Prefix is "PCA"; if there are four principal components then these values will be written in fields PCA1, PCA2, PCA3 and PCA4.**Cluster Code**. If checked then the calculated cluster code will be written to the character-type field named in the**Cluster Field**.**Cluster Weight**. If checked then a new field with name**New Weight Field**will be created and will contain the inverse of the number of members in the cluster of each entry. For example, if a particular entry is assigned a code that is identical to three other entries then a value of 0.25 will be written.**Smoothing**. This field controls the width of the smoothing distribution used in accumulating the distributions during clustering. It is the standard deviation of a normal distribution (the*s*value in the theory section). This field is not used if Cluster Code is not checked.**Code Count**. This field controls the number of equiprobable subdivisions of each principal component axis; that is, the number of letter codes used per axis. Smaller values lead to fewer clusters.**Component Limit**. This sets a limit on the number of principal components selected. A value of zero means that no limit is set; otherwise, at most the specified number of largest principal components will be used.**Condition Limit**. This sets the maximum condition (see theory section) number of the principal component transform.
When the OK button is pressed the calculation will begin (and may take a few moments depending on the size of the database). Note that if the Output Database is different from the Source Database then the results will not be displayed unless a Molecular Database Viewer is opened on the output database. ## SUMMARYA different view of Molecular Custering was presented that is based upon non-parametric ranking in high dimensions. The algorithmic complexity is O(n) making it suitable for large collections. Several applications are apparent: - Bias Removal in QSAR data sets.
- Diverse Subset Selection.
- Diverse Reagent Selection.
Because the method uses subdivision of space (with a statistical method of subdivision) each of the clusters has some definite meaning. This is in contrast to methods the subdivide descriptor space without regard to the meaning of the subdivisions. |