Journal Articles

Programming with SVL Vectors


Martin Santavy

Chemical Computing Group Inc.

In this article we show vector programming techniques using Coulomb's law as an example. We will exploit SVL's fundamental features - element-wise operators and unit-extended arguments - to vectorize the scalar code implementation of the basic formula. Further, we will discuss and use more advanced programming techniques of lamination and packeting of vector data.

Contents


Vectors are the Key to SVL Performance

SVL is a vector language. It is the vector nature of the language and its instructions that allows SVL to perform well.

SVL programs are not compiled directly into machine code; they are first compiled into a machine-independent byte-code, which is then interpreted at run-time by an interpreter. To overcome the overhead cost of the interpreter, the instructions are vectorized, i.e. they access and manipulate whole vectors of data. The larger the vector involved, the smaller the cost of the interpreter per item of data.

For maximum efficiency, SVL programs should always manipulate large chunks of vector data and minimize their access to scalar items.

Consider the formula for matrix multiplication:

Cij = sum{k = 1..n} Aij * Bij

Scalar code must loop through all row and column indices of the matrix to access its elements. It also requires an additional loop to calculate the sum:

    for i = 1, n loop
        for j = 1, n loop
            sum = 0;
            for k = 1, n loop
                sum = sum + A(i)(k) * B(k)(j);
            endloop
            C(i)(j) = sum;
        endloop
    endloop

Vector code, on the other hand, uses generalized operations of addition and multiplication that operate in element-wise fashion on a whole matrix. Combined with the SVL mechanism of unit extension, the vector code comprises a single expression:

    C = app add (A * [B]);

which typically executes at least an order of magnitude faster than the scalar code.

This illustrates that the vector approach makes SVL programs both simpler and faster. Most problems, however, are not as straightforward as matrix multiplication and require a little more thought to be vectorized properly. We'll show this on a more elaborate example in the following section.


Coding Coulomb's Law

In this section, we calculate electrostatic interaction between n charged points. The n-point formula is a generalization of the formula for 2-point interaction. Function force1 implements the formula in a straightforward way, using only FORTRAN-like scalar code.

(In the next section we will see how this code, rewritten in force2, can benefit from SVL's vector nature.)

The electrostatic interaction between two charged points is described by Coulomb's law. Given two charged points, Pi and Pj , the force fij of electrostatic repulsion of Pi and Pj is:

fij = qi * qj / d2ij
d2ij = x2ij + y2ij + z2ij

where:

xij = xj - xi , yij = yj - yi , zij = zj - zi

Vectors [ xi, yi, zi ] and [ xj, yj, zj ] represent the Cartesian coordinates of points Pi and Pj. Scalars qi and qj represent their charges.

The vector of force given by Coulomb's law is then:

Fij = [ Fxij, Fyij, Fzij ]

where:

Fxij = xij / dij * fij = xij * qi * qj / d3ij
Fyij = yij / dij * fij = yij * qi * qj / d3ij
Fzij = zij / dij * fij = zij * qi * qj / d3ij

The following piece of SVL code evaluates the expression above:

        // Calculate 2-point interaction
    local x_ij = x_i - x_j;
    local y_ij = y_i - y_j;
    local z_ij = z_i - z_j;
    local d3_ij = cube sqrt
        (sqr(x_ij) + sqr(y_ij) + sqr(z_ij));
    Fx_ij = x_ij * q_i * q_j / d3_ij;
    Fy_ij = y_ij * q_i * q_j / d3_ij;
    Fz_ij = z_ij * q_i * q_j / d3_ij;

(This code will form the body of the inner loop of function force1.)

Instead of just two points, let us now consider n charged points. The force on one charged point is the sum of its pairwise interactions with the remaining n-1 points. The x-, y- and z-components of the force on point Pi are given by:

Fxi = sum{j = 1..n} xij * qi * qj / d3ij
Fyi = sum{j = 1..n} yij * qi * qj / d3ij
Fzi = sum{j = 1..n} zij * qi * qj / d3ij

Strictly speaking, the above sum must exclude the case i = j. However, since:

xii = yii = zii = 0

we can simply set dii to any non-zero value (to avoid division by 0) and keep the case i = j in the sum.

This can be done by requiring all points to have a positive minimum distance:

    d3_ij = maxE [REAL_MIN, d3_ij];

Let us assume that the charges and coordinates of the n charged points are stored in vectors q, x, y, z.

The following SVL function incorporates the above SVL code to calculate and return three vectors, Fx, Fy, Fz, each containing the x-, y- and z- components of the electrostatic forces.

The outer loop calculates Fxi, Fyi, Fzi for each point Pi. The inner loop calculates and adds the force contributions from remaining n-1 points on point Pi.

function force1 [q, [x,y,z]]
    local n = length q; // assume equal lengths of q,x,y,z
    local i, j, Fx, Fy, Fz, x_ij, y_ij, z_ij, d3_ij;
    for i = 1, n loop
        Fx(i) = Fy(i) = Fz(i) = 0;
        for j = 1, n loop
            x_ij = x(i) - x(j);
            y_ij = y(i) - y(j);
            z_ij = z(i) - z(j);
            d3_ij = cube sqrt
                (sqr(x_ij) + sqr(y_ij) + sqr(z_ij));
            d3_ij = maxE [ REAL_MIN, d3_ij ];
            Fx(i) = Fx(i) + x_ij * q(i) * q(j) / d3_ij;
            Fy(i) = Fy(i) + y_ij * q(i) * q(j) / d3_ij;
            Fz(i) = Fz(i) + z_ij * q(i) * q(j) / d3_ij;
        endloop
    endloop
    return COULOMB_SCALE * [Fx, Fy, Fz];
endfunction

[Click to compare: 2-point interaction | force2 | force3 | force4 ]

The constant COULOMB_SCALE represents an appropriate scaling factor.


Vectorizing Coulomb's Law

This section demonstrates how to modify the n-point formula to take advantage of SVL's vector nature.

Not considering minor changes in the syntax, function force1 (see section above) is essentially identical to a function one might write in FORTRAN or C to calculate the required values. This type of coding, however, puts SVL at a disadvantage. As SVL code, the inner loop of the function calculates the sum in an inefficient manner. Each loop iteration calculates one term in the sum by accessing one element of vectors x, y, z and q. This is (relatively) slow.

The j-th iteration of the loop works with the j-th element of each vector. It is, however, far more preferable to work with each vector as a whole. We will replace each access of the j-th element, e.g. "x(j)", by access of the whole vector, i.e. "x". Instead of a sequence of scalar terms of the sum, each produced in one iteration of the inner loop, we create a whole vector of terms, which can be summed by a single add operation.

Line:

        x_ij = x(i) - x(j);

becomes:

        x_ij = x(i) - x;

and line:

        Fx(i) = add (x_ij * q(i) * q(j) / d3_ij);

becomes:

        Fx(i) = add (x_ij * q(i) * q / d3_ij);

Similar transformation applies to lines manipulating the y- and z-components of the data. Notice that except for the missing -1 j, the expressions look identical.

As an additional measure of efficiency, instead of first calculating distances:

        d3_ij = maxE [ ... ];

and then using them as a part of the expression:

        q(i) * q / d3_ij

in the consequent formulas in the program, we will directly calculate:

        qq_d3_ij = q(i) * q / maxE [ ... ];

and use that value instead.

Also, a slight improvement is obtained by giving the result vectors the correct size right from the beginning, instead of having their length increase item by item with each iteration:

        Fx = Fy = Fz = zero q;  // pre-allocate result

The following function uses these simple steps to vectorize function force1 from the previous section:

function force2 [q, [x,y,z]]
    local n = length q; // assume equal lengths of q,x,y,z
    local i, Fx, Fy, Fz, x_ij, y_ij, z_ij, qq_d3_ij;
    Fx = Fy = Fz = zero q;  // pre-allocate result
    for i = 1, n loop
        x_ij = x(i) - x;
        y_ij = y(i) - y;
        z_ij = z(i) - z;
        qq_d3_ij = q(i) * q / maxE [
            REAL_MIN,
            cube sqrt (sqr(x_ij) + sqr(y_ij) + sqr(z_ij))
        ];
        Fx(i) = add (x_ij * qq_d3_ij);
        Fy(i) = add (y_ij * qq_d3_ij);
        Fz(i) = add (z_ij * qq_d3_ij);
    endloop
    return COULOMB_SCALE * [Fx, Fy, Fz];
endfunction

[Click to compare: 2-point interaction | force1 | force3 | force4 ]

Notice how little the code has changed (except for the replacement of variable d3_ij by variable qq_d3_ij ). This is fairly typical of well written SVL vector code: it is syntactically similar, or even identical, to its scalar equivalent.

The difference in speed, however, is substantial. While the scalar code will always incur a heavy overhead of the interpreter, the vector code will compete in speed with code written in C or FORTRAN.


Further Vectorization Techniques:
Laminating and Packeting Vector Code

In this section, function force2 is vectorized further in functions force3 and force4 using more sophisticated programming techniques.

A naive attempt at further vectorization would try to mimic the vectorization technique seen in the preceding section and replace the loop by a vector in which each item corresponds to one iteration of the loop.

This approach, however, does not usually meet with success for several reasons:

  1. All operations have to be nested an extra level. The added level of nesting makes many expressions much harder to read and debug.
  2. The extra level of complexity makes it more difficult for SVL to optimize the stream of vector instructions, which may result in slower, rather than faster execution of the program.
  3. The generated intermediate vector expression may become too large for the machine to handle, even for moderately sized problems.

Instead, we'll use a vectorization technique we call "lamination," which is analogous to a carpenter's technique of glueing several sheets of wood together to form a single piece that can be cut and processed as a single unit.

We, too, will "laminate" multiple items into a single unit vector, which will be "processed" by SVL functions and operations in the same way a single scalar item would be. This will allow us to use the code as is, with only minor modifications.

Let us assume that the scalar -1, i, from function force2 will now represent a whole vector of indices. We will replace each scalar value that represents an i-th element, such as "x(i)", by a unit vector that laminates elements of the given indices, i.e. "[x[i]]".

Line:

        qq_d3_ij = q(i) * q / maxE [ ...

becomes:

        qq_d3_ij = [q[i]] * q / maxE [ ...

and lines:

        x_ij = x(i) - x;
        Fx(i) = ...

become:

        x_ij = [x[i]] - x;
        Fx[i] = ...

A similar change is applied to lines manipulating the y- and z-components.

Notice that we did not put extra brackets around Fx[i]. This is because Fx[i] does not interact with any other variables in the expression, so it is safe to leave it "as is."

This scalar-to-laminated-vector transformation addresses the concerns raised in points 1 and 2 above: by not modifying the basic formulas we not only preserve the readability and correctness of the original program, but we also keep essentially the same stream of vector instructions to be executed.

The problem of "memory explosion," raised in point 3, however, needs to be addressed separately. To prevent the intermediate expressions from becoming too large, we have to limit the amount of laminated items and repeat the calculation until all data are processed. We call this technique "packeting," a term inspired by the methods of moving voice traffic used in modern day telephone networks (or methods of moving cargo and mail by boats in former days).

We will keep the loop structure of the previous function. Instead of using a scalar -1, however, we use a whole vector (packet) of indices. Given a problem of size n, each scalar -1 is involved in access and manipulation of n points. The size of the packet of indices must therefore be inversely proportional to n: for example, to limit the amount of data processed in each iteration to 100000, the packet size must be no greater than 100000 / n.

The loop -1, i, will change from a single number:

    for i = 1, n loop ...

to a vector (packet) of indices:

    packets =  split [igen n, ceil (100000 / n)];
    for i in packets loop ...

The following function applies this simple transformation to the code of function force2 from the previous section:

function force3 [q, [x,y,z]]
    local n = length q; // assume equal lengths of q,x,y,z
    const PACKET = 100000;
    local i, packets, Fx, Fy, Fz, x_ij, y_ij, z_ij, qq_d3_ij;
    Fx = Fy = Fz = zero q;  // pre-allocate result
    packets =  split [igen n, ceil (PACKET / n)];
    for i in packets loop
        x_ij = [x[i]] - x;
        y_ij = [y[i]] - y;
        z_ij = [z[i]] - z;
        qq_d3_ij = [q[i]] * q / maxE [
            REAL_MIN,
            cube sqrt (sqr(x_ij) + sqr(y_ij) + sqr(z_ij))
        ];
        Fx[i] = add (x_ij * qq_d3_ij);
        Fy[i] = add (y_ij * qq_d3_ij);
        Fz[i] = add (z_ij * qq_d3_ij);
    endloop
    return COULOMB_SCALE * [Fx, Fy, Fz];
endfunction

[Click to compare: 2-point interaction | force1 | force2 | force4 ]

The performance of function force3, when compared with the simpler code of force2, depends on the problem size.

For very small problem sizes, the vectors processed by force2 are short. It is, generally speaking, beneficial for us to package these vectors in larger chunks, as is done in force3.

On bigger problems, however, the vectors used by force2 are already long enough; the advantage of having larger chunks of them disappears. The added complexity of the data structures involved may actually cause the code of force3 to run slightly slower.

Function force3 serves as an example to demonstrate the techniques of lamination and packeting. On our chosen problem, these techniques, although instructive, are not used as the primary source of vectorization and do not offer substantial benefits. On other problems, however, they often do lead to definite and significant gains in speed.

Let us also mention that the function above can be vectorized even further by "bundling" the x-,y- and z-components of the argument and the result in one 3-element vector:

function force4 [q, r]
    local n = length q; // assume equal lengths of q,x,y,z
    const PACKET = 100000;
    local packets =  split [igen n, ceil (PACKET / n)];
    local i, F = zero r, r_ij;
    for i in packets loop
        r_ij = app nest apt get [r, [i]] - r;
        F = apt put [F, [i], app add (
            r_ij * [[q[i]] * 1 / maxE [
                REAL_MIN,
                cube sqrt add sqr r_ij
            ]]
        )];
    endloop
    return COULOMB_SCALE * F;
endfunction

[Click to compare: 2-point interaction | force1 | force2 | force3 ]

This, however, does not improve the speed: the increase in speed has already been realized by replacing the inner loops by long, flat vectors in function force2 and (for smaller-size problems) laminating and packeting them together in function force3.

Furthermore, the extra level of nesting requires all SVL top-level functions (such as get or put) to be applied one level deeper, which makes the source code harder to read and debug.


Performance of SVL Code

As we pointed out in the introduction, the key to SVL performance lies in its vector nature. SVL functions that are to be fast need to arrange their operand as large, flat vectors of data.

To illustrate this point, we measured the performance of the functions force1, force2 and force3 from the previous sections.

We calculated the Coulombic forces with the three methods above, as well as with the potential energy function built into MOE. (The function was hand-coded in C. To make the comparisons fair, we turned off the evaluation of all irrelevant terms in the potential function and disabled the use of any cutoff distance.)

We used five different data samples:

  • 5-atom "COOH" molecule
  • 21-atom aspirin molecule
  • Polystyrene molecule of 50 atoms
  • PDB's "1CRN" molecule of 327 atoms
  • PDB's "2ZNA" molecule of 756 atoms

and we used two different platforms: SUN's SPARC 10 and SGI's OCTANE.

As expected, we found that for the smallest molecules the SVL code incurs significant overhead of the interpreter. However, when the difference is measured only in milliseconds, the issue of speed becomes mostly irrelevant.

On medium-sized problems we can immediately see the speed improvements of the vector code. The performance penalty, when compared to the C-coded implementation, may be viewed as a reasonable trade-off for the convenience and the flexibility of SVL.

It is on the large size problems that the vector code really shines. The difference in speed between SVL and C code is minimal. In fact, when calculating the Coulombic forces in the molecule of "2ZNA" (with n = 756) on SGI's OCTANE, the speed of our SVL implementation was identical to that of the built-in C routine.

The "2ZNA" molecule also pointed out the importance of the "packeting" technique used in the force3 routine. Without it, SUN's SPARC 10 would not find enough free space in its 32 MB of memory to complete the task.

The following tables summarize our results:

Sun's SPARC 10
molecule #of atoms force1 force2 force3 built-in
COOH 5 0.0097 sec 0.0032 sec 0.0030 sec 0.00077 sec
Aspirin 21 0.25 sec 0.0097 sec 0.0097 sec 0.0023 sec
Polystyrene 50 1.4 sec 0.040 sec 0.033 sec 0.0087 sec
1CRN 327 64 sec 0.62 sec 0.91 sec 0.41 sec
2ZNA 756 340 sec 3.3 sec 5.4 sec 2.1 sec
Sgi's OCTANE
molecule #of atoms force1 force2 force3 built-in
COOH 5 0.0047 sec 0.00084 sec 0.00078 sec 0.00018 sec
Aspirin 21 0.077 sec 0.0037 sec 0.0026 sec 0.00057 sec
Polystyrene 50 0.41 sec 0.0095 sec 0.0067 sec 0.0022 sec
1CRN 327 18 sec 0.12 sec 0.23 sec 0.088 sec
2ZNA 756 99 sec 0.47 sec 1.3 sec 0.47 sec
Relative speeds
#of atoms SPARC 10
force1 / built-in
SPARC 10
force2 / built-in
OCTANE
force1 / built-in
OCTANE
force2 / built-in
5 12.6 4.2 26.1 4.7
21 109 4.2 135 6.5
50 160 4.6 186 3.0
327 156 1.5 205 1.4
756 162 1.6 211 1.0

Conclusion

As we saw in the above examples, it is essential for computation-intensive programs to take advantage of SVL's ability to operate with vectors. The structure of the language makes the task easy to accomplish: the source code of a well-designed vectorized function is often almost identical to that of its scalar equivalent.

The key to vectorization is the proper use of the SVL operators and their ability to apply the operation to all elements of their arguments; the built-in unit extension of the arguments simplifies much of the code. The basic technique of vectorization is the replacement of scalar expressions in inner loops by expressions on whole vectors of terms. The more advanced techniques of vectorization lamination and packeting.

These techniques are not difficult to master and will reward the SVL programmer with high-quality, high-performance code.


Martin Santavy is the author of much of MOE's SVL software at Chemical Computing Group Inc. 1010 Sherbrooke Street W, Suite 910, Montreal, Quebec, Canada H3A 2R7. His E-mail address is and he can be reached by telephone at (514) 393-1055.