Programming with SVL Vectors
Martin Santavy
Chemical Computing Group Inc.
In this article we show vector programming techniques using
Coulomb's law as an example. We will exploit SVL's fundamental
features  elementwise operators and unitextended arguments  to
vectorize the scalar code implementation of the basic formula.
Further, we will discuss and use more advanced programming
techniques of lamination and packeting of vector data.
Contents
SVL is a vector language. It is the vector nature of the
language and its instructions that allows SVL to perform well.
SVL programs are not compiled directly into machine code; they
are first compiled into a machineindependent bytecode, which is
then interpreted at runtime by an interpreter. To overcome the
overhead cost of the interpreter, the instructions are vectorized,
i.e. they access and manipulate whole vectors of data. The larger
the vector involved, the smaller the cost of the interpreter per
item of data.
For maximum efficiency, SVL programs should always manipulate
large chunks of vector data and minimize their access to scalar
items.
Consider the formula for matrix multiplication:
C_{ij} = sum_{{k = 1..n}} A_{ij}
* B_{ij}
Scalar code must loop through all row and column indices of the
matrix to access its elements. It also requires an additional loop
to calculate the sum:
for i = 1, n loop
for j = 1, n loop
sum = 0;
for k = 1, n loop
sum = sum + A(i)(k) * B(k)(j);
endloop
C(i)(j) = sum;
endloop
endloop
Vector code, on the other hand, uses generalized operations of
addition and multiplication that operate in elementwise fashion on
a whole matrix. Combined with the SVL mechanism of unit extension,
the vector code comprises a single expression:
C = app add (A * [B]);
which typically executes at least an order of magnitude faster
than the scalar code.
This illustrates that the vector approach makes SVL programs
both simpler and faster. Most problems, however, are not as
straightforward as matrix multiplication and require a little more
thought to be vectorized properly. We'll show this on a more
elaborate example in the following section.
In this section, we calculate electrostatic interaction between
n charged points. The npoint formula is a generalization of the
formula for 2point interaction. Function
force1 implements the formula in a
straightforward way, using only FORTRANlike scalar code.
(In the next section we will see how this code, rewritten in
force2, can benefit from SVL's vector
nature.)
The electrostatic interaction between two charged points is
described by Coulomb's law. Given two charged points,
P_{i} and P_{j} ,
the force f_{ij} of electrostatic
repulsion of P_{i} and
P_{j} is:
f_{ij} = q_{i} * q_{j} /
d^{2}_{ij}
d^{2}_{ij} = x^{2}_{ij}
+ y^{2}_{ij} + z^{2}_{ij}
where:
x_{ij} = x_{j}  x_{i}
, y_{ij} = y_{j} 
y_{i} , z_{ij} =
z_{j}  z_{i}
Vectors [ x_{i}, y_{i}, z_{i} ]
and [ x_{j}, y_{j}, z_{j} ]
represent the Cartesian coordinates of points
P_{i} and P_{j}. Scalars
q_{i} and q_{j} represent
their charges.
The vector of force given by Coulomb's law is then:
F_{ij} = [ Fx_{ij}, Fy_{ij},
Fz_{ij} ]
where:
Fx_{ij} = x_{ij} / d_{ij} *
f_{ij} = x_{ij} * q_{i} *
q_{j} / d^{3}_{ij}
Fy_{ij} = y_{ij} / d_{ij} *
f_{ij} = y_{ij} * q_{i} *
q_{j} / d^{3}_{ij}
Fz_{ij} = z_{ij} / d_{ij} *
f_{ij} = z_{ij} * q_{i} *
q_{j} / d^{3}_{ij}
The following piece of SVL code evaluates the expression
above:
// Calculate 2point interaction
local x_ij = x_i  x_j;
local y_ij = y_i  y_j;
local z_ij = z_i  z_j;
local d3_ij = cube sqrt
(sqr(x_ij) + sqr(y_ij) + sqr(z_ij));
Fx_ij = x_ij * q_i * q_j / d3_ij;
Fy_ij = y_ij * q_i * q_j / d3_ij;
Fz_ij = z_ij * q_i * q_j / d3_ij;
(This code will form the body of the inner loop of function
force1.)
Instead of just two points, let us now consider n charged
points. The force on one charged point is the sum of its pairwise
interactions with the remaining n1 points. The x, y
and zcomponents of the force on point P_{i}
are given by:
Fx_{i} = sum_{{j = 1..n}}
x_{ij} * q_{i} * q_{j} /
d^{3}_{ij}
Fy_{i} = sum_{{j = 1..n}}
y_{ij} * q_{i} * q_{j} /
d^{3}_{ij}
Fz_{i} = sum_{{j = 1..n}}
z_{ij} * q_{i} * q_{j} /
d^{3}_{ij}
Strictly speaking, the above sum must exclude the case i =
j. However, since:
x_{ii} = y_{ii} = z_{ii} = 0
we can simply set d_{ii} to any nonzero
value (to avoid division by 0) and keep the case
i = j in the sum.
This can be done by requiring all points to have a positive
minimum distance:
d3_ij = maxE [REAL_MIN, d3_ij];
Let us assume that the charges and coordinates of the n
charged points are stored in vectors q, x, y, z.
The following SVL function incorporates the above SVL code to
calculate and return three vectors, Fx, Fy, Fz, each containing the
x, y and z components of the electrostatic forces.
The outer loop calculates Fx_{i}, Fy_{i},
Fz_{i} for each point P_{i}.
The inner loop calculates and adds the force contributions from
remaining n1 points on point
P_{i}.
function force1 [q, [x,y,z]]
local n = length q; // assume equal lengths of q,x,y,z
local i, j, Fx, Fy, Fz, x_ij, y_ij, z_ij, d3_ij;
for i = 1, n loop
Fx(i) = Fy(i) = Fz(i) = 0;
for j = 1, n loop
x_ij = x(i)  x(j);
y_ij = y(i)  y(j);
z_ij = z(i)  z(j);
d3_ij = cube sqrt
(sqr(x_ij) + sqr(y_ij) + sqr(z_ij));
d3_ij = maxE [ REAL_MIN, d3_ij ];
Fx(i) = Fx(i) + x_ij * q(i) * q(j) / d3_ij;
Fy(i) = Fy(i) + y_ij * q(i) * q(j) / d3_ij;
Fz(i) = Fz(i) + z_ij * q(i) * q(j) / d3_ij;
endloop
endloop
return COULOMB_SCALE * [Fx, Fy, Fz];
endfunction
[Click to compare: 2point interaction 
force2  force3 
force4 ]
The constant COULOMB_SCALE represents an appropriate
scaling factor.
Vectorizing Coulomb's Law
This section demonstrates how to modify the npoint formula to
take advantage of SVL's vector nature.
Not considering minor changes in the syntax, function force1 (see section above) is essentially identical
to a function one might write in FORTRAN or C to calculate the
required values. This type of coding, however, puts SVL at a
disadvantage. As SVL code, the inner loop of the function
calculates the sum in an inefficient manner. Each loop iteration
calculates one term in the sum by accessing one element of vectors
x, y, z and q. This is (relatively) slow.
The jth iteration of the loop works with the
jth element of each vector. It is, however, far more
preferable to work with each vector as a whole. We will replace
each access of the jth element, e.g.
"x(j)", by access of the whole vector, i.e.
"x". Instead of a sequence of scalar terms of
the sum, each produced in one iteration of the inner loop, we
create a whole vector of terms, which can be summed by a single
add operation.
Line:
x_ij = x(i)  x(j);
becomes:
x_ij = x(i)  x;
and line:
Fx(i) = add (x_ij * q(i) * q(j) / d3_ij);
becomes:
Fx(i) = add (x_ij * q(i) * q / d3_ij);
Similar transformation applies to lines manipulating the y and
zcomponents of the data. Notice that except for the missing 1 j,
the expressions look identical.
As an additional measure of efficiency, instead of first
calculating distances:
d3_ij = maxE [ ... ];
and then using them as a part of the expression:
q(i) * q / d3_ij
in the consequent formulas in the program, we will directly
calculate:
qq_d3_ij = q(i) * q / maxE [ ... ];
and use that value instead.
Also, a slight improvement is obtained by giving the result
vectors the correct size right from the beginning, instead of
having their length increase item by item with each iteration:
Fx = Fy = Fz = zero q; // preallocate result
The following function uses these simple steps to vectorize
function force1 from the previous section:
function force2 [q, [x,y,z]]
local n = length q; // assume equal lengths of q,x,y,z
local i, Fx, Fy, Fz, x_ij, y_ij, z_ij, qq_d3_ij;
Fx = Fy = Fz = zero q; // preallocate result
for i = 1, n loop
x_ij = x(i)  x;
y_ij = y(i)  y;
z_ij = z(i)  z;
qq_d3_ij = q(i) * q / maxE [
REAL_MIN,
cube sqrt (sqr(x_ij) + sqr(y_ij) + sqr(z_ij))
];
Fx(i) = add (x_ij * qq_d3_ij);
Fy(i) = add (y_ij * qq_d3_ij);
Fz(i) = add (z_ij * qq_d3_ij);
endloop
return COULOMB_SCALE * [Fx, Fy, Fz];
endfunction
[Click to compare: 2point interaction 
force1  force3 
force4 ]
Notice how little the code has changed (except for the
replacement of variable d3_ij by variable
qq_d3_ij ). This is fairly typical of well written SVL
vector code: it is syntactically similar, or even identical, to its
scalar equivalent.
The difference in speed, however, is substantial. While the
scalar code will always incur a heavy overhead of the interpreter,
the vector code will compete in speed with code written in C or
FORTRAN.
Further Vectorization Techniques:
Laminating and Packeting Vector Code
In this section, function force2 is
vectorized further in functions force3 and
force4 using more sophisticated programming
techniques.
A naive attempt at further vectorization would try to mimic the
vectorization technique seen in the preceding section and replace
the loop by a vector in which each item corresponds to one
iteration of the loop.
This approach, however, does not usually meet with success for
several reasons:
 All operations have to be nested an extra level. The added
level of nesting makes many expressions much harder to read and
debug.
 The extra level of complexity makes it more difficult for SVL
to optimize the stream of vector instructions, which may result in
slower, rather than faster execution of the program.
 The generated intermediate vector expression may become too
large for the machine to handle, even for moderately sized
problems.
Instead, we'll use a vectorization technique we call
"lamination," which is analogous to a carpenter's technique of
glueing several sheets of wood together to form a single piece that
can be cut and processed as a single unit.
We, too, will "laminate" multiple items into a single unit
vector, which will be "processed" by SVL functions and operations
in the same way a single scalar item would be. This will allow us
to use the code as is, with only minor modifications.
Let us assume that the scalar 1, i, from function force2 will
now represent a whole vector of indices. We will replace each
scalar value that represents an ith element, such as
"x(i)", by a unit vector that laminates
elements of the given indices, i.e.
"[x[i]]".
Line:
qq_d3_ij = q(i) * q / maxE [ ...
becomes:
qq_d3_ij = [q[i]] * q / maxE [ ...
and lines:
x_ij = x(i)  x;
Fx(i) = ...
become:
x_ij = [x[i]]  x;
Fx[i] = ...
A similar change is applied to lines manipulating the y and
zcomponents.
Notice that we did not put extra brackets around
Fx[i]. This is because
Fx[i] does not interact with any other
variables in the expression, so it is safe to leave it "as is."
This scalartolaminatedvector transformation addresses the
concerns raised in points 1 and 2 above: by not modifying
the basic formulas we not only preserve the readability and
correctness of the original program, but we also keep essentially
the same stream of vector instructions to be executed.
The problem of "memory explosion," raised in point 3, however,
needs to be addressed separately. To prevent the intermediate
expressions from becoming too large, we have to limit the amount of
laminated items and repeat the calculation until all data are
processed. We call this technique "packeting," a term inspired by
the methods of moving voice traffic used in modern day telephone
networks (or methods of moving cargo and mail by boats in former
days).
We will keep the loop structure of the previous function.
Instead of using a scalar 1, however, we use a whole vector
(packet) of indices. Given a problem of size n, each scalar 1 is
involved in access and manipulation of n points. The size of the
packet of indices must therefore be inversely proportional to n:
for example, to limit the amount of data processed in each
iteration to 100000, the packet size must be no greater than
100000 / n.
The loop 1, i, will change from a single number:
for i = 1, n loop ...
to a vector (packet) of indices:
packets = split [igen n, ceil (100000 / n)];
for i in packets loop ...
The following function applies this simple transformation to the
code of function force2 from the previous section:
function force3 [q, [x,y,z]]
local n = length q; // assume equal lengths of q,x,y,z
const PACKET = 100000;
local i, packets, Fx, Fy, Fz, x_ij, y_ij, z_ij, qq_d3_ij;
Fx = Fy = Fz = zero q; // preallocate result
packets = split [igen n, ceil (PACKET / n)];
for i in packets loop
x_ij = [x[i]]  x;
y_ij = [y[i]]  y;
z_ij = [z[i]]  z;
qq_d3_ij = [q[i]] * q / maxE [
REAL_MIN,
cube sqrt (sqr(x_ij) + sqr(y_ij) + sqr(z_ij))
];
Fx[i] = add (x_ij * qq_d3_ij);
Fy[i] = add (y_ij * qq_d3_ij);
Fz[i] = add (z_ij * qq_d3_ij);
endloop
return COULOMB_SCALE * [Fx, Fy, Fz];
endfunction
[Click to compare: 2point interaction 
force1  force2 
force4 ]
The performance of function force3, when compared with the
simpler code of force2, depends on the problem size.
For very small problem sizes, the vectors processed by force2
are short. It is, generally speaking, beneficial for us to package
these vectors in larger chunks, as is done in force3.
On bigger problems, however, the vectors used by force2 are
already long enough; the advantage of having larger chunks of them
disappears. The added complexity of the data structures involved
may actually cause the code of force3 to run slightly slower.
Function force3 serves as an example to demonstrate the
techniques of lamination and packeting. On our chosen problem,
these techniques, although instructive, are not used as the primary
source of vectorization and do not offer substantial benefits. On
other problems, however, they often do lead to definite and
significant gains in speed.
Let us also mention that the function above can be vectorized
even further by "bundling" the x,y and zcomponents of the
argument and the result in one 3element vector:
function force4 [q, r]
local n = length q; // assume equal lengths of q,x,y,z
const PACKET = 100000;
local packets = split [igen n, ceil (PACKET / n)];
local i, F = zero r, r_ij;
for i in packets loop
r_ij = app nest apt get [r, [i]]  r;
F = apt put [F, [i], app add (
r_ij * [[q[i]] * 1 / maxE [
REAL_MIN,
cube sqrt add sqr r_ij
]]
)];
endloop
return COULOMB_SCALE * F;
endfunction
[Click to compare: 2point interaction 
force1  force2 
force3 ]
This, however, does not improve the speed: the increase in speed
has already been realized by replacing the inner loops by long,
flat vectors in function force2 and (for smallersize problems)
laminating and packeting them together in function force3.
Furthermore, the extra level of nesting requires all SVL
toplevel functions (such as get or put) to be
applied one level deeper, which makes the source code harder to
read and debug.
Performance of SVL Code
As we pointed out in the introduction, the key to SVL
performance lies in its vector nature. SVL functions that are to be
fast need to arrange their operand as large, flat vectors of
data.
To illustrate this point, we measured the performance of the
functions force1, force2 and force3 from the
previous sections.
We calculated the Coulombic forces with the three methods above,
as well as with the potential energy function built into MOE. (The
function was handcoded in C. To make the comparisons fair, we
turned off the evaluation of all irrelevant terms in the potential
function and disabled the use of any cutoff distance.)
We used five different data samples:
 5atom "COOH" molecule
 21atom aspirin molecule
 Polystyrene molecule of 50 atoms
 PDB's "1CRN" molecule of 327 atoms
 PDB's "2ZNA" molecule of 756 atoms
and we used two different platforms: SUN's SPARC 10 and SGI's
OCTANE.
As expected, we found that for the smallest molecules the SVL
code incurs significant overhead of the interpreter. However, when
the difference is measured only in milliseconds, the issue of speed
becomes mostly irrelevant.
On mediumsized problems we can immediately see the speed
improvements of the vector code. The performance penalty, when
compared to the Ccoded implementation, may be viewed as a
reasonable tradeoff for the convenience and the flexibility of
SVL.
It is on the large size problems that the vector code really
shines. The difference in speed between SVL and C code is minimal.
In fact, when calculating the Coulombic forces in the molecule of
"2ZNA" (with n = 756) on SGI's OCTANE, the speed of our SVL
implementation was identical to that of the builtin C
routine.
The "2ZNA" molecule also pointed out the importance of the
"packeting" technique used in the force3 routine. Without it, SUN's
SPARC 10 would not find enough free space in its 32 MB of memory to
complete the task.
The following tables summarize our results:
Sun's SPARC 10
molecule 
#of atoms 
force1 
force2 
force3 
builtin 
COOH 
5 
0.0097 sec 
0.0032 sec 
0.0030 sec 
0.00077 sec 
Aspirin 
21 
0.25 sec 
0.0097 sec 
0.0097 sec 
0.0023 sec 
Polystyrene 
50 
1.4 sec 
0.040 sec 
0.033 sec 
0.0087 sec 
1CRN 
327 
64 sec 
0.62 sec 
0.91 sec 
0.41 sec 
2ZNA 
756 
340 sec 
3.3 sec 
5.4 sec 
2.1 sec 
Sgi's OCTANE
molecule 
#of atoms 
force1 
force2 
force3 
builtin 
COOH 
5 
0.0047 sec 
0.00084 sec 
0.00078 sec 
0.00018 sec 
Aspirin 
21 
0.077 sec 
0.0037 sec 
0.0026 sec 
0.00057 sec 
Polystyrene 
50 
0.41 sec 
0.0095 sec 
0.0067 sec 
0.0022 sec 
1CRN 
327 
18 sec 
0.12 sec 
0.23 sec 
0.088 sec 
2ZNA 
756 
99 sec 
0.47 sec 
1.3 sec 
0.47 sec 
Relative speeds
#of atoms 
SPARC 10
force1 / builtin 
SPARC 10
force2 / builtin 
OCTANE
force1 / builtin 
OCTANE
force2 / builtin 
5 
12.6 
4.2 
26.1 
4.7 
21 
109 
4.2 
135 
6.5 
50 
160 
4.6 
186 
3.0 
327 
156 
1.5 
205 
1.4 
756 
162 
1.6 
211 
1.0 
As we saw in the above examples, it is essential for
computationintensive programs to take advantage of SVL's ability
to operate with vectors. The structure of the language makes the
task easy to accomplish: the source code of a welldesigned
vectorized function is often almost identical to that of its scalar
equivalent.
The key to vectorization is the proper use of the SVL operators
and their ability to apply the operation to all elements of their
arguments; the builtin unit extension of the arguments simplifies
much of the code. The basic technique of vectorization is the
replacement of scalar expressions in inner loops by expressions on
whole vectors of terms. The more advanced techniques of
vectorization lamination and packeting.
These techniques are not difficult to master and will reward the
SVL programmer with highquality, highperformance code.
Martin Santavy is the author of much of MOE's SVL software at
Chemical Computing Group Inc. 1010 Sherbrooke Street W, Suite 910,
Montreal, Quebec, Canada H3A 2R7. His Email address is
and he can be reached by telephone at (514) 3931055.
