Contents
Introduction | The Data Parallel Trend | Existing LanguagesSVL is a new high-performance data-parallel programming language built into the MOE Molecular Operating Environment. SVL is an embedded language; that is, its compiler and run-time environment are an integral part of MOE. SVL serves as the command, macro, scripting, and high-performance computing language of MOE.
The tight integration of SVL and the molecular computing services of MOE result in a unique, high-performance and flexible molecular operating environment that minimizes the time taken to convert ideas into results.
SVL was designedSVL was designed to be the vehicle for the exploitation of both sequential and parallel computing platforms. The future of industrial scientific computing lies in the efficient use of the entire spectrum of computer platforms from low end Pentium Pro platforms through shared memory multiprocessors to high-end vector pipeline and massively parallel architectures.
In this document, we will outline the ideas behind SVL and data-parallelism as a programming paradigm. We present some of the reasoning behind the decision to design and implement a new programming language. Finally, we attempt to give the flavor of SVL (although a thorough description is beyond the scope of this document).
A language is called collection-oriented if aggregate data structures (e.g., sets, sequences, arrays, vectors and lists) as well operations for manipulating them "as a whole" (summing, sorting, permuting and applying a function to each member) are primitives in the language.
Collection-oriented languages discourage explicit inner loops. For this reason they are, in most cases, ideally suited for implementation on massively parallel machines. The parallelism inherent in the primitive operations removes the need for sophisticated compiler analysis to uncover available parallelism.
Indeed, the designers of high-level languages for massively parallel machines have turned to collection-oriented languages because of the difficulty in automatically exploiting parallelism. Languages such as C*, *LISP, CM-LISP (Connection Machine), AL and Apply (Warp), Parallel Pascal (MPP) and the array extensions to Fortran and High Performance Fortran (Cray) all exploit collection-oriented primitives and constructs.
SVL is a collection oriented language; more precisely, it is a data-parallel language. A data-parallel language has the following important characteristics:
Imperative Style. Imperative languages (e.g., C and Fortran) can be compiled more efficiently and are more familiar to the wider programming community. Functional languages (e.g., Haskell and SETL) currently exhibit poor array performance, inefficient use of memory and poorly support of random numbers.
Explicit Parallelism. The programmer and compiler work together to produce efficient parallel (and sequential) code. It is unreasonable to expect a compiler to extract parallelism that has been hidden inside sequential control and data structures.
Processor-Primitive View. A critical problem in generating code for distributed memory machines is determining how to distribute the data among the individual memories available. Data parallel languages encourage the expression of algorithms in the fundamental unit of parallelism; that is, the operations performed by the processors.
SIMD focus. Each processor executes the same instructions in a synchronous way; that is, language is Single Instruction stream, Multiple Data stream (SIMD).
Global Data Access. Each processor has access to the memories of any other processor. Processor interaction is through expressions of the language rather than explicit message passing.
The trend towards data-parallelism is due to a number of key observations and wide-spread experiences including
the difficulty in programming parallel computers with the message passing paradigm;
the success of vector architectures (e.g., Cray) and shared memory multi-processors;
the natural mapping of scientific calculations onto the primitive operations of data parallel languages;
that sequential processor engineering assumptions (e.g., localization of memory access) are not as likely to be violated in the data-parallel paradigm.
The result has been a wealth of research (e.g., at CMU and MIT) into data parallel languages. Many language groups have shown that the paradigm is extremely successful in that it allows for efficient implementations on a wide variety of computer architectures, even sequential processors.
Data parallel languages in general, and SVL in particular, have five important strengths:
Versatility. Data parallelism is the most natural paradigm for a large proportion of scientific and engineering computing. In biology, chemistry, chemical engineering, geology, earth and space science, physics, astronomy, computer science and other fields, experience has shown that the source of parallelism is essentially always data parallelism.
Practicality. It is easy to convert Fortran or C paradigms into the SVL paradigm. The imperative style of data-parallel languages eliminates the radical paradigm normally associated with parallel programming.
Programmability. SVL programs are easier to write than programs using lower level message passing constructs. The synchronous execution model means that there is only one thread of control; therefore, deadlock and race conditions are eliminated. Global data access eliminates message passing considerations from the programming process even when the underlying architecture has distributed memory.
Portability. Software portability is of critical importance. Very often, portability and maintainability are more important than raw performance due to the short life-span of high-performance computing architectures. SVL programs are more portable than programs written in a language closer to the underlying architecture.
Reasonable Performance. Data-parallel languages exhibit high speedups on a wide variety of computer architectures. Naturally, hand-coded programs using lower level parallelism can be faster. The situation is, however, analogous to assembly language. A proficient assembly language programmer can always write programs that run faster than C or Fortran programs yet very little programming is actually done in assembly language. Other considerations, such as portability, maintainability and programmer productivity in general outweigh the speed advantages of assembly language. As parallel computing enters the mainstream, extracting microsecond speedups will become less important.
The decision to design a new programming language was not taken lightly. There are, perhaps, a dozen credible programming languages that, on the surface, appear to satisfy the requirements. However, after careful study, the conclusion was reached that a new language was necessary. In this section, some of the reasoning leading to this conclusion will be described.
Existing candidate languages for scientific computing can be divided into two broad classes. (Please note, however, that this classification is not strict in the sense that most languages contain features from many classes.)
Imperative languages are, loosely speaking, the traditional languages and the more recent object oriented languages. Examples of languages in this class are C, Fortran, Pascal, Modula, C++, Ada, SML, and APL.
Imperative languages like C and Fortran can deliver high performance and can be used for large scale problems. Unfortunately, they are not interactive, parallelize poorly, and are not concise. The goals of High Performance Fortran are a step in the right direction; however, the result is Fortran with a few vector constructs and not a fundamental change. Object oriented languages like C++ and SML can at times be concise; however, they still suffer from the drawbacks of the other imperative languages.
It was not the imperative nature of these languages that made them poor candidates for embedding into MOE. Rather, it was the fact that they were low level languages; that is, paradigms in which the programmer must be concerned with things like memory allocation and management, as well as all looping and iteration control. Furthermore, this low level focus makes programs written in these languages difficult to parallelize (e.g., the efforts toward automatic vectorization of Fortran programs have led to the conclusion that a new language, High Performance Fortran, was required).
Functional languages are modern languages (in some cases, still experimental) that attempt to provide powerful new features and constructs by sacrificing the imperative notion of state. In functional languages, side effects are banned (e.g. global data structures or COMMON blocks). A program becomes, in effect, one large mathematical expression or function. A consequence of this is that there are no traditional looping constructs and recursion must be used (although compilers can convert some recursive algorithms into loops). In exchange for these sacrifices, functional languages offer powerful notational syntax, elegant and concise programs and a strong mathematical flavor. Examples of languages in this class are Haskell, Lisp, NESL, SETL, and ML.
A recent survey (Zorn 1995) of the use of nonprocedural languages for numerical computing revealed that functional programs can be elegant, concise and very expressive for mathematical algorithms; however, existing functional language implementations have a number of severe problems that make them (currently) unsuitable for industrial application:
Poor array performance Although most functional languages are experimental and have immature implementations, the paradigm forces certain inefficiencies: lack of in-place updating results in the fact that array manipulations are slower by a factor of 50 to 10,000 and required memory garbage collection produces high levels of fragmentation and virtual memory usage.
Poor support for large scale problems Functional languages generally require vast amounts of memory even for moderately sized problems; for example, Gaussian elimination with Haskell uses well over 16Mb for a 20 by 20 matrix.
Poor support for random numbers The elimination of side effects results in a difficulty handling random numbers. This is unfortunate, since random algorithms are extremely important in simulations and parallel algorithms.
The latter two problems are inherent to the functional paradigm; however, the first is not. Indeed, languages such as NESL attempt to solve the first problem; however, they are still in the experimental stage and are therefore more suited to language research than industrial application.
SVL is both a compiled and interactive language. In this way SVL can serve as a command language and as a forum for the development of complex scientific applications. The interactive component of the language consists of the evaluation of expressions that make up the core of applications programs. Flow control constructs and function definitions form the remainder of the language.
The unit of data in SVL is the vector. Vectors are manipulated with functions that operate on a vector and return a result vector. Syntactic shorthands are provided for common functions such as addition and multiplication. A statement of data manipulation is an expression. Expressions and flow constructs can be grouped into functions.
SVL is a very rich language. A full description is beyond the scope of this document; however, the flavor of SVL will be described below.
An SVL vector is a quantity very much like the array that is found in many programming languages. The general ordered vector x has the form
Each vector x(i) is called an element of x and n is the length of x. Vectors can nest to arbitrary depths since each of the elements of x is, itself, a vector. If n=0 then x is a null vector while if n=1 then then x is called a unit.
All vectors are constructed from the set of atomic unit vectors. This set includes all
Atomic unit vectors have the unique property that they are equal in value to their first element. It is this property that eliminates the need to distinguish between scalars and vectors.
SVL supports the traditional infix syntax for mathematical operators; for example,
are all legal SVL expressions. Operators proceed element-wise, pairing an element from one operand with the corresponding element in the other operand. In the case where a unit vector operates with another vector, the unit vector automatically extends to the length. This principle of unit extension provides scalar-vector multiplication, for example. Unit extension is not limited to atomic units. To form an outer product one uses unit extension as follows:
The vector on the right is a unit; it consists of a single element, a 3-vector. It is extended to each element of the vector on the left. The resulting vector is
SVL provides a rich set of built-in functions such as trigonometric functions, random number generators, sampling, sorting and permutation functions as well as mathematical reduction functions. For example, the add function sums all of the elements in a vector. The dot product of two vectors x and y is computed with add(x*y). while the 2-norm of a vector is computed with
which almost reads as "the square root of the sum of the squares". The power of the unit extension principle and the natural syntax are extremely expressive. For example, if u = [x,y,z] is a vector containing three length n-vectors of coordinates, then the norm of each of the n 3-vectors [x(i),y(i),z(i)] can be computed in parallel with the code fragment sqrt add sqr u. Note the similarity with the code fragment for the 2-norm of a simple n-vector.
The familiar Velocity Verlet integration step (used in molecular dynamics simulations) can be coded with
which operates on all atoms at the same time. Note that this is practically the theoretical description of the algorithm. The above fragment of SVL code, together with a few lines of code for initialization and flow control form a complete dynamics simulation program.
SVL is an integral part of MOE. It does not consist of a separate compiler and execution model but rather a tightly coupled command and control paradigm.
Specific SVL functions are provided to modify and examine the molecular data structures used for visualization and calculation. SVL can control such things as how molecules are rendered, the geometry and properties of atoms, and configuration of the potential energy model.
SVL is the key to the programmability of the MOE interface in that menu commands trigger the execution of SVL expressions. Any program written in SVL can be run by making an attachment to the MOE menus. Interaction and dialog with users is supported with an automatic graphical interface constructor. Thus any SVL program can have a graphical user interface.
SVL is interactive and, indeed, is the command language of MOE. All functions available in programs are available at the MOE command line. New commands can be tested interactively and, when ready, introduced to the menus. The high level of integration of SVL and the molecular modeling services of MOE result in an extremely flexible and efficient environment for computational chemistry.
With traditional programming languages and script-based tool control, many chemists are discouraged from implementing new techniques. In many cases, a simple idea can only be tested after days or months of programming. With MOE and SVL, new ideas and methods can be tried out in a matter of hours.
MOE is built on the embedded language architecture; that is, MOE contains a compiler, linker and run-time execution environment for SVL, the Scientific Vector Language. With SVL, programs can be developed and run from within MOE.
We now present an example of what an SVL program looks like. The context of this example is Conformational Search; more specifically, a modified version of the Random Incremental Pulse Search (RIPS) as described in
Given that a molecule has been loaded into the MOE system, the RIPS methodology proceeds as follows. (Each step will be described below.)
The program will be developed in the following sections and a program listing will appear at the bottom of the page. From the outline described above the methodology requires the services of the internal molecular objects, the forcefield, energy minimization and structural comparison and molecular database services. We structure the program as follows.
The #set line is optional and sets the title of the program module. This title will appear the graphical module management window of MOE. The global keyword indicates that the function to be defined will be visible at the MOE command line. (The local keyword indicates that the function to be defined is private to the program module.)
The semantics of the function definition is as follows. RIPS accepts one argument, called options. This argument may take on many values depending on what the caller passes in; however, the intent is that the caller will use a calling sequence of the form
Each attempt to generate a new conformation consists of a conformation perturbation followed by energy minimization. The described methodology uses a uniform random Cartesian coordinate perturbation.
Notes:
Energy minimization is performed with a call to the Molecular Mechanics services of (a piece of SVL code defined in another module). In MOE, these services are accessed with the MOE function (that must be declared as external).
We will now fill in some of the RIPS function body (although we will modify it shortly). The first version will perform a specific number of perturbations and energy minimizations.
Notes:
In order to produce unique conformations two things will have to be done. Firstly, all of the unique conformations generated will have to be kept somewhere. Secondly, as each new conformation is generated it will have to be compared to the known ones.
We modify the RIPS function to keep all of the unique conformers in a single vector as follows.
Saving the conformations is accomplished by creating a MOE molecular database file (which can be visualized with the Database Viewer). We encapsulate the writing of the conformations in the following function.
Putting all of the code fragments together, we end up with the following SVL program (roughly 100 lines).