## The A* Search And Applications To Sequence AlignmentK. Kelly, P. LabuteChemical Computing Group Inc.1010 Sherbrooke Street W, Suite 910; Montreal, Quebec; Canada H3A 2R7. ## INTRODUCTIONIn this article we present an augmentation of the traditional dynamic programming method of sequence alignment capable of correctly scoring gap initiation and extension penalties when aligning two pairs of alignments. The algorithm is based upon the A* search. We present the A* search algorithm along with a proof of correctness and a description of its application to multiple sequence alignment. ## THE A* SEARCH ALGORITHMThe A* Search algorithm is a procedure to find the best path in a directed acyclic graph with associated edge scores. It is a best-first search algorithm that makes use of a heuristic to evaluate the score of unsearched portions of the graph. It has been extensively studied in the context of tree searches; however, all of its graph search properties are not generally known. Therefore, in this section, we will outline the algorithm and offer proofs of its correctness and properties. Let We distinguish two distinct nodes in For each node The A* search algorithm assumes the existence of a
We now describe the A* search algorithm in detail. The algorithm
constructs subgraph, - [
**Initialization**] Initially,*G*contains*r*and no edges. Set*C =*{},*O*= {*r*},*a*(*r*) = 1. Fix a constant*t*> 0 (this controls the termination criteria). - [
**Selection**] Let*m*be a node in*O*with maximal heuristic score*a*(*m*)*b*(*m*) over all other nodes in*O*. Set*C=C+*{*m*} and*O*=*O*{*m*}. - [
**Termination**] If*f*is in*C*and*a*(*m*)*b*(*m*) <*ta*(*f*) then terminate the algorithm. - [
**Expansion**] For each edge (*m*,*n*,*s*) in*G** (all edges originating from*m*) add*n*and the edge (*m*,*n*,*s*) to*G*. If*n*is in*O*set*a*(*n*)=max{*a*(*n*),*sa*(*m*)}; otherwise if*n*is not in*C*then set*a*(*n*)=*sa*(*m*) and*O*=*O*+{*n*}. Go to Step 2.
The construction in the Expansion step of the algorithm
guarantees that The proof of correctness will require a number of lemmas each of which is interesting on its own and provide insight into the behavior of the A* algorithm.
This lemma states that along an optimal path, the monotone
heuristic property guarantees that the upper bound on the score of
an optimal complete path
The preceding lemma states, in particular, that when the
algorithm terminates,
## APPLICATION TO MULTIPLE SEQUENCE ALIGNMENTIn comparison of protein sequences, a matrix defining relatedness between the different amino acids is used along with a scheme for penalizing the introduction of gaps. It is recognized that non-proportional gap penalizing functions - such as linear functions of the form where C is a constant "gap opening" penalty, and E is a proportional gap extension penalty, produce much better alignments than a simple proportional function (Gotoh, 1990). Given such a scoring scheme, the optimal alignment between two sequences can be calculated by the application of the dynamic programming principle. (Needleman and Wunsh, 1970). This is easily seen if we represent the space of possible alignments between two sequences of lengths M and N as a graph of nodes (i,j) (0 <= i <= M, 0 <= j <= M) where the source node is (0,0) and the goal node is (M,N). The successors of a given node (i,j) are the nodes (i+1,j), (i+1,j+1) and (i,j+1) (if they exist). In the case of only two sequences, we can derive the optimal score and path to a node (I,J) if we have already computed and stored at each of the predecessor nodes the optimal scores from each of their predecessor nodes. In principle we can similarly work in a K-dimensional space to solve the K-way alignment problem. However, the computational complexity of the recursion step grows too rapidly - faster than K factorial (Altschul, 1989). In practice, therefore, 3-way alignments tend to be the limit. Consequently the multiple alignment problem is generally attacked through iterative use of two-way alignment of alignments. However, when either of the alignment groups contains internal gaps, then the simple recursion described above is no longer applicable. This can happen whenever an internal gap in one group is matched against a new or pre-existing gap in the other group, and is followed then by a gap being matched against a character. In this case, the information required to assign the gap penalty is no longer guaranteed to be represented in the any of the predecessor nodes. This means that the problem of aligning two groups of alignments
cannot be reduced to the MxN matrix description, and one must
instead deal with the (very much) larger state space populated by
the nodes (I,J) where each of I and J are bit-vectors of length M
and N, where the successors of given node are (I<<1,
J<<1), (I<<1 + 1, J<<1), and (I<<1 + 1,
J<<1 + 1). Let [ We have observed that in matrix of partial path scores (constructed by the dynamic programming scheme), if we take the gap penalty whenever it might be applicable, then we have an admissible and monotone heuristic for an A* search. Each cell (i,j) of the matrix serves as a heuristic upper bound for all nodes (I,J) where [I] == i and [J] == j. For this to be true it must be that that actual cost from a goal node to a node X, plus the edge cost from X to a predecessor node Y is always less than or equal to the H(Y), the heuristic overestimate for the source node (0,0) to node X. This turns out to be true if we take the gap constant penalty only those edges that close gaps. |