-

Hardness of MSA with Selective Column Scoring?

0 Department of Computer Science, University of Verona , Italy

Multiple Sequence Alignment (MSA for short) is a well known problem in the eld of computational biology. In order to evaluate the quality of a solution, many di erent scoring functions have been introduced, the most widely used being the Sum-of-pairs score (SP-score). It is known that computing the best MSA under the SP-score measure is NP-hard. In this paper, we introduce a variant of the Column score (de ned in Thompson et al. 1999), which we refer to as Selective Column score: Given a symbol a 2 , the score of the i-th column is one if and only if all symbols of the same column are a, and otherwise zero. The acolumn score of an alignment is then the number of columns made of only character a. We show that nding the optimal MSA under the Selective Column Score is NP-hard for all alphabets of size j j 2 by reducing from MIN-2-SAT.

Multiple Sequence Alignment Column score NP-completeness

Alignments are fundamental methods for the comparison of biological sequences. They are extensively employed for assessing similarity among sequences as well as in subprocedures of more complex operations, like the computation of approximate overlaps in the context of sequencing.

De nition 1 (Multiple alignment). Let s1; : : : ; sm; be m strings over some alphabet . Let - 62 be a gap symbol and 0 = [ f-g: A multiple alignment S~ of s1; : : : ; sm of length ` is a tuple of strings s~1; : : : ; s~m 2 0 such that ? Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 1 For a string s and integer i, we denote by s[i] the ith character of s.

It is useful to visualize the alignment as a matrix, where rows are the strings s~i: Therefore, for each j = 1; : : : ; ` we refer to the sequence s~1[j]; : : : ; s~m[j] as the jth column of the alingment.

Many proposals exist regarding the choice of the most appropriate objective function that should be optimized when computing an alignment of a set of strings [1, 11, 12].

In this paper, we focus on the so called column score analyzed in [12], where the alignment is assessed by counting the number of columns consisting of a single character.

De nition 2 (Column Score). Let s1; : : : ; sm be a sequence of strings from some alphabet : Given an alignment S~ = s~1; : : : ; s~m of length ` of s1; : : : ; sm; and a character a 2 ; the a-column score of S~; denoted csa(S~) is the number of columns made of only character a; i.e.,

csa(S~) = jfj 2 f1; : : : ; `g j 8i = 1; : : : ; m; s~i[j] = agj:

The column score of S~; denoted cs(S~) is the number of columns made of a single character a; i.e., cs(S~) = j [a2 fj 2 f1; : : : ; `g j 8i = 1; : : : ; m; s~i[j] = agj = X csa(S~): a2 In particular we prove the NP-completeness of the following problem. Selective Column Score Alignment (SCS-Align) Input: Strings s1; : : : ; sm; over some alphabet ; an integer M maxi jsij; a non-negative integer ; and a selected character a 2 : Question: Is there an aligment S~ = s~1; : : : ; s~m of s~1; : : : ; s~m such that the length of S~ is at most M and the a-column score of S~ is at least ?

In contrast with other objective functions for multiple sequence alignment, to the best of our knowledge, before our result, nothing was known about the tractability of computing the best possible alignment under the (selective) acolumn score.

Main Result. In this paper, we show that the Selective Column Score Alignment problem (and several of its variants) is NP-complete.

Related work. It is not hard to see that without the upper bound M on the alignment length, computing an alignment of maximum possible column score is equivalent to nding the longest common subsequence of the input strings, which is known to be an NP-hard problem [9]. However, in contrast to our SCSAlign, the Longest Common Subsequence problem is easily solvable in polynomial time when the output is restricted to subsequences made of a single given character|note that a a-column score alignment asks for a subsequence of this type. Another important issue/di erence with respect to modelling column score alignment as an instance of LCS is that an alignment obtained starting from a longest common subsequence might have an uncontrollable number of gaps per row. We show that the hardness result on SCS holds for an equivalent variant of the problem where we impose a bound of 1 on the number of gaps per row. In this respect, the length constraint on the solutions of SCS-Align have some similarity with the gapped common subsequences considered in [4, 5]. In these papers, however, the authors consider only the special case of two input strings and do not discuss complexity issues of the general instances.

The most commonly considered objective function for optimizing multiple sequence alignments is the so called Sum of Pairs score (SP-score) introduced by Carrillo and Lipman [3]. The SP-score generalizes the edit distance based score for pairwise alignments (equivalently multiple alignments of only two strings). An optimal pairwise alignment can be computed in quadratic time by the Needleman and Wunsh dynamic programming approach [10]. In contrast, computing an optimal SP-score based multiple sequence alignment is known to be NP-hard [13].

Alternative scoring measures, based on the distance to a consensus string, are investigated in Li et al. [8]. In its simplest form, given an alignment a consensus string can be obtained by selecting for each column a majority symbol. The score of the alignment is the sum of the Hamming distance between each input string and the consensus sequence. The authors of [8] investigate several variants of this approach showing that they are in general NP-hard. They also consider a sort of inverse procedure, where a median string of the input string is computed that minimizes the sum of gap-bounded edit distance of each string from the median. Once this string has been found the alignment can be constructed by optimally aligning each input string to the median.

Both the above approaches (SP-score and distance from consensus) are based on pairwise comparisons of strings. More precisely, in the SP-score case each possible pairwise comparison of the input strings is computed and the corresponding scores are added up to obtain the nal score. In the case of distance to consensus, a median string is computed and the alignment score is de ned as the sum of pairwise alignment scores of the input strings to the median. The column score is a basic model of scoring functions that are obtained by evaluating the set of elements in a column, rather than composing evaluations of single pairs of elements in a column.

Organization of the paper. In section 2, we prove the NP-completeness of our basic version of the column score, Selective Column Score Alignment, by reducing from MIN-2-SAT. In section 3, we show how the reduction can be extended to variants where there is no selected character, and the bound on the length of the alignment is substituted with a very sharp bound on the number of gaps per row of the alignment. Section 4 concludes the paper with some open questions. The NP-completeness of Selective Column Score Alignment In this section, we establish the complexity of the SCS-Align problem de ned in the introduction. We provide a reduction from the NP-complete problem MIN-2-SAT [7].

Min-2-Sat Input: A 2-CNF formula and an integer k Question: Is there an assignment to the variables of k clauses? that satis es at most The Reduction. Let I = ( ; k) be an instance of MIN-2-Sat. Let x1; : : : ; xn be the variables of and C1; : : : ; Cm be the clauses of : Then, = C1^C2^ ^Cm; where for each i = 1; : : : ; m; Ci = (l1i _ l2i), with lji 2 fx1; : : : xn; x1; : : : ; xng.

From I; we now build an instance J = (fs0; : : : ; sn+1g; M; ; a) for SCSAlign, where each si is a binary string, M is a bound on the alignment length, a is a character in the alphabet = f0; 1g; and is the bound on the number of columns of the alignments that are expected to contain only the selected character a:

We set M = 8mn+3m+2; a = 0 and build n+2 binary strings: s0; s1; : : : ; sn; sn+1 where js0j = jsn+1j = M and jsij = M 1 for each i = 1; : : : ; n: More precisely we de ne:

s0 = 1(01)2mn(101)m( 10 )2mn1; and for each i = 1; : : : ; n; where, for each j = 1; : : : ; m;

si = 0(00)(i 1)2m( 10 )2m(00)(n i)2mci( 1 ) : : : ci(m)(00)(i 1)2m( 10 )2m(00)(n i)2m; Finally, we de ne and set 8100 if xi 2 Cj > c(j) = < i

010 if xi 2 Cj >:000 else sn+1 = 0M ; = 2mn + m k: ( 1 ) ( 2 ) ( 3 ) ( 4 ) Example 1: Let = (x1 _ x2) ^ (x2 _ x3) be a 2-CNF formula with m = 2 and n = 3. The strings s0; : : : ; s4 obtained by the above process are then: with s1; s2; s3 respectively representing the variables x1; x2; x3.

Remark 1. Let S~ = s~0; s~1; : : : ; s~n+1 be an alignment of length ` M of the strings s0; : : : ; sn+1: The following three facts are easy implied by the structure of the instance produced by our reduction. 1. Since js0j = jsn+1j = M we have that it must be ` = M; hence s~0 = s0 and s~n+1 = sn+1: Moreover, for each i = 1; : : : ; n; because of jsij = M 1 it follows that s~i contains exactly one gap. We will use gap(i) to denote the position of the single gap in s~i; i.e., the index j = 1; : : : ; M; such that s~i[j] = -: 2. Following from Fact 1. and since s~n+1 is made of only zeroes there exist no columns with only 1, and the only columns that can contain only 0 are those corresponding to the positions of the zeroes in s0: In other words, for the instances produced by our reduction, we have cs0(S~) = cs(S~): 3. For each i = 1; : : : ; n and j = 1; : : : ; M; we say that s~i blocks position j if s~0[j] = s0[j] = 0 and s~i[j] = 1: A column j is made of only zeroes if and only if there is no i = 1; : : : ; n such that s~i blocks position j:

For each i = 1; : : : ; n; let siT = si- be the string obtained by adding a gap at the end of si. Let siF = -si be the string obtained by adding a gap at the beginning of si: Then, we have siT = 0(00)(i 1)2m( 10 )2m(00)(n i)2mci(;1T) : : : ci(;mT)(00)(i 1)2m( 10 )2m(00)(n i)2m-; siF = -(00)(i 1)2m(01)2m(00)(n i)2mci(;1F) : : : ci(;mF)(00)(i 1)2m(01)2m(00)(n i)2m0; where, for each j = 1; : : : ; m; ci(;jT) is equal to ci(j) (de ned in ( 3 )) and ci(;jF) = 8010 if xi 2 Cj > <

001 if xi 2 Cj >:000 else

Let us de ne the head of s0 to be the pre x s0h = s0[1] : : : s0[4mn + 1] = 1(01)2mn:

Let us de ne the core of s0 to be the substring s0p = s0[4mn + 2] : : : s0[4mn + 3m + 1] = (101)m:

Finally, let us de ne the tail of s0 to be the su x st0 = s0[4mn + 3m + 2] : : : s0[8mn + 3m + 2] = ( 10 )2mn1:

By a direct comparison with the de nition of s0 in ( 1 ), we have the following proposition, stating useful properties of strings siF ; siT : Proposition 1. For each i = 1; : : : ; n; the string siT blocks { exactly 2m positions of the head of s0; { no position of the tail of s0; { the position of the jth zero in the core of s0 if and only if xi 2 Cj ; i.e., if the assignment xi = T rue satis es clause Cj :

Analogously, we have that the string siF blocks

{ exactly 2m positions of the tail of s0; { no position of the head of s0; { the position of the jth zero in the core of s0 if and only if xi 2 Cj ; i.e., if the assignment xi = F alse satis es clause Cj : Moreover, for each i 6= i0 the strings siT ; siF ; siT0 ; siF0 block distinct positions in the head and tail of s0.

The following lemma proves that our reductions maps yes instances of MIN2-Sat to yes-instances of SCS-Align.

Lemma 1. If there exists an assignment A for that satis es at most k clauses then there exists an alignment S~ = s~0; : : : ; s~n+1 of s0; : : : ; sn+1 such that the length of S~ is M and there are at least = 2mn + m k columns made only of zeroes.

Proof. Given an assignment A that satis es at most k clauses of ; we de ne the alignment S~ as follows: for each i = 0; : : : ; n + 1 we let s~i = 8s >< iT

si >:siF if i = 0 or i = n + 1 if A sets xi = T rue if A sets xi = F alse ( 5 )

By the de nition of siF and siT and Proposition 1, we have that strings s~1; : : : ; s~n block exactly 2mn of the positions in the head and the tail of s0: Moreover, the position of the jth zero in the core of s0 is blocked by some string s~i if and only if the truth value of xi in A satis es Cj : Since A satis es at most k clauses, we have that at most k positions of the core of s0 are blocked by some s~i: It total there are 4mn + m positions in s0 corresponding to a zero character and at most 2mn + k are blocked by the strings of the alignment. Hence, there are at least 2mn + m k positions which are not blocked, equivalently 2mn + m k columns made of only zeros. Example 2: Using the same CNF formula as Example 1 with the assignment A = fx1 = T rue; x2 = F alse; x3 = F alseg, by Lemma 1 we get this alignment: s~0 = s0 : 1(01)4(01)4(01)4101101( 10 )4( 10 )4( 10 )41 s~1 = s1T : 0( 10 )4(00)4(00)4100000( 10 )4(00)4(00)4s~2 = s2F : -(00)4(01)4(00)4010001(00)4(01)4(00)40 s~3 = s3F : -(00)4(00)4(01)4000001(00)4(00)4(01)40 s~4 = s4 : 0(00)4(00)4(00)4000000(00)4(00)4(00)40 We can see that the strings s1T ; s2F ; s3F block a total of 13 columns, and knowing that the assignment A satis es only one clause we have k = 1, therefore it holds 13 2mn + k = 13.

Lemma 2. Let be the maximum column score achievable by an alignment of length M of the strings s0; : : : ; sn+1: Then, there exists an alignment S~ of s0; : : : ; sn+1; such that

1. the length of S~ is M;

2. cs(S~) = ; 3. for each i = 1; : : : ; n; we have gap(i) 2 f1; M g; where gap(i) denotes the position of the gap in s~i; i.e., the index j 2 f1; : : : ; M g such that s~i[j] = -: Proof. Let S~ be an alignment that, among those satisfying conditions 1. and 2., has the minimum number of strings s~i violating condition 3. And assume, by contradiction, that such number is greater than 0.

Then there exists i 2 f1; 2; : : : ; ng such that the position of the gap in s~i is not in f1; M g: We are going to show that we can modify s~i moving the gap in position 1 or M and obtain another alignment satisfying condition 1. and 2. The existence of such an alignment contradicts the minimality of S~ with respect to condition 3, implying that in fact in S~ for all i = 1; : : : ; n we must have gap(i) 2 f1; M g as desired.

Let ni be the number of positions that s~i blocks altogether in the tail and the head of s0: We have ni = >82m > >>>2m > <

4m >>>2m + d 2 e

x > > >:2m + d 4m 2x+1 e if gap(i) < 4m(i 1) + 2; if gap(i) 4mi + 2 + 4mn + 3m; if 4mi + 2 gap(i) (i 1)4m + 1 + 4mn + 3m; if gap(i) = (i 1)4m + 1 + x; 1 x 4m; if gap(i) = 4m(i 1) + 1 + 4mn + 3m + x; 1 x 4m:

Note that these numbers refer to positions that can be blocked only by s~i: Hence, unblocking any of such positions implies a net increase in the column score of the resulting alignment.

We apply the following three cases: Case 1. gap(i) 4mi + 1: In this case s~i blocks 2m positions in the tail of s0 and possibly some positions in the head of s~0: If we modify s~i by moving the gap to the rst position (equivalently, we set s~i = siF ) we reduce the number of positions blocked in the tail and head of s~0 to 2m: Moreover, the characters of the new s~i which are aligned to the core of s~0 do not change. Then, the new alignment has not decreased the number of columns made of only zeros, i.e., it satis es conditions 1 and 2 and has one more string satisfying condition 3. Case 2. gap(i) 4mn + 3m + 4m(i 1) + 2: In this case s~i blocks 2m positions in the head of s0 and possibly some positions in the tail of s~0: If we modify s~i by moving the gap to the last position (equivalently, we set s~i = siT ) we reduce the number of positions blocked altogether in the tail and head of s~0 to only 2m: Moreover, the characters of the new s~i which are aligned to the core of s~0 do not change. Then, the new alignment has not decreased the number columns made of only zeros, i.e., it satis es conditions 1 and 2 and has one more string satisfying condition 3.

Case 3. 4mi + 2 gap(i) 4m(i 1) + 1 + 4mn + 3m: In this case, s~i blocks 2m positions in the head of s0 and 2m positions in the tail of s0. It is possible that the position of the gaps allows to align the central zeros of s~i (those belonging to the substrings ci(j)) to the zeroes of the core of s0: If we modify s~i by moving the gap to the rst position (equivalently, we set s~i = siF ) we reduce the number of positions blocked altogether in the tail and head of s~0 to 2m: Even if in the resulting new alignment the new s~i blocked all the positions in the core of s0; which are m, the net gain in the number of columns made of only zeroes would be m: Then, the new alignment has strictly increased the number of columns made of only zeros, i.e., it satis es conditions 1 and 2 and has one more string satisfying condition 3.

We can see that the strings s1T ; s2F ; s3F block a total of 13 columns, and knowing that the assignment A satis es only one clause we have k = 1, therefore it holds 13 2mn + k = 13.

Lemma 3. If there exists an alignment S~ of s0; : : : ; sn+1 that has length M and such that cs(S~) ; then there exists an assignment A for that satis es at most k clauses.

Proof. By Lemma 2, we can assume w.l.o.g., that in S~ for each i = 1; : : : ; n, we have gap(i) 2 f1; M g: Equivalently, for each i = 1; : : : ; n; we have that s~i 2 fsiT ; siF g:

We now create the assignment A by setting for each i = 1; : : : ; n; xi = (T rue

if s~i = siT F alse if s~i = siF ( 6 )

By the de nition of siF and siT and Proposition 1, we have that strings s~1; : : : ; s~n block exactly 2mn of the positions in the head and the tail of s0: Hence, exactly 2mn columns corresponding to the non-blocked positions in the head and tail of s0 are made of only zeroes. Since cs(S~) = 2mn + m k; we have that at least m k columns corresponding to positions in the core of s0 must be made of only zeroes. Equivalently, at most k positions in the core of s0 are blocked by the strings s~1; : : : ; s~n:

By Proposition 1, we have that the position of the jth zero in the core of s0 is blocked by some string s~i if and only if the truth value of xi in A satis es Cj : Since at most k positions of the core of s0 are blocked by some s~i; it follows that A satis es at most k clauses.

As a result of Lemmas 1 and 3, we have the following.

Theorem 1. The problem Selective Column Alignment is NP-complete already over an alphabet of size 2: 3

Variants and Extensions In this section we extend the NP-completeness of the SCS-Align to show the NP-completeness of other variants or reformulations of the problem that are also of interest. We start by the simple observation that the problem remains NP-complete even if we consider as objective function the column score, rather than the a-column score with respect to a speci c character a.

Bounded Column Score Alignment (BCS-Align) Input: Strings s1; : : : ; sm over some alphabet ; an integer M maxi jsij; and a non-negative integer : Question: Is there an aligment S~ = s~1; : : : ; s~m of s1; : : : ; sm such that the length of S~ is at most M and the column score of S~ is at least ?

In fact, due to the presence of the string sn+1 being made of only zeroes as observed in point 2 of Remark 1, the proof in the previous section also shows the result in the following corollary. It is enough to observe that sn+1 forces only columns entirely made of 0 to be accepted.

Corollary 1. The Bounded Column Score Alignment problem is NPcomplete

We can also reformulate the problem by interpreting the bound on the length of the alignment as a bound on the number of gaps. In fact, the following formulation is also easily shown to be NP-complete gap-Bounded Column Score Alignment (gap-CS-Align) Input: Strings s1; : : : ; sm over some alphabet ; an integer bound g on the number of allowed gaps per string, and a non-negative integer bound on the number of uniform columns to be found.

Question: Is there an aligment S~ = s~1; : : : ; s~m of s~1; : : : ; s~m such that for each i 2 [m]; s~i contains at most g gaps, and the column score of S~ is at least ? Corollary 2. The gap-Bounded Column Score Alignment problem is NPcomplete Proof. We use exactly the same reduction employed for the NP-completeness of SCS-Align. We show that the particular case of gap-CS-Align with g = 1 is NP-complete. The simple observation needed is that in any alignment S~ = s~0; : : : ; s~n+1 of the strings s0; : : : ; sn+1 no gap can be present in s~0; (nor in s~n+1); otherwise, at least 2 gaps should be present in each s~i (i = 1; : : : ; n). Hence, we have that Remark 1 (and in particular point 1.) holds also on the basis of the bound on the number of gaps. 4

Final Remarks We proved that computing the best multiple sequence alignment with respect to the column score is NP-hard even in a very restricted model where: strings are binary, we only try to optimize the number of columns of the alignment made of only zeroes, and we bound the number of gaps per row to 1.

It would be interesting to better understand the relationships of this problem to parameterized variants of the Longest Common Supersequence (LCS) problem [4, 5, 2, 6], since LCS is the natural model of column score based multiple sequence alignment in the absence of any bounds on the number of gaps or restrictions on the uniform columns. Another natural open problem regards the limits of approximability of our problems.

1. Berkemer , S.J. , Honer zu Siederdissen, C. , Stadler , P.F. : Compositional properties of alignments . Mathematics in Computer Science ( 2020 )

2. Blin , G. , Bulteau , L. , Jiang , M. , Tejada , P.J. , Vialette , S. : Hardness of longest common subsequence for sequences with bounded run-lengths . In: Karkkainen, J., Stoye , J . (eds.) Combinatorial Pattern Matching. pp. 138 { 148 . Springer Berlin Heidelberg, Berlin, Heidelberg ( 2012 )

3. Carrillo , H. , Lipman , D. : The multiple sequence alignment problem in biology . SIAM Journal on Applied Mathematics 48 ( 5 ), 1073 { 1082 ( 1988 )

4. Iliopoulos , C.S. , Kubica , M. , Rahman , M.S. , Walen , T. : Algorithms for computing the longest parameterized common subsequence . In: Ma, B. , Zhang , K. (eds.) Combinatorial Pattern Matching. pp. 265 { 273 . Springer Berlin Heidelberg, Berlin, Heidelberg ( 2007 )

5. Iliopoulos , C.S. , Sohel

Rahman

, M. : Algorithms for computing variants of the longest common subsequence problem . Theoretical Computer Science 395 ( 2 ), 255 { 267 ( 2008 )

6. Keller, O., Kopelowitz , T. , Lewenstein , M. : On the longest common parameterized subsequence . Theoretical Computer Science 410 ( 51 ), 5347 { 5353 ( 2009 ), combinatorial Pattern Matching

7. Kohli , R. , Krishnamurti , R. , Mirchandani , P.: The minimum satis ability problem . SIAM Journal on Discrete Mathematics 7 ( 2 ), 275 { 283 ( 1994 )

8. Li , M. , Ma , B. , Wang , L. : Finding similar regions in many sequences . Journal of Computer and System Sciences 65 ( 1 ), 73 { 96 ( 2002 )

9. Maier , D. : The complexity of some problems on subsequences and supersequences . J. ACM 25 ( 2 ), 322 ? 336 ( 1978 )

10. Needleman , S.B. , Wunsch , C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins . Journal of Molecular Biology 48 ( 3 ), 443 { 453 ( 1970 )

11. Notredame , C. : Recent progress in multiple sequence alignment: a survey . Pharmacogenomics 3 ( 1 ), 131 {144 ( January 2002 )

12. Thompson , J.D. , Plewniak , F. , Poch , O.: A comprehensive comparison of multiple sequence alignment programs . Nucleic Acids Research 27 ( 13 ), 2682 { 2690 (07 1999 )

13. Wang , L. , Jiang , T. : On the complexity of multiple sequence alignment . Journal of Computational Biology 1 ( 4 ), 337 { 348 ( 1994 )