Fully Polynomial-Time Approximation Scheme for a Problem of Finding a Subsequence Alexander Kel’manov1,2 , Sergey Khamidullin1 , and Semyon Romanchenko1,2 1 Sobolev Institute of Mathematics, 4 Koptyug Ave., 630090 Novosibirsk, Russia 2 Novosibirsk State University, 2 Pirogova St., 630090 Novosibirsk, Russia {kelm,kham,rsm}@math.nsc.ru Abstract. We consider a strongly NP-hard Euclidean problem of finding a sub- sequence in a finite sequence under the criterion of the minimum sum of squared distances from the elements of sought subsequence to its geometric center (cen- troid). It is assumed that the sought subsequence contains a given number of elements. In addition, sought subsequence has to satisfy the following condition: the difference between the indexes of each previous and next points is bounded with given lower and upper constants. We present an approximation algorithm for the problem and prove that it is a fully polynomial-time approximation scheme when the space dimension is bounded by a constant. Keywords: Euclidean space, sequence, minimum sum of squared distances, NP-hardness, FPTAS. Introduction In this paper we study one strongly NP-hard problem of searching a subsequence in a finite sequence of points from Euclidean space. Our aim is to provide an approximation algorithm for this problem. This work was motivated by poor study of the problem. The problem is also interest- ing because of its importance for applications, in particular, for mathematical problems of time series analysis, approximation problems and also for applications dealing with data mining problems (see, for example, [1–4] and references therein). The paper is organized as follows. In the next section the formal definition of the problem under study is given; an example of application (origin) of the problem is also presented. In Section 3, we provide a review of the previous results and announce the obtained algorithmic result. Basic definitions and statements that provide necessary elements to prove the properties of the proposed algorithm are presented in Section 4. Finally, in Section 5 we construct an approximation algorithm for solving the problem under study and prove that our algorithm is a fully polynomial-time approximation scheme (FPTAS) when the space dimension is fixed. Copyright ⃝ c by the paper’s authors. Copying permitted for private and academic purposes. In: A. Kononov et al. (eds.): DOOR 2016, Vladivostok, Russia, published at http://ceur-ws.org FPTAS for One Problem of Finding a Subsequence 517 1 Problem formulation and its origin Everywhere below R denotes the set of real numbers, ∥ · ∥ denotes the Euclidean norm in Rq . Problem under consideration has the following formulation (see [5], [6], [7]). Problem 1 (Finding a subsequence in a sequence). Given a sequence Y = (y1 , . . . , yN ) of points from Rq and positive integer numbers Tmin , Tmax and M > 1. Find a subset M = {n1 , . . . , nM } ⊆ N = {1, . . . , N } of indexes of the se- quence Y elements such that ∑ F (M) = ∥yj − y(M)∥2 → min, (1) j∈M ∑ i∈M yi is a geometric center (centroid) of the subsequence {yi ∈ 1 where y(M) = |M| Y | i ∈ M} subject to constraints 1 ≤ Tmin ≤ nm − nm−1 ≤ Tmax ≤ N, m = 2, . . . , M, (2) on the elements of the tuple (n1 , . . . , nM ). Problem 1 has the following interpretation (see [5], [6]). There is a time series con- taining N measurements y1 , . . . , nN of q numerical characteristics of some objects. Each measurement result in the time series has an error, and no correspondence is known between the elements of the time series and the objects. Some of this objects have identical characteristics (or one can say that in the time series there are several mea- surements of one significant object). Other objects are distinguished and have different characteristics (or one can say that in the time series there are some measurements that are treated as ”trash” which could be presented in this time series). The number of measurements for identical objects is known. In addition, it is known that the time interval between every two consequent results of measuring characteristics of the identi- cal objects is bounded from above and below with some constants Tmax and Tmin . The characteristics of identical objects in contrast to the characteristics of other objects have basic information value. It is required to find the subsequence of measures which corresponds to the identical objects using the criterion of minimum sum of squared distances and to estimate the characteristics of these objects (taking into account the measuring errors in the data). 2 Previous and obtained results Problem 1 is among poorly studied discrete optimization problems. A special case of this problem when Tmin = 1 and Tmax = N is equivalent [5] to the strongly NP-hard problem of searching points subset. In this case at the input there is a set of points instead of a sequence and time-related constraints (2) are absent. First, let us recall the results obtained for the searching subset problem because it is a simple particular case of Problem 1. In general, the searching subset problem is strongly NP-hard [8]. But in the case of fixed dimension q of the space this problem could be solved [9] in O(N q+1 ) time. 518 A.V. Kel’manov, S.A. Khamidullin, S.M. Romanchenko Moreover, by this time for the searching subset problem the following algorithms have been presented. In [10] a 2-approximation polynomial algorithm is proposed, it’s time complexity is O(qN 2 ). A polynomial-time approximation scheme (PTAS) is sub- stantiated in [11]. This scheme allows to solve the searching subset problem for arbitrary relative error ε in O(qN 2/ε+1 (9/ε)3/ε ) time. For the case of fixed dimension q of the space and integer-valued points coordinates an exact pseudo-polynomial algorithm is constructed [12]. The running time of this algorithm is O(N (M D)q ), where D is max- imal absolute input point coordinate value. It is established [13], that the searching subset problem has no FPTAS, unless P=NP. In cited paper, an FPTAS was proposed for the case of fixed space dimension. This scheme allows to solve the problem with arbitrary relative error ε in O(N 2 (M/ε)q ) time. By this time for the considered Problem 1 the following results were obtained. First, we should note that as far as the Problem 1 is generalization of strongly NP-hard subset problem, there are neither exact polynomial-time, nor pseudo-polynomial-time algorithms or FPTAS schemes, unless P=NP. The case when Tmin and Tmax are parameters of Problem 1 is analyzed in [5]. In this work the authors showed that this problem is strongly NP-hard for any Tmin < Tmax . In the trivial case when Tmin = Tmax this problem can be solved through a polynomial time. In [6] a 2-approximation polynomial-time algorithm is proposed; the running time of the algorithm is O(N 2 (M N + q)). In the case of Problem 1 with integer components of the sequence elements and fixed dimension q of the space in [7] an exact pseudo- polynomial-time algorithm is substantiated. This algorithm finds an optimal solution of Problem 1 in O(N 3 (M D)q ) time. Among the highest interest is the question of approximability of Problem 1. In particular, the question of FPTAS construction for the special case of Problem 1 (sub- class of problem) has a great importance because in general case such a scheme does not exist. In the present work such scheme is presented for the case of fixed space dimension. The main result of this work is an approximation algorithm which allows to find a (1 + ε)–approximate √ solution for arbitrary relative error ε > 0 in O(N 2 (M (Tmax − 2q q Tmin + 1) + q)( ε + 1) ) time. If the dimension q of the space is fixed then the time complexity of our algorithm is equal to O(M N (1/ε)q/2 ), and it implements an 3 FPTAS. 3 Algorithm foundations In order to substantiate our algorithm we’ll need a few basic assertions and auxiliary problem with exact polynomial-time algorithm for this problem. Lemma 1. For an arbitrary point x ∈ Rq and a finite set Z ⊂ Rq it is true that ∑ ∑ ∥z − x∥2 = ∥z − z∥2 + |Z| · ∥x − z∥2 , z∈Z z∈Z where z is a centroid of set Z. FPTAS for One Problem of Finding a Subsequence 519 Lemma 2. In assumptions of Lemma 1, if some point u ∈ Rq is closest (by distance) to the centroid z of set Z among all points from this set, then ∑ ∑ ∥z − u∥2 ≤ 2 ∥z − z∥2 . z∈Z z∈Z Both lemmas are well-known. Their proofs are presented in many publications (for example, in [12], [13]). Lemma 3. Let ∑ S(M, x) = ∥yn − x∥2 , x ∈ Rq , M ⊆ N , (3) n∈M where yn ∈ Y, and M = {n1 , . . . , nM } satisfies the restrictions (2). Then for any fixed M the constrained minimum of S(M, x) over x is reached at the point x = y(M) and is equal to F (M). This assertion could be easily verified by differentiation of S over x and also follows from Lemma 1. In addition to Lemma 3 for an arbitrary fixed point x ∈ Rq the restriction of the function S(M, x) to M we denote as S x (M) and argument of its minimum we denote as Mx . Lemma 4. Let M∗ be the optimal solution of Problem 1, and y(M∗ ) be the centroid of the set {yi |i ∈ M∗ }. Then for any point x ∈ Rq the following inequality holds F (Mx ) ≤ F (M∗ ) + M ∥x − y(M∗ )∥2 . (4) ∑ Proof. Let y(Mx ) = |M1 x | i∈Mx yi be a centroid of {yi | i ∈ Mx }. Then from defini- tions (1), (3) and Lemma 3 we obtain ∑ ∑ F (Mx ) = ∥yi − y(Mx )∥2 ≤ ∥yi − x∥2 = S x (Mx ). (5) i∈Mx i∈Mx In addition, from the definition of the set Mx we have S x (Mx ) ≤ S x (M∗ ). (6) Further, Lemma 1 applied up to the point x and the set {yi | i ∈ M∗ } implies ∑ ∑ S x (M∗ ) = ∥yi − x∥2 = ∥yi − y(M∗ )∥2 + |M∗ | · ∥x − y(M∗ )∥2 . (7) i∈M∗ i∈M∗ Finally, combining (5)–(7) we obtain F (Mx ) ≤ S x (Mx ) ≤ S x (M∗ ) ∑ = ∥yi − y(M∗ )∥2 + |M∗ | · ∥x − y(M∗ )∥2 i∈M∗ = F (M∗ ) + M ∥x − y(M∗ )∥2 . ⊔ ⊓ 520 A.V. Kel’manov, S.A. Khamidullin, S.M. Romanchenko Lemma 4 shows that quality of feasible solution Mx obtained by some point x ∈ R could be estimated via distance from this point to (unknown) optimal centroid q y(M∗ ). The closer considered point x up to the optimal centroid, the less absolute approximation error of an obtained solution. Lemma 5. Let M∗ be the optimal solution of Problem 1, and let t = arg min ∥y − y(M∗ )∥ y∈{yi | i∈M∗ } be the point of the optimal set {yi |i ∈ M∗ } closest to its centroid, then 1 ∥t − y(M∗ )∥2 ≤ F (Mt ), (8) M where Mt supplies minimum to S t (M) over M ⊆ N under constraints (2) on elements of M. Proof. From the definition of the point t it follows that ∥t − y(M∗ )∥2 ≤ ∥yi − y(M∗ )∥2 for any i ∈ M∗ . Summing up both sides of this inequality for all i ∈ M∗ we obtain ∑ M ∥t − y(M∗ )∥2 ≤ ∥yi − y(M∗ )∥2 . (9) i∈M∗ From the fact that Mt is a feasible solution of Problem 1 and M∗ is the optimal solution we have the following bound F (M∗ ) ≤ F (Mt ). (10) Combining (9) and (10) we obtain final estimation ∑ M ∥t − y(M∗ )∥2 ≤ ∥yi − y(M∗ )∥2 = F (M∗ ) ≤ F (Mt ). i∈M∗ ⊔ ⊓ Lemma 6. Assume that conditions of Lemma 5 are held. Then for Mx to be a (1+ε)– approximate solution of the Problem 1 for fixed ε > 0, it is enough to take such point x that satisfies the inequality ε ∥x − y(M∗ )∥2 ≤ F (Mt ). (11) 4M ∑ i∈Mt yi be a centroid of the set {yi | i ∈ M }. From the 1 Proof. Let y(Mt ) = |M t| t fact that M = arg minM S (M) and definitions (1), (3) we have t t F (Mt ) ≤ S t (Mt ). (12) FPTAS for One Problem of Finding a Subsequence 521 In addition, from the optimality of the set Mt we have inequality S t (Mt ) ≤ S t (M∗ ). (13) Further, since the set {yi | i ∈ M∗ } and the point t satisfy the conditions of Lemma 2, the following estimate holds ∑ ∑ ∥yi − t∥2 ≤ 2 ∥yi − y(M∗ )∥2 i∈M∗ i∈M∗ and, therefore, ∑ ∑ S t (M∗ ) = ∥yi − t∥2 ≤ 2 ∥yi − y(M∗ )∥2 = 2F (M∗ ). (14) i∈M∗ i∈M∗ Combining (11)–(14) we obtain ε ε t ε t ε ∥x − y(M∗ )∥2 ≤ F (Mt ) ≤ S (Mt ) ≤ S (M∗ ) ≤ F (M∗ ). (15) 2M 2M 2M M Finally, applying (15) up to the right side of inequality (4) we obtain inequality F (Mx ) ≤ (1 + ε)F (M∗ ), that completes the proof of Lemma 6. ⊔ ⊓ Computational base of the proposed algorithm is an exact polynomial-time algo- rithm for solving the auxiliary problem. In this auxiliary problem for any point x ∈ Rq one needs to find such a set Mx that supplies minimum to S x (M) while elements from M are under constraints (2). This auxiliary problem can be formulated as follows: Problem 2. Given a sequence Y = (y1 , . . . , yN ) of points from Rq , point x ∈ Rq , positive integer numbers Tmin , Tmax and M > 1. Find a subset M = {n1 , . . . , nM } ⊆ N of indexes of the sequence elements such that ∑ S x (M) = ∥yi − x∥2 → min, i∈M while elements of the tuple (n1 , . . . , nM ) satisfy the constraints (2). A dynamic programming scheme is presented in the next lemma and its corollary. This scheme allows to find the optimal solution Mx of Problem 2. The presented scheme is based on results from [6], [14] and given here for completeness. Lemma 7. For any positive integer M > 1, such that (M − 1)Tmin ≤ N − 1, and for arbitrary point x ∈ Rq the optimum Smin x = minM S x (M) of Problem 2 could be found as x x Smin = min SM (n), (16) n∈ωM 522 A.V. Kel’manov, S.A. Khamidullin, S.M. Romanchenko x where the values of the functions SM (n), n ∈ ωM , are calculated using the following recurrence formulas { ∥yn − x∥2 , if n ∈ ω1 , m = 1; x Sm (n) = ∥yn − x∥2 + max Sm−1 x (j), if n ∈ ωm , m = 2, . . . , M, (17) − j∈γm−1 (n) where { } ωm = n | 1 + (m − 1)Tmin ≤ n ≤ N − (M − m)Tmin , m = 1, . . . , M, − { } γm−1 (n) = j | max{1 + (m − 2)Tmin , n − Tmax } ≤ j ≤ n − Tmin , n ∈ ωm , m = 2, . . . , M. Corollary 1. Elements nx1 , . . . , nxM of the optimal tuple Mx could be found by the formulas: nxM = arg min SM x (n), (18) n∈ωM nxm−1 = arg min − x Sm (n), m = M, M − 1, . . . , 2. (19) n∈γm (nx m) Let us show the algorithm implementing above scheme in a step-by-step description. A l g o r i t h m A1 . Input: sequence Y, point x, numbers Tmin , Tmax and M . Step 1. Compute the values ∥yn − x∥2 for each n ∈ N . Step 2. Using formulas (17), calculate the values Sm x (n) for each n ∈ ωm while m = 1, . . . , M . x Step 3. Find the minimum Smin of the objective function S x using (16) and the optimal tuple M = (n1 , . . . , nM ) by formulas (18), (19). x x x Output: tuple Mx = (nx1 , . . . , nxM ). In [6], it was proved that the algorithm A1 finds an optimal solution of Problem 2 in O(N (M (Tmax − Tmin + 1) + q)) time. In this expression, the value Tmax − Tmin + 1 is at most N . Therefore, the running time of the algorithm is estimated as O(N (M N + q)). 4 Approximation algorithm Proposed approximation algorithm for Problem 1 can be schematically described as follows. For all points from the input sequence we define a special box (cube with cen- ter in this point) such that at least one of these boxes contains unknown centroid of the optimal subsequence. By the given value of the relative error we construct mul- tidimensional grid with uniform step by all coordinates. For all nodes of the grid we solve the auxiliary problem using dynamic programming scheme, and obtain a feasible solution of Problem 1. Finally, among all obtained feasible solutions we select one with minimal value of the objective function of Problem 1. For an arbitrary point z ∈ Rq and positive numbers h and H we define a set of points D(z, h, H) = {d | d = z + h(j1 , . . . , jq ), ji ∈ Z, |h · ji | ≤ H, i = 1, . . . , q}. FPTAS for One Problem of Finding a Subsequence 523 Note that points from this set are placed in the nodes of the uniform multidimensional rectangular grid with size 2H and step h between nodes. The center of this grid is at the point z. For the number of nodes of this grid the following bound holds H |D(z, h, H)| ≤ (2 + 1)q . h Herewith, for any point x ∈ Rq , such that ∥z − √ x∥ ≤ H, the distance up to the closest h q node of the grid D(z, h, H) does not exceed 2 . Lemma 5 (right-hand side of inequality (8)), in fact, defines the size of the cube which guaranteed contains the unknown centroid of the optimal subsequence. So the size of the grid can be defined as follows: √ 1 H(y) = F (My ), y ∈ Y. (20) M Moreover, Lemma 6 defines the condition on the value of the grid spacing, wherein there is a node closed enough to the centroid of the optimal subsequence (in sense of guaranteed error ε). Therefore, define the grid spacing as follows: √ 2ε h(y, ε) = F (My ), y ∈ Y, ε > 0. (21) qM Let us present the algorithm for solving Problem 1. A l g o r i t h m A. Input: sequence Y of points, numbers Tmin , Tmax , M and ε. For all points y ∈ Y execute the steps 1–5. Step 1. Using Algorithm A1 find the optimal solution My of Problem 2 with x = y. Step 2. Compute F (My ), h and H by the formulas (1), (21) (20). Step 3. If F (My ) = 0, then set My is declared as a result of the algorithm; Exit. Otherwise go to the next step. Step 4. Build the grid D(y, h, H). Step 5. For all points d from the grid D(y, h, H) using Algorithm A1 find the optimal solution Md of Problem 2 (with x = d) and compute the value F (Md ). Step 6. In the set of solutions {Md | d ∈ D(y, h, H), y ∈ Y} as a result select the ∗ tuple Md supplying the minimal value for the objective function F (Md ). ∗ Output: tuple Md . Theorem 1. For an arbitrary ε > 0 Algorithm A√finds a (1 + ε)–approximate solution of Problem 1 in O(N 2 (M (Tmax − Tmin + 1) + q)( 2q q ε + 1) ) time. Proof. Let t = arg miny∈{yi | i∈M∗ } ∥y−y(M∗ )∥ be the point from the set {yi | i ∈ M∗ }, that is closest to the centroid of this set. If for this point on step 3 we have equality F (Mt ) = 0, then the set Mt is an optimal solution of Problem 1, because the value of the objective function F is always nonnegative. Now we consider the second case, when F (Mt ) > 0. 524 A.V. Kel’manov, S.A. Khamidullin, S.M. Romanchenko In accordance with Lemma 5, there is an inequality (8) for the point t. This inequal- ity and definition (20) of the H(·) implies ∥t − y(M∗ )∥ ≤ H. In other words, centroid y(M∗ ) of the optimal subsequence is inside area bounded by the grid D(t, h, H). Take the point d∗ = arg min ∥d − y(M∗ )∥ d∈D(t,h,H) distance from y(M∗ ) to the from the grid, that is closest to the optimal centroid. Since √ ∗ h q closest node d from the grid D(t, h, H) does not exceed 2 , we obtain the estimate h2 q ε ∥d∗ − y(M∗ )∥2 ≤ F (Mt ). 4 2M ∗ Thus, the point d∗ satisfies the conditions of Lemma 6. Therefore, the set Md is a (1 + ε)–approximate solution of Problem 1. Note, that either on step 3 optimal solution will be found or the point d∗ and the ∗ set Md will be considered during operation of the algorithm in step 5. Therefore, at the end the algorithm is guaranteed to obtain at least (1 + ε)–approximate solution. Let us now estimate the time complexity of the algorithm. On step 1 to solve auxiliary problem O(N (M (Tmax −Tmin +1)+q)) operations is required. Step 2 requires O(qN ) operations, and step 3 could be performed in O(1) time. To generate the grid D(y, h, H) on step 4 it needs to perform O(q · |D(y, h, H)|) operations. The cost of constructing the sets Md on step 5 and computing the values F (Md ) is equal to (as on step 1) O(N (M (Tmax − Tmin + 1) + q)). As a result, for each of N points from the sequence Y performing steps 1–5 required O(N (M (Tmax − Tmin + 1) + q) · |D(y, h,∑ H)|) operations. Finally, on step 6 to choose the best solution it needs to perform O( y∈Y |D(y, h, H)|) operations. It remains to note that the size of the grid D(y, h, H) is bounded by √ H 2q |D(y, h, H)| ≤ (2 + 1) ≤ ( q + 1)q . h ε √ the total time complexity of the entire algorithm is O(N (M (Tmax − Tmin + 2 Therefore, 1) + q)( 2q q ε + 1) ). ⊔ ⊓ Let us show that if the dimension q of the space is fixed, then presented algorithm implements an FPTAS. Indeed, if ε ∈ (0, 2q], then one of the factors in final complexity can be estimated as follows (√ )q (√ )q ( ) q2 (( ) q ) 2q 2q 3q q 1 1 2 +1 ≤2 q = 2 2 q2 =O . ε ε ε ε Therefore, as the value Tmax − Tmin + 1 does not exceed N , the time complexity of the algorithm is O(M N 3 (1/ε)q/2 ). Thus, the proposed algorithm implements an FPTAS. FPTAS for One Problem of Finding a Subsequence 525 5 Conclusion In this work we have constructed an approximation algorithm for one problem of finding a subsequence in the given sequence of points from the Euclidean space. The proposed algorithm implements a fully polynomial-time approximation scheme in the case when the dimension of the space is fixed (is not a parameter of the input). Acknowledgments. This work was supported by the RFBR, projects 15-01-00462, 16-31-00186 and 16-07-00168. References 1. Fu, Tak-chung: A Review on Time Series Data Mining. Engineering Applications of Arti- ficial Intelligence. 24(1), 164–181 (2011) 2. Kuenzer, C., Dech, S., Wagner, W.: Remote Sensing Time Series. Remote Sensing and Digital Image Processing. Vol. 22. Springer International Publishing Switzerland (2015) 3. T. Warren Liao: Clustering of Time Series Data — a Survey. Pattern Recognition. 38(11), 1857–1874 (2005) 4. Aggarwal C. C. Data Mining: The Textbook. Springer International Publishing (2015) 5. Kel’manov A.V., Pyatkin A.V.: On Complexity of Some Problems of Cluster Analysis of Vector Sequences. J. Appl. Indust. Math. 7(3) 363–369 (2013) 6. Kel’manov A.V., Romanchenko S.M., Khamidullin S.A.: Approximation algorithms for some intractable problems of choosing a vector subsequence. J. Appl. Indust. Math. 6(4) 443–450 (2012) 7. Kel’manov A.V., Romanchenko S.M., Khamidullin S.A.: Exact pseudopolynomial algo- rithms for some NP-hard problems of searching a vectors subsequence. Zhurnal Vychisli- tel’noi Matematiki i Matematicheskoi Fiziki (in Russian). 53(1) 143–153 (2013) 8. Kel’manov A.V., Pyatkin A.V.: NP-Completeness of Some Problems of Choosing a Vector Subset. J. Appl. Indust. Math. 5(3) 352–357 (2011) 9. Aggarwal A., Imai H., Katoh N., Suri S.: Finding k points with minimum diameter and related problems. J. Algorithms. Vol. 12. P. 38–56 (1991) 10. Kel’manov A.V., Romanchenko S.M.: An Approximation Algorithm for Solving a Problem of Search for a Vector Subset. J. Appl. Indust. Math. 6(1), 90–96 (2012) 11. Shenmaier V.V.: An approximation scheme for a problem of search for a vector subset. J. Appl. Indust. Math. 6(3), 381–386 (2012) 12. Kel’manov A.V., Romanchenko S.M.: Pseudopolynomial Algorithms for Certain Compu- tationally Hard Vector Subset and Cluster Analysis Problems. Automation and Remote Control. 73(2), 349–354 (2012) 13. Kel’manov A.V., Romanchenko S.M.: An FPTAS for a Vector Subset Search Problem. J. Appl. Indust. Math. 8(3), 329–336 (2014) 14. Kel’manov A.V., Khamidullin S.A.: Posterior Detection of a Given Number of Identical Subsequences in a Quasi-periodic Sequence. Comput. Math. Math. Phys. 41(5) 762-774 (2001)