=Paper=
{{Paper
|id=Vol-1146/paper8
|storemode=property
|title=On Representations of Ternary Order Relations in Numeric Strings
|pdfUrl=https://ceur-ws.org/Vol-1146/paper8.pdf
|volume=Vol-1146
|dblpUrl=https://dblp.org/rec/conf/icabd/KimANPS14
}}
==On Representations of Ternary Order Relations in Numeric Strings==
On Representations of Ternary Order Relations in Numeric Strings Jinil Kim1,⇤ , Amihood Amir2,3 , Joong Chae Na4 , Kunsoo Park1,⇤ , Jeong Seop Sim5 1 Dep. of Computer Science and Engineering, Seoul National University, Korea 2 Department of Computer Science, Bar-Ilan University, Israel 3 Johns Hopkins University, USA 4 Dep. of Computer Science and Engineering, Sejong University, Korea 5 Dep. of Computer and Information Engineering, Inha University, Korea ⇤ {jikim | kpark}@theory.snu.ac.kr Abstract Order-preserving matching is a string matching problem of two numeric strings where the relative orders of consecutive substrings are matched instead of the characters themselves. The order relation between two characters is a ternary relation (>, <, =) rather than a binary relation (>, <), but it was not sufficiently studied in previous works [5, 7, 1]. In this paper, we extend the representations of order relations by Kim et al. [5] to ternary order relations, and prove the equiva- lence of those representations. The extended prefix representation takes log m + 1 bits per character, while the nearest neighbor representation takes 2 log m bits per character. With our extensions, the time complexities of order-preserving matching in binary order relations can be achieved in ternary order relations as well. 1 Introduction Order-preserving matching is a string matching problem of two numeric strings where the relative orders of substrings are matched instead of the characters them- selves. It has many practical applications such as stock price analysis and musical melody matching. The study on this field was introduced by Kubica et al. [7] and Kim et al. [5] where Kubica et al. [7] defined order relations by order isomor- phism of two strings, while Kim et al. [5] defined them explicitly by the sequence Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. In: Costas S. Iliopoulos, Alessio Langiu (eds.): Proceedings of the 2nd International Conference on Algorithms for Big Data, Palermo, Italy, 7-9 April 2014, published at http://ceur-ws.org/ 46 On Representations of Ternary Order Relations in Numeric Strings of rank values (which they called the natural representation). Recently, variants of order-preserving matching have been studied such as order-preserving suffix trees [2], approximate matching with k mismatches [3], and a Boyer-Moore type algorithm [1]. An order relation between two characters is a ternary relation (>, <, =) rather than a binary relation (>, <), but the representations of order relations were not sufficiently studied for ternary order relations. Kim et al. [5] considered only binary order relations by assuming that all the characters in a string are distinct, while Kubica et al. [7] considered ternary order relations but their match condition on equal characters was faulty and it was fixed later by Cho et al. [1]. As there was no integrated study on the relationship between the representations of order relations, previous works [1, 2, 3] individually handled the cumbersome details of ternary order relations on their own works. The number of order relations on a sequence of n elements is n! in binary order relations, but it is the ordered Bell number [4] in ternary order relations. The Pn 1 ordered Bell number is the solution of the recurrence f (n) = 1 + j=1 nj f (n j), which is exponentially greater than n!. The representations of order relations should be extended for ternary order relations, which might incur some space overhead on the representations. In this paper, we extend the representations of order relations by Kim et al. [5] to ternary order relations, and prove the equivalence of those representations. With the extended prefix representation, order-preserving matching can be done in O(n log m) time, and the representation of order relations takes (log m + 1)-bit per character. With the nearest neighbor representation, the matching can be done in O(n + m log m), but the representation takes (2 log m)-bit per character. The nearest neighbor representation is suitable for the single pattern matching while the extended prefix representation is space-efficient and can be useful in order- preserving multiple pattern matching [5] and order-preserving suffix trees [2]. 2 Problem Formulation Let ⌃ denote the set of numbers such that a comparison of two numbers can be done in constant time, and let ⌃⇤ denote the set of strings over the alphabet ⌃. For a string x 2 ⌃⇤ , let |x| denote the length of x. A string x is described by a sequence of characters (x[1], x[2], ..., x[|x|]). Let a substring x[i..j] be (x[i], x[i + 1], ..., x[j]) and a prefix xi be x[1..i]. For a character c 2 ⌃, let rankx (c) = 1 + |{i : x[i] < c for 1 i |x|}|, and let existx (c) be 1 if c exists in x, and 0 otherwise. Let ex-rankx (c) = (rankx (c), existx (c)). For any boolean condition cond, let (cond) be 1 if cond is true, 0 otherwise. Order Isomorphism [7] For two strings x and y of length n, x and y are order- isomorphic if 8i, j 2 [1..n], x[i] x[j] , y[i] y[j]. Order isomorphism implicitly deals with ternary order relations because each of ternary order relations (>, <, =) can be checked by variants of the proposition in Definition 2 (changing i and j, taking the contrapositive, or both). For example, 47 On Representations of Ternary Order Relations in Numeric Strings if x[i] > x[j], then y[i] > y[j] by the contrapositive of x[i] x[j] ( y[i] y[j]. If x[i] = x[j], then y[i] = y[j] by x[i] x[j] ) y[i] y[j] and x[j] x[i] ) y[j] y[i]. The definition of order isomorphism looks simple, but it is somewhat compli- cated to handle in practice. The number of the order relations involved in checking order isomorphism of two strings of length n is O(n2 ), hence it has an inherent quadratic term if the definition is used directly for order-preserving matching. Natural Representation [5] For a string x of length n, the nat- ural representation of the order relations is defined as N at(x) = (rankx (x[1]), rankx (x[2]), ..., rankx (x[n])). In the natural representation, ternary order relations are explicitly stated in terms of ranks. For example, x[i] > x[j] if and only if N at(x)[i] > N at(x)[j], and x[i] = x[j] if and only if N at(x)[i] = N at(x)[j]. That is, the order relations of two strings coincide if and only if N at(x) = N at(y). The comparison of two natural representations takes O(n) time if the natural representations are given. These two definitions are equivalent because the natural representations of two strings are identical if and only if they are order-isomorphic. We adopt the natural representation throughout this paper because the definition itself and subsequent analysis are more intuitive. Order-Preserving Matching [5] Given a text T [1..n] 2 ⌃⇤ and a pattern P [1..m] 2 ⌃⇤ , P matches T at position i if N at(P ) = N at(T [i m + 1..i]). Order- preserving matching is the problem of finding all positions of T matched with P . Equivalent Representation For any two representations R1 (·) and R2 (·), R1 is equivalent to R2 if R1 (x) = R1 (y) , R2 (x) = R2 (y) for any two strings x, y. For KMP-based algorithms [6], the length of matches is incrementally increased when the next character of the text matches that of the pattern. Such a match operation is formalized by the match condition as follows. Match Condition A match condition of a representation R(·) is a boolean func- tion M atch(x, y, R(x), t + 1) such that N at(xt+1 ) = N at(yt+1 ) holds if and only if N at(xt ) = N at(yt ) and M atch(x, y, R(x), t + 1) where x, y 2 ⌃⇤ , |x| = |y| and t 2 [1..|x| 1]. 3 Prefix Representation Prefix Representation [5] For a string x, the prefix representation of the order relations is defined as P re(x) = (rankx1 (x[1]), rankx2 (x[2]), ..., rankx|x| (x[|x|])). The prefix representation has an ambiguity between di↵erent strings in ternary order relations [1]. For example, when x = (10, 30, 20), and y = (10, 20, 20), the prefix representations of both x and y are (1, 2, 2). The ambiguity is resolved in the extended prefix representation. 48 On Representations of Ternary Order Relations in Numeric Strings i 1 2 3 4 5 6 7 8 x 30 10 50 20 30 20 25 20 N at(x6 ) 4 ✓ ◆ 1 ✓ ◆ 6 ✓ ◆ 2 ✓ ◆ 4 ✓ ◆ 2 ✓ ◆ ✓ ◆ 1 1 3 2 3 2 4 Ex-pre(x7 ) ✓ 0 ◆ ✓ 0 ◆ ✓0 ◆ ✓0 ◆ ✓1◆ ✓1◆ ✓0◆ 1 1 2 2 1 4 6 N N (x7 ) 1 1 1 1 1 4 5 N at(x7 ) 5 ✓ ◆ 1 ✓ ◆ 7 ✓ ◆ ✓ ◆ 2 5 ✓ ◆ 2 ✓ ◆ 4 ✓ ◆ ✓ ◆ 1 1 3 2 3 2 4 2 Ex-pre(x8 ) ✓ 0 ◆ ✓ 0 ◆ ✓0 ◆ ✓0 ◆ ✓1◆ ✓1◆ ✓0◆ ✓1 ◆ 1 1 2 2 1 4 6 6 N N (x8 ) 1 1 1 1 1 4 5 6 N at(x8 ) 6 1 8 2 6 2 5 2 Figure 1: Example of Lemma 3.1 and Lemma 4.1 Extended Prefix Representation For a string x, the extended prefix representation is defined as Ex-pre(x) = (ex-rankx1 (x[1]), ex-rank(x2 [2]), ..., ex-rank(x|x| [|x|])). For any t 2 [1..|x| 1], N at(xt+1 ) can be computed from N at(xt ) and Ex-pre(xt+1 ). Lemma 3.1 (Representation Conversion) Given N at(xt ) and Ex-pre(xt+1 ), ⇢ a + ((a > c) _ (a = c ^ d = 0)) for 1 i t N at(xt+1 )[i] = c for i = t + 1 ✓ ◆ c where a = N at(xt )[i] and d = Ex-pre(xt+1 )[t + 1]. An example of Lemma 3.1 is shown in Figure 1 for x = (30, 10, 50, 20, 30, 20, 25, 20). Let’s consider when ✓ ◆ t + ✓ ◆ 1 = 7. For i = 1, we have N at(x6 )[1] = a = 4 and Ex-pre(x7 )[7] = dc = 40 . Since a = c and d = 0, ✓ ◆ c we have N at(x7 )[1] = a + 1 = 5. For i = 2, d is the same as for i = 1, and we have N at(x6 )[2] = a = 1. Since a < c, N at(x7 )[2] = a = 1. Let’s consider ✓ ◆ the ✓ ◆ next c 2 step when t + 1 = 8. For i = 4, we have N at(x7 )[4] = a = 2 and d = 1 . As a = c and d = 1, N at(x8 )[4] = a = 2. Theorem 3.2 Ex-pre(·) is equivalent to N at(·). Theorem 3.3 (Match Condition) Given x, y and t, the condition Ex-pre(xt+1 )[t + 1] = Ex-pre(yt+1 )[t + 1] is a match condition of Ex-pre(·). 4 Nearest Neighbor Representation Let LM axx [i] = j if x[j] = max{x[k] : x[k] x[i] for 1 k i 1}, or 1 if no such j. Let LM inx [i] = j if x[j] = min{x[k] : x[k] x[i] for 1 k i 1}, or 49 On Representations of Ternary Order Relations in Numeric Strings 1 if no such j. If there are multiple j’s for LM axx [i] or LM inx [i], we choose the rightmost one. In Figure 1, LM axx [8] = 6 since x[6] is the rightmost one among the maximum values which are less than or equal to x[8] in x[1..7]. Similarly, LM inx [8] = 6. Nearest Neighbor Representation [5, 7, 1] For a string x, the nearest neigh- bor representation can be defined as ✓ ◆✓ ◆ ✓ ◆ LM axx [1] LM axx [2] LM axx [|x|] N N (x) = LM inx [1] LM inx [2] ··· LM inx [|x|] For convenience, let x[ 1] = 1, x[1] = 1, N at(x)[ 1] = 0 and N at(x)[1] = |x| + 1 for any string x. Then, N at(x)[LM axx [i]] N at(x)[i] N at(x)[LM inx [i]] holds for any i 2 [1..|x|] even when LM axx [i] = 1 or LM inx [i] = 1. Lemma 4.1 (Representation Conversion) Given N at(xt ) and N N (xt+1 ), ⇢ a + ((a > f ) _ (a = f ^ e 6= f )) for 1 i t N at(xt+1 )[i] = f for i = t + 1 ✓ ◆ c where a = N at(xt )[i], d = N N (xt+1 )[t + 1], e = N at(xt )[c] and f = N at(xt )[d]. An example of Lemma 4.1 is shown in Figure 1 for x = (30, 10, 50, 20, 30, 20, 25, 20). Let us consider ✓ ◆ when ✓ ◆ t + 1 = 7. For i = 1, we have N at(x6 )[1] = a = 4, N N (x7 )[7] = dc = 65 , e = 2 and f = 4. Since a = f ✓ ◆ c and e 6= f , we get N at(x7 )[1] = a + 1 = 5. For i = 2, d , e and f are the same as for i = 1, and we have a = 1. Since a < e, we have N at(x7 )[2] = a = 1. For i = 3, we have a = 6 and a > f , and thus N at(x7 )[3] = a + 1 = 7. For i = 4, we have a = 2. Since a = e and e 6= f , we get N at(x7 )[4] = a = 2. For i = 7, we get N at(x7 )[7] = f = 4 since i = t + 1.✓ Consider ◆ ✓ ◆ the next step when t + 1 = 8. For c 2 i = 4, we have a = 2, N N (x8 )[8] = d = 1 , e = 2 and f = 2. Since a = e = f , we get N at(x8 )[4] = a = 2. Theorem 4.2 N N (·) is equivalent to N at(·). A naive match condition of the nearest neighbor representation is N N (xt+1 )[t + 1] = N N (yt+1 )[t + 1] as that of the extended prefix representation in Theorem 3.3, which requires computing the nearest neighbor representations of both x and y. Kubica et al. [7] proposed an efficient match condition for ternary order relations which can be checked in constant time when the nearest neighbor representation of x is given, but it was faulty. Cho et al. [1] presented a modified match condition in ternary order relations, which can produce an O(n + m log m) algorithm as in binary order relations. 50 On Representations of Ternary Order Relations in Numeric Strings Theorem 4.3 (Match Condition of Nearest Neighbor Representation [1]) Given x, y and t, the condition (y[c] ✓ ◆ < y[t + 1] < y[d]) _ (y[t + 1] = y[c] = y[d]) is c a match condition of N N (·) where d = N N (xt+1 )[t + 1]. We can generalize order-preserving matching algorithms in binary order rela- tions [5] to ternary order relations using the representations above. For single pattern matching, we can obtain an O(n log m) algorithm with the extended prefix representation, and an O(n + m log m) algorithm with the nearest neighbor repre- sentation, both of which are consistent with the results in [5, 7, 1]. For multiple pattern matching, we can obtain an O((n + m) log m) algorithm in ternary order relations using the extended prefix representation. Acknowledgements The work of Jinil Kim and Kunsoo Park was supported by Next-Generation Informa- tion Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2011-0029924). Amihood Amir was partly supported by NSF grant CCR-09-04581, ISF grant 347/09, and BSF grant 2008217. Joong Chae Na was supported by the IT R&D program of MKE/KEIT [10038768, The Development of Supercomputing System for the Genome Analysis], and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2011-0007860). Jeong Seop Sim was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2012R1A2A2A01014892), and by the Industrial Strategic technology development program (10041971, Development of Power Efficient High-Performance Multime- dia Contents Service Technology using Context-Adapting Distributed Transcoding) funded by the Ministry of Knowledge Economy (MKE, Korea). References [1] S. Cho, J. C. Na, K. Park, and J. S. Sim. Fast order-preserving pattern match- ing. In COCOA, 2013. [2] M. Crochemore, C. S. Iliopoulos, T. Kociumaka, M. Kubica, A. Langiu, S. P. Pissis, J. Radoszewski, W. Rytter, and T. Walen. Order-preserving incomplete suffix trees and order-preserving indexes. In SPIRE, pages 84–95, 2013. [3] P. Gawrychowski and P. Uznanski. Order-preserving pattern matching with k mismatches. CoRR, abs/1309.6453, 2013. [4] O. A. Gross. Preferential arrangements. The American Mathematical Monthly, 69:4–8, 1962. [5] J. Kim, P. Eades, R. Fleischer, S.-H. Hong, C. S. Iliopoulos, K. Park, S. J. Puglisi, and T. Tokuyama. Order-preserving matching. To appear in Theor. Comput. Sci. 51 On Representations of Ternary Order Relations in Numeric Strings [6] D. E. Knuth, J. H. M. Jr., and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977. [7] M. Kubica, T. Kulczynski, J. Radoszewski, W. Rytter, and T. Walen. A linear time algorithm for consecutive permutation pattern matching. Inf. Process. Lett., 113(12):430–433, 2013. 52