On String Attractors Nicola Prezza1 Department of Computer Science, University of Pisa, Italy nicola.prezza@di.unipi.it Abstract. String attractors are combinatorial objects recently intro- duced as an universal framework for studying dictionary compressors. In this brief communication we summarize possible future developments and open problems related to the subject. Keywords: String attractors · compressed computation 1 Introduction Let S be a string of length n. Informally, a string attractor [5] for S is a set Γ ⊆ [1..n] with the following property: any substring of S has at least one occurrence in S crossing at least one position in Γ . The following definition formalizes a generalized version of this concept. Definition 1 (String k-attractor [5]). A k-attractor of a string S ∈ Σ n is a set of positions Γ ⊆ [1..n] such that every substring S[i..j] with i ≤ j < i + k has at least one occurrence S[i0 ..j 0 ] = S[i..j] with j 00 ∈ [i0 ..j 0 ] for some j 00 ∈ Γ . When k = n, we simply call Γ an attractor of S. String attractors were originally introduced as a unifying framework for known dictionary compressors: Straight-Line programs [7] (context-free gram- mars generating the string), Collage Systems [6], Macro schemes [10] (a set of substring equations having the string as unique solution; this includes Lempel- Ziv 77 [8]), the run-length Burrows-Wheeler transform [2] (a string permutation whose number of equal-letter runs decreases as the string’s repetitiveness in- creases), and the compact directed acyclic word graph [1, 3] (the minimization of the suffix tree). As shown in [5], any of the above compressed representations induces a string attractor of the same asymptotic size [5]. The other way round also holds for a subset of the above compressors: given a string attractor of size γ, one can build a compressed representation of size O(γ log(n/γ)). These reductions imply the following interesting facts: (1) We can asymptotically relate the output sizes of different compressors. (2) We can design universal compressed data structures (i.e. working on top of any of the above compressors). With respect to point (2), it can be shown that one can perform optimal-time random access within O(γ polylog n) space [5]. Similarly, fast text indexing can be achieved within the same space [9]. 2 N. Prezza Computing a minimum k-attractor is NP-complete for general alphabets and k ≥ 3, and APX-complete for constant k ≥ 3 [5]. If the alphabet and k are small, however, the problem admits polynomial-time solutions [4]. Verification and approximation, on the other hand, are much easier problems: we can decide whether Γ is a proper k-attractor in optimal O(n) time and space [4] and, with a simple reduction to set-cover, we can compute a O(log k) approximation of the smallest k -attractor in polynomial time [5]. In the following sections we list some interesting open problems and promis- ing future developments related to these new combinatorial objects. We divide the exposition in four sections: complexity and approximability problems, com- binatorial problems (enumeration of attractors and relations with the string’s complexity), generalizations (to trees, graphs, grids), and algorithmic problems (universal compressed data structures and quick algorithms on attractors). 2 Complexity and Approximability The first open problem that we discuss is the complexity of 2-attractor: Problem 1. Is the (decisional version of) 2-attractor solvable in polynomial-time? The reason why Problem 1 is so interesting is that k-attractor is NP-complete for k ≥ 3, and one can easily verify that 1-attractor can be solved in linear time with a naive algorithm. The only remaining case, k = 2, is still open. Interestingly, if we restrict our attention to just substrings of length 2 (not ≤ 2), then the corresponding problem — dubbed sharp-2-attractor, can be solved in polynomial time [4]. A related problem is that of improving the dependency on the alphabet size in the original NP-completeness proof of [5] (which used polynomial alphabets): Problem 2. Is k-attractor still NP-complete for alphabets of size σ = no(1) ? Note that k-attractor can be solved in polynomial time when σ k log σ k ∈ O(log n) [4], so the NP-completeness proof should somehow put some lower- bound restriction on σ (in particular, note that 3-attractor is not NP-complete in the binary case). The next problems are concerned with (in-)approximability. To start with, all polynomial-time approximation algorithms we know of achieve rate O(log k) (including reductions from dictionary compressors). Problem 3. Can we compute a o(log k)-approximation of the smallest k-attractor in polynomial time? The n-attractor problem is deeply related to set-cover [5], for which it is known that no o(log n)-approximation can be computed in polynomial time. A natural question is therefore whether this inapproximability result transfers to the minimum attractor problem. Note that the linear-time algorithm Lempel- Ziv 77 achieves O(log n) approximation; such an inapproximability result would On String Attractors 3 therefore imply that LZ77 is optimal among polynomial-time dictionary com- pressors. The best inapproximability result we know so far is that k-attractor cannot be approximated within factor < 11809/11808 in polynomial time, so we get the additional open problem: Problem 4. Can we approximate k-attractor within factor 11809/11808 +  in polynomial time, for any  ≥ 0? 3 Combinatorics String attractors raise (quite naturally) several questions of combinatorial na- ture. One of the most intriguing open problems in this case is that of representing the string in attractor-space. All random access data structures provided in [5], as well as the reductions to compressed formats, achieve O(γ log(n/γ)) space (in words of Θ(log n) bits). In what follows, let γ be the size of a smallest attractor for the string. We are left with the question: Problem 5. Can we encode a string within o(γ log(n/γ) log n) bits of space? is this space below the Kolmogorov complexity for some infinite family of strings? A related question, important to understand the power of these objects (as well as designing smaller data structures), is: Problem 6. Does there exist an infinite family of strings such that γ = o(b), where b is the size of the smallest bidirectional macro scheme? [10] Macro schemes [10] are the most general and powerful dictionary compressors known to date. Since fast random access can be supported within O(b log(n/b)) space, a positive answer to the above question would imply smaller random access data structures (of size o(b log(n/b)) for some strings). A similar problem is that of the relation between the number r of equal-letter runs in the Burrows-Wheeler transform and γ: Problem 7. Does it hold that r ∈ O(γ polylog n)? Note that the above relation holds for all other known dictionary compres- sors [5] (except CDAWG). A positive answer to Problem 7 would immediately prove new relations between dictionary compressors (e.g. is r ∈ O(z polylog n), z being the size of the LZ77 parse?). A natural combinatorial question, related to the previous ones, is: Problem 8. How many strings in Σ n have a smallest attractor of size γ? Let f (γ, n, σ) be the size of this class of strings (i.e. strings of length n, on alphabet of size σ, and with smallest attractor of size γ). Note that the class with γ = 1 is precisely that of unary strings, therefore f (1, n, σ) = σ. Now, a simple information-theoretic argument shows that any encoding for this class will use, in the worst case, at least log2 (f (γ, n, σ)) bits. This implies that no string can be represented within o(log2 (f (γ, n, σ))) space (see problem 5). 4 N. Prezza 4 Generalizations It is natural to try generalizing string attractors to more complex objects such as trees, graphs, and grids. Here we give a tentative formulation for (labeled) graphs: Definition 2 (Graph attractor). A graph attractor is a set Γ of edges such that every connected subgraph has an isomorphic occurrence crossing at least one of the edges in Γ . Note that this formulation coincides with Definition 1 in the case of undi- rected labeled path graphs. Note also that the verification problem on graphs (is Γ a “graph attractor”?) now becomes much harder due to the NP-completeness of subgraph isomorphism. Indeed, it is easy to see that the verification problem itself becomes NP-complete: given an (unlabeled, undirected) graph G = hV, Ei, then G contains a subgraph isomorphic to a given (disjoint) graph H if and only if Γ = E is a graph attractor for the graph G ∪ H. Nevertheless, Definition 2 could be useful to approach the hard problem of graph compression, for which only few techniques exist: Problem 9. Can we use Definition 2 to compress a graph G? (say, within space O(|Γ | polylog |G|)). Similarly, a natural generalization of string attractors could be useful to compress repetitive sets of multi-dimensional points (e.g. two-dimensional grids). In this case, the generalization is more straightforward: Definition 3 (Grid attractor). An attractor for a grid on [1..n] × [1..n] is a set Γ of points such that every sub-grid has an occurrence crossing at least one of the points in Γ . It is actually not hard to generalize the random access data structure of [5, Thm 5.3] to support access on grids within O(|Γ | polylog n) space, where Γ is a grid attractor as defined in Definition 3. The same idea can be generalized to arbitrary dimension d ≥ 2. We leave this extension as future work. 5 Algorithms and Data Structures To conclude this communication, we consider problems related to algorithms and data structures on string attractors. A natural research direction is to explore which queries can be supported on these combinatorial objects. As seen above, random access [5], as well as text indexing [9], can be supported in O(γ polylog n) space. While in the former case the provided upper-bound matches the lower bound, the problem of optimality remains open for text indexing: Problem 10. Can we locate patterns in O(m + occ) time within O(γ polylog n) space, where m is the pattern length and occ is the number of occurrences? On String Attractors 5 Similarly, no structures have been proposed yet for queries such as RMQ (range minimum), rank, and select. An interesting research direction could also be that of devising dynamic data structures based on string attractors (e.g. strings with insert, delete, and access operations). As far as verification and optimization are concerned, it is still not known whether we can check that a given set Γ is a valid string attractor in op- timal O(n/ logσ n + γ) time and space (this problem was addressed in [4]). The same holds for exact algorithms for computing a minimum attractor: the Fixed Parameter Tractable algorithm described in [4, Thm 24] runs in O(n) + exp(O(σ k log σ k )) time. Is it possible to reduce the dependency on k and σ? We conclude the section with a problem of practical importance. The greedy approximation algorithm for k-attractor discussed in [5] (i.e. iteratively pick the text position that covers the largest number of uncovered substrings) achieves a O(log k) approximation factor, but a naive implementation runs in O(n3 ) time. Implementing efficiently this approximation (say, in O(n log n) time) would be extremely valuable to build more space-efficient data structures in practice. References 1. Anselm Blumer, Janet Blumer, David Haussler, Ross McConnell, and Andrzej Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. Jour- nal of the ACM, 34(3):578–595, 1987. 2. Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. 3. Maxime Crochemore and Renaud Vérin. Direct construction of compact directed acyclic word graphs. In Combinatorial Pattern Matching (CPM), pages 116–129. Springer, 1997. 4. Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg. String At- tractors: Verification and Optimization. In Yossi Azar, Hannah Bast, and Grzegorz Herman, editors, 26th Annual European Symposium on Algorithms (ESA 2018), volume 112 of Leibniz International Proceedings in Informatics (LIPIcs), pages 52:1–52:13, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer In- formatik. 5. Dominik Kempa and Nicola Prezza. At the Roots of Dictionary Compression: String Attractors. In Annual Symposium on Theory of Computing (STOC), pages 827–840. ACM, 2018. 6. T. Kida, T. Matsumoto, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Collage system: A unifying framework for compressed pattern matching. Theor. Comput. Sci., 298(1):253–272, 2003. 7. John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737–754, 2000. 8. A. Lempel and J. Ziv. On the complexity of finite sequences. IEEE Trans. Infor- mation Theory, 22(1):75–81, 1976. 9. Gonzalo Navarro and Nicola Prezza. Universal Compressed Text Indexing. Theo- retical Computer Science, 2018. 10. James A. Storer and Thomas G. Szymanski. Data compression via textual substi- tution. Journal of the ACM, 29(4):928–951, 1982.