Improved Compressed String Dictionaries∗ Nieves R. Brisaboa Ana Cerdeira-Pena brisaboa@udc.es acerdeira@udc.es Universidade da Coruña, Universidade da Coruña, Centro de investigación CITIC, Databases Lab. Centro de investigación CITIC, Databases Lab. A Coruña, Spain A Coruña, Spain Guillermo de Bernardo Gonzalo Navarro gdebernardo@udc.es gnavarro@dcc.uchile.cl Universidade da Coruña, IMFD, DCC, University of Chile Centro de investigación CITIC, Databases Lab. Santiago, Chile A Coruña, Spain ABSTRACT wide space/time tradeoff varying the block size. Grossi and Otta- We propose a new family of data structures to store and query viano proposed path-decomposed tries (PDT) [3], a compact trie string dictionaries in main memory. We combine hierarchical Front- representation that achieves good compression and very consis- coding with ideas from longest-common-prefix computation in suf- tent query times. fix arrays. We focus on two application domains where string dic- tionaries are extensively used: URL collections, used in Web graphs, 2 OUR PROPOSAL and collection of URIs and literals used in RDF datasets. We test our Our techniques follow the same ideas of differential compression proposals in real-world dictionaries, showing that our data struc- based on Front-coding proposed in previous work [4], but we de- tures yield relevant space-time tradeoffs, achieving very good com- sign a new arrangement that aims at improving the space-time pression and competitive query times. tradeoff of previous solutions based on sampling and linear search. Our representation starts from a collection of strings, that are CCS CONCEPTS lexicographically sorted. We add two markers at the limits of the • Information systems → Data compression; Dictionaries. collection, and initialize two arrays llcp and rlcp, that will be set to 0 for those markers. Then, we select the middle point m of our KEYWORDS array, set the value llcp[m] to the longest common prefix (lcp) be- compression, data structures, string dictionaries tween the string at the middle and the left limit, and rlcp[m] to the lcp between the middle and the rightmost string. Then, we re- 1 INTRODUCTION cursively repeat the process in each half of the collection, until all lcps have been computed. We remove from each string the prefix String dictionaries, that map strings to unique identifiers, are widely corresponding to the maximum of its two associated lcps. Figure 1 used in applications that must work with large collections of strings. displays the resulting elements for a sample collection of strings. For instance, in Web graphs or RDF datasets, a usual technique Our representation consists of the arrays llcp and rlcp, and the tails is to associate consecutive integers to each node label, and then of the strings after removing the prefixes. build a representation of the graph structure that uses those inte- ger identifiers (ids) instead of the original strings. Then, queries are solved by translating query strings to ids, executing the query on ids and mapping the results back to strings. The two basic oper- ations needed are lookup(s) , that receives a string and returns the //&3 string id, and access(i) that returns the string from the id. 5/&3 In this work we focus on the compact representation of URL dic- tionaries, used in Web graphs, and URIs and literals dictionaries, FOLQJ a FOLPDWH FORWKHV FDOO FDPS FHLO FODPS FOHYHU FORYHU FOHDU FOHDQ FHQWV FHLOLQJ FHQW FODP used in RDF datasets. We combine well-known techniques such as Front-coding or Re-pair compression with adapted binary-search algorithms to obtain a family of compressed string dictionaries Figure 1: Conceptual dictionary structure. with good performance in those applications. Several solutions have been proposed for this problem. Martinez- Prieto et al. [4] developed, among other proposals, a solution based Using those structures, to look up a string, we could binary on a fixed partition of the strings and compression of each block search our collection and compare the query string with the string with Front-coding and other techniques; their solution provides a tail at the midpoint at each step. We have instead devised an im- ∗ This extended abstract summarizes work previously published in CIKM 2019 [1] proved binary search procedure that, keeping track of lcp values, "Copyright ©2020 for this paper by its authors. Use permitted under Creative Com- is able to save some string comparisons. To recover the string for mons License Attribution 4.0 International (CC BY 4.0)." an id, we reverse the procedure used in lookups. We recover the Brisaboa et al. 14 string from the end, starting at the given position and moving to RPFC RPHTFC the left/right “parent” to keep decoding new characters, until we 12 RPDAC HASHRPDAC reach a position with lcp 0 and hence the string decoding is com- PDT lookup time (µsec./query) RP IBiS -nt plete. The complete pseudocode for both algorithms can be found 10 RP IBiS -L-nt RP+DAC IBiS -nt in the full article. 8 IBiS RP+DAC -L-nt RP+DAC-VLS We call our basic proposal IBiS, but we have proposed and tested IBiS RP+DAC-VLS IBiS -nt -L-nt RP+DAC+DAC-VLS a number of implementation variants for the same conceptual rep- 6 IBiS IBiS RP+DAC+DAC-VLS -nt -L-nt resentation, varying the data structure and compression techniques 4 applied to llcp, rlcp and the string tails S: IBiS stores the arrays as sequences of fixed-size integers, and the string tails are stored 2 5 10 15 20 25 30 35 in a single string Str . We add a bitmap B that marks the begin- 10 ning of each string tail in Str . IBiSRP applies Re-Pair compres- RPFC 9 RPHTFC RPDAC sion to Str and encodes the resulting integer sequence as an ar- 8 HASHRPDAC ray. IBiSRP +DAC is similar to IBiSRP , but uses DACs [2] to store access time (µsec./query) PDT 7 RP IBiS -nt llcp and rlcp. IBiSRP +DAC −V LS is also similar to IBiSRP but uses RP 6 IBiS -L-nt RP+DAC IBiS -nt RP+DAC 5 IBiS -L-nt another variant of DACs to store the array resulting from Re-Pair IBiS RP+DAC-VLS -nt compression. IBiSRP +DAC +DAC −V LS combines the two previous 4 RP+DAC-VLS IBiS -L-nt RP+DAC+DAC-VLS IBiS -nt 3 RP+DAC+DAC-VLS approaches, using DACs to llcp, rlcp and the Re-Pair-compressed 2 IBiS -L-nt strings. 1 We also tested other variants that provide interesting tradeoffs. 0 5 10 15 20 25 30 35 First, if we consider end-of-string markers in Str to locate the end space (% of original) of each string we can speed up some comparisons, but we have to store an extra byte per string; we can omit those to save significant Figure 2: Space and query times on Web graph UK space, at the cost of slightly more complex searches. We also build variants that use a single lcp array (llcp or rlcp) instead of two; in evaluation with Web graphs and RDF data suggests that our tech- those variants, we avoid storing one of the arrays, but our string niques improve the compression obtained by state-of-the-art so- tails are longer, and search operations may not save as many string lutions. Moreover, the variants tested provide several interesting comparisons, so a space-time tradeoff is obtained. space/time tradeoffs for compression, lookup and access perfor- mance. Our proposal is so far limited to a static scenario, and works 3 EXPERIMENTAL EVALUATION well in datasets with relatively long strings on average. We plan to explore the possibilities to adapt our solutions to other types of We tested our proposal with Web graph and RDF datasets. We string dictionaries, as well as to the dynamic dictionary problem. summarize here the full results, that can be found on the full pa- per. We compare our results with previous solutions [4] based on REFERENCES Front-coding (RPFC and RPHTFC) and binary search (RPDAC and [1] Nieves R. Brisaboa, Ana Cerdeira-Pena, Guillermo de Bernardo, and Gonzalo HASHRPDAC), and with path-decomposed tries (PDT). Navarro. 2019. Improved Compressed String Dictionaries. In Proc. 28th ACM Inter- Our preliminary results suggest a number of trends among our national Conference on Information and Knowledge Management (Beijing, China) (CIKM’19). 29–38. https://doi.org/10.1145/3357384.3357972 variants. For instance, among single-lcp implementations, those us- [2] Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2013. DACs: Bringing ing only llcp (−L) are consistently the most efficient, so we will Direct Access to Variable-length Codes. Information Processing and Management show detailed results for those. Variants with no end-of-string mark- 49, 1 (Jan. 2013), 392–404. https://doi.org/10.1016/j.ipm.2012.08.003 [3] Roberto Grossi and Giuseppe Ottaviano. 2015. Fast Compressed Tries Through ers (−nt) also obtain the best performance in general, since query Path Decompositions. Journal of Experimental Algorithmics 19, Article 3.4 (Jan. times are slightly affected but compression improves significantly, 2015), 11 pages. https://doi.org/10.1145/2656332 [4] Miguel A. Martínez-Prieto, Nieves R. Brisaboa, Rodrigo Cánovas, Fran- so we will restrict our explanation to them. cisco Claude, and Gonzalo Navarro. 2016. Practical Compressed Figure 2 shows the space/time tradeoff on the Web graph dataset String Dictionaries. Information Systems 56, C (Mar. 2016), 73–108. UK. Results in other datasets follow similar trends: our techniques https://doi.org/10.1016/j.is.2015.08.008 achieve the best compression and are competitive in query times with most alternatives. Techniques like PDT or RPDAC are faster, but significantly larger in most cases; previous solutions based on Front-coding provide a wide tradeoff, but our variants can pro- vide similar tradeoffs and achieve better compression for similar query times. Additionally, our variants provide different tradeoffs for lookup and access that could be useful in specific applications. 4 CONCLUSIONS Our new family of compressed data structures can efficiently com- press string dictionaries and provide competitive query times. Our