Improved Compressed String Dictionaries∗
                            Nieves R. Brisaboa                                                                   Ana Cerdeira-Pena
                          brisaboa@udc.es                                                                 acerdeira@udc.es
                       Universidade da Coruña,                                                         Universidade da Coruña,
            Centro de investigación CITIC, Databases Lab.                                   Centro de investigación CITIC, Databases Lab.
                          A Coruña, Spain                                                                 A Coruña, Spain

                        Guillermo de Bernardo                                                                     Gonzalo Navarro
                       gdebernardo@udc.es                                                                  gnavarro@dcc.uchile.cl
                      Universidade da Coruña,                                                           IMFD, DCC, University of Chile
           Centro de investigación CITIC, Databases Lab.                                                       Santiago, Chile
                         A Coruña, Spain

ABSTRACT                                                                            wide space/time tradeoﬀ varying the block size. Grossi and Otta-
We propose a new family of data structures to store and query                       viano proposed path-decomposed tries (PDT) [3], a compact trie
string dictionaries in main memory. We combine hierarchical Front-                  representation that achieves good compression and very consis-
coding with ideas from longest-common-preﬁx computation in suf-                     tent query times.
ﬁx arrays. We focus on two application domains where string dic-
tionaries are extensively used: URL collections, used in Web graphs,                2 OUR PROPOSAL
and collection of URIs and literals used in RDF datasets. We test our               Our techniques follow the same ideas of diﬀerential compression
proposals in real-world dictionaries, showing that our data struc-                  based on Front-coding proposed in previous work [4], but we de-
tures yield relevant space-time tradeoﬀs, achieving very good com-                  sign a new arrangement that aims at improving the space-time
pression and competitive query times.                                               tradeoﬀ of previous solutions based on sampling and linear search.
                                                                                       Our representation starts from a collection of strings, that are
CCS CONCEPTS                                                                        lexicographically sorted. We add two markers at the limits of the
• Information systems → Data compression; Dictionaries.                             collection, and initialize two arrays llcp and rlcp, that will be set
                                                                                    to 0 for those markers. Then, we select the middle point m of our
KEYWORDS                                                                            array, set the value llcp[m] to the longest common preﬁx (lcp) be-
compression, data structures, string dictionaries                                   tween the string at the middle and the left limit, and rlcp[m] to
                                                                                    the lcp between the middle and the rightmost string. Then, we re-
1    INTRODUCTION                                                                   cursively repeat the process in each half of the collection, until all
                                                                                    lcps have been computed. We remove from each string the preﬁx
String dictionaries, that map strings to unique identiﬁers, are widely
                                                                                    corresponding to the maximum of its two associated lcps. Figure 1
used in applications that must work with large collections of strings.
                                                                                    displays the resulting elements for a sample collection of strings.
For instance, in Web graphs or RDF datasets, a usual technique
                                                                                    Our representation consists of the arrays llcp and rlcp, and the tails
is to associate consecutive integers to each node label, and then
                                                                                    of the strings after removing the preﬁxes.
build a representation of the graph structure that uses those inte-
ger identiﬁers (ids) instead of the original strings. Then, queries
are solved by translating query strings to ids, executing the query
on ids and mapping the results back to strings. The two basic oper-
ations needed are lookup(s) , that receives a string and returns the                 //&3

string id, and access(i) that returns the string from the id.                        5/&3


   In this work we focus on the compact representation of URL dic-
tionaries, used in Web graphs, and URIs and literals dictionaries,
                                                                                                                                                                                                   FOLQJ


                                                                                                                                                                                                                              a
                                                                                                                                                                                         FOLPDWH


                                                                                                                                                                                                           FORWKHV
                                                                                                 FDOO


                                                                                                          FDPS


                                                                                                                 FHLO


                                                                                                                                                        FODPS


                                                                                                                                                                                FOHYHU


                                                                                                                                                                                                                     FORYHU
                                                                                                                                                                        FOHDU
                                                                                                                                                                FOHDQ
                                                                                                                                         FHQWV
                                                                                                                        FHLOLQJ


                                                                                                                                  FHQW


                                                                                                                                                 FODP


used in RDF datasets. We combine well-known techniques such as
Front-coding or Re-pair compression with adapted binary-search
algorithms to obtain a family of compressed string dictionaries
                                                                                                Figure 1: Conceptual dictionary structure.
with good performance in those applications.
   Several solutions have been proposed for this problem. Martinez-
Prieto et al. [4] developed, among other proposals, a solution based                   Using those structures, to look up a string, we could binary
on a ﬁxed partition of the strings and compression of each block                    search our collection and compare the query string with the string
with Front-coding and other techniques; their solution provides a                   tail at the midpoint at each step. We have instead devised an im-
∗ This extended abstract summarizes work previously published in CIKM 2019 [1]      proved binary search procedure that, keeping track of lcp values,
"Copyright ©2020 for this paper by its authors. Use permitted under Creative Com-   is able to save some string comparisons. To recover the string for
mons License Attribution 4.0 International (CC BY 4.0)."                            an id, we reverse the procedure used in lookups. We recover the
                                                                                                                                                                                            Brisaboa et al.


                                                                                                             14
string from the end, starting at the given position and moving to                                                                                                               RPFC
                                                                                                                                                                             RPHTFC
the left/right “parent” to keep decoding new characters, until we                                            12                                                               RPDAC
                                                                                                                                                                      HASHRPDAC
reach a position with lcp 0 and hence the string decoding is com-                                                                                                                   PDT


                                                                                 lookup time (µsec./query)
                                                                                                                                                                                RP
                                                                                                                                                                           IBiS -nt
plete. The complete pseudocode for both algorithms can be found                                              10
                                                                                                                                                                             RP
                                                                                                                                                                         IBiS -L-nt
                                                                                                                                                                          RP+DAC
                                                                                                                                                                      IBiS            -nt
in the full article.                                                                                         8
                                                                                                                                                                  IBiS
                                                                                                                                                                        RP+DAC
                                                                                                                                                                                    -L-nt
                                                                                                                                                                      RP+DAC-VLS
   We call our basic proposal IBiS, but we have proposed and tested                                                                                             IBiS
                                                                                                                                                                  RP+DAC-VLS
                                                                                                                                                               IBiS
                                                                                                                                                                                      -nt
                                                                                                                                                                                    -L-nt
                                                                                                                                                                RP+DAC+DAC-VLS
a number of implementation variants for the same conceptual rep-                                             6
                                                                                                                                                         IBiS
                                                                                                                                                        IBiS
                                                                                                                                                             RP+DAC+DAC-VLS
                                                                                                                                                                                      -nt
                                                                                                                                                                                    -L-nt

resentation, varying the data structure and compression techniques
                                                                                                             4
applied to llcp, rlcp and the string tails S: IBiS stores the arrays
as sequences of ﬁxed-size integers, and the string tails are stored                                          2
                                                                                                                  5   10   15            20             25                    30               35
in a single string Str . We add a bitmap B that marks the begin-                                             10

ning of each string tail in Str . IBiSRP applies Re-Pair compres-
                                                                                                                                                                                RPFC
                                                                                                             9                                                               RPHTFC
                                                                                                                                                                              RPDAC
sion to Str and encodes the resulting integer sequence as an ar-                                             8                                                        HASHRPDAC

ray. IBiSRP +DAC is similar to IBiSRP , but uses DACs [2] to store


                                                                                 access time (µsec./query)
                                                                                                                                                                                    PDT
                                                                                                             7                                                                  RP
                                                                                                                                                                             IBiS     -nt

llcp and rlcp. IBiSRP +DAC −V LS is also similar to IBiSRP but uses
                                                                                                                                                                               RP
                                                                                                             6                                                           IBiS       -L-nt
                                                                                                                                                                          RP+DAC
                                                                                                                                                                    IBiS      -nt
                                                                                                                                                                      RP+DAC
                                                                                                             5                                                    IBiS      -L-nt
another variant of DACs to store the array resulting from Re-Pair                                                                                               IBiS
                                                                                                                                                                      RP+DAC-VLS
                                                                                                                                                                                      -nt

compression. IBiSRP +DAC +DAC −V LS combines the two previous
                                                                                                             4                                                    RP+DAC-VLS
                                                                                                                                                               IBiS                 -L-nt
                                                                                                                                                                RP+DAC+DAC-VLS
                                                                                                                                                         IBiS                         -nt
                                                                                                             3                                               RP+DAC+DAC-VLS
approaches, using DACs to llcp, rlcp and the Re-Pair-compressed                                              2
                                                                                                                                                        IBiS                        -L-nt


strings.                                                                                                     1

   We also tested other variants that provide interesting tradeoﬀs.                                          0
                                                                                                                  5   10   15            20             25                    30               35
First, if we consider end-of-string markers in Str to locate the end                                                            space (% of original)

of each string we can speed up some comparisons, but we have to
store an extra byte per string; we can omit those to save signiﬁcant           Figure 2: Space and query times on Web graph UK
space, at the cost of slightly more complex searches. We also build
variants that use a single lcp array (llcp or rlcp) instead of two; in
                                                                         evaluation with Web graphs and RDF data suggests that our tech-
those variants, we avoid storing one of the arrays, but our string
                                                                         niques improve the compression obtained by state-of-the-art so-
tails are longer, and search operations may not save as many string
                                                                         lutions. Moreover, the variants tested provide several interesting
comparisons, so a space-time tradeoﬀ is obtained.
                                                                         space/time tradeoﬀs for compression, lookup and access perfor-
                                                                         mance. Our proposal is so far limited to a static scenario, and works
3   EXPERIMENTAL EVALUATION                                              well in datasets with relatively long strings on average. We plan to
                                                                         explore the possibilities to adapt our solutions to other types of
We tested our proposal with Web graph and RDF datasets. We
                                                                         string dictionaries, as well as to the dynamic dictionary problem.
summarize here the full results, that can be found on the full pa-
per. We compare our results with previous solutions [4] based on         REFERENCES
Front-coding (RPFC and RPHTFC) and binary search (RPDAC and              [1] Nieves R. Brisaboa, Ana Cerdeira-Pena, Guillermo de Bernardo, and Gonzalo
HASHRPDAC), and with path-decomposed tries (PDT).                            Navarro. 2019. Improved Compressed String Dictionaries. In Proc. 28th ACM Inter-
   Our preliminary results suggest a number of trends among our              national Conference on Information and Knowledge Management (Beijing, China)
                                                                             (CIKM’19). 29–38. https://doi.org/10.1145/3357384.3357972
variants. For instance, among single-lcp implementations, those us-      [2] Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2013. DACs: Bringing
ing only llcp (−L) are consistently the most eﬃcient, so we will             Direct Access to Variable-length Codes. Information Processing and Management
show detailed results for those. Variants with no end-of-string mark-        49, 1 (Jan. 2013), 392–404. https://doi.org/10.1016/j.ipm.2012.08.003
                                                                         [3] Roberto Grossi and Giuseppe Ottaviano. 2015. Fast Compressed Tries Through
ers (−nt) also obtain the best performance in general, since query           Path Decompositions. Journal of Experimental Algorithmics 19, Article 3.4 (Jan.
times are slightly aﬀected but compression improves signiﬁcantly,            2015), 11 pages. https://doi.org/10.1145/2656332
                                                                         [4] Miguel A. Martínez-Prieto, Nieves R. Brisaboa, Rodrigo Cánovas, Fran-
so we will restrict our explanation to them.                                 cisco Claude, and Gonzalo Navarro. 2016.                   Practical Compressed
   Figure 2 shows the space/time tradeoﬀ on the Web graph dataset            String Dictionaries.         Information Systems 56, C (Mar. 2016), 73–108.
UK. Results in other datasets follow similar trends: our techniques          https://doi.org/10.1016/j.is.2015.08.008
achieve the best compression and are competitive in query times
with most alternatives. Techniques like PDT or RPDAC are faster,
but signiﬁcantly larger in most cases; previous solutions based on
Front-coding provide a wide tradeoﬀ, but our variants can pro-
vide similar tradeoﬀs and achieve better compression for similar
query times. Additionally, our variants provide diﬀerent tradeoﬀs
for lookup and access that could be useful in speciﬁc applications.


4   CONCLUSIONS
Our new family of compressed data structures can eﬃciently com-
press string dictionaries and provide competitive query times. Our