=Paper=
{{Paper
|id=Vol-2621/CIRCLE20_25
|storemode=property
|title=Improved Compressed String Dictionaries
|pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_25.pdf
|volume=Vol-2621
|authors=Nieves R. Brisaboa,Ana Cerdeira-Pena,Guillermo De Bernardo,Gonzalo Navarro
|dblpUrl=https://dblp.org/rec/conf/circle/BrisaboaCBN20
}}
==Improved Compressed String Dictionaries==
Improved Compressed String Dictionaries∗
Nieves R. Brisaboa Ana Cerdeira-Pena
brisaboa@udc.es acerdeira@udc.es
Universidade da Coruña, Universidade da Coruña,
Centro de investigación CITIC, Databases Lab. Centro de investigación CITIC, Databases Lab.
A Coruña, Spain A Coruña, Spain
Guillermo de Bernardo Gonzalo Navarro
gdebernardo@udc.es gnavarro@dcc.uchile.cl
Universidade da Coruña, IMFD, DCC, University of Chile
Centro de investigación CITIC, Databases Lab. Santiago, Chile
A Coruña, Spain
ABSTRACT wide space/time tradeoff varying the block size. Grossi and Otta-
We propose a new family of data structures to store and query viano proposed path-decomposed tries (PDT) [3], a compact trie
string dictionaries in main memory. We combine hierarchical Front- representation that achieves good compression and very consis-
coding with ideas from longest-common-prefix computation in suf- tent query times.
fix arrays. We focus on two application domains where string dic-
tionaries are extensively used: URL collections, used in Web graphs, 2 OUR PROPOSAL
and collection of URIs and literals used in RDF datasets. We test our Our techniques follow the same ideas of differential compression
proposals in real-world dictionaries, showing that our data struc- based on Front-coding proposed in previous work [4], but we de-
tures yield relevant space-time tradeoffs, achieving very good com- sign a new arrangement that aims at improving the space-time
pression and competitive query times. tradeoff of previous solutions based on sampling and linear search.
Our representation starts from a collection of strings, that are
CCS CONCEPTS lexicographically sorted. We add two markers at the limits of the
• Information systems → Data compression; Dictionaries. collection, and initialize two arrays llcp and rlcp, that will be set
to 0 for those markers. Then, we select the middle point m of our
KEYWORDS array, set the value llcp[m] to the longest common prefix (lcp) be-
compression, data structures, string dictionaries tween the string at the middle and the left limit, and rlcp[m] to
the lcp between the middle and the rightmost string. Then, we re-
1 INTRODUCTION cursively repeat the process in each half of the collection, until all
lcps have been computed. We remove from each string the prefix
String dictionaries, that map strings to unique identifiers, are widely
corresponding to the maximum of its two associated lcps. Figure 1
used in applications that must work with large collections of strings.
displays the resulting elements for a sample collection of strings.
For instance, in Web graphs or RDF datasets, a usual technique
Our representation consists of the arrays llcp and rlcp, and the tails
is to associate consecutive integers to each node label, and then
of the strings after removing the prefixes.
build a representation of the graph structure that uses those inte-
ger identifiers (ids) instead of the original strings. Then, queries
are solved by translating query strings to ids, executing the query
on ids and mapping the results back to strings. The two basic oper-
ations needed are lookup(s) , that receives a string and returns the //&3
string id, and access(i) that returns the string from the id. 5/&3
In this work we focus on the compact representation of URL dic-
tionaries, used in Web graphs, and URIs and literals dictionaries,
FOLQJ
a
FOLPDWH
FORWKHV
FDOO
FDPS
FHLO
FODPS
FOHYHU
FORYHU
FOHDU
FOHDQ
FHQWV
FHLOLQJ
FHQW
FODP
used in RDF datasets. We combine well-known techniques such as
Front-coding or Re-pair compression with adapted binary-search
algorithms to obtain a family of compressed string dictionaries
Figure 1: Conceptual dictionary structure.
with good performance in those applications.
Several solutions have been proposed for this problem. Martinez-
Prieto et al. [4] developed, among other proposals, a solution based Using those structures, to look up a string, we could binary
on a fixed partition of the strings and compression of each block search our collection and compare the query string with the string
with Front-coding and other techniques; their solution provides a tail at the midpoint at each step. We have instead devised an im-
∗ This extended abstract summarizes work previously published in CIKM 2019 [1] proved binary search procedure that, keeping track of lcp values,
"Copyright ©2020 for this paper by its authors. Use permitted under Creative Com- is able to save some string comparisons. To recover the string for
mons License Attribution 4.0 International (CC BY 4.0)." an id, we reverse the procedure used in lookups. We recover the
Brisaboa et al.
14
string from the end, starting at the given position and moving to RPFC
RPHTFC
the left/right “parent” to keep decoding new characters, until we 12 RPDAC
HASHRPDAC
reach a position with lcp 0 and hence the string decoding is com- PDT
lookup time (µsec./query)
RP
IBiS -nt
plete. The complete pseudocode for both algorithms can be found 10
RP
IBiS -L-nt
RP+DAC
IBiS -nt
in the full article. 8
IBiS
RP+DAC
-L-nt
RP+DAC-VLS
We call our basic proposal IBiS, but we have proposed and tested IBiS
RP+DAC-VLS
IBiS
-nt
-L-nt
RP+DAC+DAC-VLS
a number of implementation variants for the same conceptual rep- 6
IBiS
IBiS
RP+DAC+DAC-VLS
-nt
-L-nt
resentation, varying the data structure and compression techniques
4
applied to llcp, rlcp and the string tails S: IBiS stores the arrays
as sequences of fixed-size integers, and the string tails are stored 2
5 10 15 20 25 30 35
in a single string Str . We add a bitmap B that marks the begin- 10
ning of each string tail in Str . IBiSRP applies Re-Pair compres-
RPFC
9 RPHTFC
RPDAC
sion to Str and encodes the resulting integer sequence as an ar- 8 HASHRPDAC
ray. IBiSRP +DAC is similar to IBiSRP , but uses DACs [2] to store
access time (µsec./query)
PDT
7 RP
IBiS -nt
llcp and rlcp. IBiSRP +DAC −V LS is also similar to IBiSRP but uses
RP
6 IBiS -L-nt
RP+DAC
IBiS -nt
RP+DAC
5 IBiS -L-nt
another variant of DACs to store the array resulting from Re-Pair IBiS
RP+DAC-VLS
-nt
compression. IBiSRP +DAC +DAC −V LS combines the two previous
4 RP+DAC-VLS
IBiS -L-nt
RP+DAC+DAC-VLS
IBiS -nt
3 RP+DAC+DAC-VLS
approaches, using DACs to llcp, rlcp and the Re-Pair-compressed 2
IBiS -L-nt
strings. 1
We also tested other variants that provide interesting tradeoffs. 0
5 10 15 20 25 30 35
First, if we consider end-of-string markers in Str to locate the end space (% of original)
of each string we can speed up some comparisons, but we have to
store an extra byte per string; we can omit those to save significant Figure 2: Space and query times on Web graph UK
space, at the cost of slightly more complex searches. We also build
variants that use a single lcp array (llcp or rlcp) instead of two; in
evaluation with Web graphs and RDF data suggests that our tech-
those variants, we avoid storing one of the arrays, but our string
niques improve the compression obtained by state-of-the-art so-
tails are longer, and search operations may not save as many string
lutions. Moreover, the variants tested provide several interesting
comparisons, so a space-time tradeoff is obtained.
space/time tradeoffs for compression, lookup and access perfor-
mance. Our proposal is so far limited to a static scenario, and works
3 EXPERIMENTAL EVALUATION well in datasets with relatively long strings on average. We plan to
explore the possibilities to adapt our solutions to other types of
We tested our proposal with Web graph and RDF datasets. We
string dictionaries, as well as to the dynamic dictionary problem.
summarize here the full results, that can be found on the full pa-
per. We compare our results with previous solutions [4] based on REFERENCES
Front-coding (RPFC and RPHTFC) and binary search (RPDAC and [1] Nieves R. Brisaboa, Ana Cerdeira-Pena, Guillermo de Bernardo, and Gonzalo
HASHRPDAC), and with path-decomposed tries (PDT). Navarro. 2019. Improved Compressed String Dictionaries. In Proc. 28th ACM Inter-
Our preliminary results suggest a number of trends among our national Conference on Information and Knowledge Management (Beijing, China)
(CIKM’19). 29–38. https://doi.org/10.1145/3357384.3357972
variants. For instance, among single-lcp implementations, those us- [2] Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2013. DACs: Bringing
ing only llcp (−L) are consistently the most efficient, so we will Direct Access to Variable-length Codes. Information Processing and Management
show detailed results for those. Variants with no end-of-string mark- 49, 1 (Jan. 2013), 392–404. https://doi.org/10.1016/j.ipm.2012.08.003
[3] Roberto Grossi and Giuseppe Ottaviano. 2015. Fast Compressed Tries Through
ers (−nt) also obtain the best performance in general, since query Path Decompositions. Journal of Experimental Algorithmics 19, Article 3.4 (Jan.
times are slightly affected but compression improves significantly, 2015), 11 pages. https://doi.org/10.1145/2656332
[4] Miguel A. Martínez-Prieto, Nieves R. Brisaboa, Rodrigo Cánovas, Fran-
so we will restrict our explanation to them. cisco Claude, and Gonzalo Navarro. 2016. Practical Compressed
Figure 2 shows the space/time tradeoff on the Web graph dataset String Dictionaries. Information Systems 56, C (Mar. 2016), 73–108.
UK. Results in other datasets follow similar trends: our techniques https://doi.org/10.1016/j.is.2015.08.008
achieve the best compression and are competitive in query times
with most alternatives. Techniques like PDT or RPDAC are faster,
but significantly larger in most cases; previous solutions based on
Front-coding provide a wide tradeoff, but our variants can pro-
vide similar tradeoffs and achieve better compression for similar
query times. Additionally, our variants provide different tradeoffs
for lookup and access that could be useful in specific applications.
4 CONCLUSIONS
Our new family of compressed data structures can efficiently com-
press string dictionaries and provide competitive query times. Our