=Paper= {{Paper |id=Vol-1146/paper6 |storemode=property |title=Block Graphs in Practice |pdfUrl=https://ceur-ws.org/Vol-1146/paper6.pdf |volume=Vol-1146 |dblpUrl=https://dblp.org/rec/conf/icabd/GagieHP14 }} ==Block Graphs in Practice== https://ceur-ws.org/Vol-1146/paper6.pdf
                   Block Graphs in Practice
            Travis Gagie1,⇤ , Christopher Hoobin2 Simon J. Puglisi1

        1
            Department of Computer Science, University of Helsinki, Finland
                    2
                      School of CSIT, RMIT University, Australia
                              ⇤
                                  Travis.Gagie@cs.helsinki.fi




                                         Abstract
Motivated by the rapidly increasing size of genomic databases, code repositories
and versioned texts, several compression schemes have been proposed that work
well on highly-repetitive strings and also support fast random access: e.g., LZ-
End, RLZ, GDC, augmented SLPs, and block graphs. Block graphs have good
worst-case bounds but it has been an open question whether they are practi-
cal. We describe an implementation of block graphs that, for several standard
datasets, provides better compression and faster random access than competing
schemes.




1    Introduction
Advances in DNA sequencing technology have led to massive genomic databases,
the open-source movement has led to massive code repositories, and the popular-
ity of wikis has led to massive versioned textual databases. Fortunately, all these
datasets tend to be highly repetitive and, thus, highly compressible. Compress-
ing them is only useful, however, if we can still access them quickly afterwards.
Although many papers have been published about compression schemes with fast
random access (see [6] for a recent survey), most have been about schemes such as
Hu↵man coding, LZ78, CSAs or BWT-based coding. Only relatively recently have
researchers started proposing schemes with random access and LZ77-like compres-
sion, which is better suited highly-repetitive strings.
   One approach uses variants of LZ77 itself. Kreft and Navarro’s LZ-End [7]
is practical but lacks good worst-case bounds, for both the compression and the
random-access time. Kuruppu, Puglisi and Zobel’s Relative Lempel-Ziv (RLZ) [8,
9] or Deorowicz and Grabowski’s Genome Di↵erential Compressor (GDC) [4] are
also practical but are not general-purpose: we can apply them only when we have a
good reference sequence, or can construct one. Even then, Deorowicz, Danek and

Copyright c by the paper’s authors. Copying permitted only for private and academic purposes.
In: Costas S. Iliopoulos, Alessio Langiu (eds.): Proceedings of the 2nd International Conference
on Algorithms for Big Data, Palermo, Italy, 7-9 April 2014, published at http://ceur-ws.org/



                                              30
Block Graphs in Practice

Grabowski [3] have observed that when compressing genomes, “the key to high
compression is to look for similarities across the whole collection, not just against
one reference sequence”.
   Another approach uses straight-line programs (SLPs). An SLP for a string s
is a context-free grammar in Chomsky normal for that generates s and only s.
Rytter [13] and Charikar et al. [2] showed that if s has length n and its LZ77
parse consists of z phrases, then we can build a SLP for s with O(z log n) rules
and height O(log n). If we store the resulting SLP together with the size of each
non-terminal’s expansion, which takes a total of O(z log n) space, then we can
extract any substring of length ` in O(log n + `) time. This extraction time nearly
matches a lower bound by Verbin and Yiu [14]. Bille et al. [1] showed how we can
store any SLP with r rules, regardless of the height, in O(r) space with the same
time bound for extraction. These data structures are not practical, however. The
most recent and practical SLP-based scheme of which we are aware is Maruyama,
Tabei, Sakamoto and Sadakane’s Fully-Online LCA (FOLCA) [10] but, apart from
needing less resources for construction, even this is less practical than LZ-End.
   In a previous paper [5] we introduced a third approach, called block graphs,
and showed that they also use O(z log n) space and O(log n + `) extraction time.
We did not implement these data structures for that paper, however, so it has
been an open question whether they are practical. In this paper we describe an
implementation that, for several standard datasets, provides better compression
and faster random access than LZ-End; thus, block graphs are competitive in both
theory and practice. In Section 2 we review the definition of the block graph for
a string. In Section 3 we describe some details of our implementation. Finally, in
Section 4 we report on our experiments.

2    Block Graphs
The block graph of the string s[1..n] is a directed acyclic graph (DAG) in which each
node in-degree up to 2 and out-degree up to 3. For simplicity, assume n is a power
of 2. Each node of the block graph corresponds to a substring, called that node’s
block: the root corresponds to the whole of s; if they exist, the children of a node
v correspond to the first half v’s substring, the middle half of v’s substring (which
overlaps the first and last halves), and the last half of v’s substring. In general,
v shares its left and right children with its left and right siblings, respectively,
because v’s left child’s block is both the last half of v’s left sibling’s block and the
first half of v’s left sibling’s block, and v’s right child’s block is both the last half
of v’s block and the first half of v’s right sibling’s block. If n is not a power of 2,
then we pad it so that it is, build the block graph, and prune redundant nodes.
Figure 1 — which is copied from our earlier paper — shows the block graph for
the eighth Fibonacci string, abaababaabaababaababa, truncated at depth 3.
    We mark as an internal node each node whose block is the first occurrence of
that substring (shown in Figure 1 as ovals). We mark as a leaf all nodes whose
block is not unique and whose parents are internal nodes (shown in Figure 1 as
rectangles). We remove leaves’ children (and their descendants) and replace them
by pointers, as explained in our earlier paper:


                                           31
Block Graphs in Practice

                                                       1..21


                                      1..16                              9..21



                   1..8               5..12            9..16            13..20        17..21



         1..4      3..6      5..8     7..10    9..12           13..16   15..18   17..20
        abaa      aaba      baba

Figure 1: The block graph for the eighth Fibonacci string, abaababaabaababaababa,
truncated at depth 3.

           Suppose a leaf u at depth d had a child hi..ji. . . Consider the first
       occurrence s[i0 ..j 0 ] in s of the substring s[i..j]. Notice that s[i0 ..j 0 ] is
       completely contained within some block at depth d — this is one reason
       why we use overlapping blocks — and, since s[i0 ..j 0 ] is the first occurrence
       of that substring in s, that block is associated with an internal node v.
       We replace the pointer from u to hi..ji by a pointer to v and the o↵set of
       i0 in v’s block.
At some depth, we store all internal nodes’ blocks instead of their children; this
depth is 3 in Figure 1.
  In our previous paper we proved that
    1. the block graph for s has O(z log n) nodes and, thus, takes O(z log n) space;
    2. we can extract any substring of length ` in O(log n + `) time;
    3. if we store all of the nodes at depth d but removing all nodes above that depth,
       we change the space usage to O z(log n d) + 2d and decrease the extraction
       time to O(log n d).
For more details, we refer readers to the proofs and discussion it contains.

3      Implementation
We now describe an implementation of block graphs which is efficient in practice.
The main idea is to represent the shape of the graph (the internal nodes and
their pointers) using bitvectors and operations from succinct data structures, and
to carefully allocate space for the leaf nodes depending on their distance from the
root. Below we make use of two familiar operations for bitvectors: rank and select.
Given a bitvector B, a position i, and a type of bit b (either 0 or 1), rankb (B, i)
returns the number of occurrences of b before position i in B and selectb (B, i)
returns the position of the ith b in B. Efficient data structures supporting these
operations have been extensively studied (see, e.g., [11, 12]).


                                              32
Block Graphs in Practice

   Recall, each level of the block graph consists of a number of nodes, either internal
nodes, or leaves. Let Bd be a bitvector which indicates whether the ith node (from
the left) at depth d is a leaf, Bd [i] = 0, or an internal node, Bd [i] = 1. We define
another bitvector Rd , where Rd [i] = 1 if and only if Bd [i] = 1 and Bd [i + 1] = 1 for
i < n 1. That is, we mark a 1 bit for each instance of two adjacent internal nodes
in Bd , otherwise Rd [i] = 0. Let Ld be an array that holds leaf nodes at depth d.
The structure of a leaf node is discussed below. Finally, let T be the concatenation
of the textual representation (ie. the corresponding substrings) of all internal nodes
                                                                    0
at the truncated depth, d0 . As adjacent text blocks share 2log n d 1 characters, we
concatenate only the last half of a new adjacent block to T . Non-adjacent blocks
are fully concatenated. We utilize an Rd bitvector at this level; however, we mark
Rd [i] = 1 if the ith node at the truncated depth is a text block.



Navigating the block graph

The main operation is to traverse from an internal node to one of its three children.
Say we are currently at the jth internal node at depth d of the block graph — that
is, we are at Bd [i], where i = select1 (Bd , j). Each internal node has three children.
If these children were independent then locating the left child of the current node
would be simply three times the node’s position on its level, that is 3j = 3 ·
rank1 (Bd , i). However, in a block graph, adjacent internal nodes share exactly
one child, so we correct for this by subtracting the number of adjacent internal
nodes at this depth prior to the current node — this is given by rank1 (Rd , i). To
find the position corresponding to the left child of a node in Bd+1 we compute
leftchild(Bd , i) = 3 · rank1 (Bd , i) rank1 (Rd , i).
   Given the address of the left child it is easy to find the center or right child
by adding 1 or 2, respectively. If Bd [i] = 0 then we are at a leaf node, and its
leaf information is at Ld [rank0 (Bd , i)]. Once we reach the truncated depth we
access the text of an internal node by computing its o↵set in T as T [(rank1 (Bd , i) ·
        0                              0
2log n d 1 ) (rank1 (Rd , i) · 2log n d 1 )].



Leaf nodes

In a block graph, leaves point to internal nodes. For each leaf we store two values,
the position of the destination node on the current level, and an o↵set in the
destination node pointing to the beginning of the leaf block. Note that we do not
need to store the depth of the destination node. It is, by definition, on the level
above the leaf, and we know this by keeping keep track of the depth during each
step in a traversal. To improve compression we store leaf positions and o↵sets in
two separate arrays. At depth d there are no more than 2d+1 1 possible nodes,
so we can store each position in log(2d+1 1) bits. Given that the length of a node
at depth d is b = 2dlog ne d and leaf nodes point to an internal node on the level
above, we store each o↵set in log(2dlog ne d 1 ) bits.


                                          33
Block Graphs in Practice


Table 1: Size in MB of repetitive corpus files encoded with ASCII, gzip, xz, LZ-End
and block graphs truncated at text length 4, 8, 16 and 32.
 Collection         ASCII     GZIP      XZ      LZ-End      Bg4     Bg8     Bg16    Bg32

 world leaders          49      8.28    0.51         4.52    6.62    5.83    5.72    6.37
 Escherichia Coli      112     31.53    5.18        49.10   49.70   49.57   45.33   46.91
 influenza             154     10.63    1.59        21.50   33.16   32.97   33.32   37.89
 coreutils             205     49.92    3.70        35.88   42.80   33.19   30.43   33.00
 kernel                257     69.39    2.07        19.34   21.21   15.69   13.84   14.05
 para                  429    116.07    6.09        57.41   72.39   72.13   67.84   70.66
 cere                  461    120.08    5.07        41.34   57.68   57.54   54.59   57.96
 einstein.en.txt       467    163.66    0.33         2.24    3.52    3.07    3.01    3.19


4     Experiments

We have developed an implementation of block graphs1 and tested it on texts from
the Pizza-Chili Repetitive Corpus2 , a standard testbed for data structures designed
for repetitive strings.
   We compared compression achieved by the block graph to the LZ-End data
structure by Kreft and Navarro [7], and to the general-purpose compressors gzip
and xz; the results are shown in Table 1. gzip and xz were run with their high-
est compression setting -9, while LZ-End was executed with its default settings.
Throughout our experiments we tested block graphs that were truncated such that
the smallest blocks were 4, 8, 16 and 32 bytes. Note that gzip and xz provide
compression only, not random access, and are included as reference points for
achievable compression. We did not test extraction from bookmarks because Kreft
and Navarro’s data structure does not support it (nor does any other implemented
data structure).
   We then compared how quickly block graphs and LZ-End support extracting
substrings of various lengths; the results are shown in Figure 2. The mean extrac-
tion speed with LZ-End never exceeded 9 million characters per second while, for
sufficiently long extractions from the kernel file, the mean extraction speed with
the block graph was over 480 million characters per second. Each run of extrac-
tions was performed across 10,000 randomly-generated queries. Experiments were
conducted on an Intel Core i7-2600 3.4 GHz processor with 8GB of main memory,
running Linux 3.3.4; code was compiled with GCC version 4.7.0 targeting x86 64
with full optimizations. Caches were dropped between runs with sync && echo 1
> /proc/sys/vm/drop caches.
   Although xz achieves much better compression, block graphs achieve better
compression than gzip except on the Escherichia Coli and influenza files. Most im-
portantly, our experiments show that block graphs generally achieve compression
comparable to that achieved by LZ-End while supporting significantly faster sub-
string extraction.


    1 Available at http://www.github.com/choobin/block-graph
    2 http://pizzachili.dcc.uchile.cl/repcorpus.html




                                               34
Block Graphs in Practice

                                                                         world_leaders (45M)                                                                           Escherichia_Coli (108M)



                                       500




                                                                                                                                         500
                                                         Block graph 32                                                                                    Block graph 32
                                                         Block graph 16                                                                                    Block graph 16
                                                         Block graph 8                                                                                     Block graph 8
         Extraction speed (Mchars/s)

                                       400




                                                                                                                                         400
                                                 ●       Block graph 4                                                                             ●       Block graph 4
                                                         LZ−end                                                                                            LZ−end
                                       300




                                                                                                                                         300
                                       200




                                                                                                                                         200
                                       100




                                                                                                                                         100
                                                             ●                 ●   ●   ●   ●    ●   ●    ●   ●    ●   ●    ●   ●
                                                                          ●                                                                                                 ●    ●   ●   ●   ●    ●   ●    ●   ●    ●   ●    ●   ●
                                                                 ●   ●                                                                                         ●   ●   ●
                                             ●       ●   ●                                                                                     ●       ●   ●
                                       0




                                                                                                                                         0
                                             0           2       4        6        8       10       12       14       16       18              0           2       4        6        8       10       12       14       16       18




                                                                          influenza (148M)                                                                                  coreutils (196M)
                                       500




                                                                                                                                         500
                                                         Block graph 32                                                                                    Block graph 32
                                                         Block graph 16                                                                                    Block graph 16
                                                         Block graph 8                                                                                     Block graph 8
         Extraction speed (Mchars/s)

                                       400




                                                                                                                                         400
                                                 ●       Block graph 4                                                                             ●       Block graph 4
                                                         LZ−end                                                                                            LZ−end
                                       300




                                                                                                                                         300
                                       200




                                                                                                                                         200
                                       100




                                                                                                                                         100




                                                                                   ●   ●   ●    ●   ●    ●   ●    ●   ●    ●   ●                                                     ●   ●   ●    ●   ●    ●   ●    ●   ●    ●   ●
                                                                     ●    ●    ●                                                                                       ●    ●    ●
                                             ●       ●   ●   ●   ●                                                                             ●       ●   ●   ●   ●
                                       0




                                                                                                                                         0




                                             0           2       4        6        8       10       12       14       16       18              0           2       4        6        8       10       12       14       16       18




                                                                              kernel (247M)                                                                                      para (410M)
                                       500




                                                                                                                                         500




                                                         Block graph 32                                                                                    Block graph 32
                                                         Block graph 16                                                                                    Block graph 16
                                                         Block graph 8                                                                                     Block graph 8
         Extraction speed (Mchars/s)

                                       400




                                                                                                                                         400




                                                 ●       Block graph 4                                                                             ●       Block graph 4
                                                         LZ−end                                                                                            LZ−end
                                       300




                                                                                                                                         300
                                       200




                                                                                                                                         200
                                       100




                                                                                                                                         100




                                                                                   ●   ●   ●    ●   ●    ●   ●    ●   ●    ●   ●                                                         ●   ●    ●   ●    ●   ●    ●   ●    ●   ●
                                                                     ●    ●    ●                                                                                            ●    ●   ●
                                                     ●   ●   ●   ●                                                                                     ●   ●   ●   ●   ●
                                             ●                                                                                                 ●
                                       0




                                                                                                                                         0




                                             0           2       4        6        8       10       12       14       16       18              0           2       4        6        8       10       12       14       16       18




                                                                               cere (440M)                                                                              einstein.en.txt (446M)
                                       500




                                                                                                                                         500




                                                         Block graph 32                                                                                    Block graph 32
                                                         Block graph 16                                                                                    Block graph 16
                                                         Block graph 8                                                                                     Block graph 8
         Extraction speed (Mchars/s)

                                       400




                                                                                                                                         400




                                                 ●       Block graph 4                                                                             ●       Block graph 4
                                                         LZ−end                                                                                            LZ−end
                                       300




                                                                                                                                         300
                                       200




                                                                                                                                         200
                                       100




                                                                                                                                         100




                                                                                                ●   ●    ●   ●    ●   ●    ●   ●                                                     ●   ●   ●    ●   ●    ●   ●    ●   ●    ●   ●
                                                                          ●    ●   ●   ●   ●                                                                           ●    ●    ●
                                                     ●   ●   ●   ●   ●                                                                         ●       ●   ●   ●   ●
                                             ●
                                       0




                                                                                                                                         0




                                             0           2       4        6        8       10       12       14       16       18              0           2       4        6        8       10       12       14       16       18

                                                                              log(extract length)                                                                               log(extract length)




Figure 2: Extraction speeds in millions of characters per second versus the binary
logarithm of the length of the extracted substring. Each data point is averaged
over 10,000 random substring extractions.
References
 [1] P. Bille, G. M. Landau, R. Raman, K. Sadakane, S. R. Satti, and O. Weimann.
     Random access to grammar-compressed strings. In Proceedings of the 22nd
     Symposium on Discrete Algorithms (SODA), pages 373–389, 2011.


                                                                                                                                    35
Block Graphs in Practice

 [2] M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and
     A. Shelat. The smallest grammar problem. IEEE Transactions on Information
     Theory, 51(7):2554–2576, 2005.
 [3] S. Deorowicz, A. Danek, and S. Grabowski. Genome compression: a novel
     approach for large collections. Bioinformatics, 29(20):2572–2578, 2013.
 [4] S. Deorowicz and S. Grabowski. Robust relative compression of genomes with
     random access. Bioinformatics, 27(21):2979–2986, 2011.
 [5] T. Gagie, P. Gawrychowski, and S. J. Puglisi. Faster approximate pattern
     matching in compressed repetitive texts. In Proceedings of the 22nd Interna-
     tional Symposium on Algorithms and Computation (ISAAC), pages 653–662,
     2011.
 [6] R. Grossi. Random access to high-order entropy compressed text. In A. Brod-
     nik, A. López-Ortiz, V. Raman, and A. Viola, editors, Space-Efficient Data
     Structures, Streams, and Algorithms, pages 199–215. Springer-Verlag, 2013.
 [7] S. Kreft and G. Navarro. On compressing and indexing repetitive sequences.
     Theoretical Computer Science, to appear.
 [8] S. Kuruppu, S. J. Puglisi, and J. Zobel. Relative Lempel-Ziv compression
     of genomes for large-scale storage and retrieval. In Proceedings of the 17th
     Symposium on String Processing and Information Retrieval (SPIRE), pages
     201–206, 2010.
 [9] S. Kuruppu, S. J. Puglisi, and J. Zobel. Optimized relative Lempel-Ziv com-
     pression of genomes. In Proceedings of the 34th Australasian Computer Science
     Conference (ACSC), pages 91–98, 2011.
[10] S. Maruyama, Y. Tabei, H. Sakamoto, and K. Sadakane. Fully-online grammar
     compression. In Proceedings of the 20th Symposium on String Processing and
     Information Retrieval (SPIRE), pages 218–229, 2013.
[11] D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select
     dictionary. In Proceedings of the Workshop on Algorithm Engineering and
     Experiments (ALENEX), 2007.
[12] R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries with
     applications to encoding k-ary trees, prefix sums and multisets. ACM Trans-
     actions on Algorithms, 3(4), 2007.
[13] W. Rytter. Application of Lempel-Ziv factorization to the approximation of
     grammar-based compression. Theoretical Computer Science, 302(1–3):211–
     222, 2003.
[14] E. Verbin and W. Yu. Data structure lower bounds on random access to
     grammar-compressed strings. In Proceeings of the 24th Symposium on Com-
     binatorial Pattern Matching (CPM), pages 247–258, 2013.



                                       36