<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Compressing Collections of Substring Samples</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Golnaz Badkobeh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Giuliani</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zsuzsanna Lipták</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon J. Puglisi</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Goldsmiths, University of London</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Helsinki</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Verona</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Given a string  = [1..] of length , and integers  and , such that  &gt;  ≥ 2 &gt; 0, we consider the problem of compressing the string  formed by concatenating the substrings of  of length  starting at positions  ≡ 1 (mod ). In particular, we provide an upper bound of (2 − )/ + 2 + ( − ) on the size of the Lempel-Ziv (LZ77) parsing of , where  is the size of the parsing of . We also show that a related bound holds regardless of the order in which the substrings are concatenated in the formation of . If  is viewed as a genome sequence, the above substring sampling process corresponds to an idealized model of short read DNA sequencing.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Lempel-Ziv</kwd>
        <kwd>LZ77</kwd>
        <kwd>data compression</kwd>
        <kwd>strings</kwd>
        <kwd>repetitiveness</kwd>
        <kwd>combinatorics on words</kwd>
        <kwd>parsing</kwd>
        <kwd>short reads</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        ( &lt; 0.01 for Illumina short reads, for example). Secondly, in short read sequencing, coverage
is not completely uniform, fluctuating across the genome for a variety of reasons (see, e.g. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
      </p>
      <p>The extraordinarily wide adoption of high-throughput sequencing in medical and
evolutionary biology over the last decade has made short read data sets abundant. These data sets are also
very large. For example, a typical human sequencing experiment might run at 20x coverage on
a underlying genome of size  = 3 · 109 nucleotides. The resulting read set in FASTQ format (a
standard file format, which stores one ASCII-encoded short read sequence per line—akin to our
 string) is then 60 gigabytes in size. The de facto standard in most labs (and large institutions
such as, e.g., the NCBI) is to compress such files with the gzip all-purpose file compressor, which
usually leads to a factor four reduction in size1, or 15GB in our human sequencing example. A
gzip’d large read set is thus, alas, still relatively large.</p>
      <p>
        With the rapid growth in, and the need to store, short read data sets, specialized compressors
that exploit properties inherent to such data sets will become paramount. Several read set
specific compressors have now been developed (see, e.g., [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ]). None are yet in wide
use. However, to our knowledge, no careful analysis of the compressibility of short read data
sets—even in an idealized setting such as that described above—has been undertaken. This
article addresses that need. In particular, we derive a non-trivial upper bound on the size of the
LZ parsing [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] of , the concatenation of sampled substrings, in terms of , ,  and the size
of the LZ parsing of . We also show a diferent upper bound that holds regardless of the order
in which the samples are concatenated to form .
      </p>
      <p>In the next section we set notation and basic concepts and results used throughout. In
Sections 3 and 4 we establish the above mentioned bounds, and in Section 5 we gauge the
tightness of these bounds experimentally. We then sketch several directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <p>
        Throughout we consider a string  = [1..] = [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] . . . [] of || =  symbols
drawn from an ordered alphabet Σ of size |Σ| =  . For  = 1, . . . , , we write [..] to denote
the sufix of  of length  −  + 1, that is, [..] = [][ + 1] . . . []. We will often
refer to sufix [..] simply as “sufix ” or the “th sufix”. We write [1..] = [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] . . . []
to denote the prefix of  of length . We write [..] to represent the substring (or factor)
[][ + 1] . . . [] of  that starts at position  and ends at position . If  &gt; , then
[..] =  , where  is the empty string. Let lcp(, ) denote the length of the longest common
prefix of sufix  and sufix .
      </p>
      <p>For example, in the string  = zzzzzipzip, lcp(2, 5) = 1 = |z|, and lcp(5, 8) = 3 = |zip|. For
technical reasons we define lcp (, 0) = lcp(0, ) = 0 for all .</p>
      <p>Definition 1 (LZ Factorization). The LZ factorization of  is a factorization  = 12 . . . 
of  into  phrases such that each phrase  (a substring of ) is either</p>
      <p>1. a letter that does not occur in 1 · · · − 1, or
1This can be loosely interpreted as gzip, a sliding-window dictionary compressor with window size too small to
capture any large dispersed repeated substrings present in the file, essentially reducing the space used by each
DNA letter from the 8 bits used in the plain ASCII encoding to the 2 bits that a flat minimal binary code for four
letters would use.</p>
      <p>2. the longest substring that occurs at least twice in 1 · · · .</p>
      <p>Thus, the LZ factorization of our example string  = zzzzzipzip produces the following
 = 5 factors, 1 = z, 2 = zzzz, 3 = i, 4 = p, 5 = zip.</p>
      <p>We denote with ℓ = || the length of phrase , and with  the starting position of phrase
 in , i.e.,  = 1 + ∑︀&lt; ℓ . Finally, we denote with  the position of the first occurrence of
substring  in  and we call substring [.. + ℓ − 1] the source of . Note that phrases
and sources are allowed to overlap, as is the case with 2 in our example.</p>
      <p>
        The following is a known upper bound on the number of phrases in the parsing.
Theorem 1 ([
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Theorem 5.20). Let  be a string of length  over Σ, with |Σ| =  . Then
 ≤ , where
 =
      </p>
      <p>− (/ ( − 1))
log  − log log  − (1/( − 1))
.</p>
      <p>
        We will return to the above bound later in the paper, but for now we note that it is
asymptotically tight. In particular, an appropriate prefix of the de Bruijn sequence contains at least
/⌈log ⌉ LZ phrases [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>We close this section with another well-known property of the LZ parsing we make use of in
the proof of our main theorem.</p>
      <p>Fact 1. Let  be a string, and 1 &lt;  ≤  ≤ |  |. If  =  [..] has an occurrence before , i.e., if
there exists an ′ &lt;  s.t.  [′..′ + || − 1] = , then the interval [, ] can contain at most one
starting position of a phrase.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Upper Bound on the Number of LZ Phrases of the</title>
    </sec>
    <sec id="sec-4">
      <title>Concatenation of Samples</title>
      <p>In this section, we show an upper bound on the number of LZ phrases of string  formed by
concatenating samples from string .</p>
      <p>Formally, given a string  of length  and two integer parameters , , such that  ≥ 2,
let  = 12 · · · , where  = ⌊( − )/⌋ + 1, and, for  ∈ [1..],  = [( − 1) ·  +
1..( − 1) ·  + ].</p>
      <p>Let us write, for  ≥ 1,  = , where || =  −  and || = . Then  and  can be
written as follows, where || &lt; :</p>
      <p>= 112 · · · ,
 = 112233 · · · .
(1)
(2)</p>
      <p>We will now count the number of phrases  of the LZ parsing of string . Note that this is
equivalent to counting the number of starting positions of phrases.</p>
      <p>Lemma 1. Let  and  denote the beginning resp. ending position of the ’s in the factorization
of  given in (2). Then, for  &gt; 1, the interval [, ] can contain at most one starting position of a
phrase.</p>
      <p>⎪
⎪
⎧(⋃︀
⎪
⎪
⎪
⎪[, ]
[, ] =
Proof. Since two consecutive samples  and +1 overlap by  −  characters, it follows that,
for all  &gt; 1,  = +1, where  is the -length prefix of . Therefore,  contains the square
 for every  &gt; 1, and  is the start of the second occurrence of  in this square. By Fact 1,
therefore, [, ] can contain at most one starting position.</p>
      <p>Next we will count the number of starting positions in the ’s. For this, we consider phrases
 of the LZ parsing of , and count how many starting positions these induce in  in the worst
case. We first need the definition of the projection of a substring of
 on  (see also Figure 1).</p>
      <sec id="sec-4-1">
        <title>Definition 2.</title>
        <p>Let 1 ≤  ≤  ≤</p>
        <p>. We define the projection of substring [, ] on  as a
collection of substrings in  as follows: For  , let  and  denote the starting resp. ending positions
of  in  w.r.t. the factorization given in (1). Then
 ([, ]) =
⎪⎨[ +  −  + 1, ( − ) + ]</p>
        <p>if  −  &lt; 
 =[ − ,  ]) ∪ [( − 2) · ( − ) + , ( − 1)]∪
⎪⎩[,  − ] ∪  [ −  + 1, ] if  ≤  −  &lt; .</p>
        <p>if ,  ≤  −</p>
        <p>Informally, a projection in  of a substring  of  is the collections of substrings of 
coinciding with regions of  , namely substrings of  which  consists of, w.r.t. the factorization
given in (1). The first case (see Fig. 1) gives the general situation, when  starts after the prefix
1 of , which is covered only by the first sample 1. The second case is when  lies fully
within 1, while the third case is when  starts within 1 and ends after it.
 =
1
1
. . .</p>
        <p>− 1 
. . . 
. . .</p>
        <p>+1
. . .</p>
        <p>([, ]) =
 =
by [, ] in , and they are all contained in the corresponding ’s in .</p>
      </sec>
      <sec id="sec-4-2">
        <title>Definition 3.</title>
        <p>Let  be a phrase of the LZ parsing of , with starting position  and length ℓ.
Define ( ) as the number of starting positions in  ([,  + ℓ − 1]).
 &gt;  − . Then ( ) ≤</p>
        <p>|| + 2.</p>
        <p>Lemma 2. Let  ≥ 2. Let  be a phrase of the LZ factorization of , with starting position
Proof. Let ℓ = | |,  = ⌈ ℓ ⌉ and  = [.. + ℓ
− 1] = ′+1 · · ·  ′′, where ′ and ′′ are
a proper sufix of</p>
        <p>− 1 respectively a proper prefix of +1, both possibly empty. We will show
that each of these substrings is charged with at most one phrase starting position. From Fact 1,
the claim follows.</p>
        <p>By construction of , each of the substrings ′, , +1, . . . ,  , ′′ will appear contiguously
in . The number of these substrings is either  or  + 1. Additionally, a  substring separates
the projections in  of each pair of contiguous aforementioned substrings of .</p>
        <p>Consider now the source  ′ of  occurring in  in some position ′ &lt; . Then [..+ℓ− 1] =
[′..′ + ℓ − 1] = ′′ ′+1 · · · ′ ′′, where ′ and ′′ are a proper sufix of ′− 1 respectively
a proper prefix of ′+1, both possibly empty. As for  , the projection of  ′ is also split in  in
such a way that the projection of each pair of mentioned substrings of  ′ is separated by a 
substring in .</p>
        <p>Notice that, since  ≥ 2, each  in  has an occurrence in  also as sufix of ′+1,
occurring immediately after +1 in . This means that, even if the projections of  and  ′ in 
will be split asynchronously, each of the substrings ′, , +1, · · · ,  , ′′ in the projection of
 has already occurred as the concatenation of a sufix of some ′ intersecting the projection
of  ′ and the contiguous ′+1.</p>
        <p>There are at most  + 1 factors ′, , +1, . . . ,  , ′′, and each of them has a previous
occurrence in . Therefore, by Fact 1, there will be at most  + 1 phrase starting positions for
 . See Fig. 2 for an illustration.</p>
        <p>. . . − 1  +1 +2 . . . − 1  +1 +2 . . .</p>
        <p>− 1</p>
        <p>|| ≥
2||
+1
+1
+2
+2 . . .
 =
 =
1
1
 ′ = [, ]
1
1
 ([, ])</p>
        <p>. . . − 1</p>
        <p>Lemma 3. With respect to the factorization of  given in (2), the number of phrase starting
positions in ’s is at most  + 2 .</p>
        <p>Proof. The sum of the lengths of all phrases 1, . . . ,  in the LZ factorization of  is the
length  of the string . Therefore, we can bound the total contribution to  of  as follows:

∑︁ ( ) ≤
=1
=1
∑︁( | | + 2) = 
 
+ 2 .</p>
        <p>Theorem 2. Let  be the number of phrases of the LZ parsing of , and  the number of
phrases of the LZ parsing of . Then  ≤ 2−  + 2 +  − .</p>
        <p>Proof. We show the maximum number of phrase starting positions in  summing up the
contribution of the ( − )-length prefix of , and of the remaining ’s (Lemma 3) and ’s
(Lemma 1) substrings.</p>
        <p>= number of starting positions in 1 + number of starting positions in ’s
+ number of starting positions in ’s, for  ≥ 2
≤ ( − ) +  + 2 +  −   = 2 − 
+ 2 +  − .</p>
        <p>Theorem 2 essentially says that the number of LZ phrases of  is at most twice the number
of samples plus twice the number of LZ phrases of . Note that the important parameter which
determines the number of samples is , while  influences only the number of samples in
which a given position occurs (i.e. / is the so-called coverage).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Upper Bound for Arbitrary Concatenation Order</title>
      <p>This section examines the efect that the ordering of the samples in the concatenation has on
the number of phrases.</p>
      <p>As before, we are given a string  of length  and two integer parameters  and , such that
 ≥ 2. Let  = ⌊( − )/⌋ + 1, and, for  ∈ [1..],  = [( − 1) ·  + 1..( − 1) ·  + ].</p>
      <p>We will show that, regardless of the order in which the samples  of  are concatenated,
the number of phrases in the LZ factorization of that concatenation is at most  + 2(− ) . We
make this argument precise below.</p>
      <p>Theorem 3. Let ′ be the concatenation of the samples of  in any order, then the number of
phrases is ′ ≤  + 2(− ) .</p>
      <p>Proof. Fix . We will show that the number of phrase starting positions in sample  where it
appears in ′ is at most 2 + ||, where  is a specific substring of . Let  = max{′ &lt;  |
′ appears before  in ′}, and let  be the longest sufix of  which is also a prefix of 
(i.e.,  is the maximum overlap). Note that  may be empty. Similarly, let ′ = min{′ &gt;  |
′ appears before  in ′}, and let ′ be the longest prefix of ′ which is also a sufix of ,
again possibly empty. If || + |′| ≥  = ||, then set  =  . Otherwise,  can be written
as  = ′, for some .</p>
      <p>This  is a substring of  that has not so far been covered by any sample in ′, and this is
because of the definition of  and ′. We call  the new part of ; in particular, if  =  , then
the new part of  is empty.</p>
      <p>By Fact 1, if  is non-empty, then at most one phrase starting position is contained within
. Similarly, if ′ is non-empty, at most one starting position is contained within ′. The
number of starting positions within the new part  can be trivially upper bounded by ||.
 =
− 


′
+ ...
−  =
 =
...</p>
      <p>...
 =
...</p>
      <p>...
′ =

...</p>
      <p>′</p>
      <p>Summing over all samples , we thus get
′ ≤</p>
      <p>∑︁(2 + ||) = 2 + ∑︁ || = 2 + ,
=1 =1
with the last equality using the fact that every position of  occurs exactly once in some .</p>
      <p>Theorem 3 essentially says that the number of LZ phrases of ′, i.e. of the concatenation of
the samples in an arbitrary order, is at most the length of  plus twice the number of samples.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Experimental Results</title>
      <p>In order to gauge the tightness of our bounds, we computed the number of phrases in the
conventional LZ factorization (defined in Section 2) of the concatenation of the samples taken
from each of the texts in Table 1. We did this for a range of  and  parameters, and concatenated
the samples both in string order and after a random shufling.
5.1. Data
Our test data, which contains files of varying repetitiveness is shown in Table 1.</p>
      <p>
        By nature, viruses contain very little recurrent genetic heritage, and so a viral genome
represents a real-world non-repetitive string. We used a genome of SARS-CoV2 taken from
the COVID-19 Data Portal [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] of length 29 836, that is factorized in 4 373 LZ phrases. We also
performed experiments on extreme cases between which real-life genomes lie: Fibonacci words
and random words.
      </p>
      <sec id="sec-6-1">
        <title>Data</title>
      </sec>
      <sec id="sec-6-2">
        <title>SARS-CoV2 50 SARS-CoV2 Fibonacci word Random</title>
      </sec>
      <sec id="sec-6-3">
        <title>Description</title>
      </sec>
      <sec id="sec-6-4">
        <title>Taken from the COVID-19 Data Portal [9]</title>
        <p>
          taCkoenncfarotemnatthioenCoOfV5I0Dv-i1r9usDgaetnaoPmoertsal [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
        </p>
      </sec>
      <sec id="sec-6-5">
        <title>Fibonacci word of order 22</title>
        <p>Word over an alphabet of size 4 built with
python function random.choices()</p>
        <p>Let 0 = b, 1 = a, the Fibonacci word of order  + 1 is the binary word +1 = − 1. In
other words, the Fibonacci word of order  + 1 is the concatenation of the Fibonacci words of
the two previous orders.</p>
        <p>By construction, Fibonacci words have a very high degree of repetition, resulting in very few
phrases with respect to their length. Random words, on the other hand, could be considered,
instead, not repetitive at all. To perform a comparison with the virus genome, we chose a
random word of the same length of the genome, and the Fibonacci word of length 28 657, the
closest to the length of the genome. The strings are factorized in 4 575 respectively 22 phrases.</p>
        <p>We also performed experiments with larger data. In particular, we counted the number of
phrases of the concatenation of 50 SARS-CoV2 genomes, which, due to the high similarity of
the individual genomes, constitutes a highly repetitive dataset.
5.2. Experiments
We are interested in simulating the coverage provided by the most used technologies in genome
sequencing. The coverage is given by the number of samples that cover a given position in the
text. In the context of genome sequencing, a reasonable coverage is given by large  and small
, in order to have the coverage as large as possible with reasonable sample lengths.</p>
        <p>We perform the experiments on each of the aforementioned texts with  =
1, 2, 4, 8, 16, 32, 64, and  from 50 to 1 000, in increments of 10. We counted the number
of phrases for  &gt; .</p>
        <p>
          Figure 4, on top, displays the number of phrases for concatenations of samples in string
order of a Fibonacci word (top-left corner), and of a random string of a similar length (top-right
corner). Below, we show the counts for an arbitrary ordered concatenation of the samples
of the same data. In all figures, the number of phrases for some selected ’s are shown in
distinct colours, in gray for all other ’s. The corresponding coloured line shows our bounds
for each  and  pair. Finally, the colored rounded points show the existing bound given by
Kärkkäinen [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], see Theorem 1.
        </p>
        <p>The analogous comparison is shown in Figure 5 between a single SARS-CoV2 genome and
the concatenation of 50 genomes.</p>
        <p>The overlapping of the bound lines in all plots reflects the small impact of the sampling length
 in the bound for fixed .</p>
        <p>When the samples are concatenated in string order (on the top of both Figure 4 and 5), the
number of phrases in the original string has a big influence on the bound. We can see that the
bound lines in the Fibonacci word’s plot are closer to the actual counts than in the other plots,
where the number of phrases of the original strings is higher. On the other hand, because of the
diferent nature of the bound for the samples in string order versus random order, we can see
that the latter bound has a similar trend for the three strings of similar length. In this case, the
length and the number of samples of the original string determine the bound rather than its
number of phrases.</p>
        <p>
          Comparing the bound given in this paper (lines) to the one reported in Theorem [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] by
Kärkkäinen (rounded points), when repetitive strings are considered, our bound is not worse
than the existing one for the string order concatenation, while it gets more loose with the
concatenation in arbitrary order. See the plots for Fibonacci words and the concatenation of 50
genomes of SARS-CoV-2. On the other hand, for non repetitive strings, namely random strings
and the single viral genome in the plots, the existing bound is often better than ours.
        </p>
      </sec>
      <sec id="sec-6-6">
        <title>Data</title>
      </sec>
      <sec id="sec-6-7">
        <title>Fibonacci</title>
        <p>max 
phrases
bound</p>
      </sec>
      <sec id="sec-6-8">
        <title>Fibonacci</title>
        <p>shufled
max 
phrases
bound</p>
      </sec>
      <sec id="sec-6-9">
        <title>SARS-CoV2</title>
        <p>max 
phrases
bound
SARS-CoV2</p>
        <p>shufled
max 
phrases
bound
 = 1</p>
        <p>In Table 2 we show, for fixed , the maximum number of phrases in , varying . For
strings  that are already very repetitive (Fibonacci word), the longer the string, the higher the
number of phrases. On the other hand, for strings that are not that repetitive (single SARS-CoV2
genome), introducing long repetitions may decrease the number of phrase starting positions
in contrast to having a shorter string with very short repetitions. The concatenation of 50
SARS-CoV2 genomes lies in the middle.</p>
        <p>(a) Fibonacci word of length 28 657
(b) Random word of length 29 835</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Concluding Remarks</title>
      <p>This paper has explored the compressibility of collections of substrings sampled from a string.
In particular, given a string  =  [1..] of length  with an LZ parsing size of  , and integers
 and , such that  &gt;  ≥ 2 &gt; 0, we have shown that the size of the LZ parsing of the string
(a) SARS-CoV2 genome of
length 29 835
(b) Concatenation of 50 SARS-CoV2 genome, of total
length 1 490 134
 formed by concatenating the substrings of  of length  starting at positions  ≡ 1 (mod ),
is bounded by (2 − )/ + 2 + ( − ). We also proved a bound for the case of any arbitrary
chosen order of the samples in the concatenation.</p>
      <p>
        There are several avenues future work could take. The most immediate perhaps is the question
of whether the bounds given in Theorem 2 and Theorem 3 can be improved. Another direction
is to derive bounds for other compression measures, such as the size of the smallest context-free
grammar for  [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the number of runs in its Burrows-Wheeler transform [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the size of
its smallest string attractor [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], or other measures of compressibility discussed in Navarro’s
recent survey [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>Along the lines of our original motivation from DNA sequencing, how are the bounds afected
by the introduction of some probability of error to the sample substrings, such as those errors
introduced in collections of strings produced by short-read sequencing technologies?</p>
      <p>Finally, note that we have assumed  ≤ /2, which corresponds to the interesting case in
practice. However, it may be interesting from a theoretical perspective to also examine the case
where  &gt; /2.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ekblom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Smeds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ellegren</surname>
          </string-name>
          ,
          <article-title>Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria</article-title>
          ,
          <source>BMC Genomics 15</source>
          (
          <year>2014</year>
          )
          <fpage>467</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Al Yami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>LFastqC: A lossless non-reference-based FASTQ compressor</article-title>
          ,
          <source>PLoS One</source>
          <volume>14</volume>
          (
          <year>2019</year>
          )
          <article-title>e0224806</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chandak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tatwawadi</surname>
          </string-name>
          , I. Ochoa,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hernaez</surname>
          </string-name>
          , T. Weissman,
          <article-title>SPRING: a next-generation compressor for FASTQ data</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>35</volume>
          (
          <year>2019</year>
          )
          <fpage>2674</fpage>
          -
          <lpage>2676</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Deorowicz</surname>
          </string-name>
          ,
          <article-title>FQSqueezer: k-mer-based compression of sequencing data</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>10</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hoobin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Boucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Puglisi</surname>
          </string-name>
          ,
          <article-title>Fast and eficient compression of highthroughput sequencing reads</article-title>
          ,
          <source>in: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>325</fpage>
          -
          <lpage>334</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lempel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ziv</surname>
          </string-name>
          ,
          <article-title>On the complexity of finite sequences</article-title>
          ,
          <source>IEEE Trans. Inf. Theory</source>
          <volume>22</volume>
          (
          <year>1976</year>
          )
          <fpage>75</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ziv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lempel</surname>
          </string-name>
          ,
          <article-title>A universal algorithm for sequential data compression</article-title>
          ,
          <source>IEEE Trans. Inf. Theory</source>
          <volume>23</volume>
          (
          <year>1977</year>
          )
          <fpage>337</fpage>
          -
          <lpage>343</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kärkkäinen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Repetition-Based Text</surname>
            <given-names>Indexes</given-names>
          </string-name>
          ,
          <source>Ph.D. thesis</source>
          , University of Helsinki, Faculty of Science, Department of Computer Science,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Harrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aslam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Buso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cummins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Felix</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>The</surname>
            <given-names>COVID</given-names>
          </string-name>
          -19
          <string-name>
            <given-names>Data</given-names>
            <surname>Portal</surname>
          </string-name>
          <article-title>: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing</article-title>
          ,
          <source>Nucleic Acids Research</source>
          <volume>49</volume>
          (
          <year>2021</year>
          )
          <fpage>W619</fpage>
          -
          <lpage>W623</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Charikar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lehman</surname>
          </string-name>
          , D. Liu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Panigrahy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prabhakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sahai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelat</surname>
          </string-name>
          ,
          <article-title>The smallest grammar problem</article-title>
          ,
          <source>IEEE Trans. Inf. Theory</source>
          <volume>51</volume>
          (
          <year>2005</year>
          )
          <fpage>2554</fpage>
          -
          <lpage>2576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Manzini</surname>
          </string-name>
          ,
          <article-title>An analysis of the Burrows-Wheeler transform</article-title>
          ,
          <source>J. ACM</source>
          <volume>48</volume>
          (
          <year>2001</year>
          )
          <fpage>407</fpage>
          -
          <lpage>430</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kempa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Prezza</surname>
          </string-name>
          ,
          <article-title>At the roots of dictionary compression: string attractors</article-title>
          ,
          <source>in: Proc. 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC)</source>
          , ACM,
          <year>2018</year>
          , pp.
          <fpage>827</fpage>
          -
          <lpage>840</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Navarro</surname>
          </string-name>
          ,
          <article-title>Indexing highly repetitive string collections, part I: repetitiveness measures</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>54</volume>
          (
          <year>2021</year>
          )
          <volume>29</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          :
          <fpage>31</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>