=Paper= {{Paper |id=Vol-3284/1532 |storemode=property |title=On Compressing Collections of Substring Samples |pdfUrl=https://ceur-ws.org/Vol-3284/1532.pdf |volume=Vol-3284 |authors=Golnaz Badkobeh,Sara Giuliani,Zsuzsanna LiptΓ k,Simon Puglisi |dblpUrl=https://dblp.org/rec/conf/ictcs/BadkobehGLP22 }} ==On Compressing Collections of Substring Samples== https://ceur-ws.org/Vol-3284/1532.pdf
On Compressing Collections of Substring Samples
Golnaz Badkobeh1 , Sara Giuliani2,* , Zsuzsanna LiptΓ‘k2 and Simon J. Puglisi3
1
  Goldsmiths, University of London, UK
2
  University of Verona, Italy
4
  University of Helsinki, Finland


                                         Abstract
                                         Given a string 𝑋 = 𝑋[1..𝑛] of length 𝑛, and integers π‘š and 𝑑, such that 𝑛 > π‘š β‰₯ 2𝑑 > 0, we consider
                                         the problem of compressing the string 𝑆 formed by concatenating the substrings of 𝑋 of length π‘š starting
                                         at positions 𝑖 ≑ 1 (mod 𝑑). In particular, we provide an upper bound of (2𝑛 βˆ’ π‘š)/𝑑 + 2𝑧 + (π‘š βˆ’ 𝑑) on
                                         the size of the Lempel-Ziv (LZ77) parsing of 𝑆, where 𝑧 is the size of the parsing of 𝑋. We also show that
                                         a related bound holds regardless of the order in which the substrings are concatenated in the formation
                                         of 𝑆. If 𝑋 is viewed as a genome sequence, the above substring sampling process corresponds to an
                                         idealized model of short read DNA sequencing.

                                         Keywords
                                         Lempel-Ziv, LZ77, data compression, strings, repetitiveness, combinatorics on words, parsing, short
                                         reads




1. Introduction
We consider the problem of compressing a set of substrings sampled from a string. In particular,
given a string 𝑋 of length 𝑛 and two integer parameters 𝑑 and π‘š, π‘š β‰₯ 2𝑑, we call a sample of
𝑋 a substring of 𝑋 of length π‘š starting at any position 1 + 𝑖 Β· 𝑑 in 𝑋, where 𝑖 β‰₯ 0. We refer
to the samples as 𝑆𝑖 = 𝑋[(𝑖 βˆ’ 1) Β· 𝑑 + 1..(𝑖 βˆ’ 1) Β· 𝑑 + π‘š], for 𝑖 β‰₯ 1, and to the concatenation
as 𝑆 = 𝑆1 𝑆2 Β· Β· Β· π‘†π‘Ÿ , where π‘Ÿ = ⌊(𝑛 βˆ’ π‘š)/π‘‘βŒ‹ + 1. Note that if π‘›βˆ’π‘šπ‘‘ is not an integer, then the
final few characters of 𝑋 will not be part of any sample.
   If 𝑋 is viewed as a genome sequence, the above substring sampling process corresponds to
an idealized model of short read DNA sequencing. In the language of genome sequencing, π‘š is
the read length and π‘š/𝑑 is the so-called coverage (the number of samples that cover a given
position in 𝑋, on average). Our assumption that π‘š β‰₯ 2𝑑 corresponds to a coverage of at least
2, which is the relevant case for DNA sequencing. 𝑆 represents a file of short read sequences
β€” the typical output of a sequencing experiment. Our sampling process idealizes short read
sequencing in at least two ways. Firstly, short read sequencing may produce strings of slightly
different length and may introduce errors β€” insertions, deletions, and substitutions of letters β€”
to the sampled strings, albeit with fairly low probability for present-day short read technology


Proceedings of the 23rd Italian Conference on Theoretical Computer Science, Rome, Italy, September 7-9, 2022
*
 Corresponding author.
" g.badkobeh@gold.ac.uk (G. Badkobeh); sara.giuliani_01@univr.it (S. Giuliani); zsuzsanna.liptak@univr.it
(Zs. LiptΓ‘k); simon.puglisi@helsinki.fi (Simon J. Puglisi)
                                       Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
(𝑝 < 0.01 for Illumina short reads, for example). Secondly, in short read sequencing, coverage
is not completely uniform, fluctuating across the genome for a variety of reasons (see, e.g. [1]).
   The extraordinarily wide adoption of high-throughput sequencing in medical and evolution-
ary biology over the last decade has made short read data sets abundant. These data sets are also
very large. For example, a typical human sequencing experiment might run at 20x coverage on
a underlying genome of size 𝑛 = 3 Β· 109 nucleotides. The resulting read set in FASTQ format (a
standard file format, which stores one ASCII-encoded short read sequence per lineβ€”akin to our
𝑆 string) is then 60 gigabytes in size. The de facto standard in most labs (and large institutions
such as, e.g., the NCBI) is to compress such files with the gzip all-purpose file compressor, which
usually leads to a factor four reduction in size1 , or 15GB in our human sequencing example. A
gzip’d large read set is thus, alas, still relatively large.
   With the rapid growth in, and the need to store, short read data sets, specialized compressors
that exploit properties inherent to such data sets will become paramount. Several read set
specific compressors have now been developed (see, e.g., [2, 3, 4, 5]). None are yet in wide
use. However, to our knowledge, no careful analysis of the compressibility of short read data
setsβ€”even in an idealized setting such as that described aboveβ€”has been undertaken. This
article addresses that need. In particular, we derive a non-trivial upper bound on the size of the
LZ parsing [6, 7] of 𝑆, the concatenation of sampled substrings, in terms of 𝑛, π‘š, 𝑑 and the size
of the LZ parsing of 𝑋. We also show a different upper bound that holds regardless of the order
in which the samples are concatenated to form 𝑆.
   In the next section we set notation and basic concepts and results used throughout. In
Sections 3 and 4 we establish the above mentioned bounds, and in Section 5 we gauge the
tightness of these bounds experimentally. We then sketch several directions for future work.


2. Preliminaries
Throughout we consider a string 𝑋 = 𝑋[1..𝑛] = 𝑋[1]𝑋[2] . . . 𝑋[𝑛] of |𝑋| = 𝑛 symbols
drawn from an ordered alphabet Ξ£ of size |Ξ£| = 𝜎. For 𝑖 = 1, . . . , 𝑛, we write 𝑋[𝑖..𝑛] to denote
the suffix of 𝑋 of length 𝑛 βˆ’ 𝑖 + 1, that is, 𝑋[𝑖..𝑛] = 𝑋[𝑖]𝑋[𝑖 + 1] . . . 𝑋[𝑛]. We will often
refer to suffix 𝑋[𝑖..𝑛] simply as β€œsuffix 𝑖” or the β€œπ‘–th suffix”. We write 𝑋[1..𝑖] = 𝑋[1] . . . 𝑋[𝑖]
to denote the prefix of 𝑋 of length 𝑖. We write 𝑋[𝑖..𝑗] to represent the substring (or factor)
𝑋[𝑖]𝑋[𝑖 + 1] . . . 𝑋[𝑗] of 𝑋 that starts at position 𝑖 and ends at position 𝑗. If 𝑖 > 𝑗, then
𝑋[𝑖..𝑗] = πœ–, where πœ– is the empty string. Let lcp(𝑖, 𝑗) denote the length of the longest common
prefix of suffix 𝑖 and suffix 𝑗.
   For example, in the string 𝑋 = zzzzzipzip, lcp(2, 5) = 1 = |z|, and lcp(5, 8) = 3 = |zip|. For
technical reasons we define lcp(𝑖, 0) = lcp(0, 𝑖) = 0 for all 𝑖.
Definition 1 (LZ Factorization). The LZ factorization of 𝑋 is a factorization 𝑋 = 𝑓1 𝑓2 . . . 𝑓𝑧𝑋
of 𝑋 into 𝑧𝑋 phrases such that each phrase 𝑓𝑖 (a substring of 𝑋) is either
       1. a letter that does not occur in 𝑓1 Β· Β· Β· π‘“π‘–βˆ’1 , or
1
    This can be loosely interpreted as gzip, a sliding-window dictionary compressor with window size too small to
    capture any large dispersed repeated substrings present in the file, essentially reducing the space used by each
    DNA letter from the 8 bits used in the plain ASCII encoding to the 2 bits that a flat minimal binary code for four
    letters would use.
   2. the longest substring that occurs at least twice in 𝑓1 Β· Β· Β· 𝑓𝑖 .

   Thus, the LZ factorization of our example string 𝑋 = zzzzzipzip produces the following
𝑧𝑋 = 5 factors, 𝑓1 = z, 𝑓2 = zzzz, 𝑓3 = i, 𝑓4 = p, 𝑓5 = zip.
   We denote with ℓ𝑖 = βˆ‘οΈ€|𝑓𝑖 | the length of phrase 𝑓𝑖 , and with 𝑠𝑖 the starting position of phrase
𝑓𝑖 in 𝑋, i.e., 𝑠𝑖 = 1 + 𝑗<𝑖 ℓ𝑗 . Finally, we denote with 𝑝𝑖 the position of the first occurrence of
substring 𝑓𝑖 in 𝑋 and we call substring 𝑋[𝑝𝑖 ..𝑝𝑖 + ℓ𝑖 βˆ’ 1] the source of 𝑓𝑖 . Note that phrases
and sources are allowed to overlap, as is the case with 𝑓2 in our example.
   The following is a known upper bound on the number of phrases in the parsing.

Theorem 1 ([8], Theorem 5.20). Let 𝑋 be a string of length 𝑛 over Ξ£, with |Ξ£| = 𝜎. Then
𝑧𝑋 ≀ 𝑍, where
                                    𝑛 βˆ’ (𝜎/(𝜎 βˆ’ 1))
                      𝑍=                                       .
                           log𝜎 𝑛 βˆ’ log𝜎 log𝜎 𝑛 βˆ’ (1/(𝜎 βˆ’ 1))
   We will return to the above bound later in the paper, but for now we note that it is asymp-
totically tight. In particular, an appropriate prefix of the de Bruijn sequence contains at least
𝑛/⌈log𝜎 π‘›βŒ‰ LZ phrases [8].
   We close this section with another well-known property of the LZ parsing we make use of in
the proof of our main theorem.

Fact 1. Let 𝑇 be a string, and 1 < 𝑖 ≀ 𝑗 ≀ |𝑇 |. If 𝑑 = 𝑇 [𝑖..𝑗] has an occurrence before 𝑖, i.e., if
there exists an 𝑖′ < 𝑖 s.t. 𝑇 [𝑖′ ..𝑖′ + |𝑑| βˆ’ 1] = 𝑑, then the interval [𝑖, 𝑗] can contain at most one
starting position of a phrase.


3. Upper Bound on the Number of LZ Phrases of the
   Concatenation of Samples
In this section, we show an upper bound on the number of LZ phrases of string 𝑆 formed by
concatenating samples from string 𝑋.
   Formally, given a string 𝑋 of length 𝑛 and two integer parameters π‘š, 𝑑, such that π‘š β‰₯ 2𝑑,
let 𝑆 = 𝑆1 𝑆2 Β· Β· Β· π‘†π‘Ÿ , where π‘Ÿ = ⌊(𝑛 βˆ’ π‘š)/π‘‘βŒ‹ + 1, and, for 𝑖 ∈ [1..π‘Ÿ], 𝑆𝑖 = 𝑋[(𝑖 βˆ’ 1) Β· 𝑑 +
1..(𝑖 βˆ’ 1) Β· 𝑑 + π‘š].
   Let us write, for 𝑖 β‰₯ 1, 𝑆𝑖 = 𝑣𝑖 π‘₯𝑖 , where |𝑣𝑖 | = π‘š βˆ’ 𝑑 and |π‘₯𝑖 | = 𝑑. Then 𝑋 and 𝑆 can be
written as follows, where |𝑒| < 𝑑:

                                   𝑋 = 𝑣1 π‘₯1 π‘₯2 Β· Β· Β· π‘₯π‘Ÿ 𝑒,                                            (1)
                                   𝑆 = 𝑣1 π‘₯1 𝑣2 π‘₯2 𝑣3 π‘₯3 Β· Β· Β· π‘£π‘Ÿ π‘₯π‘Ÿ .                                 (2)

  We will now count the number of phrases 𝑧𝑆 of the LZ parsing of string 𝑆. Note that this is
equivalent to counting the number of starting positions of phrases.

Lemma 1. Let 𝑏𝑖 and 𝑒𝑖 denote the beginning resp. ending position of the 𝑣𝑖 ’s in the factorization
of 𝑆 given in (2). Then, for 𝑖 > 1, the interval [𝑏𝑖 , 𝑒𝑖 ] can contain at most one starting position of a
phrase.
Proof. Since two consecutive samples 𝑆𝑖 and 𝑆𝑖+1 overlap by π‘š βˆ’ 𝑑 characters, it follows that,
for all 𝑖 > 1, 𝑆𝑖 = 𝑦𝑖 𝑣𝑖+1 , where 𝑦𝑖 is the 𝑑-length prefix of 𝑆𝑖 . Therefore, 𝑆 contains the square
𝑣𝑖 𝑣𝑖 for every 𝑖 > 1, and 𝑏𝑖 is the start of the second occurrence of 𝑣𝑖 in this square. By Fact 1,
therefore, [𝑏𝑖 , 𝑒𝑖 ] can contain at most one starting position.

  Next we will count the number of starting positions in the π‘₯𝑖 ’s. For this, we consider phrases
𝑓 of the LZ parsing of 𝑋, and count how many starting positions these induce in 𝑆 in the worst
case. We first need the definition of the projection of a substring of 𝑋 on 𝑆 (see also Figure 1).

Definition 2. Let 1 ≀ 𝑏 ≀ 𝑒 ≀ 𝑛. We define the projection of substring 𝑋[𝑏, 𝑒] on 𝑆 as a
collection of substrings in 𝑆 as follows: For π‘₯πœ„ , let π‘πœ„ and π‘’πœ„ denote the starting resp. ending positions
of π‘₯πœ„ in 𝑆 w.r.t. the factorization given in (1). Then
                         ⎧ ⋃︀
                         βŽͺ
                         βŽͺ
                         βŽͺ  ( π‘—πœ„=𝑖 [πœ„π‘š βˆ’ 𝑑, πœ„π‘š]) βˆͺ [(𝑖 βˆ’ 2) Β· (π‘š βˆ’ 𝑑) + 𝑏, (𝑖 βˆ’ 1)π‘š]βˆͺ
                         βŽͺ
                         ⎨[π‘šπ‘— + π‘š βˆ’ 𝑑 + 1, 𝑗(π‘š βˆ’ 𝑑) + 𝑒]                 if π‘š βˆ’ 𝑑 < 𝑏
      𝑃 π‘Ÿπ‘œπ‘—π‘† ([𝑏, 𝑒]) =
                         βŽͺ[𝑏, 𝑒]
                         βŽͺ
                         βŽͺ                              if 𝑏, 𝑒 ≀ π‘š βˆ’ 𝑑
                         βŽͺ
                         ⎩[𝑏, π‘š βˆ’ 𝑑] βˆͺ 𝑃 π‘Ÿπ‘œπ‘— [π‘š βˆ’ 𝑑 + 1, 𝑒] if 𝑏 ≀ π‘š βˆ’ 𝑑 < 𝑒.
                                               𝑆

   Informally, a projection in 𝑆 of a substring π‘ˆ of 𝑋 is the collections of substrings of 𝑆
coinciding with regions of π‘ˆ , namely substrings of 𝑆 which π‘ˆ consists of, w.r.t. the factorization
given in (1). The first case (see Fig. 1) gives the general situation, when π‘ˆ starts after the prefix
𝑣1 of 𝑋, which is covered only by the first sample 𝑆1 . The second case is when π‘ˆ lies fully
within 𝑣1 , while the third case is when π‘ˆ starts within 𝑣1 and ends after it.
                              𝑋[𝑏, 𝑒] =
𝑋 =            𝑣1       π‘₯1    ...     π‘₯π‘–βˆ’1 π‘₯𝑖      ...     π‘₯πœ„        ...     π‘₯𝑗 π‘₯𝑗+1        ...    π‘₯π‘Ÿ


               𝑃 π‘Ÿπ‘œπ‘—π‘† ([𝑏, 𝑒]) =
𝑆 =      ...        π‘£π‘–βˆ’1       π‘₯π‘–βˆ’1   𝑣𝑖        π‘₯𝑖 . . .        π‘£πœ„         π‘₯πœ„ . . .    𝑣𝑗         π‘₯𝑗    𝑣𝑗+1   π‘₯𝑗+1 . . .
Figure 1: The projection of the substring 𝑋[𝑏, 𝑒] on the string 𝑆 is shown. The number of substrings of
𝑋 contained in the collection of substrings produced by the projection is the number of π‘₯𝑖 ’s intersected
by 𝑋[𝑏, 𝑒] in 𝑋, and they are all contained in the corresponding π‘₯𝑖 ’s in 𝑆.



Definition 3. Let 𝑓 be a phrase of the LZ parsing of 𝑋, with starting position 𝑠 and length β„“.
Define 𝑔(𝑓 ) as the number of starting positions in 𝑃 π‘Ÿπ‘œπ‘—π‘† ([𝑠, 𝑠 + β„“ βˆ’ 1]).

Lemma 2. Let π‘š β‰₯ 2𝑑. Let 𝑓 be a phrase of the LZ factorization of 𝑋, with starting position
𝑠 > π‘š βˆ’ 𝑑. Then 𝑔(𝑓 ) ≀ |𝑓𝑑| + 2.

Proof. Let β„“ = |𝑓 |, π‘˜ = ⌈ 𝑑ℓ βŒ‰ and 𝑓 = 𝑋[𝑠..𝑠 + β„“ βˆ’ 1] = π‘₯β€² π‘₯𝑖 π‘₯𝑖+1 Β· Β· Β· π‘₯𝑗 π‘₯β€²β€² , where π‘₯β€² and π‘₯β€²β€² are
a proper suffix of π‘₯π‘–βˆ’1 respectively a proper prefix of π‘₯𝑗+1 , both possibly empty. We will show
that each of these substrings is charged with at most one phrase starting position. From Fact 1,
the claim follows.
    By construction of 𝑆, each of the substrings π‘₯β€² , π‘₯𝑖 , π‘₯𝑖+1 , . . . , π‘₯𝑗 , π‘₯β€²β€² will appear contiguously
in 𝑆. The number of these substrings is either π‘˜ or π‘˜ + 1. Additionally, a 𝑣 substring separates
the projections in 𝑆 of each pair of contiguous aforementioned substrings of 𝑋.
    Consider now the source 𝑓 β€² of 𝑓 occurring in 𝑋 in some position 𝑠′ < 𝑠. Then 𝑋[𝑠..𝑠+β„“βˆ’1] =
𝑋[𝑠′ ..𝑠′ + β„“ βˆ’ 1] = 𝑒′ π‘₯𝑖′ π‘₯𝑖′ +1 Β· Β· Β· π‘₯𝑗 β€² 𝑒′′ , where 𝑒′ and 𝑒′′ are a proper suffix of π‘₯𝑖′ βˆ’1 respectively
a proper prefix of π‘₯𝑗 β€² +1 , both possibly empty. As for 𝑓 , the projection of 𝑓 β€² is also split in 𝑆 in
such a way that the projection of each pair of mentioned substrings of 𝑓 β€² is separated by a 𝑣
substring in 𝑆.
    Notice that, since π‘š β‰₯ 2𝑑, each π‘₯𝑏 in 𝑋 has an occurrence in 𝑆 also as suffix of 𝑣𝑏′ +1 ,
occurring immediately after π‘₯𝑏+1 in 𝑆. This means that, even if the projections of 𝑓 and 𝑓 β€² in 𝑆
will be split asynchronously, each of the substrings π‘₯β€² , π‘₯𝑖 , π‘₯𝑖+1 , Β· Β· Β· , π‘₯𝑗 , π‘₯β€²β€² in the projection of
𝑓 has already occurred as the concatenation of a suffix of some 𝑣𝑏′ intersecting the projection
of 𝑓 β€² and the contiguous π‘₯𝑏′ +1 .
    There are at most π‘˜ + 1 factors π‘₯β€² , π‘₯𝑖 , π‘₯𝑖+1 , . . . , π‘₯𝑗 , π‘₯β€²β€² , and each of them has a previous
occurrence in 𝑆. Therefore, by Fact 1, there will be at most π‘˜ + 1 phrase starting positions for
𝑓 . See Fig. 2 for an illustration.



                        𝑓 β€² = 𝑋[𝑏, 𝑒]                             𝑓

𝑋=           𝑣1        π‘₯1     . . . π‘₯π‘—βˆ’1 π‘₯𝑗 π‘₯𝑗+1 π‘₯𝑗+2 . . . π‘₯π‘–βˆ’1 π‘₯𝑖 π‘₯𝑖+1 π‘₯𝑖+2                 ...   π‘₯π‘Ÿ 𝑒


                        𝑃 π‘Ÿπ‘œπ‘—π‘† ([𝑏, 𝑒])
                                                      π‘₯π‘—βˆ’1
𝑆=           𝑣1        π‘₯1     ...     π‘₯π‘—βˆ’1       𝑣𝑗          π‘₯𝑗    𝑣𝑗+1     π‘₯𝑗+1       𝑣𝑗+2     π‘₯𝑗+2 . . .

                                             |𝑆𝑗 | β‰₯ 2|π‘₯𝑗 |
Figure 2: The original string 𝑋 on top, and the concatenation 𝑆 of the π‘Ÿ samples of 𝑋 below are shown.
In particular, a phrase 𝑓 and its corresponding source 𝑓 β€² in 𝑋 are represented with distinct colors for
each of the π‘₯𝑖 segments of 𝑋 intersected by the phrase. Finally, the projection 𝑃 π‘Ÿπ‘œπ‘—π‘† (𝑓 β€² ) of the source
is shown. It is clear from the figure that, in 𝑆 β€² , the samples intersecting some string in the projection of
𝑓 β€² fully contain at least one substring of the projection of 𝑓 .


Lemma 3. With respect to the factorization of 𝑆 given in (2), the number of phrase starting
positions in π‘₯𝑖 ’s is at most 𝑛𝑑 + 2𝑧𝑋 .
Proof. The sum of the lengths of all phrases 𝑓1 , . . . , 𝑓𝑧𝑋 in the LZ factorization of 𝑋 is the
length 𝑛 of the string 𝑋. Therefore, we can bound the total contribution to 𝑧𝑆 of 𝑋 as follows:
                                𝑧𝑋             𝑧𝑋
                                βˆ‘οΈ             βˆ‘οΈ   |𝑓𝑗 |       𝑛
                                      𝑔(𝑓𝑗 ) ≀    (       + 2) = + 2𝑧𝑋 .
                                                     𝑑          𝑑
                                𝑗=1              𝑗=1
Theorem 2. Let 𝑧𝑋 be the number of phrases of the LZ parsing of 𝑋, and 𝑧𝑆 the number of
phrases of the LZ parsing of 𝑆. Then 𝑧𝑆 ≀ 2π‘›βˆ’π‘š
                                            𝑑  + 2𝑧𝑋 + π‘š βˆ’ 𝑑.
Proof. We show the maximum number of phrase starting positions in 𝑆 summing up the
contribution of the (π‘š βˆ’ 𝑑)-length prefix of 𝑆, and of the remaining π‘₯𝑖 ’s (Lemma 3) and 𝑣𝑖 ’s
(Lemma 1) substrings.


        𝑧𝑆 = number of starting positions in 𝑣1 + number of starting positions in π‘₯𝑖 ’s
              + number of starting positions in 𝑣𝑖 ’s, for 𝑖 β‰₯ 2
                        𝑛            π‘›βˆ’π‘š        2𝑛 βˆ’ π‘š
            ≀ (π‘š βˆ’ 𝑑) + + 2𝑧𝑋 +              =              + 2𝑧𝑋 + π‘š βˆ’ 𝑑.
                        𝑑               𝑑            𝑑


   Theorem 2 essentially says that the number of LZ phrases of 𝑆 is at most twice the number
of samples plus twice the number of LZ phrases of 𝑋. Note that the important parameter which
determines the number of samples is 𝑑, while π‘š influences only the number of samples in
which a given position occurs (i.e. π‘š/𝑑 is the so-called coverage).


4. Upper Bound for Arbitrary Concatenation Order
This section examines the effect that the ordering of the samples in the concatenation has on
the number of phrases.
  As before, we are given a string 𝑋 of length 𝑛 and two integer parameters π‘š and 𝑑, such that
π‘š β‰₯ 2𝑑. Let π‘Ÿ = ⌊(𝑛 βˆ’ π‘š)/π‘‘βŒ‹ + 1, and, for 𝑖 ∈ [1..π‘Ÿ], 𝑆𝑖 = 𝑋[(𝑖 βˆ’ 1) Β· 𝑑 + 1..(𝑖 βˆ’ 1) Β· 𝑑 + π‘š].
  We will show that, regardless of the order in which the samples 𝑆𝑖 of 𝑋 are concatenated,
the number of phrases in the LZ factorization of that concatenation is at most 𝑛 + 2(π‘›βˆ’π‘š)
                                                                                      𝑑    . We
make this argument precise below.
Theorem 3. Let 𝑆 β€² be the concatenation of the samples of 𝑋 in any order, then the number of
phrases is 𝑧𝑆 β€² ≀ 𝑛 + 2(π‘›βˆ’π‘š)
                         𝑑   .
Proof. Fix 𝑖. We will show that the number of phrase starting positions in sample 𝑆𝑖 where it
appears in 𝑆 β€² is at most 2 + |𝑒𝑖 |, where 𝑒𝑖 is a specific substring of 𝑆𝑖 . Let π‘˜ = max{𝑖′ < 𝑖 |
𝑆𝑖′ appears before 𝑆𝑖 in 𝑆 β€² }, and let 𝑀𝑖 be the longest suffix of π‘†π‘˜ which is also a prefix of 𝑆𝑖
(i.e., 𝑀𝑖 is the maximum overlap). Note that 𝑀𝑖 may be empty. Similarly, let π‘˜ β€² = min{𝑖′ > 𝑖 |
𝑆𝑖′ appears before 𝑆𝑖 in 𝑆 β€² }, and let 𝑀𝑖′ be the longest prefix of π‘†π‘˜β€² which is also a suffix of 𝑆𝑖 ,
again possibly empty. If |𝑀𝑖 | + |𝑀𝑖′ | β‰₯ π‘š = |𝑆𝑖 |, then set 𝑒𝑖 = πœ–. Otherwise, 𝑆𝑖 can be written
as 𝑆𝑖 = 𝑀𝑖 𝑒𝑖 𝑀𝑖′ , for some 𝑒𝑖 .
    This 𝑒𝑖 is a substring of 𝑋 that has not so far been covered by any sample in 𝑆 β€² , and this is
because of the definition of π‘˜ and π‘˜ β€² . We call 𝑒𝑖 the new part of 𝑆𝑖 ; in particular, if 𝑒𝑖 = πœ–, then
the new part of 𝑆𝑖 is empty.
    By Fact 1, if 𝑀𝑖 is non-empty, then at most one phrase starting position is contained within
𝑀𝑖 . Similarly, if 𝑀𝑖′ is non-empty, at most one starting position is contained within 𝑀𝑖′ . The
number of starting positions within the new part 𝑒𝑖 can be trivially upper bounded by |𝑒𝑖 |.
𝑋 =                                                           π‘₯π‘–βˆ’π‘‘   π‘₯π‘˜                     π‘₯𝑖         π‘₯π‘˜ β€²          π‘₯𝑖+𝑑 ...

                π‘†π‘–βˆ’π‘‘ =                   ...
                          π‘†π‘˜ =
                                                             ...
                                                      𝑆𝑖 =
                                                                          ...
                                                                     π‘†π‘˜β€² =
                                                                                      ...
                                                                                  𝑆𝑖+𝑑 =

                         π‘†π‘˜ β€²                                         π‘†π‘˜                                      𝑆𝑖
 β€²        ...                                          ...                                       ...            𝑒𝑖              ...
𝑆 =
Figure 3: The contribution of the samples of 𝑋 to 𝑧𝑆 β€² , where 𝑆 β€² is the concatenation of the samples in
an arbitrary order. The colored substrings are the prefixes and suffixes with multiple occurrences in 𝑆 β€²
of some sample. Each sample consists of a prefix and a suffix (possibly empty) that has already occurred
in 𝑆 β€² , and a substring in between them (i.e., labelled 𝑒𝑖 , in white) that is not assumed to necessarily
occur previously in 𝑆 β€² .


     Summing over all samples 𝑆𝑖 , we thus get
                                                π‘Ÿ
                                               βˆ‘οΈ                           π‘Ÿ
                                                                           βˆ‘οΈ
                                𝑧   𝑆′   ≀           (2 + |𝑒𝑖 |) = 2π‘Ÿ +           |𝑒𝑖 | = 2π‘Ÿ + 𝑛,
                                               𝑖=1                          𝑖=1

with the last equality using the fact that every position of 𝑋 occurs exactly once in some 𝑒𝑖 .


  Theorem 3 essentially says that the number of LZ phrases of 𝑆 β€² , i.e. of the concatenation of
the samples in an arbitrary order, is at most the length of 𝑋 plus twice the number of samples.


5. Experimental Results
In order to gauge the tightness of our bounds, we computed the number of phrases in the
conventional LZ factorization (defined in Section 2) of the concatenation of the samples taken
from each of the texts in Table 1. We did this for a range of 𝑑 and π‘š parameters, and concatenated
the samples both in string order and after a random shuffling.

5.1. Data
Our test data, which contains files of varying repetitiveness is shown in Table 1.
  By nature, viruses contain very little recurrent genetic heritage, and so a viral genome
represents a real-world non-repetitive string. We used a genome of SARS-CoV2 taken from
the COVID-19 Data Portal [9] of length 29 836, that is factorized in 4 373 LZ phrases. We also
performed experiments on extreme cases between which real-life genomes lie: Fibonacci words
and random words.
     Data              Description                                         𝑛       𝑧     𝑛/𝑧
     SARS-CoV2         Taken from the COVID-19 Data Portal [9]        29 835    4373      6.8
                         Concatenation of 50 virus genomes
     50 SARS-CoV2                                             [9]   1 490 134   5421     275
                       taken from the COVID-19 Data Portal
     Fibonacci word         Fibonacci word of order 22                28 657      22   1302.6
                       Word over an alphabet of size 4 built with
     Random                                                           29 835    4575      6.5
                         python function random.choices()

Table 1
Data used in the experiments. The table shows the length (𝑛) and the number of phrases (𝑧) of each
text used for the experiments.


   Let 𝑠0 = b, 𝑠1 = a, the Fibonacci word of order 𝑖 + 1 is the binary word 𝑠𝑖+1 = 𝑠𝑖 π‘ π‘–βˆ’1 . In
other words, the Fibonacci word of order 𝑖 + 1 is the concatenation of the Fibonacci words of
the two previous orders.
   By construction, Fibonacci words have a very high degree of repetition, resulting in very few
phrases with respect to their length. Random words, on the other hand, could be considered,
instead, not repetitive at all. To perform a comparison with the virus genome, we chose a
random word of the same length of the genome, and the Fibonacci word of length 28 657, the
closest to the length of the genome. The strings are factorized in 4 575 respectively 22 phrases.
   We also performed experiments with larger data. In particular, we counted the number of
phrases of the concatenation of 50 SARS-CoV2 genomes, which, due to the high similarity of
the individual genomes, constitutes a highly repetitive dataset.

5.2. Experiments
We are interested in simulating the coverage provided by the most used technologies in genome
sequencing. The coverage is given by the number of samples that cover a given position in the
text. In the context of genome sequencing, a reasonable coverage is given by large π‘š and small
𝑑, in order to have the coverage as large as possible with reasonable sample lengths.
   We perform the experiments on each of the aforementioned texts with 𝑑 =
1, 2, 4, 8, 16, 32, 64, and π‘š from 50 to 1 000, in increments of 10. We counted the number
of phrases for π‘š > 𝑑.
   Figure 4, on top, displays the number of phrases for concatenations of samples in string
order of a Fibonacci word (top-left corner), and of a random string of a similar length (top-right
corner). Below, we show the counts for an arbitrary ordered concatenation of the samples
of the same data. In all figures, the number of phrases for some selected π‘šβ€™s are shown in
distinct colours, in gray for all other π‘šβ€™s. The corresponding coloured line shows our bounds
for each π‘š and 𝑑 pair. Finally, the colored rounded points show the existing bound given by
KΓ€rkkΓ€inen [8], see Theorem 1.
   The analogous comparison is shown in Figure 5 between a single SARS-CoV2 genome and
the concatenation of 50 genomes.
   The overlapping of the bound lines in all plots reflects the small impact of the sampling length
π‘š in the bound for fixed 𝑑.
   When the samples are concatenated in string order (on the top of both Figure 4 and 5), the
number of phrases in the original string has a big influence on the bound. We can see that the
bound lines in the Fibonacci word’s plot are closer to the actual counts than in the other plots,
where the number of phrases of the original strings is higher. On the other hand, because of the
different nature of the bound for the samples in string order versus random order, we can see
that the latter bound has a similar trend for the three strings of similar length. In this case, the
length and the number of samples of the original string determine the bound rather than its
number of phrases.
   Comparing the bound given in this paper (lines) to the one reported in Theorem [8] by
KΓ€rkkΓ€inen (rounded points), when repetitive strings are considered, our bound is not worse
than the existing one for the string order concatenation, while it gets more loose with the
concatenation in arbitrary order. See the plots for Fibonacci words and the concatenation of 50
genomes of SARS-CoV-2. On the other hand, for non repetitive strings, namely random strings
and the single viral genome in the plots, the existing bound is often better than ours.

  Data           𝑑=1       𝑑=2        𝑑=4         𝑑=8        𝑑 = 16       𝑑 = 32          𝑑 = 64
  Fibonacci
  max π‘š           1000       980        990         970         960          960             940
  phrases         1336     1321       1308         1021        1007          910             603
  bound          57 357   29 189     15 111        8049      4510.1       2733.1          1800.8
   Fibonacci
    shuffled
  max π‘š           1000       990        990         990          630         970            980
  phrases        25 164   13 179       6802        3505         1802         942            521
  bound          83 971   56 324   42 490.5   35 573.75     32 160.4    30 387.4   29 521.906 25
  SARS-CoV2
  max π‘š              80      150         50          70           50          90             210
  phrases        44 438   27 823     14 880        8945         6759        5575            4973
  bound          68 417   38 655   23 697.5   16 258.25    12 506.38   10 665.93         9821.09
  SARS-CoV2
    shuffled
  max π‘š             200      260        350         540          780         680             760
  phrases        45 725   27 898     17 538      11 620        7969         6142            5238
  bound          89 108   59 412     44 574      37 185       33 468   31 658.25       30 744.63

Table 2
A summary of relevant results for the Fibonacci word and the single SARS-CoV2 genome. For each
sampling frequency 𝑑, we report the value of π‘š for which the maximum number of phrases in the
concatenation of the samples is produced in both string order and after the random shuffling. We
further show the counts and the value of our bound for the mentioned π‘š.

   In Table 2 we show, for fixed 𝑑, the maximum number of phrases in 𝑆, varying π‘š. For
strings 𝑋 that are already very repetitive (Fibonacci word), the longer the string, the higher the
number of phrases. On the other hand, for strings that are not that repetitive (single SARS-CoV2
genome), introducing long repetitions may decrease the number of phrase starting positions
in contrast to having a shorter string with very short repetitions. The concatenation of 50
SARS-CoV2 genomes lies in the middle.

         (a) Fibonacci word of length 28 657              (b) Random word of length 29 835




Figure 4: The number of phrases in log scale for concatenation of samples of a Fibonacci word (left
column) and a random word (right column), both in string order (on top) and in an arbitrary chosen
order (on bottom). The counts are shown for each concatenation of π‘š-length samples at every 𝑑
position. Colored markers indicate the number of phrases for π‘š = 50, 100, 200, 400, 1000, while we
use grey points to mark all others π‘šβ€™s. The bounds shown in Theorem 2 and 3 for the concatenation
in string order respectively random order are shown with colored and shaped lines accordingly to the
corresponding π‘š, 𝑑 pair, while the coloured rounded points indicate the bound given by KΓ€rkkΓ€inen [8],
see Theorem 1.



6. Concluding Remarks
This paper has explored the compressibility of collections of substrings sampled from a string.
In particular, given a string 𝑋 = 𝑋[1..𝑛] of length 𝑛 with an LZ parsing size of 𝑧𝑋 , and integers
π‘š and 𝑑, such that 𝑛 > π‘š β‰₯ 2𝑑 > 0, we have shown that the size of the LZ parsing of the string
             (a) SARS-CoV2 genome of             (b) Concatenation of 50 SARS-CoV2 genome, of total
                   length 29 835                     length 1 490 134




Figure 5: The number of phrases in log scale for concatenation of samples of a single SARS-CoV2 genome
(left column) and of a concatenation of 50 SARS-CoV2 genomes (right column), both in string order
(on top) and in any arbitrary chosen order (on bottom). The counts are shown for each concatenation
of π‘š-length samples at every 𝑑 position. Colored markers indicate the number of phrases for π‘š =
50, 100, 200, 400, 1000, while we use grey points to mark all others π‘šβ€™s. The bounds shown in Theorem 2
and 3 for the concatenation in string order respectively random order are shown with colored and
shaped lines accordingly to the corresponding π‘š, 𝑑 pair, while the coloured rounded points indicate the
bound given by KΓ€rkkΓ€inen [8], see Theorem 1.


𝑆 formed by concatenating the substrings of 𝑋 of length π‘š starting at positions 𝑖 ≑ 1 (mod 𝑑),
is bounded by (2𝑛 βˆ’ π‘š)/𝑑 + 2𝑧 + (π‘š βˆ’ 𝑑). We also proved a bound for the case of any arbitrary
chosen order of the samples in the concatenation.
   There are several avenues future work could take. The most immediate perhaps is the question
of whether the bounds given in Theorem 2 and Theorem 3 can be improved. Another direction
is to derive bounds for other compression measures, such as the size of the smallest context-free
grammar for 𝑆 [10], the number of runs in its Burrows-Wheeler transform [11], the size of
its smallest string attractor [12], or other measures of compressibility discussed in Navarro’s
recent survey [13].
   Along the lines of our original motivation from DNA sequencing, how are the bounds affected
by the introduction of some probability of error to the sample substrings, such as those errors
introduced in collections of strings produced by short-read sequencing technologies?
   Finally, note that we have assumed 𝑑 ≀ π‘š/2, which corresponds to the interesting case in
practice. However, it may be interesting from a theoretical perspective to also examine the case
where 𝑑 > π‘š/2.


References
 [1] R. Ekblom, L. Smeds, H. Ellegren, Patterns of sequencing coverage bias revealed by
     ultra-deep sequencing of vertebrate mitochondria, BMC Genomics 15 (2014) 467.
 [2] S. Al Yami, C.-H. Huang, LFastqC: A lossless non-reference-based FASTQ compressor,
     PLoS One 14 (2019) e0224806.
 [3] S. Chandak, K. Tatwawadi, I. Ochoa, M. Hernaez, T. Weissman, SPRING: a next-generation
     compressor for FASTQ data, Bioinformatics 35 (2019) 2674–2676.
 [4] S. Deorowicz, FQSqueezer: k-mer-based compression of sequencing data, Scientific
     Reports 10 (2020) 1–9.
 [5] C. Hoobin, T. Kind, C. Boucher, S. J. Puglisi, Fast and efficient compression of high-
     throughput sequencing reads, in: Proceedings of the 6th ACM Conference on Bioinfor-
     matics, Computational Biology and Health Informatics, 2015, pp. 325–334.
 [6] A. Lempel, J. Ziv, On the complexity of finite sequences, IEEE Trans. Inf. Theory 22 (1976)
     75–81.
 [7] J. Ziv, A. Lempel, A universal algorithm for sequential data compression, IEEE Trans. Inf.
     Theory 23 (1977) 337–343.
 [8] J. KΓ€rkkΓ€inen, Repetition-Based Text Indexes, Ph.D. thesis, University of Helsinki, Faculty
     of Science, Department of Computer Science, 1999.
 [9] P. W. Harrison, R. Lopez, N. Rahman, S. G. Allen, R. Aslam, N. Buso, C. Cummins, Y. Fathy,
     E. Felix, et al., The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19
     research through rapid open access data sharing, Nucleic Acids Research 49 (2021) W619–
     W623.
[10] M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, A. Shelat, The
     smallest grammar problem, IEEE Trans. Inf. Theory 51 (2005) 2554–2576.
[11] G. Manzini, An analysis of the Burrows-Wheeler transform, J. ACM 48 (2001) 407–430.
[12] D. Kempa, N. Prezza, At the roots of dictionary compression: string attractors, in: Proc.
     50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), ACM, 2018, pp.
     827–840.
[13] G. Navarro, Indexing highly repetitive string collections, part I: repetitiveness measures,
     ACM Comput. Surv. 54 (2021) 29:1–29:31.