<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">On Compressing Collections of Substring Samples</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Golnaz</forename><surname>Badkobeh</surname></persName>
							<email>g.badkobeh@gold.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">Goldsmiths</orgName>
								<orgName type="institution">University of London</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sara</forename><surname>Giuliani</surname></persName>
							<email>sara.giuliani_01@univr.it</email>
							<affiliation key="aff1">
								<orgName type="institution">University of Verona</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Zsuzsanna</forename><surname>Lipták</surname></persName>
							<email>zsuzsanna.liptak@univr.it</email>
							<affiliation key="aff1">
								<orgName type="institution">University of Verona</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Simon</forename><forename type="middle">J</forename><surname>Puglisi</surname></persName>
							<email>simon.puglisi@helsinki.fi</email>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="institution">University of Helsinki</orgName>
								<address>
									<country key="FI">Finland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">On Compressing Collections of Substring Samples</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">990F9E8B51AAEB1066E51A2620CFDFF6</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:30+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Lempel-Ziv</term>
					<term>LZ77</term>
					<term>data compression</term>
					<term>strings</term>
					<term>repetitiveness</term>
					<term>combinatorics on words</term>
					<term>parsing</term>
					<term>short reads</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Given a string 𝑋 = 𝑋[1..𝑛] of length 𝑛, and integers 𝑚 and 𝑑, such that 𝑛 &gt; 𝑚 ≥ 2𝑑 &gt; 0, we consider the problem of compressing the string 𝑆 formed by concatenating the substrings of 𝑋 of length 𝑚 starting at positions 𝑖 ≡ 1 (mod 𝑑). In particular, we provide an upper bound of (2𝑛 − 𝑚)/𝑑 + 2𝑧 + (𝑚 − 𝑑) on the size of the Lempel-Ziv (LZ77) parsing of 𝑆, where 𝑧 is the size of the parsing of 𝑋. We also show that a related bound holds regardless of the order in which the substrings are concatenated in the formation of 𝑆. If 𝑋 is viewed as a genome sequence, the above substring sampling process corresponds to an idealized model of short read DNA sequencing.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>We consider the problem of compressing a set of substrings sampled from a string. In particular, given a string 𝑋 of length 𝑛 and two integer parameters 𝑑 and 𝑚, 𝑚 ≥ 2𝑑, we call a sample of 𝑋 a substring of 𝑋 of length 𝑚 starting at any position 1 + 𝑖 • 𝑑 in 𝑋, where 𝑖 ≥ 0. We refer to the samples as 𝑆 𝑖 = 𝑋[(𝑖 − 1) • 𝑑 + 1..(𝑖 − 1) • 𝑑 + 𝑚], for 𝑖 ≥ 1, and to the concatenation as 𝑆 = 𝑆 1 𝑆 2 • • • 𝑆 𝑟 , where 𝑟 = ⌊(𝑛 − 𝑚)/𝑑⌋ + 1. Note that if 𝑛−𝑚 𝑑 is not an integer, then the final few characters of 𝑋 will not be part of any sample.</p><p>If 𝑋 is viewed as a genome sequence, the above substring sampling process corresponds to an idealized model of short read DNA sequencing. In the language of genome sequencing, 𝑚 is the read length and 𝑚/𝑑 is the so-called coverage (the number of samples that cover a given position in 𝑋, on average). Our assumption that 𝑚 ≥ 2𝑑 corresponds to a coverage of at least 2, which is the relevant case for DNA sequencing. 𝑆 represents a file of short read sequences -the typical output of a sequencing experiment. Our sampling process idealizes short read sequencing in at least two ways. Firstly, short read sequencing may produce strings of slightly different length and may introduce errors -insertions, deletions, and substitutions of lettersto the sampled strings, albeit with fairly low probability for present-day short read technology (𝑝 &lt; 0.01 for Illumina short reads, for example). Secondly, in short read sequencing, coverage is not completely uniform, fluctuating across the genome for a variety of reasons (see, e.g. <ref type="bibr" target="#b0">[1]</ref>).</p><p>The extraordinarily wide adoption of high-throughput sequencing in medical and evolutionary biology over the last decade has made short read data sets abundant. These data sets are also very large. For example, a typical human sequencing experiment might run at 20x coverage on a underlying genome of size 𝑛 = 3 • 10 9 nucleotides. The resulting read set in FASTQ format (a standard file format, which stores one ASCII-encoded short read sequence per line-akin to our 𝑆 string) is then 60 gigabytes in size. The de facto standard in most labs (and large institutions such as, e.g., the NCBI) is to compress such files with the gzip all-purpose file compressor, which usually leads to a factor four reduction in size <ref type="foot" target="#foot_0">1</ref> , or 15GB in our human sequencing example. A gzip'd large read set is thus, alas, still relatively large.</p><p>With the rapid growth in, and the need to store, short read data sets, specialized compressors that exploit properties inherent to such data sets will become paramount. Several read set specific compressors have now been developed (see, e.g., <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref>). None are yet in wide use. However, to our knowledge, no careful analysis of the compressibility of short read data sets-even in an idealized setting such as that described above-has been undertaken. This article addresses that need. In particular, we derive a non-trivial upper bound on the size of the LZ parsing <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref> of 𝑆, the concatenation of sampled substrings, in terms of 𝑛, 𝑚, 𝑑 and the size of the LZ parsing of 𝑋. We also show a different upper bound that holds regardless of the order in which the samples are concatenated to form 𝑆.</p><p>In the next section we set notation and basic concepts and results used throughout. In Sections 3 and 4 we establish the above mentioned bounds, and in Section 5 we gauge the tightness of these bounds experimentally. We then sketch several directions for future work.  Thus, the LZ factorization of our example string 𝑋 = zzzzzipzip produces the following 𝑧 𝑋 = 5 factors, 𝑓 1 = z, 𝑓 2 = zzzz, 𝑓 3 = i, 𝑓 4 = p, 𝑓 5 = zip.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Preliminaries</head><p>We denote with ℓ 𝑖 = |𝑓 𝑖 | the length of phrase 𝑓 𝑖 , and with 𝑠 𝑖 the starting position of phrase 𝑓 𝑖 in 𝑋, i.e., 𝑠 𝑖 = 1 + ∑︀ 𝑗&lt;𝑖 ℓ 𝑗 . Finally, we denote with 𝑝 𝑖 the position of the first occurrence of substring 𝑓 𝑖 in 𝑋 and we call substring 𝑋[𝑝 𝑖 ..𝑝 𝑖 + ℓ 𝑖 − 1] the source of 𝑓 𝑖 . Note that phrases and sources are allowed to overlap, as is the case with 𝑓 2 in our example.</p><p>The following is a known upper bound on the number of phrases in the parsing. </p><formula xml:id="formula_0">𝑍 = 𝑛 − (𝜎/(𝜎 − 1)) log 𝜎 𝑛 − log 𝜎 log 𝜎 𝑛 − (1/(𝜎 − 1)</formula><p>) .</p><p>We will return to the above bound later in the paper, but for now we note that it is asymptotically tight. In particular, an appropriate prefix of the de Bruijn sequence contains at least 𝑛/⌈log 𝜎 𝑛⌉ LZ phrases <ref type="bibr" target="#b7">[8]</ref>.</p><p>We close this section with another well-known property of the LZ parsing we make use of in the proof of our main theorem. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Upper Bound on the Number of LZ Phrases of the Concatenation of Samples</head><p>In this section, we show an upper bound on the number of LZ phrases of string 𝑆 formed by concatenating samples from string 𝑋. Formally, given a string 𝑋 of length 𝑛 and two integer parameters 𝑚, 𝑑, such that 𝑚 ≥ 2𝑑,</p><formula xml:id="formula_1">let 𝑆 = 𝑆 1 𝑆 2 • • • 𝑆 𝑟 , where 𝑟 = ⌊(𝑛 − 𝑚)/𝑑⌋ + 1, and, for 𝑖 ∈ [1..𝑟], 𝑆 𝑖 = 𝑋[(𝑖 − 1) • 𝑑 + 1..(𝑖 − 1) • 𝑑 + 𝑚].</formula><p>Let us write, for 𝑖 ≥ 1, 𝑆 𝑖 = 𝑣 𝑖 𝑥 𝑖 , where |𝑣 𝑖 | = 𝑚 − 𝑑 and |𝑥 𝑖 | = 𝑑. Then 𝑋 and 𝑆 can be written as follows, where |𝑢| &lt; 𝑑:</p><formula xml:id="formula_2">𝑋 = 𝑣 1 𝑥 1 𝑥 2 • • • 𝑥 𝑟 𝑢,<label>(1)</label></formula><formula xml:id="formula_3">𝑆 = 𝑣 1 𝑥 1 𝑣 2 𝑥 2 𝑣 3 𝑥 3 • • • 𝑣 𝑟 𝑥 𝑟 .<label>(2)</label></formula><p>We will now count the number of phrases 𝑧 𝑆 of the LZ parsing of string 𝑆. Note that this is equivalent to counting the number of starting positions of phrases. Lemma 1. Let 𝑏 𝑖 and 𝑒 𝑖 denote the beginning resp. ending position of the 𝑣 𝑖 's in the factorization of 𝑆 given in <ref type="bibr" target="#b1">(2)</ref>. Then, for 𝑖 &gt; 1, the interval [𝑏 𝑖 , 𝑒 𝑖 ] can contain at most one starting position of a phrase.</p><p>Proof. Since two consecutive samples 𝑆 𝑖 and 𝑆 𝑖+1 overlap by 𝑚 − 𝑑 characters, it follows that, for all 𝑖 &gt; 1, 𝑆 𝑖 = 𝑦 𝑖 𝑣 𝑖+1 , where 𝑦 𝑖 is the 𝑑-length prefix of 𝑆 𝑖 . Therefore, 𝑆 contains the square 𝑣 𝑖 𝑣 𝑖 for every 𝑖 &gt; 1, and 𝑏 𝑖 is the start of the second occurrence of 𝑣 𝑖 in this square. By Fact 1, therefore, [𝑏 𝑖 , 𝑒 𝑖 ] can contain at most one starting position.</p><p>Next we will count the number of starting positions in the 𝑥 𝑖 's. For this, we consider phrases 𝑓 of the LZ parsing of 𝑋, and count how many starting positions these induce in 𝑆 in the worst case. We first need the definition of the projection of a substring of 𝑋 on 𝑆 (see also Figure <ref type="figure" target="#fig_1">1</ref>). Definition 2. Let 1 ≤ 𝑏 ≤ 𝑒 ≤ 𝑛. We define the projection of substring 𝑋[𝑏, 𝑒] on 𝑆 as a collection of substrings in 𝑆 as follows: For 𝑥 𝜄 , let 𝑏 𝜄 and 𝑒 𝜄 denote the starting resp. ending positions of 𝑥 𝜄 in 𝑆 w.r.t. the factorization given in <ref type="bibr" target="#b0">(1)</ref>.</p><formula xml:id="formula_4">Then 𝑃 𝑟𝑜𝑗 𝑆 ([𝑏, 𝑒]) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ ( ⋃︀ 𝑗 𝜄=𝑖 [𝜄𝑚 − 𝑑, 𝜄𝑚]) ∪ [(𝑖 − 2) • (𝑚 − 𝑑) + 𝑏, (𝑖 − 1)𝑚]∪ [𝑚𝑗 + 𝑚 − 𝑑 + 1, 𝑗(𝑚 − 𝑑) + 𝑒] if 𝑚 − 𝑑 &lt; 𝑏 [𝑏, 𝑒] if 𝑏, 𝑒 ≤ 𝑚 − 𝑑 [𝑏, 𝑚 − 𝑑] ∪ 𝑃 𝑟𝑜𝑗 𝑆 [𝑚 − 𝑑 + 1, 𝑒] if 𝑏 ≤ 𝑚 − 𝑑 &lt; 𝑒.</formula><p>Informally, a projection in 𝑆 of a substring 𝑈 of 𝑋 is the collections of substrings of 𝑆 coinciding with regions of 𝑈 , namely substrings of 𝑆 which 𝑈 consists of, w.r.t. the factorization given in <ref type="bibr" target="#b0">(1)</ref>. The first case (see Fig. <ref type="figure" target="#fig_1">1</ref>) gives the general situation, when 𝑈 starts after the prefix 𝑣 1 of 𝑋, which is covered only by the first sample 𝑆 1 . The second case is when 𝑈 lies fully within 𝑣 1 , while the third case is when 𝑈 starts within 𝑣 1 and ends after it.  </p><formula xml:id="formula_5">𝑋 = 𝑣 1 𝑥 1 𝑥 𝑖−1 𝑥 𝑖 𝑥 𝜄 . . . 𝑥 𝑗 𝑥 𝑗+1 . . . 𝑥 𝑟 . . . 𝑆 = . . . . . . . . . . . . . . . 𝑋[𝑏, 𝑒] = 𝑃 𝑟𝑜𝑗 𝑆 ([𝑏, 𝑒]) = 𝑣 𝑖−1 𝑥 𝑖 𝑥 𝑖−1 𝑣 𝜄 𝑥 𝜄 𝑣 𝑗 𝑥 𝑗 𝑣 𝑗+1 𝑥 𝑗+1 𝑣 𝑖</formula><formula xml:id="formula_6">Proof. Let ℓ = |𝑓 |, 𝑘 = ⌈ ℓ 𝑑 ⌉ and 𝑓 = 𝑋[𝑠..𝑠 + ℓ − 1] = 𝑥 ′ 𝑥 𝑖 𝑥 𝑖+1 • • • 𝑥 𝑗 𝑥 ′′</formula><p>, where 𝑥 ′ and 𝑥 ′′ are a proper suffix of 𝑥 𝑖−1 respectively a proper prefix of 𝑥 𝑗+1 , both possibly empty. We will show that each of these substrings is charged with at most one phrase starting position. From Fact 1, the claim follows.</p><p>By construction of 𝑆, each of the substrings 𝑥 ′ , 𝑥 𝑖 , 𝑥 𝑖+1 , . . . , 𝑥 𝑗 , 𝑥 ′′ will appear contiguously in 𝑆. The number of these substrings is either 𝑘 or 𝑘 + 1. Additionally, a 𝑣 substring separates the projections in 𝑆 of each pair of contiguous aforementioned substrings of 𝑋.</p><p>Consider now the source 𝑓 ′ of 𝑓 occurring in 𝑋 in some position</p><formula xml:id="formula_7">𝑠 ′ &lt; 𝑠. Then 𝑋[𝑠..𝑠+ℓ−1] = 𝑋[𝑠 ′ ..𝑠 ′ + ℓ − 1] = 𝑢 ′ 𝑥 𝑖 ′ 𝑥 𝑖 ′ +1 • • • 𝑥 𝑗 ′ 𝑢 ′′ ,</formula><p>where 𝑢 ′ and 𝑢 ′′ are a proper suffix of 𝑥 𝑖 ′ −1 respectively a proper prefix of 𝑥 𝑗 ′ +1 , both possibly empty. As for 𝑓 , the projection of 𝑓 ′ is also split in 𝑆 in such a way that the projection of each pair of mentioned substrings of 𝑓 ′ is separated by a 𝑣 substring in 𝑆.</p><p>Notice that, since 𝑚 ≥ 2𝑑, each 𝑥 𝑏 in 𝑋 has an occurrence in 𝑆 also as suffix of 𝑣 𝑏 ′ +1 , occurring immediately after 𝑥 𝑏+1 in 𝑆. This means that, even if the projections of 𝑓 and 𝑓 ′ in 𝑆 will be split asynchronously, each of the substrings 𝑥 ′ , 𝑥 𝑖 , 𝑥 𝑖+1 , • • • , 𝑥 𝑗 , 𝑥 ′′ in the projection of 𝑓 has already occurred as the concatenation of a suffix of some 𝑣 𝑏 ′ intersecting the projection of 𝑓 ′ and the contiguous 𝑥 𝑏 ′ +1 .</p><p>There are at most 𝑘 + 1 factors 𝑥 ′ , 𝑥 𝑖 , 𝑥 𝑖+1 , . . . , 𝑥 𝑗 , 𝑥 ′′ , and each of them has a previous occurrence in 𝑆. Therefore, by Fact 1, there will be at most 𝑘 + 1 phrase starting positions for 𝑓 . See Fig. <ref type="figure" target="#fig_3">2</ref> for an illustration. In particular, a phrase 𝑓 and its corresponding source 𝑓 ′ in 𝑋 are represented with distinct colors for each of the 𝑥 𝑖 segments of 𝑋 intersected by the phrase. Finally, the projection 𝑃 𝑟𝑜𝑗 𝑆 (𝑓 ′ ) of the source is shown. It is clear from the figure that, in 𝑆 ′ , the samples intersecting some string in the projection of 𝑓 ′ fully contain at least one substring of the projection of 𝑓 .</p><formula xml:id="formula_8">𝑋 = 𝑣 1 𝑥 1 𝑓 ′ = 𝑋[𝑏, 𝑒] 𝑥 𝑗−1 𝑥 𝑗 𝑥 𝑗+1 𝑥 𝑗+2 . . . 𝑓 𝑥 𝑖−1 𝑥 𝑖 𝑥 𝑖+1 𝑥 𝑖+2 . . . 𝑥 𝑟 𝑢 . . . 𝑆 = 𝑣 1 𝑥 1 𝑥 𝑗−1 𝑃 𝑟𝑜𝑗 𝑆 ([𝑏, 𝑒]) . . . 𝑣 𝑗 𝑥 𝑗 𝑣 𝑗+1 𝑥 𝑗+1 𝑣 𝑗+2 𝑥 𝑗+2 . . . 𝑥 𝑗−1 |𝑆 𝑗 | ≥ 2|𝑥 𝑗 |</formula><p>Lemma 3. With respect to the factorization of 𝑆 given in <ref type="bibr" target="#b1">(2)</ref>, the number of phrase starting positions in 𝑥 𝑖 's is at most 𝑛 𝑑 + 2𝑧 𝑋 . Proof. The sum of the lengths of all phrases 𝑓 1 , . . . , 𝑓 𝑧 𝑋 in the LZ factorization of 𝑋 is the length 𝑛 of the string 𝑋. Therefore, we can bound the total contribution to 𝑧 𝑆 of 𝑋 as follows:</p><formula xml:id="formula_9">𝑧 𝑋 ∑︁ 𝑗=1 𝑔(𝑓 𝑗 ) ≤ 𝑧 𝑋 ∑︁ 𝑗=1 ( |𝑓 𝑗 | 𝑑 + 2) = 𝑛 𝑑 + 2𝑧 𝑋 .</formula><p>Theorem 2. Let 𝑧 𝑋 be the number of phrases of the LZ parsing of 𝑋, and 𝑧 𝑆 the number of phrases of the LZ parsing of 𝑆. Then 𝑧 𝑆 ≤ 2𝑛−𝑚 𝑑 + 2𝑧 𝑋 + 𝑚 − 𝑑.</p><p>Proof. We show the maximum number of phrase starting positions in 𝑆 summing up the contribution of the (𝑚 − 𝑑)-length prefix of 𝑆, and of the remaining 𝑥 𝑖 's (Lemma 3) and 𝑣 𝑖 's (Lemma 1) substrings.</p><p>𝑧 𝑆 = number of starting positions in 𝑣 1 + number of starting positions in 𝑥 𝑖 's + number of starting positions in 𝑣 𝑖 's, for 𝑖 ≥ 2</p><formula xml:id="formula_10">≤ (𝑚 − 𝑑) + 𝑛 𝑑 + 2𝑧 𝑋 + 𝑛 − 𝑚 𝑑 = 2𝑛 − 𝑚 𝑑 + 2𝑧 𝑋 + 𝑚 − 𝑑.</formula><p>Theorem 2 essentially says that the number of LZ phrases of 𝑆 is at most twice the number of samples plus twice the number of LZ phrases of 𝑋. Note that the important parameter which determines the number of samples is 𝑑, while 𝑚 influences only the number of samples in which a given position occurs (i.e. 𝑚/𝑑 is the so-called coverage).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Upper Bound for Arbitrary Concatenation Order</head><p>This section examines the effect that the ordering of the samples in the concatenation has on the number of phrases.</p><p>As before, we are given a string 𝑋 of length 𝑛 and two integer parameters 𝑚 and 𝑑, such that 𝑚 ≥ 2𝑑. Let 𝑟 = ⌊(𝑛 − 𝑚)/𝑑⌋ + 1, and, for 𝑖 ∈</p><formula xml:id="formula_11">[1..𝑟], 𝑆 𝑖 = 𝑋[(𝑖 − 1) • 𝑑 + 1..(𝑖 − 1) • 𝑑 + 𝑚].</formula><p>We will show that, regardless of the order in which the samples 𝑆 𝑖 of 𝑋 are concatenated, the number of phrases in the LZ factorization of that concatenation is at most 𝑛 + 2(𝑛−𝑚)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>𝑑</head><p>. We make this argument precise below. Theorem 3. Let 𝑆 ′ be the concatenation of the samples of 𝑋 in any order, then the number of phrases is 𝑧 𝑆 ′ ≤ 𝑛 + 2(𝑛−𝑚) 𝑑 .</p><p>Proof. Fix 𝑖. We will show that the number of phrase starting positions in sample 𝑆 𝑖 where it appears in 𝑆 ′ is at most 2 + |𝑢 𝑖 |, where 𝑢 𝑖 is a specific substring of 𝑆 𝑖 . Let 𝑘 = max{𝑖 ′ &lt; 𝑖 | 𝑆 𝑖 ′ appears before 𝑆 𝑖 in 𝑆 ′ }, and let 𝑤 𝑖 be the longest suffix of 𝑆 𝑘 which is also a prefix of 𝑆 𝑖 (i.e., 𝑤 𝑖 is the maximum overlap). Note that 𝑤 𝑖 may be empty. Similarly, let 𝑘 ′ = min{𝑖 ′ &gt; 𝑖 | 𝑆 𝑖 ′ appears before 𝑆 𝑖 in 𝑆 ′ }, and let 𝑤 ′ 𝑖 be the longest prefix of 𝑆 𝑘 ′ which is also a suffix of 𝑆 𝑖 , again possibly empty.</p><formula xml:id="formula_12">If |𝑤 𝑖 | + |𝑤 ′ 𝑖 | ≥ 𝑚 = |𝑆 𝑖 |, then set 𝑢 𝑖 = 𝜖.</formula><p>Otherwise, 𝑆 𝑖 can be written as 𝑆 𝑖 = 𝑤 𝑖 𝑢 𝑖 𝑤 ′ 𝑖 , for some 𝑢 𝑖 . This 𝑢 𝑖 is a substring of 𝑋 that has not so far been covered by any sample in 𝑆 ′ , and this is because of the definition of 𝑘 and 𝑘 ′ . We call 𝑢 𝑖 the new part of 𝑆 𝑖 ; in particular, if 𝑢 𝑖 = 𝜖, then the new part of 𝑆 𝑖 is empty.</p><p>By Fact Summing over all samples 𝑆 𝑖 , we thus get</p><formula xml:id="formula_13">𝑧 𝑆 ′ ≤ 𝑟 ∑︁ 𝑖=1 (2 + |𝑢 𝑖 |) = 2𝑟 + 𝑟 ∑︁ 𝑖=1 |𝑢 𝑖 | = 2𝑟 + 𝑛,</formula><p>with the last equality using the fact that every position of 𝑋 occurs exactly once in some 𝑢 𝑖 .</p><p>Theorem 3 essentially says that the number of LZ phrases of 𝑆 ′ , i.e. of the concatenation of the samples in an arbitrary order, is at most the length of 𝑋 plus twice the number of samples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experimental Results</head><p>In order to gauge the tightness of our bounds, we computed the number of phrases in the conventional LZ factorization (defined in Section 2) of the concatenation of the samples taken from each of the texts in Table <ref type="table" target="#tab_3">1</ref>. We did this for a range of 𝑑 and 𝑚 parameters, and concatenated the samples both in string order and after a random shuffling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Data</head><p>Our test data, which contains files of varying repetitiveness is shown in Table <ref type="table" target="#tab_3">1</ref>.</p><p>By nature, viruses contain very little recurrent genetic heritage, and so a viral genome represents a real-world non-repetitive string. We used a genome of SARS-CoV2 taken from the COVID-19 Data Portal <ref type="bibr" target="#b8">[9]</ref> of length 29 836, that is factorized in 4 373 LZ phrases. We also performed experiments on extreme cases between which real-life genomes lie: Fibonacci words and random words. Let 𝑠 0 = b, 𝑠 1 = a, the Fibonacci word of order 𝑖 + 1 is the binary word 𝑠 𝑖+1 = 𝑠 𝑖 𝑠 𝑖−1 . In other words, the Fibonacci word of order 𝑖 + 1 is the concatenation of the Fibonacci words of the two previous orders.</p><p>By construction, Fibonacci words have a very high degree of repetition, resulting in very few phrases with respect to their length. Random words, on the other hand, could be considered, instead, not repetitive at all. To perform a comparison with the virus genome, we chose a random word of the same length of the genome, and the Fibonacci word of length 28 657, the closest to the length of the genome. The strings are factorized in 4 575 respectively 22 phrases.</p><p>We also performed experiments with larger data. In particular, we counted the number of phrases of the concatenation of 50 SARS-CoV2 genomes, which, due to the high similarity of the individual genomes, constitutes a highly repetitive dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Experiments</head><p>We are interested in simulating the coverage provided by the most used technologies in genome sequencing. The coverage is given by the number of samples that cover a given position in the text. In the context of genome sequencing, a reasonable coverage is given by large 𝑚 and small 𝑑, in order to have the coverage as large as possible with reasonable sample lengths.</p><p>We perform the experiments on each of the aforementioned texts with 𝑑 = 1, 2, 4, 8, 16, 32, 64, and 𝑚 from 50 to 1 000, in increments of 10. We counted the number of phrases for 𝑚 &gt; 𝑑.</p><p>Figure <ref type="figure" target="#fig_5">4</ref>, on top, displays the number of phrases for concatenations of samples in string order of a Fibonacci word (top-left corner), and of a random string of a similar length (top-right corner). Below, we show the counts for an arbitrary ordered concatenation of the samples of the same data. In all figures, the number of phrases for some selected 𝑚's are shown in distinct colours, in gray for all other 𝑚's. The corresponding coloured line shows our bounds for each 𝑚 and 𝑑 pair. Finally, the colored rounded points show the existing bound given by Kärkkäinen <ref type="bibr" target="#b7">[8]</ref>, see Theorem 1.</p><p>The analogous comparison is shown in Figure <ref type="figure" target="#fig_6">5</ref> between a single SARS-CoV2 genome and the concatenation of 50 genomes.</p><p>The overlapping of the bound lines in all plots reflects the small impact of the sampling length 𝑚 in the bound for fixed 𝑑.</p><p>When the samples are concatenated in string order (on the top of both Figure <ref type="figure" target="#fig_6">4 and 5</ref>), the number of phrases in the original string has a big influence on the bound. We can see that the bound lines in the Fibonacci word's plot are closer to the actual counts than in the other plots, where the number of phrases of the original strings is higher. On the other hand, because of the different nature of the bound for the samples in string order versus random order, we can see that the latter bound has a similar trend for the three strings of similar length. In this case, the length and the number of samples of the original string determine the bound rather than its number of phrases.</p><p>Comparing the bound given in this paper (lines) to the one reported in Theorem <ref type="bibr" target="#b7">[8]</ref> by Kärkkäinen (rounded points), when repetitive strings are considered, our bound is not worse than the existing one for the string order concatenation, while it gets more loose with the concatenation in arbitrary order. See the plots for Fibonacci words and the concatenation of 50 genomes of SARS-CoV-2. On the other hand, for non repetitive strings, namely random strings and the single viral genome in the plots, the existing bound is often better than ours. A summary of relevant results for the Fibonacci word and the single SARS-CoV2 genome. For each sampling frequency 𝑑, we report the value of 𝑚 for which the maximum number of phrases in the concatenation of the samples is produced in both string order and after the random shuffling. We further show the counts and the value of our bound for the mentioned 𝑚.</p><p>In Table <ref type="table" target="#tab_4">2</ref> we show, for fixed 𝑑, the maximum number of phrases in 𝑆, varying 𝑚. For strings 𝑋 that are already very repetitive (Fibonacci word), the longer the string, the higher the number of phrases. On the other hand, for strings that are not that repetitive (single SARS-CoV2 genome), introducing long repetitions may decrease the number of phrase starting positions in contrast to having a shorter string with very short repetitions. The concatenation of 50 SARS-CoV2 genomes lies in the middle. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Concluding Remarks</head><p>This paper has explored the compressibility of collections of substrings sampled from a string. In particular, given a string 𝑋 = 𝑋[1.</p><p>.𝑛] of length 𝑛 with an LZ parsing size of 𝑧 𝑋 , and integers 𝑚 and 𝑑, such that 𝑛 &gt; 𝑚 ≥ 2𝑑 &gt; 0, we have shown that the size of the LZ parsing of the string 𝑆 formed by concatenating the substrings of 𝑋 of length 𝑚 starting at positions 𝑖 ≡ 1 (mod 𝑑), is bounded by (2𝑛 − 𝑚)/𝑑 + 2𝑧 + (𝑚 − 𝑑). We also proved a bound for the case of any arbitrary chosen order of the samples in the concatenation. There are several avenues future work could take. The most immediate perhaps is the question of whether the bounds given in Theorem 2 and Theorem 3 can be improved. Another direction is to derive bounds for other compression measures, such as the size of the smallest context-free</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Throughout we consider a string 𝑋 = 𝑋[1..𝑛] = 𝑋<ref type="bibr" target="#b0">[1]</ref>𝑋<ref type="bibr" target="#b1">[2]</ref> . . . 𝑋[𝑛] of |𝑋| = 𝑛 symbols drawn from an ordered alphabet Σ of size |Σ| = 𝜎. For 𝑖 = 1, . . . , 𝑛, we write 𝑋[𝑖..𝑛] to denote the suffix of 𝑋 of length 𝑛 − 𝑖 + 1, that is, 𝑋[𝑖..𝑛] = 𝑋[𝑖]𝑋[𝑖 + 1] . . . 𝑋[𝑛]. We will often refer to suffix 𝑋[𝑖..𝑛] simply as "suffix 𝑖" or the "𝑖th suffix". We write 𝑋[1..𝑖] = 𝑋[1] . . . 𝑋[𝑖] to denote the prefix of 𝑋 of length 𝑖. We write 𝑋[𝑖..𝑗] to represent the substring (or factor) 𝑋[𝑖]𝑋[𝑖 + 1] . . . 𝑋[𝑗] of 𝑋 that starts at position 𝑖 and ends at position 𝑗. If 𝑖 &gt; 𝑗, then 𝑋[𝑖..𝑗] = 𝜖, where 𝜖 is the empty string. Let lcp(𝑖, 𝑗) denote the length of the longest common prefix of suffix 𝑖 and suffix 𝑗. For example, in the string 𝑋 = zzzzzipzip, lcp(2, 5) = 1 = |z|, and lcp(5, 8) = 3 = |zip|. For technical reasons we define lcp(𝑖, 0) = lcp(0, 𝑖) = 0 for all 𝑖.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fact 1 .</head><label>1</label><figDesc>Let 𝑇 be a string, and 1 &lt; 𝑖 ≤ 𝑗 ≤ |𝑇 |. If 𝑡 = 𝑇 [𝑖..𝑗] has an occurrence before 𝑖, i.e., if there exists an 𝑖 ′ &lt; 𝑖 s.t. 𝑇 [𝑖 ′ ..𝑖 ′ + |𝑡| − 1] = 𝑡, then the interval [𝑖, 𝑗] can contain at most one starting position of a phrase.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The projection of the substring 𝑋[𝑏, 𝑒] on the string 𝑆 is shown. The number of substrings of 𝑋 contained in the collection of substrings produced by the projection is the number of 𝑥 𝑖 's intersected by 𝑋[𝑏, 𝑒] in 𝑋, and they are all contained in the corresponding 𝑥 𝑖 's in 𝑆.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Lemma 2 .</head><label>2</label><figDesc>Let 𝑚 ≥ 2𝑑. Let 𝑓 be a phrase of the LZ factorization of 𝑋, with starting position 𝑠 &gt; 𝑚 − 𝑑. Then 𝑔(𝑓 ) ≤ |𝑓 | 𝑑 + 2.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: The original string 𝑋 on top, and the concatenation 𝑆 of the 𝑟 samples of 𝑋 below are shown.In particular, a phrase 𝑓 and its corresponding source 𝑓 ′ in 𝑋 are represented with distinct colors for each of the 𝑥 𝑖 segments of 𝑋 intersected by the phrase. Finally, the projection 𝑃 𝑟𝑜𝑗 𝑆 (𝑓 ′ ) of the source is shown. It is clear from the figure that, in 𝑆 ′ , the samples intersecting some string in the projection of 𝑓 ′ fully contain at least one substring of the projection of 𝑓 .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4:The number of phrases in log scale for concatenation of samples of a Fibonacci word (left column) and a random word (right column), both in string order (on top) and in an arbitrary chosen order (on bottom). The counts are shown for each concatenation of 𝑚-length samples at every 𝑑 position. Colored markers indicate the number of phrases for 𝑚 = 50, 100, 200, 400, 1000, while we use grey points to mark all others 𝑚's. The bounds shown in Theorem 2 and 3 for the concatenation in string order respectively random order are shown with colored and shaped lines accordingly to the corresponding 𝑚, 𝑑 pair, while the coloured rounded points indicate the bound given by Kärkkäinen<ref type="bibr" target="#b7">[8]</ref>, see Theorem 1.</figDesc><graphic coords="10,89.63,304.07,216.68,162.51" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: The number of phrases in log scale for concatenation of samples of a single SARS-CoV2 genome (left column) and of a concatenation of 50 SARS-CoV2 genomes (right column), both in string order (on top) and in any arbitrary chosen order (on bottom). The counts are shown for each concatenation of 𝑚-length samples at every 𝑑 position. Colored markers indicate the number of phrases for 𝑚 = 50, 100, 200, 400, 1000, while we use grey points to mark all others 𝑚's. The bounds shown in Theorem 2 and 3 for the concatenation in string order respectively random order are shown with colored and shaped lines accordingly to the corresponding 𝑚, 𝑑 pair, while the coloured rounded points indicate the bound given by Kärkkäinen [8], see Theorem 1.</figDesc><graphic coords="11,89.63,290.54,216.68,162.51" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Definition 1 (LZ Factorization). The LZ factorization of 𝑋 is a factorization 𝑋 = 𝑓 1 𝑓 2 . . . 𝑓 𝑧 𝑋 of 𝑋 into 𝑧 𝑋 phrases such that each phrase 𝑓 𝑖 (a substring of 𝑋) is either 2. the longest substring that occurs at least twice in 𝑓 1 • • • 𝑓 𝑖 .</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>Theorem 1 ([8], Theorem 5.20). Let 𝑋 be a string of length 𝑛 over Σ, with |Σ| = 𝜎. Then 𝑧 𝑋 ≤ 𝑍, where</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head></head><label></label><figDesc>1, if 𝑤 𝑖 is non-empty, then at most one phrase starting position is contained within 𝑤 𝑖 . Similarly, if 𝑤 ′ 𝑖 is non-empty, at most one starting position is contained within 𝑤 ′ 𝑖 . The number of starting positions within the new part 𝑢 𝑖 can be trivially upper bounded by |𝑢 𝑖 |. The contribution of the samples of 𝑋 to 𝑧 𝑆 ′ , where 𝑆 ′ is the concatenation of the samples in an arbitrary order. The colored substrings are the prefixes and suffixes with multiple occurrences in 𝑆 ′ of some sample. Each sample consists of a prefix and a suffix (possibly empty) that has already occurred in 𝑆 ′ , and a substring in between them (i.e., labelled 𝑢 𝑖 , in white) that is not assumed to necessarily occur previously in 𝑆 ′ .</figDesc><table><row><cell>𝑋 =</cell><cell></cell><cell>𝑥 𝑖−𝑡</cell><cell>𝑥 𝑘</cell><cell>𝑥 𝑖</cell><cell>𝑥 𝑘 ′</cell><cell>𝑥 𝑖+𝑡</cell><cell>...</cell></row><row><cell></cell><cell>𝑆 𝑖−𝑡 =</cell><cell>...</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>𝑆 𝑘 =</cell><cell>...</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>𝑆 𝑖 =</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell>...</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell>𝑆 𝑘 ′ =</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell>...</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell>𝑆 𝑖+𝑡 =</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>𝑆 𝑘 ′</cell><cell></cell><cell>𝑆 𝑘</cell><cell></cell><cell>𝑆 𝑖</cell><cell></cell></row><row><cell>𝑆 ′ =</cell><cell>...</cell><cell>...</cell><cell></cell><cell>...</cell><cell>𝑢 𝑖</cell><cell></cell><cell>...</cell></row><row><cell cols="2">Figure 3:</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 1</head><label>1</label><figDesc>Data used in the experiments. The table shows the length (𝑛) and the number of phrases (𝑧) of each text used for the experiments.</figDesc><table><row><cell>Data</cell><cell>Description</cell><cell></cell><cell>𝑛</cell><cell>𝑧</cell><cell>𝑛/𝑧</cell></row><row><cell>SARS-CoV2</cell><cell cols="2">Taken from the COVID-19 Data Portal [9]</cell><cell cols="2">29 835 4373</cell><cell>6.8</cell></row><row><cell>50 SARS-CoV2</cell><cell>Concatenation of 50 virus genomes taken from the COVID-19 Data Portal</cell><cell cols="3">[9] 1 490 134 5421</cell><cell>275</cell></row><row><cell>Fibonacci word</cell><cell>Fibonacci word of order 22</cell><cell></cell><cell>28 657</cell><cell cols="2">22 1302.6</cell></row><row><cell>Random</cell><cell cols="2">Word over an alphabet of size 4 built with python function random.choices()</cell><cell cols="2">29 835 4575</cell><cell>6.5</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 2</head><label>2</label><figDesc></figDesc><table><row><cell>Data</cell><cell>𝑑 = 1</cell><cell>𝑑 = 2</cell><cell>𝑑 = 4</cell><cell>𝑑 = 8</cell><cell>𝑑 = 16</cell><cell>𝑑 = 32</cell><cell>𝑑 = 64</cell></row><row><cell>Fibonacci</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>max 𝑚</cell><cell>1000</cell><cell>980</cell><cell>990</cell><cell>970</cell><cell>960</cell><cell>960</cell><cell>940</cell></row><row><cell>phrases</cell><cell>1336</cell><cell>1321</cell><cell>1308</cell><cell>1021</cell><cell>1007</cell><cell>910</cell><cell>603</cell></row><row><cell>bound</cell><cell cols="2">57 357 29 189</cell><cell>15 111</cell><cell>8049</cell><cell>4510.1</cell><cell>2733.1</cell><cell>1800.8</cell></row><row><cell>Fibonacci</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>shuffled</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>max 𝑚</cell><cell>1000</cell><cell>990</cell><cell>990</cell><cell>990</cell><cell>630</cell><cell>970</cell><cell>980</cell></row><row><cell>phrases</cell><cell cols="2">25 164 13 179</cell><cell>6802</cell><cell>3505</cell><cell>1802</cell><cell>942</cell><cell>521</cell></row><row><cell>bound</cell><cell cols="4">83 971 56 324 42 490.5 35 573.75</cell><cell>32 160.4</cell><cell cols="2">30 387.4 29 521.906 25</cell></row><row><cell>SARS-CoV2</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>max 𝑚</cell><cell>80</cell><cell>150</cell><cell>50</cell><cell>70</cell><cell>50</cell><cell>90</cell><cell>210</cell></row><row><cell>phrases</cell><cell cols="2">44 438 27 823</cell><cell>14 880</cell><cell>8945</cell><cell>6759</cell><cell>5575</cell><cell>4973</cell></row><row><cell>bound</cell><cell cols="6">68 417 38 655 23 697.5 16 258.25 12 506.38 10 665.93</cell><cell>9821.09</cell></row><row><cell>SARS-CoV2</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>shuffled</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>max 𝑚</cell><cell>200</cell><cell>260</cell><cell>350</cell><cell>540</cell><cell>780</cell><cell>680</cell><cell>760</cell></row><row><cell>phrases</cell><cell cols="2">45 725 27 898</cell><cell>17 538</cell><cell>11 620</cell><cell>7969</cell><cell>6142</cell><cell>5238</cell></row><row><cell>bound</cell><cell cols="2">89 108 59 412</cell><cell>44 574</cell><cell>37 185</cell><cell cols="2">33 468 31 658.25</cell><cell>30 744.63</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">This can be loosely interpreted as gzip, a sliding-window dictionary compressor with window size too small to capture any large dispersed repeated substrings present in the file, essentially reducing the space used by each DNA letter from the 8 bits used in the plain ASCII encoding to the</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">bits that a flat minimal binary code for four letters would use.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>grammar for 𝑆 <ref type="bibr" target="#b9">[10]</ref>, the number of runs in its Burrows-Wheeler transform <ref type="bibr" target="#b10">[11]</ref>, the size of its smallest string attractor <ref type="bibr" target="#b11">[12]</ref>, or other measures of compressibility discussed in Navarro's recent survey <ref type="bibr" target="#b12">[13]</ref>.</p><p>Along the lines of our original motivation from DNA sequencing, how are the bounds affected by the introduction of some probability of error to the sample substrings, such as those errors introduced in collections of strings produced by short-read sequencing technologies?</p><p>Finally, note that we have assumed 𝑑 ≤ 𝑚/2, which corresponds to the interesting case in practice. However, it may be interesting from a theoretical perspective to also examine the case where 𝑑 &gt; 𝑚/2.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria</title>
		<author>
			<persName><forename type="first">R</forename><surname>Ekblom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Smeds</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ellegren</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Genomics</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page">467</biblScope>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">LFastqC: A lossless non-reference-based FASTQ compressor</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">Al</forename><surname>Yami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-H</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PLoS One</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page">e0224806</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">SPRING: a next-generation compressor for FASTQ data</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chandak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Tatwawadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Ochoa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hernaez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Weissman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="2674" to="2676" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">FQSqueezer: k-mer-based compression of sequencing data</title>
		<author>
			<persName><forename type="first">S</forename><surname>Deorowicz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scientific Reports</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="1" to="9" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Fast and efficient compression of highthroughput sequencing reads</title>
		<author>
			<persName><forename type="first">C</forename><surname>Hoobin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kind</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Boucher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Puglisi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics</title>
				<meeting>the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="325" to="334" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">On the complexity of finite sequences</title>
		<author>
			<persName><forename type="first">A</forename><surname>Lempel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ziv</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. Inf. Theory</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="page" from="75" to="81" />
			<date type="published" when="1976">1976</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A universal algorithm for sequential data compression</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ziv</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lempel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. Inf. Theory</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="337" to="343" />
			<date type="published" when="1977">1977</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Kärkkäinen</surname></persName>
		</author>
		<title level="m">Repetition-Based Text Indexes</title>
				<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
		<respStmt>
			<orgName>University of Helsinki, Faculty of Science, Department of Computer Science</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Ph.D. thesis</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">W</forename><surname>Harrison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lopez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Rahman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">G</forename><surname>Allen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Aslam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Buso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cummins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Fathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Felix</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nucleic Acids Research</title>
		<imprint>
			<biblScope unit="volume">49</biblScope>
			<biblScope unit="page" from="W619" to="W623" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">The smallest grammar problem</title>
		<author>
			<persName><forename type="first">M</forename><surname>Charikar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Lehman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Panigrahy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Prabhakaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sahai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shelat</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. Inf. Theory</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="page" from="2554" to="2576" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">An analysis of the Burrows-Wheeler transform</title>
		<author>
			<persName><forename type="first">G</forename><surname>Manzini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. ACM</title>
		<imprint>
			<biblScope unit="volume">48</biblScope>
			<biblScope unit="page" from="407" to="430" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">At the roots of dictionary compression: string attractors</title>
		<author>
			<persName><forename type="first">D</forename><surname>Kempa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Prezza</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC)</title>
				<meeting>50th Annual ACM SIGACT Symposium on Theory of Computing (STOC)</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="827" to="840" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Indexing highly repetitive string collections, part I: repetitiveness measures</title>
		<author>
			<persName><forename type="first">G</forename><surname>Navarro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Comput. Surv</title>
		<imprint>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="page">31</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
