<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Burrows-Wheeler Transform of an Elastic-Degenerate String</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lapo Cioni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veronica Guerrini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanna Rosone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Degenerate strings (DS) and elastic degenerate strings (EDS) are a way to represent, in a compact form, strings that contain a high degree of similarity. They can be particularly useful in some fields, such as text processing or the study of DNA mutations in computational biology, where it is necessary to eficiently manage several variations of a sequence. In practice, a degenerate string is a string whose symbols, called degenerate, can have several alternatives (hence a degenerate symbol is a set). In the literature difer ent constraints have been imposed on degenerate string symbols. For example, the symbol can only be i) a set of letters of the alphabet, ii) a set of strings of the same length, or iii) a set of strings of variable length (including the empty string). We consider the latter in its most general form, which is known as elastic degenerate strings. Our contribution is the introduction of the Burrows-Wheeler transform of an elastic-degenerate string (EDS-BWT). We show that EDS-BWT is reversible and that it can be used to solve the pattern matching problem, i.e., the problem of finding a standard string pattern within an EDS, by adapting the inner properties of the classical Burrows-Wheeler transform. Finally, we implemented the EDS-BWT encoding/decoding and the prototype edsBWTSearch to experimentally compare our pattern matching approach to other existing tools managing elastic degenerate strings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Burrows-Wheeler Transform</kwd>
        <kwd>elastic degenerate strings</kwd>
        <kwd>backward search</kwd>
        <kwd>pattern matching problem</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The Burrows–Wheeler transform (BWT) was introduced in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as a method for compressing
a single string, and then, it was shown to be efe ctive in many other areas, with applications
spanning beyond its original purpose [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For instance, it has been successfully used for compact
text indexing [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ] and for bioinformatics applications, e.g., for sequence alignment [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
phylogenetic analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], genome assembly [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] as well as for sequencing data compression [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Roughly, the BWT performs a permutation of the letters of an input string  . First, the
cyclic rotations of  are sorted in lexicographic order, and then the bwt( ) is obtained by
concatenating the last letters of the (sorted) cyclic rotations of  . The BWT can also be define d
by sorting the sufixes of  $ [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], where $ is an end-marker symbol that does not appear in 
itself and it is lexicographically smaller than any of the symbols in  .
      </p>
      <p>
        Shifting the focus from a string to a collection of strings, the BWT of a string collection can be
defined analogously. That is, either by sorting the cyclic rotations of all strings in the collection 1,
as in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], or by sorting their sufixes, as in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In the latter case, a distinct end-marker symbol
is appended to each string, giving an ordered collection of strings. The BWT and its extension
to a collection of strings allow to define a map from symbols occurring in the transformed
string  to the string  of lexicographically sorted symbols of . This mapping is known as
LF-mapping, and it allows both the reversibility of the transform and the search for patterns
(called backward search). The LF-mapping and the backward search are the key machinery in
the FM-index [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Degenerate strings (DS) and elastic degenerate strings (EDS) have been introduced as a way
to eficiently represent sequences that contain a high degree of similarity. For instance, in
bioinformatics, they are used to represent pangenomes, which are collections of closely-related
genomic sequences that one needs to analyze together [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        A degenerate string (also known as indeterminate string) over an alphabet Σ is a sequence
 = 1 · · ·  where each  is a subset of Σ. It represents any string that can be obtained by
selecting one letter from each subset from left to right. In 2017, Iliopoulos et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] defined
a more general notion of degenerate strings: an elastic-degenerate string (EDS) over Σ is a
 of non-empty subsets  of strings over Σ, where each  is called
sequence  = 1 · · ·
degenerate string [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
degenerate symbol. If instead each  contains strings of the same length,  is called general
Here is an example of an EDS, which we use throughout this paper:
      </p>
      <p>{︃
 =</p>
      <p>ATTGCT
}︃{︃   }︃{︃</p>
      <p>CTACGGACT
}︃{︃ A }︃{︃</p>
      <p>CTGT
}︃
(1)
We remark that Eq. (1) is a compact way to represent all the strings that are obtained by taking
an element from each set and concatenating them in order.</p>
      <p>
        Elastic-degenerate strings and their variants have been much studied in the literature in
recent years, mainly for the pattern matching problem, which consists of finding the occurrences
of a pattern in an ED string [
        <xref ref-type="bibr" rid="ref16 ref17 ref18 ref19 ref20">17, 16, 18, 19, 20</xref>
        ]. For instance, the pattern    occurs in ,
across the first three degenerate symbols. A variety of methods and data structures have been
used for the pattern matching problem on very similar strings. The following partial list gives a
few examples [
        <xref ref-type="bibr" rid="ref21 ref22 ref23 ref24 ref25 ref26 ref27">21, 22, 23, 24, 25, 26, 27</xref>
        ] (see also references therein).
1.1. Our contribution
In this paper, we introduce the Burrows-Wheeler transform of an elastic degenerate string
(EDS-BWT), which applies the EBWT to a sequence of ordered collections of strings of any
length, and show that this transform is reversible. The core of our method is a mix between the
classical BWT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the EBWT [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], together with a mapping between strings belonging to
consecutive degenerate symbols. We also present an algorithm for solving the pattern matching
problem on an elastic degenerate string. Specifically, we are able to return the number of starting
positions of pattern occurrences and, for each pattern occurrence, the index of the degenerate
symbol and the string position at which the occurrence starts. For instance, searching pattern
      </p>
      <sec id="sec-1-1">
        <title>1In this case, one needs to use the -order defined in [11].</title>
        <p>=    in the string  in Eq. (1), we return that there is at least one occurrence that starts
in the first degenerate symbol at the last letter of the only string in it.</p>
        <p>
          To the best of our knowledge, no other work does it, although some authors introduced the
BWT for degenerate strings [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] and for closely-related sequences [
          <xref ref-type="bibr" rid="ref28 ref29 ref30 ref31 ref32 ref5">28, 29, 5, 30, 31, 32</xref>
          ] (see
also references therein).
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Notation</title>
      <p>Let Σ = {1, 2, . . . ,  } be a finite ordered alphabet Σ with 1 &lt; 2 &lt; . . . &lt;  , where &lt;
denotes the standard lexicographic order, and let  be the empty string. Let  be a string (or
word) of length  on Σ and [] its -th letter (or symbol). A substring [, ] of  coincides with
[] · [ + 1] · · · [], where · is the concatenation operator. For any 1 ≤  ≤ , the substring
[1, ] is called a prefix of  and [, ] a sufix of .</p>
      <p>
        The Burrows-Wheeler Transform (BWT) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The BWT is a well-known reversible
transformation defined on a string  that permutes its letters. By appending an end-marker symbol $ to 
and by sorting all the sufixes of $ in lexicographic order, the output of the BWT is a string
bwt() of length  + 1 obtained by concatenating the letters (circularly) preceding each sufix
in the list of sorted sufixes. More precisely, for each , bwt()[] is the letter preceding the -th
lexicographically smallest sufix of the string $, except for the sufix $, where the preceding
letter is set to be $.
      </p>
      <p>
        The BWT of a string collection [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. The BWT extended to a string collection , known as
EBWT, is a reversible transformation that produces a string ebwt() that is a permutation of
all the symbols of all strings in .
      </p>
      <p>Let  = {1, 2, . . . , } be a collection of  strings on the alphabet Σ. We append to each
string  ∈  a diferent end-marker symbol $, not belonging to Σ and lexicographically
smaller than any other symbol in Σ, by setting $ &lt; $ for each  &lt; 2. That is, if  has length
, we define [ + 1] = $. In the following, we will always use  to refer to a string of
length  + 1 terminating with the end-marker symbol $. Finally, let  = ∑︀
=1  +  be the
number of letters of all strings in  (including their end-marker symbol).</p>
      <p>Note that, after appending a distinct end-marker symbol to each string in , the collection 
becomes an ordered collection. Exclusively for implementation purposes, a unique end-marker
symbol $ is used for all strings in , even if each $ implicitly carries the index of the string to
which it was appended.</p>
      <p>
        LF-mapping. Let  be the string bwt() and  the string obtained by sorting all the symbols
of  lexicographically. The reversibility of the BWT (as well as of the EBWT) is based on the
following two properties stated in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]:
• For all  = 1, . . . ,  + 1, the symbol  [] circularly follows the symbol [] in the string
;
• For each symbol  ∈ Σ, the -th occurrence of  in  corresponds to the -th occurrence
of  in  .
2We note that in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] the EBWT is defined without appending end-marker symbols to the strings: cyclic rotations of
the strings in  are sorted by means of an order relation, called -order, on infinite strings.
From the second property, it follows that given a position  in  such that [] = , the position
in  corresponding to that occurrence of  is given by [[]] + rank(, []), where  is
an array storing for any  ∈ Σ the total number of symbols in  that are smaller than , and
rank(, []) is the number of occurrences of [] in the prefix [1, ]. This mapping gives a
correspondence between symbol occurrences in  and symbol occurrences in  , and is known
as LF-mapping [
        <xref ref-type="bibr" rid="ref13 ref3">3, 13</xref>
        ].
      </p>
      <p>Bit vectors. A string of zeros and ones is called bitvector. Given a bitvector  of length , the
operators rank and select are defined as follows, for any 1 ≤  ≤  and  ∈ {0, 1}:
rank(, ) = |{ | 1 ≤  ≤  and [] = }|
select(, ) = , with [] =  and rank(, ) = , if such  exists.</p>
      <p>
        In other words, rank(, ) gives the number of occurrences of the bit  in the prefix [1, ],
while select(, ) returns the index of the -th occurrence of  in  (if it exists). A bitvector 
can be preprocessed in order to support rank and select queries in constant time [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ].
      </p>
      <p>
        Elastic Degenerate string. An elastic degenerate string (see [
        <xref ref-type="bibr" rid="ref15 ref17">15, 17</xref>
        ]) (or ED string) is a sequence
of  finite nonempty sets of strings (including  ) 1 · · ·  of combined total length  . Each
 is called degenerate symbol.
      </p>
      <p>Given an elastic degenerate string  = 1 · · · , for each  = 1, . . . , , we denote the
strings in  by ,1, . . . , ,ℓ , where || = ℓ ≥ 1.</p>
      <p>Essentially, an elastic degenerate string represents all possible strings that can be constructed
by taking an element from each degenerate symbol and concatenating them. For example,
{, }{, }{ } represent the strings   ,   ,  ,  .</p>
    </sec>
    <sec id="sec-3">
      <title>3. The BWT of an elastic degenerate string</title>
      <p>In this section we define EDS-BWT, which is the BWT of an elastic degenerate string
 = 1 · · · , where  = {,1, . . . , ,ℓ }, || = ℓ ≥ 1 and ℓ = ∑︀=1 ℓ. We
need to append to each , an end-marker symbol $ ∈/ Σ, with  =  + ∑︀−=11 ℓ. In
this way, denoting , $ by , we obtain a single ordered collection (of strings)  =
{1, 2, . . . , ℓ1 , ℓ1+1, . . . , ℓ1+ℓ2 , . . . , ℓ}. Roughly, we are appending an end-marker
symbol at the end of each string of each degenerate symbol, proceeding left-to-right among the
symbols and giving an arbitrary order to the strings within the degenerate symbols. Note that
when , =  , the resulting string is $, with  as above.</p>
      <p>Definition 3.1. Given an elastic degenerate string  = 1 · · ·  the Burrows-Wheeler
transform of  (called EDS-BWT) is defined as the pair edsbwt() = (ebwt(), ℒ) where
 is the ordered collection  = {1, . . . , ℓ} and ℒ is defined as follows:
For  = 1, . . . , ℓ, let  be the position of $ in ebwt(), and let the associated string  belong
to the degenerate symbol , for some 1 ≤  ≤ .</p>
      <p>Then, if  &gt; 1, we define ℒ() to be the interval of positions [, ] in ebwt() such that
− 1 = {, . . . , }. Otherwise, ℒ() circularly gives the interval of positions [, ] such
that  = {, . . . , }.</p>
      <p>The ED string of our running example Eq. (1) gives the ordered collection
each  &gt; 1. Also, the next remark follows:</p>
      <p>= {   $1,  $2,  $3, $4,   $5, $6,  $7,   $8}.
together with its index  can be returned.</p>
      <p>
        Remark 3.2. As in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for the computation of the EBWT, we can use the same end-marker
for all strings, so that the size of the alphabet increases by just one. However, each end-marker
has a diferent (implicit) index determined by the order of the strings in the collection . In this
way, during the construction of ebwt(), a list containing the position  in ebwt() of each $
      </p>
      <sec id="sec-3-1">
        <title>Note that, in the definition of</title>
        <p>ℒ,  = (ℓ1 + · · ·
+ ℓ− 2) + 1 and  = ℓ1 + · · ·
+ ℓ− 1 for
Remark 3.3. Let  and ′ be two diferent positions in ebwt() such that ebwt[] = $ and
ebwt[′] = $′ . If the associated strings  and ′ belong to the same degenerate symbol ,
then ℒ() = ℒ(′) by definition.
row F
1 $1
2 $2
3 $3
4 $4
5 $5
6 $6
7 $7
8 $8
9 A
10 A
11 A
12 A
13 A
14 A
15 A
16 C
17 C
18 C
19 C
20 C
21 C
22 G
23 G
24 G
25 G
26 T
27 T
28 T
29 T
30 T
31 T
32 T
33 T
34 T

T
A
A
A
T
A
$7
T
T
T
$4
$6
T
G
$1
A
G
A
$2
$5
$8
G
T
C
T
C
C
G
C
$3
C
T
C
A
Reversibility To show that the EDS-BWT is reversible, we describe how LF-mapping works
for this new transform. Note that LF-mapping for the EDS-BWT is diferent from that for the
classical EBWT. Indeed, in the EBWT, the strings are independent from each other. Conversely,
in our EDS-BWT, the strings in  belonging to consecutive degenerate symbols need to be
“linked” to reconstruct the elastic degenerate string .</p>
        <p>
          Table 1 shows the transform of our running example. For instance, edsbwt()[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] = $5,
and its associated string 5 belongs to 3, so ℒ(20) = [
          <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
          ], which is an interval of three
positions. Thus the previous degenerate symbol 2 is a set of three strings. Specifically, it is
2 = { ,  ,  }.
        </p>
        <p>The following proposition states the properties of the LF-mapping for an ED string. The
proof follows immediately from the original LF-mapping properties and the definition of ℒ.
Proposition 3.1. Let  be the collection of strings associated to an ED string  and let
edsbwt() = (ebwt(), ℒ). Let  = ebwt() and  be the sequence of the lexicographically
sorted letters of . The following conditions hold:
• For all  = 1, . . . ,  , the letter [] is circularly followed by the letter  [] in its associated
string ;
• For each letter  ∈ Σ, its occurrences in  appear in the same order as in  , i.e. the  -th
occurrence of  in  corresponds to the  -th occurrence of  in  .
• For all  = 1, . . . ,  such that [] = $ for some , the letter  [] is preceded (in the
ED string) by all the symbols in [, ], where ℒ() = [, ]. If any end-marker symbol
appears in [, ], then  [] is preceded (in the ED string) by the empty string.</p>
        <p>Proposition 3.1 allows us to define the following permutation  that maps each position
of  to  and allows us to link the first symbol of a string in a degenerate symbol to the last
symbol of any string within the previous degenerate symbol.</p>
        <p>[] =
{︃[[]] + (, []) if [] ̸= $
ℒ()
if [] = $
(2)
The  of our running example is shown in Table 2.</p>
        <p>
          By using this modified LF-mapping, we are able to reconstruct the ED string, as follows.
We begin from the first end-marker symbol, which is $1, in position 15. Using ℒ we can
ifnd the positions [
          <xref ref-type="bibr" rid="ref8 ref8">8, 8</xref>
          ] of the last letters in the final degenerate symbol, which in this case
is [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] =  . Note that, in this example, we have a single string in both the first and last
degenerate symbols. In the general case, by Definition 3.3, ℒ applied to any end-marker
symbol in 1 gives the interval containing the positions of the last letters of all strings in .
Then, using the LF-mapping described in Eq. (2), we find that  is preceded by [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] = ,
and continue in this way until we have reconstructed the string 8 =   . This is correct
since 5 = {  }. This step stops at position 21, which is such that [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] = $8. Thus, we
use ℒ to obtain the positions corresponding to the last letters in the previous degenerate
symbol 4, ℒ(21) = [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ]. Since we have two positions now, we reconstruct two diferent
strings using the same strategy as before; the first is , ending at position 12, while the second
terminates immediately, since (7) = $7, indicating that 7 is the empty string. This is correct
since 4 = {,  }.
        </p>
        <p>
          We do not need to compute both ℒ(7) and ℒ(12), since by Definition 3.3 they give
the same result of [
          <xref ref-type="bibr" rid="ref5 ref5">5, 5</xref>
          ]. The reconstruction continues by alternating between the parallel
reconstruction of all the strings in a degenerate symbol and the linking to the strings in the
previous degenerate symbol. This process ends when it reaches the position of first end-marker,
since that was the starting position.
        </p>
        <p>
          Implementation of ℒ using bitvectors. Let  = 1 · · ·  be an ED string. We define
the bitvector () associated to  as the concatenation of (1), . . . , (), where the
bitvector () of length || has all zeroes except for its first bit, which is one. For instance,
the bitvector of our running example is () = 1 100 1 10 1. An analogous definition of a
bitvector to represent the underlying structure of a degenerate string appears in [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ].
        </p>
        <p>The following proposition shows how, using rank and select on (), we can compute
ℒ() in constant time, provided that the index  such that [] = $ is given.
Proposition 3.2. Let  = 1 · · ·  be an elastic degenerate string, and let  = 1, . . . , ℓ the
ordered collection of strings contained in the degenerate symbols and let 1 ≤  ≤ ℓ. For  &gt; 1,
let ,  be the indexes such that, if  ∈ , then − 1 = {, . . . , }. If  = 1, let ,  such
that  = {, . . . , }. Then  and  can be computed in constant time using just  and ().
Specifically, for  &gt; 1, thus rank()(, 1) &gt; 1, then
 = select()(rank()(, 1) − 1, 1),
 = select()(rank()(, 1), 1) − 1.</p>
        <p>For  = 1, thus rank()(, 1) = 1, then</p>
        <p>= select()(rank()(ℓ, 1), 1),  = ℓ.</p>
        <p>Proof. Since () is the concatenation of the ()’s, and since each () contains exactly
one 1, then rank()(, 1) gives the  such that  ∈ , for each . Now observe that,
for  &gt; 1,  and  are the indexes of the first and last string of − 1, so ()[] = 1 and
()[ + 1] = 1. In particular, ()[] is the ( − 1)-th 1, while  is the position preceding
the -th 1. On the other hand, for  = 1,  and  are the indexes of the first and last string of .</p>
        <p>
          We can now compute  and  using select. Finally, rank and select can be performed in
constant time on a bitvector, for example using the sdsl-lite library [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ].
        </p>
        <p>Note that we need the index  for computing ℒ(). The mapping from  to  can be
obtained during the construction of the ebwt() (see Definition 3.2). In our implementation, to
compute  in constant time, we use a bitvector  such that [] = 1 if and only if ebwt()[] = $
for some 1 ≤  ≤ ℓ, and an associated array  of length ℓ such that [] = , if rank(, 1) = ,
for each 1 ≤  ≤ ℓ. Hence  = [rank(, 1)].</p>
        <p>We conclude this section by encapsulating the previous observations in Algorithm 1. For
simplicity, when describing the algorithms in the following, we just write $ ← [] to mean
that we obtain the index  from position , if such  exists, and assign 0 to  otherwise.
Algorithm 1: link(, ())
 ←
 ←
1 $ ← [];
2 if  = 0 then
3 return ∅
4 if $ ∈/ 1 then
5  ← select()(rank()(, 1) − 1, 1);
6  ← select()(rank()(, 1), 1) − 1;
7 else
8 select()(rank()(ℓ, 1), 1);
9 (());
10 return [, ];</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Exact pattern matching problem</title>
      <p>In this section we show how the EDS-BWT can be used to resolve the exact pattern matching
problem for an elastic degenerate string  = 1 · · · .</p>
      <p>In the classical problem, a string  ∈ Σ* contains the pattern  ∈ Σ* if and only if  is a
substring of  . This concept is generalized to (elastic) degenerate strings in the natural way.
Definition 4.1. Let  ∈ Σ* , and let  = 1 · · ·  be an ED string over Σ, with  =
{,1, . . . , ,ℓ } for each . We say that  contains the pattern  if there exists a collection
of strings , ∈ , . . . , +,+ ∈ +, such that  is a pattern of , · · · +,+ , for
some ,  such that 0 ≤  ≤  − 1 and 1 ≤  ≤  − .</p>
      <p>In this case, we say that  contains an occurrence of  at position (, , ) if  begins at
position  in , .</p>
      <p>Note that more than one occurrence of  can start (and end) at the same starting (and
ending) position, since they can select diferent strings in the degenerate symbols. For example
the pattern  in  = {, ,     C}{,  ,  }{, ,  }{T} can be obtained in two
diferent ways starting at position (1, 3, 5) and ending at position (4, 1, 1) (in bold), taking
either the red or the blue strings in the degenerate symbols 2 and 3. Conversely, there are
no occurrences of  , because  and  belong to the same degenerate symbol.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the authors showed how to search a pattern  =  [1, ] backwards by the output of
the classical BWT. A backward search algorithm first searches for the  []-interval (i.e. the
interval in  of the symbols associated to sufixes starting with  []), then for the ( [ −
1] [])-interval (i.e. the interval in  of the symbols associated to sufixes starting with
 [ − 1] []), and so on, until the whole pattern  [1, ] is found, if there is one. Specifically,
given an interval [, ] such that [, ] are all the letters followed by  [, ] in the original
string, then the letters followed by  [ − 1, ] are in the interval [′, ′], with
′ = [] + ( − 1, ) + 1,
′ = [] + (, ).
      </p>
      <p>
        We adapt the backward search algorithm defined in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] in order to apply it to edsbwt() and
solve the pattern matching problem.
Algorithm 2: edsBWTSearch(edsbwt(), , (),  [1, ])
1  =  [],  =  − 1;
2  = [] + 1,  = [ + 1];
3  = {[, ]} ;
4 while  ̸= ∅ and  &gt; 0 do
      </p>
      <p>// Compute ℒ on every interval
5 for [, ] ∈  do
6 for  ∈ [, ] do
7  ←  (, link(, ())) ;
 =  − 1 ;
 ← ′;
17 if  ̸= ∅ then
18 return "Found";
19 else
20</p>
      <p>return "Not found";</p>
      <p>Algorithm 2 (edsBWTSearch) outlines the steps of the algorithm. edsBWTSearch loops
over the symbols of  , starting from the last, and for each iteration updates a list Intervals of
the intervals of  which have a positive match for the current symbol. The update happens in
two steps:
• the first step calls link (Algorithm 1). The algorithm finds all the end-marker symbols
contained in the current intervals, and compute ℒ for each of those positions, adding
new intervals to the list. We note the following:
– the newly added intervals are checked again, since they could also contain an
end-marker symbol, meaning that its corresponding string is the empty string;
– the algorithm actually uses a non-circular version of link, because the pattern
matching problem is not circular. Thus, when an end-marker symbol belonging to
the first degenerate symbol is considered, the output of link is the empty interval
(diferently from lines 8-9 in Algorithm 1);
– the Merge called in line 7 (and again in line 14) is a modified version of the union
operation. It adds the interval  obtained from link to the list of intervals, but
ifrst checks if  is consecutive to or included in other intervals in Intervals. If this
happens, a new interval is instead created by combining the two into one;
• the second step applies the modified backward search to each interval in the list Intervals.</p>
      <p>Note that Merge is applied also in this step.</p>
      <p>Since link and the  -mapping are called for each interval, then the Merge operation
allows for a faster search by reducing the number of intervals (see Section 5 for an upper bound
on the number of intervals at each iteration).</p>
      <p>
        Table 3 shows the backward search of pattern   on our running example. The search
begins from interval [
        <xref ref-type="bibr" rid="ref16 ref21">16, 21</xref>
        ], which is the interval of positions corresponding to letter  [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] = 
(marked in blue in the table). For readability, we will not color every interval in the table at
every step, but we will just mark some “branchings” that visually show the link and update
parts. Nonetheless, we fully describe the search in the following.
      </p>
      <p>
        Since there are three end-marker symbols in [
        <xref ref-type="bibr" rid="ref16 ref21">16, 21</xref>
        ], link is recursively called on each
of them, giving four additional intervals [
        <xref ref-type="bibr" rid="ref1 ref1">1, 1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ] (in orange in the table), [
        <xref ref-type="bibr" rid="ref5 ref5">5, 5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Note
that [
        <xref ref-type="bibr" rid="ref5 ref5">5, 5</xref>
        ] is obtained from the recursive call on [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], since [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] = $7, which corresponds
to the empty string in 4 of Eq. (1). The intervals are thus merged into one: [
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ]. Thus
 = [[
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ], [
        <xref ref-type="bibr" rid="ref16 ref21">16, 21</xref>
        ]].
      </p>
      <p>
        The next step is to update each interval, searching for  [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] = . Interval [
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ] gives [
        <xref ref-type="bibr" rid="ref12 ref9">9, 12</xref>
        ]
and interval [
        <xref ref-type="bibr" rid="ref16 ref21">16, 21</xref>
        ] gives [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. Thus  = [[
        <xref ref-type="bibr" rid="ref14 ref9">9, 14</xref>
        ]].
      </p>
      <p>
        Again, each end-marker symbol in the intervals is checked, giving [
        <xref ref-type="bibr" rid="ref1 ref1">1, 1</xref>
        ] (in green) and [
        <xref ref-type="bibr" rid="ref5 ref5">5, 5</xref>
        ].
Finally,  [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] =  is searched, obtaining five occurrences of  , starting at positions 1, 5, 9,
10 and 13 of .
      </p>
      <p>The positions in the original ED string can be recovered using the LF-mapping on each of
the resulting positions until reaching an end-marker symbol. This gives the string index, the
degenerate symbol index and the position in the string, which are (1, 1, 6), (2, 1, 2), (2, 2, 1),
(3, 1, 2), (3, 1, 9).</p>
      <p>
        Experiments. In order to evaluate our backward search applied to edsbwt(), we
implemented3 a prototype in C++ that builds ℒ and then solves the pattern matching problem on
an EDS. Since the EBWT is a well-known structure in string algorithms, we used existing tools
to build the ebwt(). Note that the eficient construction of the EBWT has been the subject of
extensive research (see for instance [
        <xref ref-type="bibr" rid="ref12 ref35 ref36 ref37 ref38 ref39">12, 35, 36, 37, 38, 39</xref>
        ]), that is beyond the goal of this paper.
Therefore, the time needed to build the ebwt() depends on the tool used, and the resources
available. Moreover, since our pattern matching strategy is of-line (it builds edsbwt()), the
cost to pre-process , which however took a couple of seconds for the dataset with the longest
EDS in our experiments, can be amortized for all the pattern searches.
      </p>
      <p>
        In order to show the efectiveness of our pattern matching method, we have considered
two existing tools, edsm [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and eds_search [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], that solve the elastic-degenerate string
matching problem in an on-line manner by taking as input an ED string on a biologic alphabet.
In particular, edsm takes an ED string and a solid pattern  , it builds the sufix tree of  and
scan the ED string left-to-right to return the indices of the degenerate symbols at which each
occurrence of  ends; while eds_search, given as input an ED string and a solid pattern  ,
performs an on-line backward pattern matching by combining the traditional algorithms BNDM
and SHIFT-AND, and returns the number of positions in which an occurrence of  starts.
      </p>
      <p>
        Note that there exist other tools for the pattern matching on closely-related sequences,
but they encode the similar strings with data structures diferent from ED strings, see for
instance [
        <xref ref-type="bibr" rid="ref28 ref29 ref31 ref40">31, 29, 28, 40</xref>
        ].
      </p>
      <p>The experiments were conducted on a DELL PowerEdge R630 machine, 24-core, with Intel(R)
Xeon(R) CPU E5-2620 v3 at 2.40 GHz, with 128 GB of shared memory. The system is Ubuntu
14.04.2 LTS.</p>
      <p>
        We tested our tool on the synthetic elastic degenerate strings used in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], which are 5
randomly generated strings of lengths 100000, 200000, 400000, 800000 and 1600000. We
searched 40 patterns of lengths 8, 16, 32 and 644 (ten patterns each). The patterns were obtained
by selecting random parts of the ED string and extracting a substring of the desired length, so
that each pattern would occur at least once. This is important, since the absence of a pattern
stops prematurely the execution of the algorithm, giving distorted timings.
      </p>
      <p>Table 4 shows that the performance for all datasets of edsBWTSearch is comparable to the
one of edsm. The tool eds_search is always the fastest, but it does not return the positions
of the pattern occurrences. Excluding the pre-processing time needed to build edsbwt(),
edsBWTSearch is always faster than edsm; however, the pre-processing time for the largest
dataset is 4.72 seconds that, if added to the edsBWTSearch time of 2.69 seconds, gives 7.41
seconds. Finally, we note that edsBWTSearch is implemented in semi-external memory and
stores on disk part of the index edsbwt(). In this way, edsBWTSearch is the tool that uses
the least RAM on the largest dataset.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion, discussion and further work</title>
      <p>In this work, we introduced a new transform EDS-BWT inspired by the BWT of a string and the
EBWT of a string collection. The EDS-BWT permutes the letters of the strings in  associated
with an ED string by exploiting the () and also returns a function ℒ that allows to
link the strings of a degenerate symbol to the strings of the previous one. The introduced
transform works over any alphabet and it does not concatenate strings in , but keeps track of
links between strings of consecutive degenerate symbols. Moreover, it allows, like the BWT on
a string, to build an index on which to perform pattern searches.</p>
      <p>We observe that the time and space usage for computing EDS-BWT depends mainly on the
construction of ebwt(), because the construction of ℒ can be achieved by simply reading
 and producing the bit vector (). Since there are several tools for building ebwt(), one
can choose the implementation that best suits their own resources.</p>
      <p>The pattern matching algorithm involves  iterations, where  is the length of the pattern
 . It is easy to see that the first iteration may obtain at most  + 1 intervals, because, as for
the classical backward search, edsBWTSearch produces a single interval for the LF-mapping
of letter  [], while ℒ can return at most one interval for each degenerate symbol (see
Definition 3.3). In the same fashion, at most  intervals can be added at each subsequent</p>
      <sec id="sec-5-1">
        <title>4eds_search does not support patterns of length greater than 32.</title>
        <p>¯

100000
200000
400000
800000
1600000
iteration, one for each degenerate symbol of the EDS. Therefore the worst case is to add 
intervals at each iteration, for a total of  + 1. However, this is a theoretical upper bound that
do not take into account the interval merging. Indeed, in the above case, the  intervals of each
iteration would be merged into one, giving just  intervals, which is far from the worst case.
Accounting for merging, the worst case occurs when all the intervals are non consecutive. Thus,
the maximum amount of intervals added at each iteration is ⌈︀ 2 ⌉︀ , one every two degenerate
symbols. Since we have  iterations, this sums up to ⌈︀  ⌉︀  + 1 total maximum intervals.
2</p>
        <p>During our experiments, we observed the number of intervals to be strictly decreasing after
the first iteration. We believe this not to be a coincidence and plan to investigate the matter
in subsequent work. Finally, we note that it is possible to reduce the number of intervals by
building a partial index for patterns of small lengths.</p>
        <p>Another future direction of work is to show that, like the EBWT, the EDS-BWT is dynamic,
meaning that we can add/remove a string to a degenerate symbol without needing to rebuild
the entire transform, but suitably adding/removing the symbols and updating ℒ.</p>
        <p>Moreover, we aim to show that EDS-BWT also allows us to search for multiple patterns at
the same time.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Work partially supported by the MUR PRIN 2022YRB97K PINC, by INdAM - GNCS Project, codice
CUP_E53C23001670001, “Compressione, indicizzazione, analisi e confronto di dati biologici”,
and by PNRR - M4C2 - Investimento 1.5, Ecosistema dell’Innovazione ECS00000017 - “THE
Tuscany Health Ecosystem” - Spoke 6 “Precision medicine &amp; personalized healthcare”, funded
by the European Commission under the NextGeneration EU programme.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Burrows</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Wheeler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Block</given-names>
            <surname>Sorting Lossless Data Compression Algorithm</surname>
          </string-name>
          ,
          <source>Technical Report 124</source>
          , Digital Equipment Corporation,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rosone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sciortino</surname>
          </string-name>
          ,
          <article-title>The Burrows-Wheeler Transform between Data Compression and Combinatorics on Words</article-title>
          , in: CiE, volume
          <volume>7921</volume>
          LNCS
          <string-name>
            <surname>of</surname>
            <given-names>LNCS</given-names>
          </string-name>
          , Springer,
          <year>2013</year>
          , pp.
          <fpage>353</fpage>
          -
          <lpage>364</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          , G. Manzini,
          <article-title>Opportunistic data structures with applications</article-title>
          , in: FOCS, IEEE Computer Society,
          <year>2000</year>
          , pp.
          <fpage>390</fpage>
          -
          <lpage>398</lpage>
          . doi:
          <volume>10</volume>
          .1109/SFCS.
          <year>2000</year>
          .
          <volume>892127</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mäkinen</surname>
          </string-name>
          , G. Navarro,
          <article-title>Succinct sufix arrays based on run-length encoding</article-title>
          ,
          <source>Nordic J. of Computing</source>
          <volume>12</volume>
          (
          <year>2005</year>
          )
          <fpage>40</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mäkinen</surname>
          </string-name>
          , G. Navarro,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sirén</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Välimäki</surname>
          </string-name>
          ,
          <article-title>Storage and retrieval of highly repetitive sequence collections</article-title>
          ,
          <source>J. Comput. Biol</source>
          .
          <volume>17</volume>
          (
          <year>2010</year>
          )
          <fpage>281</fpage>
          -
          <lpage>308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gagie</surname>
          </string-name>
          , G. Navarro,
          <string-name>
            <given-names>N.</given-names>
            <surname>Prezza</surname>
          </string-name>
          ,
          <article-title>Fully functional sufix trees and optimal text searching in bwt-runs bounded space</article-title>
          ,
          <source>J. ACM</source>
          <volume>67</volume>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .1145/3375890.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Durbin</surname>
          </string-name>
          ,
          <article-title>Fast and accurate long-read alignment with Burrows-Wheeler transform</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>26</volume>
          (
          <year>2010</year>
          )
          <fpage>589</fpage>
          -
          <lpage>595</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btp698.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Guerrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Conte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grossi</surname>
          </string-name>
          , G. Liti,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rosone</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Tattini,</surname>
          </string-name>
          <article-title>phyBWT2: phylogeny reconstruction via eBWT positional clustering</article-title>
          ,
          <source>Algorithms Mol. Biol</source>
          .
          <volume>18</volume>
          (
          <year>2023</year>
          )
          <article-title>11</article-title>
          . doi:
          <volume>10</volume>
          . 1186/S13015-023-00232-4.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Simpson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Durbin</surname>
          </string-name>
          ,
          <article-title>Eficient construction of an assembly string graph using the FM-index</article-title>
          ,
          <source>Bioinform</source>
          .
          <volume>26</volume>
          (
          <year>2010</year>
          )
          <fpage>367</fpage>
          -
          <lpage>373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Guerrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Louza</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Rosone, Parallel lossy compression for large fastq files</article-title>
          ,
          <source>in: Biomedical Engineering Systems and Technologies</source>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mantaci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Restivo</surname>
          </string-name>
          , G. Rosone,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sciortino</surname>
          </string-name>
          ,
          <article-title>An extension of the Burrows-Wheeler Transform</article-title>
          ,
          <source>Theor. Comput. Sci</source>
          .
          <volume>387</volume>
          (
          <year>2007</year>
          )
          <fpage>298</fpage>
          -
          <lpage>312</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.tcs.
          <year>2007</year>
          .
          <volume>07</volume>
          .014.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>M. J. Bauer</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          <string-name>
            <surname>Cox</surname>
          </string-name>
          , G. Rosone,
          <article-title>Lightweight algorithms for constructing and inverting the BWT of string collections</article-title>
          ,
          <source>Theor. Comput. Sci</source>
          .
          <volume>483</volume>
          (
          <year>2013</year>
          )
          <fpage>134</fpage>
          -
          <lpage>148</lpage>
          . Source code: https://github.com/BEETL/BEETL.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Manzini,</surname>
          </string-name>
          <article-title>An experimental study of a compressed index</article-title>
          ,
          <source>Information Sciences 135</source>
          (
          <year>2001</year>
          )
          <fpage>13</fpage>
          -
          <lpage>28</lpage>
          . doi:
          <volume>10</volume>
          .1016/S0020-
          <volume>0255</volume>
          (
          <issue>01</issue>
          )
          <fpage>00098</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>T. C. P.-G. Consortium</surname>
          </string-name>
          ,
          <article-title>Computational pan-genomics: status, promises and challenges</article-title>
          ,
          <source>Briefings in Bioinformatics</source>
          <volume>19</volume>
          (
          <year>2016</year>
          )
          <fpage>118</fpage>
          -
          <lpage>135</lpage>
          . doi:
          <volume>10</volume>
          .1093/bib/bbw089.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Iliopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kundu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Pissis</surname>
          </string-name>
          ,
          <article-title>Eficient pattern matching in elastic-degenerate texts</article-title>
          ,
          <source>in: 11th International Conference on Language and Automata Theory and Applications (LATA)</source>
          , volume
          <volume>10168</volume>
          of
          <string-name>
            <surname>Springer</surname>
            <given-names>LNCS</given-names>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Alzamel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A. K.</given-names>
            <surname>Ayad</surname>
          </string-name>
          , G. Bernardini,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Iliopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pisanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Pissis</surname>
          </string-name>
          , G. Rosone, Comparing degenerate strings,
          <source>Fundam. Informaticae</source>
          <volume>175</volume>
          (
          <year>2020</year>
          )
          <fpage>41</fpage>
          -
          <lpage>58</lpage>
          . doi:
          <volume>10</volume>
          .3233/FI-2020-
          <year>1947</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Grossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Iliopoulos</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pisanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Pissis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Retha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rosone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vayani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Versari</surname>
          </string-name>
          , et al.,
          <article-title>On-line pattern matching on similar texts</article-title>
          ,
          <source>in: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching (CPM)</source>
          , volume
          <volume>78</volume>
          ,
          <string-name>
            <surname>Schloss</surname>
            <given-names>DagstuhlLeibniz</given-names>
          </string-name>
          <source>-Zentrum für Informatik GmbH</source>
          ,
          <year>2017</year>
          , p.
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Iliopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kundu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Pissis</surname>
          </string-name>
          ,
          <article-title>Eficient pattern matching in elastic-degenerate strings</article-title>
          ,
          <source>Information and Computation</source>
          <volume>279</volume>
          (
          <year>2021</year>
          )
          <article-title>104616</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.ic.
          <year>2020</year>
          .
          <volume>104616</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bernardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gawrychowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pisanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Pissis</surname>
          </string-name>
          , G. Rosone,
          <article-title>Elastic-degenerate string matching via fast matrix multiplication</article-title>
          ,
          <source>SIAM Journal on Computing</source>
          <volume>51</volume>
          (
          <year>2022</year>
          )
          <fpage>549</fpage>
          -
          <lpage>576</lpage>
          . doi:
          <volume>10</volume>
          .1137/20M1368033.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Procházka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Cvacho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Krčál</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Holub</surname>
          </string-name>
          ,
          <article-title>Backward pattern matching on elasticdegenerate strings</article-title>
          ,
          <source>SN Computer Science</source>
          <volume>4</volume>
          (
          <year>2023</year>
          )
          <fpage>442</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>H.</given-names>
            <surname>Soldano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Viari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Champesme</surname>
          </string-name>
          ,
          <article-title>Searching for flexible repeated patterns using a non-transitive similarity relation</article-title>
          ,
          <source>Pattern Recognition Letters</source>
          <volume>16</volume>
          (
          <year>1995</year>
          )
          <fpage>233</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pisanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Soldano</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Carpentier, Incremental inference of relational motifs with a degenerate alphabet</article-title>
          ,
          <source>in: CPM</source>
          , volume
          <volume>3537</volume>
          of
          <string-name>
            <surname>Springer</surname>
            <given-names>LNCS</given-names>
          </string-name>
          ,
          <year>2005</year>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pisanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Soldano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carpentier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pothier</surname>
          </string-name>
          ,
          <article-title>A relational extension of the notion of motifs: Application to the common 3d protein substructures searching problem</article-title>
          ,
          <source>Journal of Computational Biology</source>
          <volume>16</volume>
          (
          <year>2009</year>
          )
          <fpage>1635</fpage>
          -
          <lpage>1660</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Abrahamson</surname>
          </string-name>
          ,
          <article-title>Generalized string matching</article-title>
          ,
          <source>SIAM Journal of Computing</source>
          <volume>16</volume>
          (
          <year>1987</year>
          )
          <fpage>1039</fpage>
          -
          <lpage>1051</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Crochemore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Iliopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kociumaka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Radoszewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Rytter</surname>
          </string-name>
          , T. Walen,
          <article-title>Covering problems for partial words and for indeterminate strings</article-title>
          ,
          <source>Theoretical Computer Science</source>
          <volume>698</volume>
          (
          <year>2017</year>
          )
          <fpage>25</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Iliopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Radoszewski</surname>
          </string-name>
          ,
          <article-title>Truly Subquadratic-Time Extension Queries and Periodicity Detection in Strings with Uncertainties</article-title>
          ,
          <source>in: 27th Annual Symposium on Combinatorial Pattern Matching (CPM)</source>
          , volume
          <volume>54</volume>
          of LIPIcs,
          <year>2016</year>
          , pp.
          <volume>8</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          :
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Daykin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Groult</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guesnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lecroq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lefebvre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Léonard</surname>
          </string-name>
          , L. Mouchard, É. Prieur,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Watson</surname>
          </string-name>
          ,
          <article-title>Eficient pattern matching in degenerate strings with the Burrows-Wheeler transform</article-title>
          ,
          <source>Information Processing Letters</source>
          <volume>147</volume>
          (
          <year>2019</year>
          )
          <fpage>82</fpage>
          -
          <lpage>87</lpage>
          . doi:
          <volume>10</volume>
          .1016/ j.ipl.
          <year>2019</year>
          .
          <volume>03</volume>
          .003.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Na</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Park</surname>
          </string-name>
          , T. Lecroq,
          <string-name>
            <given-names>M.</given-names>
            <surname>Léonard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mouchard</surname>
          </string-name>
          , K. Park,
          <article-title>FM-index of alignment: A compressed index for similar strings</article-title>
          ,
          <source>Theoretical Computer Science</source>
          <volume>638</volume>
          (
          <year>2016</year>
          )
          <fpage>159</fpage>
          -
          <lpage>170</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Maciuca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. del Ojo</given-names>
            <surname>Elias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>McVean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Iqbal</surname>
          </string-name>
          ,
          <article-title>A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform to Enable Mapping and Genome Inference</article-title>
          , in: Algorithms in Bioinformatics, Springer International Publishing, Cham,
          <year>2016</year>
          , pp.
          <fpage>222</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. W.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. K.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Tam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Yiu</surname>
          </string-name>
          ,
          <article-title>Indexing similar DNA sequences</article-title>
          ,
          <source>in: Algorithmic Aspects in Information and Management</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2010</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Popic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batzoglou</surname>
          </string-name>
          ,
          <article-title>Short read alignment with populations of genomes</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>29</volume>
          (
          <year>2013</year>
          )
          <fpage>i361</fpage>
          -
          <lpage>i370</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btt215.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>T.</given-names>
            <surname>Büchler</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Ohlebusch,</surname>
          </string-name>
          <article-title>An improved encoding of genetic variation in a Burrows-Wheeler transform</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2019</year>
          )
          <fpage>1413</fpage>
          -
          <lpage>1419</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/ btz782.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Beller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Petri, From theory to practice: Plug and play with succinct data structures</article-title>
          ,
          <source>in: 13th International Symposium on Experimental Algorithms</source>
          ,
          <source>(SEA</source>
          <year>2014</year>
          ),
          <year>2014</year>
          , pp.
          <fpage>326</fpage>
          -
          <lpage>337</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -07959-2_
          <fpage>28</fpage>
          , source code: https: //github.com/simongog/sdsl-lite.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. L.</given-names>
            <surname>Gørtz</surname>
          </string-name>
          , T. Stordalen,
          <article-title>Rank and select on degenerate strings</article-title>
          ,
          <source>in: 2024 Data Compression Conference (DCC)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>283</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>L.</given-names>
            <surname>Egidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Louza</surname>
          </string-name>
          , G. Manzini,
          <string-name>
            <given-names>G. P.</given-names>
            <surname>Telles</surname>
          </string-name>
          ,
          <article-title>External memory BWT and LCP computation for sequence collections with applications</article-title>
          ,
          <source>Algorithms Mol. Biol</source>
          .
          <volume>14</volume>
          (
          <year>2019</year>
          ) 6:
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          :
          <fpage>15</fpage>
          . doi:
          <volume>10</volume>
          .1186/s13015-019-0140-0.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonizzoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Della</given-names>
            <surname>Vedova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pirola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Previtali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <article-title>Multithread Multistring Burrows-Wheeler Transform and Longest Common Prefix Array</article-title>
          ,
          <source>Journal of computational biology 26</source>
          (
          <year>2019</year>
          )
          <fpage>948</fpage>
          -
          <lpage>961</lpage>
          . doi:
          <volume>10</volume>
          .1089/cmb.
          <year>2018</year>
          .
          <volume>0230</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Louza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. P.</given-names>
            <surname>Telles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Prezza</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Rosone, gsufsort: constructing sufix arrays, LCP arrays and BWTs for string collections</article-title>
          ,
          <source>Algorithms Mol. Biol</source>
          .
          <volume>15</volume>
          (
          <year>2020</year>
          )
          <article-title>18</article-title>
          . doi:
          <volume>10</volume>
          .1186/s13015-020-00177-y.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>N.</given-names>
            <surname>Prezza</surname>
          </string-name>
          , G. Rosone,
          <article-title>Space-eficient construction of compressed sufix trees</article-title>
          ,
          <source>Theoretical Computer Science</source>
          <volume>852</volume>
          (
          <year>2021</year>
          )
          <fpage>138</fpage>
          -
          <lpage>156</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.tcs.
          <year>2020</year>
          .
          <volume>11</volume>
          .024.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>D.</given-names>
            <surname>Díaz-Domínguez</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Navarro, Eficient Construction of the BWT for Repetitive Text Using String Compression</article-title>
          ,
          <source>in: CPM</source>
          <year>2022</year>
          , volume
          <volume>223</volume>
          of LIPIcs,
          <year>2022</year>
          , pp.
          <volume>29</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          :
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sirén</surname>
          </string-name>
          ,
          <article-title>Indexing variation graphs</article-title>
          ,
          <source>in: Proceedings of the Meeting on Algorithm Engineering and Experiments (ALENEX)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>27</lpage>
          . doi:
          <volume>10</volume>
          .1137/1.9781611974768.2.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>