<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Italian Conference on Theoretical Computer Science, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Comparison Between Similarity Measures Based on Minimal Absent Words: An Experimental Approach⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuseppa Castiglione</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabrina Mantaci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Salvatore L. Pizzuto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Restivo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Matematica e Informatica, Università degli studi di Palermo</institution>
          ,
          <addr-line>Palermo</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>1</fpage>
      <lpage>13</lpage>
      <abstract>
        <p>In this paper we make some experimental considerations on the sets (, ),  ()△ (),  () ∪  () involving minimal absent words of two words  and . This study is motivated by the computation of distances based on these sets. It is well-known that sequence comparison finds many applications in comparative genomics for the study of evolutions, for building phylogenies, for comparing virus genomes. Besides the traditional methods based on alignment, that consider only local mutations in biological sequences, recently many alignment-free methods have been introduced, in order to consider also global mutations (see [1] for a survey). Some of them compare two sequences by counting their factors frequencies since, intuitively, the more similar two sequences are, the greater it is the number of the factors they share. Other methods use data compression considerations, based on the intuition that the more similar two sequences are, the more efective their joint compression is than their independent compression. A third class of method generalizes the definition of sequences alignment, where the basic edit operation on characters are integrated with edit operations on blocks of characters. In the context of alignment free methods, in recent years a new class of methods consider the concept of minimal absent word, based on the idea that the negative information well represents the sequence itself, hence two sequences can be compared by comparing the relative sets of minimal absent words. The advantages of this approach are that the set of minimal absent words uniquely characterizes the sequence (cf. [2]), the number of minimal absent words of</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Minimal absent words</kwd>
        <kwd>alignment free distances</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        a sequence of length  is linear in  (cf. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), they can be computed in linear time [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As a
consequence, it is possible to compare two sequences in time proportional to their lengths.
      </p>
      <p>
        An experimental study of diferent distance measures based on minimal absent words to
analyze similarity/dissimilarity of sequences has been carried out in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        In [6] Chairungsee and Crochemore introduced a measure of similarity between two sequences
 and  making use of a length-weighted index on the symmetric diference  ()△ () of
the sets of minimal absent words  () and  () of  and , respectively. In the same paper,
the authors propose to evaluate the length-weighted index on a sample set, i.e. the subset of
 ()△ () of words of limited length ℓ. Further developments and an extension of the ideas
of [6] can be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ] a new similarity measure between sequences, based on minimal absent words, has been
introduced with the aim to deepen a theoretical comparison with the measures in [6] and [8].
The flaw of the distance in [ 6] is that the set  ()△ () could contain words that are absent
both in  and in , although they are minimal only for one of them. In our opinion, if the aim is to
distinguish  and  it is not appropriate to consider such words. Hence, we propose to evaluate
the length-weighted index on the sample set (, ) = ( () ∩  ()) ∪ ( () ∩  ()),
where  () (resp.  ()) denotes the set of factors of  (resp. ). The set (, ) contains
words that are minimal absent in one of the two words ( or ), but that are factors of the other
one. In our proposal, only the words of (, ) really contribute to distinguish  and .
      </p>
      <p>
        Independently in [
        <xref ref-type="bibr" rid="ref7">9, 10</xref>
        ] a similar idea has been used for comparing a set of words  , called a
target, against a set of words , called a reference by defining a  -specific word as a factor  of a
word in  that is not a factor of any word of  and such that any proper factor of  is a factor
of some word of . An algorithm for computing target specific words, whose construction is
based on a generalization of sufix automata, is also proposed. Finally, in [
        <xref ref-type="bibr" rid="ref8">11</xref>
        ] a generalization
of  ()△ () for multiple strings is given.
      </p>
      <p>From the algebraic point of view, the set (, ) is the base of the ideal generated by
 ()△ (), hence (, ) contains only those words of  ()△ () that do not have a
proper factor in the same set. For this reason, in general, (, ) has far fewer elements than
 ()△ () and (, ) contains words among the shortest of  ()△ (). This choice,
from a practical point of view, has a potential advantage in terms of computation time. Although
we do not yet have an algorithm for generating the set (, ) without considering all the
words in  () ∪  (), we are confident that a more direct approach for this calculation can
be introduced.</p>
      <p>The experiments shown in this paper aim to provide measurements on how smaller the set
(, ) is, compared to  ()△ (), and how shorter the words in (, ) are, compared to
the ones in  ()△ ().</p>
      <p>The paper is organized as follows: in Section 2 we give some notations and recall the definition
of minimal absent word. In Section 3 we recall the similarity measures based on absent words.
In Section 4 we comment on some experiments that aim to evaluate the amount of data needed
to compute the two distances, that are highlighted in some graphs and tables.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Definitions and notations</title>
      <p>Let Σ be a finite alphabet and Σ* the set of the words over Σ. If  ∈ Σ* , || denotes its length. If
 ⊂ Σ* , || denote its cardinality, i.e. the number of its elements, whereas () = ∑︀∈ ||
is the total length of . A set  ⊆ Σ* is said to be a (two-sided) ideal of Σ* if for  ∈  and
 ∈ Σ* , then ,  ∈ , i.e.  = Σ* Σ* . The base of the ideal  is the minimal set  (with
respect to the set inclusion) such that  = Σ* Σ* . Let  be a word of Σ* , we say that  is a
factor of  if there exist ,  ∈ Σ* such that  = . In what follows we denote by  () the
set of factors of . A word  occurs in  if it is a factor of .</p>
      <p>A word  is an absent word for  if it does not occur in . An absent word is a minimal absent
word (or MAW) for a word  if all its proper factors occur in . We denote by  () the set of
minimal absent words of . For instance if  = , then  () = {, , , }.</p>
      <p>A language  ⊆ Σ* is called factorial if it contains all the factors of its own words, whereas it is
called antifactorial if no word in the language is a proper factor of another word in the language.
In particular, for any word  ∈ Σ* ,  () is a factorial language and  () is antifactorial.</p>
      <p>
        Remark that the complement of  () (i.e. the set of the words that are not factors of ) is an
ideal of Σ* and  () is its base. This allows to establish a duality between the sets  () and
 () given by the relations (cf. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]):
 () = Σ* ∖ Σ*  ()Σ* ,
      </p>
      <p>() = Σ () ∩  ()Σ ∩ (Σ* ∖  ()).</p>
      <p>This last relation comes from the fact that if  ∈ Σ* , the word  = 1 · · · , with  ∈ Σ is a
MAW for  if  ∈/  () and 1 · · · − 1, 2 · · ·  ∈  ().</p>
    </sec>
    <sec id="sec-3">
      <title>3. Similarity measures based on sets of minimal absent words</title>
      <p>The idea to measure similarity by minimal absent words is based on the intuition that two
words,  and , are as more distant as bigger is the set of the non common absent words and
as shorter are the words in it. This idea was first formalized in a paper by Chairungsee and
Crochemore [6] where the notion of length weighted index of a set is used in order to define a
dissimilarity measure of two sequences. The length weighted index is defined as the measure
that associates to a set  ⊆ Σ* the quantity  () = ∑︀∈ |1|2 .</p>
      <p>This measure is used in [6] in order to define the distance function dist between two words
 and , by taking the set  =  ()△ (), where △ denotes the symmetric diference
operator between two sets. Therefore the distance is defined as:
dist(, ) =  ( ()△ ()) =
∑︁</p>
      <p>1
∈()△() ||2
We remark that dist(, ) is not substantially afected by long minimal absent words. This is why
in [6] the authors propose to ignore from  ()△ () those words with length longer than a
ifxed threshold ℓ, and define a distance distℓ as the length weighted index over ℓ()△ℓ(),
where ℓ() (ℓ(), resp.) denotes the set of MAWs of  (, resp.) with length smaller than or
equal to ℓ.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ] a diferent distance also based on the measure  is considered, but applied to a subset of
 ()△ () that better captures the diference between two words. Moreover, by considering
this subset, the requirement of having words with limited length is undirectely satisfied. This
subset of  ()△ () is in fact made of those factors of  that are minimal absent words for
 and viceversa. In other terms, we want the comparison of the two sequences  and  not to
be influenced by those minimal absent words of  that are absent (but not minimal) also for .
This idea is formally described as follows. For all ,  ∈ Σ* we define
      </p>
      <p>(, ) = ( () ∩  ()) ∪ ( () ∩  ()).</p>
      <p>
        The following theorem summarizes some algebraic properties of (, ) also in relation with
 ()△ () proved in [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ] (Lemma 4.1 and Theorem 4.3). Note that, in general,  ()△ ()
is not antifactorial and Σ* ( ()△ ())Σ* is an ideal.
      </p>
      <p>Theorem 1. For all ,  ∈ Σ*
1. (, ) = ∅ if and only if  = .
2. (, ) ⊆  ()△ ().
3. (, ) is antifactorial.</p>
      <p>4. (, ) is the base of the ideal Σ* ( ()△ ())Σ* .</p>
      <p>Point 4 of Theorem 1 states that considering (, ) is equivalent to ignore, in  ()△ (),
those words that have a proper factor in the same set. Therefore one can define a distance based
on the length weighted index applied to (, ):
 (, ) =  ((, )) =
∑︁</p>
      <p>1
∈(,) ||2
We remark that as in the case of distℓ, the distance  takes into consideration elements
among the shortest of  ()△ () because they are elements of the base of the ideal
Σ* ( ()△ ())Σ* .</p>
      <p>Example 1. Let  =  and  =  words over Σ = {, , , }. Then,
 () = {, , , , , , , , , , , , , , }
 () = {, , , , , , , , },
 () ∪  () = {, , , , , , , , , , , , , , , , , },
 ()△ () = {, , , , , , , , , , , },
(, ) = {, , }.</p>
      <p>Remark that the word , for instance, is absent both in  and in  (although not minimal in ) so,
in some way, it represents a common property of the two words, and it should not be considered
as a contribution to the distance. The same holds for the words , , , , , , , and
. On the other hand, the word , for instance, is a minimal absent word in , but occurs in
 and therefore discriminates the two words. Viceversa, the word  is minimal absent in  but
occurs in  i.e. it also contributes to their dissimilarity. In Example 1, the cardinality of the set
(, ) is much smaller than the one of  ()△ (), and the words in (, ) are among the
smallest in  ()△ (). Finally:</p>
      <p>
        7 3 1 453 1 3
dist(, ) = 1 + 4 + 9 + 16 = 144 ≈ 3.1  (, ) = 1 + 2 = 2 = 1.5
4. Experimental results on the (, ) set
In the previous section we have observed that the set (, ) is the base of the ideal
Σ* ( ()△ ())Σ* and then it is likely to have a smaller cardinality and that involves the
words among the shortest. Actually, in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], due to computational reasons, the distance distℓ is
considered instead of the distance dist, but the authors do not give any motivation on how they
choose the goode value of ℓ. Moreover, some experiments that will appear in [
        <xref ref-type="bibr" rid="ref9">12</xref>
        ] show that
the  and the distℓ distances behave in a similar way on biological datasets with respect to the
generated taxonomies.
      </p>
      <p>Having an idea about the quantities involved could be interesting for the computation of 
and dist, whose computational complexity depends on the computation of the sets (, ) and
 ()△ (), respectively. Therefore it is worth to see how much smaller |(, )| is, w.r.t.
| ()△ ()| and | () ∪  ()|.</p>
      <p>It is also interesting to consider and compare the total lengths  of the three sets.</p>
      <p>With these motivations here we present some experimental results. Our first experiments on
this topic is performed by exploring sets (, ),  ()△ (),  ()∪ () on a 41 mammals
mitochondrial DNA (or mtDNA) benchmark dataset
(https://github.com/NaserAnjum21/CDMAWS/tree/master/Data). The sequences in this dataset are approximately 17000 bases long.
30000
25000
s20000
t
eeeonub
m
l
fe15000
r
m
n10000
5000
dmaw
scmaw
union
()∪ (( )) ∪  (,r)e,srpeescpteivcetilvye,lwy,itwhit,h ∈,  ∈Σ,.Σ,.</p>
      <p>This table shows, for each dataset  , ()△( )
This table shows, for each dataset ΣΣ,,,, tthheemmoossttrerepprerseesnetnetdedlenlegnthgtohf MofAMWAs Winsin((,, )),  ()△ (),
size and the sequences lengths).
(,  ),  ()△ ( ) and  () ∪  ( )
diFsitgriubruet1iosnumofmMaAriWzeslethnegtrhessuinltsthceonthcererneisnegtst.hAedfteristthriabtuwtioenwoafnMdeArWed loenngwthhsaitnwtohuedthlhreaepspeetns
ifwthheersea m ecoerxrpesepriomnednsttsowheurmeapne’rsfaonrmdedtoongorrainllda’osmmsttDrNinAgs.,Tihneoerxdpeerrtiomiennftesrofnroontheexrpperaiimrseonfts
cosmpebciineastgoirviaelspi mroiplaerrtciuersvoefs.the sets. Then, in order to compare the results with those biological
strinAgsn,awtueraplrqoudeusctieodnais8t5o0a0s-klownghartahnadpopmenlys igfetnhersaatmede setxrpinergismoentas a4r-eleptteerrfosramlpehdaobnertadnadtoamset,
(, o)f the asnetds. (T)△h (e )n, in ordtoerthtoe cvoamlupeasreofthtweroespualrtasmweitehrsth:otsheeoanlpbhiaobloegticsaizlestarnindgtsh,ewdeaptraos-et
wdourdcesdlean1g7th00.0-long randomly generated strings on a 4-letters alphabet dataset, whose results
aIrne odrisdpelraytoedruinn Fthigeuerxep2e.rWimeenarte, winetegreensteerdatteodsstoumdyetrhaensdeonmsitdivatitayseotfs tthoewseotrskon(.,Fo)raenadch
alph(ab)e△tsiz(eΣ) t=o 2th, e4,v8alauneds foofrt weaocphawraomrdetseorsf: ltehnegathlph=ab8e5t0s0ize, a1n7d00th0,e3d4a0t0a0se6t8w0o00rd,s13le6n0g0t0h,.a
dataIsneto rder toof rruanndtohmeesxtprienrgi mshenast,bweeengepnroedrautceedds.o(mi.ee. wraenditoemratdivaetalysedtosutoblwedorthkeoanl.pFhoarbeetacshize
Σ,
anadlpthhaebesteqsiuzeenΣce=s l2e,n4g,t8hsa)n.d for each words-length  = 8500, 17000, 34000 68000, 136000, a
dTathaesnetforΣe,achofpraainrd, o m∈  strings has been produced. (i.e. we iteratively doubled the alphabet
, we have computed the distribution of the MAWs lengths in
Σ,
summarized in the histograms in Figures 3, 4, 5 where two random sequences , 
diferent sets are shown. We observe that:
histograms in Figures 3, 4, ∈5 whΣe,re, twweohraavnedcoommpseuqteude nthceesdi, stribution of the</p>
      <p>Then for each pair ,  MAWs lengths
∈  (with diferent values
Σ,
in (, ),  ()  () and  ()  (). The results of some of these experiments are
of |Σ| and  ) are co△nsidered and the c∪orresponding distributions of the MAWs lengths in the
∈ Σ,
. The results of these experiments are summarized in the
(with diferent values of
Σ and ) are considered and the corresponding distributions of the
| |
MA•WTshleenvgatlhusesinf othrethdeifetrhenretesestestasraersehdoiwstnr.ibWueteodbosenrvaebtehlalts:haped curve and the values are
nonzero in a small interval.</p>
      <p>• The values for the three sets are distributed on a bell shaped curve and the values are
• The maximum for  (,  ) approximates log
nonzero in a small interval.</p>
      <p>|Σ|</p>
      <p>|| . This observation is coherent to a result
 () ∪  ( )</p>
      <p>
        (see Table 1).
in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], stating that for a randomly generated word  with a memoryless source and
• The maximum for (, ) approximates log  . This observation is coherent to a
identical symbol probability, the maximal lengt|Σh| o|f a minimal absent word is ( log ||) .
      </p>
      <p>
        |
result in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], stating that for a randomly generated word  with a memoryless source|Σa|nd
This value appears to be always one unity less than the maximum for  ()△ ( )
identical symbol probability, the maximal length of a minimal absent word is (log
and
 ).
|Σ| | |
• The curve for  (,  )
is much lower than the curves for  ()△ ( )
and  () ∪  ( )
      </p>
      <p>This value appears to be always one unity less than the maximum for  ()△ () and
 () ∪  () (see Table 1).
• The curve for (, ) is much lower than the curves for  ()△ () and  () ∪  ().</p>
      <p>This intuitively means that the number of the words in (, ) is much smaller than the
ones in  ()△ ().
• The curves for  ()△ () and  () ∪  () are very close, i.e. the  ()△ ()
involve most of the MAWs of  and .</p>
      <p>Figures 3, 4 and 5 show the distributions of MAWs lengths in (, ),  ()△ () and
 () ∪  () for two random strings on diferent alphabeth sizes |Σ| and diferent string
lengths . It is easy to see, in all the dispalyes cases, how higher are the curves of length
distributions of  ()△ () and  () ∩  () compared to the one of (, ). In particular:
3500,00
3000,00
2500,00
sn
t
lfre2000,00
m
e
eo
eb1500,00
m
u
n1000,00
500,00
0,00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25</p>
      <p>
        length
• For  = 8500 and |Σ| = 2 (cf. Figure 3) the curve for (, ) has its maximum in
correspondence with length 13 (i.e. MAWs of length 13 are the most frequent in (, ))
and the frequence is around 1500, whereas the maximum for both  ()△ () and
 () ∪  () is in correspondence of length 14 with a frequence around 3000 (note that
log2 8500 = 13, 053). Nonzero values for (, ) are in the interval [
        <xref ref-type="bibr" rid="ref7">10, 18</xref>
        ] whereas for
both  ()△ () and  () ∪  () are in [
        <xref ref-type="bibr" rid="ref7">10, 24</xref>
        ].
• For  = 8500 and |Σ| = 4 (cf. Figure 4) the curve for (, ) has its maximum in
correspondence with length 7 and the frequence is around 5000, whereas the maximum for
both  ()△ () and  () ∪  () is in correspondence of length 8 with a frequence
around 11000 (note that log4 17000 = 7, 027). Nonzero values for (, ) are in the
interval [
        <xref ref-type="bibr" rid="ref5">5, 9</xref>
        ] whereas for both  ()△ () and  () ∪  () are in [
        <xref ref-type="bibr" rid="ref5 ref9">5, 12</xref>
        ].
• For  = 136000 and |Σ| = 8 (Figure 5) the curve for (, ) has its maximum in
correspondence with length 6 and the frequence is around 120000, whereas the maximum
for both  ()△ () and  ()∪ () is in correspondence of length 7 with a frequence
around 500000 (note that log8 136000 = 5, 684). Nonzero values for (, ) are in the
interval [
        <xref ref-type="bibr" rid="ref5">5, 8</xref>
        ] whereas for both  ()△ () and  () ∪  () are in [
        <xref ref-type="bibr" rid="ref5 ref7">5, 10</xref>
        ].
      </p>
      <p>The experiment, repeated on diferent sample sequences, gives similar curves and equal
maximum frequence. The curves are similar also for sample sequences taken from biological datasets
with lengths comparable to the random sequences here considered (see, for instance, Figure 1).</p>
      <p>For the investigation about cardinalities, in another experiment, for all of the pairs
,  ∈ Σ, we computed the ratios |(, )|/| ()△ ()|, |(, )|/| () ∪  ()|,
((, ))/( ()△ ()) and ((, ))/( () ∪  ()). Tables 2 and 3 show the
average of these values and the corresponding standard deviation. One can note that:
• As the cardinality of the alphabet grows, |(, )|/| ()△ ()| and
|(, )|/| () ∪  ()| decrease. This is also true w.r.t. the total lengths.
• The ratios relating to the total lengths are smaller than the corresponding ratios relating
to the cardinalities. This shows that the words in (, ) are among the smallest of the
words in  ()△ ().</p>
      <p>14000
12000
dmaw
scmaw
union
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25</p>
      <p>length
view, compared to the distance based on the symmetric diference. In fact, computing a distance
based onFi(g,u )re 5: DcisoturilbdutbioensmoofMreAeWficsielenntgtthhsainn co(m, p)u, tin(g)t△hed(is)taanndceo(n)∪ ()(△() )for 8-lettesrisnce
we wouladlphhaavbeetaansdmlaenllgetrhs1e3t6,0p0r0ovided that one can get directly the words of the set  (,  )
without explicitely producing all the words in  (),  ( ),  (),  ( ) .
TabTlaeb2le 2
ForFoeraecahchppaairirooff wwoorrddss,,  ∈∈  Σ,Σ,, t,hethraetiroastio1s(,1(),  )= =||()(△|(,)|△(),(|())||)| andan2d(,2(),  =) =((()((△()(,△(),( ))())))) , ,
respre.sph.ahvaevebbeeennccoommputedd.. AAfterftewr waradrsd,sth,ethaevearvageeravgaeluveaslues1  = 1=,∈,Σ∈, (Σ,1((1,(, )) andand 2 =2 =
  , ∈,Σ,∈(Σ2(,,())2(,r,esp))., hreasvpe.,bheaevne cboemenpcuotmedpaunteddraenpdorrteepdorintetderinmtseromfpseorfcpenertcaegnetaingecoinlucmolnusm3nasn3d 4,
respanedct4iv,reelysp,ewcittivheltyh,ewritehlatthiveeresltaatinvdeasrtdanddeavrdiadtieovniat(sio.dn.)(si.dn.)pianrpeanrtehnetshiess.isS.inSicneceththeessaannddaarrdd ddeevviiaattiioonn is
eveirsyewvehreyrwe hveerreyvsemryasllm,tahlli,stmhiseamnesatnhsatthdaattdaataarearcelucslutesrteerdedtigtihgthltylyaraorouunnddtthheemmeeaann..
TaTbalbel3e 3
FoFroeraecahchppaairir ooff wwoorrddss,,  ∈∈  Σ,Σ,, t,hethreatiroastios3(3,(,) )==||()(|∪(),|∪((,))()|||)| andand4(4,(,) )==(((()((∪)(∪,((,))))()))) , ,
resrpes.,ph., ahvaevebbeeeennccoommppuutteedd.. AAfterftewr waradrsdtshethaevearvaegreavgaeluveaslues3  = 3=,∈,∈Σ , Σ,
  ((3(3(,,)))) anadnd 4 =4 =
  ,∈,Σ,∈Σ4,(, ())4(,r,es)p)., hreasvpe. bheaevne bcoeemnpcuotmedpuanteddraenpdorrteepdoirntetderinmtseormfpseorfcepnetracgenetiangceoilnumconlusm3nasn3d 4,
(
resapnedc4ti,vrelsyp,ewctiitvheltyh,weirtehlathtievereslatatinvedastradndeavrdiadtieovniat(sio.dn.)(si.dn.)pianrepnartehnetshise.siSs.inScinectehtehesasnanddaardrdddeevviiaattion is
eveisryevwehryewrehveereryvesmryaslml, athlli,sthmiseamnesatnhsatthdatatdaataarearcelucslutesrteerdedtigtihgthlytlyaraoruounnddt htheemmeeaann..
9:186 450 (2016) 1–8.
[6] S. Chairungsee, M. Crochemore, Using minimal absent words to build phylogeny, Theor.</p>
      <p>
        Comput. Sci. 450 (2012) 109–116.
[
        <xref ref-type="bibr" rid="ref6">7</xref>
        ] G. Castiglione, S. Mantaci, A. Restivo, Some investigations on similarity measures based
on absent words, Fundam. Informaticae 171 (2020) 97–112.
[8] A. Ehrenfeucht, D. Haussler, A new distance metric on strings computable in linear time,
      </p>
      <p>
        Discrete Applied Mathematics 20 (1988) 191–203.
[9] M. Béal, M. Crochemore, Fast detection of specific fragments against a set of sequences,
volume 13911 of Lecture Notes in Computer Science, Springer, 2023, pp. 51–60.
[
        <xref ref-type="bibr" rid="ref7">10</xref>
        ] P. Bonizzoni, C. D. Felice, Y. Pirola, R. Rizzi, R. Zaccagnino, R. Zizza, Can formal languages
help pangenomics to represent and analyze multiple genomes?, in: V. Diekert, M. V. Volkov
(Eds.), Developments in Language Theory - 26th International Conference, DLT 2022,
Tampa, FL, USA, May 9-13, 2022, Proceedings, volume 13257 of Lecture Notes in Computer
Science, Springer, 2022, pp. 3–12.
[
        <xref ref-type="bibr" rid="ref8">11</xref>
        ] K. Okabe, T. Mieno, Y. Nakashima, S. Inenaga, H. Bannai, Linear-time computation of
generalized minimal absent words for multiple strings, volume 14240 of Lecture Notes in
Computer Science, Springer, 2023, pp. 331–344.
[
        <xref ref-type="bibr" rid="ref9">12</xref>
        ] S. L. Pizzuto, Similarity measures based on minimal absent words, Master’s thesis, DAMI,
University of Palermo, in preparation.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mantaci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Restivo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sciortino</surname>
          </string-name>
          ,
          <article-title>Distance measures for biological sequences: Some recent approaches</article-title>
          ,
          <source>Int. J. Approx. Reason</source>
          .
          <volume>47</volume>
          (
          <year>2008</year>
          )
          <fpage>109</fpage>
          -
          <lpage>124</lpage>
          . URL: https://doi.org/10.1016/ j.ijar.
          <year>2007</year>
          .
          <volume>03</volume>
          .011. doi:
          <volume>10</volume>
          .1016/J.IJAR.
          <year>2007</year>
          .
          <volume>03</volume>
          .011.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mignosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Restivo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sciortino</surname>
          </string-name>
          ,
          <article-title>Forbidden factors and fragment assembly, RAIRO Theor</article-title>
          .
          <source>Informatics Appl</source>
          .
          <volume>35</volume>
          (
          <year>2001</year>
          )
          <fpage>565</fpage>
          -
          <lpage>577</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Crochemore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mignosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Restivo</surname>
          </string-name>
          , Automata and forbidden words,
          <source>Inf. Process. Lett</source>
          .
          <volume>67</volume>
          (
          <year>1998</year>
          )
          <fpage>111</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Charalampopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Crochemore</surname>
          </string-name>
          , G. Fici,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mercas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Pissis</surname>
          </string-name>
          ,
          <article-title>Alignment-free sequence comparison using absent words</article-title>
          ,
          <source>Inf. Comput</source>
          .
          <volume>262</volume>
          (
          <year>2018</year>
          )
          <fpage>57</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alatabbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Crochemore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <article-title>Absent words and the (dis)similarity analysis of dna sequences: An experimental study</article-title>
          ,
          <source>BMC Research Notes</source>
          .
          <volume>9</volume>
          :
          <fpage>186</fpage>
          <lpage>450</lpage>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7] InG.coCnacsltuigsiloionn,eth,eSs.eMexapnetaricmi,eAn.tsRaersetiaviom, eSdotmoereimnvaersktitghaattiothnesgorneastimnuilmareirtiycamldeiafesruenrecse boafsed theodnataabdsiemnetnwsioorndisn, tFhuensedasmets.,
          <source>Imnfaokremtahteicdaiesta1n7c1e</source>
          (
          <article-title>20i2n0te)r9e7st-in11g2,f.rom a computational point of [8v]iewA,</article-title>
          .cEohmrpeanrfeeductohtt,hDe.dHistaaunscselebra, sAednoenwthdeisstyamncmeetric metrdiicferoenncset.riInngfsaccto,
          <string-name>
            <surname>mcopmuptaubtilnegina</surname>
            <given-names>dliinsteaanrcteime</given-names>
          </string-name>
          , basDedisocnrete
          <article-title>(A,pp)liceodulMd abtehmemoraetiecficsie2n0t(</article-title>
          <year>t1h9a8n8</year>
          )
          <fpage>co1m91p</fpage>
          -
          <lpage>u2t0in3g</lpage>
          .
          <article-title>the distance on  ()△ () since [w9]e Mwo.uBlédahl,aMve</article-title>
          . Carsomcahlelemrosreet,, pFraosvtiddeetdetchtiaotnonofe scpaencgificetfrdaigremcetlnytsthaegwaionrsdtsaosfetthoefsseetq
          <article-title>u(en,ce)s, without explicitely producing all the words in  (</article-title>
          ),  (),  (),  (). volume
          <volume>13911</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2023</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonizzoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Felice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pirola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zaccagnino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zizza</surname>
          </string-name>
          ,
          <article-title>Can formal languages Rehfeelprepnancgeesnomics to represent and analyze multiple genomes?</article-title>
          , in: V.
          <string-name>
            <surname>Diekert</surname>
            ,
            <given-names>M. V.</given-names>
          </string-name>
          <string-name>
            <surname>Volkov</surname>
          </string-name>
          (Eds.),
          <source>Developments in Language Theory - 26th International Conference, DLT [1]20S2.2M</source>
          ,
          <string-name>
            <surname>Tanamtacpia</surname>
          </string-name>
          ,,AF.LR,eUstSivAo,,
          <source>MMa.ySc9io-1rt3i</source>
          ,
          <year>n2o0</year>
          ,
          <year>2D2i</year>
          ,stParnocceeemdeiansgusr,evsoflourmbieol1o3g2i5ca7l osfeqLueecntucrees:NSootmesein CormecpeunttearpSpcrioeancchee,sS,pIrnitn. gJ.eAr,
          <year>p2p0r2o2x</year>
          ,.
          <source>pRpe.a3so-n1.24. 7</source>
          (
          <year>2008</year>
          )
          <fpage>109</fpage>
          -
          <lpage>124</lpage>
          . URL: https://doi.org/10.1016/ j.ijar.
          <year>2007</year>
          .
          <year>03M</year>
          .
          <year>01ie1n</year>
          .od,
          <source>oiY: 1.0N.a1k0a1s6h/imJ.aI</source>
          ,JSA.RIn.
          <year>e2n0a0g7a</year>
          .,
          <year>0H3</year>
          ..
          <article-title>B0a1n1n.ai, Linear-time computation of</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Okabe</surname>
          </string-name>
          , T. [2]geFn.eMraiglinzeodsi,
          <string-name>
            <given-names>A.</given-names>
            <surname>Restivo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sciortino</surname>
          </string-name>
          , Fmorublitdipdleensftarcintogrss, avnodlufmraeg m14e2n4t0aossfeLmecbtluy,
          <article-title>reRNAoIRteOsin minimal absent words for Theor</article-title>
          .
          <source>Informatics Appl</source>
          .
          <volume>35</volume>
          (
          <year>2001</year>
          )
          <fpage>565</fpage>
          -
          <lpage>577</lpage>
          . Computer Science, Springer,
          <year>2023</year>
          , pp.
          <fpage>331</fpage>
          -
          <lpage>344</lpage>
          . [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Crochemore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mignosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Restivo</surname>
          </string-name>
          , Automata and forbidden words, Inf. Process. Lett.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Pizzuto</surname>
          </string-name>
          ,
          <article-title>Similarity measures based on minimal absent words, Master's thesis</article-title>
          , DAMI,
          <volume>67</volume>
          (
          <year>1998</year>
          )
          <fpage>111</fpage>
          -
          <lpage>117</lpage>
          . [4]UnPi.vCehrsairtaylampopoulos,
          <string-name>
            <given-names>M.</given-names>
            <surname>Crochemore</surname>
          </string-name>
          , G. Fici,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mercas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Pissis</surname>
          </string-name>
          ,
          <article-title>Alignment-free of Palermo, in preparation. sequence comparison using absent words</article-title>
          ,
          <source>Inf. Comput</source>
          .
          <volume>262</volume>
          (
          <year>2018</year>
          )
          <fpage>57</fpage>
          -
          <lpage>68</lpage>
          . [5]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alatabbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Crochemore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <article-title>Absent words and the (dis)similarity analysis of dna sequences: An experimental study</article-title>
          ,
          <source>BMC Research Notes.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>