1. Introduction

Italian Conference on Theoretical Computer Science, September

A Comparison Between Similarity Measures Based on Minimal Absent Words: An Experimental Approach⋆

Giuseppa Castiglione

Sabrina Mantaci

Salvatore L. Pizzuto

Antonio Restivo

0 0 Dipartimento di Matematica e Informatica, Università degli studi di Palermo , Palermo , Italy

2024

1 1 13

In this paper we make some experimental considerations on the sets (, ), ()△ (), () ∪ () involving minimal absent words of two words and . This study is motivated by the computation of distances based on these sets. It is well-known that sequence comparison finds many applications in comparative genomics for the study of evolutions, for building phylogenies, for comparing virus genomes. Besides the traditional methods based on alignment, that consider only local mutations in biological sequences, recently many alignment-free methods have been introduced, in order to consider also global mutations (see [1] for a survey). Some of them compare two sequences by counting their factors frequencies since, intuitively, the more similar two sequences are, the greater it is the number of the factors they share. Other methods use data compression considerations, based on the intuition that the more similar two sequences are, the more efective their joint compression is than their independent compression. A third class of method generalizes the definition of sequences alignment, where the basic edit operation on characters are integrated with edit operations on blocks of characters. In the context of alignment free methods, in recent years a new class of methods consider the concept of minimal absent word, based on the idea that the negative information well represents the sequence itself, hence two sequences can be compared by comparing the relative sets of minimal absent words. The advantages of this approach are that the set of minimal absent words uniquely characterizes the sequence (cf. [2]), the number of minimal absent words of

eol>Minimal absent words alignment free distances

1. Introduction

a sequence of length is linear in (cf. [ 3 ]), they can be computed in linear time [ 4 ]. As a consequence, it is possible to compare two sequences in time proportional to their lengths.

An experimental study of diferent distance measures based on minimal absent words to analyze similarity/dissimilarity of sequences has been carried out in [ 5 ].

In [6] Chairungsee and Crochemore introduced a measure of similarity between two sequences and making use of a length-weighted index on the symmetric diference ()△ () of the sets of minimal absent words () and () of and , respectively. In the same paper, the authors propose to evaluate the length-weighted index on a sample set, i.e. the subset of ()△ () of words of limited length ℓ. Further developments and an extension of the ideas of [6] can be found in [ 4 ].

In [ 7 ] a new similarity measure between sequences, based on minimal absent words, has been introduced with the aim to deepen a theoretical comparison with the measures in [6] and [8]. The flaw of the distance in [ 6] is that the set ()△ () could contain words that are absent both in and in , although they are minimal only for one of them. In our opinion, if the aim is to distinguish and it is not appropriate to consider such words. Hence, we propose to evaluate the length-weighted index on the sample set (, ) = ( () ∩ ()) ∪ ( () ∩ ()), where () (resp. ()) denotes the set of factors of (resp. ). The set (, ) contains words that are minimal absent in one of the two words ( or ), but that are factors of the other one. In our proposal, only the words of (, ) really contribute to distinguish and .

Independently in [ 9, 10 ] a similar idea has been used for comparing a set of words , called a target, against a set of words , called a reference by defining a -specific word as a factor of a word in that is not a factor of any word of and such that any proper factor of is a factor of some word of . An algorithm for computing target specific words, whose construction is based on a generalization of sufix automata, is also proposed. Finally, in [ 11 ] a generalization of ()△ () for multiple strings is given.

From the algebraic point of view, the set (, ) is the base of the ideal generated by ()△ (), hence (, ) contains only those words of ()△ () that do not have a proper factor in the same set. For this reason, in general, (, ) has far fewer elements than ()△ () and (, ) contains words among the shortest of ()△ (). This choice, from a practical point of view, has a potential advantage in terms of computation time. Although we do not yet have an algorithm for generating the set (, ) without considering all the words in () ∪ (), we are confident that a more direct approach for this calculation can be introduced.

The experiments shown in this paper aim to provide measurements on how smaller the set (, ) is, compared to ()△ (), and how shorter the words in (, ) are, compared to the ones in ()△ ().

The paper is organized as follows: in Section 2 we give some notations and recall the definition of minimal absent word. In Section 3 we recall the similarity measures based on absent words. In Section 4 we comment on some experiments that aim to evaluate the amount of data needed to compute the two distances, that are highlighted in some graphs and tables.

2. Definitions and notations

Let Σ be a finite alphabet and Σ* the set of the words over Σ. If ∈ Σ* , || denotes its length. If ⊂ Σ* , || denote its cardinality, i.e. the number of its elements, whereas () = ∑︀∈ || is the total length of . A set ⊆ Σ* is said to be a (two-sided) ideal of Σ* if for ∈ and ∈ Σ* , then , ∈ , i.e. = Σ* Σ* . The base of the ideal is the minimal set (with respect to the set inclusion) such that = Σ* Σ* . Let be a word of Σ* , we say that is a factor of if there exist , ∈ Σ* such that = . In what follows we denote by () the set of factors of . A word occurs in if it is a factor of .

A word is an absent word for if it does not occur in . An absent word is a minimal absent word (or MAW) for a word if all its proper factors occur in . We denote by () the set of minimal absent words of . For instance if = , then () = {, , , }.

A language ⊆ Σ* is called factorial if it contains all the factors of its own words, whereas it is called antifactorial if no word in the language is a proper factor of another word in the language. In particular, for any word ∈ Σ* , () is a factorial language and () is antifactorial.

Remark that the complement of () (i.e. the set of the words that are not factors of ) is an ideal of Σ* and () is its base. This allows to establish a duality between the sets () and () given by the relations (cf. [ 3 ]): () = Σ* ∖ Σ* ()Σ* ,

() = Σ () ∩ ()Σ ∩ (Σ* ∖ ()).

This last relation comes from the fact that if ∈ Σ* , the word = 1 · · · , with ∈ Σ is a MAW for if ∈/ () and 1 · · · − 1, 2 · · · ∈ ().

3. Similarity measures based on sets of minimal absent words

The idea to measure similarity by minimal absent words is based on the intuition that two words, and , are as more distant as bigger is the set of the non common absent words and as shorter are the words in it. This idea was first formalized in a paper by Chairungsee and Crochemore [6] where the notion of length weighted index of a set is used in order to define a dissimilarity measure of two sequences. The length weighted index is defined as the measure that associates to a set ⊆ Σ* the quantity () = ∑︀∈ |1|2 .

This measure is used in [6] in order to define the distance function dist between two words and , by taking the set = ()△ (), where △ denotes the symmetric diference operator between two sets. Therefore the distance is defined as: dist(, ) = ( ()△ ()) = ∑︁

1 ∈()△() ||2 We remark that dist(, ) is not substantially afected by long minimal absent words. This is why in [6] the authors propose to ignore from ()△ () those words with length longer than a ifxed threshold ℓ, and define a distance distℓ as the length weighted index over ℓ()△ℓ(), where ℓ() (ℓ(), resp.) denotes the set of MAWs of (, resp.) with length smaller than or equal to ℓ.

In [ 7 ] a diferent distance also based on the measure is considered, but applied to a subset of ()△ () that better captures the diference between two words. Moreover, by considering this subset, the requirement of having words with limited length is undirectely satisfied. This subset of ()△ () is in fact made of those factors of that are minimal absent words for and viceversa. In other terms, we want the comparison of the two sequences and not to be influenced by those minimal absent words of that are absent (but not minimal) also for . This idea is formally described as follows. For all , ∈ Σ* we define

(, ) = ( () ∩ ()) ∪ ( () ∩ ()).

The following theorem summarizes some algebraic properties of (, ) also in relation with ()△ () proved in [ 7 ] (Lemma 4.1 and Theorem 4.3). Note that, in general, ()△ () is not antifactorial and Σ* ( ()△ ())Σ* is an ideal.

Theorem 1. For all , ∈ Σ* 1. (, ) = ∅ if and only if = . 2. (, ) ⊆ ()△ (). 3. (, ) is antifactorial.

4. (, ) is the base of the ideal Σ* ( ()△ ())Σ* .

Point 4 of Theorem 1 states that considering (, ) is equivalent to ignore, in ()△ (), those words that have a proper factor in the same set. Therefore one can define a distance based on the length weighted index applied to (, ): (, ) = ((, )) = ∑︁

1 ∈(,) ||2 We remark that as in the case of distℓ, the distance takes into consideration elements among the shortest of ()△ () because they are elements of the base of the ideal Σ* ( ()△ ())Σ* .

Example 1. Let = and = words over Σ = {, , , }. Then, () = {, , , , , , , , , , , , , , } () = {, , , , , , , , }, () ∪ () = {, , , , , , , , , , , , , , , , , }, ()△ () = {, , , , , , , , , , , }, (, ) = {, , }.

Remark that the word , for instance, is absent both in and in (although not minimal in ) so, in some way, it represents a common property of the two words, and it should not be considered as a contribution to the distance. The same holds for the words , , , , , , , and . On the other hand, the word , for instance, is a minimal absent word in , but occurs in and therefore discriminates the two words. Viceversa, the word is minimal absent in but occurs in i.e. it also contributes to their dissimilarity. In Example 1, the cardinality of the set (, ) is much smaller than the one of ()△ (), and the words in (, ) are among the smallest in ()△ (). Finally:

7 3 1 453 1 3 dist(, ) = 1 + 4 + 9 + 16 = 144 ≈ 3.1 (, ) = 1 + 2 = 2 = 1.5 4. Experimental results on the (, ) set In the previous section we have observed that the set (, ) is the base of the ideal Σ* ( ()△ ())Σ* and then it is likely to have a smaller cardinality and that involves the words among the shortest. Actually, in [ 4 ], due to computational reasons, the distance distℓ is considered instead of the distance dist, but the authors do not give any motivation on how they choose the goode value of ℓ. Moreover, some experiments that will appear in [ 12 ] show that the and the distℓ distances behave in a similar way on biological datasets with respect to the generated taxonomies.

Having an idea about the quantities involved could be interesting for the computation of and dist, whose computational complexity depends on the computation of the sets (, ) and ()△ (), respectively. Therefore it is worth to see how much smaller |(, )| is, w.r.t. | ()△ ()| and | () ∪ ()|.

It is also interesting to consider and compare the total lengths of the three sets.

With these motivations here we present some experimental results. Our first experiments on this topic is performed by exploring sets (, ), ()△ (), ()∪ () on a 41 mammals mitochondrial DNA (or mtDNA) benchmark dataset (https://github.com/NaserAnjum21/CDMAWS/tree/master/Data). The sequences in this dataset are approximately 17000 bases long. 30000 25000 s20000 t eeeonub m l fe15000 r m n10000 5000 dmaw scmaw union ()∪ (( )) ∪ (,r)e,srpeescpteivcetilvye,lwy,itwhit,h ∈, ∈Σ,.Σ,.

This table shows, for each dataset , ()△( ) This table shows, for each dataset ΣΣ,,,, tthheemmoossttrerepprerseesnetnetdedlenlegnthgtohf MofAMWAs Winsin((,, )), ()△ (), size and the sequences lengths). (, ), ()△ ( ) and () ∪ ( ) diFsitgriubruet1iosnumofmMaAriWzeslethnegtrhessuinltsthceonthcererneisnegtst.hAedfteristthriabtuwtioenwoafnMdeArWed loenngwthhsaitnwtohuedthlhreaepspeetns ifwthheersea m ecoerxrpesepriomnednsttsowheurmeapne’rsfaonrmdedtoongorrainllda’osmmsttDrNinAgs.,Tihneoerxdpeerrtiomiennftesrofnroontheexrpperaiimrseonfts cosmpebciineastgoirviaelspi mroiplaerrtciuersvoefs.the sets. Then, in order to compare the results with those biological strinAgsn,awtueraplrqoudeusctieodnais8t5o0a0s-klownghartahnadpopmenlys igfetnhersaatmede setxrpinergismoentas a4r-eleptteerrfosramlpehdaobnertadnadtoamset, (, o)f the asnetds. (T)△h (e )n, in ordtoerthtoe cvoamlupeasreofthtweroespualrtasmweitehrsth:otsheeoanlpbhiaobloegticsaizlestarnindgtsh,ewdeaptraos-et wdourdcesdlean1g7th00.0-long randomly generated strings on a 4-letters alphabet dataset, whose results aIrne odrisdpelraytoedruinn Fthigeuerxep2e.rWimeenarte, winetegreensteerdatteodsstoumdyetrhaensdeonmsitdivatitayseotfs tthoewseotrskon(.,Fo)raenadch alph(ab)e△tsiz(eΣ) t=o 2th, e4,v8alauneds foofrt weaocphawraomrdetseorsf: ltehnegathlph=ab8e5t0s0ize, a1n7d00th0,e3d4a0t0a0se6t8w0o00rd,s13le6n0g0t0h,.a dataIsneto rder toof rruanndtohmeesxtprienrgi mshenast,bweeengepnroedrautceedds.o(mi.ee. wraenditoemratdivaetalysedtosutoblwedorthkeoanl.pFhoarbeetacshize Σ, anadlpthhaebesteqsiuzeenΣce=s l2e,n4g,t8hsa)n.d for each words-length = 8500, 17000, 34000 68000, 136000, a dTathaesnetforΣe,achofpraainrd, o m∈ strings has been produced. (i.e. we iteratively doubled the alphabet , we have computed the distribution of the MAWs lengths in Σ, summarized in the histograms in Figures 3, 4, 5 where two random sequences , diferent sets are shown. We observe that: histograms in Figures 3, 4, ∈5 whΣe,re, twweohraavnedcoommpseuqteude nthceesdi, stribution of the

Then for each pair , MAWs lengths ∈ (with diferent values Σ, in (, ), () () and () (). The results of some of these experiments are of |Σ| and ) are co△nsidered and the c∪orresponding distributions of the MAWs lengths in the ∈ Σ, . The results of these experiments are summarized in the (with diferent values of Σ and ) are considered and the corresponding distributions of the | | MA•WTshleenvgatlhusesinf othrethdeifetrhenretesestestasraersehdoiwstnr.ibWueteodbosenrvaebtehlalts:haped curve and the values are nonzero in a small interval.

• The values for the three sets are distributed on a bell shaped curve and the values are • The maximum for (, ) approximates log nonzero in a small interval.

|Σ|

|| . This observation is coherent to a result () ∪ ( )

(see Table 1). in [ 2 ], stating that for a randomly generated word with a memoryless source and • The maximum for (, ) approximates log . This observation is coherent to a identical symbol probability, the maximal lengt|Σh| o|f a minimal absent word is ( log ||) .

| result in [ 2 ], stating that for a randomly generated word with a memoryless source|Σa|nd This value appears to be always one unity less than the maximum for ()△ ( ) identical symbol probability, the maximal length of a minimal absent word is (log and ). |Σ| | | • The curve for (, ) is much lower than the curves for ()△ ( ) and () ∪ ( )

This value appears to be always one unity less than the maximum for ()△ () and () ∪ () (see Table 1). • The curve for (, ) is much lower than the curves for ()△ () and () ∪ ().

This intuitively means that the number of the words in (, ) is much smaller than the ones in ()△ (). • The curves for ()△ () and () ∪ () are very close, i.e. the ()△ () involve most of the MAWs of and .

Figures 3, 4 and 5 show the distributions of MAWs lengths in (, ), ()△ () and () ∪ () for two random strings on diferent alphabeth sizes |Σ| and diferent string lengths . It is easy to see, in all the dispalyes cases, how higher are the curves of length distributions of ()△ () and () ∩ () compared to the one of (, ). In particular: 3500,00 3000,00 2500,00 sn t lfre2000,00 m e eo eb1500,00 m u n1000,00 500,00 0,00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

length • For = 8500 and |Σ| = 2 (cf. Figure 3) the curve for (, ) has its maximum in correspondence with length 13 (i.e. MAWs of length 13 are the most frequent in (, )) and the frequence is around 1500, whereas the maximum for both ()△ () and () ∪ () is in correspondence of length 14 with a frequence around 3000 (note that log2 8500 = 13, 053). Nonzero values for (, ) are in the interval [ 10, 18 ] whereas for both ()△ () and () ∪ () are in [ 10, 24 ]. • For = 8500 and |Σ| = 4 (cf. Figure 4) the curve for (, ) has its maximum in correspondence with length 7 and the frequence is around 5000, whereas the maximum for both ()△ () and () ∪ () is in correspondence of length 8 with a frequence around 11000 (note that log4 17000 = 7, 027). Nonzero values for (, ) are in the interval [ 5, 9 ] whereas for both ()△ () and () ∪ () are in [ 5, 12 ]. • For = 136000 and |Σ| = 8 (Figure 5) the curve for (, ) has its maximum in correspondence with length 6 and the frequence is around 120000, whereas the maximum for both ()△ () and ()∪ () is in correspondence of length 7 with a frequence around 500000 (note that log8 136000 = 5, 684). Nonzero values for (, ) are in the interval [ 5, 8 ] whereas for both ()△ () and () ∪ () are in [ 5, 10 ].

The experiment, repeated on diferent sample sequences, gives similar curves and equal maximum frequence. The curves are similar also for sample sequences taken from biological datasets with lengths comparable to the random sequences here considered (see, for instance, Figure 1).

For the investigation about cardinalities, in another experiment, for all of the pairs , ∈ Σ, we computed the ratios |(, )|/| ()△ ()|, |(, )|/| () ∪ ()|, ((, ))/( ()△ ()) and ((, ))/( () ∪ ()). Tables 2 and 3 show the average of these values and the corresponding standard deviation. One can note that: • As the cardinality of the alphabet grows, |(, )|/| ()△ ()| and |(, )|/| () ∪ ()| decrease. This is also true w.r.t. the total lengths. • The ratios relating to the total lengths are smaller than the corresponding ratios relating to the cardinalities. This shows that the words in (, ) are among the smallest of the words in ()△ ().

14000 12000 dmaw scmaw union 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

length view, compared to the distance based on the symmetric diference. In fact, computing a distance based onFi(g,u )re 5: DcisoturilbdutbioensmoofMreAeWficsielenntgtthhsainn co(m, p)u, tin(g)t△hed(is)taanndceo(n)∪ ()(△() )for 8-lettesrisnce we wouladlphhaavbeetaansdmlaenllgetrhs1e3t6,0p0r0ovided that one can get directly the words of the set (, ) without explicitely producing all the words in (), ( ), (), ( ) . TabTlaeb2le 2 ForFoeraecahchppaairirooff wwoorrddss,, ∈∈ Σ,Σ,, t,hethraetiroastio1s(,1(), )= =||()(△|(,)|△(),(|())||)| andan2d(,2(), =) =((()((△()(,△(),( ))())))) , , respre.sph.ahvaevebbeeennccoommputedd.. AAfterftewr waradrsd,sth,ethaevearvageeravgaeluveaslues1 = 1=,∈,Σ∈, (Σ,1((1,(, )) andand 2 =2 = , ∈,Σ,∈(Σ2(,,())2(,r,esp))., hreasvpe.,bheaevne cboemenpcuotmedpaunteddraenpdorrteepdorintetderinmtseromfpseorfcpenertcaegnetaingecoinlucmolnusm3nasn3d 4, respanedct4iv,reelysp,ewcittivheltyh,ewritehlatthiveeresltaatinvdeasrtdanddeavrdiadtieovniat(sio.dn.)(si.dn.)pianrpeanrtehnetshiess.isS.inSicneceththeessaannddaarrdd ddeevviiaattiioonn is eveirsyewvehreyrwe hveerreyvsemryasllm,tahlli,stmhiseamnesatnhsatthdaattdaataarearcelucslutesrteerdedtigtihgthltylyaraorouunnddtthheemmeeaann.. TaTbalbel3e 3 FoFroeraecahchppaairir ooff wwoorrddss,, ∈∈ Σ,Σ,, t,hethreatiroastios3(3,(,) )==||()(|∪(),|∪((,))()|||)| andand4(4,(,) )==(((()((∪)(∪,((,))))()))) , , resrpes.,ph., ahvaevebbeeeennccoommppuutteedd.. AAfterftewr waradrsdtshethaevearvaegreavgaeluveaslues3 = 3=,∈,∈Σ , Σ, ((3(3(,,)))) anadnd 4 =4 = ,∈,Σ,∈Σ4,(, ())4(,r,es)p)., hreasvpe. bheaevne bcoeemnpcuotmedpuanteddraenpdorrteepdoirntetderinmtseormfpseorfcepnetracgenetiangceoilnumconlusm3nasn3d 4, ( resapnedc4ti,vrelsyp,ewctiitvheltyh,weirtehlathtievereslatatinvedastradndeavrdiadtieovniat(sio.dn.)(si.dn.)pianrepnartehnetshise.siSs.inScinectehtehesasnanddaardrdddeevviiaattion is eveisryevwehryewrehveereryvesmryaslml, athlli,sthmiseamnesatnhsatthdatatdaataarearcelucslutesrteerdedtigtihgthlytlyaraoruounnddt htheemmeeaann.. 9:186 450 (2016) 1–8. [6] S. Chairungsee, M. Crochemore, Using minimal absent words to build phylogeny, Theor.

Comput. Sci. 450 (2012) 109–116. [ 7 ] G. Castiglione, S. Mantaci, A. Restivo, Some investigations on similarity measures based on absent words, Fundam. Informaticae 171 (2020) 97–112. [8] A. Ehrenfeucht, D. Haussler, A new distance metric on strings computable in linear time,

Discrete Applied Mathematics 20 (1988) 191–203. [9] M. Béal, M. Crochemore, Fast detection of specific fragments against a set of sequences, volume 13911 of Lecture Notes in Computer Science, Springer, 2023, pp. 51–60. [ 10 ] P. Bonizzoni, C. D. Felice, Y. Pirola, R. Rizzi, R. Zaccagnino, R. Zizza, Can formal languages help pangenomics to represent and analyze multiple genomes?, in: V. Diekert, M. V. Volkov (Eds.), Developments in Language Theory - 26th International Conference, DLT 2022, Tampa, FL, USA, May 9-13, 2022, Proceedings, volume 13257 of Lecture Notes in Computer Science, Springer, 2022, pp. 3–12. [ 11 ] K. Okabe, T. Mieno, Y. Nakashima, S. Inenaga, H. Bannai, Linear-time computation of generalized minimal absent words for multiple strings, volume 14240 of Lecture Notes in Computer Science, Springer, 2023, pp. 331–344. [ 12 ] S. L. Pizzuto, Similarity measures based on minimal absent words, Master’s thesis, DAMI, University of Palermo, in preparation.

[1]

Mantaci ,

Restivo ,

Sciortino , Distance measures for biological sequences: Some recent approaches , Int. J. Approx. Reason . 47 ( 2008 ) 109 - 124 . URL: https://doi.org/10.1016/ j.ijar. 2007 . 03 .011. doi: 10 .1016/J.IJAR. 2007 . 03 .011.

[2]

Mignosi ,

Restivo ,

Sciortino , Forbidden factors and fragment assembly, RAIRO Theor . Informatics Appl . 35 ( 2001 ) 565 - 577 .

[3]

Crochemore ,

Mignosi ,

Restivo , Automata and forbidden words, Inf. Process. Lett . 67 ( 1998 ) 111 - 117 .

[4]

Charalampopoulos ,

Crochemore , G. Fici,

Mercas ,

S. P.

Pissis , Alignment-free sequence comparison using absent words , Inf. Comput . 262 ( 2018 ) 57 - 68 .

[5]

M. S.

Rahman ,

Alatabbi ,

Crochemore ,

M. S.

Rahman , Absent words and the (dis)similarity analysis of dna sequences: An experimental study , BMC Research Notes . 9 : 186 450 ( 2016 ) 1 - 8 .

[7] InG.coCnacsltuigsiloionn,eth,eSs.eMexapnetaricmi,eAn.tsRaersetiaviom, eSdotmoereimnvaersktitghaattiothnesgorneastimnuilmareirtiycamldeiafesruenrecse boafsed theodnataabdsiemnetnwsioorndisn, tFhuensedasmets., Imnfaokremtahteicdaiesta1n7c1e ( 20i2n0te)r9e7st-in11g2,f.rom a computational point of [8v]iewA, .cEohmrpeanrfeeductohtt,hDe.dHistaaunscselebra, sAednoenwthdeisstyamncmeetric metrdiicferoenncset.riInngfsaccto, mcopmuptaubtilnegina

dliinsteaanrcteime

, basDedisocnrete (A,pp)liceodulMd abtehmemoraetiecficsie2n0t( t1h9a8n8 ) co1m91p - u2t0in3g . the distance on ()△ () since [w9]e Mwo.uBlédahl,aMve . Carsomcahlelemrosreet,, pFraosvtiddeetdetchtiaotnonofe scpaencgificetfrdaigremcetlnytsthaegwaionrsdtsaosfetthoefsseetq u(en,ce)s, without explicitely producing all the words in ( ), (), (), (). volume 13911 of Lecture Notes in Computer Science, Springer, 2023 , pp. 51 - 60 .

[10]

Bonizzoni ,

C. D.

Felice ,

Pirola ,

Rizzi ,

Zaccagnino ,

Zizza , Can formal languages Rehfeelprepnancgeesnomics to represent and analyze multiple genomes? , in: V. Diekert , M. V. Volkov (Eds.), Developments in Language Theory - 26th International Conference, DLT [1]20S2.2M , Tanamtacpia ,,AF.LR,eUstSivAo,, MMa.ySc9io-1rt3i , n2o0 , 2D2i ,stParnocceeemdeiansgusr,evsoflourmbieol1o3g2i5ca7l osfeqLueecntucrees:NSootmesein CormecpeunttearpSpcrioeancchee,sS,pIrnitn. gJ.eAr, p2p0r2o2x ,. pRpe.a3so-n1.24. 7 ( 2008 ) 109 - 124 . URL: https://doi.org/10.1016/ j.ijar. 2007 . 03M . 01ie1n .od, oiY: 1.0N.a1k0a1s6h/imJ.aI ,JSA.RIn. e2n0a0g7a ., 0H3 .. B0a1n1n.ai, Linear-time computation of

[11]

Okabe , T. [2]geFn.eMraiglinzeodsi,

Restivo ,

Sciortino , Fmorublitdipdleensftarcintogrss, avnodlufmraeg m14e2n4t0aossfeLmecbtluy, reRNAoIRteOsin minimal absent words for Theor . Informatics Appl . 35 ( 2001 ) 565 - 577 . Computer Science, Springer, 2023 , pp. 331 - 344 . [3]

Crochemore ,

Mignosi ,

Restivo , Automata and forbidden words, Inf. Process. Lett.

[12]

S. L.

Pizzuto , Similarity measures based on minimal absent words, Master's thesis , DAMI, 67 ( 1998 ) 111 - 117 . [4]UnPi.vCehrsairtaylampopoulos,

Crochemore , G. Fici,

Mercas ,

S. P.

Pissis , Alignment-free of Palermo, in preparation. sequence comparison using absent words , Inf. Comput . 262 ( 2018 ) 57 - 68 . [5]

M. S.

Rahman ,

Alatabbi ,

Crochemore ,

M. S.

Rahman , Absent words and the (dis)similarity analysis of dna sequences: An experimental study , BMC Research Notes.