<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Novel XBWT-based Distance Measures for Labeled Trees</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Danilo G. Dolce</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabrina Mantaci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Romana</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanna Rosone</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marinella Sciortino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Palermo</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Comparing labeled trees has applications in various domains, particularly in the study of cancer phylogenies. In this paper, we address the problem of comparing fully labeled unordered trees, focusing on their structural and label similarities. We define a novel class of distance measures by exploiting the eXtended Burrows-Wheeler Transform (XBWT), an extension to labeled trees of the well-known Burrows-Wheeler Transform. The XBWT, introduced in [Ferragina et al., FOCS 2005], produces a linearization of the tree that is both compressible and eficiently searchable. We extend the definition of this linearization to pairs of trees, and we produce a partition based on the prefixes of a given length  ≥ 1 of the parent-to-root paths of the nodes. We define, for any  ≥ 1, the distance measure  by applying the Jaccard distance to each element of this partition. We prove that all the measures  are pseudometrics, i.e. they assume non-negative values, are zero when applied to identical trees, are symmetric, and satisfy the triangle inequality. These measures become metrics when all the labels within each tree are distinct. Furthermore, we show that these distances are sensitive to some operations on trees, such as the removal and insertion of subtrees, swapping of subtrees, and label swapping. Finally, to show the efectiveness of our approach, we have analyzed experimentally the behavior of the distances when operations on trees are applied to a randomly generated fully labeled tree. Here, we present the results obtained in the case  = 1.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Burrows-Wheeler Transform</kwd>
        <kwd>XBWT</kwd>
        <kwd>labeled tree</kwd>
        <kwd>distance measure</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Trees are fundamental and well-studied combinatorial structures in computer science since
their structure can encode, in a natural way, hierarchical relations in many domains.</p>
      <p>
        Comparing trees is a key problem that arises in various fields like computational biology,
structured text databases, image analysis, automated theorem proving, and compiler
optimization, where it can be crucial to have efective distance measures able to capture diferent types
of operations or transformations on the trees used to model data. A survey on the methods for
comparing labeled trees based on simple local operations of deleting, inserting, and relabeling
nodes can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In Bioinformatics, the trees are also used to model cancer phylogenies.
In this context comparing trees typically means analyzing and comparing the genetic mutations
and progression of cancer cells over time. In fact, according to the clonal theory of cancer
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], each node in these trees is labeled to represent a distinct genetic mutation or a group of
mutations, and the edges represent the evolutionary pathways that these mutations have taken.
The comparison between these trees aims to understand the evolutionary history of the cancer,
identify common pathways of mutation, and potentially uncover patterns that could lead to
better treatment strategies [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. Several distance measures have been recently introduced
with the aim to compare the topology of the trees and the mutations involved (see [
        <xref ref-type="bibr" rid="ref10 ref6 ref7 ref8 ref9">6, 7, 8, 9, 10</xref>
        ]
and references therein). Other recent approaches are based on metaheuristics [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Motivated by the application in the study of cancer phylogeny, in this paper we focus our
attention on the comparison between fully labeled trees, i.e. assuming that each node, whether
internal or a leaf, has a label over a finite alphabet. However, the methodology we present in
this paper can also be extended to the case of multi-labeled trees.</p>
      <p>
        Here we address the problem of comparing unordered labeled trees by exploiting a
linearization of the trees produced by the eXtended Burrows-Wheeler Transform (XBWT) [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. The
XBWT is an elegant transformation that extends to trees the functionalities of the well-known
Burrows-Wheeler Transform (denoted as BWT [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]), which is instead defined on strings and
has been introduced in the context of data compression. The XBWT is applied to a labeled tree
and emits, in addition to a permutation of the labels of the tree (tree linearization), a sequence
of bits that makes the transformation reversible. This output is compressible and eficiently
searchable. This is particularly useful for applications in Bioinformatics and XML document
processing.
      </p>
      <p>In this paper, we are not interested in the aspects related to compression and indexing. For
this reason, here we will not use the full output of the XBWT, but only the tree linearization
it produces. More in detail, we extend to a pair of trees the linearization computed by XBWT
with the aim of measuring how the labels of the nodes of the two trees are mixed in the output
of this transformation. We therefore define a novel class of distances  between two trees
through a partition of the linearization induced by the lexicographically sorted prefixes of
length  ≥ 1 of the parent-to-root paths of the nodes of the tree, eventually extended to length
 with an appropriate special character. To define these measures, the Jaccard distance is used
in each element of such partition. We prove that such measures are pseudometrics. Note that
the measures  become metrics when all the labels within each tree are distinct.</p>
      <p>Furthermore, we study the sensitivity of the  measures with respect to some operations
on trees, such as the insertion or deletion of a subtree, subtree swapping, and the exchange of
labels between two nodes.</p>
      <p>Finally, to show the robustness and efectiveness of our method, we have also analyzed
the behavior of the  measures on simulated datasets obtained by applying a sequence of
operations on randomly generated trees. Here we present the results obtained for the case
 = 1 and when trees with distinct labels are considered.</p>
      <p>
        In this paper, we do not focus on implementation aspects, but rather on methodological issues.
The problems related to space and time complexity will be addressed in a subsequent full version
of the paper. To our knowledge, this approach is innovative compared to the measures used in
the literature for labeled tree comparison. The idea of comparing two combinatorial structures
by measuring how they mix within a common structure has been used in the context of string
comparison through the use of an extension to a collection of sequences of the Burrows-Wheeler
Transform [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] but with diferent output partitioning strategies [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. Such an extension
has been largely applied for comparing biological sequences (see [18] and references therein).
Moreover, it has been recently used in [19] to reconstruct the phylogenetic tree of a collection
of biological sequences. We believe that the methodology introduced in this paper can also
provide new insights in the context of string comparison.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <p>
        Let Σ = {1, 2, . . . ,  } be a finite ordered alphabet with 1 &lt; 2 &lt; . . . &lt;  , where &lt;
denotes the standard lexicographic order. We denote by Σ* the set of words over Σ. Given
a finite word  ∈ Σ* , let  be the length of , denoted ||. We also denote by [] the -th
letter in  for any 1 ≤  ≤  , therefore  = [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] . . . [], and we denote by [, ] the
substring [][ + 1] · · · []. A prefix is a substring of the form [1, ] for some , and a sufix
one of the form [, ||]. Given two strings  and , we denote by (, ) the length of the
longest common prefix (LCP) of  and , i.e., (, ) = max{ | [1, ] = [1, ]}.
      </p>
      <p>A rooted tree  = (, ), with  set of nodes and  ⊆  ×  set of edges, is a directed
connected acyclic graph where: 1. all the nodes have one in-edge, except the root that has no
in-edges; 2. all nodes have zero (leaves) or more (internal nodes) out-edges. The size of a tree
 , denoted by | |, is equal to the number | | of its nodes. Given a tree  , we denote by ( )
the set of its leaves. If there is an edge from node  to node , then  is the parent of  and  is
the child of . A path in the tree is a sequence of nodes 1, 2, . . .  where  is parent of − 1
for all 1 ≤  ≤  − 1. If  &lt;  then  is ancestor of  , and  is descendent of . The depth of
a node is the length of the path from its parent to the root. The depth of the root is 0. For a
given  &gt; 0, we denote by  ≤  the subtree of  from the root up to the nodes at depth . Two
nodes that are children of the same node are called siblings. The tree  is an ordered tree if a
left-to-right order among siblings in  is given, otherwise it is unordered. A tree  = (, )
is labeled over the alphabet Σ if it is defined a labeling function ℓ :  → Σ that associates a
letter from Σ to each node of  . For each node x of a labeled tree, we denote by  (x) the string
obtained as the concatenation of the labels in the path from the parent node of x to the root of
the tree. Let us denote by  ( ) the multiset of all the strings  (x) for every node x ∈  . If 
is an ordered tree,  is populated with the parent-to-root string paths of the tree nodes visited
in pre-order.</p>
      <p>The eXtended Burrows-Wheeler Transform (XBWT) of an ordered labeled tree  , denoted
by xbw( ), linearizes the tree with a string, denoted as  ( ), obtained as a concatenation of
the labels of all nodes, ordered according to the lexicographical sorting of their parent-to-root
string paths in  ( ).</p>
      <p>In this paper, we will consider labeled trees where either all the nodes have diferent labels,
or nodes with equal labels are allowed, but only if they are not siblings and appear in diferent
root-to-leaf paths. Moreover, for a technical reason, we add a child to each leaf, labeled with a
special symbol $ ̸∈ Σ. We exclude these nodes to count the size | |. Observe that for each tree
 extended in this way, the set ( ) consists solely of nodes labeled with $. We denote by Σ
the set of all such trees.</p>
      <p>Let  and  be two sets. The Jaccard distance  between  and  is defined from the
Jaccard coeficient  of similarity for  and  , as follows:
 (,  ) = | ∩  | ,  (,  ) = 1 −  (,  ) = | ∪  | − |  ∩  | .</p>
      <p>| ∪  | | ∪  |
We further assume that whenever  =  = ∅,  (,  ) = 1, and  (,  ) = 0. Observe
that the measure  is a metric. Obviously, 0 ≤  (,  ) ≤ 1 and  (,  ) = 0 if  =  ,
and  (,  ) = 1 if  ∩  = ∅ and  ∪  ̸= ∅.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Linearization of Pairs of Trees via XBWT</title>
      <p>In this section, we aim to define a linearization for pairs of ordered trees using an XBWT-based
approach. Specifically, we define xbw(1, 2) as the string over the alphabet Σ × { 1, 2} defined
in the following. Note that such a linearization can be defined for every pair of trees. However,
motivated by practical applications, we assume here that each tree contains distinct labels or
that it may contain repeated labels but only in diferent root-to-leaf paths and not for nodes
with the same parent. Let us denote by  (1, 2) the multiset of all the strings  (x) for every
node x in the trees 1 and 2. Let us denote by  (1, 2) the string obtained by concatenating
the labels of all the nodes of 1 and 2 ordered according to the lexicographical sorting of the
paths in  (1, 2).</p>
      <p>The output xbw(1, 2) is a sequence of pairs (ℓ(x), ), where ℓ(x) ∈  and  is the flag
assuming value 1 or 2 if x comes from 1 or 2, respectively. However, for simplicity of
exposition, in the figures and tables, we replace the flags with the full names of the trees
considered.</p>
      <p>
        We enrich the definition of XBWT of the two trees 1 and 2 with the LCP array, defined as
 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] = 0 and  [] = ( [],  [ − 1]) for any 1 &lt;  &lt; |1|+|2|+|(1)|+|(2)|.
Example 1. Let 1 and 2 be the pair of trees depicted in Fig. 1a and in Fig. 1b, respectively. The
output of xbw(1, 2) is represented in the table showed in Fig. 1c. We remark that, by construction,
the first two lines in the table contain the roots of 1 and 2, respectively. The last column in the
table contains the values of the LCP array.
      </p>
      <p>Note that the parent-to-root string paths of two sibling nodes are equal. If the trees are
ordered, the relative order of such paths is determined by the left-to-right order of the nodes.
However, the distance measures we will introduce in the next section consider the labels of
such nodes as a set, thus their relative order is not relevant. Then, they could be applied to
unordered trees as well.</p>
    </sec>
    <sec id="sec-4">
      <title>4. A new Class of Distance Measures between Trees</title>
      <p>In this section, we introduce a new measure for comparing two unordered trees, 1 and 2,
and we will prove that this measure is a pseudometric. To compute this measure, for a given
D
H
$
G
B
$</p>
      <p>G
$
H
$
(a) Tree 1
(b) Tree 2</p>
      <p>F
A
$
C
F
$</p>
      <p>B
$
P
$
(c) xbw(1, 2) and LCP array
consists of the two columns  and  . The LCP array is contained in the last column of the table.
Σ# = ⋃︀
the context.</p>
      <p>′∈[0,]
 &gt; 0 we map each word in  of length &lt;  into a new word over the right-padded alphabet
{︁Σ′ · { #}− ′ }︁. To improve readability, we omit  whenever it is clear from
In defining the measure, we refer to a partition of  (1, 2) according to the values of the
LCP array, defined as follows.
|(1)| + |(2)|] obtained as union of the following intervals:
Definition 1. LCP-based partition of order . Given a positive integer , let us consider the
list of strings</p>
      <p>()(1, 2) obtained by extending all the strings in  (1, 2) to the length  with
the # fill character, where # is smaller than all the characters in Σ ∪ {$}. Let us denote by 
the LCP array for</p>
      <p>()(1, 2). We denote by  the partition of the interval [1, |1| + |2| +
• ⋃︀
1≤ ≤ &lt; {[, ] | [] &lt; , [ + 1] &lt; , [] ≥ , ∀ ∈ [ + 1, ]};
∑︁
[,]∈1
∑︁
[,]∈2
• [, |1| + |2| + |(1)| + |(2)|] | [] &lt; , [] ≥
|2| + |(1)| + |(2)|].
, ∀ ∈ [ + 1, |1| +</p>
      <p>Note that each interval [, ] in the partition  can be uniquely associated with a word in
Σ, which is a -length prefix of the longest common prefix of the strings, possibly extended to
length  with the padding character #, in the list  (1, 2) contained in the interval [, ].</p>
      <p>We denote by  ([, ]) such a word, and the indexes  and  are denoted by  ( [, ]) and
( [, ]), respectively. For each pair of trees 1 and 2, we denote by Λ(1, 2) = { [, ] |
[, ] ∈ }, that is the set of -length words from Σ# that appear as prefix of some word
in ()(1, 2). Moreover, for  ∈ {1, 2}, we denote by ([, ]) = { ∈ Σ ∪ {$} | (, ) ∈
xbw(1, 2)[, ]}, that is the set of labels in  [, ] coming from the tree . Analogously, for
each tree and word  ∈ Σ#, we define the set Γ()() = { ∈ Σ ∪ {$} |  ·  ∈ ()( )}, that
is the set of letters  in xbw( )[, ] such that  [, ] = .</p>
      <p>Definition 2.</p>
      <p>The distance measure  is defined as:</p>
      <p>∑︁
[,]∈</p>
      <p>Example 2. Let us consider the two trees 1 and 2 depicted in Figure 2. The partition of  (1, 2)
induced by  (with  = 1 and  = 2) is shown in Table 1. By applying the Jaccard distance 
to each element of the partition,</p>
      <p>1. The statement follows from the definition.
2. From the definition, we have that ( ,  ) = 0 since  (1[, ], 2[, ]) = 0 for all
[, ] ∈ .</p>
      <p>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
2
1
2
1
2
1
2
1
2
1
2
2
1
1
2</p>
      <p>B
C
X
$
from the tree on the left 1 (a) by swapping the subtrees rooted in  and  , respectively.</p>
      <sec id="sec-4-1">
        <title>Index Flag</title>
        <p>1
2
0
0
0
1
1
1
0
2
0
2
1
3
0
2
0
2
(1)
#
#
A
A
A
A
BA
BA
CA
CA
CBA
CBA
XCA</p>
      </sec>
      <sec id="sec-4-2">
        <title>XCBA</title>
        <p>YCA</p>
      </sec>
      <sec id="sec-4-3">
        <title>YCBA</title>
        <sec id="sec-4-3-1">
          <title>The table shows how the partition  relative to the two trees depicted in Fig. 2 can vary as  changes.</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>Specifically, the partitions 1 and 2 are illustrated. In the columns</title>
          <p>
            the letters of the prefixes which define the partitions 1 and 2 respectively. Partition 2 contains the
same intervals as 1, except for two intervals obtained as a refinement of the interval [
            <xref ref-type="bibr" rid="ref12 ref9">9, 12</xref>
            ] from 1.
This refinement is indicated with a dashed line. The table also shows the output  of XBWT applied
(1) and 
(2) we highlight in bold
to the two trees and LCP arrays.
          </p>
          <p>3. The statement follows from the fact that the Jaccard distance is a metric and then the
symmetric property holds.
4. Since the Jaccard distance is a metric and the triangle inequality holds, for each  ∈ Σ
holds that
one can always find a ′ &gt;  such that ′ (12) &gt; 0, i.e. ′ makes 1 and 2 distinguishable.
This fact is shown in Fig. 2, where the trees (a) and (b) are indistinguishable for 1, i.e. their 1
distance is 0, but they become distinguishable when 2 distance is applied.</p>
          <p>By using a similar argument as in the proof of Proposition 1, it is possible to prove the
following corollary.
and only if 1 = 2.
proposition.</p>
          <p>Corollary 1. Let  be the set of all unordered trees whose nodes are labeled by distinct symbols.
Then  is a metric over  for each  ≥ 1, i.e. for each pair of trees 1, 2 ∈  , (1, 2) = 0 if</p>
          <p>What we have stated in the previous remark can be more generally formalized in the following
1 + ∑︀∈[1,] |Σ|.</p>
          <p>Proposition 2. Given two trees 1 and 2 in Σ, (1, 2) ≤ +1(1, 2), for each  ≥ 1.</p>
          <p>Note that if we consider the set Σ of all labeled trees, the  measures can be used to define
a class of dissimilarity measures normalized with respect to the number of words in Σ#, i.e.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Sensitivity to Operations on Trees</title>
      <p>In this section, we evaluate how the distance  changes when a swap of subtrees, label
exchanges, insertion, or removal of nodes are applied to a tree. Other operations will be
considered in the full version of the paper.</p>
      <p>Proposition 3. Let 1 and 2 be two unordered trees such that 2 is obtained from 1, by
swapping two disjoint subtrees v1 and v2 , rooted in the nodes v1 and v2, respectively. Then,
1(1, 2) ≤ 2, and (1, 2) ≤ 2(| v≤1− 2| + | v≤2− 2|) + 2 for all  &gt; 1, where | v≤1− 1|
and | v≤2− 1| denote the number of nodes at depth at most  − 1 in the subtrees v1 and v2 ,
respectively.</p>
      <p>Proof. Let v1 and v2 be the two nodes in 1 that are the roots of the subtrees swapped
to obtain 2, and let  be the LCP-based partition of order  for 1 and 2. Let us
denote by  (v1) and  (v2) the parent-to–root string path of v1 and v2, respectively, both
possibly padded up to length  by using the character #. The upper bound for the
distance (1, 2) is obtained when all the labels in 1 are distinct (and therefore in 2
as well) and if each of the two nodes is the only child of their respective parent. In
this case, for all  &gt; 0,  ((1)[ ( (v1)), ( (v1))], (2)[ ( (v1)), ( (v1))]) = 1 and
 ((1)[ ( (v2)), ( (v2))], (2)[ ( (v2)), ( (v2))]) = 1. Moreover, for all  &gt; 1, each
node z in the subtrees v1 and v2 at depth at most  − 2 from the root increases by 2 the
distance . All the other intervals in the partition  give a zero contribution to the distance.
Then, the thesis follows.</p>
      <p>The following proposition provides an evaluation of the distance between two trees when
one is obtained from the other by removing or inserting an entire subtree.</p>
      <p>Proposition 4. Let 1 and 2 be two unordered trees such that 2 is obtained by removing from
1 a subtree x ̸= 1 with root x. Then, (1, 2) ≤ | x| + 1.</p>
      <p>Proof. Let x be the subtree rooted in x removed from 1 to obtain 2. Let  be the LCP-based
partition of order  for 1 and 2. Let us denote by  (x) the parent-to–root string path of x
possibly padded up to length  by using the character #.</p>
      <p>The upper bound for the distance (1, 2) is obtained when x has no sibling nodes and
 (x) does not have any common -length prefix with other parent-to-root string paths. In this
case, the parent of x and each node of x, leaves excluded, increases by 1 the distance .</p>
      <p>The following proposition considers the operation of swapping the labels of two nodes. Note
that the label swap does not involve any descendants of the nodes we are considering.
Proposition 5. Let 1 and 2 be two unordered trees such that 2 is obtained from 1, by swapping
the label of the nodes v1 and v2. Let us denote by v1 and v2 the subtrees rooted in the nodes
v1 and v2, respectively. Then, 1(1, 2) ≤ 4, and (1, 2) ≤ 2(| v≤1− 1| + | v≤2− 1|) + 2 for
all  &gt; 1, where | v≤1− 1| and | v≤2− 1| denote the number of nodes at depth at most  − 1 in the
subtrees v1 and v2 , respectively.</p>
      <p>Proof. Let v1 and v2 be the two nodes in 1 whose labels are swapped in 1 to obtain 2. This
means that in 2 we can find the subtrees v′ 1 and v′ 2 obtained by swapping the roots of the
subtrees v1 and v2 in 1 respectively. Let  be the LCP-based partition of order  for 1 and
2. Let us denote by  (v1) and  (v2) the parent–to–root string path of v1 and v2, respectively,
both possibly padded up to length  by using the character #. In order to obtain the upper bound
for the distance (1, 2) we assume that all the labels in 1 are distinct (and therefore in 2 as
rooted in  and  , removing the subtree rooted in , and swapping the labels  and  .
zero contribution to the distance. Then, the thesis follows.
well), each of the two nodes is the only child of their respective parent and the subtrees v1 and
v2 have distinct labels. In this case,  ((1)[ ( (v1)), ( (v1))], (2)[ ( (v1)), ( (v1))]) =
1, and symmetrically  ((1)[ ( (v2)), ( (v2))], (2)[ ( (v2)), ( (v2))]) = 1. For  =
1, observe that  ((1)[ (ℓ(v1)), (ℓ(v1))], (2)[ (ℓ(v1)), (ℓ(v1))]) = 1, and equivalently
 ((1)[ (ℓ(v2)), (ℓ(v2))], (2)[ (ℓ(v2)), (ℓ(v2))]) = 1. On the other hand, for all  &gt; 1,
each node z in the subtrees v1 and v2 (equivalently in v′ 1 and v′ 2 ) at depth at most  − 2
from the root increases by 2 the distance . All the other intervals in the partition  give a
Flag  
The tables show the phases of computation of 2(1, 2) = 6, 2(1, 3) = 4, 2(1, 4) = 12, where
1, 2, 3, 4 are depicted in Figure 3.</p>
      <p>In Fig. 3 and Table 2 is displayed a worst–case example for each of the cases described in
Propositions 3, 4, and 5, showing that the three upper–bounds are tight.</p>
      <p>1 1
2 3
3 1
4 3
5 1
6 3
7 1
8 3
9 1
10 3
11 1
12 1
13 1
14 3
15 1
16 1
17 1
18 3</p>
      <p>A 
A 
B A
B A
C A
C A
X BA
$ BA
Y CA
Y CA
$ DXBA
$ EXBA
$ FYCA
$ FYCA
D XBA
E XBA
F YCA
F YCA</p>
      <p>Num4ber of subtre6es removed8
(a) Subtree removals
10
2</p>
      <p>Num4ber of sub6trees swap8ped
(b) Subtree swap
10
12
10
d1 8
e
c
n
ta 6
s
i
D
4
2
0
10 Number of subtrees swapped 40
20 30</p>
      <p>50
(c) Subtree swap with repetitions</p>
      <p>We have analyzed the behavior of the  measures on simulated data, by evaluating the
distance values after the application of perturbations, which consist of subtree removal and
subtree swapping operations, on randomly generated trees. We conclude this section by showing
in Fig. 4a the results obtained by considering the values of the 1 measure after applying a
maximum of 10 subtree removals, chosen randomly, on randomly generated trees having distinct
labels and such that each internal node has at most three children. We also show the behavior
of the 1 measure when the operation applied on a randomly generated tree is the subtree
swapping. In Fig. 4b, we consider the case where each subtree can be swapped at most once,
and in Fig. 4c, successive swapping operations on the same subtree are allowed.</p>
      <p>A more extensive description of the experiments will be presented in a subsequent extended
version of the paper.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Further Work</title>
      <p>In this paper, we introduced a new class of distances for unordered fully labeled trees. Theoretical
results, as well as preliminary experimental analyses on simulated data, show that these measures
can efectively capture some operations on trees such as removal and insertion of subtrees,
subtree swapping, and label swapping. These distances are defined using an LCP-based partition
of a linearization of trees, defined by a generalization of the XBWT. We have proven that the
measures in this class are pseudometrics and become metrics when trees having distinct labels
are considered. In the general case in which repeated labels are allowed, we have observed
that for any given collection of trees, there exists an integer  for which  is a metric over the
dataset. It would be interesting to experimentally determine for any given dataset of trees the
smallest value of  for which  is a metric.</p>
      <p>We have focused on combinatorial aspects related to the extension of XBWT to compare
pairs of trees. The algorithmic issues related to the eficiency of computing these measures, as
well as the use of this transformation for finding common subtrees, will be explored in the full
paper.</p>
      <p>Our preliminary experimental evaluation shows that our method is able to capture structural
diferences and similarities between unordered trees, with significant possible implications for
computational biology, XML data processing, and hierarchical clustering. We intend to evaluate
the behavior of the  measures concerning a more comprehensive set of tree operations, as
well as to test these measures on real datasets for the study of cancer phylogenies. To this
end, we plan to extend the methodology introduced in this paper to the more general case
of multi-labeled trees, by using the Jaccard distance defined on multisets, and compare our
approach with others existing in the literature.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Sabrina Mantaci is partially supported by INdAM-GNCS Project 2024- CUP E53C23001670001
(“Proprietà combinatorie e distanze basate su parole da evitare”). Giuseppe Romana, Giovanna
Rosone, and Marinella Sciortino are partially supported by MUR PRIN project no. 2022YRB97K
- “PINC” (Pangenome INformatiCs: From Theory to Applications), and partially funded by
the INdAM-GNCS Project CUP E53C23001670001 (“Compressione, indicizzazione, analisi e
confronto di dati biologici”). Giovanna Rosone is partially supported by PAN-HUB T4-AN-07
(“Hub multidisciplinare e interregionale di ricerca e sperimentazione clinica per il contrasto alle
pandemie e all’antibiotico resistenza”), CUP I53C22001300001. Sabrina Mantaci and Marinella
Sciortino are partially supported by the project “ACoMPA – Algorithmic and Combinatorial
Methods for Pangenome Analysis” (CUP B73C24001050001) funded by the NextGeneration EU
programme PNRR ECS00000017 Tuscany Health Ecosystem (Spoke 6).
[18] V. Guerrini, F. A. Louza, G. Rosone, Metagenomic analysis through the
extended Burrows-Wheeler transform, BMC Bioinform. 21-S (2020) 299. doi:10.1186/
S12859-020-03628-W.
[19] V. Guerrini, A. Conte, R. Grossi, G. Liti, G. Rosone, L. Tattini, phyBWT2: phylogeny
reconstruction via eBWT positional clustering, Algorithms Mol. Biol. 18 (2023) 11. doi:10.
1186/S13015-023-00232-4.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bille</surname>
          </string-name>
          ,
          <article-title>A survey on tree edit distance and related problems</article-title>
          ,
          <source>Theoretical Computer Science</source>
          <volume>337</volume>
          (
          <year>2005</year>
          )
          <fpage>217</fpage>
          -
          <lpage>239</lpage>
          . doi:https://doi.org/10.1016/j.tcs.
          <year>2004</year>
          .
          <volume>12</volume>
          .030.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Nowell</surname>
          </string-name>
          ,
          <article-title>The clonal evolution of tumor cell populations: Acquired genetic lability permits stepwise selection of variant sublines and underlies tumor progression</article-title>
          .,
          <source>Science</source>
          <volume>194</volume>
          (
          <year>1976</year>
          )
          <fpage>23</fpage>
          -
          <lpage>28</lpage>
          . doi:
          <volume>10</volume>
          .1126/science.959840.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Beerenwinkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Greenman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lagergren</surname>
          </string-name>
          ,
          <article-title>Computational cancer biology: An evolutionary perspective</article-title>
          ,
          <source>PLOS Computational Biology</source>
          <volume>12</volume>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1371/ journal.pcbi.
          <volume>1004717</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Llabrés</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rosselló</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Valiente</surname>
          </string-name>
          ,
          <article-title>A generalized Robinson-Foulds distance for clonal trees, mutation trees, and phylogenetic trees and networks</article-title>
          ,
          <source>in: BCB '20: 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics</source>
          , Virtual Event, USA, September
          <volume>21</volume>
          -
          <issue>24</issue>
          ,
          <year>2020</year>
          , ACM,
          <year>2020</year>
          , pp.
          <volume>13</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          :
          <fpage>10</fpage>
          . doi:
          <volume>10</volume>
          .1145/3388440. 3412479.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <article-title>The evolution of tumour phylogenetics: principles and practice</article-title>
          ,
          <source>Nature Reviews Genetics</source>
          <volume>18</volume>
          (
          <year>2017</year>
          )
          <fpage>213</fpage>
          -
          <lpage>229</lpage>
          . doi:
          <volume>10</volume>
          .1038/nrg.
          <year>2016</year>
          .
          <volume>170</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bernardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonizzoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Della</given-names>
            <surname>Vedova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <article-title>A Rearrangement Distance for Fully-Labelled Trees</article-title>
          ,
          <source>in: 30th Annual Symposium on Combinatorial Pattern Matching (CPM</source>
          <year>2019</year>
          ), volume
          <volume>128</volume>
          of Leibniz International Proceedings in Informatics (LIPIcs),
          <source>Schloss Dagstuhl - Leibniz-Zentrum für Informatik</source>
          , Dagstuhl, Germany,
          <year>2019</year>
          , pp.
          <volume>28</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          :
          <fpage>15</fpage>
          . doi:
          <volume>10</volume>
          .4230/LIPIcs.CPM.
          <year>2019</year>
          .
          <volume>28</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>DiNardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tomlinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritz</surname>
          </string-name>
          , L. Oesper,
          <article-title>Distance measures for tumor evolutionary trees</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2019</year>
          )
          <fpage>2090</fpage>
          -
          <lpage>2097</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btz869.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ciccolella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bernardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Denti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonizzoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Previtali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Della</surname>
          </string-name>
          <string-name>
            <surname>Vedova</surname>
          </string-name>
          ,
          <article-title>Tripletbased similarity score for fully multilabeled trees with poly-occurring labels</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>37</volume>
          (
          <year>2020</year>
          )
          <fpage>178</fpage>
          -
          <lpage>184</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btaa676.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bernardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonizzoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gawrychowski</surname>
          </string-name>
          ,
          <article-title>On Two Measures of Distance Between Fully-Labelled Trees</article-title>
          , in: I. L.
          <string-name>
            <surname>Gørtz</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          Weimann (Eds.),
          <source>31st Annual Symposium on Combinatorial Pattern Matching (CPM</source>
          <year>2020</year>
          ), volume
          <volume>161</volume>
          of Leibniz International Proceedings in Informatics (LIPIcs),
          <source>Schloss Dagstuhl - Leibniz-Zentrum für Informatik</source>
          ,
          <year>2020</year>
          , pp.
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          :
          <fpage>16</fpage>
          . doi:
          <volume>10</volume>
          .4230/LIPIcs.CPM.
          <year>2020</year>
          .
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Karpov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malikic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Sahinalp</surname>
          </string-name>
          ,
          <article-title>A multi-labeled tree dissimilarity measure for comparing “clonal trees” of tumor progression</article-title>
          ,
          <source>Algorithms for Molecular Biology</source>
          <volume>14</volume>
          (
          <year>2019</year>
          )
          <article-title>17</article-title>
          . doi:
          <volume>10</volume>
          .1186/s13015-019-0152-9.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ciccolella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Della</given-names>
            <surname>Vedova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Filipović</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Soto Gomez, Three metaheuristic approaches for tumor phylogeny inference: An experimental comparison</article-title>
          ,
          <source>Algorithms</source>
          <volume>16</volume>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .3390/a16070333.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Luccio</surname>
          </string-name>
          , G. Manzini,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muthukrishnan</surname>
          </string-name>
          ,
          <article-title>Structuring labeled trees for optimal succinctness, and beyond</article-title>
          ,
          <source>in: 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05)</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>184</fpage>
          -
          <lpage>193</lpage>
          . doi:
          <volume>10</volume>
          .1109/SFCS.
          <year>2005</year>
          .
          <volume>69</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Luccio</surname>
          </string-name>
          , G. Manzini,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muthukrishnan</surname>
          </string-name>
          ,
          <article-title>Compressing and indexing labeled trees, with applications</article-title>
          ,
          <source>J. ACM</source>
          <volume>57</volume>
          (
          <year>2009</year>
          ). doi:
          <volume>10</volume>
          .1145/1613676.1613680.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Burrows</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Wheeler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Block</given-names>
            <surname>Sorting data Compression Algorithm</surname>
          </string-name>
          ,
          <source>Technical Report, DIGITAL System Research Center</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mantaci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Restivo</surname>
          </string-name>
          , G. Rosone,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sciortino</surname>
          </string-name>
          ,
          <article-title>An extension of the Burrows-Wheeler Transform</article-title>
          ,
          <source>Theoret. Comput. Sci</source>
          .
          <volume>387</volume>
          (
          <year>2007</year>
          )
          <fpage>298</fpage>
          -
          <lpage>312</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.tcs.
          <year>2007</year>
          .
          <volume>07</volume>
          . 014.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mantaci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Restivo</surname>
          </string-name>
          , G. Rosone,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sciortino</surname>
          </string-name>
          ,
          <article-title>A new combinatorial approach to sequence comparison</article-title>
          ,
          <source>Theory Comput. Syst</source>
          .
          <volume>42</volume>
          (
          <year>2008</year>
          )
          <fpage>411</fpage>
          -
          <lpage>429</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s00224-007-9078-6.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mantaci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Restivo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sciortino</surname>
          </string-name>
          ,
          <article-title>Distance measures for biological sequences: Some recent approaches</article-title>
          ,
          <source>Int. J. Approx. Reason</source>
          .
          <volume>47</volume>
          (
          <year>2008</year>
          )
          <fpage>109</fpage>
          -
          <lpage>124</lpage>
          . doi:
          <volume>10</volume>
          .1016/J.IJAR.
          <year>2007</year>
          .
          <volume>03</volume>
          .011.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>