<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>x Trees for Text Clustering</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics Moscow</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>25</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>In this paper an extension of tf -idf weighting on annotated su x tree (AST) structure is described. The new weighting scheme can be used for computing similarity between texts, which can further serve as in input to clustering algorithm. We present preliminary tests of using AST for computing similarity of Russian texts and show slight improvement in comparison to the baseline cosine similarity after applying spectral clustering algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The text clustering applications exploit two major di erent clustering approaches:
either a text is represented as a vector of features, and distance-based algorithms
(such as k-Means) are used, or the similarity between texts are computed and the
similarity-based algorithms (such k-Medoids or Normalized cuts) are used. While
the former requires to extract features from texts, the latter requires the de
nition of similarity measure. The other algorithms, such as Su x Tree Clustering
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] explore internal features for the text collection and nd clusters
straightforward. We will concentrate of the preliminary step of applying distance-based
algorithms, i.e. computing similarity between texts. According to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], there are
several approaches to computing text similarity: there are string-based (which
include character-based and term-based), corpus-based and knowledge-based
approaches. This project belongs to the characters-based approach, which means,
we do not take corpora data or semantics into account. We describe a new
similarity measure, which is based on the notion of an annotated su x tree and
present an example of using this measure.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Annotated su x tree</title>
      <p>
        The su x tree is a data structure used for storing of and searching for strings
of characters and their fragments [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. An annotated su x tree (AST) is a su x
tree whose nodes are annotated by the frequencies of the strings fragments.
An algorithm for the construction and the usage of AST for spam- ltering is
described in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Some other applications are described in [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>De nition of AST</title>
        <p>According to the annotated su x tree model [4{6], a text document is a set of
words or word n-grams, which we will address as strings. An annotated su x
tree is a data structure used for computing and storing all fragments of the
strings and their frequencies (see Fig. 1 for an example of the AST for the string
\mining"). It is a rooted tree in which:
{ Every node corresponds to one character
{ Every node is labeled by the frequency of the text fragment encoded by the
path from the root to the node.
2.1
.</p>
        <p>The AST has two important proprieties: the frequency of a parent node is
equal to:
1. the sum of the frequencies of children nodes;
2. the sum of the frequencies of underlying leaves.</p>
        <p>According to these properties we can calculate the frequency of the root: it
is equal to the sum of the frequencies on the rst level of the AST. For example,
the frequency of the root of the AST in Fig. 1 is equal to 2+2+1+1 = 6.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Similarity measure</title>
        <p>To estimate the similarity between two texts we nd the common subtree of
the corresponding ASTs. We do the depth- rst search for the common chains of
nodes that start from the root of the both ASTs. After the common subtree is
constructed we need to annotate and score it.</p>
        <p>We annotate the common subtree in the following way. A new node of a
common subtree is annotated by two numbers:the minimum and the maximum
frequencies of the corresponding nodes of original ASTs.</p>
        <p>Let us provide an example of common subtree construction and annotation.
Given two ASTs:
{ for two strings \mining" and \dining" in Fig. 2
{ for one string \dinner" in Fig. 3
we construct the common subtree for them. There are three common chains,
which start from the roots: \D I N", \I N", \N". All the nodes of the chain \D
I N" have frequencies equal to unity in both ASTs, so in the common subtree
the minimum and the maximum frequencies of all three nodes coincide and are
equal to unity. The chain \I N" occurs once in the AST in Fig. 3 and four times
in the AST in Fig. 2, hence the minimum frequencies are equal to unity and
the maximum frequencies are equal to four. The node \N" is annotated by four
in the AST in Fig. 2 and by two in the AST in Fig. 3. So its minimum and
maximum frequencies are equal to two and four, respectively.</p>
        <p>Following the general principle of the AST-based string to text scoring we
suggest to score the common subtree in several steps:
1. weighting each node by computing the mean between two frequencies. At this
step we can use di erent type of means, such as geometric mean or harmonic
mean. For the sake of simplicity we use further the arithmetic mean;
2. scoring every chain of the common subtree;
3. summing up all chain scores and standardizing them by dividing by the
number of chains;</p>
        <p>To score the chain we compute the sum of the node frequencies divided by
their parents frequencies, which is again divided by the length of the chain:
score(chain) =</p>
        <p>P</p>
        <p>fnode
node2chain fparent =
jchainj</p>
        <p>P</p>
        <p>(fmnoidne+fmnoadxe)=2
node2chain (fmpairnent+fmpaarxent)=2
jchainj
For example, the scoring of the common subtree in Fig. 4 is computed as:
score(``D I N'') + score(``I N'') + score(``N'')
3
;
where
score(``D I N'') = ((14++91))==22 + ((11++11))==22 + ((11++11))==22 = 123 + 1 + 1
= 0:718
3</p>
        <p>3
= 0:6538
score(``I N'') = ((41++94))==22 + ((11++44))==22 = 143 + 1</p>
        <p>2 2
and the nal scoring is:</p>
        <p>(2+4)=2
score(``N'') = (4+9)=2 = 6
1 13</p>
        <p>= 0:4615
0:718 + 0:6538 + 0:4615
3
= 0:6111</p>
        <p>At this point the scoring of the common subtree is based on using only the
frequencies of the strings and their fragments. To make the scoring analogous to
computing tf -idf we can introduce the idf -like component to the scoring.</p>
        <p>Let us think about a collection of texts. As a source for idf values we can
construct a global AST from the whole text collection. To construct the global
AST we split every text in strings and exclude from further computations
repeating strings and construct the AST then from these unique strings. This way
we calculate not the frequencies of the strings, but the number of texts where
every string and their fragments occur, that is exactly the df 's of all the possible
fragments of the texts.</p>
        <p>To combine common subtrees and the global tree, we update the chain scoring
step:
score(chain) =</p>
        <p>P</p>
        <p>fnode
node2chain fparent
jchainj
dfnode
dfparent
;
where df are extracted from the global tree.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Construction of AST</title>
        <p>
          It is possible to construct an AST straightforward form a set of tokens using quite
an obvious iterative algorithm, which requires splitting each token in su xes and
adding them consecutively to the AST [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. However, is is shown in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], that it
is more time- and space-e cient to construct a su x tree, using one of the
well-known algorithms and than annotate the tree with the frequencies.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>
        We manually created the text collection for further testing of the proposed
algorithm. The collection consists of 50 documents in Russian, every 10 text devoted
to di erent de nitions of the word \jaguar": an animal, a car, a beverage, a lm
or a sewing machine. We supposed, that the clusters we would achieve should
coincide with the prede ned text classes, i.e. we should get ve clusters, every
cluster corresponding to the initial class. We used the Shi-Malik algorithm [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
with the default parameters to cluster the similarity matrices. Two approaches
to the similarity matrix construction:
1. the tf -idf transformation and the cosine similarity
2. the AST technique, presented above.
      </p>
      <p>To compare these approaches we computed the number of errors in the
achieved clusters. Given a cluster we found the mode value of the class and
calculated how many documents do not belong to this class. The higher this
number is, the worse is the result of clustering. Using the cosine similarity and
Shi-Malik algorithm for nding ve clusters, we achieved four perfect cluster and
one cluster, that contained six errors. Hence 44 texts were clustered correctly
and 6 were not. Using the AST technique, we got only 2 errors, which means
that 48 texts were clustered correctly. The results of clustering are presented in
Table 1.</p>
      <p>Fig. 5 and Fig. 6 present the heat maps of both similarity measures and
reveal some issues of the suggested AST technique. First of all, when the AST
technique is applied to compute the similarity of a text to itself, the result is
not equal to unity. What is more, for di erent texts it results in di erent values.
Second, all the similarity values are more or less the same, there is no drastic
di erence at all between the inside or outside classes values. These are the issues
to be solved in the future.
We suggest a new text similarity measure, which is based on annotated su x
trees. To estimate the similarity between text A and text B, one should construct
two annotated su x trees, nd the common subtree and score it. The scoring
can be extended by document frequencies, if a text collection is given. The
preliminary experiments show, that although the proposed similarity measure has
some clear advantages in comparison to the baseline cosine similarity, because
of being more robust, some formal aspects should be improved. For example,
currently the similarity of a text to itself is not equal to unity, which a ects the
visualisation. Obviously, more experiments should be conducted to nd other
limitations and possible improvements.</p>
      <p>Acknowledgments The paper was prepared within the framework of the Basic
Research Program at the National Research University Higher School of
Economics (HSE) and supported within the framework of a subsidy by the Russian
Academic Excellence Project \5-100". The authors was partially supported by
Russian Foundation for Basic Research, grants no. 16-29-12982 and 16-01-00583.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Dubov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Text Analysis with Enhanced Annotated Su x Trees: Algorithms and Implementation</article-title>
          .
          <source>In Analysis of Images, Social Networks and Texts</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>308</fpage>
          -
          <lpage>319</lpage>
          . Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gomaa</surname>
            ,
            <given-names>W. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fahmy</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          :
          <article-title>Survey of text similarity approaches</article-title>
          .
          <source>International Journal of Computer Applications</source>
          ,
          <year>2013</year>
          ,
          <volume>68</volume>
          (
          <issue>13</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gus</surname>
          </string-name>
          eld D.: Algorithms on Strings,
          <source>Trees, and Sequences</source>
          , Cambridge University Press,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chernyak</surname>
            <given-names>E.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chugunova</surname>
            <given-names>O.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Askarova</surname>
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nascimento</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirkin</surname>
            <given-names>B.G.</given-names>
          </string-name>
          ,
          <article-title>Abstracting concepts from text documents by using an ontology</article-title>
          ,
          <source>in Proceedings of the 1st International Workshop on Concept Discovery in Unstructured Data</source>
          .
          <year>2011</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chernyak</surname>
            <given-names>E.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chugunova</surname>
            <given-names>O.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirkin</surname>
            <given-names>B.G.</given-names>
          </string-name>
          :
          <article-title>Annotated su x tree method for measuring degree of string to text belongingness</article-title>
          ,
          <source>Business Informatics</source>
          ,
          <year>2012</year>
          . Vol.
          <volume>21</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>41</lpage>
          (in Russian).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pampapathi</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirkin</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levene</surname>
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A su x tree approach to anti-spam email ltering</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <year>2006</year>
          , Vol.
          <volume>65</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>309</fpage>
          -
          <lpage>338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
          </string-name>
          , J.:
          <article-title>Normalized cuts and image segmentation</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          , IEEE Transactions on,
          <year>2000</year>
          ,
          <volume>22</volume>
          (
          <issue>8</issue>
          ),
          <fpage>888</fpage>
          -
          <lpage>905</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zamir</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Web document clustering: A feasibility demonstration</article-title>
          .
          <source>In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval</source>
          (pp.
          <fpage>46</fpage>
          -
          <lpage>54</lpage>
          ).
          <year>1998</year>
          , ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>