=Paper=
{{Paper
|id=Vol-1625/paper3
|storemode=property
|title=Annotated Suffix Trees for Text Clustering
|pdfUrl=https://ceur-ws.org/Vol-1625/paper3.pdf
|volume=Vol-1625
|authors=Ekaterina Chernyak,Dmitry Ilvovsky
|dblpUrl=https://dblp.org/rec/conf/cla/ChernyakI16
}}
==Annotated Suffix Trees for Text Clustering==
<pdf width="1500px">https://ceur-ws.org/Vol-1625/paper3.pdf</pdf>
<pre>
     Annotated Suffix Trees for Text Clustering

                   Ekaterina Chernyak and Dmitry Ilvovsky

             National Research University Higher School of Economics
                                 Moscow, Russia
                           echernyak,dilvovsky@hse.ru


      Abstract. In this paper an extension of tf -idf weighting on annotated
      suffix tree (AST) structure is described. The new weighting scheme can
      be used for computing similarity between texts, which can further serve
      as in input to clustering algorithm. We present preliminary tests of us-
      ing AST for computing similarity of Russian texts and show slight im-
      provement in comparison to the baseline cosine similarity after applying
      spectral clustering algorithm.
      Keywords: annotated suffix tree, clustering, similarity measure


1   Introduction

The text clustering applications exploit two major different clustering approaches:
either a text is represented as a vector of features, and distance-based algorithms
(such as k-Means) are used, or the similarity between texts are computed and the
similarity-based algorithms (such k-Medoids or Normalized cuts) are used. While
the former requires to extract features from texts, the latter requires the defini-
tion of similarity measure. The other algorithms, such as Suffix Tree Clustering
[8] explore internal features for the text collection and find clusters straightfor-
ward. We will concentrate of the preliminary step of applying distance-based
algorithms, i.e. computing similarity between texts. According to [2], there are
several approaches to computing text similarity: there are string-based (which
include character-based and term-based), corpus-based and knowledge-based ap-
proaches. This project belongs to the characters-based approach, which means,
we do not take corpora data or semantics into account. We describe a new sim-
ilarity measure, which is based on the notion of an annotated suffix tree and
present an example of using this measure.


2   Annotated suffix tree

The suffix tree is a data structure used for storing of and searching for strings
of characters and their fragments [3]. An annotated suffix tree (AST) is a suffix
tree whose nodes are annotated by the frequencies of the strings fragments.
An algorithm for the construction and the usage of AST for spam-filtering is
described in [6]. Some other applications are described in [4, 5].
26         Ekaterina Chernyak, Dmitry Ilvovsky

2.1       Definition of AST

According to the annotated suffix tree model [4–6], a text document is a set of
words or word n-grams, which we will address as strings. An annotated suffix
tree is a data structure used for computing and storing all fragments of the
strings and their frequencies (see Fig. 1 for an example of the AST for the string
“mining”). It is a rooted tree in which:

 – Every node corresponds to one character
 – Every node is labeled by the frequency of the text fragment encoded by the
   path from the root to the node.

      .


                            Fig. 1. AST for string “mining”


   The AST has two important proprieties: the frequency of a parent node is
equal to:

1. the sum of the frequencies of children nodes;
2. the sum of the frequencies of underlying leaves.

    According to these properties we can calculate the frequency of the root: it
is equal to the sum of the frequencies on the first level of the AST. For example,
the frequency of the root of the AST in Fig. 1 is equal to 2+2+1+1 = 6.


2.2       Similarity measure

To estimate the similarity between two texts we find the common subtree of
the corresponding ASTs. We do the depth-first search for the common chains of
nodes that start from the root of the both ASTs. After the common subtree is
constructed we need to annotate and score it.
    We annotate the common subtree in the following way. A new node of a
common subtree is annotated by two numbers:the minimum and the maximum
frequencies of the corresponding nodes of original ASTs.
    Let us provide an example of common subtree construction and annotation.
Given two ASTs:
                                 Annotated Suffix Trees for Text Clustering     27

 – for two strings “mining” and “dining” in Fig. 2
 – for one string “dinner” in Fig. 3
we construct the common subtree for them. There are three common chains,
which start from the roots: “D I N”, “I N”, “N”. All the nodes of the chain “D
I N” have frequencies equal to unity in both ASTs, so in the common subtree
the minimum and the maximum frequencies of all three nodes coincide and are
equal to unity. The chain “I N” occurs once in the AST in Fig. 3 and four times
in the AST in Fig. 2, hence the minimum frequencies are equal to unity and
the maximum frequencies are equal to four. The node “N” is annotated by four
in the AST in Fig. 2 and by two in the AST in Fig. 3. So its minimum and
maximum frequencies are equal to two and four, respectively.


Fig. 2. AST for strings “mining” and “din-
                                              Fig. 3. AST for string “dinner”
ing”


               Fig. 4. Common subtree of ASTs in Fig. 2 and Fig. 3


   Following the general principle of the AST-based string to text scoring we
suggest to score the common subtree in several steps:
 1. weighting each node by computing the mean between two frequencies. At this
    step we can use different type of means, such as geometric mean or harmonic
    mean. For the sake of simplicity we use further the arithmetic mean;
 2. scoring every chain of the common subtree;
 3. summing up all chain scores and standardizing them by dividing by the
    number of chains;
28       Ekaterina Chernyak, Dmitry Ilvovsky

   To score the chain we compute the sum of the node frequencies divided by
their parents frequencies, which is again divided by the length of the chain:
                                                                            node   node
                        P             f node
                                                        P                (fmin   +fmax  )/2
                          node∈chain f parent               node∈chain (f parent +fmax
                                                                                   parent
                                                                                          )/2
                                                                             min
      score(chain) =                                =
                                 |chain|                               |chain|

     For example, the scoring of the common subtree in Fig. 4 is computed as:

              score(‘‘D I N’’) + score(‘‘I N’’) + score(‘‘N’’)
                                                               ,
                                     3
     where
                             (1+1)/2   (1+1)/2   (1+1)/2                2
                             (4+9)/2 + (1+1)/2 + (1+1)/2                     +1+1
     score(‘‘D I N’’) =                                               = 13        = 0.718
                                               3                              3

                                    (1+4)/2   (1+4)/2             4
                                    (4+9)/2 + (1+4)/2                 +1
             score(‘‘I N’’) =                                 = 13       = 0.6538
                                               2                      2

                                             (2+4)/2
                                             (4+9)/2         6
                      score(‘‘N’’) =                    =       = 0.4615
                                                1            13
     and the final scoring is:

                          0.718 + 0.6538 + 0.4615
                                                  = 0.6111
                                     3
    At this point the scoring of the common subtree is based on using only the
frequencies of the strings and their fragments. To make the scoring analogous to
computing tf -idf we can introduce the idf -like component to the scoring.
    Let us think about a collection of texts. As a source for idf values we can
construct a global AST from the whole text collection. To construct the global
AST we split every text in strings and exclude from further computations re-
peating strings and construct the AST then from these unique strings. This way
we calculate not the frequencies of the strings, but the number of texts where
every string and their fragments occur, that is exactly the df ’s of all the possible
fragments of the texts.
    To combine common subtrees and the global tree, we update the chain scoring
step:
                                      P                f node     df node
                                           node∈chain f parent × df parent
                  score(chain) =                                                   ,
                                                        |chain|

     where df are extracted from the global tree.
                                  Annotated Suffix Trees for Text Clustering      29

2.3   Construction of AST

It is possible to construct an AST straightforward form a set of tokens using quite
an obvious iterative algorithm, which requires splitting each token in suffixes and
adding them consecutively to the AST [5]. However, is is shown in [1], that it
is more time- and space-efficient to construct a suffix tree, using one of the
well-known algorithms and than annotate the tree with the frequencies.


3     Evaluation

We manually created the text collection for further testing of the proposed algo-
rithm. The collection consists of 50 documents in Russian, every 10 text devoted
to different definitions of the word “jaguar”: an animal, a car, a beverage, a film
or a sewing machine. We supposed, that the clusters we would achieve should
coincide with the predefined text classes, i.e. we should get five clusters, every
cluster corresponding to the initial class. We used the Shi-Malik algorithm [7]
with the default parameters to cluster the similarity matrices. Two approaches
to the similarity matrix construction:

 1. the tf -idf transformation and the cosine similarity
 2. the AST technique, presented above.

    To compare these approaches we computed the number of errors in the
achieved clusters. Given a cluster we found the mode value of the class and
calculated how many documents do not belong to this class. The higher this
number is, the worse is the result of clustering. Using the cosine similarity and
Shi-Malik algorithm for finding five clusters, we achieved four perfect cluster and
one cluster, that contained six errors. Hence 44 texts were clustered correctly
and 6 were not. Using the AST technique, we got only 2 errors, which means
that 48 texts were clustered correctly. The results of clustering are presented in
Table 1.
    Fig. 5 and Fig. 6 present the heat maps of both similarity measures and
reveal some issues of the suggested AST technique. First of all, when the AST
technique is applied to compute the similarity of a text to itself, the result is
not equal to unity. What is more, for different texts it results in different values.
Second, all the similarity values are more or less the same, there is no drastic
difference at all between the inside or outside classes values. These are the issues
to be solved in the future.

                            Table 1. Clustering quality

                                         Accuracy # of errors
                       cosine similarity 0.88     6
                       AST similarity 0.96        2
30      Ekaterina Chernyak, Dmitry Ilvovsky


Fig. 5. Heat map of the cosine similarity Fig. 6. Heat map of the AST similarity
matrix                                    matrix


4    Conclusion
We suggest a new text similarity measure, which is based on annotated suffix
trees. To estimate the similarity between text A and text B, one should construct
two annotated suffix trees, find the common subtree and score it. The scoring
can be extended by document frequencies, if a text collection is given. The pre-
liminary experiments show, that although the proposed similarity measure has
some clear advantages in comparison to the baseline cosine similarity, because
of being more robust, some formal aspects should be improved. For example,
currently the similarity of a text to itself is not equal to unity, which affects the
visualisation. Obviously, more experiments should be conducted to find other
limitations and possible improvements.

Acknowledgments The paper was prepared within the framework of the Basic
Research Program at the National Research University Higher School of Eco-
nomics (HSE) and supported within the framework of a subsidy by the Russian
Academic Excellence Project “5-100”. The authors was partially supported by
Russian Foundation for Basic Research, grants no. 16-29-12982 and 16-01-00583.


References
1. Dubov, M.: Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and
  Implementation. In Analysis of Images, Social Networks and Texts, 2015, pp. 308-319.
  Springer International Publishing.
2. Gomaa, W. H., Fahmy, A. A.: Survey of text similarity approaches. International
  Journal of Computer Applications, 2013, 68(13).
3. Gusfield D.: Algorithms on Strings, Trees, and Sequences, Cambridge University
  Press, 1997.
4. Chernyak E.L., Chugunova O.N., Askarova J.A., Nascimento S., Mirkin B.G., Ab-
  stracting concepts from text documents by using an ontology, in Proceedings of the
  1st International Workshop on Concept Discovery in Unstructured Data. 2011, pp.
  21-31.
5. Chernyak E.L., Chugunova O.N., Mirkin B.G.: Annotated suffix tree method for
  measuring degree of string to text belongingness, Business Informatics, 2012. Vol.
  21, no.3, pp. 31-41 (in Russian).
                                 Annotated Suffix Trees for Text Clustering     31

6. Pampapathi R., Mirkin B., Levene M.: A suffix tree approach to anti-spam email
  filtering, Machine Learning, 2006, Vol. 65, no.1, pp. 309-338.
7. Shi, J., Malik, J.: Normalized cuts and image segmentation. Pattern Analysis and
  Machine Intelligence, IEEE Transactions on, 2000, 22(8), 888-905.
8. Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In
  Proceedings of the 21st annual international ACM SIGIR conference on Research
  and development in information retrieval (pp. 46-54). 1998, ACM.

</pre>