=Paper= {{Paper |id=Vol-1410/paper1 |storemode=property |title=Annotated Suffix Tree Similarity Measure for Text Summarization |pdfUrl=https://ceur-ws.org/Vol-1410/paper1.pdf |volume=Vol-1410 |dblpUrl=https://dblp.org/rec/conf/pkdd/YakovlevC15 }} ==Annotated Suffix Tree Similarity Measure for Text Summarization== https://ceur-ws.org/Vol-1410/paper1.pdf
     Annotated suffix tree similarity measure for text
                      summarization

                          Maxim Yakovlev and Ekaterina Chernyak

                   National Research University – Higher School of Economics
                                       Moscow, Russia
                                 myakovlev,echernyak@hse.ru



            Abstract. The paper describes an attempt to improve the TextRank al-
            gorithm. TextRank is an algorithm for unsupervised text summarisation.
            It has two main stages: first stage is representing a text as a weighted
            directed graph, where nodes stand for single sentences, and edges are
            weighted with sentence similarity and connect sequential sentences. The
            second stage is applying the PageRank algorithm [1] as is to the graph.
            The nodes that get the highest ranks form the summary of the text.
            We focus on the first stage, especially on measuring the sentence similar-
            ity. Mihalcea and Tarau [4] suggest to employ the common scheme: use
            the Vector space model (VSM), so that every text is a vector in the space
            of words or stems, and compute cosine similarity between these vectors.
            Our idea is to replace this scheme by using the annotated suffix trees
            (AST) [5] model for sentence representation. The AST overcomes sev-
            eral limitations of the VSM model, such as being dependent on the size of
            vocabulary, the length of sentences and demanding stemming or lemma-
            tisation. This is achieved by taking all fuzzy matches between sentences
            into account and computing probabilities of matched coocurrencies.
            More specifically we develop an algorithm for common subtree construc-
            tion and annotation. The common subtrees are used to score the simi-
            larity between two sentences. Using this algorithm allows us to achieve
            slight improvements according to cosine baseline on our own collection of
            Russian newspaper texts. The AST measure gained around 0.05 points of
            precision more than the cosine measure. This is a great figure for natural
            language processing task, taking into account how low the baseline pre-
            cision of the cosine measure is. The fact that the precision is so low can
            be explained by some lack of consistency in the constructed collection:
            the authors of the articles use di↵erent strategies to highlight the im-
            portant sentences. The text collection is heterogeneous: in some articles
            there are 10 or more sentences highlighted, in some only the first one.
            Unfortunately, there is no other test collection for text summarisation in
            Russian. For further experiments we might need to exclude some articles,
            so that the size of summary would be more stable. Another issue of our
            test collection is the selection of sentences that form summaries. When
            the test collections are constructed manually, summaries are chosen to
            common principles. But we can not be sure that the sentences are not
            highlighted randomly.
            Although the AST technique is rather slow, it is not a big issue for the
            text summarisation problem. The summarisation problem is not that




In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Porto, Portugal, 2015.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
    2
2       M. Yakovlev and E. Chernyak

           kind of problems where on-line algorithms are required. Hence the pre-
           cision plays more significant part than time characteristics.
           There are several directions of future work. First of all, we have to con-
           duct experiments on the standard DUC (Document Understanding Con-
           ference [2]) collections in English. Second, we are going to develop dif-
           ferent methods for construction and scoring of common subtrees and
           compare it to each other. Finally, we may use some external and more
           efficient implementation of the AST method, such as EAST Python li-
           brary by Mikhail Dubov [3], which uses annotated suffix arrays. More
           details on this work can be found in [6].
           Keywords: TextRank, annotated suffix tree


    References
    1. Brin S., Page L. : The anatomy of a large-scale hypertextual Web search engine. In:
       Proceedings of the seventh international conference on World Wide Web 7, 107-117
       (1998)
    2. Document Understanding Conference, http://www-nlpir.nist.gov/
    3. Enhanced Annotated Suffix Tree, https://pypi.python.org/pypi/EAST/0.2.2/
    4. Mihalcea R., Tarau P. : TextRank: bringing order into text. In: Proceedings of the
       Conference on Empirical Methods in Natural Language Processing, 404-411 (2004)
    5. Pampapathi R., Mirkin B., Levene M. (2008): A suffix tree approach to anti-spam
       email filtering. In: Machine Learning, 65(1), 309-338 (2008)
    6. Yakovlev M., Chernyak E. (2015): Using annotated suffix tree similarity measure
       for text summarization. In: Proccedings of European Conferenece on Data Analysis,
       2015.