=Paper= {{Paper |id=Vol-1410/paper3 |storemode=property |title=Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing |pdfUrl=https://ceur-ws.org/Vol-1410/paper3.pdf |volume=Vol-1410 |dblpUrl=https://dblp.org/rec/conf/pkdd/Chernyak15 }} ==Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing== https://ceur-ws.org/Vol-1410/paper3.pdf
     Some Thoughts on Using Annotated Suffix Trees
           for Natural Language Processing

                                       Ekaterina Chernyak

                   National Research University – Higher School of Economics
                                       Moscow, Russia
                                      echernyak@hse.ru



            Abstract. The paper defines an annotated suffix tree (AST) - a data
            structure used to calculate and store the frequencies of all the fragments
            of the given string or a collection of strings. The AST is associated with
            a string to text scoring, which takes all fuzzy matches into account.
            We show how the AST and the AST scoring can be used for Natural
            Language Processing tasks.
            Keywords: text representation, annotated suffix tree, text summariza-
            tion, text categorization


     1    Introduction

     Natural Language Processing tasks require a text being represented by a sort
     of a formal structure to be processed by a computer. The most popular text
     representation is the Vector Space Model (VSM), designed by Salton [1]. The
     idea of the VSM is simple: given a collection of texts, represent every text as a
     vector in a space of terms. A term is a word itself or a lemmatized word or the
     stem of a word or any other meaningful part of the word. The VSM is widely
     used in any kind of Natural Language Processing tasks. The few exceptions are
     machine translation or text generation, when word order is important, while
     the VSM completely loses it. For these purposes Ponte and Croft introduced the
     language model [2], which is based on calculating the probability of the sequence
     of n words or characters, so-called n-grams. There is one more approach to text
     representation, which is based on suffix trees and suffix arrays. Originally the
     suffix tree was developed for fuzzy string matching and indexing [3]. However
     there appear to be several application of suffix trees to Natural Language Pro-
     cessing. One of them is document clustering, presented in [4]. When some sort
     of probability estimators of the paths in the suffix tree are introduced, it can be
     used as a language model for machine translation [5] and information retrieval
     [6].
          In this paper we are going to concentrate on the so-called annotated suffix
     tree (AST), introduced in [8]. We will present the data structure itself and several
     Natural Language Processing tasks where the AST representation is successfully
     used. We are not going to make any comparisons to other text representation
     models, but will show that using the AST approach helps to overcome some




In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
    2
6         E. Chernyak

    exciting problems. The paper is organized as follows: the Section 2 presents
    the definition of the AST and the algorithm for the AST construction, Sections
    from 3 to 7 present exciting applications of the AST (almost all developed with
    author’s contribution), Section 8 lists some future application, Section 9 suggests
    how to compare the AST scoring to other approaches, Section 10 is devoted to
    the AST scoring implementation. Section 11 concludes.
       The project is being developed by the “Methods of web corpus collection,
    analysis and visualisation” research and study group under guidance of prof. B.
    Mirkin (grant 15 - 05 - 0041 of Academic Fund Program).


    2      Annotated suffix tree

    2.1     Definition

    The suffix tree is a data structure used for storing of and searching for strings of
    characters and their fragments [3]. When the suffix tree representation is used,
    the text is considered as a set of strings, where a string may be any significant
    part of the text, like a word, a word or character n-gram or even a whole sentence.
    An annotated suffix tree (AST) is a suffix tree whose nodes (not edges!) are
    annotated by the frequencies of the strings fragments.
        An annotated suffix tree (see Figure 1)[7] is a data structure used for com-
    puting and storing all fragments of the text and their frequencies. It is a rooted
    tree in which:

        – Every node corresponds to one character
        – Every node is labeled by the frequency of the text fragment encoded by the
          path from the root to the node.



    2.2     AST construction

    Our algorithm for constructing an AST is a modification of the well-known
    algorithm for constructing suffix trees [3]. The algorithm is based on finding
    suffixes and prefixes of a string. Formally, the i-th suffix of the sting is the
    substring, which starts at i-th character of the string. The i-th prefix of the
    string is the substring, that ends on the i-th character of the string. The AST
    is built in an iterative way. For each string, its suffixes are added to the AST
    one-by-one starting from an empty set representing the root. To add a suffix to
    the AST, first check, whether there is already a match, that is, a path in the
    AST that encodes / reads the whole suffix or its prefix. If such a match exists, we
    add 1 to all the frequencies in the match and append new nodes with frequencies
    1 to the last node in the match, if it does not cover the whole suffix. If there is
    no match, we create a new chain of nodes in the AST from the root with the
    frequencies 1.
                                                                                       3
Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing               7




                             Fig. 1. An AST for string “mining“.


    2.3     AST relevance measure

    To use an AST to score the string to text relevance we first build an AST for a
    text. Next we match the string to the AST to estimate the relevance.
       A procedure for computing string-to-text relevance score:
       Input: string and AST for a given text.
       Output: the AST scoring.

     1. The string is represented by the set of its suffixes;
     2. Every suffix is matched to the AST starting from the root. To estimate the
        match we use the average conditional probability of the next symbol:
                                          P                     f (node)

          score(match(suf f ix, ast)) =       node2match   ( f (parent(node))
                                                                 |suf f ix|   ),

        where f (node) is the frequency of the matching node, f (parent(node)) is
        it’s parent frequency, and |suf f ix| is the length of the suffix;
     3. The relevance of the string is evaluated by averaging the scores of all suffixes:

            relevance(string, text) = SCORE(string, ast) =
                                            P
                                              suf f ix score(match(suf f ix, ast))
                                          =                                        ,
                                                            |string|

          where |string| is the length of the string.

       Note, that “score” is found by applying a scaling function to convert a match
    score into the relevance evaluation. There are three useful scaling functions,
    according to experiments in [8] for spam classification:
    4
8         E. Chernyak

        – Identity function: (x) = x
        – Logit function:

                                              x
                              (x) = log               = log x   log(1   x)
                                          1       x
                                p
        – Root function (x) =       x

    The identity scaling stands for the conditional probability of characters averaged
    over matching fragments (CPAMF).
        Consider an example to illustrate the described method. Let us construct
    an the for the string “mining”. This string has six suffixes: “mining”, “ining”,
    “ning”, “ing”, “ng”, and “g’ . We start with the first suffix and add it to the
    empty AST as a chain of nodes with the frequencies equal to unity. To add the
    next suffix, we need to check whether there is any match, i.e. whether there is
    such a path in the AST starting at its root that encodes / reads a prefix of
    “ining”. Since there is no match between existing nodes and the second suffix,
    we add it to the root as a chain of nodes with the frequencies equal to unity.
    We repeat this step until a match is found: a prefix of the fourth suffix “ing”
    matches the second suffix “ining”: two first letters, “in”, coincide. Hence we add
    1 to the frequency of each of these nodes and add a new child node “g” to the
    leaf node “n” (see Figure 1). The next suffix “ng” matches the third suffix and
    we repeat the same actions: increase the frequency of the matched nodes and
    add a new child node that does not match. The last suffix does not match any
    path in the AST, so again we add it to the AST’s root as a single node with
    its frequency equal to unity. Now let us calculate the relevance score for string
    “dining” using the AST in Figure 1. There are six suffixes of the string “dining”:
    ‘dining”, “ining”, “ning”, “ing”, “ng”, and “g’ . Each of them is aligned with an
    AST path starting from the root. The scorings of the suffixes are presented in
    Table 1.


                        Table 1. Computing the string “dining” score

                          Suffix Match              Score
                        “dining” None                 0
                         “ining” “ining” 1/1+1/1+1/2+2/2+2/6
                                                  5
                                                               = 0.76
                                           1/1+1/1+1/2+2/6
                         “ning” “ning”            4
                                                            =  0.71
                                             1/2+2/2+2/6
                          “ing” “ing”             3
                                                          =  0.61
                                               1/2+2/6
                          “ng”    “ng”            2
                                                        = 0.41
                                                 1/6
                           “g”     “g”            1
                                                     =  0.16



       We have used the identity scaling function to score all 6 suffixes of the string
    “dining”. Now, to get the final CPAMF relevance value we sum and average
                                                                                        5
Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing                9

    them:

                                       0 + 0.76 + 0.71 + 0.61 + 0.41 + 0.16
        relevance(dining, mining) =                                          =
                                                        6
                                                                         2.65
                                                                      =        = 0.44
                                                                           6
       In spite of the fact that “dining” di↵ers from “mining” by just one character,
    the total score, 0.44, is less than unity. This is not only because the trivial suffix
    “dining” contributes 0 to the sum, but also because conditional probabilities get
    smaller for the shorter suffixes.


    3     Spam filtering

    The definition of the AST presented above was for first time introduced by Pam-
    papathi, Mirkin and Levene in [7] for spam filtering. The AST was used as a
    representation tool for every class (spam and ham). By introducing a procedure
    for scoring the class AST they developed a classifier that beats the Naive Bayes
    classifier in a series of experiments on standard datasets. The success of ASTs
    in domain of email filtering was due to the notion of match permutation nor-
    malization, which allowed to take into account some intentional typos developed
    by spamers to pass over spam filters. Match permutation normalization is in a a
    sense analogous to the edit distance [10] that if frequently implemented in spam
    filters [11].


    4     Research paper categorization

    The problem of text categorization is formulated as follows. Given a collection
    of documents and a domain taxonomy, annotate a document with relevant tax-
    onomy topics. A taxonomy is a rooted tree, such that every node corresponds
    to a (taxonomy) topic of the domain. The taxonomy generalizes the relation “is
    – a” or “is a part of”.
        There are two basic approaches to the problem of text categorization: su-
    pervised and unsupervised. Supervised approaches give high precision values
    when applied to web document categorization [12], but may fail when applied
    to research paper categorization, since the research taxonomies, such as ACM
    Computing Classification System [13], are seldom revised and the supervised
    techniques may overfit [14]. The unsupervised approaches to text categorization
    are based on information retrieval – like idea: given the set of taxonomy topics,
    let us find those research papers that are relevant to every topic. The question
    for researcher is the following: what kind of the relevance model and measure
    to choose? In [15] we experimentally compared cosine relevance function, which
    measures the cosine between tf idf vectors in Vector Space Model [1], BM25,
    based on the probabilistic relevance framework, and the AST scoring, introduced
    above. These three relevance measures where applied to a relatively small dataset
     6
10          E. Chernyak

     of 244 articles, published in ACM journals and the current version of ACM Com-
     puting Classification System. The AST scoring outperforms cosine and BM25
     measures, by being more robust and taking not crisp but fuzzy measures into ac-
     count. The next step in this research direction would be testing the AST scoring
     versus w-shingling procedure [17], which is also a fuzzy matching technique that
     requires text preprocessing, such stemming or lemmatization. However there is
     no need in stemming or lemmatization to apply the AST scoring.


     5      Taxonomy refinement
     Taxonomies are widely used to represent, maintain and store domain knowledge,
     see, for example SNOMED [18] or ACM CCS [13]. Domain taxonomy construc-
     tion is a difficult task and a number of researchers have come out with idea of
     taxonomy refinement. The idea of taxonomy refinement is the following: having
     one taxonomy or upper levers of taxonomy refine it with topics extracted from
     additional sources such as other taxonomies, web search or Wikipedia. We fol-
     lowed this strategy and developed a two-step approach to taxonomy refinement,
     presented in more details in [21]. We concentrated on taxonomies of probability
     theory and mathematical statistics (PTMS) and numerical mathematics (NM),
     both in Russian. On a first step an expert sets manually the upper layers of tax-
     onomy. On the second step these upper layers are refined by Wikipedia category
     tree and the articles, belonging to this tree, from the same domain. In this study
     the AST scoring is used several times:
         – To clear the Wikipedia data from noise;
         – To assign the remaining Wikipedia categories to the taxonomy topics;
         – To form the intermediate layers of the taxonomy by using Wikipedia sub-
           categories;
         – To use Wikipedia articles in each of the added category nodes as its leaves.
     The Wikipedia data is rather noisy: there some articles that are stubs or irrele-
     vant to parental categories (the categories, they belong to) and the more so there
     are subcategories (of a category) that are irrelevant to the parental categories.
     For example, we found the article “ROC curve” be irrelevant to the category
     “Regression analysis” and the category “Accidentally killed” to the category
     “Randomness”. To define what article is irrelevant we exploit the AST scoring
     twice:
         – We scored the title of the article to the text of the article to detect stubs;
         – We scored the title of the parental category to the text of the article to detect
           irrelevant category.
     If the value of the scoring function is less than a threshold we decided that the
     article is irrelevant. Usually we set the threshold at 0.2. To assign the remaining
     Wikipedia categories to the taxonomy topics we score the taxonomy topics to all
     the articles in the category merged into one text. Next we found the maximum
     value of the scoring function and assigned the category to the corresponding
                                                                                      7
Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing              11

    taxonomy topic. Finally, we score the title of parental categories to the articles
    of the subcategories, merged into one. If the subcategory to category scoring
    is higher than the subcategory to taxonomy topic, the subcategory remains on
    the intermediate layer of the refined taxonomy tree under its parental category.
    Finally, the articles left after clearing from noise became leaves in the refined
    taxonomy tree. The quality of achieved PTMS and NM taxonomies is difficult
    to evaluate computationally, so the design of the user study is an open question.


    6   Text summarization
    Automatic text summarisation is one of the key tasks in natural language pro-
    cessing. There are two main approaches to text summarisation, called abstractive
    and extractive approaches [22].
        According to the abstractive approach, the summary of a text is another text,
    but much shorter, generated automatically to make the semantic representation
    of the text. According to extractive approach, the summary of a text is nothing
    else, but some important parts of the given text, such as a set of important
    sentences.
        The extractive summarisation problem can be formulated in the following
    way. Given a text T that is a sequence of sentences S that consists of words V ,
    select a subset of the sentences S ⇤ that are important in T . Therefore we need
    to define:
     – what importance of a sentence is;
     – how to measure importance of the sentence; Hence we need to introduce
       a function, importance(s), which measures the importance of a sentence.
       The higher importance is, the better. Next step is to build the summary.
       Let us rank all the sentences according the values of importance. Suppose
       we look for the summary that consists of five sentence. Hence we take the
       five sentences with the highest values of importance and call them top-5
       sentences according to importance. Generally, the summary of the text are
       the top-N sentences according to importance and N is set manually.
        The best results for this statement of the problem are achieved by Mihalcea
    and Tarau [23], where importance(s) is introduced as PageRank type function
    [24] without any kind of additional grammar, syntax or semantic information.
    The main idea of the suggested TextRank algorithm is to represent a text as a
    directed graph, where nodes stand for sentences and edges connect sequential
    sentences. The edges are weighted with sentence similarity. When PageRank is
    applied to this graph, every node receives its rank that is to be interpreted as the
    importance of the sentence, so that importance(s) = P ageRank(snode ), where
    snode is the node corresponding to sentence s.
        To measure similarity of the sentences the authors of TextRank algorithm
    suggest to use the basic VSM (Vector Space Model) scheme. First every sentence
    is represented as a vector in space of words or stems. Next cosine similarity
    between those vectors is computed. We can use the AST scoring as well for
     8
12         E. Chernyak

     scoring the similarity between two sentences. To do this we have to introduce
     the common tree technique.


     6.1    Constructing common subtree for two ASTs

     To estimate the similarity between two sentences we find the common subtree
     of the corresponding ASTs. We do the depth-first search for the common chains
     of nodes that start from the root of the both ASTs. After the common subtree
     is constructed we need to annotate and score it. We annotate every node of
     the common subtree with the averaged frequency of the corresponding nodes in
     initial ASTs. Consider for example two ASTs for strings “mining” and “dinner”
     (see Fig. 1 and Fig. 2, correspondingly). There are two common chains: “I N” and
     “N”, the first one consists of two nodes, the second one consists of a single node.
     Both this chains form the common subtree. Let us annotate it. The frequency
     of the node “I” is equal to 2 in the first AST and to 1 in the second. Hence, the
     frequency of this node in the common subtree equals to 2+1   2 = 1.5. In the same
     way we annotate the node “N” that follows after the node “I” with 2+1      2 = 1.5
     and the node “N” on the first level with 2+2  2 = 2. The root is annotated with
     the sum of the frequencies of the first level nodes that is 1.5 + 2 = 3.5.


     6.2    Scoring common subtree

     The score of the subtree is the sum of scores of every chain of nodes. The score
     of the path is the averaged sum of the conditional probabilities of the nodes,
     where conditional probability of the node is the frequency of the node divided
     by the frequency of its parent. For example, the conditional probability of the
     node “G:1” on the third level of the AST on Fig. 1 is 1/2. Let us continue with
     the example of “mining” and “dinner”. There are two chains in their common
     subtree: “I N” and “N”. The score of “I N” chain is (1.5/1.5 + 1.5/3.5)/2 =
     0.71, since there are 2 nodes in the chain. The score of one node chain “N” is
     1.5/3.5 = 0.42. The score of the whole subtree is (0.71 + 0.42) = 1.13.
         The collection for experiments was made of 400 articles from Russian news
     portal called Gazeta.ru. The articles were marked up in a special way, so that
     some of sentences were highlighted because of being more important. This high-
     lighting was done either by the author of the article or by the editor on the basis
     of their own ideas. In our experiments we considered those sentences as the sum-
     mary of the article. We tried to reproduce these summaries using TextRank with
     cosine similarity measure and AST scoring.
         Using this algorithm allowed us to gain around 0.05 points of precision ac-
     cording to cosine baseline on our own collection of Russian newspaper texts.
     This is a great figure for Natural Language Processing task, taking into account
     that the baseline precision of the cosine measure was very low. The fact that
     the precision is so low can be explained by some lack of consistency in the con-
     structed collection: the authors of the articles use di↵erent strategies to highlight
     the important sentences. The text collection is heterogeneous: in some articles
                                                                                      9
Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing              13




                            Fig. 2. An AST for string “dinner”.




            Fig. 3. Common subtree of ASTs for stings “mining” and “dinner”.


    there are 10 or more sentences highlighted, in some only the first one. More
    details of this experiment are presented in [25].


    7   Association rule extraction

    Several research group develop di↵erent approaches to extraction and visualiza-
    tion of association rules from text collections [26, 27]. Association rule is a rule
    X =) Y , where both X and Y are sets of concepts, possibly a singleton,
    and the implication means some sort of co-occurrence relation. An association
    rule has two important features, called support and confidence. When the rule is
    extracted from the text collection, the support of the set X support(X) usually
    stands for the proportion of the documents where concepts X occur and the
    confidence of the association rule conf idence(X =) Y ) stands for conditional
    probability of Y given X. The majority of approaches to association rule ex-
    traction share the following idea in common: the concepts should be extracted
    from the text collection. Using the fuzzy AST scoring we can diminish this lim-
    itation and produce the rules on the set of concepts provided by a user. In [28]
    we presented a so-called “conceptual map”, which is a graph of association rules
    X =) Y . To make the visualization easy we restricted ourselves only to single
    item sets, so that |X| = |Y | = 1. We analyzed a collection of Russian language
     10
14         E. Chernyak

     newspaper articles on business and the concepts were provided by a domain ex-
     pert. We used the AST scoring to score every concept ki to every text from the
     collection. Next we formed F (ki ) the set of articles, to which the concept ki
     relevant (i.e. the scoring is higher than a threshold, usually 0.2). Finally, there
                                           F (ki )\F (kj )
     was a rule ki =) kj if the ratio          F (ki )     was higher than the predefined
     confidence threshold. An example of conceptual map (translated into English)
     can be found on Fig. 4.




                                  Fig. 4. A conceptual map


        This conceptual map may serve as a tool for text analysis: it reveals some
     hidden relations between concepts and it can be easy visualized as a graph. Of
     course, to estimate the power of conceptual maps we have to conduct an user
     study.


     8     Future work

     In the following sections we will briefly present some Natural Language Process-
     ing tasks, where AST scoring might be used.


     8.1    Plagiarism detection

     Ordinary suffix trees are widely used for plagiarism detection [29]. The common
     subtree technique can also be used in this case. Suppose we have two texts,
                                                                                   11
Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing            15

    construct two individual ASTs and the common AST. The size of the common
    AST will show how much these texts share in come. Scoring the common AST
    allows to measure how significant coinciding parts are. With no doubts, the
    common AST can be used for indexing of coinciding parts of the texts. Hence, it
    inherits advantages of ordinary suffix trees with some additional functionality.


    8.2   Compound splitting

    Splitting compounds, such as German compounds, is necessary for machine
    translation and information retrieval. The splitting is usually conducted accord-
    ing to some morphological or probabilistic models [30]. We have a hypothesis
    that scoring prefixes of compound words to the AST, constructed from the col-
    lection of simple words, will allow to split compounds without using additional
    morphological knowledge. The main research in this direction is the design of
    the collection of simple words.


    8.3   Profanity filtering

    The Russian profanity language is rich and complex and has a complex deriva-
    tion, usually based on adding prefixes (such as “za”, “pro”, “vy”, etc). New
    words appear almost every month, so it is difficult to maintain a profanity dic-
    tionary. Profanity filtering is an important part of Russian Text or Web mining,
    specially since some special limitations on using profanity were introduced. The
    task is to find words in a text that are profane and, for example, to replace them
    with star symbols “***”. Note, that Russian derivative includes also a variety
    of endings, so lematization or stemming should be used. Since Porter stemmer
    [31] does not cope with prefixes, it can be easily replaced by some sort of the
    AST-scoring.


    9     Comparison to other approaches

    Cosine measure on tf idf vectors is a traditional baseline in majority of Natural
    Language Processing tasks and is easily overcame by any sort of more robust
    and fuzzy similarity or relevance measure, such as w-shingling [17], super shin-
    gles [32], mega shingles [33] and character n-grams [34]. The main future research
    concentrates on drawing comparison between these fuzzy measure and AST scor-
    ing.


    10     Implementation

    Mikhail Dubov’s implementation of AST construction and scoring is based on
    suffix arrays, which makes it space and time efficient. It is available at https:
    //github.com/msdubov/AST-text-analysis. It can be used as console utility
    or as a Python library.
     12
16        E. Chernyak

     11    Conclusion
     In this paper the notion of annotated suffix tree is defined. The annotated suffix
     trees are used by several research groups and in the paper several finished, run-
     ning or future projects are presented. The annotated suffix tree is a simple but
     powerful tool for scoring di↵erent types of relevance or similarity. This paper
     may sound light weighted and to make it more theoretical, we will conclude by
     provided some insights on probabilistic or morphological origins of ASTs. From
     one point of view, we have a strong feeling that it can proved that the AST or
     the common AST is a string kernel, thus it can be used to generate features for
     text classification / categorization or to measure similarity. From another point
     of view, the AST is a sort of supervised stemmer, that can be used to generate
     terms more efficiently than model-based stemmers.


     12    Acknowledgments
     I am deeply grateful to my supervisor Dr. Boris Mirkin for an oppurtunity to
     learn from him and to work under his guidance for so long and to my colleagues
     Mikhail Dubov, Maxim Yakovlev, Vera Provotorova and Dmitry Ilvovsky for
     collaboration.


     References
     1. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval.
        Information Processing and Management, Vol.2, no 5, pp. 513-523, 1998.
     2. Ponte, J. M., and Croft B.W.. A language modeling approach to information re-
        trieval. In Proceedings of the 21st annual international ACM SIGIR conference on
        Research and development in information retrieval, pp. 275-281. ACM, 1998.
     3. Gusfield D., Algorithms on Strings, Trees, and Sequences, Cambridge University
        Press, 1997.
     4. Zamir O., Etzioni, O. Web document clustering: A feasibility demonstration. Pro-
        ceedings of the 21st annual international ACM SIGIR conference on Research and
        development in information retrieval, pp. 46-54. ACM, 1998.
     5. Kennington C.R., Kay M. , Friedrich. A.. Suffix Trees as Language Models. In
        LREC, pp. 446-453. 2012.
     6. Huang J.H., Powers D.. Suffix tree based approach for chinese information retrieval.
        Intelligent Systems Design and Applications, 2008. ISDA’08. Eighth International
        Conference on, vol. 3, pp. 393-397. IEEE, 2008.
     7. Pampapathi R., Mirkin B., Levene M., A suffix tree approach to anti-spam email
        filtering, Machine Learning, 2006, Vol. 65, no.1, pp. 309-338.
     8. Chernyak E.L., Chugunova O.N., Mirkin B.G., Annotated suffix tree method for
        measuring degree of string to text belongingness, Business Informatics, 2012. Vol.
        21, no.3, pp. 31-41 (in Russian).
     9. Chernyak E.L., Chugunova O.N., Askarova J.A., Nascimento S., Mirkin B.G., Ab-
        stracting concepts from text documents by using an ontology, in Proceedings of the
        1st International Workshop on Concept Discovery in Unstructured Data. 2011, pp.
        21-31.
                                                                                         13
Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing                  17

    10. Levenshtein, V. I., Binary codes capable of correcting deletions, insertions, and
       reversal. Soviet Physics Doklady Vol.10, no 8, pp. 707710.
    11. Tretyakov K., Machine learning techniques in spam filtering. Data Mining Problem-
       oriented Seminar, MTAT, vol. 3, no. 177, pp. 60-79. 2004.
    12. M. Ceci and D. Malerba Classifying web documents in a hierarchy of categories:
       a comprehensive study. Journal of Intelligent Information Systems, Vol. 28, no. 1,
       pp. 37-78, 2007.
    13. ACM Computing Classification System (ACM CCS), 1998, available at:
       http://www.acm.org/about/class/ccs98-html
    14. A.P. Santos and F. Rodrigues. Multi-label hierarchical text vlassification using the
       ACM taxonomy Proceedings of 14th Portuguese Conference on Artificial Intelli-
       gence, pages 553 - 564, Aveiro, Portugal, 2010.
    15. Chernyak E. L. An approach to the problem of annotation of research publications,
       Proceedings of The Eighth International Conference on Web Search and Data Min-
       ing, pp. 429-434.
    16. S. Robertson and H. Zaragoza. The probabilistic relevance gramework: BM25 and
       beyond. Journal Foundations and Trends in Information Retrieval, Vol.25, no 4.,
       pp. 333-389, 2009
    17. Manber, Udi. Finding Similar Files in a Large File System. Usenix Winter, vol. 94,
       pp. 1-10. 1994.
    18. SNOMED CT - Systematized Nomenclature of Medicine Clinincal Terms,
       www.ihtsdo.org/snomed-ct/, visited 09.25.14.
    19. Van Hage W.R., Katrenko S., Schreiber G., A Method to Combine Linguistic
       Ontology-Mapping Techniques, in Proceedings of 4th International Semantic Web
       Conference, 2005, pp. 34-39.
    20. Grau B.C., Parsia B., Sirin E. Working with Multiple Ontologies on the Semantic
       Web, in Proceedings of the 3d International Semantic Web Conference, 2004, pp.
       620-634.
    21. Chernyak E. L., Mirkin B. G. Refining a Taxonomy by Using Annotated Suffix
       Trees and Wikipedia Resources. Annals of Data Science. Vol. 2. No. 1. P. 61-82,
       2015.
    22. Hahn U., Mani I. The challenges of automatic summarization, Computer, Vol.33,
       no.11, pp. 29-36, 2000
    23. Mihalcea R., Tarau P. TextRank: bringing order into text. In: Proceedings of the
       Conference on Empirical Methods in Natural Language Processing, pp. 404-411,
       2004
    24. Brin S., Page L. The anatomy of a large-scale hypertextual Web search engine.
       Proceedings of the seventh international conference on World Wide Web 7, 107-117,
       1998
    25. , Chernyak E.L., Yakovlev M.S., Using annotated suffix tree similarity measure for
       text summarization (under revision)
    26. Pak Chung W., Whitney P., Thomas J.. Visualizing association rules for text
       mining. Information Visualization, 1999. Proceedings. 1999 IEEE Symposium on,
       pp. 120-123. IEEE, 1999.
    27. Mahgoub, H., Rsner, D., Ismail, N., Torkey, F.. A text mining technique using
       association rules extraction. International journal of computational intelligence 4,
       no. 1, pp. 21-28, 2008.
    28. Morenko, E. N., Chernyak E.L., Mirkin B.G.. Conceptual Maps: Construction Over
       a Text Collection and Analysis. In Analysis of Images, Social Networks and Texts,
       pp. 163-168. Springer International Publishing, 2014.
     14
18        E. Chernyak

     29. Krisztin M., Zaslavsky A., Schmidt, H.. Document overlap detection system for
        distributed digital libraries. Proceedings of the fifth ACM conference on Digital
        libraries. ACM, 2000.
     30. Koehn P., Knight K.. Empirical methods for compound splitting. Proceedings of
        the tenth conference on European chapter of the Association for Computational
        Linguistics-Volume 1. Association for Computational Linguistics, 2003.
     31. Porter, M. F. An algorithm for suffix stripping. Program Vol. 14, no. 3, pp, 130-137
        (1980).
     32. Chowdhury A., Frieder O., Grossman D., McCabe M.C..”Collection statistics for
        fast duplicate document detection. ACM Transactions on Information Systems
        (TOIS) Vol. 20, no. 2 ,pp. 171-191 (2002).
     33. Conrad J. G., Schriber C.P.. Managing dj vu: Collection building for the identi-
        fication of nonidentical duplicate documents. Journal of the American Society for
        Information Science and Technology Vol 57, no. 7 pp. 921-932 (2006).
     34. Damashek M.. Gauging similarity with n-grams: Language-independent catego-
        rization of text. Science Vol. 267, no. 5199, pp 843-848 (1995).