Peggy Cellier Thierry Charnois Andreas Hotho Stan Matwin Marie-Francine Moens Yannick Toussaint (Eds.) Interactions between Data Mining and Natural Language Processing 2nd International Workshop, DMNLP 2015 Porto, Portugal, September 2015 Proceedings 2 II Volume Editors Peggy Cellier INSA Rennes, IRISA Campus Beaulieu, 35042 Rennes cedex, France E-mail: peggy.cellier@irisa.fr Thierry Charnois Université Paris 13 Sorbonne Paris Cité, LIPN CNRS Av. J.B. Clément, 93430 Villetaneuse, France E-mail: Thierry.Charnois@lipn.univ-paris13.fr Andreas Hotho University of Würzburg Am Hubland, 97074 Würzburg, Germany E-mail: hotho@informatik.uni-wuerzburg.de Stan Matwin Faculty of Computer Science, Dalhousie University 6050 University Ave., PO BOX 15000, Halifax, NS B3H 4R2, Canada E-mail: stan@cs.dal.ca Marie-Francine Moens Department of Computer Science, KU Leuven Celestijnenlaan 200A, B-3001 Heverlee, Belgium E-mail: sien.moens@cs.kuleuven.be Yannick Toussaint INRIA Nancy Grand-Est, LORIA 615 Rue du Jardin Botanique, 54600 Villers-lès-Nancy, France E-mail: Yannick.Toussaint@loria.fr Copyright c 2015 for the individual papers by the papers’ authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. Preface Recently, a new field has emerged taking benefit of both domains: Data Mining (DM) and Natural Language Processing (NLP). Indeed, statistical and machine learning methods hold a predominant position in NLP research1 , advanced methods such as recurrent neural networks, Bayesian networks and kernel based methods are exten- sively researched, and ”may have been too successful (. . . ) as there is no longer much room for anything else”2 . They have proved their e↵ectiveness for some tasks but one major drawback is that they do not provide human readable models. By contrast, sym- bolic machine learning methods are known to provide more human-readable model that could be an end in itself (e.g., for stylistics) or improve, by combination, further meth- ods including numerical ones. Research in Data Mining has progressed significantly in the last decades, through the development of advanced algorithms and techniques to extract knowledge from data in di↵erent forms. In particular, for two decades Pattern Mining has been one of the most active field in Knowledge Discovery. This volume contains the papers presented at the ECML/PKDD 2015 workshop: DMNLP’15, held on September 7, 2015 in Porto. DMNLP’15 (Workshop on Inter- actions between Data Mining and Natural Language Processing) is the second edi- tion of a workshop dedicated to Data Mining and Natural Language Processing cross- fertilization, i.e a workshop where NLP brings new challenges to DM, and where DM gives future prospects to NLP. It is well-known that texts provide a very challeng- ing context to both NLP and DM with a huge volume of low-structured, complex, domain-dependent and task-dependent data. The objective of DMNLP is thus to pro- vide a forum to discuss how Data Mining can be interesting for NLP tasks, providing symbolic knowledge, but also how NLP can enhance data mining approaches by pro- viding richer and/or more complex information to mine and by integrating linguistic knowledge directly in the mining process. The high quality of the program of the workshop was ensured by the much- appreciate work of the authors and the Program Committee members. Finally, we wish to thank the local organization team of ECML/PKDD 2015. and the ECML/PKDD 2015 workshop chairs Bernhard Pfahringer andLuis Torgo. September 2015 Peggy Cellier, Thierry Charnois Andreas Hotho, Stan Matwin Marie-Francine Moens, Yannick Toussaint 1 D. Hall, D. Jurafsky, and C. M. Manning. Studying the History of Ideas Using Topic Models. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 363–371, 2008 2 K. Church. A Pendulum Swung Too Far. Linguistic Issues in Language Technology, Vol. 6, CSLI publications, 2011. Organization Program Chairs Peggy Cellier INSA Rennes, IRISA, France Thierry Charnois Université Paris 13, Sorbonne Paris cité, LIPN, France Andreas Hotho University of Kassel, Germany Stan Matwin Dalhousie University, Canada Marie-Francine Moens Katholieke Universiteit Leuven, Belgium Yannick Toussaint INRIA Nancy Grand-Est, LORIA, France Program Commitee Martin Atzmueller University of Kassel, Germany Delphine Battistelli MoDyCo-Université Paris Ouest, France Yves Bestgen F.R.S-FNRS, Université catholique de Louvain, Belgium Philipp Cimiano University of Bielefeld, Germany Bruno Crmilleux Universit de Caen, France Batrice Daille Laboratoire d’Informatique de Nantes Atlantique, France Luigi Di Caro University of Torino, Italy Pierre Geurts University of Lige, Belgium Francois Jacquenet Laboratoire Hubert Curien, France Jiri Klma Czech Technical University, Prague, Czech Republic Yves Lepage Waseda University, Japan Amedeo Napoli LORIA Nancy, France Adeline Nazarenko Université de Paris 13, LIPN, France Claire Nédellec Institut National de Recherche Agronomique, France Maria Teresa Pazienza University of Roma ”Tor Vergata”, Italy Pascal Poncelet LIRMM Montpellier, France Stephen Poteet Boeing, USA Solen Quiniou Laboratoire d’Informatique de Nantes Atlantique, France Mathieu Roche Cirad, TETIS, Montpellier, France Arnaud Soulet Université François Rabelais Tours, France Koichi Takeuchi Okayama University, Japan Isabelle Tellier Lattice, Paris, France Xifeng Yan University of California at Santa Barbara, USA Pierre Zweigenbaum LIMSI-CNRS, Paris, France V Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV Papers Annotated suffix tree similarity measure for text summarization . . . . . . . . . . . . . . 1 Maxim Yakovlev and Ekaterina Chernyak Narrative Generation from Extracted Associations . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Pierre-Luc Vaudry and Guy Lapalme Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Ekaterina Chernyak MIL: Automatic Metaphor Identification by Statistical Learning . . . . . . . . . . . . . . 19 Yosef Ben Shlomo and Mark Last A Peculiarity-based Exploration of Syntactical Patterns: a Computational Study of Stylistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Mohamed-Amine Boukhaled, Francesca Frontini and Jean-Gabriel Ganascia Mining Comparative Sentences from Social Medias . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Fabı́ola S. F. Pereira VI Annotated suffix tree similarity measure for text summarization Maxim Yakovlev and Ekaterina Chernyak National Research University – Higher School of Economics Moscow, Russia myakovlev,echernyak@hse.ru Abstract. The paper describes an attempt to improve the TextRank al- gorithm. TextRank is an algorithm for unsupervised text summarisation. It has two main stages: first stage is representing a text as a weighted directed graph, where nodes stand for single sentences, and edges are weighted with sentence similarity and connect sequential sentences. The second stage is applying the PageRank algorithm [1] as is to the graph. The nodes that get the highest ranks form the summary of the text. We focus on the first stage, especially on measuring the sentence similar- ity. Mihalcea and Tarau [4] suggest to employ the common scheme: use the Vector space model (VSM), so that every text is a vector in the space of words or stems, and compute cosine similarity between these vectors. Our idea is to replace this scheme by using the annotated suffix trees (AST) [5] model for sentence representation. The AST overcomes sev- eral limitations of the VSM model, such as being dependent on the size of vocabulary, the length of sentences and demanding stemming or lemma- tisation. This is achieved by taking all fuzzy matches between sentences into account and computing probabilities of matched coocurrencies. More specifically we develop an algorithm for common subtree construc- tion and annotation. The common subtrees are used to score the simi- larity between two sentences. Using this algorithm allows us to achieve slight improvements according to cosine baseline on our own collection of Russian newspaper texts. The AST measure gained around 0.05 points of precision more than the cosine measure. This is a great figure for natural language processing task, taking into account how low the baseline pre- cision of the cosine measure is. The fact that the precision is so low can be explained by some lack of consistency in the constructed collection: the authors of the articles use di↵erent strategies to highlight the im- portant sentences. The text collection is heterogeneous: in some articles there are 10 or more sentences highlighted, in some only the first one. Unfortunately, there is no other test collection for text summarisation in Russian. For further experiments we might need to exclude some articles, so that the size of summary would be more stable. Another issue of our test collection is the selection of sentences that form summaries. When the test collections are constructed manually, summaries are chosen to common principles. But we can not be sure that the sentences are not highlighted randomly. Although the AST technique is rather slow, it is not a big issue for the text summarisation problem. The summarisation problem is not that In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Porto, Portugal, 2015. Copyright c by the paper’s authors. Copying only for private and academic purposes. 2 2 M. Yakovlev and E. Chernyak kind of problems where on-line algorithms are required. Hence the pre- cision plays more significant part than time characteristics. There are several directions of future work. First of all, we have to con- duct experiments on the standard DUC (Document Understanding Con- ference [2]) collections in English. Second, we are going to develop dif- ferent methods for construction and scoring of common subtrees and compare it to each other. Finally, we may use some external and more efficient implementation of the AST method, such as EAST Python li- brary by Mikhail Dubov [3], which uses annotated suffix arrays. More details on this work can be found in [6]. Keywords: TextRank, annotated suffix tree References 1. Brin S., Page L. : The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the seventh international conference on World Wide Web 7, 107-117 (1998) 2. Document Understanding Conference, http://www-nlpir.nist.gov/ 3. Enhanced Annotated Suffix Tree, https://pypi.python.org/pypi/EAST/0.2.2/ 4. Mihalcea R., Tarau P. : TextRank: bringing order into text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 404-411 (2004) 5. Pampapathi R., Mirkin B., Levene M. (2008): A suffix tree approach to anti-spam email filtering. In: Machine Learning, 65(1), 309-338 (2008) 6. Yakovlev M., Chernyak E. (2015): Using annotated suffix tree similarity measure for text summarization. In: Proccedings of European Conferenece on Data Analysis, 2015. Narrative Generation from Extracted Associations Pierre-Luc Vaudry and Guy Lapalme Université de Montréal, Montréal, Canada {vaudrypl,lapalme}@iro.umontreal.ca Keywords. Narrative. Natural Language Generation. Association rule discov- ery. Activity of Daily Living. Data-to-text. Rhetorical relations. Coherence. In [1], we study how causal relations may be used to improve narrative generation from real-life temporal data. We describe a method for extracting potential causal relations from temporal data and for structuring a generated report. The method is applied to the generation of reports highlighting unusual combinations of events in the Activity of Daily Living (ADL) domain. Our experiment applies association rules discovery techniques in [2] for selecting candidate associations based on three properties: frequency, confidence and signifi- cance. We assume that temporal proximity and temporal precedence are indicators of potential causality. The generation of a report from the ADL data for a given period follows a pipeline architecture. The first stage is data interpretation, which consists of finding instances of the previously selected association rules in the input. For each of those, one or more semantic relations are introduced as part of a hypothetic interpretation of the input data. Next those relations are used to plan the document as a whole in the doc- ument planning stage. The output is a rhetorical structure which is then pruned to keep only the most important events and relations. Follows a microplanning stage that plans the phrases and lexical units expressing the events and rhetorical relations. This produces a lexico-syntactic specification that is realised as natural language text in the last stage: surface realisation. After analysing the results, the extracted relations seem to be useful to locally link activities with explicit rhetorical relations. However, further work is needed to better exploit them for improving coherence at the global level. References 1. Vaudry, P.-L., Lapalme, G.: Narrative Generation from Extracted Associations. In: Proceedings of the 15th European Workshop on Natural Language Genera- tion., Brighton, United Kingdom (Sept 2015). 2. Hamalainen, W., Nykanen, M.: Efficient Discovery of Statistically Significant Association Rules. In: ICDM ’08 Proceedings of the Eighth IEEE International Conference on Data Mining. pp. 203–212 (2008). In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014. Copyright c by the paper’s authors. Copying only for private and academic purposes. Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing Ekaterina Chernyak National Research University – Higher School of Economics Moscow, Russia echernyak@hse.ru Abstract. The paper defines an annotated suffix tree (AST) - a data structure used to calculate and store the frequencies of all the fragments of the given string or a collection of strings. The AST is associated with a string to text scoring, which takes all fuzzy matches into account. We show how the AST and the AST scoring can be used for Natural Language Processing tasks. Keywords: text representation, annotated suffix tree, text summariza- tion, text categorization 1 Introduction Natural Language Processing tasks require a text being represented by a sort of a formal structure to be processed by a computer. The most popular text representation is the Vector Space Model (VSM), designed by Salton [1]. The idea of the VSM is simple: given a collection of texts, represent every text as a vector in a space of terms. A term is a word itself or a lemmatized word or the stem of a word or any other meaningful part of the word. The VSM is widely used in any kind of Natural Language Processing tasks. The few exceptions are machine translation or text generation, when word order is important, while the VSM completely loses it. For these purposes Ponte and Croft introduced the language model [2], which is based on calculating the probability of the sequence of n words or characters, so-called n-grams. There is one more approach to text representation, which is based on suffix trees and suffix arrays. Originally the suffix tree was developed for fuzzy string matching and indexing [3]. However there appear to be several application of suffix trees to Natural Language Pro- cessing. One of them is document clustering, presented in [4]. When some sort of probability estimators of the paths in the suffix tree are introduced, it can be used as a language model for machine translation [5] and information retrieval [6]. In this paper we are going to concentrate on the so-called annotated suffix tree (AST), introduced in [8]. We will present the data structure itself and several Natural Language Processing tasks where the AST representation is successfully used. We are not going to make any comparisons to other text representation models, but will show that using the AST approach helps to overcome some In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014. Copyright c by the paper’s authors. Copying only for private and academic purposes. 2 6 E. Chernyak exciting problems. The paper is organized as follows: the Section 2 presents the definition of the AST and the algorithm for the AST construction, Sections from 3 to 7 present exciting applications of the AST (almost all developed with author’s contribution), Section 8 lists some future application, Section 9 suggests how to compare the AST scoring to other approaches, Section 10 is devoted to the AST scoring implementation. Section 11 concludes. The project is being developed by the “Methods of web corpus collection, analysis and visualisation” research and study group under guidance of prof. B. Mirkin (grant 15 - 05 - 0041 of Academic Fund Program). 2 Annotated suffix tree 2.1 Definition The suffix tree is a data structure used for storing of and searching for strings of characters and their fragments [3]. When the suffix tree representation is used, the text is considered as a set of strings, where a string may be any significant part of the text, like a word, a word or character n-gram or even a whole sentence. An annotated suffix tree (AST) is a suffix tree whose nodes (not edges!) are annotated by the frequencies of the strings fragments. An annotated suffix tree (see Figure 1)[7] is a data structure used for com- puting and storing all fragments of the text and their frequencies. It is a rooted tree in which: – Every node corresponds to one character – Every node is labeled by the frequency of the text fragment encoded by the path from the root to the node. 2.2 AST construction Our algorithm for constructing an AST is a modification of the well-known algorithm for constructing suffix trees [3]. The algorithm is based on finding suffixes and prefixes of a string. Formally, the i-th suffix of the sting is the substring, which starts at i-th character of the string. The i-th prefix of the string is the substring, that ends on the i-th character of the string. The AST is built in an iterative way. For each string, its suffixes are added to the AST one-by-one starting from an empty set representing the root. To add a suffix to the AST, first check, whether there is already a match, that is, a path in the AST that encodes / reads the whole suffix or its prefix. If such a match exists, we add 1 to all the frequencies in the match and append new nodes with frequencies 1 to the last node in the match, if it does not cover the whole suffix. If there is no match, we create a new chain of nodes in the AST from the root with the frequencies 1. 3 Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing 7 Fig. 1. An AST for string “mining“. 2.3 AST relevance measure To use an AST to score the string to text relevance we first build an AST for a text. Next we match the string to the AST to estimate the relevance. A procedure for computing string-to-text relevance score: Input: string and AST for a given text. Output: the AST scoring. 1. The string is represented by the set of its suffixes; 2. Every suffix is matched to the AST starting from the root. To estimate the match we use the average conditional probability of the next symbol: P f (node) score(match(suf f ix, ast)) = node2match ( f (parent(node)) |suf f ix| ), where f (node) is the frequency of the matching node, f (parent(node)) is it’s parent frequency, and |suf f ix| is the length of the suffix; 3. The relevance of the string is evaluated by averaging the scores of all suffixes: relevance(string, text) = SCORE(string, ast) = P suf f ix score(match(suf f ix, ast)) = , |string| where |string| is the length of the string. Note, that “score” is found by applying a scaling function to convert a match score into the relevance evaluation. There are three useful scaling functions, according to experiments in [8] for spam classification: 4 8 E. Chernyak – Identity function: (x) = x – Logit function: x (x) = log = log x log(1 x) 1 x p – Root function (x) = x The identity scaling stands for the conditional probability of characters averaged over matching fragments (CPAMF). Consider an example to illustrate the described method. Let us construct an the for the string “mining”. This string has six suffixes: “mining”, “ining”, “ning”, “ing”, “ng”, and “g’ . We start with the first suffix and add it to the empty AST as a chain of nodes with the frequencies equal to unity. To add the next suffix, we need to check whether there is any match, i.e. whether there is such a path in the AST starting at its root that encodes / reads a prefix of “ining”. Since there is no match between existing nodes and the second suffix, we add it to the root as a chain of nodes with the frequencies equal to unity. We repeat this step until a match is found: a prefix of the fourth suffix “ing” matches the second suffix “ining”: two first letters, “in”, coincide. Hence we add 1 to the frequency of each of these nodes and add a new child node “g” to the leaf node “n” (see Figure 1). The next suffix “ng” matches the third suffix and we repeat the same actions: increase the frequency of the matched nodes and add a new child node that does not match. The last suffix does not match any path in the AST, so again we add it to the AST’s root as a single node with its frequency equal to unity. Now let us calculate the relevance score for string “dining” using the AST in Figure 1. There are six suffixes of the string “dining”: ‘dining”, “ining”, “ning”, “ing”, “ng”, and “g’ . Each of them is aligned with an AST path starting from the root. The scorings of the suffixes are presented in Table 1. Table 1. Computing the string “dining” score Suffix Match Score “dining” None 0 “ining” “ining” 1/1+1/1+1/2+2/2+2/6 5 = 0.76 1/1+1/1+1/2+2/6 “ning” “ning” 4 = 0.71 1/2+2/2+2/6 “ing” “ing” 3 = 0.61 1/2+2/6 “ng” “ng” 2 = 0.41 1/6 “g” “g” 1 = 0.16 We have used the identity scaling function to score all 6 suffixes of the string “dining”. Now, to get the final CPAMF relevance value we sum and average 5 Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing 9 them: 0 + 0.76 + 0.71 + 0.61 + 0.41 + 0.16 relevance(dining, mining) = = 6 2.65 = = 0.44 6 In spite of the fact that “dining” di↵ers from “mining” by just one character, the total score, 0.44, is less than unity. This is not only because the trivial suffix “dining” contributes 0 to the sum, but also because conditional probabilities get smaller for the shorter suffixes. 3 Spam filtering The definition of the AST presented above was for first time introduced by Pam- papathi, Mirkin and Levene in [7] for spam filtering. The AST was used as a representation tool for every class (spam and ham). By introducing a procedure for scoring the class AST they developed a classifier that beats the Naive Bayes classifier in a series of experiments on standard datasets. The success of ASTs in domain of email filtering was due to the notion of match permutation nor- malization, which allowed to take into account some intentional typos developed by spamers to pass over spam filters. Match permutation normalization is in a a sense analogous to the edit distance [10] that if frequently implemented in spam filters [11]. 4 Research paper categorization The problem of text categorization is formulated as follows. Given a collection of documents and a domain taxonomy, annotate a document with relevant tax- onomy topics. A taxonomy is a rooted tree, such that every node corresponds to a (taxonomy) topic of the domain. The taxonomy generalizes the relation “is – a” or “is a part of”. There are two basic approaches to the problem of text categorization: su- pervised and unsupervised. Supervised approaches give high precision values when applied to web document categorization [12], but may fail when applied to research paper categorization, since the research taxonomies, such as ACM Computing Classification System [13], are seldom revised and the supervised techniques may overfit [14]. The unsupervised approaches to text categorization are based on information retrieval – like idea: given the set of taxonomy topics, let us find those research papers that are relevant to every topic. The question for researcher is the following: what kind of the relevance model and measure to choose? In [15] we experimentally compared cosine relevance function, which measures the cosine between tf idf vectors in Vector Space Model [1], BM25, based on the probabilistic relevance framework, and the AST scoring, introduced above. These three relevance measures where applied to a relatively small dataset 6 10 E. Chernyak of 244 articles, published in ACM journals and the current version of ACM Com- puting Classification System. The AST scoring outperforms cosine and BM25 measures, by being more robust and taking not crisp but fuzzy measures into ac- count. The next step in this research direction would be testing the AST scoring versus w-shingling procedure [17], which is also a fuzzy matching technique that requires text preprocessing, such stemming or lemmatization. However there is no need in stemming or lemmatization to apply the AST scoring. 5 Taxonomy refinement Taxonomies are widely used to represent, maintain and store domain knowledge, see, for example SNOMED [18] or ACM CCS [13]. Domain taxonomy construc- tion is a difficult task and a number of researchers have come out with idea of taxonomy refinement. The idea of taxonomy refinement is the following: having one taxonomy or upper levers of taxonomy refine it with topics extracted from additional sources such as other taxonomies, web search or Wikipedia. We fol- lowed this strategy and developed a two-step approach to taxonomy refinement, presented in more details in [21]. We concentrated on taxonomies of probability theory and mathematical statistics (PTMS) and numerical mathematics (NM), both in Russian. On a first step an expert sets manually the upper layers of tax- onomy. On the second step these upper layers are refined by Wikipedia category tree and the articles, belonging to this tree, from the same domain. In this study the AST scoring is used several times: – To clear the Wikipedia data from noise; – To assign the remaining Wikipedia categories to the taxonomy topics; – To form the intermediate layers of the taxonomy by using Wikipedia sub- categories; – To use Wikipedia articles in each of the added category nodes as its leaves. The Wikipedia data is rather noisy: there some articles that are stubs or irrele- vant to parental categories (the categories, they belong to) and the more so there are subcategories (of a category) that are irrelevant to the parental categories. For example, we found the article “ROC curve” be irrelevant to the category “Regression analysis” and the category “Accidentally killed” to the category “Randomness”. To define what article is irrelevant we exploit the AST scoring twice: – We scored the title of the article to the text of the article to detect stubs; – We scored the title of the parental category to the text of the article to detect irrelevant category. If the value of the scoring function is less than a threshold we decided that the article is irrelevant. Usually we set the threshold at 0.2. To assign the remaining Wikipedia categories to the taxonomy topics we score the taxonomy topics to all the articles in the category merged into one text. Next we found the maximum value of the scoring function and assigned the category to the corresponding 7 Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing 11 taxonomy topic. Finally, we score the title of parental categories to the articles of the subcategories, merged into one. If the subcategory to category scoring is higher than the subcategory to taxonomy topic, the subcategory remains on the intermediate layer of the refined taxonomy tree under its parental category. Finally, the articles left after clearing from noise became leaves in the refined taxonomy tree. The quality of achieved PTMS and NM taxonomies is difficult to evaluate computationally, so the design of the user study is an open question. 6 Text summarization Automatic text summarisation is one of the key tasks in natural language pro- cessing. There are two main approaches to text summarisation, called abstractive and extractive approaches [22]. According to the abstractive approach, the summary of a text is another text, but much shorter, generated automatically to make the semantic representation of the text. According to extractive approach, the summary of a text is nothing else, but some important parts of the given text, such as a set of important sentences. The extractive summarisation problem can be formulated in the following way. Given a text T that is a sequence of sentences S that consists of words V , select a subset of the sentences S ⇤ that are important in T . Therefore we need to define: – what importance of a sentence is; – how to measure importance of the sentence; Hence we need to introduce a function, importance(s), which measures the importance of a sentence. The higher importance is, the better. Next step is to build the summary. Let us rank all the sentences according the values of importance. Suppose we look for the summary that consists of five sentence. Hence we take the five sentences with the highest values of importance and call them top-5 sentences according to importance. Generally, the summary of the text are the top-N sentences according to importance and N is set manually. The best results for this statement of the problem are achieved by Mihalcea and Tarau [23], where importance(s) is introduced as PageRank type function [24] without any kind of additional grammar, syntax or semantic information. The main idea of the suggested TextRank algorithm is to represent a text as a directed graph, where nodes stand for sentences and edges connect sequential sentences. The edges are weighted with sentence similarity. When PageRank is applied to this graph, every node receives its rank that is to be interpreted as the importance of the sentence, so that importance(s) = P ageRank(snode ), where snode is the node corresponding to sentence s. To measure similarity of the sentences the authors of TextRank algorithm suggest to use the basic VSM (Vector Space Model) scheme. First every sentence is represented as a vector in space of words or stems. Next cosine similarity between those vectors is computed. We can use the AST scoring as well for 8 12 E. Chernyak scoring the similarity between two sentences. To do this we have to introduce the common tree technique. 6.1 Constructing common subtree for two ASTs To estimate the similarity between two sentences we find the common subtree of the corresponding ASTs. We do the depth-first search for the common chains of nodes that start from the root of the both ASTs. After the common subtree is constructed we need to annotate and score it. We annotate every node of the common subtree with the averaged frequency of the corresponding nodes in initial ASTs. Consider for example two ASTs for strings “mining” and “dinner” (see Fig. 1 and Fig. 2, correspondingly). There are two common chains: “I N” and “N”, the first one consists of two nodes, the second one consists of a single node. Both this chains form the common subtree. Let us annotate it. The frequency of the node “I” is equal to 2 in the first AST and to 1 in the second. Hence, the frequency of this node in the common subtree equals to 2+1 2 = 1.5. In the same way we annotate the node “N” that follows after the node “I” with 2+1 2 = 1.5 and the node “N” on the first level with 2+2 2 = 2. The root is annotated with the sum of the frequencies of the first level nodes that is 1.5 + 2 = 3.5. 6.2 Scoring common subtree The score of the subtree is the sum of scores of every chain of nodes. The score of the path is the averaged sum of the conditional probabilities of the nodes, where conditional probability of the node is the frequency of the node divided by the frequency of its parent. For example, the conditional probability of the node “G:1” on the third level of the AST on Fig. 1 is 1/2. Let us continue with the example of “mining” and “dinner”. There are two chains in their common subtree: “I N” and “N”. The score of “I N” chain is (1.5/1.5 + 1.5/3.5)/2 = 0.71, since there are 2 nodes in the chain. The score of one node chain “N” is 1.5/3.5 = 0.42. The score of the whole subtree is (0.71 + 0.42) = 1.13. The collection for experiments was made of 400 articles from Russian news portal called Gazeta.ru. The articles were marked up in a special way, so that some of sentences were highlighted because of being more important. This high- lighting was done either by the author of the article or by the editor on the basis of their own ideas. In our experiments we considered those sentences as the sum- mary of the article. We tried to reproduce these summaries using TextRank with cosine similarity measure and AST scoring. Using this algorithm allowed us to gain around 0.05 points of precision ac- cording to cosine baseline on our own collection of Russian newspaper texts. This is a great figure for Natural Language Processing task, taking into account that the baseline precision of the cosine measure was very low. The fact that the precision is so low can be explained by some lack of consistency in the con- structed collection: the authors of the articles use di↵erent strategies to highlight the important sentences. The text collection is heterogeneous: in some articles 9 Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing 13 Fig. 2. An AST for string “dinner”. Fig. 3. Common subtree of ASTs for stings “mining” and “dinner”. there are 10 or more sentences highlighted, in some only the first one. More details of this experiment are presented in [25]. 7 Association rule extraction Several research group develop di↵erent approaches to extraction and visualiza- tion of association rules from text collections [26, 27]. Association rule is a rule X =) Y , where both X and Y are sets of concepts, possibly a singleton, and the implication means some sort of co-occurrence relation. An association rule has two important features, called support and confidence. When the rule is extracted from the text collection, the support of the set X support(X) usually stands for the proportion of the documents where concepts X occur and the confidence of the association rule conf idence(X =) Y ) stands for conditional probability of Y given X. The majority of approaches to association rule ex- traction share the following idea in common: the concepts should be extracted from the text collection. Using the fuzzy AST scoring we can diminish this lim- itation and produce the rules on the set of concepts provided by a user. In [28] we presented a so-called “conceptual map”, which is a graph of association rules X =) Y . To make the visualization easy we restricted ourselves only to single item sets, so that |X| = |Y | = 1. We analyzed a collection of Russian language 10 14 E. Chernyak newspaper articles on business and the concepts were provided by a domain ex- pert. We used the AST scoring to score every concept ki to every text from the collection. Next we formed F (ki ) the set of articles, to which the concept ki relevant (i.e. the scoring is higher than a threshold, usually 0.2). Finally, there F (ki )\F (kj ) was a rule ki =) kj if the ratio F (ki ) was higher than the predefined confidence threshold. An example of conceptual map (translated into English) can be found on Fig. 4. Fig. 4. A conceptual map This conceptual map may serve as a tool for text analysis: it reveals some hidden relations between concepts and it can be easy visualized as a graph. Of course, to estimate the power of conceptual maps we have to conduct an user study. 8 Future work In the following sections we will briefly present some Natural Language Process- ing tasks, where AST scoring might be used. 8.1 Plagiarism detection Ordinary suffix trees are widely used for plagiarism detection [29]. The common subtree technique can also be used in this case. Suppose we have two texts, 11 Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing 15 construct two individual ASTs and the common AST. The size of the common AST will show how much these texts share in come. Scoring the common AST allows to measure how significant coinciding parts are. With no doubts, the common AST can be used for indexing of coinciding parts of the texts. Hence, it inherits advantages of ordinary suffix trees with some additional functionality. 8.2 Compound splitting Splitting compounds, such as German compounds, is necessary for machine translation and information retrieval. The splitting is usually conducted accord- ing to some morphological or probabilistic models [30]. We have a hypothesis that scoring prefixes of compound words to the AST, constructed from the col- lection of simple words, will allow to split compounds without using additional morphological knowledge. The main research in this direction is the design of the collection of simple words. 8.3 Profanity filtering The Russian profanity language is rich and complex and has a complex deriva- tion, usually based on adding prefixes (such as “za”, “pro”, “vy”, etc). New words appear almost every month, so it is difficult to maintain a profanity dic- tionary. Profanity filtering is an important part of Russian Text or Web mining, specially since some special limitations on using profanity were introduced. The task is to find words in a text that are profane and, for example, to replace them with star symbols “***”. Note, that Russian derivative includes also a variety of endings, so lematization or stemming should be used. Since Porter stemmer [31] does not cope with prefixes, it can be easily replaced by some sort of the AST-scoring. 9 Comparison to other approaches Cosine measure on tf idf vectors is a traditional baseline in majority of Natural Language Processing tasks and is easily overcame by any sort of more robust and fuzzy similarity or relevance measure, such as w-shingling [17], super shin- gles [32], mega shingles [33] and character n-grams [34]. The main future research concentrates on drawing comparison between these fuzzy measure and AST scor- ing. 10 Implementation Mikhail Dubov’s implementation of AST construction and scoring is based on suffix arrays, which makes it space and time efficient. It is available at https: //github.com/msdubov/AST-text-analysis. It can be used as console utility or as a Python library. 12 16 E. Chernyak 11 Conclusion In this paper the notion of annotated suffix tree is defined. The annotated suffix trees are used by several research groups and in the paper several finished, run- ning or future projects are presented. The annotated suffix tree is a simple but powerful tool for scoring di↵erent types of relevance or similarity. This paper may sound light weighted and to make it more theoretical, we will conclude by provided some insights on probabilistic or morphological origins of ASTs. From one point of view, we have a strong feeling that it can proved that the AST or the common AST is a string kernel, thus it can be used to generate features for text classification / categorization or to measure similarity. From another point of view, the AST is a sort of supervised stemmer, that can be used to generate terms more efficiently than model-based stemmers. 12 Acknowledgments I am deeply grateful to my supervisor Dr. Boris Mirkin for an oppurtunity to learn from him and to work under his guidance for so long and to my colleagues Mikhail Dubov, Maxim Yakovlev, Vera Provotorova and Dmitry Ilvovsky for collaboration. References 1. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, Vol.2, no 5, pp. 513-523, 1998. 2. Ponte, J. M., and Croft B.W.. A language modeling approach to information re- trieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 275-281. ACM, 1998. 3. Gusfield D., Algorithms on Strings, Trees, and Sequences, Cambridge University Press, 1997. 4. Zamir O., Etzioni, O. Web document clustering: A feasibility demonstration. Pro- ceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 46-54. ACM, 1998. 5. Kennington C.R., Kay M. , Friedrich. A.. Suffix Trees as Language Models. In LREC, pp. 446-453. 2012. 6. Huang J.H., Powers D.. Suffix tree based approach for chinese information retrieval. Intelligent Systems Design and Applications, 2008. ISDA’08. Eighth International Conference on, vol. 3, pp. 393-397. IEEE, 2008. 7. Pampapathi R., Mirkin B., Levene M., A suffix tree approach to anti-spam email filtering, Machine Learning, 2006, Vol. 65, no.1, pp. 309-338. 8. Chernyak E.L., Chugunova O.N., Mirkin B.G., Annotated suffix tree method for measuring degree of string to text belongingness, Business Informatics, 2012. Vol. 21, no.3, pp. 31-41 (in Russian). 9. Chernyak E.L., Chugunova O.N., Askarova J.A., Nascimento S., Mirkin B.G., Ab- stracting concepts from text documents by using an ontology, in Proceedings of the 1st International Workshop on Concept Discovery in Unstructured Data. 2011, pp. 21-31. 13 Some Thoughts on Using Annotated Suffix Trees for Natural Language Processing 17 10. Levenshtein, V. I., Binary codes capable of correcting deletions, insertions, and reversal. Soviet Physics Doklady Vol.10, no 8, pp. 707710. 11. Tretyakov K., Machine learning techniques in spam filtering. Data Mining Problem- oriented Seminar, MTAT, vol. 3, no. 177, pp. 60-79. 2004. 12. M. Ceci and D. Malerba Classifying web documents in a hierarchy of categories: a comprehensive study. Journal of Intelligent Information Systems, Vol. 28, no. 1, pp. 37-78, 2007. 13. ACM Computing Classification System (ACM CCS), 1998, available at: http://www.acm.org/about/class/ccs98-html 14. A.P. Santos and F. Rodrigues. Multi-label hierarchical text vlassification using the ACM taxonomy Proceedings of 14th Portuguese Conference on Artificial Intelli- gence, pages 553 - 564, Aveiro, Portugal, 2010. 15. Chernyak E. L. An approach to the problem of annotation of research publications, Proceedings of The Eighth International Conference on Web Search and Data Min- ing, pp. 429-434. 16. S. Robertson and H. Zaragoza. The probabilistic relevance gramework: BM25 and beyond. Journal Foundations and Trends in Information Retrieval, Vol.25, no 4., pp. 333-389, 2009 17. Manber, Udi. Finding Similar Files in a Large File System. Usenix Winter, vol. 94, pp. 1-10. 1994. 18. SNOMED CT - Systematized Nomenclature of Medicine Clinincal Terms, www.ihtsdo.org/snomed-ct/, visited 09.25.14. 19. Van Hage W.R., Katrenko S., Schreiber G., A Method to Combine Linguistic Ontology-Mapping Techniques, in Proceedings of 4th International Semantic Web Conference, 2005, pp. 34-39. 20. Grau B.C., Parsia B., Sirin E. Working with Multiple Ontologies on the Semantic Web, in Proceedings of the 3d International Semantic Web Conference, 2004, pp. 620-634. 21. Chernyak E. L., Mirkin B. G. Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources. Annals of Data Science. Vol. 2. No. 1. P. 61-82, 2015. 22. Hahn U., Mani I. The challenges of automatic summarization, Computer, Vol.33, no.11, pp. 29-36, 2000 23. Mihalcea R., Tarau P. TextRank: bringing order into text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 404-411, 2004 24. Brin S., Page L. The anatomy of a large-scale hypertextual Web search engine. Proceedings of the seventh international conference on World Wide Web 7, 107-117, 1998 25. , Chernyak E.L., Yakovlev M.S., Using annotated suffix tree similarity measure for text summarization (under revision) 26. Pak Chung W., Whitney P., Thomas J.. Visualizing association rules for text mining. Information Visualization, 1999. Proceedings. 1999 IEEE Symposium on, pp. 120-123. IEEE, 1999. 27. Mahgoub, H., Rsner, D., Ismail, N., Torkey, F.. A text mining technique using association rules extraction. International journal of computational intelligence 4, no. 1, pp. 21-28, 2008. 28. Morenko, E. N., Chernyak E.L., Mirkin B.G.. Conceptual Maps: Construction Over a Text Collection and Analysis. In Analysis of Images, Social Networks and Texts, pp. 163-168. Springer International Publishing, 2014. 14 18 E. Chernyak 29. Krisztin M., Zaslavsky A., Schmidt, H.. Document overlap detection system for distributed digital libraries. Proceedings of the fifth ACM conference on Digital libraries. ACM, 2000. 30. Koehn P., Knight K.. Empirical methods for compound splitting. Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume 1. Association for Computational Linguistics, 2003. 31. Porter, M. F. An algorithm for suffix stripping. Program Vol. 14, no. 3, pp, 130-137 (1980). 32. Chowdhury A., Frieder O., Grossman D., McCabe M.C..”Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) Vol. 20, no. 2 ,pp. 171-191 (2002). 33. Conrad J. G., Schriber C.P.. Managing dj vu: Collection building for the identi- fication of nonidentical duplicate documents. Journal of the American Society for Information Science and Technology Vol 57, no. 7 pp. 921-932 (2006). 34. Damashek M.. Gauging similarity with n-grams: Language-independent catego- rization of text. Science Vol. 267, no. 5199, pp 843-848 (1995). MIL: Automatic Metaphor Identification by Statistical Learning Yosef Ben Shlomo and Mark Last Department of Information Systems Engineering Ben-Gurion University of the Negev Beer-Sheva 84105, Israel E-mail: bshyossi@gmail.com, mlast@bgu.ac.il Abstract. Metaphor identification in text is an open problem in natural language processing. In this paper, we present a new, supervised learning approach called MIL (Metaphor Identification by Learning), for identifying three major types of metaphoric expressions without using any knowledge resources or handcrafted rules. We derive a set of statistical features from a corpus representing a given domain (e.g., news articles published by Reuters). We also use an annotated set of sentences, which contain candidate expressions labelled as 'metaphoric' or 'lit- eral' by native English speakers. Then we induce a metaphor identification model for each expression type by applying a classification algorithm to the set of an- notated expressions. The proposed approach is evaluated on a set of annotated sentences extracted from a corpus of Reuters articles. We show a significant im- provement vs. a state-of-the-art learning-based algorithm and comparable results to a recently presented rule-based approach. Keywords: Metaphor Identification, Natural Language Processing, Supervised Learning. 1 Introduction A metaphor is defined in previous works (such as Krishnakumaran & Zhu, 2007) as the use of terms related to one concept in order to describe a term, which is related to a different concept. For example, in the metaphoric expression “fertile imagination”, the word “imagination”, which is related to the concept / domain “cognition”, is described by the word “fertile”, which is usually related to the concept / domain, “land / soil”. Metaphor identification can be useful in numerous applications that require understand- ing of the natural language such as machine translation, information extraction, and automatic text summarization. For example, the word “жесткая” in Russian may have a metaphorical meaning of “tough” when applied to the word “политика” (policy) or a literal meaning of “hard” when applied to the word “кровать” (bed). Metaphoric expressions have a variety of syntactic structures. In this paper, we focus on three major types of syntactic structures discussed in the work of (Krishnakumaran and Zhu, 2007). In type 1 expression, a subject noun is related to an object noun by a form of the copula verb “to be” (e.g., “God is a father”). In type 2 expression, the subject In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014. Copyright c by the paper’s authors. Copying only for private and academic purposes. 20 Y. B. Shlomo and M. Last noun is associated with a metaphorically used verb and an object noun (e.g., “The actor painted his relationship with…”). Type 3 expression is an adjective-noun phrase (e.g., “sweet child”). An expression of one of the three structures above may or may not be a metaphor. Therefore, in this paper we treat the problem of metaphor identification as a binary classification problem of labelling a given expression as metaphoric or literal. The first step in our method is choosing a corpus representative of a given domain and building an annotated data set of labelled expressions from that corpus. The second step is feature extraction from the domain corpus, which is the main subject of our work. Then we induce a classification model for each expression type, using a set of annotated sentences, and evaluate its accuracy on a hold-out set. This paper is organized as follows. Section 2 covers the state-of-the-art approaches to automated metaphor detection. Section 3 presents MIL (Metaphor Identification by Learning), a novel algorithm for metaphor detection. The proposed algorithm is em- pirically evaluated in Section 4. In Section 5, we conclude the paper with some insights and directions for future research. 2 Related Work The work (Birke & Sarkar, 2006) focused on classifying the uses of verbs in a sentence as either literal or non-literal (type 2 expressions). They adopted the work of (Karov & Edelman, 1998), who worked on word sense disambiguation of words within their con- texts (i.e., sentences). Krishnakumaran & Zhu (2007) suggested three algorithms for distinguishing be- tween live and dead metaphors. A dead metaphor is a metaphor that is already assimi- lated and familiar in the spoken language (e.g., “fell in love”) and a live metaphor is a less familiar metaphor, which is not yet assimilated. The methodology of Krishna- kumaran & Zhu 2007’s work is based on conditional probabilities combined with WordNet (Fellbaum, 1999), which represents the language ontology. The authors of (Neuman, et al., 2013) present three rule-based algorithms for meta- phor identification in three expression types (1, 2, and 3). They suggest identifying metaphors by negating literalness. They define a set of rules for detecting whether a particular expression is literal, and if it is not literal, it is assumed to be a metaphor. Following the work of Turney, et al. (2011), they define as literal an expression com- prised of words (e.g., verb and noun), which have in common at least one concrete category. Examples of concrete categories include physical objects, like “table”, or body parts, like “hair”. The ruleset defined by (Neuman, et al., 2013) builds upon mul- tiple knowledge resources (such as Wiktionary and WordNet). The work of (Shutova, et al. 2010) focuses on semi-supervised learning for identifi- cation of type 2 metaphors. The metaphor identification process is based on the princi- ple of clustering by association, i.e., the clustering of words by their associative lan- guage neighborhood using the verb’s subject as an anchor. A domain-independent ap- proach to type 2 metaphor identification is presented in (Shutova, et al., 2013). The authors report a high precision of 0.79, but no information about the system recall and F-measure is provided. MIL: Automatic Metaphor Identification by Statistical Learning 21 Turney, et al. (2011) provide a solution for the type 3 metaphor classification prob- lem called the Concrete-Abstract algorithm. In contrast to previous works that treated this issue as a sub-problem of the Word Sense Disambiguation problem, Turney, et al. (2011) suggest identifying metaphors by considering the abstractness level of words in a given expression. They found that in a wide range of type 3 metaphoric expressions, nouns tend to be abstractive while adjectives tend to be concrete. Their methodology builds upon a unique algorithm for ranking the abstractness level of a given word by comparing it to 20 abstract words and 20 concrete words that are used as paradigms of abstractness and concreteness. The main limitation of this approach is that it uses a single feature (abstractness level) that covers a limited range of expressions. The clas- sification model used by Turney, et al. (2011) is logistic regression. The authors of (Hovy, et al., 2013) use SVMs with tree kernels for supervised met- aphor classification based on a vector representation of the semantic aspects of each word and different tree representations of each sentence. They report the best F1 meas- ure of 0.75 without specifying the distribution of metaphor types in their dataset. Very similar results (F-measure = 0.76 for English) are reported for type 2 and type 3 expres- sions by (Tsvetkov, et al., 2014) who use three categories of features: 1) abstractness and imageability, (2) word supersenses (extracted from WordNet), and (3) unsuper- vised vector-space word representations. Using translation dictionaries, the authors ap- ply a trained English model to three other languages (Spanish, Russian, and Farsi). The English dataset used by (Tsvetkov, et al., 2014) included 3,737 annotated sentences from the Wall Street Journal domain. 3 MIL (Metaphor Identification by Learning) 3.1 Overview The first step in our proposed methodology is choosing a domain corpus, which repre- sents a particular domain (area) of text documents (e.g., Reuter’s news articles). The next step is the extraction of candidate expressions that satisfy one of the three syntactic structures discussed above (types 1, 2, and 3). This can be done using existing natural language processing tools. We also use the domain corpus to construct a word-context matrix to be used in the feature extraction phase. Then we manually annotate a selected subset of candidate expressions, since in a large corpus, it is not feasible to annotate them all. Each selected expression is labeled as ‘literal’ or metaphoric’ by native lan- guage speakers. Then we perform feature extraction, the largest and most important task in our work. The goal of feature extraction is to compute statistical features that may differentiate between metaphor and literal expressions. After the feature extraction is complete, a feature selection process must be carried out for choosing the features that are most relevant to the classification task. The next step is inducing a classification model by running a supervised learning algorithm (e.g., C4.5). The model is built separately for each syntactic type. The re- sulting classification model can be used for classification of new expressions as literal 22 Y. B. Shlomo and M. Last or metaphoric. The model performance can be evaluated by such measures as precision and recall using cross-validation. 3.2 Domain Corpus Pre-processing The basic actions on a domain corpus are parsing the corpus into sentences, tokeniza- tion, stemming the tokens by the Porter Stemmer algorithm (Porter, 1980), removing stopwords, and then calculating the frequency of each token. Then we build a co-oc- currence matrix for words with frequency of 100 and higher (like in Turney et al., 2011). The co-occurrence matrix is used for calculating the abstractness level of a given word. The idea behind this matrix is that the co-occurring words are more likely to share an identical concept. We denote the co-occurrence matrix by F. Then we calcu- late the Positive Pointwise Mutual Information (PPMI) of F. The purpose of PPMI is to prevent the bias that can be caused by highly frequent words. We calculate PPMI as described in (Turney and Pantel, 2010) and get a new matrix denoted as X. The X matrix is smoothed with truncated Singular Value Decomposition (SVD) (Deerwester, et al., 1990), which decomposes X into a multiplication of three matrices X ≈ Uk Sk VkT . The matrix 𝑈𝑘 represents the left singular vectors, the matrix 𝑆𝑘 is a diag- onal matrix of the singular values and the matrix 𝑉𝑘𝑇 represents the right singular vec- tors. The idea behind using SVD is that a conceptual relation between two words can be indicated by a third word called a “latent factor” (e.g., the relation between soccer and basketball can be established by the word sport even if that word does not occur in text). SVD calculation has two parameters. The first one is k, which represents the number of latent factors. We manually calibrated k starting from the value of 1000 as used in the work of (Turney, et al., 2011) and reduced it by 100 at each iteration in order to reduce the running time (more latent factors means longer running time), without de- creasing the quality of results. We stopped when there was a substantial decrease in the results accuracy. In this manner, we have set the best value of k to 300. The other pa- rameter is p, which adjusts the weights of the latent factors as in (Caron, 2001). We adopted its value from the work of (Turney, et al., 2011) and set it to 0.5. The second 𝑝 matrix we use is the multiplication of 𝑈𝑘 𝑆𝑘 . We use this matrix for computing the se- mantic similarity of two words as a cosine similarity of two matrix rows (Turney and 𝑝 Pantel, 2010). We annotate 𝑈𝑘 𝑆𝑘 𝑉𝑘𝑇 as matA and 𝑈𝑘 𝑆𝑘 as matB. Finally, we represent each of the corpus documents by calculating the average vector of the frequent words the document contains. The idea behind that is that in our view, each document represents some concept of its own and it can contribute to the identifi- cation of the concept transition. 3.3 Feature Extraction The proposed feature set contains four types of features: features, which apply to every single word in a candidate expression (two features for type 1 and type 3 expressions, three features for type 2), features, which apply to every word pair in a candidate ex- pression (one feature for type 1 and 3, three features for type 2), features, which apply MIL: Automatic Metaphor Identification by Statistical Learning 23 to the sentence containing the candidate expression, and features, which apply to the candidate expression itself. The first set of statistical-based features builds upon the idea of concrete-abstract mapping (Turney et al., 2011). They present a measure of the abstractness level, which can be calculated for each word in a candidate expression. We are using this measure to define the following features: x Abstract Scale. The abstractness level of a word according to the algorithm presented in (Turney et al., 2011) is calculated as: 20 ∑ 𝑝𝐶𝑜𝑟(𝑤𝑜𝑟𝑑, 𝑎𝑏𝑠𝑡𝑟𝑎𝑐𝑡 𝑝𝑎𝑟𝑎𝑑𝑖𝑔𝑚) 𝑘=1 20 − ∑ 𝑝𝐶𝑜𝑟(𝑤𝑜𝑟𝑑, 𝑐𝑜𝑛𝑐𝑟𝑒𝑡𝑒 𝑝𝑎𝑟𝑎𝑑𝑖𝑔𝑚) (1) 𝑘=1 where word is a semantic vector of the target word, abstract paradigm is a semantic vector of a very abstractive word (e.g., sense), concrete paradigm is a vector of a very concrete word (e.g., donut) and 𝑝𝐶𝑜𝑟 is the Pearson correlation. The abstract scale rank is normalized between zero and one. The semantic vectors are taken from a pre-pro- cessed matrix based on words’ co-occurrences. The full list of 20 abstract paradigm words and 20 concrete paradigm words is given in (Turney et al., 2011). This feature is applied to every word in a given expression. x Abstract Scale difference. The absolute difference between the abstractness levels of every two words in an expression. x Abstract Scale Average / Variance. The average / variance of the abstractness levels of all frequent words in the sentence that contains the candidate expression. We also define a set of statistical-based features that can indicate a conceptual mapping between different word categories. These features include word-level features, docu- ment-level features, and domain-level features. Word-level features are features which are associated with the semantic meaning of a given word. Document-level features are features which are associated with the entire document’s meaning (document which contains the expression) and domain-level features are features, which are associated with the entire domain corpus. The statistical-based features are defined below: Word-level features x Semantic Relation. The semantic relation value between every two words in a given expression. This value is taken from the associated entry of this words pair in the pre-processed matA (the large NxN matrix). Low values are an indication for low semantic relation between two words and thus imply metaphoric behavior (the con- ceptual mapping definition) and vice versa. x Semantic Relation Average / Variance. The average / variance of the semantic rela- tion values between every two words in the sentence that contains the candidate ex- pression (including the expression itself). 24 Y. B. Shlomo and M. Last x Cosine-similarity. The cosine similarity between every two words in a given ex- pression. The cosine similarity is between the two vectors associated with the given words pair, taken from the pre-processed matB (the low dimensionality matrix). Low values indicate low conceptual relation between two words and thus imply meta- phoric behavior. x Cosine-similarity Average / Variance. The average / variance of the cosine similarity between every two words in a given expression. Low values indicate low conceptual relation between two words and thus imply metaphoric behavior. x First k Words. For each two words in a given expression, we first find the k most similar words to the noun (if both words are nouns then the subject noun is selected). We then calculate the average cosine similarity of the second word to those k words and use it as a feature. A similarity between two words is computed as the cosine similarity of their associated vectors in matB. The value k = 30 was chosen based on the quality of classification results. Document-level features x First k Documents. Same as First k Words, based on document vectors rather than word vectors. Considering a document as a topic/s oriented, its representation by its word vectors average is actually the semantic representation of its topic/s. Here we also set k to 30 after manual calibration considering the results quality (F-measure). x Documents Jaccard Similarity. It is calculated between semantically related neigh- borhoods of each two words in a given expression. The word neighborhood is de- fined as a window of ± 5 words surrounding a given word in the same sentence. In the previous features, we detect a conceptual mapping between two words by en- riching one of the words with its semantically related neighborhood. In this feature, we take this idea one-step further and enrich both words with their semantically re- lated neighborhoods. The feature is the Jaccard similarity of these two neighbor- hoods, calculated as follows: 𝑆𝐷𝑇(𝑤1, 𝑇) ∩ 𝑆𝐷𝑇(𝑤2, 𝑇) . (2) 𝑆𝐷𝑇(𝑤1, 𝑇) ∪ 𝑆𝐷𝑇(𝑤2, 𝑇) Where SDT(x) is a group of documents that are related to word x according to a prede- fined threshold T. In our experiments, we manually set T to 0.35 after trying different values. Domain-level features x Domain Corpus Frequency. The normalized frequency of a word in the domain corpus. This feature is extracted for every word in the candidate expression. A low word’s frequency in a given corpus can indicate metaphoric behavior since the word is not strongly related to the given domain and thus is used there as a metaphor. x Domain Corpus Frequency Difference. The absolute value of the difference be- tween the frequencies of every two words in a given expression. x Positive Point-wise Mutual Information. This feature (based on Turney and Pantel, 2010) is applied to every two words in a given expression. MIL: Automatic Metaphor Identification by Statistical Learning 25 3.4 Feature and Model Selection We used the Wrapper method of (Kohavi & John, 1997) to select the best features for each expression type. This method gets as input a classifier (e.g., decision tree) and a criterion to maximize (e.g., accuracy), and finds the subset of features that maximizes this criterion. We applied this method to each expression type. The criterion we used was the F-measure. We applied the following classifiers: Logistic Regression, Naïve Bayes, K-Nearest Neighbors (KNN), Voting Features Intervals (VFI), Random Forest, Random Tree, and J-48. We also combined each one of them with the AdaBoost algo- rithm. All algorithms were run on Weka (Hall, et al., 2009), an open source implementation of Machine Learning algorithms. For each domain and expression type, we used the Wrapper method with 10-fold cross-validation to choose the algorithm and the feature set that provided the maximum average F-measure value over the ten splits of the da- taset. 4 Experimental Results 4.1 Corpora Our results in this paper are based on the Reuters Corpus, which was previously used as a domain corpus in (Neuman, et al., 2013). It consists of 342,000 English documents, which include 3.9 million sentences. We used the same annotated corpus as in (Neuman, et al., 2013). The annotated cor- pus was constructed by extracting from the domain corpus sentences containing one of the five target nouns related to the concepts of “government” and “governance” in both literal and metaphorical sense: Father, God, Governance, Government, and Mother. Every selected sentence was parsed with the Stanford Part-of-Speech Tagger (de Marneffe & Manning, 2008). Candidate expressions having one of the three syntactic structures were independently denoted as metaphoric or literal by four human judges who were given the following definition of a metaphoric expression: Literal is the most direct or specific meaning of a word or expression. Metaphorical is the meaning suggested by the word that goes beyond its literal sense. The annotators were also given examples of literal and metaphorical expressions of types 1, 2, and 3. Inter-annotator agreement, measured in terms of Cronbach's alpha, was 0.78, 0.80, and 0.82 for type I, II, and III, respectively (Neuman, et al., 2013). Finally, an expression was labeled as metaphoric if at least three judges out of four considered it as such; otherwise, it was labelled as literal. Table 1 shows the distribu- tion of expressions by their type and label (Literal / Metaphorical) in the set of anno- tated sentences. 26 Y. B. Shlomo and M. Last Table 1. Candidate Expressions, Reuters Annotated Set (Neuman, et al., 2013) Annotators Decision\ Type Type 1 Type 2 Type 3 Literal 40 242 576 (33.3%) (48.6%) (75.8%) Metaphorical 80 256 184 (66.7%) (51.4%) (24.2%) Total 120 498 760 4.2 Comparative Results In this section, we present the comparative results of MIL vs. two state-of-the-art algo- rithms - Concrete-Abstract (Conc-Abs) of Turney, et al. (2011) and CCO of Neuman, et al. (2013), which so far have reported the best performance, in terms of the F-measure on all three types of metaphoric expressions. The following performance measures were calculated using 10-fold cross-validation: precision, recall, and F-measure. The com- parative results are shown in Tables 2-4 for expressions of type 1, 2, and 3, respectively. The Conc-Abs and CCO results are replicated from (Neuman, et al., 2013) and they refer to the same domain (Reuters) and the same set of annotated sentences. Table 2. Comparative results type 1, Reuters MIL Conc-Abs CCO Precision 86 76.5 83.9 Recall 92.5 76.5 97.5 F-measure 89.2 76.5 90.1 Table 3. Comparative results type 2, Reuters MIL Conc-Abs CCO Precision 65.2 63.9 76.1 Recall 77 67.2 82 F-measure 70.6 65.4 78.9 Table 4. Comparative results type 3, Reuters MIL Conc-Abs CCO Precision 46.8 0 54.4 Recall 39.7 0 43.5 F-measure 42.9 0 48.3 The MIL algorithm has clearly outperformed the Concrete-Abstract algorithm in terms of F-measure with an advantage of 12.7%, 5.2%, and 42.9% for expression types MIL: Automatic Metaphor Identification by Statistical Learning 27 1, 2, and 3, respectively. Moreover, all type 3 expressions were identified by the Con- crete-Abstract algorithm as literal leading to the F-measure of zero. The most likely reason for that is that our dataset rarely contains expressions where the noun is an ab- stract noun (e.g., the noun “thoughts” in the expression “dark thoughts” is an abstract noun) and most metaphoric expressions contain concrete nouns (e.g., the noun “heart” in the expression “broken heart” is a concrete noun). Since Conc-Abs relies on a single feature, which is the noun’s abstractness level, it cannot detect a metaphor in these cases. However, CCO outperformed MIL, especially in type 2 and type 3 expressions. Unlike MIL, which is a supervised learning approach, CCO is a rule-based method, requiring for each new language and domain a significant amount of manual expert labor along with multiple high-quality knowledge resources, which are unavailable for most human languages. The F-measure results reported by (Tsvetkov, et al., 2014) for type 2 and type 3 expressions (76%) are also better than the results reached by MIL, but their system is dependent upon a massive knowledge resource –WordNet. Table 5 shows the list of features and the classifier selected by the Wrapper method of (Kohavi & John, 1997) for each expression type. We can conclude from the selected feature list that the feature First k Documents is a general feature, since it has been selected in all three expression types. The following features have been selected for two expression types out of three: Domain Corpus Frequency (Types 1 and 3), Cosine- similarity Variance (Types 1 and 3), and Cosine-similarity Average (Types 2 and 3). These results imply that for detecting conceptual mapping between two words in a given expression, it can be useful to consider the semantic neighborhood of each word as well as its frequency in the domain corpus. 28 Y. B. Shlomo and M. Last Table 5. Features and classifier selected by the Wrapper method, Reuters Expression Selected Features Selected Classifier Type Type 1 Semantic Relation Random Forest First k Documents Domain Corpus Frequency Abstract Scale Average Cosine-similarity Variance Type 2 Abstract Scale AdaBoost with Naïve First k Documents Base First k Words Semantic Relation Average Cosine-similarity Average Type 3 Cosine-similarity AdaBoost with VFI First k Documents Abstract Scale difference Documents’ Jaccard Similarity Domain Corpus Frequency Domain Corpus Frequency Difference Cosine-similarity Average Cosine-similarity Variance 5 Conclusion In this paper, we have presented a novel supervised learning approach for automatic metaphor identification in three syntactic structure types. We have extended the single feature set used by Turney, et al. (2011) with a large amount of statistical features. We have shown a significant improvement vs. a learning-based algorithm (Concrete-Ab- stract). However, MIL was outperformed by a rule-based algorithm (CCO), which ap- plies a set of rules to a candidate expression in order to determine if it is a literal or not. In CCO, the rules are generated separately for each of the three expression types, and if a candidate expression satisfies all of them, it is labelled as literal. Otherwise, it is labeled as a metaphor. Although CCO outperformed MIL, it has some major disad- vantages. One of the major disadvantages is that the rules are based on a relatively large amount of linguistic resources, including COCA (Corpus of Contemporary American English http://www.ngrams.info/), ConceptNet (http://conceptnet5.media.mit.edu/), WordNet (https://wordnet.princeton.edu/), and Wiktionary (https://en.wiktion- ary.org/wiki/English). Future research on using statistical features for metaphor detection may include exper- imentation with additional predictive features, domains, and languages. Transfer learn- ing across different domains may also be explored. MIL: Automatic Metaphor Identification by Statistical Learning 29 References Krishnakumaran, S., & Zhu, X. (2007). Hunting Elusive Metaphors Using Lexical Resources. Proceedings of the Workshop on Computational Approaches to Figurative Language, (pp. 13-20). Stroudsburg, PA, USA. Birke , J., & Sarkar, A. (2006). A Clustering Approach for the Nearly Unsupervised Recognition of Nonliteral Language. In Proceedings of EACL-06, (pp. 329– 336). Caron, J. (2001). Experiments with LSA scoring: Optimal rank and basis. Proceedings of the SIAM Computational Information Retrieval Workshop. de Marneffe, M.-C., & Manning, C. D. (2008). The Stanford typed dependencies representation. Proceedings of the workshop on Cross-Framework and Cross- Domain Parser Evaluation (pp. 1-8 ). Association for Computational Linguistics. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 391-407. Fellbaum, C. (1999). WordNet: An Electronic Lexical Database. Blackwell Publishing Ltd. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1). Hovy, D., Srivastava, S., Jauhar, S. K., Sachan, M., Goyal, K., Li, H., . . . Hovy, E. (2013). Identifying metaphorical word use with tree kernels. Proceedings of the First Workshop on Metaphor in NLP, (pp. 52-57). Karov, Y., & Edelman, S. (1998). Similarity-based word sense disambiguation. Computational Linguistics, 41-59. Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273–324. Neuman, Y., Assaf, D., Cohen, Y., Last, M., Argamon, S., Howard, N., & Frieder, O. (2013). Metaphor Identification in Large Texts Corpora. PLOS ONE, 1-9. Shutova, E., Korhonen, A., & Teufel, S. (2013). Statistical Metaphor Processing. Computational Linguistics, 301-353. Shutova, E., Sun , L., & Korhonen, A. (2010). Metaphor Identification Using Verb and Noun Clustering. COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics, (pp. 1002-1010). Beijing. Tsvetkov, Y., Boytsov, L., Gershman, A., Nyberg, E., & Dyer, C. (2014). Metaphor Detection with Cross-Lingual Model Transfer. ACL-2014, (pp. 248-258). Turney, P. D., & Pantel, P. (2010). From Frequency to Meaning : Vector Space Models of Semantics. Journal of artificial intelligence research, 1-48. Turney, P., Neuman, Y., Assaf, D., & Cohen, Y. (2011). Literal and Metaphorical Sense Identification through Concrete and Abstract Context. Proceedings of the 2011 Conference on the Empirical Methods in Natural Language Processing, (pp. 680-690). A Peculiarity-based Exploration of Syntactical Patterns: a Computational Study of Stylistics Mohamed-Amine Boukhaled, Francesca Frontini, Jean-Gabriel Ganascia LIP6 (Laboratoire d’Informatique de Paris 6), Université Pierre et Marie Curie and CNRS (UMR7606), ACASA Team, 4, place Jussieu, 75252-PARIS Cedex 05 (France) {mohamed.boukhaled, francesca.frontini, jean- gabriel.ganascia}@lip6.fr Abstract. In this contribution, we present a computational stylistic study and comparison of classic French literary texts based on a data- driven approach where discovering interesting linguistic patterns is done without any prior knowledge. We propose an objective measure capable of capturing and extracting meaningful stylistic syntactic patterns from a given author’s work. Our hypothesis is based on the fact that the most relevant syntactic patterns should significantly reflect the author’s stylis- tic choice and thus they should exhibit some kind of peculiar overrepre- sentation behavior controlled by the author’s purpose with respect to a linguistic norm. The analyzed results show the effectiveness in extracting interesting syntactic patterns from novels, and seem particularly promis- ing for the analysis of such particular texts. Keywords: Computational Stylistics, Interestingness Measure, Sequen- tial Pattern Mining, Syntactic Style 1 Introduction Computational stylistics is a subdomain of computational linguistics located at the intersection of several research areas such as natural language pro- cessing, literary analysis and data mining. The goal of computational stylistics is to extract style patterns characterizing a particular type of texts using computational and automatic methods (Craig 2004). When investigating the writing style of a particular author, the task will automatically explore lin- guistic forms of his style, which is not only distinguishing features, but also the deliberate overuse of certain structures by the author compared to a lin- guistic norm (Mahlberg 2012). However, the notion of style in the context of computational stylistics appears to be wide enough, and is manifested on sev- eral linguistic levels: lexicon, syntax, semantics and pragmatics. Each level has its own markers of styles and its own linguistic units that characterize it. In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014. Copyright c by the paper’s authors. Copying only for private and academic purposes. 32 M-A. Boukhaled, F. Frontini and J-G. Ganascia Many works have been done in the literature to analyze the stylistic traits on these different linguistic levels ( Biber 2006, Biber & Conrad 2009, Ramsay 2011, Frontini et al. 2014; see Siemens & Schreibman, 2013 for a discussion and overview ). In this contribution, syntactic style will be targeted. In their study Quiniou et al. (2012) have shown the interest of using sequen- tial data mining methods for the stylistic analysis of large texts. They have shown that relevant and understandable patterns that are characteristic of a specific type of text can be extracted using sequential data mining techniques such as sequential pattern mining. However, the process of extracting textual patterns is known by its property of producing a large amount of patterns, even from a relatively small sample of text. Thus, a measure of interest is to be applied to identify the most im- portant and relevant patterns for the characterization of the text’s style in question. In this paper, we present a computational stylistic study of classic texts of French literature based on a data-driven approach where the discovery of interesting linguistic forms is done without any prior knowledge. Specifically, the proposed method is based on the assessment of the peculiar over- representation of syntactic patterns extracted using sequential data mining technique from texts with respect to a norm corpus. This method is intended to quantitatively support a textual analysis by focusing on the verification of the degree of importance of each syntactic pattern (syntagmatic segments with potential gaps), and by extracting the syntactic patterns that character- ize the syntactical style of a work by a particular author. 2 Approach for extracting relevant syntactic patterns Our method consists of two steps. First, a sequential pattern mining algo- rithm is applied to the texts in order to extract recurrent syntactic patterns. Second, a peculiarity-based interestingness measure that evaluates of the overrepresentation (in terms of frequency of occurrence with respect to a norm corpus) is applied to the set of extracted syntactic patterns. Thus, each syn- tactic pattern will be assigned an interestingness value indicating its im- portance and its relevance for the characterization of text’s syntactic style. In what follows, we present in section 2.1 the corpus used in our experience, and its dividing protocol into two parts: text to analyze and text used as norm. Then, section 2.2 introduces some elements necessary to understand the pro- cess of extracting sequential syntactic patterns. Finally, the formulation and the statistical details of the proposed interestingness measure are presented in Section 2.3. A Peculiarity-based Exploration of Syntactical Patterns: a Computational Study of Stylistics 33 2.1 Analyzed Corpus In our study, we used four novels, belonging to the same genre and the same literary time span, written by four famous classic French authors: Balzac’s “Eugenie Grandet”, Flaubert's “Madame Bovary”, Hugo’s “Notre Dame de Paris” and Zola’s “Le ventre de Paris”. This choice is motivated by our partic- ular interest in studying the style of the classical French literature of the 19th century. At the time of the analysis of the syntactic patterns, each text writ- ten by one of the four authors is contrasted with texts written by the three other authors. That is to say that these three texts will be considered as norm corpus from which we will evaluate the hypothesis of the overrepresentation of syntactic patterns in the fourth remaining text, as explained later in this sec- tion. 2.2 Extraction of syntactic patterns In our study we consider a syntagmatic approach. The text is first segmented into a set of sentences, each sentence is then represented by a sequence of syntactic labels (POS-tag)1 corresponding to the words of the sentence using Treetagger (Schmid 1994). This produces at the end a set of syntactic se- quences for each text. For exemple, the sentence “Le silence profond régnait nuit et jour dans la maison.” Will be represented by the sequence: < "#$ , '() , *"+ , ,#- , '() , .(' , '() , /-/ , "#$ , '() , 0#'$ > Then, sequential patterns of a certain length with their supports (a number indicating how many sentences contain the pattern) are extracted from this syntactic sequential database using a sequential pattern extraction algorithm (Viger et al. 2014). Syntactic pattern consists of a sequential syntagmatic segment (with possible gaps) present in the syntactic sequences. It can be considered as a kind of generalization of the notion of n-gram widely used in the field of automatic language processing. Examples of syntactic patterns present in the sequence of the example above: • < "#$ >< '() >< *"+ > • < '() >< *"+ >< ,#- >< '() > • < .(' >< '() > <∗ 2 > < "#$ >< '() > To avoid the effect of statistical fluctuations on the analysis of patterns with low supports, we considered a support’s threshold of 1%. That is to say that we focus only on patterns that are present in at least 1% of the sentences of the analyzed text. However, as sequential pattern mining is known to pro- duce a large quantity of patterns even from relatively small samples of texts, 1 Frech treetagger tagset: http://www.cis.unimuenchen.de/~schmid/tools/TreeTagger/data/french-tagset.html 2 <*> denotes a gap that can be filled with any POS tag 34 M-A. Boukhaled, F. Frontini and J-G. Ganascia an interestingness measure should be applied on these patterns in order to identify the most important ones. This interestingness measure is explained in the next section. 2.3 Evaluation of the relevance of syntactic patterns Our hypothesis to evaluate the relevance of a syntactic pattern is based on the fact that the most relevant ones should significantly reflect the stylistic choice of the author and should thus be characterized by a significant peculiar quan- titative behavior, this peculiar behavior translate into a support’s over- representation in his texts. However, to capture this overrepresentation one cannot refer only to the abso- lute frequency of occurrence (support) Indeed, more frequent use of a syntac- tic pattern by an author (which translates into a relatively high support) does not necessarily indicate a stylistic choice since it can be very well a property imposed by the grammar of the language or by syntactic features that are characteristic of text’s genre. Thus, to assess the over-representation of a pattern, we use an empirical ap- proach based on the comparison of the support of a syntactic pattern in a text to that found in a norm corpus. A ratio 4 between these two quantities is calculated as follow: frequency of a pattern in the norm corpus 4= frequency pattern in the text In our experiments we found empirically that the distribution of the ratio 4 exhibits a Gaussian behavior. Indeed, the values of the 4 ratio are normally distributed around a central value (see Fig. 1). This is due to the fact that the frequency of occurrence of a syntactic pattern in a text is highly correlated with the frequency of occurrence in the norm corpus with a few exceptional special cases or outliers (see Fig. 2). These outliers represent the patterns of special interest for our study because they represent a certain linguistic devia- tion that is specific to the author's style compared to what one would expect to see in the norm corpus. A Peculiarity-based Exploration of Syntactical Patterns: a Computational Study of Stylistics 35 Fig. 1. Gaussian behaviour of the ratio 4 in Balzac’s “Eugénie Grandet” novel The configuration described above allows us to use an outlier detection meth- od based on Gaussian distribution and '-score to identify such special pat- terns (Chandola et al. 2009). The over-representation of a pattern in this case will result in a greater negative aberrant behavior compared to other patterns. The most over-represented patterns will be those associated with lowest values of standard z-score '. The z-score values are calculated as follows: 4( ) 4* '( = 0 Where 4( and '( are respectively the ratio 4 and the z-score corresponding to the +-th syntactic pattern. 4 , and 0 are respectively the mean and standard deviation of the ratio 4 . Fig. 2. Frequencies of syntactic patterns in a text with respect to their frequen- cies in the norm corpus for the studied novel. Each point in the graph represents a syntactic pattern. The plotted lines represent the linear regression lines captur- ing the expected behaviour of the α ratio 36 M-A. Boukhaled, F. Frontini and J-G. Ganascia 3 Results and Discussion In this section, we present some examples of relevant syntactic patterns ex- tracted from our corpus. Using the proposed method, the extracted patterns seem to have a strong relevance to characterize the style of the authors of our corpus but also to the novels’ content and the literary genre in which it oper- ates. In the Flaubert's Madame Bovary, several extracted patterns well repre- sent the rhythmic rather than functional role of punctuation that is peculiar to the style of Flaubert (Mangiapane 2012). For example pattern (1) captures instances of a comma preceding the conjunction, followed by a parenthetical clause. Pattern (1) < KON>< PUN> , with support= 113, sample instances of the pattern in the text: • , et , à • , mais , avant • ; et , à In le Ventre de Paris of Zola, and in the same direction, the syntactic pat- terns extracted as relevant clearly represent the use of nested clauses to de- scribe situations or attitudes in the novel such as in the pattern (2), or to describe public places and objects in displays in long lists as in the pattern (3): Pattern (2) : , support= 104, sample instances of the pattern in the text (bold text): « Florent se heurtait à mille obstacles , à des porteurs qui se chargeaient , à des marchandes qui discutaient de leurs voix rudes ; il glissait sur le lit épais d' épluchures et de trognons qui couvrait la chaussée , il étouffait dans l' odeur puissante des feuilles écrasées .» Pattern (3): , support= 68, sample instances of the pattern in the text (bold text): • angles , à fenêtres étroites • très-jolies , des légendes miraculeuses • écrevisses , des nappes mouvantes In Eugénie Grandet of Balzac, other different communicative functions are performed by the syntactic patterns and their textual instances, for example: Pattern (4): , support= 49, which is used as post-introducer of direct speech. This rather formulaic way of specify- ing (in a parenthetical form) the utterer of a reported speech is common to all, but seems to be strongly preferred by Balzac, while the other authors have A Peculiarity-based Exploration of Syntactical Patterns: a Computational Study of Stylistics 37 shown a more varied style in introducing dialogues. Sample instances of the pattern in the novel: • , dit Grandet en • , reprit Charles en • , dit Cruchot en Pattern (5): , support= 54, is a pattern used to refer to money, which is typical for the novel scenario where money plays a very important role. Sample instances of the pattern in the novel: • vingt mille francs • deux mille louis • sept mille livres Pattern (6) : , support= 59, is used to express negative questions : • n' avait -il pas • ne disait -on pas • ne serait -il pas Pattern (7) : , support= 44, repre- sent the punctuation extensively used to mimic spoken intonation and even to reproduce performance phenomena such as stutter. : • , messieurs , cria • , madame , répondit • , mademoiselle , disait The few analyzed examples indicate that the presented technique is effec- tive in extracting interesting syntactic patterns from a single text, and this seems particularly promising for the analyses of such classic literary texts. On the other hand, this technique, as well as other similar ones, prompts the question of what is really captured by significant patterns. Some structures may be significant because they are typical of an author’s style, its fingerprint - as we may say borrowing a metaphor often used in attribution studies, or they may be dictated by functional needs, due to the particular topic of the novel, or to the conventions of the chosen genre. This is particularly true for syntactic analysis, where the functional constraints on the authorial freedom are more evident. Much further works have to be carried out concerning this issue. 4 conclusion In this paper, we have presented an objective interestingness measure to ex- tract meaningful stylistic syntactic patterns from a given author’s work. Our hypothesis is based on the fact that the most relevant syntactic patterns should significantly reflect the author’s stylistic choice and thus they should 38 M-A. Boukhaled, F. Frontini and J-G. Ganascia exhibit some kind of peculiar overrepresentation behavior controlled by the author’s purpose. To evaluate the effectiveness of the proposed method, we conducted an experiment on a classic French Corpus. The analyzed results show the effectiveness in extracting interesting syntactic patterns from this type of text. Based on the current study, we have identified several future research direc- tions such as exploring other statistical measures to assess the interestingness of a given syntactic pattern, and expanding the analysis to include morpho- syntactic patterns (form and lemma words). Finally, we intend to experiment with other languages and text sizes using standard corpora employed in the field of computational stylistics at large. References Biber, D., 2006. University language: A corpus-based study of spoken and written registers, John Benjamins Publishing. Biber, D. & Conrad, S., 2009. Register, genre, and style, Cambridge University Press. Chandola, V., Banerjee, A. & Kumar, V., 2009. Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3), p.15. Craig, H., 2004. Stylistic analysis and authorship studies. A companion to digital humanities, 3, pp.233–334. Frontini, F., Boukhaled, M.A. & Ganascia, J., Linguistic Pattern Extraction and Analysis for Classic French Plays. Mahlberg, M., 2012. Corpus stylistics and Dickens’s fiction, Routledge. Mangiapane, S., 2012. Ponctuation et mise en page dans Madame Bovary: les interventions de Flaubert sur le manuscrit du copiste. Flaubert. Revue critique et génétique, (8). Quiniou, S. et al., 2012. What about sequential data mining techniques to identify linguistic patterns for stylistics? In Computational Linguistics and Intelligent Text Processing. Springer, pp. 166–177. Ramsay, S., 2011. Reading machines: Toward an algorithmic criticism, University of Illinois Press. Schmid, H., 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing. pp. 44–49. A Peculiarity-based Exploration of Syntactical Patterns: a Computational Study of Stylistics 39 Siemens, R. & Schreibman, S., 2013. A companion to digital literary studies, John Wiley & Sons. Viger, P.F. et al., 2014. SPMF: A Java Open-Source Pattern Mining Library. Journal of Machine Learning Research, 15, pp.3389–3393. Mining Comparative Sentences from Social Media Text Fabíola S. F. Pereira Faculty of Computer Science Federal University of Uberlândia (UFU) Uberlândia, Minas Gerais, Brazil fabfernandes@comp.ufu.br Abstract. Comparative opinions represent a way of users express their preferences about two or more entities. In this paper we address the problem of comparative sentences mining focused on social medias. We propose a genetic algorithm able to mine comparative sentences from short sentences based on sequential patterns classification. A compar- ison among classifiers regarding comparative sentences analysis is also presented. Our results indicate better accuracy for the proposed tech- nique against literature baseline approaches, reaching accuracy levels of 73%. Keywords: opinion mining, comparative sentences, genetic algorithm, social media mining 1 Introduction Comparative opinions represent a way of users express their preferences about two or more entities. Mining comparative sentences from texts can be useful in several applications. For instance, a company might be interested in social media rumors of a new product release among consumers. Or, what are the best and worst features of the new product from consumers viewpoint? Nowadays, social medias are great source of this kind of information and mining comparative opinions from them seems to be a very promising direction to unveil valuable knowledge. Many researches have been done in the field of regular opinion and sentiment classification [3,2]. However, comparative opinions represent a different viewpoint of users and an interesting research area. According to [8], a regular opinion about a certain car X is a statement like “car X is ugly”. On the other hand, a comparison is like “car X is much better than car Y”, or “car X is larger than car Y”. Clearly, these sentences have rich information from which we can extract knowledge with specific mining techniques. In [5] the authors proposed a classification technique for mining comparative sentences based on grammatical sequential patterns. In this paper, based on [5]’s background, our goal is to stress techniques for mining comparative sentences focused on Twitter social data analysis. We argue that social medias corpora, as In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014. Copyright c by the paper’s authors. Copying only for private and academic purposes. 42 F. S. F. Pereira a great source of users opinions, must be explored and specific mining algorithms are needed. The main contributions of this paper are: (1) a publicly available dataset crawled from Twitter. We manually labeled 1,500 tweets as comparative or non- comparative; (2) the genetic algorithm GA-CSR to aggregate to the problem of mining comparative sentences; and (3) a set of experiments comparing our approach with state-of-the-art techniques. This paper is organized as follows: in Section 2 we introduce the problem of mining comparative sentences, highlighting social medias texts. In Section 3 we discuss techniques proposed in related work and present our proposal. In Section 4 the experimental results are showed. Finally, Section 5 concludes the paper. 2 The Problem of Mining Comparative Sentences In the context of our study, comparative opinions are opinions that express a relation based on similarities or differences between two or more entities. Accord- ing to [6], there are four types of comparisons: non-equal gradable comparison (“XBox is better than Wii-U”), equative (“XBox and Wii-U are equally funny”), superlative (“XBox is the best among all video games”) and non-gradable com- parison (“XBox and Wii-U have different features”). The first three types are called gradable comparative and are our focus because the sentences allow to establish a preference order among entities being compared. Definition 1 (Comparative Opinion [6]). Comparative opinion is a sextuple (E1, E2, A, PE, h, t), where E1 and E2 are the entity sets being compared based on their shared aspects A, P E 2 {E1, E2} is the preferred entity set of the opinion holder h, and t is the time when the comparative opinion is expressed. For a superlative comparison, if one entity set is implicit (not given in the text), we can use a special set U to denote it. For an equative comparison, we can use the special symbol EQU AL as the value for P E. Example 1. Let us consider the following comparative sentence: “@stephthe- lamekid tbh wii u games do have better graphics than ps4 and xbox 1 games”, posted by user Dinotia_4 in 12/06/2014. The comparative opinion extracted is: ({Wii U games}, {PS4 games, XBox One games}, {graphics}, {Wii U games}, Dinotia_4, 12/06/2014) One challenge on the problem of comparative sentences is that not all sen- tences with POS tags JJR, RBR, JJS and RBS (comparative and superlative POS tags) are comparative. For example, “faster is better.” Moreover, some ex- pressions are comparative, but just can be identified through context, e.g “PS4 is expensive, but Wii-U is cheap.” 3 Comparative Sentences Mining Techniques To the best of our knowledge, the most representative technique in literature that addresses the problem of comparative sentences is [5], which is based on Mining Comparative Sentences from Social Medias 43 sequential pattern mining. In the following we present two new approaches: a naive approach based on n-grams classification and a genetic algorithm approach. The technique from [5] is also summarized in this Section. 3.1 N-grams Classification The technique of document representation through term vector is the most com- mon in the sentiment analysis field and can be used as our baseline. In this approach, each sentence in the corpus is a document, terms are the most rele- vant words and we use TF-IDF matrix [6] to represent them. Such matrix is, therefore, submitted to a classifier that builds a model able to identify whether a given sentence is comparative or not. In this work, just unigrams have been considered. We did three pre-processing steps: (1) stop words removal, (2) stemming and (3) 1000 features extraction based on information gain index [10]. As we will present in Section 4, the results obtained with this approach were not expressive, even varying the classification algorithms. 3.2 Sequential Patterns Classification Sequential patterns classification for comparative sentences mining had been proposed in [5]. Sequential pattern mining (SPM) is an important data mining task [1]. A sub-sequence is called sequential pattern or frequent sequence if it frequently appears in a sequence database, and its frequency is no less than a user-specified minimum support threshold minsup [4]. According to [5], a class sequential rule (CSR) is a rule with a sequential pattern on the left and a class label on the right of the rule. Unlike classic sequential pattern mining, which is unsupervised, in this approach sequential rules are mined with fixed classes. This method is thus supervised. For a formal definition, please refer to [5]. After defined the task of mining class sequential rules, then we deploy the algorithm to our problem. However a sentence cannot be handle simply from raw words, as we did on n-grams classification approach (Subsec. 3.1). To find se- quential POS tags patterns in sentences and, then, build an input dataset of sen- tences to be classified (supervised learning) as comparative or non-comparative the following steps are needed: 1. Sentences with pivot keywords. Many words in English language indi- cate comparisons, for example beat, exceed, outperform etc. Moreover, those ending with -est and -er are naturally comparative or superlative adverbs and adjectives. Thus, a set of comparative keywords is considered. The idea is to identity sentences with at least one keyword and use the words that are within the radius of 3 of each keyword in the sentence as a sequence in our data. 2. Replacing with POS tags. For each sequence of max length of 7 obtained in previous phase, replace all words with their corresponding POS tags. 44 F. S. F. Pereira 3. Labeling sequences. For each sequence, we have to label it as comparative or non-comparative. This is the same label that originated the sequence. 4. Generating CSR. In this phase we have to mine sequential patterns. The algorithm PrefixSpan [9] have been used with minimum support 0.1 and minimum confidence 0.6. 5. Building dataset for classification task. To translate class sequential rules into input for classification algorithms, the following steps are consid- ered: each CSR is a feature. The classes are comparative and non-comparative. Each sentence from original corpus is a tuple in dataset. If the sentence matches a given CSR, the value is 1. Otherwise, 0. Each sentence keeps with its class. In this way, we have a well-formed input to a classifier algorithm. 6. Running the classifier. In paper [5] the authors use just the Naive Bayes classifier. In our experiments, we also considered the algorithms SVM, Multi- Layer Perceptron (MLP) and Radial Basis Function (RBF). Example 2. In order to illustrate steps 1 to 4, let us consider the sentence: “this/DT game/NN is/VBZ significantly/RB more/JJR fun/NN with/IN Kinect/NN than/IN without/IN it/PRP." It has the keyword more and the generated CSR is: <{NN}{VBZ}{RB}{moreJJR}{NN}{IN}{NN}> ! comparative 3.3 Genetic Algorithm In this paper we propose a genetic algorithm for mining comparative sentences. The idea is to mine class sequential rules (CSR) from [5] (Subsec. 3.2). However, we do not use a classifier, but a genetic algorithm (GA-CSR) to get rules. Each chromosome represents a CSR. Chromosomes have fixed length of 8 genes, where the first 7 are the sequential patterns with itemsets of length 1 (default gene) and the last one is the class with value comparative or non- comparative (class gene). Each default gene is an itemset and can assume POS tags domain values. Moreover, for each default gene we have the additional bit ’1’ or ’0’ representing whether or not it is part of sequential pattern. The exam- ple chromosome coding and its meaning is described in Figure 1. In the following we detail GA-CSR features. Fig. 1: Chromossome coding Mining Comparative Sentences from Social Medias 45 – Fitness. The rules fitness (F itness) in the population is calculated based on a function containing two terms, namely Specificity (Sp) and Sensitivity (Se), where Sp = T N/(T N + F P ), Se = T P/(T P + F N ) and F itness = Se ⇤ Sp. The variables T P , T N , F P and F N correspond to true positives, true negatives, false positives and false negatives, respectively. – Population creator. The population is randomly generated. – Selection and crossover. Two best chromosomes are selected by applying roulette wheel selection method and two point crossover method is applied over them to generate new children chromosomes. The class gene is not considered. – Mutation. The mutation process changes the value of an attribute to an- other random value selected from the same domain. It can occurs in any gene type and does not consider the flag bit of each gene. – Insertion and removal operator. Insertion and removal operators control the size of a rule. Insertion operator activates the gene by setting its flag bit and removal operator deactivates a gene by resetting the flag bit with a varying probability Pi and Pr , respectively. – Survivor selection. GA-CSR uses fitness-based selection where individuals are selected from the set composed by parents and offspring. The top Tp fitness individuals are selected, where Tp is the population size. In the end, we have a set of class sequential rules and just those greater than minimum support and confidence are considered for test and model validation. 4 Experimental Settings In this section we report our experiments. We aim to compare best classification accuracies. Section 4.1 describes the datasets used to train and test the models. It also presents our experiments set-up and parameter setting. Finally, we expose our results in terms of success rate of the classifiers in Section 4.2. 4.1 Datasets and Parameterization We tested our algorithms over two datasets: Amazon product reviews and Twit- ter texts. Our goal is to show how text mining social medias is different because of specific features of text length and language. The Amazon product review is about mp3 players and was obtained from [7]. Twitter dataset contains tweets about PlayStation 4 and XBox video games and we collect them from Twitter API1 . Both datasets were manually labeled. In Table 1 we detail the datasets features. Our test set is composed by 9 runs for each dataset. For the n-grams ap- proach, we used 4 classifiers: SVM (SVM-Unigram), NB (NB-Unigram), MLP (MLP-Unigram) and RBF (RBF-Unigram). In the CSR approach we also use 4 classifiers: SVM-CSR, NB-CSR, MLP-CSR and RBF-CSR. Finally, we run our proposed genetic algorithm GA-CSR. In Figure 2 we present detailed parameters used for each approach. 1 https://dev.twitter.com/rest/public 46 F. S. F. Pereira DB-Amazon DB-Twitter # sentences 1000 1500 # comparative sentences 97 (9.7%) 199 (13.26%) Texts dates 2003-2007 Dec 2014 Topic mp3 players XBox and PS4 Table 1: Datasets used for tests Unigrams Train/Test 10-fold cross-validation Pre-process stop words, stemm, infogain # Features 1000 MLP-Unigram RBF-Unigram Momentum 0.8 - Learning rate 0.6 - # neurons in hidden layer 15 2 # hidden layers 1 1 (a) Parameters for n-grams approach CSR Train/Test 10-fold cross-validation Radius 3 minsup 0.1 minconf 0.6 Sequential pat. algorithm PrefixSpan MLP-RCPS RBF-RCPS Momentum 0.8 - Learning rate 1.0 - # neurons in hidden layer 7 7 # hidden layers 1 1 (b) Parameters for CSR approach GA-CSR Train/Test 10-fold cross-validation Learning rate (Tm ) 0.8 Crossover rate (Tc ) 0.8 Insertion rate (Pi ) 0.3 Removal rate (Pr ) 0.3 minsup 0.1 minconf 0.6 # generations 100 Population size (Tp ) 50 Fitness Se ⇤ Sp (c) Parameters GA-CSR approach Fig. 2: Parameterization Mining Comparative Sentences from Social Medias 47 4.2 Experimental Results The first test set was performed over DB-Amazon dataset (Figure 3). We can observe a poor performance for n-grams approach. As expected, it is a sim- ple baseline that does not take into account elaborated features of our mining problem. Varying classifiers algorithms does not impact on results that reach a maximum accuracy of 68.6% for RBF neural network. Fig. 3: Experimental results over DB-Amazon Regarding CSR approach from [5], the results were similar to original pa- per. The difference is that in [5] just Naive Bayes classifier had been used. In our experiments we also considered other classification algorithms. The neural network RBF-CSR reached the best accuracy of 81.13%. Finally, our proposed genetic algorithm reached 85.23% of accuracy indicating the best approach for DB-Amazon dataset. The second test set ran over DB-Twitter (Figure 4). Graphics curves main- tained the trend, however the average accuracy decreased around 10%. This can be explained due to the large amount of noise in Twitter texts. Moreover, sen- tences grammatical errors potentially harm the grammatical pattern approaches. Fig. 4: Experimental results over DB-Twitter 48 F. S. F. Pereira 5 Conclusion In this paper we addressed the problem of mining comparative sentences. We carried out an experiment using 1,500 short sentences from Twitter.com, equally divided into two domain categories: comparative and non-comparative sentences. The results showed that the higher success rate was obtained with our genetic algorithm approach (73%). As our sample is relatively small, we used cross- validation (10-fold) to avoid overfitting and increase the accuracy of the success rate of the classifiers. To ensure reproducibility of our results, in conjunction with the publication of this paper, we have released the full genetic algorithm GA-CSR code and Twitter data in the format used by our algorithm2 . As future work, once mined comparative sentences, our focus will be on mining user preferences. We consider that comparative sentences are good source of users opinions, enabling the development of reasoning user preferences models from social data. References 1. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering. pp. 3–14. ICDE ’95 (1995) 2. Arias, M., Arratia, A., Xuriguera, R.: Forecasting with twitter data. ACM Trans. Intell. Syst. Technol. 5(1), 8:1–8:24 (2014) 3. Ceron, A., Curini, L., Iacus, S.M.: Using sentiment analysis to monitor electoral campaigns: Method matters-evidence from the united states and italy. Soc. Sci. Comput. Rev. 33(1), 3–20 (2015) 4. Fournier-Viger, P., Wu, C.W., Tseng, V.: Mining maximal sequential patterns with- out candidate maintenance. In: Advanced Data Mining and Applications, vol. 8346, pp. 169–180 (2013) 5. Jindal, N., Liu, B.: Identifying comparative sentences in text documents. In: Pro- ceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 244–251. SIGIR ’06 (2006) 6. Liu, B.: Sentiment Analysis and Opinion Mining. Morgan Claypool Pub. (2012) 7. McAuley, J., Leskovec, J.: Hidden factors and hidden topics: Understanding rat- ing dimensions with review text. In: Proceedings of the 7th ACM Conference on Recommender Systems. pp. 165–172. RecSys ’13 (2013) 8. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2, 1–135 (2008) 9. Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C.: Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE Trans. on Knowl. and Data Eng. 16(11), 1424–1440 (Nov 2004) 10. Sharma, A., Dey, S.: An artificial neural network based approach for sentiment analysis of opinionated text. In: Proc. of the 2012 ACM Research in Applied Com- putation Symposium. pp. 37–42 (2012) 2 http://lsi.facom.ufu.br/~fabiola/comparative-mining