Text Classification Based on Deep Textual Parsing Boris Galitsky1, Dmitry Ilvovsky2 1 Knowledge Trail Inc., San Jose, USA bgalitsky@hotmail.com 2 National Research University Higher School of Economics, Moscow, Russia dilvovsky@hse.ru, echernyak@hse.ru, skuznetsov@hse.ru Abstract. The problem of classifying text based on the deep parsing structure is addressed. An algorithm for document classification tasks where counts of words or n-grams is insufficient is proposed. The parse tree kernel method at the level of para- graphs, based on anaphora, rhetoric structure relations and communicative actions linking phrases in the parse thicket is considered. Keywords: text genre, style classification, rhetoric structure, discourse, meta- language 1 Introduction The problem of genre classification (also referred as automatic genre identification, AGI) has received so far some attention of the researches. Mainly there are two tied directions of these studies: 1. To develop intelligible genre system and to collect a corpus which would represent the established genre system. Usually the texts are collected from the Web [8, 11]. 2. To develop a machine classifier for classifying texts of different genres [9-14]. In this paper we will consider both style and genre classification, without paying a lot of attention on the difference between these notions. Following [1] we will refer to “style” as to specific usage of language, and to “genre” as to the category of a text, which represent its intention and aim. There are several applications of genre classification: 1. Evaluating how many different texts are there on the Web. This application can be treated as developing a socio- or psycho-metric tool [8,11,12,13]. 2. Using genre classification for improving user-based information retrieval: based on the query, the search system should provide documents of appropriate genre (for example, if the query sounds scientific enough, return scholar papers, if the query is less formal – blogs, social media) [9]. 3. Recognizing document type for a document management system (like security, document recommendation, and other applications) [28]. Besides there are different attempts to genre classifications the majority of re- searches agree upon the following idea: the less complicated text elements are used as © Springer-Verlag Berlin Heidelberg 2011 the features for classification, the better the results are. For example, [14,29] suggests using character n-grams to perform genre classification on Brown corpus, BNC, HGC and other corpora. In [12] the syntactic patterns, morphological patterns and character n-grams are used to build feature sets and are compared to each other. The latter al- lowed us to achieve the highest F-measure, while the former provides poor results. The morphological pattern based classifier does not outperform the character-based one. In [13] common words are used to form feature sets. To perform text classification in the described domains, we employ discourse in- formation such as anaphora, rhetoric structure, entity synonymy. Relying on syntactic parse trees would provide us with specific expressions and phrasings connected with a style of writing. However, it will still be insufficient for a thorough description of linguistic features inherent to a style of writing. It is hard to identify such features without employing a discourse structure of a document. This discourse structure needs to include anaphora and rhetoric relations. Furthermore, to systematically learn these discourse features associated with the style of writing one needs a unified ap- proach to classify graph structures at the level of paragraphs [16]. The design of such features for automated learning of syntactic and discourse structures for classification is still done manually today. To overcome this problem, tree kernel approach has been proposed [26]. Tree kernels constructed over syntactic parse trees, as well as discourse trees [17] is one of the solutions to conduct feature engineering. Convolution tree kernel [24] defines a feature space consisting of all subtree types of parse trees and counts the number of common subtrees to express the respective distance in the feature space. The kernel ability to generate large feature sets is useful to assure we have enough linguistic features to differentiate between the classes, to quickly model new and not well understood linguistic phenomena in learning machines. However, it is often pos- sible to manually design features for linear kernels that produce high accuracy and fast computation time whereas the complexity of tree kernels may prevent their appli- cation in real scenarios. SVM [19] can work directly with kernels by replacing the dot product with a particular kernel function. This useful property of kernel methods, that implicitly calculates the dot product in a high-dimensional space over the original representations of objects such as sentences, has made kernel methods an effective solution to modeling structured linguistic objects [25]. In this paper we will try to show how using more complicated and extensive syn- tactical information allows improving the result of genre classification. The goal of this research is to apply the learning based on high-level linguistic features for the style and genre classification task and also to estimate the influence of the corpus annotation quality to the quality of the performance. 2 From style to genre Moving from “simple” to “complex” system of style classes we start to distinguish texts between 2 classes: description (object-level) and meta-description (meta- language or meta-level). We consider domain of literature documents. A combination of object-language and metalanguage patterns and description styles can be found in literature. In the literature domain, we attempt to draw a bound- ary between the pure metalanguage (works of literature with a special level of abstrac- tion) and a mixed level text (a typical work of literature). Describing the nature, a historical event, an encounter between people, an author uses a language object. De- scribing the thought, beliefs, desires and knowledge of characters about the nature, events and interactions between people, an author may use a metalanguage, if its enti- ties/ range over the expressions (phrases) of the language-object. An outstanding example of the use of metalanguage in literature is Franz Kafka’s novel “The Trial”. According to our model, the whole plot is described in metalan- guage, and object-level layer is not presented at all. This is unlike a typical work of literature, where both levels are employed and object-level prevail, such as fairy tales. The novel is a pure example of the presence of meta-theory and absence of object- level theory, from the standpoint of logic. The reader is expected to form the object– level theory herself to avoid an ambiguity in the interpretation of this novel. For the genre classification we used the system of genres and the corpus from [2,3]. Let us describe the genre system in more details. Unlike in other disciplines, authors do not define particular genres in systematic, exact, crisp way, but instead of this construct 17 main so-called Functional Dimension which are the basis for a genre description. For example, the direction A7 corresponds to instructions (tutorials, FAQs, manuals, recipes), the direction A11 – to personal writing, such as diary-like blogs, personal letters, traditional diaries. A collection of texts, picked from the Web, is annotated by humans according to these directions: the annotator is asked to what extent this or that direction is present in the text. There are four possible answers: 0 none or hardly at all; 0.5 slightly; 1 somewhat or partly; 2 strongly or very much so. After the annotation, every text is represented as a vector in the space of 17 functional dimensions, which makes any kind of machine learning applicable. The texts and functional dimension are bi-clustered and the resulting clusters are said to represent a genre. The resulting system of genres consists of combinations of FTDs. Let us de- scribe some of genres, achieved in [2,3]. There are genres that use only singly dimen- sion: for example, the cluster Cl6 corresponds to the dimension A16, which is aimed at presenting information. But the are some genres that correspond to two or three dimensions: the cluster Cl13 stands for dimensions A1 + A11, which are opinion blogs, often reporting personal experience and expressing one’s emotions; and the cluster Cl14 stands for dimensions A11 + A19 + A3, which are diary blogs expressing one’s emotions and attempting to embellish the description. The clusters often corre- spond to traditional genres, but are more reliable than traditional genres, since the annotator does not have to choose between several predefined genres. We adopt both the genre system and the corpus from this research. 3 Discourse text structure for the classification task It turns out that low-level features could be insufficient for the style classification in some domains like meta-document or design-document text detection. Since im- portant phrases can be distributed through different sentences, one needs a sentence boundary – independent way of extracting both syntactic and discourse features. Therefore we intend to combine/merge parse trees to make sure we cover all the phrase of interest. Rhetorical Structure Theory (RST) [5, 20] has been used to describe or understand the structure of texts and to link rhetorical structure to other phenomena, such as anaphora or cohesion. RST is one of the most popular approach to model extra- sentence as well as intra-sentence discourse. RST represents texts by labeled hierar- chical structures. Their leaves correspond to contiguous Elementary Discourse Units; adjacent ones are connected by rhetorical relations (e.g., Elaboration, Contrast), form- ing larger discourse units (represented by internal nodes), which in turn are also sub- ject to this relation linking. Discourse units linked by a rhetorical relation are further distinguished based on their relative importance in the text: nucleus being the central part, whereas satellite being the peripheral one. Discourse analysis in RST involves two subtasks: discourse segmentation is the task of identifying the EDUs, and dis- course parsing is the task of linking the discourse units into a labeled tree. Discourse analysis explores how meanings can be built up in a communicative process, which varies between a text metalanguage and a text language-object. Each part of a text has a specific role in conveying the overall message of a given text. 4 Learning on extended parse trees The design of discourse and syntactic features for automated text assessment tasks is still an art nowadays. One of the solutions to systematically treat these features is the set of tree kernels built over syntactic parse trees, extended by discourse relations. Convolution tree kernel [24, 25] defines a feature space consisting of all subtree types of parse trees and counts the number of common subtrees as the syntactic similarity between two parse trees. They have found a lot of applications in a number of NLP tasks. To obtain the inter-sentence links, we employed anaphoric relations from Stanford NLP [22, 23]. Rhetoric parser [15] builds a discourse parse tree by applying an opti- mal parsing algorithm to probabilities obtained from two conditional random fields, intra-sentence and multi-sentence parsing. We also rely on additional tags to extend SVM feature space, finding similarities between trees. These additional tags include noun entities from Stanford NLP such as organization and title, and verb types from VerbNet. For every arc which connects two parse trees, we obtain the extension of these trees, extending branches according to the arc. For a given parse tree, we will obtain a set of its extension, so the elements of kernel will be computed for many extensions, instead of just a single tree [17]. The problem here is that we need to find common sub-trees for a much higher number of trees than the number of sentences in text, however by subsumption (sub-tree relation) the number of common sub-trees will be substantially reduced. The resultant trees are not the proper parse trees for a sentence, but nevertheless form an adequate feature space for tree kernel learning. 5 Evaluation 5.1 Style dataset For the literature domain, we collected 160 paragraphs as meta-documents from Kaf- ka’s novel “The Trial” as well as his other novels so that these paragraphs are read as metalanguage patterns. As a set of object-level documents we manually selected 200 paragraphs of text in the same domain (scholarly articles about “The Trial”). We split the data into 3 subsets for training/evaluation portions and cross-validation. Table 1. Evaluation results for literature documents Method Precision Recall F-measure Nearest neighbor classifier 48.5 54.3 51.24 (TF*IDF based) Tree kernel – regular parse 63.3 68.7 65.89 trees Tree kernel SVM – extended trees for both anaphora and 71.5 73.1 72.29 RST Table 1 shows evaluation results. Baseline approaches show rather low perfor- mance. The one of the tree kernel based methods improves as the sources of linguistic properties are expanded. For both domains, there is an improvement by a few percent due to the rhetoric relations compared with the baseline tree kernel SVM which em- ploys parse trees only. The best accuracy is lower than 85%. This can be explained by a few reasons. Meta-documents can contain object-level text, such as design exam- ples. Object level documents (genuine action-plan docs) can contain some author reflections on the writing process or direct citations (which are written in metalan- guage). Hence the boundary between classes does not strictly separates metalanguage and language object. So for the better performance we need better annotated dataset. 5.2 Genre dataset As it was mentioned earlier we adopted the genre system and the corpora from [1, 3]. The genre system is constructed in the following way. First, the Functional Text Di- mensions (FTD) are defined. The FTD are genre annotations which reflect judge- ments as to what extent a text can be interpreted as belonging to a generalized func- tional category. A genre is a combination of several FTD. In other words, the genre is a point in the space, defined by FTD. The corpus was annotated by humans. Every user was asked to evaluate texts of FTD on a scale: 0 none or hardly at all; 0.5 slightly; 1 somewhat or partly; 2 strongly or very much so. See an example of FTD and annotated texts below. Table 2. Functional Text Dimensions A1 Argum To what extent does the text argue to persuade the reader to sup- port (or renounce) an opinion or a point of view? (‘Strongly’, for argumentative blogs, editorials or opinion pieces) A4 Fictive To what extent is the text’s content fictional? (‘None’ if you judge it to be factual/informative.) A7 Instruct To what extent does the text aim at teaching the reader how something works? (For example, a tutorial or an FAQ) A8 Hardnews To what extent does the text appear to be an informative report of events recent at the time of writing? A9 Legal To what extent does the text lay down a contract or specify a set of regulations? In [2,3] 17 general dimensions are defined. Among them ten A1, A3, A4, A5, A6, A7, A8, A9, A11 form 7 different genres. See the explanations of these genres bellow. For further classification we will exploit these genres. ─ [tells] Instructions for how to use software. ─ [tele] Instructions for how to use hardware. ─ [ted] Emotional speech on a political topic. Presentation of him/her self. Attempt to sound convincing. ─ [synd] An article on a political event by a professional journalist. ─ [news] A presentation of a news article in an objective, independent manner. ─ [fict] Novels, stories, verses. ─ [un] UN reports. Table 3. Main genres used for the evaluation Genre example A1 A3 A4 A5 A6 A7 A8 A9 A11 ted/eva_zeisel_on_the_playful _search_for_beauty 1 0 0 1 0 0 0 0 2 FictDostoyev- skyF_CrimePun_II2_EN.txt 0 1 2 1 0.5 0 0 0 1 NewsGoal- com_MessiTop50_EN.txt 0.5 1 0 1 1 0 2 0 0.5 syndicate/exchange-rate- disorder 2 0 0 0 0 0.5 0.5 0 0 un/A_AC252_L13 1 0 0 0 0 0.5 0 2 0 TeleHTC_Manual_12_EN.txt 0 0 0 0 0 2 0 0 0 Tels- Goog_Answer_2feb_EN.txt 0 0 0 0 0 0.5 0 0 1 Table 4. Pairwise classification results Classes VCDim Recall Precision #kernel eval- F- uations measure Fict vs News 106 98.11 95.55 159841 96.81 Ted vs Synd 787 99.49 98.94 73177349 99.21 Un vs News 697 98.70 94.93 9486134 96.78 Tele vs Tells 360 96.69 90.76 1151517 93.63 Fict vs Ted 139 97.12 93.74 7557291 95.40 Fict vs Synd 192 95.21 94.23 7546911 94.72 Fict vs Un 214 94.90 95.71 4641983 95.30 Fict vs Tele 317 97.25 94.90 6547910 96.06 Fict vs Tells 301 96.51 95.61 8766391 96.06 News vs Ted 514 96.85 93.85 2619549 95.33 News vs Synd 281 97.28 96.19 7490174 96.73 News vs Tele 190 96.31 94.27 5235193 95.28 News vs Tells 231 98.28 96.15 3916727 97.20 Ted vs Un 390 96.45 97.03 5836394 96.74 Ted vs Tele 210 97.28 96.62 1612102 96.95 Ted vs Tells 187 94.52 96.06 7645104 96.81 The values of quality measures – recall, precision and F-measure – are optimisti- cally high. The highest F-measure is achieved by classification of Ted against Synd. Both of these genres correspond to describing political topics. However the rhetorical structures for these genres are completely different. Hence we are able to learn a very efficient classifier to distinguish between these genres. Another important point is a superior performance in the comparison with the re- sults for the shallow-annotated dataset. Although the classes from this dataset could be roughly mapped on some genres (e.g. meta-level literature texts are corresponding with the [fict] genre) the distinction is less accurate. 6 Conclusions We observed that using SVM TK one can differentiate between a broad range of text styles and genres. Each text style and genre has its inherent rhetoric structure which is leveraged and automatically learned. Since the correlation between text style and text vocabulary is rather low, traditional classification approaches which only take into account keyword statistics information could lack the accuracy in the complex cases. In this paper we have presented two experiments on style and genre classifications. The style experiments were aimed at distinguishing between two types of writing and language usage: description and meta-description. These styles share the same vocab- ulary but the rhetoric structure of documents with descriptions and documents with meta-descriptions is fairly different. For the genre classification part we adopted a corpus annotated with 7 different genres and conducted a series of pairwise classification between two genres. From mathematical point of view, as a part of future extension of this research we may conduct one genre against all-others-genres-together classification, which will allow us to understand how distinctive each genre is. Hence we will obtain a more complete picture of the genre system in general. If every genre is distinctive enough, it means that the whole genre system is well developed and the dimensions are adequate. How- ever there might arise some problems because of the corpus being unbalanced: there are different numbers of texts in every genre and to tackle this problem we will have to balance the corpus. 7 References 1. Lee, David YW. Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path through the BNC jungle. (2001) 2. Sharoff, S. In the garden and in the jungle: Comparing genres in the BNC and Internet. In Genres on the Web: Computational Models and Empirical Studies, pages 149–166. Springer, Berlin/New York. (2010) 3. Sharoff, S., Wu, Z., and Markert, K. The Web library of Babel: evaluating genre collec- tions. In Proc. of the Seventh Language Resources and Evaluation Conference, LREC 2010, Malta. (2010) 4. Richard Forsyth and Serge Sharoff. Document Dissimilarity within and across Languages: a Benchmarking Study. Literary and Linguistic Computing, 29:6–22. (2014) 5. Mann, W., Matthiessen, C., Thompson, S.: Rhetorical Structure Theory and Text Analysis. Discourse Description: Diverse linguistic analyses of a fund-raising text / ed. by W. C. Mann and S. A. Thompson. – Amsterdam. –P. 39–78 (1992) 6. Egg, M., Redeker, G.: Underspecified discourse representation. In: Anton Benz & Peter Kühnlein (eds), Constraints in Discourse (pp. 117-138), Amsterdam: Benjamins. (2008) 7. Taboada, M.: The Genre Structure of Bulletin Board Messages. Text Technology 13 (2): 55-82. (2004) 8. Biber, Douglas, Jerry Kurjian. Towards a taxonomy of web registers and text types: a mul- tidimensional analysis." Language and Computers 59.1 (2006): 109-131. (2006) 9. Freund, Luanne, Charles LA Clarke, Elaine G. Toms. Towards genre classification for IR in the workplace. Proceedings of the 1st international conference on Information interac- tion in context. ACM (2006). 10. Kessler, Brett, Geoffrey Numberg, Hinrich Schütze. Automatic detection of text genre." Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (1997). 11. Santini, Marina. Automatic identification of genre in web pages. Diss. University of Brighton (2007). 12. Sarawgi, Ruchita, Kailash Gajulapalli, Yejin Choi. Gender attribution: tracing stylometric evidence beyond topic and genre. Proceedings of the Fifteenth Conference on Computa- tional Natural Language Learning. Association for Computational Linguistics (2011). 13. Stamatatos, Efstathios, Nikos Fakotakis, George Kokkinakis. Text genre detection using common word frequencies." Proceedings of the 18th conference on Computational linguis- tics-Volume 2. Association for Computational Linguistics (2000). 14. Wu, Zhili, Katja Markert, and Serge Sharoff. Fine-grained genre classification using struc- tural learning algorithms. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2010). 15. Joty, S., Carenini, G., Ng, R., Mehdad, Y.: Combining Intra- and Multi-sentential Rhetori- cal Parsing for Document-level Discourse Analysis.In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria (2013) 16. Galitsky, B., Ilvovsky, D., Kuznetsov, S. O., Strok, F.: Matching sets of parse trees for an- swering multi-sentence questions // Proceedings of the Recent Advances in Natural Lan- guage Processing, RANLP 2013. – INCOMA Ltd., Shoumen, Bulgaria.– P. 285–294 (2013) 17. Ilvovsky, D.: Going beyond sentences when applying tree kernels. Proceedings of the Stu- dent Research Workshop.‒ ACL 2014.‒ P. 56-63 (2014) 18. Galitsky, B., Kuznetsov, S.O.: Learning communicative actions of conflicting human agents. J. Exp. Theor. Artif. Intell. –Vol. 20(4). – P. 277–317 (2008) 19. Vapnik, V.: The Nature of Statistical Learning Theory. – Springer-Verlag (1995) 20. Marcu, D.: From Discourse Structures to Text Summaries. Proceedings of ACL Workshop on Intelligent Scalable Text Summarization / eds. I. Mani and M. Maybury. – Madrid, P. 82–88 (1997) 21. Severyn, A., Moschitti, A.: Fast Support Vector Machines for Convolution Tree Kernels. Data Mining Knowledge Discovery 25.‒ 2012.‒ P.325-357. (1997) 22. Marta Recasens, Marie-Catherine de Marneffe, and Christopher Potts. The Life and Death of Discourse Entities: Identifying Singleton Mentions. In Proceedings of NAACL 2013. 23. Lee, H., Chang, A., Peirsman, Y., Chambers, N., Surdeanu, M., Jurafsky, D.: Deterministic coreference resolution based on entity-centric, precision-ranked rules. Computational Lin- guistics 39(4), 2013. 24. Collins, M., Duffy, N.: Convolution kernels for natural language. In Proceedings of NIPS, 625–632 (2002) 25. Moschitti, A.: Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees. In Proceedings of the 17th European Conference on Machine Learning, Berlin, Germany (2006) 26. Sharoff, Serge. "Functional Text Dimensions for annotation of Web corpora. http://corpus.leeds.ac.uk/serge/publications/2015-corpora-submission.pdf (2015) 27. Cumby, C. and Roth D. On Kernel Methods for Relational Learning. ICML, pp. 107-14. (2003) 28. Galitsky B., Ilvovsky D., Kuznetsov S. O. Text Classification into Abstract Classes Based on Discourse Structure, in: Proceedings of the Recent Advances in Natural Language Pro- cessing, RANLP 2015. Hissar. P. 201-207. (2015) 29. Kanaris, I. and E. Stamatatos Learning to Recognize Webpage Genres Information Pro- cessing and Management, 45(5), pp. 499-512, Elsevier (2009)