Identifying Predictive Features for Textual Genre Classification: the Key Role of Syntax Andrea Cimino , Martijn Wieling• , Felice Dell’Orletta , Simonetta Montemagni , Giulia Venturi • University of Groningen - The Netherlands m.b.wieling@rug.nl  Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR) ItaliaNLP Lab - www.italianlp.it {name.surname}@ilc.cnr.it Abstract document genre can help to develop more accu- rate Information Retrieval tools. Genre identifica- English. The paper investigates impact tion has been considered a key factor for reducing and role of different feature types for the irrelevant results of search engines, as users would specific task of Automatic Genre Classi- be able to specify the desired textual genre along fication with the final aim of identifying with the keywords expressing the content they are the most predictive ones. The goal was looking for (Santini, 2004; Lim, 2004; Santini, pursued by carrying out incremental fea- 2007). In fact, document genre and document con- ture selection through Grafting using dif- tent represent orthogonal dimensions of classifica- ferent sets of linguistic features. Achieved tions (Finn, 2003). results for discriminating among four tra- A variety of different approaches to Automatic ditional textual genres show the key role Genre Classification (AGC) has been proposed so played by syntactic features, whose impact far differing at the level of the genre and the ty- turned out to vary across genres. pology of features considered. According to the widely acknowledged fact that no established clas- Italiano. L’articolo intende indagare il sification of genres exists (see e.g. Sharoff (2010) ruolo svolto da diversi tipi di caratteris- or Biber (2009)), previous studies focused on ‘tra- tiche linguistiche nella classificazione au- ditional genres’ such as journalism, handbooks, tomatica del genere testuale al fine di academic prose, see among the others (Kessler, identificare le più efficaci e rilevanti. A 1997; Stamatatos, 2001; Fang, 2010), and on ‘web questo scopo è stata messa a punto una genres’, i.e. genres of web pages, see e.g. (Santini, metodologia basata su un processo incre- 2004; Lim, 2004; Mehler, 2010). mentale di selezione realizzato mediante un algoritmo di Grafting usando diversi Despite the great interest in the investigation tipi di caratteristiche. I risultati raggiunti of which linguistic features qualify a text genre mostrano il ruolo chiave delle caratteris- (Biber, 2009; Fang, 2015), so far little effort has tiche sintattiche, il cui impatto varia in been devoted to use sophisticated NLP techniques, modo significativo tra generi diversi. such as syntactic parsing, to capture complex lin- guistic features for the automatic classification of textual genres. Differently from other application 1 Introduction scenarios where the form (the style) of a docu- ment is investigated, such as e.g. Authorship Attri- Automatic classification of textual genres has al- bution (Cranenburgh, 2012), Readability Assess- ways received significant attention from both the- ment (Collins, 2014) and Native Language Iden- oretical and application perspectives. On the one tification (Tetreault, 2013), AGC approaches pro- hand, it has been considered relevant by linguists posed so far mainly focus on word level linguistic and educators to teach students the correct way features, in particular the distribution of function of writing in specific communicative scenarios words, word frequency, n–gram models of both (Biber, 1995; Lee, 2001). On the other hand, the characters and Parts–Of-Speech (Santini, 2004; classification of textual genres is seen as a way to Crossley, 2007; Mehler, 2010) or finer-grained cope with the well known problem of information Parts–Of-Speech tags including morpho-syntactic overload: the exploitation of information about features such as verb tense (Fang, 2010). Very few studies rely on features extracted from syn- our case, the l1 penalty was selected on the ba- tactically annotated texts, the exception being Sta- sis of evaluating maximum entropy models (using matatos (2001) who combines lexical features (i.e. 10-fold cross validation) using varying l1 values word frequency) with features extracted from the (range: 1e-11, 1e-10, ..., 0.1, 1). output of a chunk boundary detector (e.g. the For selecting the features and estimating their distribution of noun, verbal, adjectival phrases), weights, we used T INY E ST1 , a grafting-capable the average number of words included in verbal maximum entropy parameter estimator for rank- phrases. Similar structural features have been also ing tasks (De Kok, 2011; De Kok, 2013). Even used by (Lim, 2004) who combined web–specific though our task is not a ranking task, it can be used features (e.g. HTML tags) with lexical informa- for binary classification by assigning a high score tion and features aiming at capturing the syntac- (1) to the correct class and a low score (0) to the tic structure of a sentence, e.g. the distribution incorrect class. A similar approach was followed of declarative and imperative sentences, syntactic by Dell’Orletta (2014) for discriminating between ambiguities, etc. easy–to–read vs difficult–to–read sentences. As In this paper, we tackle the AGC task for tra- the focus of the present study is on the classifi- ditional genres (namely literary, scientific, edu- cation of texts belonging to different traditional cational and journalistic texts) by using different genres, we created four separate binary classifiers types of linguistic features, i.e. lexical, morpho- which were trained to distinguish Literature texts syntactic and syntactic. In particular, the follow- from non-Literature (i.e. the three remaining gen- ing research questions are addressed: i) which res) texts, Educational texts from non-Educational are the most effective features to classify a tex- texts, etc. A text was assigned the class of the clas- tual genre, and ii) whether and to what extent fea- sifier which returned the highest score. tures identified as most effective remain the same across different genres. These questions have been 3 Typology of Features addressed by carrying out incremental feature se- Various types of features have been proposed in lection with the final aim of identifying the most the literature for the automatic classification of predictive ones. So far, studies focused on the best text genres. Following Stamatatos (2001) and Lim set of features to classify textual genres have been (2004), we combine token–based and structural carried out mainly on English. In this paper, this features. Token–based features were extracted issue is investigated for a typologically different from the top list of the most frequent lemmata in language, Italian. the training corpus and represented in terms of the relative frequency of each lemma in each docu- 2 Model training and feature ranking ment. Structural features were extracted from the In order to identify and rank the most impor- considered corpora morpho–syntactically tagged tant features playing a role in genre classification, by the POS tagger described in (Dell’Orletta, we used GRAFTING (Perkins, 2003). This ap- 2009) and dependency–parsed by the DeSR parser proach allows us to simultaneously train a max- using Multi–Layer Perceptron (Attardi, 2009). imum entropy model while also including incre- As shown in Table 1, they range across differ- mental feature selection. Grafting uses a gradient- ent linguistic description levels (lexical, morpho– based heuristic to select the most promising fea- syntactic and syntactic) for a total of 90 features ture (which is added to the set of selected fea- that resulted to be informative “fingerprints” of the tures S), and subsequently performs a full weight form of a text, on issues of e.g. genre, style, au- optimization over all features in S. This process thorship or readability. is repeated until a certain stopping condition is 4 Experiments reached. The stopping condition integrates l1 reg- ularization in the grafting approach. This means 4.1 Experimental Setup that only those features are included (with a non- We used an Italian corpus including documents zero weight) if the l1 penalty is outweighed by the representative of four different genres: educa- reduction of the objective function. Consequently, tional material (Dell’Orletta, 2011), newspaper ar- overfitting is prevented by excluding noisy fea- tures, or those that change value infrequently. In 1 http://github.com/danieldk/tinyest ticles (Marinelli, 2003), literary texts (Marinelli, Typology Feature Raw Text Sentence and token length 2003) and scientific papers (Dell’Orletta, 2014). Lexical Rate of words in the Basic Italian The whole corpus was split up into a training set Vocabulary, Type/Token ratio (136 documents for the Education genre, 579 for Morpho-syntactic Part-Of-Speech unigrams, Lexical density, Verbal mood the Journalism genre, 365 for the Literature genre Syntactic Dependency type unigrams, Parse and 317 for the Scientific genre), and a held-out tree depth features, Arity of verbal test set (60 documents for each genre). predicates, Distribution of subordi- nate vs main clauses, Length of de- To assess the influence of including structural pendency links features over simply using the most frequent words (lemmata), we used two sets of features Table 1: Typology of features automatically ex- (each consisting of about 200 features). The first tracted from linguistically annotated texts. set of features (taken as the baseline) corresponds to the relative frequency of the 200 top-most fre- quent words (henceforth referred to as the tw200 ated with the best performance on the cross valida- set). 2 The second set combines token-based and tion set, and ii) the lowest number of features such structural features: i.e. in addition to the rela- that the performance dropped when a new feature tive frequency of the 100 top-most frequent words, was added (i.e. performance kept increasing for it contains the (90) structural features illustrated each additional feature up to the selected number above and detailed in Table 1 (this set is hence- of features). forth referred to as the lingtw set). To guarantee 4.2 Replication comparability of values, for each feature the val- ues were scaled between 0 and 1 on the basis of Results reported below can be repli- the data from the training set. If a (non-scaled) cated by downloading the docker image feature value in the held-out test set exceeded the italianlp-wieling/dockergenreclas maximum non-scaled value of that feature in the sification which contains all data and training set, it was set to the maximum value (1). scripts necessary for the feature extraction The feature ranking for each genre was ob- and the grafting procedure, and also con- tained using grafting on the full training data set. tains all results. The Docker file including The performance (i.e. the percentage of correctly all commands to setup the virtual machine classified documents) of the algorithm was eval- can be found at https://github.com/italianlp- uated for an increasing number of features (start- wieling/dockergenreclassification. ing from including only the first (best) feature for each genre to including all features for each genre) 5 Results against both a 10-fold cross-validation test set and 5.1 Genre Classification Results a held-out test set. The 10-fold cross-validation procedure was per- Figures 1(a) and 1(b) report the classification re- formed on the basis of the training set (i.e. the fea- sults using the lingtw vs tw200 features sets: ture weights were determined on the basis of 90% it can be clearly observed that inclusion of struc- of the training data, whereas the performance was tural features is highly beneficial. With only 10 evaluated on the remaining 10% of the training features, the 10-fold cross validation performance data; this procedure was repeated 10 times). As is 83.89% for the lingtw set, whereas it is only stated before, the genre of the document in the test 71.51% for the tw200 set. The optimal per- set was assigned to the genre whose binary classi- formance for the lingtw set is reached with fication model (in this case with the same number 106 features (89.72%), whereas a significantly of features) resulted in the highest score. lower performance is reached using the tw200 set The classification accuracy was assessed with (84.17%) despite the much higher number of fea- respect to the held-out test set for different num- tures used (179). The performance on the held-out bers of features: i) the number of features associ- test set turned out to be slightly lower: 79.16% for 10 features using the lingtw set, against 59.16% 2 In our preliminary analyses, we also assessed the effect with the tw200 set. The optimal performance of including the most frequent bigrams as features. However, as the performance was similar to only using unigrams, we for the lingtw set is reached with 80 features did not include bigrams as features. (86.66%), whereas for the tw200 set a lower per- First 10 features First 50 features First 80 features Genre S P L R W S P L R W S P L R W Journalism 30 40 30 0 0 40 38 10 0 12 38.75 28.75 6.25 1.25 25 Literature 40 30 20 0 10 34 28 8 2 28 33.75 26.25 6.25 1.25 32.50 Education 50 20 10 20 0 44 32 6 4 14 42.5 30 5 2.5 20 Science 60 10 20 10 0 50 32 10 4 4 42.5 30 6.25 2.5 18.75 Table 2: Percentage distribution of different typologies of ranked syntactic (S), morpho-syntactic (P), lexical (L), raw-text (R) and token-based (W) features selected via GRAFTING on the held-out test set. (a) Held-out test set. (b) 10-fold cross-validation test set. Figure 1: Genre classification results using a held-out test set (a), and a 10-fold cross-validation proce- dure (b). formance (72.08%) is obtained using 133 features. 5.2 Feature Ranking Results In order to investigate the typology of linguistic features most significantly contributing to AGC we focused on the lingtw set. In particular, we carried out an in-depth analysis of the grafting- based feature ranking resulting from the classifica- tion of the held-out test set. Ranked features were categorized into five classes: syntactic, morpho- syntactic, lexical, raw and token-based features. Figure 2 provides a genre-independent view re- Figure 2: Genre-independent average distribution porting the percentage average distribution (across of different feature types in the top 10, 50 and 80 genres) of different feature types within the first ranked sets. 10, 50 and 80 ranked feature sets. As shown, syn- tactic features play the most relevant role. They cover the 45% and 42% of the first 10 and 50 fea- table differences can be observed. In particular, tures respectively, and remain the most predictive Literature and Scientific prose represent two op- ones also when 80 features are considered (rep- posite poles. Token-based features (W) are more resenting 39.38% of the set). On the other hand, predictive for literary texts with respect to other the distribution of token-based features increases genres (i.e. they represent 10%, 28% and 32.50% as far as a wider amount of ranked features is con- in the top 10, 50 and 80 features respectively). On sidered (they cover 2.5%, 14.5% and 24.06% in the contrary, syntactic features (S) play for Sci- the 10, 50 and 80 feature sets respectively). entific prose a more important role than for the Consider now the distribution of different types other genres (covering respectively 60%, 50% and of features across genres reported in Table 2: no- 42.50% of the top 10, 50 and 80 features). Journalism Literature Education Science Sentence length – 82 6 32 Word length 56 50 3 3 Type/Token Ratio (forms) 3 95 94 4 Parse tree depth 11 3 37 94 Maximum length of dependency links 42 24 48 56 Post-verbal subject 16 22 90 76 Pre-verbal object 31 21 47 42 Passive subject 7 53 17 5 Table 3: Different ranking positions of a selection of features across genres. Features which were not selected during ranking have no specified rank in the table. Let’s focus now on the role played by individ- zation of scientific writing and newspaper articles. ual features across genres. Table 3 reports the dif- ferent rank positions associated with a selection 6 Conclusion of features in the classification of the four gen- In this paper we investigated impact and role of res. Raw text features (i.e. sentence and word different feature types for Automatic Genre Clas- length) resulted to play a key role in the classi- sification. The goal was pursued by carrying fication of educational materials (Education) with out incremental feature selection through Graft- respect to the other genres (e.g. Literature). A fea- ing augmented with TinyEst. Two sets of fea- ture capturing the lexical richness of texts such as tures were taken into account, token-based and Type/Token Ratio (TTR), which refers to the ratio structure-based. Achieved results show the key between the number of lexical types and the num- role played by syntactic features, a result which ber of tokens (considered as single forms) within is new with respect to the AGC literature. An- a text, is similarly ranked for Journalism and Sci- other original contribution is concerned with the ence while it plays a less relevant role in the classi- role of different feature types which turned out to fication of educational material and literary texts. vary across textual genres, suggesting the special- Moving to syntax, it should be noted that two fea- ization of features in binary genre classification tures characterizing the overall sentence structure, tasks (e.g. Literature vs. other genres). The fea- i.e. the depth of the whole parse tree (calculated tures contributing to AGC for Italian are possibly in terms of the longest path from the root of the influenced by the language dealt with. Although dependency tree to some leaf) and the maximum it is widely acknowledged that linguistic variation length of dependency links (calculated in terms of across genres is a language universal, the question the words occurring between the syntactic head is whether similar linguistic features are expected and the dependent), play a key role in the classi- to play a similar role across languages. If this fication of the Literature and Journalism genres. might be the case of features such as e.g. TTR, use For the latter, it is interesting to contrast the high of passive voice, tenses or pronouns, on the other rank associated with the parse tree depth feature hand features concerned with the ordering of sen- and the irrelevant role played by sentence length tence constituents or the overall sentence structure (typically taken as a proxy of the underlying gram- (e.g. parse tree depth or dependency length) may matical structure): this clearly shows that syntactic be distinctive to a specific language or language features are more effective in discriminating gen- family. Further directions of research thus include res. Other features which turned out to play a rel- comparison of results in a multilingual perspective evant role in ACG are concerned with the relative as well as across a wider variety of genres. ordering of subject and object with respect to the verbal head: their non-canonical orders, i.e. post- Acknowledgements verbal subject and pre-verbal object, play a key Reported research started in the framework of a role in the classification of Literature and Jour- Short Term Mobility program of international ex- nalism genres. On the contrary, the use of passive changes funded by CNR, and continued within the voice (inferred from the presence of passive sub- project “Smart News, Social sensing for break- jects) is less relevant for the classification of Liter- ingnews”, funded by the Tuscany Region under ature, whereas it is highly ranked in the characteri- the FAR-FAS 2014 program. References Chengyu Alex Fang, and Jing Cao 2015. Text Genres and Registers: The Computation of Linguistic Fea- Giuseppe Attardi, Felice Dell’Orletta, Maria Simi and tures. Springer. Joseph Turian 2009. Accurate Dependency Parsing with a Stacked Multilayer Perceptron. Proceedings Aidan Finn and Nicholas Kushmerick 2003. Learning of Evalita 2009. to classify documents according to genre. Proceed- ings of the IJCAI’03 Workshop on Computational Douglas Biber. 1995. Dimensions of register variation: Approaches to Style Analysis and Synthesis, 1–26 A cross-linguistic comparison. Cambridge Univer- sity Press Press, Cambridge, UK. Brett Kessler, Geoffrey Nunberg, and Hinrich Schütze Douglas Biber and Susan Conrad 2009. Genre, Regis- 1997. Automatic detection of text genre. Pro- ter, Style. Cambridge University Press ceedings of the 35th Annual Meeting of the As- sociation for Computational Linguistics and the Andreas van Cranenburgh 2012. Literary authorship 8th Conference of the European Chapter of ACL attribution with phrase-structure fragments. Pro- (ACL/EACL’97), 223–232 ceedings of the ACL Workshop on Computational Linguistics for Literature, 59–63 David Lee. 2001. Genres, registers, text types, do- mains, and styles: clarifyng the concepts and navi- Kevin Collins-Thompson 2014. Computational as- gating a path through the BNC jungle. Proceedings sessment of text readability: a survey of current of ECIR 2004 (26th European Conference on IR Re- and future research. Recent Advances in Automatic search), University of Sunderland (UK), (3), 37–72 Readability Assessment and Text Simplification. Spe- cial issue of International Journal of Applied Lin- Chul Lim, Kong Lee, and Gil Kim. 2004. Multiple guistics, (165–2), 97–135 sets of features for automatic genre classification of web documents. Information processing and man- Scott A. Crossley and Max Louwerse. 2007. Multi- agement (41) 1263–1276 dimensional register classification using bigrams. International Journal of Corpus Linguistics), (12– R. Marinelli, L. Biagini, R. Bindi, S. Goggi, M. Mona- 4) 453–478 chini, P. Orsolini, E. Picchi, S. Rossi, N. Calzolari and A. Zampolli. 2003. The italian parole corpus: Daniël de Kok 2011. Discriminative features in re- an overview. Computational Linguistics in Pisa, versible stochastic attribute-value grammars. Pro- Special Issue, XVI-XVII, 401-421 ceedings of the EMNLP Workshop on Language Generation and Evaluation, 54–63 Alexander Mehler, Serge Sharoff and Marina Santini (Eds.) 2010. Genres on the Web. Springer Series - Daniël de Kok 2013. Reversible Stochastic Attribute- Text, Speech and Language Technology Value Grammars. Rijksuniversiteit Groningen. Simon Perkins, Kevin Lacker and James Theiler 2003. Felice Dell’Orletta 2009. Ensemble system for Grafting: Fast, incremental feature selection by gra- Part-of-Speech tagging. Proceedings of Evalita’09, dient descent in function space. The Journal of Ma- Evaluation of NLP and Speech Tools for Italian. chine Learning Researchs, (3), 1333–1356 Felice Dell’Orletta, Simonetta Montemagni, Eva Maria Marina Santini. 2004. Identification of Genres on the Vecchi and Giulia Venturi. 2011. Tecnologie Web: a Multi–Faceted Approach. Language Learn- linguistico-computazionali per il monitoraggio della ing and Technology, 1–8 competenza linguistica italiana degli alunni stranieri nella scuola primaria e secondaria. Percorsi mi- Maria Santini 2007. Enhanced Genre Classifica- granti: uomini, diritto, lavoro, linguaggi, 319–366 tion through Linguistically Fine-Grained POS Tag. Proceedings of the 24th Pacific Asia Conference on Felice Dell’Orletta, Simonetta Montemagni and Giulia Language, Information and Computation (PACLIC), Venturi 2014. Assessing document and sentence 223–232 readability in less resourced languages and across textual genres. International Journal of Applied Lin- Serge Sharoff 2010. In the garden and in the jun- guistics, 165:2, 319–366 gle: Comparing genres in the BNC and internet. in Mehler (2010), 149–166 Felice Dell’Orletta, Martijn Wieling, Andrea Cimino, Giulia Venturi and Simonetta Montemagni 2014. Efstathios Stamatatos, Nikos Fakotakis and George Assessing the Readability of Sentences: Which Cor- Kokkinakis 2001. Automatic text categorization in pora and Features? Proceedings of 9th Workshop on terms of genre and author. Computational Linguis- Innovative Use of NLP for Building Educational Ap- tics, (26) 471–495 plications (BEA 2014), 163–173 Joel Tetreault, Daniel Blanchard and Aoife Cahill Alex Chengyu Fang, and Jing Cao. 2010. Enhanced 2013. A Report on the First Native Language Identi- Genre Classification through Linguistically Fine- fication Shared Task Proceedings of the ACL Work- Grained POS Tag. Proceedings of the 24th Pa- shop on Innovative Use of NLP for Building Educa- cific Asia Conference on Language, Information and tional Applications, 48–57 Computation (PACLIC), 223–232