-

Text Genre Classification Based on Linguistic Complexity Contours Using A Recurrent Neural Network

Marcus Stro¨ bel

marcus.stroebel@ifaar.rwth-aachen.de 0

Elma Kerz

elma.kerz@ifaar.rwth-aachen.de 0

Daniel Wiechmann

d.wiechmann@uva.nl 1

Yu Qiao

yu.qiao@rwth-aachen.de 0 0 RWTH Aachen University 1 University of Amsterdam

2018

56 63

Over the last years, there has been an increased interest in the combined use of natural language processing techniques and machine learning algorithms to automatically classify texts on the basis of wide range of features. One class of features that have been successfully employed for a wide range of classification tasks, including native language identification, readability assessment and text genre categorization pertain to the construct of 'linguistic complexity'. This paper presents a novel approach to the use of linguistic complexity features in text categorization: Rather than representing text complexity 'globally' in terms of summary statistics, this approach assesses text complexity 'locally' and captures the progression of complexity within a text as a sequence of complexity scores, generating what is referred to here as 'complexity contours'. We demonstrate the utility of the approach in an automatic text classification task for five genres - academic, newspaper, fiction, magazine and spoken - of the Corpus of Contemporary American English (COCA) [Davies, 2008] using a recurrent neural network.

Recent years have witnessed a growing interest in the combined use of natural language processing (NLP) techniques together with machine learning algorithms to investigate the formal features of a text, rather than its content. This type of approach has been successfully applied to a range of automatic text categorization tasks, including author recognition and verification [Van Halteren, 2004], native language identification e.g. [Malmasi et al., 2017; Crossley and McNamara, 2012; Kyle et al., 2015], readability assessment [Franc¸ois and Miltsakaki, 2012] and text genre identification [Xu et al., 2017]. One class of features that have been successfully employed in text classification research pertains to the multidimensional construct of ‘linguistic complexity’ that cuts across multiple levels of linguistic representation. Linguistic complexity is commonly defined as the “the range of forms that surface in language production and the degree of sophistication of such forms” [Ortega, 2003, p. 492]. This construct has been operationalized in terms of a number of measures that tap into different levels of linguistic analysis (e.g. lexical features, such as type-token ratio, or syntactic features, such as complex nominals per clause) and that require different NLP preprocessing steps, from tokenization to syntactic parsing (see Section 2). While previous text classification studies have combined information from multiple measures so as to cover different levels of analysis, these studies used as input for their classifiers scores representing the average complexity of a text. However, the use of such aggregate scores obscure the considerable degree of variation of complexity within a text.

In this paper we present a novel approach to the use of linguistic complexity features in the area of text classification. To this end, we employ a computational tool that implements a sliding-window technique to track the progression of complexity within a text, allowing for a ‘local’ - rather than a global’ - assessment of complexity of a text. More precisely, we demonstrate the utility of the approach in a text genre classification task. Text genre detection is a typical classificatory task in computational stylistics that concerns “the identification of the kind of (or functional style) of the text [Stamatatos et al., 2000, p. 472]. Although definitions of ‘genre’ remain elusive, in the broadest sense, it can be used to refer to “language use in a conventionalized communicative setting in order to give expression to a specific set of communicative goals of a disciplinary or social institution, which give rise to stable structural forms by imposing constraints on the use of lexicogrammatical as well as discoursal resources” [Bhatia, 2004, p.:23]. In this study, we start from the assumption that these constraints are reflected in the degree of linguistic complexity of a text. We then proceed to show that genres are not only distinguished by their average complexity but also in their distribution of linguistic complexity within a text. 2

Our approach: Measuring linguistic complexity using a sliding-window technique The distribution of linguistic complexity within a text was measured using Complexity Contour Generator (CoCoGen), a computational tool that implements a slidingwindow technique to generate a series of measurements for a given complexity measure (CM), allowing for a ‘lo

3, no scaling

cal assessment’ of complexity within a text [Stro¨ bel, 2014; Stro¨ bel et al., 2016]. The approach implemented in CoCoGen stands in contrast the standard approach that represent text complexity as a single score, thus only providing a ‘global assessment’ of the complexity of a text. A sliding window can be conceived of as a window with a certain size defined by the number of sentences it contains. The window is moved across a text sentence-by-sentence, computing one value per window for a given CM. For a text comprising n sentences, there are w = n − ws + 1 windows. Given the constraint that there has to be at least one window, a text has to comprise at least as many sentences at the ws is wide n ≥ w. To compute the complexity score of a given window m (w(m)), a measurement function is called for each sentence in the window and returns a fraction (wnm/wdm). The denominators and numerators of the fractions from the first to the last sentence in the windowSalrideinthgeWn iandddowedAuppprtooafcohrm the denominator and numerator of the resulting complexity score of a given window (see Figure 1).

Text comprising n = 10 sentences 1 wn0 wd0 2 wn1 wd1 3 wn2 wd2 4 wn3 wd3 5 wn4 wd4 6 wn5 wd5 7 wn6 wd6 8 wn7 wd7 9 wn8 wd8 10 wn9 wd9 w0 = wwnd00++wwnd11++wwnd22 w3 = wwnd33++wwnd44++wwnd55 w6 = wwnd66++wwnd77++wwnd88 w1 = wwnd11++Slwwidnd22i++nwwgndW34iwnd4o=w wwAndp44++pwwronda55++chwwdn66 w7 = wwnd77++wwnd88++wwnd99

w2 = wwnd22++wwnd33++wwnd45 w5 = wwnd55++wwnd66++wwnd77 = 3, no scaling ing, non-overlapFpiignugrwei1ndSocwhse,mTneuaxtmitcbeilrluopsfrtsirscaiatnilgoendnwo=fin1hd0oswensctseownmc=pesle3xity measurements com are obtained in CoCoGen for a text comprising ten sentences with a win1dow s2ize wsT3eoxftthcor4emepsreisni5tnegncne=s6. 10 se7ntence8s 9 10 ww1nd00 ww2nd11 w3n2 ww4nd33 ww5nd44 ww6nd55 ww7nd66 ww8nd77 ww9nd88 ww1nd099 wTn0he swen1rieswwond2f2 mewan3surewmn4entswng5enewrna6 tedwbn7y CwonC8oGwen9n captuwred0s thwewn0dp+1wron1g+rwens2siownd3ofwlni3n+gwun4i+swtwdin5c5 cowmd6plwenx6i+twynw7+wiwtdnh88in awdt9ext for wd2 wd4 wd7 a wg0iv=enwdC0+Mwd1a+nwdd2 iswr3ef=erwrde3d+whde4+rewdt5o was6 =a‘wcdo6+mwpd7l+ewwxnd7i8+tywnc8+ownnt9our’. sAnsw0te=xtwwsdn00v++awwrdnwy11n++1i+wwndnw22nle2s+nnwwgnt13h=, twhde3 +irwwndc44o++mwwndp55++lewwnxd66ity

wn3+wn4+wn5+wn6 snwc2o=nwtonw7ud+7rw+snw8cd+8aw+nnw9ndo9t be w1 = w4 = directly comwpd1a+rwedd2.+wTdo4 permiwtdc4+owmd5p+awrdi6sown7s=ofwds7u+cwhd8+cwodn9tours, wn2a+wsnc3a+lwiwnn4igndaolwgosrswiwtnh5=+mw3nt6h+want7divides each text ing, overlappingCwoiCnodoGwesn,nfuemabtuerreosf scaled

w2 = nu w5 = wdx5+iwd6a+tewldy7 same-sized parinto a user-definTeedwxdt2c+owmmd3pb+rewirsd5iongf a10ppserontenmces titions, termed here as ‘scaled windows’ (see Figure 2). ling, non-overlappi1ng win2dows,3numbe4r of sc5aled w6indow7s sw =83 9 10 wn0 wn1 wTn2ext cwonm3priswinn4g n =wn510 sewnn6tencewsn7 wn8 wn9 wd0 wd1 wd2 wd3 wd4 wd5 wd6 wd7 wd8 wd9 1 2 3 4 5 6 7 8 9 10 wn0soww0n=1 wwdnw00++n2wwdn11++wwwwdndn3322++wwdnw33n++4wwnd44++wwwwndnd5555++wwndw66n++6 wwnd77 wwnd77 wn8 wn9 wd0 wd1 wd2 wd4 wd6 wd8 wd9

wn1+wn2+wn3+wn4+wn5+wn6+wn7+wn8 snw0 = wwnd00++sowwwnd111++ww=nd22wsdn1w+w1d=2+wdn3+wdn4+wnd5+wnd6+swndw72+w=d8wn7+wn8+wn9

wd3+wd4+wd5+wd6 wd7+wd8+wd9 sow2 = wn2+wn3+wn4+wn5+wn6+wn7+wn8+wn9 ling, overlappinFgiwguinrdeow2sI,llnuusmtrbaetiroonf oscfatlhweedd2+wwind3d+owwd4s+swwd5m=+we3da6s+uwrde7m+wedn8t+swod9btained by complexity ion annotation CoCoGen for a tTexexttccoommpprriissiinngg 1te0nsesnentetnecnecses with the number of scaled windows set to 3.

1 2 3 Text 4compr5ising 160 sent7ences 8 9 10 wInn0 itsww2cnd11urrewwn3ndt22 verwsni3on,ww5Cnd44oCownG5 enw7snu6ppwwo8ndr77ts 3ww29nd88meaw1sn0u9res of w1d0 w4d3 w6d5 wd6 wd9 linwgn0uistwicn1 comwnp2lexwitny3. I mwnp4 ortawnn5tly, wCn6oCowGn7en wwn8as dwen9signed wwitdh0 seoxwtwe0dn1=siwbwni0ld+i2twyn1i+nwwdmn32+inwndw3d,+4swon4+twhwdan55t+awdnwd6d+i6twino7 nwadl7 co mwdp8lexwitdy9 meawd0+wbde1+awddd2+ewdd.3+Itwdu4s+ewsd5a+nwda6b+swtdr7act measure class for sures can easily

Introduction Method Results Discussion sw0 = wwdn00++wwsndo11++www1nd22= wn1+wn2+wn3+wn4+wn5+wn6+wn7+wn8 swwd11+=wdw2n+3+wwdn34+swwd24+=wdw5n+5+wwdn66+wds7w+3w=d8wn7+wn8+wn9

wd3+wd4 wd5+wd6 wd7+wd8+wd9 tion annotation sow2 = wwnd22++wwnd33++wwnd44++wwnd55++wwnd66++wwnd77++wwnd88++wwnd99

Text comprising 10 sentences

Experiment

3.1 Datasets

The corpus data from the present study come from the Corpus of Contemporary American English (COCA) [Davies, 2008]. COCA is a balanced corpus of American English containing more than 560 million words of text (20 million words each year 1990-2017) equally divided among five general genres: spoken, fiction, popular magazines, newspapers, and academic texts.1 For the

1The selection of the COCA over the British National Corpus (BNC) is motivated by two main reasons: (1) the COCA covers the time span from 1990 to 2012, whereas the BNC covers the time span from 1980s to 1993 (i.e. the most recent texts in the BNC are from the early 1990s, more than twenty years ago), making the COCA more representative of contemporary English and (2) the m (

P P N N ) s ) d s r d ) o r s w o le ( w

( ltirseaeeuxypmM .i/lffrrfsseaedodbopwmm tti-rreeceeapkoydnoTTR iltisaceynxD ti/frffrrsseeeandbodowmm tti-eaenkoypooTTR ti-aeeooknpTR ltsaeeeauhngnLC itt-eeanhngnTLU ilrsaeccceeaednquouFLAmm liititi()scaaceohopxnSCAN liititi()scaaceohopxnSBCN ltreaevogoooflDm lllirracohogpooogovoKDm tteeaeecenhgnnnSL tfr(rseaecaoohgnndhLWirscaeecddooonAwNmWitlrrrseceeaennodnovoSG ttiltracceaeogooynovflKDm ltrsseeceeaennpuS lit-rsseeanpuTU lillrssaeeaenopxopuCNmm lilitr-saeenopxopnTNUmm ltiit-r-seenppxonTTUUm litrrrssseaeaeeauhdnoopPC ititrrr-ssaeeaehdnoopnPTU ltlrsssaeeeeeaeundppunCC ltitr-ssaeeeeeundppnnTCU tllfr(sseeaaohognndybLWtiitrsseacaonhuoodonPPfim itirrseeacaonhoudnoPPfim titr-seeceennnpSU tir-rrsseeaepbhnPTU o u o e u o y C N C L N R T MMS L L K MMMWWS C C C C C C C D D MN N T V x x e le l S le le s s N N purposes of this study, we used a balanced subsample of the corpus comprising 10,500 texts obtained by random sampling of 2,100 texts from each of these five genres.

Complexity contours were obtained using CoCoGen with a window size of 10 sentences over all 10,500 texts. For each text, we extracted a feature sequence, which consists of a series of length n − 10 + 1 32 dimensional feature vectors generated at each window position, where n is the number the sentences in a text. After normalization and padding of the feature sequences, the data were divided into a balanced training set of 10,000 feature sequences (2,000 texts per genre) and a balanced test set of 500 feature sequences (100 texts per genre). To determine to what extent the performance of the classification model is driven by the sequence information, i.e. by the complexity contour, rather than average text complexity, we also created a comparison dataset in which we collapsed each unnormalized feature sequences to its mean vector, so as to retain only the global feature information, and then normalized these data. We used the same COCA subset of 10,500 texts described above to train and test the classification model. Complexity contours were obtained using CoCoGen with window size of 10 sentences over all 10,500 texts. For each text, we extracted a feature sequence, which consists of a series of length n − 10 + 1 32 dimensional feature vectors generated at each window position, where n is the number the sentences in a text. After normalization and padding of the feature sequences, the data were divided into a balanced training set of 10,000 feature sequences (2,000 texts per genre) and a balanced test set of 500 feature sequences (100 texts per genre). 3.2

Model

We used a Recurrent Neural Network classifier adopting the model specification described in [Hafner, 2018]. This model was used because (1) is a dynamic RNN model that can handle sequences of variable length2, (2) it uses Gated Recurrent Unit (GRU) cells, which have been shown yield better performance on smaller datasets [Chung et al., 2014], and (3) it is a simple model.

Assume an input sequence X = (x1, x2, . . . , xl, xl+1, . . . , xn), where each of xi is a 32 dimensional vector, l is the length of the sequence, n ∈ Z is a number, which is greater or equal to the length of the longest sequence in the dataset and xl+1, · · · , xn are padded 0-vectors. As shown in Figure 3, this model consists only of GRU cells with 200 hidden units. To predict the classification, softmax was applied to the output of a fully-connected layer, where the output of the last GRU cell, i.e. whose input is xl, are transformed from a 200 dimensional vector to a 5 dimensional vector. In order to make our comparison to the average-complexity approach as fair as possible, we reused the above model. However, rather than training it on sequences, it was provided only with vectors of average-complexities, i.e. the roll-out of the model consist of only one GRU cell. 3.3

Training

As a loss function, cross entropy was used.

5 L = − yi log(yˆi)

i=1 where [y1, y2, . . . , y5] is the label of the sequence and [yˆ1, yˆ2, . . . , yˆ5] is the prediction of the model.

The mini-batch size was set to 100. For optimization, we compared Nesterov accelerated gradient (NAG), Adadelta and RMSprop and finally decided on NAG with a learning rate of η = 0.01 and γ = 0.9, for which we achieved the lowest error rate of our model. 3.4

Results and Discussion

Before turning to the results of the classification experiment, we first present the results of the CoCoGen analysis to illustrate the variation in text complexity both at the level of individual text and at the level of text genres. Note that for this illustration we used the scaled CoCoGen output with 100 scaled windows. Figure 4 visualizes the progression of complexity within a single text from the genre of academic writing for two selected measures of complexity (corrected type–token–ratio and Dependent Clauses per T − U nit ).

CM CTTR DependentClausesPerTUnit Figure 3 Roll-out Of the RNN Model COCA (450 million words) is more than four times as large as the BNC (100 million words).

2The lengths of the feature sequences depend on the number of sentences of the texts in our corpus. 0 25

50 Text position (scaled windows) 75 100 Figure 4 Progression of complexity within a single text from the genre of academic writing for two measures of complexity (corrected type–token–ratio and Dependent Clauses per T − U nit ) GRU x1 h1

GRU x2

Softmax Dense Layer yˆ xl h2 · · · hl−1 GRU hl · · · hn−1 GRU xn As shown in Figure 4, complexity is not evenly distributed within a text but progresses through a sequence of peaks and troughs for both CMs. Furthermore, Figure 4 suggest that there is an interaction between the two measures such that higher complexity in CTTR in the beginning of the text (windows 1-30) appears to be compensated for by lower complexity in DClperT U nit, whereas the reverse is true for a middle part of the text (windows 40-60).

To determine to what extent text complexity varies among the genres and to see if there are tendencies for genre-specific complexity contours, we aggregated the complexity scores of all text from a given genre at each of the 100 scaled window positions. Figure 6 presents an overview of the resulting average complexity contours for the five genre compared across all CMs.

Figure 6 shows that academic writing is the most complex of the five genres investigated in this study with respect to the majority of CMs (19 out of 32 CMs). It is particularly more complex with regard to the CMs Complex N ominal per Clause, Coordinate P hrases per Clause and N oun P hrase P ost − modif ications per Clause, but not with regard to Dependent Clauses per T − U nit or V erb P hrases per T − U nit. Complexity scores of these latter two CMs are highest in spoken conversation. These results are consistent with findings reported in corpus based studies demonstrating that academic writing is characterized by a ‘compressed’ discourse style with phrasal (non- clausal) modifiers embedded in noun phrases, whereas spoken discourse is more structurally elaborated with multiple levels of clausal embedding [Biber and Gray, 2010; Biber et al., 2011]. Figure 6 furthermore demonstrates that, while the averaged contours are less ‘wiggly’ than those of individual texts, they are typically not uniform and often nonlinear. For example, for some CMs and genres, e.g. Coordinate P hrases per Clause in academic writing, the distribution is U-shaped, such that the beginning and end of a text are much more complex on average than its middle part. Overall, this pattern of results strongly suggest that the complexity scores are not randomly distributed across the texts of a given genre. We now turn to the results our classification experiment. Classification results of previous studies on text genre identification range from relatively low accuracy of 52% to 80% range, with results above 90% reported in some cases, depending on the number and type of genres being considered, size and difficulty of data, etc. (see, e.g [Kessler et al., 1997; Dewdney et al., 2001; Dell’Orletta et al., 2013; Passonneau et al., 2014; Yogatama et al., 2017]) .The performance of our RNN classifiers over 60 training epochs is presented in Figure 5. Figure 5 indicates that the sequence-based RNN displays consistently lower error rates than the average-based RNN: The average-based RNN reached a maximal performance of 91.2% at epoch 16 with an mean performance of 90% accuracy in the surrounding epochs (epochs 10-20). After that performance starts to decrease indicating overfitting. The sequence-based RNN reached a maximal accuracy of 92.8% after 10 epochs and converged on an robust average performance 91.5% after around 30 epochs. These results suggest the utility of the sequence information for the task of genre identification.

cl mean seq The main goal of the paper was to showcase a novel approach to the use of linguistic complexity features for purposes of automatic text classification. Using the task of text genre classification as a test case, we showed that both individual texts as well as text genres are characterized by considerable variation in within-text complexity as captured by ‘complexity contours’, i.e. series of measurements generated by CoCoGen that implements a sliding-window technique. The results of a 5-class text genre classification experiment demonstrated that the inclusion of these contours further increased the high performance ( 90%) of a GRU-RNN classifier trained on text average complexity scores. In future studies we intend to explore the utility of our approach to other tasks of text classification. 2.0 1.5 0.8 0.6 0.4 0.90 0.85 0.80 0.75 0.025 0.020 0.015 0.010 0.85 0.80 0.75 0.70

[Bhatia , 2004]

Vijay

Bhatia . Worlds of written discourse: A genre-based view . A& C Black , 2004 .

[Biber and Gray , 2010]

Douglas

Biber and

Bethany

Gray . Challenging stereotypes about academic writing: Complexity, elaboration, explicitness . Journal of English for Academic Purposes , 9 ( 1 ): 2 - 20 , 2010 .

[Biber et al., 2011 ]

Douglas

Biber , Bethany Gray, and

Kornwipa

Poonpon . Should we use characteristics of conversation to measure grammatical complexity in l2 writing development? Tesol Quarterly , 45 ( 1 ): 5 - 35 , 2011 .

[Chung et al., 2014 ]

Junyoung

Chung , C¸ aglar Gu¨lc¸ehre, KyungHyun Cho, and

Yoshua

Bengio . Empirical evaluation of gated recurrent neural networks on sequence modeling . CoRR, abs/1412.3555 , 2014 .

[Crossley and McNamara , 2012 ] Scott A Crossley and Danielle S McNamara. Detecting the first language of second language writers using automated indices of cohesion, lexical sophistication, syntactic complexity and conceptual knowledge. In Approaching language transfer through text classification: Explorations in the detection-based approach . Channel View Publications , 2012 .

[Davies , 2008]

Mark

Davies . The Corpus of Contemporary American English . BYE, Brigham Young University, 2008 .

[Dell'Orletta et al., 2013 ]

Felice

Dell'Orletta ,

Simonetta

Montemagni , and

Giulia

Venturi . Linguistic profiling of texts across textual genres and readability levels. an exploratory study on Italian fictional prose . In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013 , pages 189 - 197 , 2013 .

[Dewdney et al., 2001 ]

Nigel

Dewdney , Carol VanEssDykema, and Richard MacMillan. The form is the substance: Classification of genres in text . In Proceedings of the Workshop on Human Language Technology and Knowledge Management-Volume 2001 , page 7 . Association for Computational Linguistics, 2001 .

[Ehret and Szmrecsanyi , 2016]

Katharina

Ehret and

Benedikt

Szmrecsanyi . An information-theoretic approach to assess linguistic complexity . Complexity and Isolation . Berlin: de Gruyter, 2016 .

[Franc¸ois and Miltsakaki , 2012] Thomas Franc¸ois and Eleni Miltsakaki. Do NLP and machine learning improve traditional readability formulas ? In Proceedings of the First Workshop on Predicting and Improving Text Readability for Target Reader Populations , pages 49 - 57 . Association for Computational Linguistics, 2012 .

[Hafner , 2018]

Danijar

Hafner . Variable Sequence Lengths in TensorFlow . https://danijar.com/ variable-sequence - lengths-in-tensorflow/, 2018 .

editors , Language Complexity: Typology, contact, change, pages 89 - 108 . Benjamins Amsterdam, Philadelphia, 2008 .

[Kessler et al., 1997 ]

Brett

Kessler , Geoffrey Numberg, and Hinrich Schu¨tze. Automatic detection of text genre . In Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics , pages 32 - 38 . Association for Computational Linguistics, 1997 .

[Kettunen et al., 2006 ]

Kimmo

Kettunen , Markus Sadeniemi, Tiina Lindh-Knuutila, and

Timo

Honkela . Analysis of EU languages through text compression . In Advances in Natural Language Processing , pages 99 - 109 . Springer, 2006 .

[Klein and Manning , 2003]

Dan

Klein and

Christopher D

Manning . Accurate unlexicalized parsing . In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics , 2003 .

[Kyle et al., 2015 ]

Kristopher

Kyle , Scott A Crossley, and You Jin Kim . Native language identification and writing proficiency . International Journal of Learner Corpus Research , 1 ( 2 ): 187 - 209 , 2015 .

[Li and Vita´nyi , 1997]

Ming

Li and Paul Vita´nyi. An introduction to Kolmogorov complexity and its applications . Springer Heidelberg, 1997 .

[Lu , 2010]

Xiaofei

Lu . Automatic analysis of syntactic complexity in second language writing . International Journal of Corpus Linguistics , 15 ( 4 ): 474 - 496 , 2010 .

[Lu , 2012]

Xiaofei

Lu . The relationship of lexical richness to the quality of esl learners oral narratives . The Modern Language Journal , 96 ( 2 ): 190 - 208 , 2012 .

[Malmasi et al., 2017 ]

Shervin

Malmasi , Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christopher Hamill, Diane Napolitano, and

Yao

Qian . A report on the 2017 native language identification shared task . In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications , pages 62 - 75 , 2017 .

[Manning et al., 2014 ]

Christopher

Manning , Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky . The stanford corenlp natural language processing toolkit . In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System demonstrations , pages 55 - 60 , 2014 .

[Ortega , 2003]

Lourdes

Ortega . Syntactic complexity measures and their relationship to l2 proficiency: A research synthesis of college-level L2 writing . Applied linguistics, 24 ( 4 ): 492 - 518 , 2003 .

[Passonneau et al., 2014 ] Rebecca J Passonneau, Nancy Ide, Songqiao Su, and Jesse Stuart. Biber redux: Reconsidering dimensions of variation in American English . In Proceedings of COLING 2014 , the 25th International Conference on Computational Linguistics: Technical papers , pages 565 - 576 , 2014 .

[Juola , 2008]

Patrick

Juola . Assessing linguistic complexity . In Matti Miestamo, Kaius Sinnema¨ki, and Fred Karlsson, [Stamatatos et al., 2000 ]

Efstathios

Stamatatos , Nikos Fakotakis, and

George

Kokkinakis . Automatic text categorization in terms of genre and author . Computational Linguistics , 26 ( 4 ): 471 - 495 , 2000 .

[Stro¨bel et al., 2016 ] Marcus Stro¨bel, Elma Kerz, Daniel Wiechmann, and

Stella

Neumann . Cocogen-complexity contour generator: Automatic assessment of linguistic complexity using a sliding-window technique . In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC) , pages 23 - 31 , 2016 .

[Stro¨bel, 2014] Marcus Stro¨bel. Tracking complexity of l2 academic texts: A sliding-window approach . Master thesis . RWTH Aachen University, 2014 .

[Van Halteren , 2004 ] Hans Van Halteren. Linguistic profiling for author recognition and verification . In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 199. Association for Computational Linguistics , 2004 .

[Xu et al., 2017 ]

Zhijuan

Xu , Lizhen Liu, Wei Song, and

Chao

Du . Text genre classification research . In Computer, Information and Telecommunication Systems (CITS) , 2017 International Conference on, pages 175 - 178 . IEEE, 2017 .

[Yogatama et al., 2017 ]

Dani

Yogatama , Chris Dyer, Wang Ling , and Phil Blunsom . Generative and discriminative text classification with recurrent neural networks . arXiv preprint arXiv:1703 . 01898 , 2017 .