<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Genre Classification Based on Linguistic Complexity Contours Using A Recurrent Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcus Stro¨ bel</string-name>
          <email>marcus.stroebel@ifaar.rwth-aachen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elma Kerz</string-name>
          <email>elma.kerz@ifaar.rwth-aachen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Wiechmann</string-name>
          <email>d.wiechmann@uva.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu Qiao</string-name>
          <email>yu.qiao@rwth-aachen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>RWTH Aachen University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Amsterdam</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>56</fpage>
      <lpage>63</lpage>
      <abstract>
        <p>Over the last years, there has been an increased interest in the combined use of natural language processing techniques and machine learning algorithms to automatically classify texts on the basis of wide range of features. One class of features that have been successfully employed for a wide range of classification tasks, including native language identification, readability assessment and text genre categorization pertain to the construct of 'linguistic complexity'. This paper presents a novel approach to the use of linguistic complexity features in text categorization: Rather than representing text complexity 'globally' in terms of summary statistics, this approach assesses text complexity 'locally' and captures the progression of complexity within a text as a sequence of complexity scores, generating what is referred to here as 'complexity contours'. We demonstrate the utility of the approach in an automatic text classification task for five genres - academic, newspaper, fiction, magazine and spoken - of the Corpus of Contemporary American English (COCA) [Davies, 2008] using a recurrent neural network.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Recent years have witnessed a growing interest in the
combined use of natural language processing (NLP) techniques
together with machine learning algorithms to investigate the
formal features of a text, rather than its content. This type
of approach has been successfully applied to a range of
automatic text categorization tasks, including author recognition
and verification [Van Halteren, 2004], native language
identification e.g. [Malmasi et al., 2017; Crossley and McNamara,
2012; Kyle et al., 2015], readability assessment [Franc¸ois
and Miltsakaki, 2012] and text genre identification [Xu et
al., 2017]. One class of features that have been successfully
employed in text classification research pertains to the
multidimensional construct of ‘linguistic complexity’ that cuts
across multiple levels of linguistic representation. Linguistic
complexity is commonly defined as the “the range of forms
that surface in language production and the degree of
sophistication of such forms” [Ortega, 2003, p. 492]. This construct
has been operationalized in terms of a number of measures
that tap into different levels of linguistic analysis (e.g.
lexical features, such as type-token ratio, or syntactic features,
such as complex nominals per clause) and that require
different NLP preprocessing steps, from tokenization to
syntactic parsing (see Section 2). While previous text classification
studies have combined information from multiple measures
so as to cover different levels of analysis, these studies used
as input for their classifiers scores representing the average
complexity of a text. However, the use of such aggregate
scores obscure the considerable degree of variation of
complexity within a text.</p>
      <p>In this paper we present a novel approach to the use of
linguistic complexity features in the area of text classification.
To this end, we employ a computational tool that implements
a sliding-window technique to track the progression of
complexity within a text, allowing for a ‘local’ - rather than a
global’ - assessment of complexity of a text. More precisely,
we demonstrate the utility of the approach in a text genre
classification task. Text genre detection is a typical classificatory
task in computational stylistics that concerns “the
identification of the kind of (or functional style) of the text [Stamatatos
et al., 2000, p. 472]. Although definitions of ‘genre’ remain
elusive, in the broadest sense, it can be used to refer to
“language use in a conventionalized communicative setting in
order to give expression to a specific set of communicative goals
of a disciplinary or social institution, which give rise to stable
structural forms by imposing constraints on the use of
lexicogrammatical as well as discoursal resources” [Bhatia, 2004,
p.:23]. In this study, we start from the assumption that these
constraints are reflected in the degree of linguistic complexity
of a text. We then proceed to show that genres are not only
distinguished by their average complexity but also in their
distribution of linguistic complexity within a text.
2</p>
      <p>Our approach: Measuring linguistic
complexity using a sliding-window
technique
The distribution of linguistic complexity within a text was
measured using Complexity Contour Generator
(CoCoGen), a computational tool that implements a
slidingwindow technique to generate a series of measurements
for a given complexity measure (CM), allowing for a
‘lo</p>
    </sec>
    <sec id="sec-2">
      <title>3, no scaling</title>
      <p>cal assessment’ of complexity within a text [Stro¨ bel, 2014;
Stro¨ bel et al., 2016]. The approach implemented in CoCoGen
stands in contrast the standard approach that represent text
complexity as a single score, thus only providing a ‘global
assessment’ of the complexity of a text. A sliding window
can be conceived of as a window with a certain size defined
by the number of sentences it contains. The window is moved
across a text sentence-by-sentence, computing one value per
window for a given CM. For a text comprising n sentences,
there are w = n − ws + 1 windows. Given the constraint that
there has to be at least one window, a text has to comprise at
least as many sentences at the ws is wide n ≥ w. To compute
the complexity score of a given window m (w(m)), a
measurement function is called for each sentence in the window
and returns a fraction (wnm/wdm). The denominators and
numerators of the fractions from the first to the last sentence
in the windowSalrideinthgeWn iandddowedAuppprtooafcohrm the denominator and
numerator of the resulting complexity score of a given
window (see Figure 1).</p>
      <p>Text comprising n = 10 sentences
1
wn0
wd0
2
wn1
wd1
3
wn2
wd2
4
wn3
wd3
5
wn4
wd4
6
wn5
wd5
7
wn6
wd6
8
wn7
wd7
9
wn8
wd8
10
wn9
wd9
w0 = wwnd00++wwnd11++wwnd22 w3 = wwnd33++wwnd44++wwnd55 w6 = wwnd66++wwnd77++wwnd88
w1 = wwnd11++Slwwidnd22i++nwwgndW34iwnd4o=w wwAndp44++pwwronda55++chwwdn66 w7 = wwnd77++wwnd88++wwnd99</p>
      <p>w2 = wwnd22++wwnd33++wwnd45 w5 = wwnd55++wwnd66++wwnd77
= 3, no scaling
ing, non-overlapFpiignugrwei1ndSocwhse,mTneuaxtmitcbeilrluopsfrtsirscaiatnilgoendnwo=fin1hd0oswensctseownmc=pesle3xity measurements
com
are obtained in CoCoGen for a text comprising ten sentences with a
win1dow s2ize wsT3eoxftthcor4emepsreisni5tnegncne=s6. 10 se7ntence8s 9 10
ww1nd00 ww2nd11 w3n2 ww4nd33 ww5nd44 ww6nd55 ww7nd66 ww8nd77 ww9nd88 ww1nd099
wTn0he swen1rieswwond2f2 mewan3surewmn4entswng5enewrna6 tedwbn7y CwonC8oGwen9n
captuwred0s thwewn0dp+1wron1g+rwens2siownd3ofwlni3n+gwun4i+swtwdin5c5 cowmd6plwenx6i+twynw7+wiwtdnh88in awdt9ext for
wd2 wd4 wd7
a wg0iv=enwdC0+Mwd1a+nwdd2 iswr3ef=erwrde3d+whde4+rewdt5o was6 =a‘wcdo6+mwpd7l+ewwxnd7i8+tywnc8+ownnt9our’.
sAnsw0te=xtwwsdn00v++awwrdnwy11n++1i+wwndnw22nle2s+nnwwgnt13h=, twhde3 +irwwndc44o++mwwndp55++lewwnxd66ity</p>
      <p>wn3+wn4+wn5+wn6 snwc2o=nwtonw7ud+7rw+snw8cd+8aw+nnw9ndo9t be
w1 = w4 =
directly comwpd1a+rwedd2.+wTdo4 permiwtdc4+owmd5p+awrdi6sown7s=ofwds7u+cwhd8+cwodn9tours,
wn2a+wsnc3a+lwiwnn4igndaolwgosrswiwtnh5=+mw3nt6h+want7divides each text
ing, overlappingCwoiCnodoGwesn,nfuemabtuerreosf scaled</p>
      <p>w2 = nu w5 = wdx5+iwd6a+tewldy7 same-sized
parinto a user-definTeedwxdt2c+owmmd3pb+rewirsd5iongf a10ppserontenmces
titions, termed here as ‘scaled windows’ (see Figure 2).
ling, non-overlappi1ng win2dows,3numbe4r of sc5aled w6indow7s sw =83 9 10
wn0 wn1 wTn2ext cwonm3priswinn4g n =wn510 sewnn6tencewsn7 wn8 wn9
wd0 wd1 wd2 wd3 wd4 wd5 wd6 wd7 wd8 wd9
1 2 3 4 5 6 7 8 9 10
wn0soww0n=1 wwdnw00++n2wwdn11++wwwwdndn3322++wwdnw33n++4wwnd44++wwwwndnd5555++wwndw66n++6 wwnd77 wwnd77 wn8 wn9
wd0 wd1 wd2 wd4 wd6 wd8 wd9</p>
      <p>wn1+wn2+wn3+wn4+wn5+wn6+wn7+wn8
snw0 = wwnd00++sowwwnd111++ww=nd22wsdn1w+w1d=2+wdn3+wdn4+wnd5+wnd6+swndw72+w=d8wn7+wn8+wn9</p>
      <p>wd3+wd4+wd5+wd6 wd7+wd8+wd9
sow2 = wn2+wn3+wn4+wn5+wn6+wn7+wn8+wn9
ling, overlappinFgiwguinrdeow2sI,llnuusmtrbaetiroonf oscfatlhweedd2+wwind3d+owwd4s+swwd5m=+we3da6s+uwrde7m+wedn8t+swod9btained by
complexity
ion annotation CoCoGen for a tTexexttccoommpprriissiinngg 1te0nsesnentetnecnecses with the number of
scaled windows set to 3.</p>
      <p>1 2 3 Text 4compr5ising 160 sent7ences 8 9 10
wInn0 itsww2cnd11urrewwn3ndt22 verwsni3on,ww5Cnd44oCownG5 enw7snu6ppwwo8ndr77ts 3ww29nd88meaw1sn0u9res of
w1d0 w4d3 w6d5 wd6 wd9
linwgn0uistwicn1 comwnp2lexwitny3. I mwnp4 ortawnn5tly, wCn6oCowGn7en wwn8as dwen9signed
wwitdh0 seoxwtwe0dn1=siwbwni0ld+i2twyn1i+nwwdmn32+inwndw3d,+4swon4+twhwdan55t+awdnwd6d+i6twino7 nwadl7 co mwdp8lexwitdy9
meawd0+wbde1+awddd2+ewdd.3+Itwdu4s+ewsd5a+nwda6b+swtdr7act measure class for
sures can easily</p>
      <p>Introduction Method Results Discussion
sw0 = wwdn00++wwsndo11++www1nd22= wn1+wn2+wn3+wn4+wn5+wn6+wn7+wn8
swwd11+=wdw2n+3+wwdn34+swwd24+=wdw5n+5+wwdn66+wds7w+3w=d8wn7+wn8+wn9</p>
      <p>wd3+wd4 wd5+wd6 wd7+wd8+wd9
tion annotation
sow2 = wwnd22++wwnd33++wwnd44++wwnd55++wwnd66++wwnd77++wwnd88++wwnd99</p>
    </sec>
    <sec id="sec-3">
      <title>Text comprising 10 sentences</title>
      <p>57</p>
      <p>Experiment</p>
      <sec id="sec-3-1">
        <title>3.1 Datasets</title>
        <p>The corpus data from the present study come from the
Corpus of Contemporary American English
(COCA) [Davies, 2008]. COCA is a balanced corpus of
American English containing more than 560 million words
of text (20 million words each year 1990-2017) equally
divided among five general genres: spoken, fiction, popular
magazines, newspapers, and academic texts.1 For the</p>
        <p>1The selection of the COCA over the British National
Corpus (BNC) is motivated by two main reasons: (1) the COCA
covers the time span from 1990 to 2012, whereas the BNC covers
the time span from 1980s to 1993 (i.e. the most recent texts in the
BNC are from the early 1990s, more than twenty years ago), making
the COCA more representative of contemporary English and (2) the
m
(</p>
        <p>P P
N N
)
s )
d s
r d
) o r
s w o
le ( w</p>
        <p>(
ltirseaeeuxypmM .i/lffrrfsseaedodbopwmm tti-rreeceeapkoydnoTTR iltisaceynxD ti/frffrrsseeeandbodowmm tti-eaenkoypooTTR ti-aeeooknpTR ltsaeeeauhngnLC itt-eeanhngnTLU ilrsaeccceeaednquouFLAmm liititi()scaaceohopxnSCAN liititi()scaaceohopxnSBCN ltreaevogoooflDm lllirracohogpooogovoKDm tteeaeecenhgnnnSL tfr(rseaecaoohgnndhLWirscaeecddooonAwNmWitlrrrseceeaennodnovoSG ttiltracceaeogooynovflKDm ltrsseeceeaennpuS lit-rsseeanpuTU lillrssaeeaenopxopuCNmm lilitr-saeenopxopnTNUmm ltiit-r-seenppxonTTUUm litrrrssseaeaeeauhdnoopPC ititrrr-ssaeeaehdnoopnPTU ltlrsssaeeeeeaeundppunCC ltitr-ssaeeeeeundppnnTCU tllfr(sseeaaohognndybLWtiitrsseacaonhuoodonPPfim itirrseeacaonhoudnoPPfim titr-seeceennnpSU tir-rrsseeaepbhnPTU
o u o e u o y
C N C L N R T MMS L L K MMMWWS C C C C C C C D D MN N T V
x x
e le
l
S le le
s s
N N
purposes of this study, we used a balanced subsample of
the corpus comprising 10,500 texts obtained by random
sampling of 2,100 texts from each of these five genres.</p>
        <p>Complexity contours were obtained using CoCoGen with
a window size of 10 sentences over all 10,500 texts. For
each text, we extracted a feature sequence, which consists
of a series of length n − 10 + 1 32 dimensional feature
vectors generated at each window position, where n is the
number the sentences in a text. After normalization and
padding of the feature sequences, the data were divided
into a balanced training set of 10,000 feature sequences
(2,000 texts per genre) and a balanced test set of 500 feature
sequences (100 texts per genre). To determine to what extent
the performance of the classification model is driven by the
sequence information, i.e. by the complexity contour, rather
than average text complexity, we also created a comparison
dataset in which we collapsed each unnormalized feature
sequences to its mean vector, so as to retain only the global
feature information, and then normalized these data. We used
the same COCA subset of 10,500 texts described above to
train and test the classification model. Complexity contours
were obtained using CoCoGen with window size of 10
sentences over all 10,500 texts. For each text, we extracted
a feature sequence, which consists of a series of length
n − 10 + 1 32 dimensional feature vectors generated at
each window position, where n is the number the sentences
in a text. After normalization and padding of the feature
sequences, the data were divided into a balanced training
set of 10,000 feature sequences (2,000 texts per genre) and
a balanced test set of 500 feature sequences (100 texts per
genre).
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Model</title>
        <p>We used a Recurrent Neural Network classifier adopting the
model specification described in [Hafner, 2018]. This model
was used because (1) is a dynamic RNN model that can
handle sequences of variable length2, (2) it uses Gated Recurrent
Unit (GRU) cells, which have been shown yield better
performance on smaller datasets [Chung et al., 2014], and (3) it is a
simple model.</p>
        <p>Assume an input sequence X =
(x1, x2, . . . , xl, xl+1, . . . , xn), where each of xi is a 32
dimensional vector, l is the length of the sequence, n ∈ Z
is a number, which is greater or equal to the length of
the longest sequence in the dataset and xl+1, · · · , xn are
padded 0-vectors. As shown in Figure 3, this model consists
only of GRU cells with 200 hidden units. To predict the
classification, softmax was applied to the output of a
fully-connected layer, where the output of the last GRU
cell, i.e. whose input is xl, are transformed from a 200
dimensional vector to a 5 dimensional vector. In order to
make our comparison to the average-complexity approach
as fair as possible, we reused the above model. However,
rather than training it on sequences, it was provided only
with vectors of average-complexities, i.e. the roll-out of the
model consist of only one GRU cell.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Training</title>
        <p>As a loss function, cross entropy was used.</p>
        <p>5
L = −  yi log(yˆi)</p>
        <p>i=1
where [y1, y2, . . . , y5] is the label of the sequence and
[yˆ1, yˆ2, . . . , yˆ5] is the prediction of the model.</p>
        <p>The mini-batch size was set to 100. For optimization,
we compared Nesterov accelerated gradient (NAG), Adadelta
and RMSprop and finally decided on NAG with a learning
rate of η = 0.01 and γ = 0.9, for which we achieved the
lowest error rate of our model.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Results and Discussion</title>
        <p>Before turning to the results of the classification
experiment, we first present the results of the CoCoGen
analysis to illustrate the variation in text complexity both at
the level of individual text and at the level of text
genres. Note that for this illustration we used the scaled
CoCoGen output with 100 scaled windows. Figure 4
visualizes the progression of complexity within a single
text from the genre of academic writing for two selected
measures of complexity (corrected type–token–ratio and
Dependent Clauses per T − U nit ).</p>
        <p>CM CTTR DependentClausesPerTUnit
Figure 3 Roll-out Of the RNN Model
COCA (450 million words) is more than four times as large as the
BNC (100 million words).</p>
        <p>2The lengths of the feature sequences depend on the number of
sentences of the texts in our corpus.
0
25</p>
        <p>50
Text position (scaled windows)
75
100
Figure 4 Progression of complexity within a single text from
the genre of academic writing for two measures of complexity
(corrected type–token–ratio and Dependent Clauses per T −
U nit )
GRU
x1
h1</p>
        <p>GRU
x2</p>
        <p>Softmax
Dense Layer
yˆ
xl
h2
· · ·
hl−1 GRU
hl
· · · hn−1 GRU
xn
As shown in Figure 4, complexity is not evenly distributed
within a text but progresses through a sequence of peaks and
troughs for both CMs. Furthermore, Figure 4 suggest that
there is an interaction between the two measures such that
higher complexity in CTTR in the beginning of the text
(windows 1-30) appears to be compensated for by lower
complexity in DClperT U nit, whereas the reverse is true for a middle
part of the text (windows 40-60).</p>
        <p>To determine to what extent text complexity varies among
the genres and to see if there are tendencies for genre-specific
complexity contours, we aggregated the complexity scores of
all text from a given genre at each of the 100 scaled window
positions. Figure 6 presents an overview of the resulting
average complexity contours for the five genre compared across
all CMs.</p>
        <p>
          Figure 6 shows that academic writing is the most
complex of the five genres investigated in this study
with respect to the majority of CMs (19 out of 32
CMs). It is particularly more complex with
regard to the CMs Complex N ominal per Clause,
Coordinate P hrases per Clause and
N oun P hrase P ost − modif ications per Clause,
but not with regard to Dependent Clauses per T − U nit
or V erb P hrases per T − U nit. Complexity scores of
these latter two CMs are highest in spoken conversation.
These results are consistent with findings reported in
corpus based studies demonstrating that academic writing is
characterized by a ‘compressed’ discourse style with phrasal
(non- clausal) modifiers embedded in noun phrases, whereas
spoken discourse is more structurally elaborated with
multiple levels of clausal embedding [Biber and Gray, 2010;
Biber et al., 2011]. Figure 6 furthermore demonstrates that,
while the averaged contours are less ‘wiggly’ than those of
individual texts, they are typically not uniform and often
nonlinear. For example, for some CMs and genres, e.g.
Coordinate P hrases per Clause in academic writing, the
distribution is U-shaped, such that the beginning and end of
a text are much more complex on average than its middle
part. Overall, this pattern of results strongly suggest that
the complexity scores are not randomly distributed across
the texts of a given genre. We now turn to the results our
classification experiment. Classification results of previous
studies on text genre identification range from relatively
low accuracy of 52% to 80% range, with results above 90%
reported in some cases, depending on the number and type
of genres being considered, size and difficulty of data, etc.
          <xref ref-type="bibr" rid="ref13 ref20 ref21 ref23 ref28 ref29 ref4 ref7 ref8">(see, e.g [Kessler et al., 1997; Dewdney et al., 2001;
Dell’Orletta et al., 2013; Passonneau et al., 2014;
Yogatama et al., 2017])</xref>
          .The performance of our RNN
classifiers over 60 training epochs is presented in Figure 5.
Figure 5 indicates that the sequence-based RNN displays
consistently lower error rates than the average-based RNN:
The average-based RNN reached a maximal performance
of 91.2% at epoch 16 with an mean performance of 90%
accuracy in the surrounding epochs (epochs 10-20). After
that performance starts to decrease indicating overfitting.
The sequence-based RNN reached a maximal accuracy of
92.8% after 10 epochs and converged on an robust average
performance 91.5% after around 30 epochs. These results
suggest the utility of the sequence information for the task of
genre identification.
        </p>
        <p>cl mean seq
The main goal of the paper was to showcase a novel approach
to the use of linguistic complexity features for purposes of
automatic text classification. Using the task of text genre
classification as a test case, we showed that both individual texts as
well as text genres are characterized by considerable variation
in within-text complexity as captured by ‘complexity
contours’, i.e. series of measurements generated by CoCoGen
that implements a sliding-window technique. The results of
a 5-class text genre classification experiment demonstrated
that the inclusion of these contours further increased the high
performance ( 90%) of a GRU-RNN classifier trained on text
average complexity scores. In future studies we intend to
explore the utility of our approach to other tasks of text
classification.
2.0
1.5
0.8
0.6
0.4
0.90
0.85
0.80
0.75
0.025
0.020
0.015
0.010
0.85
0.80
0.75
0.70</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Bhatia</source>
          , 2004]
          <string-name>
            <given-names>Vijay</given-names>
            <surname>Bhatia</surname>
          </string-name>
          .
          <article-title>Worlds of written discourse: A genre-based view</article-title>
          . A&amp;
          <string-name>
            <surname>C Black</surname>
          </string-name>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[Biber and Gray</source>
          , 2010]
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Biber</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bethany</given-names>
            <surname>Gray</surname>
          </string-name>
          .
          <article-title>Challenging stereotypes about academic writing: Complexity, elaboration, explicitness</article-title>
          .
          <source>Journal of English for Academic Purposes</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>2</fpage>
          -
          <lpage>20</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Biber et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>Douglas</given-names>
            <surname>Biber</surname>
          </string-name>
          , Bethany Gray, and
          <string-name>
            <given-names>Kornwipa</given-names>
            <surname>Poonpon</surname>
          </string-name>
          .
          <article-title>Should we use characteristics of conversation to measure grammatical complexity in l2 writing development?</article-title>
          <source>Tesol Quarterly</source>
          ,
          <volume>45</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>35</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Chung et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Junyoung</given-names>
            <surname>Chung</surname>
          </string-name>
          , C¸ aglar Gu¨lc¸ehre, KyungHyun Cho, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>
          .
          <source>CoRR, abs/1412.3555</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Crossley and McNamara</source>
          ,
          <year>2012</year>
          ]
          <article-title>Scott A Crossley and Danielle S McNamara. Detecting the first language of second language writers using automated indices of cohesion, lexical sophistication, syntactic complexity and conceptual knowledge. In Approaching language transfer through text classification: Explorations in the detection-based approach</article-title>
          .
          <source>Channel View Publications</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Davies</source>
          , 2008]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Davies</surname>
          </string-name>
          .
          <article-title>The Corpus of Contemporary American English</article-title>
          . BYE, Brigham Young University,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>[Dell'Orletta</surname>
          </string-name>
          et al.,
          <year>2013</year>
          ]
          <string-name>
            <given-names>Felice</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Simonetta</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Giulia</given-names>
            <surname>Venturi</surname>
          </string-name>
          .
          <article-title>Linguistic profiling of texts across textual genres and readability levels. an exploratory study on Italian fictional prose</article-title>
          .
          <source>In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP</source>
          <year>2013</year>
          , pages
          <fpage>189</fpage>
          -
          <lpage>197</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Dewdney et al.,
          <year>2001</year>
          ]
          <string-name>
            <given-names>Nigel</given-names>
            <surname>Dewdney</surname>
          </string-name>
          , Carol VanEssDykema, and Richard MacMillan.
          <article-title>The form is the substance: Classification of genres in text</article-title>
          .
          <source>In Proceedings of the Workshop on Human Language Technology and Knowledge Management-Volume</source>
          <year>2001</year>
          ,
          <article-title>page 7</article-title>
          . Association for Computational Linguistics,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Ehret and Szmrecsanyi</source>
          , 2016]
          <string-name>
            <given-names>Katharina</given-names>
            <surname>Ehret</surname>
          </string-name>
          and
          <string-name>
            <given-names>Benedikt</given-names>
            <surname>Szmrecsanyi</surname>
          </string-name>
          .
          <article-title>An information-theoretic approach to assess linguistic complexity</article-title>
          .
          <source>Complexity and Isolation</source>
          . Berlin: de Gruyter,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Franc¸ois and Miltsakaki</source>
          , 2012]
          <article-title>Thomas Franc¸ois and Eleni Miltsakaki. Do NLP and machine learning improve traditional readability formulas</article-title>
          ?
          <source>In Proceedings of the First Workshop on Predicting and Improving Text Readability for Target Reader Populations</source>
          , pages
          <fpage>49</fpage>
          -
          <lpage>57</lpage>
          . Association for Computational Linguistics,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>[Hafner</source>
          , 2018]
          <string-name>
            <given-names>Danijar</given-names>
            <surname>Hafner</surname>
          </string-name>
          .
          <article-title>Variable Sequence Lengths in TensorFlow</article-title>
          . https://danijar.com/ variable-sequence
          <string-name>
            <surname>-</surname>
          </string-name>
          lengths-in-tensorflow/,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>editors</surname>
          </string-name>
          , Language Complexity: Typology, contact, change, pages
          <fpage>89</fpage>
          -
          <lpage>108</lpage>
          . Benjamins Amsterdam, Philadelphia,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Kessler et al.,
          <year>1997</year>
          ]
          <string-name>
            <given-names>Brett</given-names>
            <surname>Kessler</surname>
          </string-name>
          , Geoffrey Numberg, and
          <article-title>Hinrich Schu¨tze. Automatic detection of text genre</article-title>
          .
          <source>In Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics</source>
          , pages
          <fpage>32</fpage>
          -
          <lpage>38</lpage>
          . Association for Computational Linguistics,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Kettunen et al.,
          <year>2006</year>
          ]
          <string-name>
            <given-names>Kimmo</given-names>
            <surname>Kettunen</surname>
          </string-name>
          , Markus Sadeniemi, Tiina Lindh-Knuutila, and
          <string-name>
            <given-names>Timo</given-names>
            <surname>Honkela</surname>
          </string-name>
          .
          <article-title>Analysis of EU languages through text compression</article-title>
          .
          <source>In Advances in Natural Language Processing</source>
          , pages
          <fpage>99</fpage>
          -
          <lpage>109</lpage>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Klein and Manning</source>
          , 2003]
          <string-name>
            <given-names>Dan</given-names>
            <surname>Klein</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Accurate unlexicalized parsing</article-title>
          .
          <source>In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Kyle et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Kristopher</given-names>
            <surname>Kyle</surname>
          </string-name>
          ,
          <article-title>Scott A Crossley, and You Jin Kim</article-title>
          .
          <article-title>Native language identification and writing proficiency</article-title>
          .
          <source>International Journal of Learner Corpus Research</source>
          ,
          <volume>1</volume>
          (
          <issue>2</issue>
          ):
          <fpage>187</fpage>
          -
          <lpage>209</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Li and Vita´nyi</source>
          , 1997]
          <string-name>
            <given-names>Ming</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <article-title>Paul Vita´nyi. An introduction to Kolmogorov complexity and its applications</article-title>
          . Springer Heidelberg,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Lu</source>
          , 2010]
          <string-name>
            <given-names>Xiaofei</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>Automatic analysis of syntactic complexity in second language writing</article-title>
          .
          <source>International Journal of Corpus Linguistics</source>
          ,
          <volume>15</volume>
          (
          <issue>4</issue>
          ):
          <fpage>474</fpage>
          -
          <lpage>496</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>[Lu</source>
          , 2012]
          <string-name>
            <given-names>Xiaofei</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>The relationship of lexical richness to the quality of esl learners oral narratives</article-title>
          .
          <source>The Modern Language Journal</source>
          ,
          <volume>96</volume>
          (
          <issue>2</issue>
          ):
          <fpage>190</fpage>
          -
          <lpage>208</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [Malmasi et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Shervin</given-names>
            <surname>Malmasi</surname>
          </string-name>
          , Keelan Evanini, Aoife Cahill, Joel Tetreault, Robert Pugh, Christopher Hamill, Diane Napolitano, and
          <string-name>
            <given-names>Yao</given-names>
            <surname>Qian</surname>
          </string-name>
          .
          <article-title>A report on the 2017 native language identification shared task</article-title>
          .
          <source>In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications</source>
          , pages
          <fpage>62</fpage>
          -
          <lpage>75</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [Manning et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          , Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and
          <string-name>
            <surname>David McClosky</surname>
          </string-name>
          .
          <article-title>The stanford corenlp natural language processing toolkit</article-title>
          .
          <source>In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System demonstrations</source>
          , pages
          <fpage>55</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>[Ortega</source>
          , 2003]
          <string-name>
            <given-names>Lourdes</given-names>
            <surname>Ortega</surname>
          </string-name>
          .
          <article-title>Syntactic complexity measures and their relationship to l2 proficiency: A research synthesis of college-level L2 writing</article-title>
          . Applied linguistics,
          <volume>24</volume>
          (
          <issue>4</issue>
          ):
          <fpage>492</fpage>
          -
          <lpage>518</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [Passonneau et al.,
          <year>2014</year>
          ] Rebecca J Passonneau, Nancy Ide, Songqiao Su, and Jesse Stuart.
          <article-title>Biber redux: Reconsidering dimensions of variation in American English</article-title>
          .
          <source>In Proceedings of COLING</source>
          <year>2014</year>
          ,
          <source>the 25th International Conference on Computational Linguistics: Technical papers</source>
          , pages
          <fpage>565</fpage>
          -
          <lpage>576</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>[Juola</source>
          , 2008]
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Juola</surname>
          </string-name>
          .
          <article-title>Assessing linguistic complexity</article-title>
          . In Matti Miestamo, Kaius Sinnema¨ki, and Fred Karlsson, [Stamatatos et al.,
          <year>2000</year>
          ]
          <string-name>
            <given-names>Efstathios</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          , Nikos Fakotakis, and
          <string-name>
            <given-names>George</given-names>
            <surname>Kokkinakis</surname>
          </string-name>
          .
          <article-title>Automatic text categorization in terms of genre and author</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>26</volume>
          (
          <issue>4</issue>
          ):
          <fpage>471</fpage>
          -
          <lpage>495</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [Stro¨bel et al.,
          <year>2016</year>
          ] Marcus Stro¨bel, Elma Kerz, Daniel Wiechmann, and
          <string-name>
            <given-names>Stella</given-names>
            <surname>Neumann</surname>
          </string-name>
          .
          <article-title>Cocogen-complexity contour generator: Automatic assessment of linguistic complexity using a sliding-window technique</article-title>
          .
          <source>In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)</source>
          , pages
          <fpage>23</fpage>
          -
          <lpage>31</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [Stro¨bel, 2014]
          <article-title>Marcus Stro¨bel. Tracking complexity of l2 academic texts: A sliding-window approach</article-title>
          .
          <source>Master thesis</source>
          . RWTH Aachen University,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>[Van Halteren</surname>
          </string-name>
          ,
          <year>2004</year>
          ] Hans Van Halteren.
          <article-title>Linguistic profiling for author recognition and verification</article-title>
          .
          <source>In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 199. Association for Computational Linguistics</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [Xu et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Zhijuan</given-names>
            <surname>Xu</surname>
          </string-name>
          , Lizhen Liu, Wei Song, and
          <string-name>
            <given-names>Chao</given-names>
            <surname>Du</surname>
          </string-name>
          .
          <article-title>Text genre classification research</article-title>
          .
          <source>In Computer, Information and Telecommunication Systems (CITS)</source>
          , 2017 International Conference on, pages
          <fpage>175</fpage>
          -
          <lpage>178</lpage>
          . IEEE,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [Yogatama et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Dani</given-names>
            <surname>Yogatama</surname>
          </string-name>
          , Chris Dyer,
          <string-name>
            <surname>Wang Ling</surname>
            , and
            <given-names>Phil</given-names>
          </string-name>
          <string-name>
            <surname>Blunsom</surname>
          </string-name>
          .
          <article-title>Generative and discriminative text classification with recurrent neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1703</source>
          .
          <year>01898</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>