Identifying Predictive Features for Textual Genre Classification:
                            the Key Role of Syntax
                                Andrea Cimino , Martijn Wieling• ,
                 Felice Dell’Orletta , Simonetta Montemagni , Giulia Venturi
                             • University of Groningen - The Netherlands

                                      m.b.wieling@rug.nl
             Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)

                                   ItaliaNLP Lab - www.italianlp.it
                                 {name.surname}@ilc.cnr.it
                     Abstract                          document genre can help to develop more accu-
                                                       rate Information Retrieval tools. Genre identifica-
    English. The paper investigates impact             tion has been considered a key factor for reducing
    and role of different feature types for the        irrelevant results of search engines, as users would
    specific task of Automatic Genre Classi-           be able to specify the desired textual genre along
    fication with the final aim of identifying         with the keywords expressing the content they are
    the most predictive ones. The goal was             looking for (Santini, 2004; Lim, 2004; Santini,
    pursued by carrying out incremental fea-           2007). In fact, document genre and document con-
    ture selection through Grafting using dif-         tent represent orthogonal dimensions of classifica-
    ferent sets of linguistic features. Achieved       tions (Finn, 2003).
    results for discriminating among four tra-             A variety of different approaches to Automatic
    ditional textual genres show the key role          Genre Classification (AGC) has been proposed so
    played by syntactic features, whose impact         far differing at the level of the genre and the ty-
    turned out to vary across genres.                  pology of features considered. According to the
                                                       widely acknowledged fact that no established clas-
    Italiano. L’articolo intende indagare il
                                                       sification of genres exists (see e.g. Sharoff (2010)
    ruolo svolto da diversi tipi di caratteris-
                                                       or Biber (2009)), previous studies focused on ‘tra-
    tiche linguistiche nella classificazione au-
                                                       ditional genres’ such as journalism, handbooks,
    tomatica del genere testuale al fine di
                                                       academic prose, see among the others (Kessler,
    identificare le più efficaci e rilevanti. A
                                                       1997; Stamatatos, 2001; Fang, 2010), and on ‘web
    questo scopo è stata messa a punto una
                                                       genres’, i.e. genres of web pages, see e.g. (Santini,
    metodologia basata su un processo incre-
                                                       2004; Lim, 2004; Mehler, 2010).
    mentale di selezione realizzato mediante
    un algoritmo di Grafting usando diversi                Despite the great interest in the investigation
    tipi di caratteristiche. I risultati raggiunti     of which linguistic features qualify a text genre
    mostrano il ruolo chiave delle caratteris-         (Biber, 2009; Fang, 2015), so far little effort has
    tiche sintattiche, il cui impatto varia in         been devoted to use sophisticated NLP techniques,
    modo significativo tra generi diversi.             such as syntactic parsing, to capture complex lin-
                                                       guistic features for the automatic classification of
                                                       textual genres. Differently from other application
1   Introduction                                       scenarios where the form (the style) of a docu-
                                                       ment is investigated, such as e.g. Authorship Attri-
Automatic classification of textual genres has al-     bution (Cranenburgh, 2012), Readability Assess-
ways received significant attention from both the-     ment (Collins, 2014) and Native Language Iden-
oretical and application perspectives. On the one      tification (Tetreault, 2013), AGC approaches pro-
hand, it has been considered relevant by linguists     posed so far mainly focus on word level linguistic
and educators to teach students the correct way        features, in particular the distribution of function
of writing in specific communicative scenarios         words, word frequency, n–gram models of both
(Biber, 1995; Lee, 2001). On the other hand, the       characters and Parts–Of-Speech (Santini, 2004;
classification of textual genres is seen as a way to   Crossley, 2007; Mehler, 2010) or finer-grained
cope with the well known problem of information        Parts–Of-Speech tags including morpho-syntactic
overload: the exploitation of information about        features such as verb tense (Fang, 2010). Very
few studies rely on features extracted from syn-        our case, the l1 penalty was selected on the ba-
tactically annotated texts, the exception being Sta-    sis of evaluating maximum entropy models (using
matatos (2001) who combines lexical features (i.e.      10-fold cross validation) using varying l1 values
word frequency) with features extracted from the        (range: 1e-11, 1e-10, ..., 0.1, 1).
output of a chunk boundary detector (e.g. the              For selecting the features and estimating their
distribution of noun, verbal, adjectival phrases),      weights, we used T INY E ST1 , a grafting-capable
the average number of words included in verbal          maximum entropy parameter estimator for rank-
phrases. Similar structural features have been also     ing tasks (De Kok, 2011; De Kok, 2013). Even
used by (Lim, 2004) who combined web–specific           though our task is not a ranking task, it can be used
features (e.g. HTML tags) with lexical informa-         for binary classification by assigning a high score
tion and features aiming at capturing the syntac-       (1) to the correct class and a low score (0) to the
tic structure of a sentence, e.g. the distribution      incorrect class. A similar approach was followed
of declarative and imperative sentences, syntactic      by Dell’Orletta (2014) for discriminating between
ambiguities, etc.                                       easy–to–read vs difficult–to–read sentences. As
   In this paper, we tackle the AGC task for tra-       the focus of the present study is on the classifi-
ditional genres (namely literary, scientific, edu-      cation of texts belonging to different traditional
cational and journalistic texts) by using different     genres, we created four separate binary classifiers
types of linguistic features, i.e. lexical, morpho-     which were trained to distinguish Literature texts
syntactic and syntactic. In particular, the follow-     from non-Literature (i.e. the three remaining gen-
ing research questions are addressed: i) which          res) texts, Educational texts from non-Educational
are the most effective features to classify a tex-      texts, etc. A text was assigned the class of the clas-
tual genre, and ii) whether and to what extent fea-     sifier which returned the highest score.
tures identified as most effective remain the same
across different genres. These questions have been      3     Typology of Features
addressed by carrying out incremental feature se-
                                                        Various types of features have been proposed in
lection with the final aim of identifying the most
                                                        the literature for the automatic classification of
predictive ones. So far, studies focused on the best
                                                        text genres. Following Stamatatos (2001) and Lim
set of features to classify textual genres have been
                                                        (2004), we combine token–based and structural
carried out mainly on English. In this paper, this
                                                        features. Token–based features were extracted
issue is investigated for a typologically different
                                                        from the top list of the most frequent lemmata in
language, Italian.
                                                        the training corpus and represented in terms of the
                                                        relative frequency of each lemma in each docu-
2   Model training and feature ranking                  ment. Structural features were extracted from the
In order to identify and rank the most impor-           considered corpora morpho–syntactically tagged
tant features playing a role in genre classification,   by the POS tagger described in (Dell’Orletta,
we used GRAFTING (Perkins, 2003). This ap-              2009) and dependency–parsed by the DeSR parser
proach allows us to simultaneously train a max-         using Multi–Layer Perceptron (Attardi, 2009).
imum entropy model while also including incre-          As shown in Table 1, they range across differ-
mental feature selection. Grafting uses a gradient-     ent linguistic description levels (lexical, morpho–
based heuristic to select the most promising fea-       syntactic and syntactic) for a total of 90 features
ture (which is added to the set of selected fea-        that resulted to be informative “fingerprints” of the
tures S), and subsequently performs a full weight       form of a text, on issues of e.g. genre, style, au-
optimization over all features in S. This process       thorship or readability.
is repeated until a certain stopping condition is
                                                        4     Experiments
reached. The stopping condition integrates l1 reg-
ularization in the grafting approach. This means        4.1    Experimental Setup
that only those features are included (with a non-
                                                        We used an Italian corpus including documents
zero weight) if the l1 penalty is outweighed by the
                                                        representative of four different genres: educa-
reduction of the objective function. Consequently,
                                                        tional material (Dell’Orletta, 2011), newspaper ar-
overfitting is prevented by excluding noisy fea-
tures, or those that change value infrequently. In          1 http://github.com/danieldk/tinyest
ticles (Marinelli, 2003), literary texts (Marinelli,                 Typology           Feature
                                                                     Raw Text           Sentence and token length
2003) and scientific papers (Dell’Orletta, 2014).                    Lexical            Rate of words in the Basic Italian
The whole corpus was split up into a training set                                       Vocabulary, Type/Token ratio
(136 documents for the Education genre, 579 for                      Morpho-syntactic   Part-Of-Speech unigrams, Lexical
                                                                                        density, Verbal mood
the Journalism genre, 365 for the Literature genre                   Syntactic          Dependency type unigrams, Parse
and 317 for the Scientific genre), and a held-out                                       tree depth features, Arity of verbal
test set (60 documents for each genre).                                                 predicates, Distribution of subordi-
                                                                                        nate vs main clauses, Length of de-
   To assess the influence of including structural                                      pendency links
features over simply using the most frequent
words (lemmata), we used two sets of features                    Table 1: Typology of features automatically ex-
(each consisting of about 200 features). The first               tracted from linguistically annotated texts.
set of features (taken as the baseline) corresponds
to the relative frequency of the 200 top-most fre-
quent words (henceforth referred to as the tw200                 ated with the best performance on the cross valida-
set). 2 The second set combines token-based and                  tion set, and ii) the lowest number of features such
structural features: i.e. in addition to the rela-               that the performance dropped when a new feature
tive frequency of the 100 top-most frequent words,               was added (i.e. performance kept increasing for
it contains the (90) structural features illustrated             each additional feature up to the selected number
above and detailed in Table 1 (this set is hence-                of features).
forth referred to as the lingtw set). To guarantee
                                                                 4.2     Replication
comparability of values, for each feature the val-
ues were scaled between 0 and 1 on the basis of                  Results reported below can be repli-
the data from the training set. If a (non-scaled)                cated by downloading the docker image
feature value in the held-out test set exceeded the              italianlp-wieling/dockergenreclas
maximum non-scaled value of that feature in the                  sification which contains all data and
training set, it was set to the maximum value (1).               scripts necessary for the feature extraction
   The feature ranking for each genre was ob-                    and the grafting procedure, and also con-
tained using grafting on the full training data set.             tains all results. The Docker file including
The performance (i.e. the percentage of correctly                all commands to setup the virtual machine
classified documents) of the algorithm was eval-                 can be found at https://github.com/italianlp-
uated for an increasing number of features (start-               wieling/dockergenreclassification.
ing from including only the first (best) feature for
each genre to including all features for each genre)             5      Results
against both a 10-fold cross-validation test set and
                                                                 5.1     Genre Classification Results
a held-out test set.
   The 10-fold cross-validation procedure was per-               Figures 1(a) and 1(b) report the classification re-
formed on the basis of the training set (i.e. the fea-           sults using the lingtw vs tw200 features sets:
ture weights were determined on the basis of 90%                 it can be clearly observed that inclusion of struc-
of the training data, whereas the performance was                tural features is highly beneficial. With only 10
evaluated on the remaining 10% of the training                   features, the 10-fold cross validation performance
data; this procedure was repeated 10 times). As                  is 83.89% for the lingtw set, whereas it is only
stated before, the genre of the document in the test             71.51% for the tw200 set. The optimal per-
set was assigned to the genre whose binary classi-               formance for the lingtw set is reached with
fication model (in this case with the same number                106 features (89.72%), whereas a significantly
of features) resulted in the highest score.                      lower performance is reached using the tw200 set
   The classification accuracy was assessed with                 (84.17%) despite the much higher number of fea-
respect to the held-out test set for different num-              tures used (179). The performance on the held-out
bers of features: i) the number of features associ-              test set turned out to be slightly lower: 79.16% for
                                                                 10 features using the lingtw set, against 59.16%
    2 In our preliminary analyses, we also assessed the effect
                                                                 with the tw200 set. The optimal performance
of including the most frequent bigrams as features. However,
as the performance was similar to only using unigrams, we        for the lingtw set is reached with 80 features
did not include bigrams as features.                             (86.66%), whereas for the tw200 set a lower per-
                        First 10 features           First 50 features                  First 80 features
      Genre
                   S      P     L     R W      S      P     L R W              S        P      L       R            W
      Journalism   30    40 30        0    0   40     38 10 0 12             38.75    28.75 6.25 1.25               25
      Literature   40    30 20        0   10   34     28    8     2 28       33.75    26.25 6.25 1.25              32.50
      Education    50    20 10 20          0   44     32    6     4 14       42.5      30       5     2.5           20
      Science      60    10 20 10          0   50     32 10 4         4      42.5      30     6.25    2.5          18.75

Table 2: Percentage distribution of different typologies of ranked syntactic (S), morpho-syntactic (P),
lexical (L), raw-text (R) and token-based (W) features selected via GRAFTING on the held-out test set.


                   (a) Held-out test set.                                 (b) 10-fold cross-validation test set.

Figure 1: Genre classification results using a held-out test set (a), and a 10-fold cross-validation proce-
dure (b).


formance (72.08%) is obtained using 133 features.

5.2   Feature Ranking Results
In order to investigate the typology of linguistic
features most significantly contributing to AGC
we focused on the lingtw set. In particular, we
carried out an in-depth analysis of the grafting-
based feature ranking resulting from the classifica-
tion of the held-out test set. Ranked features were
categorized into five classes: syntactic, morpho-
syntactic, lexical, raw and token-based features.
Figure 2 provides a genre-independent view re-               Figure 2: Genre-independent average distribution
porting the percentage average distribution (across          of different feature types in the top 10, 50 and 80
genres) of different feature types within the first          ranked sets.
10, 50 and 80 ranked feature sets. As shown, syn-
tactic features play the most relevant role. They
cover the 45% and 42% of the first 10 and 50 fea-            table differences can be observed. In particular,
tures respectively, and remain the most predictive           Literature and Scientific prose represent two op-
ones also when 80 features are considered (rep-              posite poles. Token-based features (W) are more
resenting 39.38% of the set). On the other hand,             predictive for literary texts with respect to other
the distribution of token-based features increases           genres (i.e. they represent 10%, 28% and 32.50%
as far as a wider amount of ranked features is con-          in the top 10, 50 and 80 features respectively). On
sidered (they cover 2.5%, 14.5% and 24.06% in                the contrary, syntactic features (S) play for Sci-
the 10, 50 and 80 feature sets respectively).                entific prose a more important role than for the
   Consider now the distribution of different types          other genres (covering respectively 60%, 50% and
of features across genres reported in Table 2: no-           42.50% of the top 10, 50 and 80 features).
                                                           Journalism   Literature   Education   Science
             Sentence length                                   –            82           6         32
             Word length                                       56           50           3          3
             Type/Token Ratio (forms)                          3            95          94          4
             Parse tree depth                                  11            3          37         94
             Maximum length of dependency links                42           24          48         56
             Post-verbal subject                               16           22          90         76
             Pre-verbal object                                 31           21          47         42
             Passive subject                                   7            53          17          5

Table 3: Different ranking positions of a selection of features across genres. Features which were not
selected during ranking have no specified rank in the table.


   Let’s focus now on the role played by individ-            zation of scientific writing and newspaper articles.
ual features across genres. Table 3 reports the dif-
ferent rank positions associated with a selection            6    Conclusion
of features in the classification of the four gen-           In this paper we investigated impact and role of
res. Raw text features (i.e. sentence and word               different feature types for Automatic Genre Clas-
length) resulted to play a key role in the classi-           sification. The goal was pursued by carrying
fication of educational materials (Education) with           out incremental feature selection through Graft-
respect to the other genres (e.g. Literature). A fea-        ing augmented with TinyEst. Two sets of fea-
ture capturing the lexical richness of texts such as         tures were taken into account, token-based and
Type/Token Ratio (TTR), which refers to the ratio            structure-based. Achieved results show the key
between the number of lexical types and the num-             role played by syntactic features, a result which
ber of tokens (considered as single forms) within            is new with respect to the AGC literature. An-
a text, is similarly ranked for Journalism and Sci-          other original contribution is concerned with the
ence while it plays a less relevant role in the classi-      role of different feature types which turned out to
fication of educational material and literary texts.         vary across textual genres, suggesting the special-
Moving to syntax, it should be noted that two fea-           ization of features in binary genre classification
tures characterizing the overall sentence structure,         tasks (e.g. Literature vs. other genres). The fea-
i.e. the depth of the whole parse tree (calculated           tures contributing to AGC for Italian are possibly
in terms of the longest path from the root of the            influenced by the language dealt with. Although
dependency tree to some leaf) and the maximum                it is widely acknowledged that linguistic variation
length of dependency links (calculated in terms of           across genres is a language universal, the question
the words occurring between the syntactic head               is whether similar linguistic features are expected
and the dependent), play a key role in the classi-           to play a similar role across languages. If this
fication of the Literature and Journalism genres.            might be the case of features such as e.g. TTR, use
For the latter, it is interesting to contrast the high       of passive voice, tenses or pronouns, on the other
rank associated with the parse tree depth feature            hand features concerned with the ordering of sen-
and the irrelevant role played by sentence length            tence constituents or the overall sentence structure
(typically taken as a proxy of the underlying gram-          (e.g. parse tree depth or dependency length) may
matical structure): this clearly shows that syntactic        be distinctive to a specific language or language
features are more effective in discriminating gen-           family. Further directions of research thus include
res. Other features which turned out to play a rel-          comparison of results in a multilingual perspective
evant role in ACG are concerned with the relative            as well as across a wider variety of genres.
ordering of subject and object with respect to the
verbal head: their non-canonical orders, i.e. post-          Acknowledgements
verbal subject and pre-verbal object, play a key
                                                             Reported research started in the framework of a
role in the classification of Literature and Jour-
                                                             Short Term Mobility program of international ex-
nalism genres. On the contrary, the use of passive
                                                             changes funded by CNR, and continued within the
voice (inferred from the presence of passive sub-
                                                             project “Smart News, Social sensing for break-
jects) is less relevant for the classification of Liter-
                                                             ingnews”, funded by the Tuscany Region under
ature, whereas it is highly ranked in the characteri-
                                                             the FAR-FAS 2014 program.
References                                                 Chengyu Alex Fang, and Jing Cao 2015. Text Genres
                                                             and Registers: The Computation of Linguistic Fea-
Giuseppe Attardi, Felice Dell’Orletta, Maria Simi and        tures. Springer.
  Joseph Turian 2009. Accurate Dependency Parsing
  with a Stacked Multilayer Perceptron. Proceedings        Aidan Finn and Nicholas Kushmerick 2003. Learning
  of Evalita 2009.                                           to classify documents according to genre. Proceed-
                                                             ings of the IJCAI’03 Workshop on Computational
Douglas Biber. 1995. Dimensions of register variation:
                                                             Approaches to Style Analysis and Synthesis, 1–26
  A cross-linguistic comparison. Cambridge Univer-
  sity Press Press, Cambridge, UK.                         Brett Kessler, Geoffrey Nunberg, and Hinrich Schütze
Douglas Biber and Susan Conrad 2009. Genre, Regis-           1997. Automatic detection of text genre. Pro-
  ter, Style. Cambridge University Press                     ceedings of the 35th Annual Meeting of the As-
                                                             sociation for Computational Linguistics and the
Andreas van Cranenburgh 2012. Literary authorship            8th Conference of the European Chapter of ACL
  attribution with phrase-structure fragments. Pro-          (ACL/EACL’97), 223–232
  ceedings of the ACL Workshop on Computational
  Linguistics for Literature, 59–63                        David Lee. 2001. Genres, registers, text types, do-
                                                             mains, and styles: clarifyng the concepts and navi-
Kevin Collins-Thompson 2014. Computational as-               gating a path through the BNC jungle. Proceedings
  sessment of text readability: a survey of current          of ECIR 2004 (26th European Conference on IR Re-
  and future research. Recent Advances in Automatic          search), University of Sunderland (UK), (3), 37–72
  Readability Assessment and Text Simplification. Spe-
  cial issue of International Journal of Applied Lin-      Chul Lim, Kong Lee, and Gil Kim. 2004. Multiple
  guistics, (165–2), 97–135                                  sets of features for automatic genre classification of
                                                             web documents. Information processing and man-
Scott A. Crossley and Max Louwerse. 2007. Multi-             agement (41) 1263–1276
  dimensional register classification using bigrams.
  International Journal of Corpus Linguistics), (12–       R. Marinelli, L. Biagini, R. Bindi, S. Goggi, M. Mona-
  4) 453–478                                                  chini, P. Orsolini, E. Picchi, S. Rossi, N. Calzolari
                                                              and A. Zampolli. 2003. The italian parole corpus:
Daniël de Kok 2011. Discriminative features in re-           an overview. Computational Linguistics in Pisa,
  versible stochastic attribute-value grammars. Pro-          Special Issue, XVI-XVII, 401-421
  ceedings of the EMNLP Workshop on Language
  Generation and Evaluation, 54–63                         Alexander Mehler, Serge Sharoff and Marina Santini
                                                             (Eds.) 2010. Genres on the Web. Springer Series -
Daniël de Kok 2013. Reversible Stochastic Attribute-        Text, Speech and Language Technology
  Value Grammars. Rijksuniversiteit Groningen.
                                                           Simon Perkins, Kevin Lacker and James Theiler 2003.
Felice Dell’Orletta 2009. Ensemble system for                Grafting: Fast, incremental feature selection by gra-
  Part-of-Speech tagging. Proceedings of Evalita’09,         dient descent in function space. The Journal of Ma-
  Evaluation of NLP and Speech Tools for Italian.            chine Learning Researchs, (3), 1333–1356
Felice Dell’Orletta, Simonetta Montemagni, Eva Maria       Marina Santini. 2004. Identification of Genres on the
  Vecchi and Giulia Venturi. 2011. Tecnologie               Web: a Multi–Faceted Approach. Language Learn-
  linguistico-computazionali per il monitoraggio della      ing and Technology, 1–8
  competenza linguistica italiana degli alunni stranieri
  nella scuola primaria e secondaria. Percorsi mi-         Maria Santini 2007. Enhanced Genre Classifica-
  granti: uomini, diritto, lavoro, linguaggi, 319–366       tion through Linguistically Fine-Grained POS Tag.
                                                            Proceedings of the 24th Pacific Asia Conference on
Felice Dell’Orletta, Simonetta Montemagni and Giulia        Language, Information and Computation (PACLIC),
  Venturi 2014. Assessing document and sentence             223–232
  readability in less resourced languages and across
  textual genres. International Journal of Applied Lin-    Serge Sharoff 2010. In the garden and in the jun-
  guistics, 165:2, 319–366                                   gle: Comparing genres in the BNC and internet. in
                                                             Mehler (2010), 149–166
Felice Dell’Orletta, Martijn Wieling, Andrea Cimino,
  Giulia Venturi and Simonetta Montemagni 2014.            Efstathios Stamatatos, Nikos Fakotakis and George
  Assessing the Readability of Sentences: Which Cor-         Kokkinakis 2001. Automatic text categorization in
  pora and Features? Proceedings of 9th Workshop on          terms of genre and author. Computational Linguis-
  Innovative Use of NLP for Building Educational Ap-         tics, (26) 471–495
  plications (BEA 2014), 163–173
                                                           Joel Tetreault, Daniel Blanchard and Aoife Cahill
Alex Chengyu Fang, and Jing Cao. 2010. Enhanced              2013. A Report on the First Native Language Identi-
  Genre Classification through Linguistically Fine-          fication Shared Task Proceedings of the ACL Work-
  Grained POS Tag. Proceedings of the 24th Pa-               shop on Innovative Use of NLP for Building Educa-
  cific Asia Conference on Language, Information and         tional Applications, 48–57
  Computation (PACLIC), 223–232