=Paper= {{Paper |id=Vol-2253/paper02 |storemode=property |title=Classifying Italian newspaper text: news or editorial? |pdfUrl=https://ceur-ws.org/Vol-2253/paper02.pdf |volume=Vol-2253 |authors=Pietro Totis,Manfred Stede |dblpUrl=https://dblp.org/rec/conf/clic-it/TotisS18 }} ==Classifying Italian newspaper text: news or editorial?== https://ceur-ws.org/Vol-2253/paper02.pdf
              Classifying Italian newspaper text: news or editorial?

                     Pietro Totis                                   Manfred Stede
           Università degli Studi di Udine                Applied Computational Linguistics
        totis.pietro@spes.uniud.it                          University of Potsdam, Germany
                                                             stede@uni-potsdam.de



                    Abstract                           a business letter, a personal homepage, a cooking
                                                       recipe, and so on. In this paper, we perform genre
    English. We present a text classifier that         classification on newspaper text and are specifi-
    can distinguish Italian news stories from          cally interested in the question whether a text com-
    editorials. Inspired by earlier work on            municates a news report or gives an opinion, i.e., it
    English, we built a suitable train/test cor-       is an editorial (or some similar opinionated piece).
    pus and implemented a range of features,           This task is relevant for many information extrac-
    which can predict the distinction with an          tion applications based on newspaper text, and it
    accuracy of 89,12%. As demonstrated by             can also be extended from newspapers to other
    the earlier work, such a feature-based ap-         kinds of text, where the distinction ”opinionated
    proach outperforms simple bag-of-words             or not” is of interest, as in sentiment analysis or
    models when being transferred to new do-           argumentation mining.
    mains. We argue that the technique can                Our starting point is the work by (Krüger et
    also be used to distinguish opinionated            al., 2017), who presented a news/editorial clas-
    from non-opinionated text outside of the           sifier for English. They demonstrated that us-
    realm of newspapers.                               ing linguistically-motivated features leads to bet-
    Italiano. Presentiamo una tecnica per la           ter results than bag-of-words or POS-based mod-
    classificazione di articoli di giornale in         els, when it comes to changing the domain of text
    italiano come articoli di cronaca oppure           (which newspaper, which time of origin, which
    editoriali. Ispirandoci a precedenti pub-          type of content). To transfer the approach to
    blicazioni riguardanti la lingua inglese,          Italian, we assembled a suitable corpus for train-
    abbiamo costruito un corpus adatto allo            ing and testing, selected preprocessing tools, and
    scopo e selezionato un insieme di carat-           adapted the features used by the classifier from
    teristiche testuali in grado di distinguere        Krüger et al. Our results are in same range of
    il genere con un accuratezza dell’ 89,12%.         the original work, indicating that the problem can
    Come dimostrato dai lavori precedenti,             be solved for Italian in pretty much the same way.
    questo approccio basato sulle proprietà           We found some differences in the relative feature
    del testo mostra risultati migliori rispetto       strengths, however.
    ad altri quando trasferito a nuovi argo-              After considering related work in Section 2, we
    menti. Riteniamo inoltre che questa tec-           describe our corpus (Section 3) and the classifica-
    nica possa essere usata con successo an-           tion experiments (Section 4), and then conclude.
    che in contesti diversi dagli articoli di
                                                       2     Related Work
    giornale per distinguere testi contenenti
    opinioni dell’autore e non.                        In early work, (Karlgren and Cutting, 1994) ran
                                                       genre classification experiments on the Brown
                                                       Corpus and employed the distribution of POS-tags
1   Introduction
                                                       as well as surface-based features such as length of
The computational task of text classification is       words, sentences and documents, type/token ra-
typically targeting the question of domain: Is a       tio, and the frequency of the words ‘therefore’,
text about sports, the economy, local politics, etc.   ‘I’, ‘me’, ‘it’, ‘that’ and ‘which’. Among the
But texts can also be grouped by their genre: Is it    experiments, the classification of ‘press editorial’
yielded 30% errors, and that of ‘press reportage’        different text genres.
25%. On the same data, (Kessler et al., 1997)               Finally, as mentioned earlier, we build our work
used additional lexical features (latinate affixes,      on that of (Krüger et al., 2017), who systemati-
date expressions, etc.) and punctuation. The au-         cally tested a meaningful set of linguistic features.
thors reported these accuracies: reportage 83%,          Among several classifiers from the WEKA libraries,
editorial 61%, scitech 83%, legal 20%, nonfiction        the SMO classifiers performed best, and the mod-
(= other expository writing) 47%, fiction 94%.           els based on linguistic features outperformed stan-
   The alternative method is to refrain from any         dard bag-of-lemma approaches across different
linguistic analysis and instead use bag-of-tokens        genres, but the latter still performed very well
(2003), bag-of-words (Freund et al., 2006), (Finn        on the same genre on which they were trained.
and Kushmerick, 2003) or bag-of-character-n-             Krüger et al. then tested which features are most
gram (Sharoff et al., 2010) models. This has             predictive for each class, and related these obser-
the obvious advantage of knowledge-freeness and          vations to their original expectations.
yields very good results in the domains of the
training data, but, as found for instance by Finn        3       Dataset
and Kushmerick, a bag-of-words model performs
                                                         For our study, we built a corpus of about 1000 Ital-
very badly in cross-domain experiments. Like-
                                                         ian newspaper articles, which are equally divided
wise, (Petrenz and Webber, 2011) show in their
                                                         into editorials and news articles.
replication experiments that this idea is highly
vulnerable to topic/domain shifting: the models             The editorials have been collected from the
largely learn from the content words in the train-       website of the Italian newspaper “Il Manifesto”
ing texts, and these can be very different from day      and we removed headers and footers that serve
to day, when the news and the opinions on them           as metadata for the newspaper, such as “2017
reflect the current affairs.                             IL NUOVO MANIFESTO SOCIETÀ COOP. ED-
                                                         ITRICE”. The news articles are from the Adige
   (Toprak and Gurevych, 2009) experimented              corpus1 , a collection of news stories from the lo-
with various lexical features: Word-based features       cal newspaper L’Adige categorized into different
included unigrams, bigrams, variants with sur-           topics of news, such as sport, finance or culture.
rounding tokens, as well as frequency-amended            The corpus is also annotated with semantic infor-
lemma features (using a tf*idf measure); lexicon         mation related to temporal expressions and enti-
features exploited the Subjectivity Clues Lexicon        ties. However, we have not exploited these fea-
(Wilson et al., 2005), SentiWordnet (Esuli and Se-       tures since they were not available on the editori-
bastiani, 2006), and a list of communication and         als.
mental verbs. It turned out that word class features
                                                            Both corpora have been annotated using the
outperform the other classes, with an accuracy of
                                                         TreeTagger tool2 (Schmid, 1994), which provides
up to 0.857. Specifically, the tf*idf representation
                                                         an annotation of the form WORD, POS- TAG,
was successful. Such frequency-based representa-
                                                         LEMMA .
tions are known to be effective for classical topic
                                                            In order to reproduce the types of classification
categorization tasks, and this study provides an in-
                                                         features used by (Krüger et al., 2017), some lexi-
dication that they may also help for related tasks
                                                         cal resources are needed. The corresponding Ital-
(especially when the class distribution is skewed).
                                                         ian vocabulary has been collected from different
Another finding was that plain unigrams beat the
                                                         sources:
larger n-grams and certain context features.
   (Cimino et al., 2017) investigated the role of            • A list of connectives, categorized into tem-
different feature types in the task of Automatic               poral, causal, contrastive and expansive con-
Genre Classification. In this study a set of rele-             nectives, has been obtained from LICO (Fel-
vant features is extracted across different linguistic         tracco et al., 2016), a lexicon for Italian con-
description levels (lexical, morpho-syntactic and              nectives.
syntactic) and a meaningful subset is then selected
                                                             1
through an incremental feature selection proce-              http://ontotext.fbk.eu/icab.html
                                                             2
                                                             Future improvements include using a more modern
dure. The results show that syntactic features are       postagger such as UDPipe: https://ufal.mff.cuni.
the most effective in order to discriminate between      cz/udpipe
               Acc.     Prec.   Recall     F1                          Acc.     Prec.    Recall    F1
        L      83,35    86,04   79,42     82,60               L        83,90    84,21    82,75    83,47
        P      84,49    85,80   82,50     84,11               P        64,71    63,08    69,49    66,12
        U      82,29    80,29   85,38     82,75               U        39,17    43,30    70,00    53,50
       L+U     87,75    88,88   86,15     87,50              L+U       65,00    50,57    73,33    59,86
       L+P     87,27    88,46   85,58     87,00              L+P       72,57    70,37    71,70    71,03
       U+P     87,37    87,31   87,31     87,31              U+P       50,83    50,57    73,33    59,86
      L+P+U    89,09    89,64   88,27     88,95             L+P+U      61,34    57,83    81,35    67,60

Table 1: Linear SMO results: L: Linguistic fea-         Table 3: Linear SMO results on Amazon reviews
tures, P: POS tagging, U: Unigrams                      and Wikipedia articles

    • A list of communication verbs (say, argue,         Feature                        Weight
      state, etc.) has been obtained from the lex-       LING:C ITATIONS                4,8912
      ical database MultiWordNet3 for a total of 54      LING:C OMPLEXITY               2,6676
      entries.                                           LING:PAST P ERFECT             2,1070
                                                         LING:F UTURE                   2,0092
    • Sentiment features rely on the Sentix4 lexicon     LING:T OKEN L ENGTH            1,8754
      for Italian sentiment analysis, which assigns      LING:C AUSAL C ONN             1,7568
      to each lemma a positive and negative score,
                                                         LING:S ENT P OL                0,9710
      plus a score of polarity and intensity.
                                                         LING:VO S                      0,7414
4     Experiments                                        LING:I MPERATIVE               0,6871
                                                         LING: FS P RONOUNS             0,6518
 Feature                            Weight               LING: F P RONOUNS              0,6518
 LING:P RONOUNS                     3,5452               LING:M ODALS                   0,4237
 LING:T EMPORAL C ONN               2,0647              Table 4: Linguistic features pointing to news text
 LING:S ENT P OS                    1,8040
 LING:N EGATIONS                    1,7301
 LING:S ENT N EG                    1,6609              speech tags counts, and their combinations as indi-
 LING:PAST                          1,3686              cators for classifying the newspaper articles from
 LING:C ONTRASTIVE C ONN            1,2816              the dataset. Four different classifiers from the
 LING:I NFINITIVE                   1,2230              WEKA library have been tested: linear and polyno-
 LING:S ENT ADJ P OL                1,2114              mial SMO (kernel with e = 2), J48 trees and Naive
 LING:S ENT ADJ N EG                1,0880              Bayes classifier, with a 10-fold cross-validation
 LING:C OND I MP                    1,0796              evaluation. The SMO classifiers proved to be the
 LING:G ERUND                       1,0653              most accurate, with the polynomial SMO having
 LING:C OMMAS                       0,9658              marginally higher scores than the linear counter-
 LING:S ENT I NT                    0,9593              part. In Table 1 we provide our results obtained
 LING:I MPERFECT                    0,7801              with that approach. It can be seen that combining
                                                        feature sets generally outperforms the individual
Table 2: Linguistic features pointing to opinion-       sets, and in fact the combination of all three yields
ated text                                               the best results.
                                                           Our set of linguistic features was modeled
4.1    Main experiment: feature performance             closely after that of Krüger et al., because we
In our experiments, we were primarily interested        wanted to know how well it can be transferred to
in comparing the accuracies obtained by (i) lin-        languages other than English. These features can
guistic features, (ii), unigram counts, (iii) part of   be summarized as follows: text statistics (length
                                                        of a sentence, frequency of digits, etc.); ratio of
  3
    http://multiwordnet.fbk.eu/english/                 punctuation symbols; ratio of temporal, causal and
home.php
  4
    http://valeriobasile.github.io/twita/               other connectives; verb tenses; pronouns (esp. 1st
sentix.html                                             and 2nd person) and sentiment indicators.
The set also includes the presence of modal verbs         4.3      Replication
and negation operators, morphological features of         Altough we cannot make public all the data we
the matrix verb (tense, mood), as well as some se-        used in this experiment, we uploaded our code on
lected part-of speech and basic text statistic fea-       a public repository5 to provide a description of our
tures, as they had already been proposed in the           implementation.
early related work.
   The feature weights assigned by the linear clas-       5       Conclusion
sifier are shown in tables 2 and 4 in order to high-
                                                          We presented, to our knowledge, the first classi-
light which linguistic features represent good indi-
                                                          fier that is able to distinguish ‘news’ from ‘edito-
cators towards one or another type of article, and
                                                          rials’ in an Italian newspaper corpus. It follows
with how much strength.
                                                          a linguistic feature-oriented approach proposed by
   The results obtained offer interesting analogies       (Krüger et al., 2017) for English, who had demon-
with the English corpus analysed by (Krüger et           strated that it outperforms lexical and POS-based
al., 2017). For instance, pronouns, negations and         models. In our implementation, With an accuracy
sentiment represent strong indicators for opinion-        of 89.09% the distinction between the two subgen-
ated texts, while complexity, future, communica-          res can be drawn quite reliably. Our results are
tion verbs, token length and causal connectives are       comparable to that of Krüger et al., which indi-
all features pointing towards news reports in both        cates (again, to our knowledge for the first time)
languages. An interesting difference is the role of       that their feature space is applicable successfully
past tense, which for English had been found to           to languages other than English.
correlate more with news than with editorials, and           Our central concern for this kind of task is
here it plays a different role.                           robustness against domain changes of different
                                                          kinds. To this end, Krüger et al. had worked with
                                                          different newspaper sources and demonstrated the
4.2   Testing domain change robustness
                                                          utility of the feature approach in such settings.
                                                          While we were not able to assemble large corpora
We then evaluated another aspect of the task,
                                                          from different papers, we ran other experiments in
viz. domain robustness: we split the news corpus
                                                          the same vein, where the first shows that the sys-
into a training set (categories Attualità, Sport and
                                                          tem is robust against changing the portions of the
Economia) and a test set (categories Cultura and
                                                          newspapers (i.e., economy versus local affairs, and
Trento) in order to evaluate the robustness of the
                                                          so on). In the second one, we applied the classifier,
classifier when unseen categories are submitted.
                                                          as trained on the newspaper data, to the distinction
All the classification performances in this setting
                                                          between Italian Wikipedia articles and Amazon re-
show a drop of performance of only about 0,03%,
                                                          views, where the results remained stable as well.
demonstrating that the classification performances
                                                          We take this as an indication that the classifier cap-
are not overfitted to the topics of the articles.
                                                          tures a general difference between ‘opinionated’
   Finally, to further test domain change robust-         and ‘non-opinionated’ text, and not just some ‘ad
ness, we tested the classifier – with the model           hoc’ phenomena of certain newspaper sub-genres.
trained on the newspaper corpora – on a set of 60
Amazon reviews versus 60 Wikipedia articles (all
randomly chosen). As the results in Table 3 show,         References
the linguistic features perform remarkably robust        [Cimino et al.2017] Andrea Cimino, Martijn Wieling,
also on this quite different data. The bad results for      Felice Dell’Orletta, Simonetta Montemagni, and
unigrams on the one hand are not so surprising, but         Giulia Venturi.     2017.    Identifying predictive
they have to be taken with a grain of salt, because         features for textual genre classification: the key
                                                            role of syntax. In Roberto Basili, Malvina Nis-
we employed the same low frequency filtering as             sim, and Giorgio Satta, editors, Proceedings of the
in the main experiment: unigrams that occur less            Fourth Italian Conference on Computational Lin-
than five times are not being considered, in order          guistics (CLiC-it 2017), Rome, Italy, December 11-
to reduce the feature space. This might well lead             5
                                                                 https://bitbucket.org/PietroTotis/
to poorer results for a small data set like the 120       classifying-italian-newspaper-text-news-
texts used here.                                          or-editorial/src/master/
   13, 2017., volume 2006 of CEUR Workshop Pro-            [Toprak and Gurevych2009] Cigdem Toprak and Iryna
   ceedings. CEUR-WS.org.                                     Gurevych. 2009. Document level subjectivity clas-
                                                              sification experiments in deft’09 challenge. In Pro-
[Esuli and Sebastiani2006] Andrea Esuli and Fabrizio          ceedings of the DEFT’09 Text Mining Challenge,
    Sebastiani. 2006. Sentiwordnet: A publicly avail-         pages 89–97, Paris, France.
    able lexical resource for opinion mining. In In Pro-
    ceedings of the 5th Conference on Language Re-         [Wilson et al.2005] Theresa Wilson, Janyce Wiebe, and
    sources and Evaluation (LREC’06, pages 417–422.           Paul Hoffmann. 2005. Recognizing contextual po-
                                                              larity in phrase-level sentiment analysis. In Pro-
[Feltracco et al.2016] Anna Feltracco, Elisabetta Jezek,      ceedings of HLT-EMNLP-2005.
    Bernardo Magnini, and Manfred Stede. 2016. Lico:
    a lexicon of italian connectives. In Proceedings of    [Yu and Hatzivassiloglou2003] Hong Yu and Vasileios
    the 3rd Italian Conference on Computational Lin-           Hatzivassiloglou. 2003. Towards answering opin-
    guistics (CLiC-it), Napoli.                                ion questions: Separating facts from opinions and
                                                               identifying the polarity of opinion sentences. In
[Finn and Kushmerick2003] Aidan Finn and Nicholas              Proceedings of the 2003 conference on Empirical
    Kushmerick. 2003. Learning to classify documents           methods in natural language processing, pages 129–
    according to genre. IJCAI-03 Workshop on Compu-            136. Association for Computational Linguistics.
    tational Approaches to Style Analysis and Synthesis.

[Freund et al.2006] Luanne Freund, Charles L. A.
    Clarke, and Elaine G. Toms. 2006. Towards genre
    classification for IR in the workplace. In Proceed-
    ings of the 1st international conference on Informa-
    tion interaction in context, IIiX, pages 30–36, New
    York, NY, USA. ACM.

[Karlgren and Cutting1994] Jussi Karlgren and Dou-
   glass Cutting. 1994. Recognizing text genres with
   simple metrics using discriminant analysis. In Pro-
   ceedings of the 15th conference on Computational
   linguistics - Volume 2, COLING ’94, pages 1071–
   1075, Stroudsburg, PA, USA. Association for Com-
   putational Linguistics.

[Kessler et al.1997] Brett Kessler, Geoffrey Numberg,
   and Hinrich Schütze. 1997. Automatic detection of
   text genre. In Proceedings of the 35th Annual Meet-
   ing of the Association for Computational Linguis-
   tics and Eighth Conference of the European Chap-
   ter of the Association for Computational Linguistics,
   pages 32–38. Association for Computational Lin-
   guistics.

[Krüger et al.2017] Katarina R. Krüger,        Anna
    Lukowiak, Jonathan Sonntag, Saskia Warzecha,
    and Manfred Stede. 2017. Classifying news
    versus opinions in newspapers: Linguistic features
    for domain independence.       Natural Language
    Engineering, 23(5):687–707.

[Petrenz and Webber2011] Philipp Petrenz and Bonnie
    Webber. 2011. Stable classification of text genres.
    Computational Linguistics, 37(2):385–393.

[Schmid1994] Helmut Schmid. 1994. Probabilistic
    part-of-speech tagging using decision trees. In Pro-
    ceedings of International Conference on New Meth-
    ods in Language Processing, Manchester.

[Sharoff et al.2010] Serge Sharoff, Zhili Wu, and Katja
   Markert. 2010. The Web Library of Babel: evaluat-
   ing genre collections. Proceedings of the Seventh
   Conference on International Language Resources
   and Evaluation, pages 3063–3070.