<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Classifying Italian newspaper text: news or editorial?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pietro Totis</string-name>
          <email>totis.pietro@spes.uniud.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manfred Stede</string-name>
          <email>stede@uni-potsdam.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Applied Computational Linguistics, University of Potsdam</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universita` degli Studi di Udine</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2006</year>
      </pub-date>
      <volume>2006</volume>
      <fpage>417</fpage>
      <lpage>422</lpage>
      <abstract>
        <p>English. We present a text classifier that can distinguish Italian news stories from editorials. Inspired by earlier work on English, we built a suitable train/test corpus and implemented a range of features, which can predict the distinction with an accuracy of 89,12%. As demonstrated by the earlier work, such a feature-based approach outperforms simple bag-of-words models when being transferred to new domains. We argue that the technique can also be used to distinguish opinionated from non-opinionated text outside of the realm of newspapers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. Presentiamo una tecnica per la
classificazione di articoli di giornale in
italiano come articoli di cronaca oppure
editoriali. Ispirandoci a precedenti
pubblicazioni riguardanti la lingua inglese,
abbiamo costruito un corpus adatto allo
scopo e selezionato un insieme di
caratteristiche testuali in grado di distinguere
il genere con un accuratezza dell’ 89,12%.
Come dimostrato dai lavori precedenti,
questo approccio basato sulle proprieta`
del testo mostra risultati migliori rispetto
ad altri quando trasferito a nuovi
argomenti. Riteniamo inoltre che questa
tecnica possa essere usata con successo
anche in contesti diversi dagli articoli di
giornale per distinguere testi contenenti
opinioni dell’autore e non.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>The computational task of text classification is
typically targeting the question of domain: Is a
text about sports, the economy, local politics, etc.
But texts can also be grouped by their genre: Is it
a business letter, a personal homepage, a cooking
recipe, and so on. In this paper, we perform genre
classification on newspaper text and are
specifically interested in the question whether a text
communicates a news report or gives an opinion, i.e., it
is an editorial (or some similar opinionated piece).
This task is relevant for many information
extraction applications based on newspaper text, and it
can also be extended from newspapers to other
kinds of text, where the distinction ”opinionated
or not” is of interest, as in sentiment analysis or
argumentation mining.</p>
      <p>
        Our starting point is the work by
        <xref ref-type="bibr" rid="ref1 ref2">(Kru¨ger et
al., 2017)</xref>
        , who presented a news/editorial
classifier for English. They demonstrated that
using linguistically-motivated features leads to
better results than bag-of-words or POS-based
models, when it comes to changing the domain of text
(which newspaper, which time of origin, which
type of content). To transfer the approach to
Italian, we assembled a suitable corpus for
training and testing, selected preprocessing tools, and
adapted the features used by the classifier from
Kru¨ger et al. Our results are in same range of
the original work, indicating that the problem can
be solved for Italian in pretty much the same way.
We found some differences in the relative feature
strengths, however.
      </p>
      <p>After considering related work in Section 2, we
describe our corpus (Section 3) and the
classification experiments (Section 4), and then conclude.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        In early work, (Karlgren and Cutting, 1994) ran
genre classification experiments on the Brown
Corpus and employed the distribution of POS-tags
as well as surface-based features such as length of
words, sentences and documents, type/token
ratio, and the frequency of the words ‘therefore’,
‘I’, ‘me’, ‘it’, ‘that’ and ‘which’. Among the
experiments, the classification of ‘press editorial’
yielded 30% errors, and that of ‘press reportage’
25%. On the same data,
        <xref ref-type="bibr" rid="ref1">(Kessler et al., 1997)</xref>
        used additional lexical features (latinate affixes,
date expressions, etc.) and punctuation. The
authors reported these accuracies: reportage 83%,
editorial 61%, scitech 83%, legal 20%, nonfiction
(= other expository writing) 47%, fiction 94%.
      </p>
      <p>
        The alternative method is to refrain from any
linguistic analysis and instead use bag-of-tokens
(2003), bag-of-words (Freund et al., 2006),
        <xref ref-type="bibr" rid="ref8">(Finn
and Kushmerick, 2003)</xref>
        or
bag-of-character-ngram
        <xref ref-type="bibr" rid="ref5">(Sharoff et al., 2010)</xref>
        models. This has
the obvious advantage of knowledge-freeness and
yields very good results in the domains of the
training data, but, as found for instance by Finn
and Kushmerick, a bag-of-words model performs
very badly in cross-domain experiments.
Likewise,
        <xref ref-type="bibr" rid="ref3">(Petrenz and Webber, 2011)</xref>
        show in their
replication experiments that this idea is highly
vulnerable to topic/domain shifting: the models
largely learn from the content words in the
training texts, and these can be very different from day
to day, when the news and the opinions on them
reflect the current affairs.
      </p>
      <p>
        <xref ref-type="bibr" rid="ref6">(Toprak and Gurevych, 2009)</xref>
        experimented
with various lexical features: Word-based features
included unigrams, bigrams, variants with
surrounding tokens, as well as frequency-amended
lemma features (using a tf*idf measure); lexicon
features exploited the Subjectivity Clues Lexicon
        <xref ref-type="bibr" rid="ref7">(Wilson et al., 2005)</xref>
        , SentiWordnet (Esuli and
Sebastiani, 2006), and a list of communication and
mental verbs. It turned out that word class features
outperform the other classes, with an accuracy of
up to 0:857. Specifically, the tf*idf representation
was successful. Such frequency-based
representations are known to be effective for classical topic
categorization tasks, and this study provides an
indication that they may also help for related tasks
(especially when the class distribution is skewed).
Another finding was that plain unigrams beat the
larger n-grams and certain context features.
      </p>
      <p>
        <xref ref-type="bibr" rid="ref1 ref2">(Cimino et al., 2017)</xref>
        investigated the role of
different feature types in the task of Automatic
Genre Classification. In this study a set of
relevant features is extracted across different linguistic
description levels (lexical, morpho-syntactic and
syntactic) and a meaningful subset is then selected
through an incremental feature selection
procedure. The results show that syntactic features are
the most effective in order to discriminate between
different text genres.
      </p>
      <p>
        Finally, as mentioned earlier, we build our work
on that of
        <xref ref-type="bibr" rid="ref1 ref2">(Kru¨ger et al., 2017)</xref>
        , who
systematically tested a meaningful set of linguistic features.
Among several classifiers from the WEKA libraries,
the SMO classifiers performed best, and the
models based on linguistic features outperformed
standard bag-of-lemma approaches across different
genres, but the latter still performed very well
on the same genre on which they were trained.
Kru¨ger et al. then tested which features are most
predictive for each class, and related these
observations to their original expectations.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Dataset</title>
      <p>For our study, we built a corpus of about 1000
Italian newspaper articles, which are equally divided
into editorials and news articles.</p>
      <p>The editorials have been collected from the
website of the Italian newspaper “Il Manifesto”
and we removed headers and footers that serve
as metadata for the newspaper, such as “2017
IL NUOVO MANIFESTO SOCIETA` COOP.
EDITRICE”. The news articles are from the Adige
corpus1, a collection of news stories from the
local newspaper L’Adige categorized into different
topics of news, such as sport, finance or culture.
The corpus is also annotated with semantic
information related to temporal expressions and
entities. However, we have not exploited these
features since they were not available on the
editorials.</p>
      <p>
        Both corpora have been annotated using the
TreeTagger tool2
        <xref ref-type="bibr" rid="ref4">(Schmid, 1994)</xref>
        , which provides
an annotation of the form WORD, POS-TAG,
LEMMA.
      </p>
      <p>
        In order to reproduce the types of classification
features used by
        <xref ref-type="bibr" rid="ref1 ref2">(Kru¨ger et al., 2017)</xref>
        , some
lexical resources are needed. The corresponding
Italian vocabulary has been collected from different
sources:
      </p>
      <p>A list of connectives, categorized into
temporal, causal, contrastive and expansive
connectives, has been obtained from LICO
(Feltracco et al., 2016), a lexicon for Italian
connectives.
1http://ontotext.fbk.eu/icab.html
2Future improvements include using a more modern
postagger such as UDPipe: https://ufal.mff.cuni.
cz/udpipe</p>
      <p>L
P</p>
      <p>U
L+U
L+P</p>
      <p>U+P
L+P+U
A list of communication verbs (say, argue,
state, etc.) has been obtained from the
lexical database MultiWordNet3 for a total of 54
entries.</p>
      <p>Sentiment features rely on the Sentix4 lexicon
for Italian sentiment analysis, which assigns
to each lemma a positive and negative score,
plus a score of polarity and intensity.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>Feature
LING:PRONOUNS
LING:TEMPORALCONN
LING:SENT POS
LING:NEGATIONS
LING:SENT NEG
LING:PAST
LING:CONTRASTIVECONN
LING:INFINITIVE
LING:SENT ADJ POL
LING:SENT ADJ NEG
LING:CONDIMP
LING:GERUND
LING:COMMAS
LING:SENT INT
LING:IMPERFECT
3http://multiwordnet.fbk.eu/english/
home.php</p>
      <p>4http://valeriobasile.github.io/twita/
sentix.html
speech tags counts, and their combinations as
indicators for classifying the newspaper articles from
the dataset. Four different classifiers from the
WEKA library have been tested: linear and
polynomial SMO (kernel with e = 2), J48 trees and Naive
Bayes classifier, with a 10-fold cross-validation
evaluation. The SMO classifiers proved to be the
most accurate, with the polynomial SMO having
marginally higher scores than the linear
counterpart. In Table 1 we provide our results obtained
with that approach. It can be seen that combining
feature sets generally outperforms the individual
sets, and in fact the combination of all three yields
the best results.</p>
      <p>Our set of linguistic features was modeled
closely after that of Kru¨ger et al., because we
wanted to know how well it can be transferred to
languages other than English. These features can
be summarized as follows: text statistics (length
of a sentence, frequency of digits, etc.); ratio of
punctuation symbols; ratio of temporal, causal and
other connectives; verb tenses; pronouns (esp. 1st
and 2nd person) and sentiment indicators.
The set also includes the presence of modal verbs
and negation operators, morphological features of
the matrix verb (tense, mood), as well as some
selected part-of speech and basic text statistic
features, as they had already been proposed in the
early related work.</p>
      <p>The feature weights assigned by the linear
classifier are shown in tables 2 and 4 in order to
highlight which linguistic features represent good
indicators towards one or another type of article, and
with how much strength.</p>
      <p>
        The results obtained offer interesting analogies
with the English corpus analysed by
        <xref ref-type="bibr" rid="ref1 ref2">(Kru¨ger et
al., 2017)</xref>
        . For instance, pronouns, negations and
sentiment represent strong indicators for
opinionated texts, while complexity, future,
communication verbs, token length and causal connectives are
all features pointing towards news reports in both
languages. An interesting difference is the role of
past tense, which for English had been found to
correlate more with news than with editorials, and
here it plays a different role.
4.2
      </p>
      <sec id="sec-5-1">
        <title>Testing domain change robustness</title>
        <p>We then evaluated another aspect of the task,
viz. domain robustness: we split the news corpus
into a training set (categories Attualita`, Sport and
Economia) and a test set (categories Cultura and
Trento) in order to evaluate the robustness of the
classifier when unseen categories are submitted.
All the classification performances in this setting
show a drop of performance of only about 0,03%,
demonstrating that the classification performances
are not overfitted to the topics of the articles.</p>
        <p>Finally, to further test domain change
robustness, we tested the classifier – with the model
trained on the newspaper corpora – on a set of 60
Amazon reviews versus 60 Wikipedia articles (all
randomly chosen). As the results in Table 3 show,
the linguistic features perform remarkably robust
also on this quite different data. The bad results for
unigrams on the one hand are not so surprising, but
they have to be taken with a grain of salt, because
we employed the same low frequency filtering as
in the main experiment: unigrams that occur less
than five times are not being considered, in order
to reduce the feature space. This might well lead
to poorer results for a small data set like the 120
texts used here.
4.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Replication</title>
        <p>Altough we cannot make public all the data we
used in this experiment, we uploaded our code on
a public repository5 to provide a description of our
implementation.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        We presented, to our knowledge, the first
classifier that is able to distinguish ‘news’ from
‘editorials’ in an Italian newspaper corpus. It follows
a linguistic feature-oriented approach proposed by
        <xref ref-type="bibr" rid="ref1 ref2">(Kru¨ger et al., 2017)</xref>
        for English, who had
demonstrated that it outperforms lexical and POS-based
models. In our implementation, With an accuracy
of 89.09% the distinction between the two
subgenres can be drawn quite reliably. Our results are
comparable to that of Kru¨ger et al., which
indicates (again, to our knowledge for the first time)
that their feature space is applicable successfully
to languages other than English.
      </p>
      <p>Our central concern for this kind of task is
robustness against domain changes of different
kinds. To this end, Kru¨ger et al. had worked with
different newspaper sources and demonstrated the
utility of the feature approach in such settings.
While we were not able to assemble large corpora
from different papers, we ran other experiments in
the same vein, where the first shows that the
system is robust against changing the portions of the
newspapers (i.e., economy versus local affairs, and
so on). In the second one, we applied the classifier,
as trained on the newspaper data, to the distinction
between Italian Wikipedia articles and Amazon
reviews, where the results remained stable as well.
We take this as an indication that the classifier
captures a general difference between ‘opinionated’
and ‘non-opinionated’ text, and not just some ‘ad
hoc’ phenomena of certain newspaper sub-genres.
5 https://bitbucket.org/PietroTotis/
classifying-italian-newspaper-text-newsor-editorial/src/master/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Cimino et al.2017]
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Cimino</surname>
          </string-name>
          , Martijn Wieling, Felice Dell'Orletta,
          <string-name>
            <given-names>Simonetta</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Giulia</given-names>
            <surname>Venturi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Identifying predictive features for textual genre classification: the key role of syntax</article-title>
          . In Roberto Basili, Malvina Nissim, and Giorgio Satta, editors,
          <source>Proceedings of the Fourth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2017</year>
          ), Rome, Italy, December 11- [
          <string-name>
            <surname>Kessler</surname>
            et al.1997]
            <given-names>Brett</given-names>
          </string-name>
          <string-name>
            <surname>Kessler</surname>
          </string-name>
          , Geoffrey Numberg, and Hinrich Schu¨tze.
          <year>1997</year>
          .
          <article-title>Automatic detection of text genre</article-title>
          .
          <source>In Proceedings of the 35th Annual</source>
          <article-title>Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics</article-title>
          , pages
          <fpage>32</fpage>
          -
          <lpage>38</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Kru¨ger et al.2017]
          <article-title>Katarina R</article-title>
          . Kru¨ger, Anna Lukowiak, Jonathan Sonntag, Saskia Warzecha, and
          <string-name>
            <given-names>Manfred</given-names>
            <surname>Stede</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Classifying news versus opinions in newspapers: Linguistic features for domain independence</article-title>
          .
          <source>Natural Language Engineering</source>
          ,
          <volume>23</volume>
          (
          <issue>5</issue>
          ):
          <fpage>687</fpage>
          -
          <lpage>707</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Petrenz and Webber2011] Philipp Petrenz and Bonnie Webber</source>
          .
          <year>2011</year>
          .
          <article-title>Stable classification of text genres</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>37</volume>
          (
          <issue>2</issue>
          ):
          <fpage>385</fpage>
          -
          <lpage>393</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Schmid1994]
          <string-name>
            <given-names>Helmut</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Probabilistic part-of-speech tagging using decision trees</article-title>
          .
          <source>In Proceedings of International Conference on New Methods in Language Processing</source>
          , Manchester.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Sharoff et al.2010]
          <string-name>
            <given-names>Serge</given-names>
            <surname>Sharoff</surname>
          </string-name>
          ,
          <string-name>
            <surname>Zhili Wu</surname>
            , and
            <given-names>Katja</given-names>
          </string-name>
          <string-name>
            <surname>Markert</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>The Web Library of Babel: evaluating genre collections</article-title>
          .
          <source>Proceedings of the Seventh Conference on International Language Resources and Evaluation</source>
          , pages
          <fpage>3063</fpage>
          -
          <lpage>3070</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Toprak and Gurevych2009] Cigdem Toprak and Iryna Gurevych</source>
          .
          <year>2009</year>
          .
          <article-title>Document level subjectivity classification experiments in deft'09 challenge</article-title>
          .
          <source>In Proceedings of the DEFT'09 Text Mining Challenge</source>
          , pages
          <fpage>89</fpage>
          -
          <lpage>97</lpage>
          , Paris, France.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Wilson et al.2005] Theresa Wilson, Janyce Wiebe, and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Hoffmann</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Recognizing contextual polarity in phrase-level sentiment analysis</article-title>
          .
          <source>In Proceedings of HLT-EMNLP-2005.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Yu and Hatzivassiloglou2003] Hong Yu and Vasileios Hatzivassiloglou</source>
          .
          <year>2003</year>
          .
          <article-title>Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences</article-title>
          .
          <source>In Proceedings of the 2003 conference on Empirical methods in natural language processing</source>
          , pages
          <fpage>129</fpage>
          -
          <lpage>136</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>