=Paper= {{Paper |id=Vol-2765/169 |storemode=property |title=CHANGE-IT @ EVALITA 2020: Change Headlines, Adapt News, GEnerate (short paper) |pdfUrl=https://ceur-ws.org/Vol-2765/paper169.pdf |volume=Vol-2765 |authors=Lorenzo De Mattei,Michele Cafagna,Felice Dell'Orletta,Malvina Nissim,Albert Gatt |dblpUrl=https://dblp.org/rec/conf/evalita/MatteiCDNG20 }} ==CHANGE-IT @ EVALITA 2020: Change Headlines, Adapt News, GEnerate (short paper)== https://ceur-ws.org/Vol-2765/paper169.pdf
                             CHANGE-IT @ EVALITA 2020:
                          Change Headlines, Adapt News, GEnerate
      Lorenzo De Mattei          Michele Cafagna               Felice Dell’Orletta
       University of Pisa       Aptus.AI, Pisa, Italy       ItaliaNLP Lab, ILC-CNR
CLCG, University of Groningen University of Malta, Malta            Pisa, Italy
  ItaliaNLP Lab, ILC-CNR          michele@aptus.ai       felice.dellorletta@ilc.cnr.it
          Pisa, Italy
lorenzo.demattei@di.unipi.it

                   Malvina Nissim                                            Albert Gatt
             CLCG, University of Groningen                                University of Malta
                   The Netherlands                                              Malta
                       m.nissim@rug.nl                                 albert.gatt@um.edu.mt


                       Abstract                                teams not only training data, but also a baseline
                                                               sequence to sequence model that performs the task
    We propose a generation task for Italian –                 in order to help everyone get started, even when
    more specifically, a style transfer task for               not accustomed to generation models, yet. This
    headlines of Italian newspapers. This is                   baseline model casts the style transfer problem as
    the first shared task on generation included               an extreme summarisation task, just showing how
    in the EVALITA evaluation framework.                       versatile the problem is in terms of possible ap-
    Indeed, one of the reasons to have this task               proaches. Contextually, this task will help to fur-
    is to stimulate more research on generation                ther explore the complex issue of evaluation of
    within the Italian community. With this                    generated text, which is receiving particular at-
    aim in mind, we release to the participat-                 tention in the Natural Language Generation in-
    ing teams not only training data, but also a               ternational community (Gatt and Krahmer, 2018;
    baseline sequence to sequence model that                   van der Lee et al., 2019).
    performs the task in order to help everyone
    get started, even when not accustomed to                   Task The task is cast as a “headline translation”
    Natural Language Generation (NLG) ap-                      problem, and it is as follows. Given a collection of
    proaches. Contextually, we explore the                     headlines from two Italian newspapers at opposite
    complex issue of automatic evaluation of                   ends of the political spectrum, call them G and R,
    generated text, which is receiving particu-                change all G-headlines to headlines into style R,
    lar attention in the NLG community.                        and all R-headlines to headlines in style G.
                                                                  In the context of this task we need to take care of
                                                               two crucial aspects: data and evaluation. Details
                                                               on data are provided in Section 2, and on evalua-
1   Task and Motivation                                        tion in Section 3.
We propose a generation task for Italian in the con-           2   Data
text of the EVALITA 2020 campaign (Basile et al.,
2020). More specifically, we design a style trans-             We have collected news coming from two of the
fer task for headlines of Italian newspapers.                  most important Italian newspapers situated at op-
   We believe it is the first time that a shared               posite ends of the political spectrum, namely la
task on generation is offered in the context of                Repubblica (left) and Il Giornale (right), totalling
EVALITA. Indeed, one of the reasons to have                    approximately 152,000 article-headline pairs, with
this task is to stimulate more research on gener-              the two newspapers equally represented. Although
ation within the Italian community. With this goal             the task only concerns headline change, the teams
in mind, we release to the potential participating             will receive both the headlines as well as their re-
                                                               spective full articles.
     Copyright ©2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-         Leveraging on an alignment procedure de-
ternational (CC BY 4.0).                                       scribed below (see Cafagna et al. (2019) for fur-
 cosine score    newspaper   alignment
 0.96 (strict)   rep         Estroverso o nevrotico? Lo dice la foto scelta per il profilo social
                             en:[Extrovert or neurotic? The photo chosen for the social profile says so]
                 gio         L’immagine del profilo usata nei social network rivela la nostra personalità
                             en:[The profile picture used in social networks reveals our personality]
 0.5 (strict)    rep         Egitto, governo si dimette a sorpresa
                             en:[Egypt, government resigns surprisingly]
                 gio         Egitto, il governo si dimette
                             en:[Egypt, government resigns]
 0.185 (loose)   rep         Elezioni presidenziali Francia, la Chiesa non si schiera né per Macron né per Le Pen
                             en:[Presidential elections France, the Church does not take sides either for Macron or for Le Pen]
                 gio         Il primo voto con l’incubo Isis ma il terrorismo esce sconfitto
                             en:[The first vote with the Isis nightmare but terrorism comes out defeated]

Table 1: Example of alignments between La Repubblica and Il Giornale, extracted with different simi-
larity scores. The second and the third examples would fall into the strict and the loose sets, respectively,
according to the thresholds used to split the alignments. The first two headline pairs are well aligned,
while the third pair has a very loose alignment.


ther details), we account for potential topic biases              and used as test set for the final style transfer task.
in the two newspapers, and we split the data set                  The remaining three sets are used for training the
into strongly, weakly and not-aligned news. This                  evaluation classifiers and the system for the target
information is useful in the creation of the datasets             task. These are shown in Figure 1b. Note that all
that we need to train our three evaluation classi-                sets also always contain the headlines’ respective
fiers (see Section 3). Additionally, it could help to             full articles, though these are not necessarily used.
better disentangle newspaper-specific style.
                                                                  Format The data is distributed in the form of
Alignment We compute the tf-idf vectors of all                    one CSV file with the following fields:
the articles of both newspapers and create subsets                    id, headline, article, label [R,G]
of relevant news filtering by date, i.e. consider-
ing only news which were published in approx-                     3     Evaluation
imately the same, short, temporal range for the
                                                                  Human evaluation is generally viewed as the
two sources. On the tf-idf vectors we then com-
                                                                  most desirable method to assess generated text
pute cosine similarities for all news in the resulting
                                                                  (Novikova et al., 2018; van der Lee et al., 2019).
subset, rank them, and retain only the alignments
                                                                  However, human evaluation is not always a viable
that are above a certain threshold. The threshold
                                                                  option, due to resources, but also due to the fact
is chosen taking into consideration a trade-off be-
                                                                  that humans might not be capable of reliably as-
tween number of documents and quality of align-
                                                                  sessing the task at hand. Related to the current
ment. We choose two different thresholds: one is
                                                                  challenge, De Mattei et al. (2020a) have shown
stricter (≥ 0.5) and we use it to select best align-
                                                                  that people find it difficult to identify subtle stylis-
ments (strict alignments); the other one is looser
                                                                  tic differences between texts.
(≥ 0.185, and < 0.5) — we define these latter as
                                                                     Automatic, reliable metrics should therefore
weak alignments. We consider the rest as basically
                                                                  also be sought (Novikova et al., 2017). For our
not aligned.
                                                                  task, we propose a fully automatic strategy based
Data splits We split the dataset into strongly                    on a series of classifiers to assess style strength and
aligned news, which are selected using the stricter               content preservation. For style, we train a single
threshold (∼20K aligned pairs, set A∗ in Fig-                     classifier (main). For content, we train two classi-
ure 1a), and weakly aligned and non-aligned news                  fiers that perform two ‘sanity checks’: one ensures
(∼100K article-headline pairs equally distributed                 that the two headlines (original and transformed)
among the two newspapers, set R in Figure 1a).                    are still compatible (HH classifier); the other en-
   The strictly aligned data is further split as shown            sures that the headline is still compatible with the
in Figure 1a; this yields a total of four sets over the           original article (AH classifier). See also Figure 1b.
whole dataset (A1, A2, A3, and R). A2 is left aside                  In what follows we describe these classifiers in
                                                                                 EVALUATION

                                                                                 main      R+A3+A1
                                                                  train & test   HH        A1 + random pairs
                                                                                 AH        R+A3+A1
                                                                                     TASK

                                                                  train                    R+A3
                                                                  test                     A2
                      (a) Overall data splits                             (b) Training/test sets

                      Figure 1: Data splits and their use in the different training sets


more detail. When discussing baseline results, we        fiers with batch size of 8, same learning rate and
will show how the contribution of each classifier        6 epochs. Performance on gold data is >.97 (Ta-
is crucial towards a comprehensive evaluation.           ble 2).
Main classifier The main classifier uses a pre-
                                                                                   prec        rec   f-score
trained BERT (Devlin et al., 2019) encoder with a
linear classifier on top fine-tuned with a batch size                rep            0.77     0.83      0.80
                                                             main
of 256 and sequences truncated at 32 tokens for 6                    gio            0.84     0.78      0.81
epochs with learning rate 1e-05. Given a headline,                   match          0.98     0.95      0.96
                                                             HH
this classifier can distinguish the two sources with                 no match       0.95     0.98      0.96
an f-score of approximately 80% (see Table 2).                       match          0.96     0.99      0.98
Since style transfer is deemed successful if the             AH
                                                                     no match       0.99     0.96      0.97
original style is lost in favour of the target style,
we use this classifier to assess how many times a        Table 2: Performance of the evaluation classifiers
style transfer system manages to reverse the main        on gold data.
classifier’s decisions.
HH classifier This classifier checks compatibil-         Overall compliancy We calculate a compliancy
ity between the original and the generated head-         score which assesses the proportion of times the
line. We use the same architecture as for the main       following three outcomes are successful (i) the
classifier with a slightly different configuration:      HH classifier predicts ‘match’; (ii) the AH clas-
max. sequence length of 64 tokens, batch size            sifier predicts ‘match’; (iii) the main classifier’s
of 128 for 2 epochs (early-stopped), with learn-         decision is reversed. As upperbound, we find the
ing rate 1e-05. Being trained on strictly aligned        compatibility score for gold at 74.3% for transfer
data as positive instances (A1), with a correspond-      from La Repubblica to Il Giornale (rep2gio), and
ing amount of random pairs as negative instances,        78.1% for the opposite direction (gio2rep).
it should learn whether two headlines describe the
same content or not. Performance on gold data is         4   Baseline System
.96 (Table 2).
                                                         We developed a baseline system using a summari-
AH classifier This classifier performs yet an-           sation approach, where headlines are viewed as
other content-related check. It takes a headline         an extreme case of summarisation and generated
and its corresponding article, and tells whether         from the article. We exploit article-headline gener-
the headline is appropriate for the article. The         ators trained on opposite sources to do the transfer,
classifier is trained on article-headline pairs from     as done in (De Mattei et al., 2020b). The advan-
both the strongly aligned and the weakly and non-        tage of this approach is that in principle it doesn’t
aligned instances (R+A3+A1, Figure 1b). At test          require parallel data for training.
time, the generated headline is checked for com-            Specifically, we use two pointer-generator net-
patibility against the source article. We use the        works (See et al., 2017), which include a point-
same base model as for the main and HH classi-           ing mechanism able to copy words from the
                                           Il Giornale → La Repubblica
    E in Sicilia è scattata l’allerta rossa          −→       Migranti, la Protezione civile continua di-
                                                               menticata
    [en: And in Sicily it’s now red alert]                     [en: Migrants, the Civil Protection Depart-
                                                               ment goes on forgotten]



    Nozze gay, toghe contro i sindaci: ”Le            −→       Il Consiglio di Stato boccia le nozze gay
    trascrizioni sono illegittime”                             all’estero
    [en: Gay marriages, gowns against mayors:                  [en: The State Council rejects gay mar-
    “Transcriptions are not valid”]                            riages abroad]
                                           La Repubblica → Il Giornale
    Castelnuovo, lo sdegno di cittadini e asso-       −→       I migranti non sono più rifugiati
    ciazioni: ”Attacco all’integrazione che fun-
    ziona”
    [en: Castelnuovo, the indignation of citizens              [en: Migrants are not refugees anymore]
    and associations: “Attack to the integration
    that works”]



    Da Renzi a Di Maio, ecco il reddito               −→       Grillo e Giggino italiani conquistano
    dichiarato dai politici italiani. Fedeli il mi-            l’elenco dei redditi italiani
    nistro con l’imponibile più alto
    [en: From Renzi to Di Maio: here it’s the                  [en: Grillo and Giggino Italians conquer the
    income declared by the Italian politicians.                list of Italian incomes]
    Fedeli is the minister with the highest tax-
    able income]

                       Table 3: Examples of headlines generated by the baseline system.


source as well as pick them from a fixed vocab-                           HH       AH      Main       compl.
ulary, thereby allowing better handling of out-of-
                                                            rep2gio       .649    .876       .799        .449
vocabulary words.
                                                            gio2rep       .639    .871       .435        .240
   One model is trained on the la Repubblica por-
                                                            avg           .644    .874       .616        .345
tion of the training set, the other on Il Giornale.
In a style transfer setting we use these models as
                                                              Table 4: Baseline performance on test data.
follows: Given a headline from Il Giornale, for
example, the model trained on la Repubblica can
be run over the corresponding article from Il Gior-
                                                           style transfer and automatic evaluation, in the Ital-
nale to generate a headline in the style of la Re-
                                                           ian community. Over ten teams expressed their in-
pubblica, and vice versa.
                                                           terest in participating in the shared task officially,
   The results of the baseline system, measured as
                                                           but eventually there were no submitted runs. We
performance of each classifier as well as the over-
                                                           do hope that the materials developed in the con-
all compliancy score, are reported in Table 4.
                                                           text of this challenge will nevertheless be of use
5     Outlook                                              to promote research in a field that is still under-
                                                           researched in the Italian NLP landscape. All
This shared task proposal was intended to stim-            materials are available: https://github.com/
ulate research in NLG, with a specific focus on            michelecafagna26/CHANGE-IT.
References                                                 Abigail See, Peter J Liu, and Christopher D Manning.
                                                             2017. Get to the point: Summarization with pointer-
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-         generator networks. In Proceedings of the 55th An-
  cia C. Passaro. 2020. Evalita 2020: Overview               nual Meeting of the Association for Computational
  of the 7th evaluation campaign of natural language         Linguistics (Volume 1: Long Papers), pages 1073–
  processing and speech tools for italian. In Valerio        1083.
  Basile, Danilo Croce, Maria Di Maro, and Lucia C.
  Passaro, editors, Proceedings of Seventh Evalua-         Chris van der Lee, Albert Gatt, Emiel van Miltenburg,
  tion Campaign of Natural Language Processing and           Sander Wubben, and Emiel Krahmer. 2019. Best
  Speech Tools for Italian. Final Workshop (EVALITA          practices for the human evaluation of automatically
  2020), Online. CEUR.org.                                   generated text. In Proceedings of the 12th Interna-
                                                             tional Conference on Natural Language Generation,
Michele Cafagna, Lorenzo De Mattei, and Malvina              pages 355–368, Tokyo, Japan, October–November.
  Nissim. 2019. Embeddings shifts as proxies for             Association for Computational Linguistics.
  different word use in italian newspapers. In Pro-
  ceedings of the Sixth Italian Conference on Compu-
  tational Linguistics (CLiC-it 2019), Bari, Italy.

Lorenzo De Mattei, Michele Cafagna, Felice
  Dell’Orletta, and Malvina Nissim. 2020a. In-
  visible to People but not to Machines: Evaluation
  of Style-aware Headline Generation in Absence
  of Reliable Human Judgment. In Proceedings of
  the Twelfth International Conference on Language
  Resources and Evaluation (LREC 2020), Mar-
  seille, France, May. European Language Resources
  Association (ELRA).

Lorenzo De Mattei, Michele Cafagna, Felice
  Dell’Orletta, and Malvina Nissim. 2020b. In-
  visible to People but not to Machines: Evaluation
  of Style-aware Headline Generation in Absence
  of Reliable Human Judgment. In Proceedings of
  the Twelfth International Conference on Language
  Resources and Evaluation (LREC 2020), Mar-
  seille, France, May. European Language Resources
  Association (ELRA).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of
   deep bidirectional transformers for language under-
   standing. In Proceedings of NAACL, pages 4171–
   4186.

Albert Gatt and Emiel Krahmer. 2018. Survey of the
  state of the art in natural language generation: Core
  tasks, applications and evaluation. Journal of Artifi-
  cial Intelligence Research, 61:65–170.

Jekaterina Novikova, Ondřej Dušek, Amanda Cer-
   cas Curry, and Verena Rieser. 2017. Why we need
   new evaluation metrics for NLG. In Proceedings of
   the 2017 Conference on Empirical Methods in Natu-
   ral Language Processing, pages 2241–2252, Copen-
   hagen, Denmark, September. Association for Com-
   putational Linguistics.

Jekaterina Novikova, Ondřej Dušek, and Verena Rieser.
   2018. RankME: Reliable human ratings for natu-
   ral language generation. In Proceedings of the 2018
   Conference of the North American Chapter of the
   Association for Computational Linguistics: Human
   Language Technologies, Volume 2 (Short Papers),
   pages 72–78, New Orleans, Louisiana, June. Asso-
   ciation for Computational Linguistics.