=Paper= {{Paper |id=Vol-2765/169 |storemode=property |title=CHANGE-IT @ EVALITA 2020: Change Headlines, Adapt News, GEnerate (short paper) |pdfUrl=https://ceur-ws.org/Vol-2765/paper169.pdf |volume=Vol-2765 |authors=Lorenzo De Mattei,Michele Cafagna,Felice Dell'Orletta,Malvina Nissim,Albert Gatt |dblpUrl=https://dblp.org/rec/conf/evalita/MatteiCDNG20 }} ==CHANGE-IT @ EVALITA 2020: Change Headlines, Adapt News, GEnerate (short paper)== https://ceur-ws.org/Vol-2765/paper169.pdf

CHANGE-IT @ EVALITA 2020:
Change Headlines, Adapt News, GEnerate
Lorenzo De Mattei Michele Cafagna Felice Dell’Orletta
University of Pisa Aptus.AI, Pisa, Italy ItaliaNLP Lab, ILC-CNR
CLCG, University of Groningen University of Malta, Malta Pisa, Italy
ItaliaNLP Lab, ILC-CNR michele@aptus.ai felice.dellorletta@ilc.cnr.it
Pisa, Italy
lorenzo.demattei@di.unipi.it

Malvina Nissim Albert Gatt
CLCG, University of Groningen University of Malta
The Netherlands Malta
m.nissim@rug.nl albert.gatt@um.edu.mt

Abstract teams not only training data, but also a baseline
sequence to sequence model that performs the task
We propose a generation task for Italian – in order to help everyone get started, even when
more specifically, a style transfer task for not accustomed to generation models, yet. This
headlines of Italian newspapers. This is baseline model casts the style transfer problem as
the first shared task on generation included an extreme summarisation task, just showing how
in the EVALITA evaluation framework. versatile the problem is in terms of possible ap-
Indeed, one of the reasons to have this task proaches. Contextually, this task will help to fur-
is to stimulate more research on generation ther explore the complex issue of evaluation of
within the Italian community. With this generated text, which is receiving particular at-
aim in mind, we release to the participat- tention in the Natural Language Generation in-
ing teams not only training data, but also a ternational community (Gatt and Krahmer, 2018;
baseline sequence to sequence model that van der Lee et al., 2019).
performs the task in order to help everyone
get started, even when not accustomed to Task The task is cast as a “headline translation”
Natural Language Generation (NLG) ap- problem, and it is as follows. Given a collection of
proaches. Contextually, we explore the headlines from two Italian newspapers at opposite
complex issue of automatic evaluation of ends of the political spectrum, call them G and R,
generated text, which is receiving particu- change all G-headlines to headlines into style R,
lar attention in the NLG community. and all R-headlines to headlines in style G.
In the context of this task we need to take care of
two crucial aspects: data and evaluation. Details
on data are provided in Section 2, and on evalua-
1 Task and Motivation tion in Section 3.
We propose a generation task for Italian in the con- 2 Data
text of the EVALITA 2020 campaign (Basile et al.,
2020). More specifically, we design a style trans- We have collected news coming from two of the
fer task for headlines of Italian newspapers. most important Italian newspapers situated at op-
We believe it is the first time that a shared posite ends of the political spectrum, namely la
task on generation is offered in the context of Repubblica (left) and Il Giornale (right), totalling
EVALITA. Indeed, one of the reasons to have approximately 152,000 article-headline pairs, with
this task is to stimulate more research on gener- the two newspapers equally represented. Although
ation within the Italian community. With this goal the task only concerns headline change, the teams
in mind, we release to the potential participating will receive both the headlines as well as their re-
spective full articles.
Copyright ©2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In- Leveraging on an alignment procedure de-
ternational (CC BY 4.0). scribed below (see Cafagna et al. (2019) for fur-
cosine score newspaper alignment
0.96 (strict) rep Estroverso o nevrotico? Lo dice la foto scelta per il profilo social
en:[Extrovert or neurotic? The photo chosen for the social profile says so]
gio L’immagine del profilo usata nei social network rivela la nostra personalità
en:[The profile picture used in social networks reveals our personality]
0.5 (strict) rep Egitto, governo si dimette a sorpresa
en:[Egypt, government resigns surprisingly]
gio Egitto, il governo si dimette
en:[Egypt, government resigns]
0.185 (loose) rep Elezioni presidenziali Francia, la Chiesa non si schiera né per Macron né per Le Pen
en:[Presidential elections France, the Church does not take sides either for Macron or for Le Pen]
gio Il primo voto con l’incubo Isis ma il terrorismo esce sconfitto
en:[The first vote with the Isis nightmare but terrorism comes out defeated]

Table 1: Example of alignments between La Repubblica and Il Giornale, extracted with different simi-
larity scores. The second and the third examples would fall into the strict and the loose sets, respectively,
according to the thresholds used to split the alignments. The first two headline pairs are well aligned,
while the third pair has a very loose alignment.

ther details), we account for potential topic biases and used as test set for the final style transfer task.
in the two newspapers, and we split the data set The remaining three sets are used for training the
into strongly, weakly and not-aligned news. This evaluation classifiers and the system for the target
information is useful in the creation of the datasets task. These are shown in Figure 1b. Note that all
that we need to train our three evaluation classi- sets also always contain the headlines’ respective
fiers (see Section 3). Additionally, it could help to full articles, though these are not necessarily used.
better disentangle newspaper-specific style.
Format The data is distributed in the form of
Alignment We compute the tf-idf vectors of all one CSV file with the following fields:
the articles of both newspapers and create subsets id, headline, article, label [R,G]
of relevant news filtering by date, i.e. consider-
ing only news which were published in approx- 3 Evaluation
imately the same, short, temporal range for the
Human evaluation is generally viewed as the
two sources. On the tf-idf vectors we then com-
most desirable method to assess generated text
pute cosine similarities for all news in the resulting
(Novikova et al., 2018; van der Lee et al., 2019).
subset, rank them, and retain only the alignments
However, human evaluation is not always a viable
that are above a certain threshold. The threshold
option, due to resources, but also due to the fact
is chosen taking into consideration a trade-off be-
that humans might not be capable of reliably as-
tween number of documents and quality of align-
sessing the task at hand. Related to the current
ment. We choose two different thresholds: one is
challenge, De Mattei et al. (2020a) have shown
stricter (≥ 0.5) and we use it to select best align-
that people find it difficult to identify subtle stylis-
ments (strict alignments); the other one is looser
tic differences between texts.
(≥ 0.185, and < 0.5) — we define these latter as
Automatic, reliable metrics should therefore
weak alignments. We consider the rest as basically
also be sought (Novikova et al., 2017). For our
not aligned.
task, we propose a fully automatic strategy based
Data splits We split the dataset into strongly on a series of classifiers to assess style strength and
aligned news, which are selected using the stricter content preservation. For style, we train a single
threshold (∼20K aligned pairs, set A∗ in Fig- classifier (main). For content, we train two classi-
ure 1a), and weakly aligned and non-aligned news fiers that perform two ‘sanity checks’: one ensures
(∼100K article-headline pairs equally distributed that the two headlines (original and transformed)
among the two newspapers, set R in Figure 1a). are still compatible (HH classifier); the other en-
The strictly aligned data is further split as shown sures that the headline is still compatible with the
in Figure 1a; this yields a total of four sets over the original article (AH classifier). See also Figure 1b.
whole dataset (A1, A2, A3, and R). A2 is left aside In what follows we describe these classifiers in
EVALUATION

main R+A3+A1
train & test HH A1 + random pairs
AH R+A3+A1
TASK

train R+A3
test A2
(a) Overall data splits (b) Training/test sets

Figure 1: Data splits and their use in the different training sets

more detail. When discussing baseline results, we fiers with batch size of 8, same learning rate and
will show how the contribution of each classifier 6 epochs. Performance on gold data is >.97 (Ta-
is crucial towards a comprehensive evaluation. ble 2).
Main classifier The main classifier uses a pre-
prec rec f-score
trained BERT (Devlin et al., 2019) encoder with a
linear classifier on top fine-tuned with a batch size rep 0.77 0.83 0.80
main
of 256 and sequences truncated at 32 tokens for 6 gio 0.84 0.78 0.81
epochs with learning rate 1e-05. Given a headline, match 0.98 0.95 0.96
HH
this classifier can distinguish the two sources with no match 0.95 0.98 0.96
an f-score of approximately 80% (see Table 2). match 0.96 0.99 0.98
Since style transfer is deemed successful if the AH
no match 0.99 0.96 0.97
original style is lost in favour of the target style,
we use this classifier to assess how many times a Table 2: Performance of the evaluation classifiers
style transfer system manages to reverse the main on gold data.
classifier’s decisions.
HH classifier This classifier checks compatibil- Overall compliancy We calculate a compliancy
ity between the original and the generated head- score which assesses the proportion of times the
line. We use the same architecture as for the main following three outcomes are successful (i) the
classifier with a slightly different configuration: HH classifier predicts ‘match’; (ii) the AH clas-
max. sequence length of 64 tokens, batch size sifier predicts ‘match’; (iii) the main classifier’s
of 128 for 2 epochs (early-stopped), with learn- decision is reversed. As upperbound, we find the
ing rate 1e-05. Being trained on strictly aligned compatibility score for gold at 74.3% for transfer
data as positive instances (A1), with a correspond- from La Repubblica to Il Giornale (rep2gio), and
ing amount of random pairs as negative instances, 78.1% for the opposite direction (gio2rep).
it should learn whether two headlines describe the
same content or not. Performance on gold data is 4 Baseline System
.96 (Table 2).
We developed a baseline system using a summari-
AH classifier This classifier performs yet an- sation approach, where headlines are viewed as
other content-related check. It takes a headline an extreme case of summarisation and generated
and its corresponding article, and tells whether from the article. We exploit article-headline gener-
the headline is appropriate for the article. The ators trained on opposite sources to do the transfer,
classifier is trained on article-headline pairs from as done in (De Mattei et al., 2020b). The advan-
both the strongly aligned and the weakly and non- tage of this approach is that in principle it doesn’t
aligned instances (R+A3+A1, Figure 1b). At test require parallel data for training.
time, the generated headline is checked for com- Specifically, we use two pointer-generator net-
patibility against the source article. We use the works (See et al., 2017), which include a point-
same base model as for the main and HH classi- ing mechanism able to copy words from the
Il Giornale → La Repubblica
E in Sicilia è scattata l’allerta rossa −→ Migranti, la Protezione civile continua di-
menticata
[en: And in Sicily it’s now red alert] [en: Migrants, the Civil Protection Depart-
ment goes on forgotten]

Nozze gay, toghe contro i sindaci: ”Le −→ Il Consiglio di Stato boccia le nozze gay
trascrizioni sono illegittime” all’estero
[en: Gay marriages, gowns against mayors: [en: The State Council rejects gay mar-
“Transcriptions are not valid”] riages abroad]
La Repubblica → Il Giornale
Castelnuovo, lo sdegno di cittadini e asso- −→ I migranti non sono più rifugiati
ciazioni: ”Attacco all’integrazione che fun-
ziona”
[en: Castelnuovo, the indignation of citizens [en: Migrants are not refugees anymore]
and associations: “Attack to the integration
that works”]

Da Renzi a Di Maio, ecco il reddito −→ Grillo e Giggino italiani conquistano
dichiarato dai politici italiani. Fedeli il mi- l’elenco dei redditi italiani
nistro con l’imponibile più alto
[en: From Renzi to Di Maio: here it’s the [en: Grillo and Giggino Italians conquer the
income declared by the Italian politicians. list of Italian incomes]
Fedeli is the minister with the highest tax-
able income]

Table 3: Examples of headlines generated by the baseline system.

source as well as pick them from a fixed vocab- HH AH Main compl.
ulary, thereby allowing better handling of out-of-
rep2gio .649 .876 .799 .449
vocabulary words.
gio2rep .639 .871 .435 .240
One model is trained on the la Repubblica por-
avg .644 .874 .616 .345
tion of the training set, the other on Il Giornale.
In a style transfer setting we use these models as
Table 4: Baseline performance on test data.
follows: Given a headline from Il Giornale, for
example, the model trained on la Repubblica can
be run over the corresponding article from Il Gior-
style transfer and automatic evaluation, in the Ital-
nale to generate a headline in the style of la Re-
ian community. Over ten teams expressed their in-
pubblica, and vice versa.
terest in participating in the shared task officially,
The results of the baseline system, measured as
but eventually there were no submitted runs. We
performance of each classifier as well as the over-
do hope that the materials developed in the con-
all compliancy score, are reported in Table 4.
text of this challenge will nevertheless be of use
5 Outlook to promote research in a field that is still under-
researched in the Italian NLP landscape. All
This shared task proposal was intended to stim- materials are available: https://github.com/
ulate research in NLG, with a specific focus on michelecafagna26/CHANGE-IT.
References Abigail See, Peter J Liu, and Christopher D Manning.
2017. Get to the point: Summarization with pointer-
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- generator networks. In Proceedings of the 55th An-
cia C. Passaro. 2020. Evalita 2020: Overview nual Meeting of the Association for Computational
of the 7th evaluation campaign of natural language Linguistics (Volume 1: Long Papers), pages 1073–
processing and speech tools for italian. In Valerio 1083.
Basile, Danilo Croce, Maria Di Maro, and Lucia C.
Passaro, editors, Proceedings of Seventh Evalua- Chris van der Lee, Albert Gatt, Emiel van Miltenburg,
tion Campaign of Natural Language Processing and Sander Wubben, and Emiel Krahmer. 2019. Best
Speech Tools for Italian. Final Workshop (EVALITA practices for the human evaluation of automatically
2020), Online. CEUR.org. generated text. In Proceedings of the 12th Interna-
tional Conference on Natural Language Generation,
Michele Cafagna, Lorenzo De Mattei, and Malvina pages 355–368, Tokyo, Japan, October–November.
Nissim. 2019. Embeddings shifts as proxies for Association for Computational Linguistics.
different word use in italian newspapers. In Pro-
ceedings of the Sixth Italian Conference on Compu-
tational Linguistics (CLiC-it 2019), Bari, Italy.

Lorenzo De Mattei, Michele Cafagna, Felice
Dell’Orletta, and Malvina Nissim. 2020a. In-
visible to People but not to Machines: Evaluation
of Style-aware Headline Generation in Absence
of Reliable Human Judgment. In Proceedings of
the Twelfth International Conference on Language
Resources and Evaluation (LREC 2020), Mar-
seille, France, May. European Language Resources
Association (ELRA).

Lorenzo De Mattei, Michele Cafagna, Felice
Dell’Orletta, and Malvina Nissim. 2020b. In-
visible to People but not to Machines: Evaluation
of Style-aware Headline Generation in Absence
of Reliable Human Judgment. In Proceedings of
the Twelfth International Conference on Language
Resources and Evaluation (LREC 2020), Mar-
seille, France, May. European Language Resources
Association (ELRA).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of NAACL, pages 4171–
4186.

Albert Gatt and Emiel Krahmer. 2018. Survey of the
state of the art in natural language generation: Core
tasks, applications and evaluation. Journal of Artifi-
cial Intelligence Research, 61:65–170.

Jekaterina Novikova, Ondřej Dušek, Amanda Cer-
cas Curry, and Verena Rieser. 2017. Why we need
new evaluation metrics for NLG. In Proceedings of
the 2017 Conference on Empirical Methods in Natu-
ral Language Processing, pages 2241–2252, Copen-
hagen, Denmark, September. Association for Com-
putational Linguistics.

Jekaterina Novikova, Ondřej Dušek, and Verena Rieser.
2018. RankME: Reliable human ratings for natu-
ral language generation. In Proceedings of the 2018
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, Volume 2 (Short Papers),
pages 72–78, New Orleans, Louisiana, June. Asso-
ciation for Computational Linguistics.