=Paper=
{{Paper
|id=Vol-2765/169
|storemode=property
|title=CHANGE-IT @ EVALITA 2020: Change Headlines, Adapt News, GEnerate (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2765/paper169.pdf
|volume=Vol-2765
|authors=Lorenzo De Mattei,Michele Cafagna,Felice Dell'Orletta,Malvina Nissim,Albert Gatt
|dblpUrl=https://dblp.org/rec/conf/evalita/MatteiCDNG20
}}
==CHANGE-IT @ EVALITA 2020: Change Headlines, Adapt News, GEnerate (short paper)==
CHANGE-IT @ EVALITA 2020: Change Headlines, Adapt News, GEnerate Lorenzo De Mattei Michele Cafagna Felice Dell’Orletta University of Pisa Aptus.AI, Pisa, Italy ItaliaNLP Lab, ILC-CNR CLCG, University of Groningen University of Malta, Malta Pisa, Italy ItaliaNLP Lab, ILC-CNR michele@aptus.ai felice.dellorletta@ilc.cnr.it Pisa, Italy lorenzo.demattei@di.unipi.it Malvina Nissim Albert Gatt CLCG, University of Groningen University of Malta The Netherlands Malta m.nissim@rug.nl albert.gatt@um.edu.mt Abstract teams not only training data, but also a baseline sequence to sequence model that performs the task We propose a generation task for Italian – in order to help everyone get started, even when more specifically, a style transfer task for not accustomed to generation models, yet. This headlines of Italian newspapers. This is baseline model casts the style transfer problem as the first shared task on generation included an extreme summarisation task, just showing how in the EVALITA evaluation framework. versatile the problem is in terms of possible ap- Indeed, one of the reasons to have this task proaches. Contextually, this task will help to fur- is to stimulate more research on generation ther explore the complex issue of evaluation of within the Italian community. With this generated text, which is receiving particular at- aim in mind, we release to the participat- tention in the Natural Language Generation in- ing teams not only training data, but also a ternational community (Gatt and Krahmer, 2018; baseline sequence to sequence model that van der Lee et al., 2019). performs the task in order to help everyone get started, even when not accustomed to Task The task is cast as a “headline translation” Natural Language Generation (NLG) ap- problem, and it is as follows. Given a collection of proaches. Contextually, we explore the headlines from two Italian newspapers at opposite complex issue of automatic evaluation of ends of the political spectrum, call them G and R, generated text, which is receiving particu- change all G-headlines to headlines into style R, lar attention in the NLG community. and all R-headlines to headlines in style G. In the context of this task we need to take care of two crucial aspects: data and evaluation. Details on data are provided in Section 2, and on evalua- 1 Task and Motivation tion in Section 3. We propose a generation task for Italian in the con- 2 Data text of the EVALITA 2020 campaign (Basile et al., 2020). More specifically, we design a style trans- We have collected news coming from two of the fer task for headlines of Italian newspapers. most important Italian newspapers situated at op- We believe it is the first time that a shared posite ends of the political spectrum, namely la task on generation is offered in the context of Repubblica (left) and Il Giornale (right), totalling EVALITA. Indeed, one of the reasons to have approximately 152,000 article-headline pairs, with this task is to stimulate more research on gener- the two newspapers equally represented. Although ation within the Italian community. With this goal the task only concerns headline change, the teams in mind, we release to the potential participating will receive both the headlines as well as their re- spective full articles. Copyright ©2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- Leveraging on an alignment procedure de- ternational (CC BY 4.0). scribed below (see Cafagna et al. (2019) for fur- cosine score newspaper alignment 0.96 (strict) rep Estroverso o nevrotico? Lo dice la foto scelta per il profilo social en:[Extrovert or neurotic? The photo chosen for the social profile says so] gio L’immagine del profilo usata nei social network rivela la nostra personalità en:[The profile picture used in social networks reveals our personality] 0.5 (strict) rep Egitto, governo si dimette a sorpresa en:[Egypt, government resigns surprisingly] gio Egitto, il governo si dimette en:[Egypt, government resigns] 0.185 (loose) rep Elezioni presidenziali Francia, la Chiesa non si schiera né per Macron né per Le Pen en:[Presidential elections France, the Church does not take sides either for Macron or for Le Pen] gio Il primo voto con l’incubo Isis ma il terrorismo esce sconfitto en:[The first vote with the Isis nightmare but terrorism comes out defeated] Table 1: Example of alignments between La Repubblica and Il Giornale, extracted with different simi- larity scores. The second and the third examples would fall into the strict and the loose sets, respectively, according to the thresholds used to split the alignments. The first two headline pairs are well aligned, while the third pair has a very loose alignment. ther details), we account for potential topic biases and used as test set for the final style transfer task. in the two newspapers, and we split the data set The remaining three sets are used for training the into strongly, weakly and not-aligned news. This evaluation classifiers and the system for the target information is useful in the creation of the datasets task. These are shown in Figure 1b. Note that all that we need to train our three evaluation classi- sets also always contain the headlines’ respective fiers (see Section 3). Additionally, it could help to full articles, though these are not necessarily used. better disentangle newspaper-specific style. Format The data is distributed in the form of Alignment We compute the tf-idf vectors of all one CSV file with the following fields: the articles of both newspapers and create subsets id, headline, article, label [R,G] of relevant news filtering by date, i.e. consider- ing only news which were published in approx- 3 Evaluation imately the same, short, temporal range for the Human evaluation is generally viewed as the two sources. On the tf-idf vectors we then com- most desirable method to assess generated text pute cosine similarities for all news in the resulting (Novikova et al., 2018; van der Lee et al., 2019). subset, rank them, and retain only the alignments However, human evaluation is not always a viable that are above a certain threshold. The threshold option, due to resources, but also due to the fact is chosen taking into consideration a trade-off be- that humans might not be capable of reliably as- tween number of documents and quality of align- sessing the task at hand. Related to the current ment. We choose two different thresholds: one is challenge, De Mattei et al. (2020a) have shown stricter (≥ 0.5) and we use it to select best align- that people find it difficult to identify subtle stylis- ments (strict alignments); the other one is looser tic differences between texts. (≥ 0.185, and < 0.5) — we define these latter as Automatic, reliable metrics should therefore weak alignments. We consider the rest as basically also be sought (Novikova et al., 2017). For our not aligned. task, we propose a fully automatic strategy based Data splits We split the dataset into strongly on a series of classifiers to assess style strength and aligned news, which are selected using the stricter content preservation. For style, we train a single threshold (∼20K aligned pairs, set A∗ in Fig- classifier (main). For content, we train two classi- ure 1a), and weakly aligned and non-aligned news fiers that perform two ‘sanity checks’: one ensures (∼100K article-headline pairs equally distributed that the two headlines (original and transformed) among the two newspapers, set R in Figure 1a). are still compatible (HH classifier); the other en- The strictly aligned data is further split as shown sures that the headline is still compatible with the in Figure 1a; this yields a total of four sets over the original article (AH classifier). See also Figure 1b. whole dataset (A1, A2, A3, and R). A2 is left aside In what follows we describe these classifiers in EVALUATION main R+A3+A1 train & test HH A1 + random pairs AH R+A3+A1 TASK train R+A3 test A2 (a) Overall data splits (b) Training/test sets Figure 1: Data splits and their use in the different training sets more detail. When discussing baseline results, we fiers with batch size of 8, same learning rate and will show how the contribution of each classifier 6 epochs. Performance on gold data is >.97 (Ta- is crucial towards a comprehensive evaluation. ble 2). Main classifier The main classifier uses a pre- prec rec f-score trained BERT (Devlin et al., 2019) encoder with a linear classifier on top fine-tuned with a batch size rep 0.77 0.83 0.80 main of 256 and sequences truncated at 32 tokens for 6 gio 0.84 0.78 0.81 epochs with learning rate 1e-05. Given a headline, match 0.98 0.95 0.96 HH this classifier can distinguish the two sources with no match 0.95 0.98 0.96 an f-score of approximately 80% (see Table 2). match 0.96 0.99 0.98 Since style transfer is deemed successful if the AH no match 0.99 0.96 0.97 original style is lost in favour of the target style, we use this classifier to assess how many times a Table 2: Performance of the evaluation classifiers style transfer system manages to reverse the main on gold data. classifier’s decisions. HH classifier This classifier checks compatibil- Overall compliancy We calculate a compliancy ity between the original and the generated head- score which assesses the proportion of times the line. We use the same architecture as for the main following three outcomes are successful (i) the classifier with a slightly different configuration: HH classifier predicts ‘match’; (ii) the AH clas- max. sequence length of 64 tokens, batch size sifier predicts ‘match’; (iii) the main classifier’s of 128 for 2 epochs (early-stopped), with learn- decision is reversed. As upperbound, we find the ing rate 1e-05. Being trained on strictly aligned compatibility score for gold at 74.3% for transfer data as positive instances (A1), with a correspond- from La Repubblica to Il Giornale (rep2gio), and ing amount of random pairs as negative instances, 78.1% for the opposite direction (gio2rep). it should learn whether two headlines describe the same content or not. Performance on gold data is 4 Baseline System .96 (Table 2). We developed a baseline system using a summari- AH classifier This classifier performs yet an- sation approach, where headlines are viewed as other content-related check. It takes a headline an extreme case of summarisation and generated and its corresponding article, and tells whether from the article. We exploit article-headline gener- the headline is appropriate for the article. The ators trained on opposite sources to do the transfer, classifier is trained on article-headline pairs from as done in (De Mattei et al., 2020b). The advan- both the strongly aligned and the weakly and non- tage of this approach is that in principle it doesn’t aligned instances (R+A3+A1, Figure 1b). At test require parallel data for training. time, the generated headline is checked for com- Specifically, we use two pointer-generator net- patibility against the source article. We use the works (See et al., 2017), which include a point- same base model as for the main and HH classi- ing mechanism able to copy words from the Il Giornale → La Repubblica E in Sicilia è scattata l’allerta rossa −→ Migranti, la Protezione civile continua di- menticata [en: And in Sicily it’s now red alert] [en: Migrants, the Civil Protection Depart- ment goes on forgotten] Nozze gay, toghe contro i sindaci: ”Le −→ Il Consiglio di Stato boccia le nozze gay trascrizioni sono illegittime” all’estero [en: Gay marriages, gowns against mayors: [en: The State Council rejects gay mar- “Transcriptions are not valid”] riages abroad] La Repubblica → Il Giornale Castelnuovo, lo sdegno di cittadini e asso- −→ I migranti non sono più rifugiati ciazioni: ”Attacco all’integrazione che fun- ziona” [en: Castelnuovo, the indignation of citizens [en: Migrants are not refugees anymore] and associations: “Attack to the integration that works”] Da Renzi a Di Maio, ecco il reddito −→ Grillo e Giggino italiani conquistano dichiarato dai politici italiani. Fedeli il mi- l’elenco dei redditi italiani nistro con l’imponibile più alto [en: From Renzi to Di Maio: here it’s the [en: Grillo and Giggino Italians conquer the income declared by the Italian politicians. list of Italian incomes] Fedeli is the minister with the highest tax- able income] Table 3: Examples of headlines generated by the baseline system. source as well as pick them from a fixed vocab- HH AH Main compl. ulary, thereby allowing better handling of out-of- rep2gio .649 .876 .799 .449 vocabulary words. gio2rep .639 .871 .435 .240 One model is trained on the la Repubblica por- avg .644 .874 .616 .345 tion of the training set, the other on Il Giornale. In a style transfer setting we use these models as Table 4: Baseline performance on test data. follows: Given a headline from Il Giornale, for example, the model trained on la Repubblica can be run over the corresponding article from Il Gior- style transfer and automatic evaluation, in the Ital- nale to generate a headline in the style of la Re- ian community. Over ten teams expressed their in- pubblica, and vice versa. terest in participating in the shared task officially, The results of the baseline system, measured as but eventually there were no submitted runs. We performance of each classifier as well as the over- do hope that the materials developed in the con- all compliancy score, are reported in Table 4. text of this challenge will nevertheless be of use 5 Outlook to promote research in a field that is still under- researched in the Italian NLP landscape. All This shared task proposal was intended to stim- materials are available: https://github.com/ ulate research in NLG, with a specific focus on michelecafagna26/CHANGE-IT. References Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer- Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- generator networks. In Proceedings of the 55th An- cia C. Passaro. 2020. Evalita 2020: Overview nual Meeting of the Association for Computational of the 7th evaluation campaign of natural language Linguistics (Volume 1: Long Papers), pages 1073– processing and speech tools for italian. In Valerio 1083. Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of Seventh Evalua- Chris van der Lee, Albert Gatt, Emiel van Miltenburg, tion Campaign of Natural Language Processing and Sander Wubben, and Emiel Krahmer. 2019. Best Speech Tools for Italian. Final Workshop (EVALITA practices for the human evaluation of automatically 2020), Online. CEUR.org. generated text. In Proceedings of the 12th Interna- tional Conference on Natural Language Generation, Michele Cafagna, Lorenzo De Mattei, and Malvina pages 355–368, Tokyo, Japan, October–November. Nissim. 2019. Embeddings shifts as proxies for Association for Computational Linguistics. different word use in italian newspapers. In Pro- ceedings of the Sixth Italian Conference on Compu- tational Linguistics (CLiC-it 2019), Bari, Italy. Lorenzo De Mattei, Michele Cafagna, Felice Dell’Orletta, and Malvina Nissim. 2020a. In- visible to People but not to Machines: Evaluation of Style-aware Headline Generation in Absence of Reliable Human Judgment. In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), Mar- seille, France, May. European Language Resources Association (ELRA). Lorenzo De Mattei, Michele Cafagna, Felice Dell’Orletta, and Malvina Nissim. 2020b. In- visible to People but not to Machines: Evaluation of Style-aware Headline Generation in Absence of Reliable Human Judgment. In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), Mar- seille, France, May. European Language Resources Association (ELRA). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of NAACL, pages 4171– 4186. Albert Gatt and Emiel Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artifi- cial Intelligence Research, 61:65–170. Jekaterina Novikova, Ondřej Dušek, Amanda Cer- cas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natu- ral Language Processing, pages 2241–2252, Copen- hagen, Denmark, September. Association for Com- putational Linguistics. Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. 2018. RankME: Reliable human ratings for natu- ral language generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 72–78, New Orleans, Louisiana, June. Asso- ciation for Computational Linguistics.