UPB at GermEval-2020 Task 3: Assessing Summaries for German Texts using BERTScore and Sentence-BERT Andrei Paraschiv Dumitru-Clementin Cercel University Politehnica of University Politehnica of Bucharest, Romania Bucharest, Romania Computer Science and Computer Science and Engineering Department Engineering Department andrei.paraschiv74@stud. clementin.cercel@gmail.com acs.upb.ro Abstract Summarization skill assessment is often used to test the reading proficiency and the cognitive ac- The overwhelming amount of online text quisitions for learners (Grabe and Jiang, 2013). information available today has increased In addition, the automated scoring tools of sum- the need for more research on its auto- maries can help students to improve their reading matic summarization. In this work, we comprehension and also lead to improvements in describe our participation in GermEval- educational applications. 2020, Task 3: German Text Summariza- There are two kinds of evaluation methods of tion. We compare two BERT-based met- summaries: extrinsic evaluation, where the can- rics, Sentence-BERT and BERTScore, to didate summary is judged on how useful it is for automatically evaluate the quality of sum- a specific task, and intrinsic evaluation based on maries in the German language. Our low- a deep analysis of the candidate summary, for in- est error rate achieved was 31.9925, rank- stance, a comparison with the original text, with ing us in 4th place out of 6 participating a reference summary, or with the text generated teams. by another automated system (Jones and Galliers, 1995). 1 Introduction The shared task 3 proposed by the organizers of Germeval 2020 encouraged participants to sug- The objective of the text summarization task is to gest a metric for an intrinsic evaluation of candi- generate a condensed and coherent representation date summaries for the German text data against of the input text, with the important ideas from it, reference summaries. The quality of each can- as well as maintaining the meaning of the orig- didate summary will be indicated by a score be- inal (Allahyari et al., 2017). The task of auto- tween 0 and 1, where 0 and 1 are a ”bad summary” matic summarization is a hard problem since the and a ”good summary”, respectively. Our ap- system must understand the content, context, and proaches rely on two newly introduced measures meaning of the text. Most often, additional word- for evaluating summary quality, Sentence-BERT level knowledge is required to complete the task (Reimers and Gurevych, 2019) and BERTScore (Malviya and Tiwary, 2016). (Zhang et al., 2019) and we assess their perfor- In this task, a major issue is evaluating the qual- mance on the competition dataset to observe how ity of summaries that were automatically gener- well they correlate with human judgment. ated. Since human evaluation is expensive, time- In the next section, we cover the relevant work consuming, and prone to subjective biases, au- to the goal of this research task. Section 3 presents tomatic metrics have sparked the interest of re- the methodology used in our case. Then, Section 4 searchers. Sharing similarities with the evalua- presents the results from the experiments. Finally, tion of Machine Translation (MT), many evalua- we discuss the conclusions of the paper. tion metrics originate in this area of research (Pa- pineni et al., 2002). 2 Related Work Copyright c 2020 for this paper by its authors. Use permit- ted under Creative Commons License Attribution 4.0 Interna- For almost twenty years, BLEU (Papineni et al., tional (CC BY 4.0) 2002), ROUGE (Lin, 2004), and METEOR BERT Model BERT Version Corpora used for training Deepset.ai1 Cased Wikipedia, Legal data, news bert-base-german-europeana-uc 2 Uncased Europeana newspapers bert-base-german-uc2 Uncased Wikipedia, Subtitles, News, Commoncrawl literary-german-bert3 Uncased German Fiction Literature bert-adapted-german-press4 Uncased Newspapers Table 1: Collection of pre-trained BERT models for the German language used in our study. (Banerjee and Lavie, 2005) are the most used met- have improved performance in many tasks in the rics to assess summaries. These measures based field of natural language processing in the last on n-grams matching stand out through simplicity year. In contrast to previous word embeddings, and a relatively good correlation with human eval- these contextual embeddings can produce differ- uations. Although these metrics and their variants ent vector representations for the same word in are widely used, there are valid objections to their distinct sentences, depending on the neighboring limitations (Reiter, 2018). words. Since the contextual embeddings capture In recent years, metrics based on word embed- also the context of the words in the token repre- dings as well as measures based on deep learn- sentations for the input sentences, evaluation met- ing models have gained more attention from re- rics based on them tend to be more correlated searchers. Word embeddings (Mikolov et al., with human evaluations. For instance, both the 2013; Pennington et al., 2014) are dense represen- BERT adaptation for RUSE and BERT with an tations for words in a vector space. Using these appended regressor did outperform the individual representations rather than the n-gram decomposi- RUSE model (Shimanaka et al., 2019). Also, Zhao tion of the texts, researchers computed summary et al. (2019) shows that MoverScore, the Word similarity scores. Either by enhancing existing Mover’s Distance (Kusner et al., 2015) over con- metrics like BLEU (Wang and Merlo, 2016; Ser- textualized embeddings, can achieve state-of-the- van et al., 2016) or by using an adapted version art performance. of Earth Mover’s Distance proposed by Rubner et 3 Methodology al. (1998) (Li et al., 2019; Echizen-ya et al., 2019; Clark et al., 2019), these representations proved to In our case, we adopt two novel BERT-based be more in tune with human judgment than tradi- metrics, Sentence-BERT (Reimers and Gurevych, tional measures such as ROUGE, METEOR, and 2019) and BERTScore (Zhang et al., 2019) to BLEU. automatically asses pairs of German candidate- Another application of deep learning in eval- reference summaries. Specifically, for the two uation metrics to score summaries are measures metrics, we evaluate five different pre-trained learned by the model. For instance, models like BERT models as listed in Table 1. In each exper- ReVal (Gupta et al., 2015) or RUSE (Shimanaka iment, we have generated a score between 0 and et al., 2018) learn sentence-level embeddings for 1 for every candidate-reference summary pair and the input sentences and then compute a similar- then submitted the resulting file to the competition ity score between them. A common architecture website for error evaluation. in summary scoring is the siamese neural network Sentence-BERT In order to derive fixed em- (Bromley et al., 1994). Ruseti et al. (2018) used a beddings for two input summaries (i.e., the siamese BiGRU neural network to score candidate candidate and reference summary, respectively), summaries against the source text. Further, Xia et Sentence-BERT uses a siamese network architec- al. (2019) proposed three architectures (i.e., CNN, ture that has a pooling layer on the top of BERT. LSTM, and attention mechanism-based LSTM) to There are three scenarios available for using the assess the students for reading comprehension by 1 https://deepset.ai/german-bert scoring their summaries against the source text. 2 https://github.com/dbmdz/berts 3 Pre-trained language models based on Trans- https://huggingface.co/ severinsimmler/literary-german-bert formers (Vaswani et al., 2017), such as BERT (De- 4 https://huggingface.co/ vlin et al., 2019) and RoBERTa (Liu et al., 2019), severinsimmler/german-press-bert pooling layer, as follows: using the output cor- from around 2000 characters to 12000, averaging responding to the [CLS] token, the mean of the around 5800 characters. Also, the length of the vectorial representations over all 12 BERT head- reference summaries varied from 3% to 13% of the ers, as well as the max-over-time of these output source text length with an average of 6%. More- vectors. Our experiments indicated that only the over, the candidate summaries varied from 0.6% mean vector scenario delivers optimal scores. length of the source text to 21%, having an aver- Through fine-tuning, Sentence-BERT will pro- age around 6%. duce summary-level embeddings that capture both the semantic and context of these texts in a power- 4.2 BERT Fine-tuning ful way. By exploiting the cosine similarity mea- We fine-tune the aforementioned BERT mod- sure, the two summary embeddings can then be els (see Table 1) using the Opusparcus cor- compared. pus (Creutz, 2018) that introduced 3168 human- BERTScore In contrast to Sentence-BERT, annotated paraphrase pairs, sourced from the BERTScore is a token-level matching metric. OpenSubtitles2016 thesaurus consisting of paral- Since BERT-based models use a Wordpiece tok- lel corpora (Lison and Tiedemann, 2016). The enizer (Schuster and Nakajima, 2012), both the paraphrase pairs are scored on a scale from 1 to candidate (sc ) and reference (sr ) summaries are 4, in 0.5 increments, where 4 is a good match and split into k and m tokens, respectively. The vec- 1 is a bad match. For our purposes of fine-tuning, tor space representations v c and v r for sc and sr we translated the scores in the [0, 1] interval ac- respectively, are then computed through 12 Trans- cording to Table 2. former layers (Vaswani et al., 2017). Using a In order to train SentenceBERT, we used the greedy matching approach, the resulting tokens Opusparcus dataset with the modified scores for are paired and the precision, recall and F1 scores 5 training-epochs, with a mean squared loss. Fur- are determined: ther, we use the fine-tuned BERT models as the 1 X basis for computing the BERTScore. RBERT = max (v c )> vjr k vc ∈vc vjr ∈vr i i Opusparcus Rating Similarity Score 1 X 4 0.85 PBERT = max (v c )> vjr m vr ∈vr vic ∈vc i 3.5 0.70 j 3 0.50 PBERT · RBERT 2.5 0.30 F 1BERT = 2 PBERT + RBERT 2 0.20 Additionally, we compute the inverse document 1.5 0.10 frequencies (idf) based on the source text of the 1 0.05 summaries, for each word from all candidate- Table 2: Mapping from the Opusparcus ratings to the reference summary pairs and use them for impor- similarity scores for each paraphrase pair used for fine- tance weighting in BERTScore, as described in the tuning of Sentence-BERT and BERTScore via BERT. original paper. Also, we tested the re-scaling strat- egy of the scores as suggested by the authors, but the performance did not improve. 4.3 Results 4 Performance Evaluation In Table 3, we show the results for our exper- iments. First of all, we find that training Sen- 4.1 Corpus tenceBERT with the literary-german-bert and bert- The experimental data consisted of 216 German adapted-german-press models and using a score language source texts, their reference summary, translation from the Opusparcus to the [0, 1] in- and summaries proposed for evaluation. More terval delivered a more accurate evaluation. specifically, there were 24 distinct source texts, For BERTScore, after trying out the vectors each with one reference summary and nine sum- from several attention heads, we concluded that maries proposed for evaluation. All texts were using the last layer for the token representations provided in lower case, with punctuation and quo- yields the best performance. Using the fine-tuned tations intact. The length of the source texts varied BERT models with Sentence-BERT as basis for BERT Model Sentence-BERT BERT-Score BERT-Score BERT-Score with with idf fine-tuning and idf Deepset.ai 37.2916 35.6950 35.3121 31.9925 bert-base-german-europeana-uc 35.2817 32.9403 32.2169 32.0194 bert-base-german-uc 42.7792 34.1719 33.4136 40.5780 literary-german-bert 36.5822 44.7441 43.2454 35.5773 bert-adapted-german-press 36.5098 33.1080 32.2967 35.3199 Table 3: Results for comparing the metrics: Sentence-BERT trained on Opusparcus, BERT-Score without fine- tuning, BERT-Score without fine-tuning and with idf weighting, and BERT-Score with both fine-tuning and idf weighting, considering different pre-trained BERT models of the German language. BERTScore did improve the error rate for all pre- 5 Conclusions trained BERT models, but had a significant im- In this paper, we analyzed the robustness of pact on the case sensitive version from deepset.ai, two different metrics (i.e., Sentence-BERT and which delivered the best result of 31.9925. The BERTScore) based on the pre-trained BERT fine-tuning of the uncased BERT versions with language model, with application to automatic Sentence-BERT before applying BERTScore did assessment of summary quality. Intuitively, add some improvement, but the small decrease in Sentence-BERT learns embeddings for the two in- error may not be justified by the computational ef- put summaries whereas BERTScore focuses on fort. On the other hand, for the cased BERT ver- the token-level embeddings in each summary and sion, the increase in performance was significant. computes an average score from them. Compared Overall, BERTScore did perform more closely to classical scoring methods, like BLEU, ROUGE, correlated with the human evaluators, regardless or METEOR, these metrics are more compute- of the used pre-trained BERT model. The impact intensive and lack the simple explainability that of the idf-weighting on the final result amounted classical scores provide. Also, as seen in our ex- to about 1 percentage point improvement. periments, the scores can differ depending on the As expected, since the provided summaries had pre-trained BERT model is used. no capitalization and since the importance of cap- Since BERT embeddings are context- italization in the German language is significant, dependent, this simpler approach, BERT-Score, the case sensitive version, without fine-tuning, proves to be more in tune with the human evalua- performed worse for both metrics. Also, the BERT tors. Also, computationally, BERTScore is much model pre-trained with the Europeana Newspaper easier to streamline since it does not require an corpus performed the best for both metrics. additional training dataset. Due to the lack of As seen in Table 4, the scores obtained by our qualitative and manually annotated datasets of best model, compared to the baselines are at least paraphrases in German, the easiest use in produc- 10 percentage points better. Surprisingly, from all tion would be BERTScore with an appropriate the baseline scoring methods, BLEU performed cased model. We also showed that BERTScore the best. applied on a BERT model fine-tuned using a para- Baseline Score phrase dataset and the SentenceBERT similarity BLEU 41.4299 objective can lead to a higher correlation between ROUGE-1 42.6328 human assessments and the automatic scores. ROUGE-2 55.7044 ROUGE-L 43.7750 METEOR 48.0823 References Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Table 4: Results using the baseline scoring methods: Saeid Safaei, Elizabeth D Trippe, Juan B Gutier- BLUE, three variants of ROUGE (i.e., ROUGE-1 using rez, and Krys Kochut. 2017. Text summariza- unigram overlap, ROUGE-2 using bigram overlap, and tion techniques: a brief survey. arXiv preprint ROUGE-L using the Longest Common Subsequence), arXiv:1707.02268. and METEOR. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings Chin-Yew Lin. 2004. Rouge: A package for automatic of the acl workshop on intrinsic and extrinsic evalu- evaluation of summaries. In Proceedings of Work- ation measures for machine translation and/or sum- shop on Text Summarization Branches Out, Post- marization, pages 65–72. Conference Workshop of ACL. Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Pierre Lison and Jörg Tiedemann. 2016. Opensub- Säckinger, and Roopak Shah. 1994. Signature ver- titles2016: Extracting large parallel corpora from ification using a” siamese” time delay neural net- movie and tv subtitles. In Proceedings of the Tenth work. In Advances in neural information processing International Conference on Language Resources systems, pages 737–744. and Evaluation (LREC’16), pages 923–929. Elizabeth Clark, Asli Celikyilmaz, and Noah A Smith. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- 2019. Sentence mover’s similarity: Automatic eval- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, uation for multi-sentence texts. In Proceedings of Luke Zettlemoyer, and Veselin Stoyanov. 2019. the 57th Annual Meeting of the Association for Com- Roberta: A robustly optimized bert pretraining ap- putational Linguistics, pages 2748–2760. proach. arXiv preprint arXiv:1907.11692. Mathias Creutz. 2018. Open subtitles paraphrase Shrikant Malviya and Uma Shanker Tiwary. 2016. corpus for six languages. In Proceedings of the Knowledge based summarization and document Eleventh International Conference on Language Re- generation using bayesian network. Procedia Com- sources and Evaluation (LREC 2018). puter Science, 89:333–340. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Kristina Toutanova. 2019. Bert: Pre-training of rado, and Jeff Dean. 2013. Distributed representa- deep bidirectional transformers for language under- tions of words and phrases and their compositional- standing. In Proceedings of the 2019 Conference of ity. In Advances in neural information processing the North American Chapter of the Association for systems, pages 3111–3119. Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages Kishore Papineni, Salim Roukos, Todd Ward, and Wei- 4171–4186. Jing Zhu. 2002. Bleu: a method for automatic eval- Hiroshi Echizen-ya, Kenji Araki, and Eduard Hovy. uation of machine translation. In Proceedings of 2019. Word embedding-based automatic mt evalua- the 40th annual meeting on association for compu- tion metric using word position information. In Pro- tational linguistics, pages 311–318. Association for ceedings of the 2019 Conference of the North Amer- Computational Linguistics. ican Chapter of the Association for Computational Jeffrey Pennington, Richard Socher, and Christopher D Linguistics: Human Language Technologies, Vol- Manning. 2014. Glove: Global vectors for word ume 1 (Long and Short Papers), pages 1874–1883. representation. In Proceedings of the 2014 confer- William Grabe and Xiangying Jiang. 2013. Assessing ence on empirical methods in natural language pro- reading. The companion to language assessment, cessing (EMNLP), pages 1532–1543. 1:185–200. Nils Reimers and Iryna Gurevych. 2019. Sentence- Rohit Gupta, Constantin Orasan, and Josef van Gen- bert: Sentence embeddings using siamese bert- abith. 2015. Reval: A simple and effective machine networks. arXiv preprint arXiv:1908.10084. translation evaluation metric based on recurrent neu- ral networks. In Proceedings of the 2015 Confer- Ehud Reiter. 2018. A structured review of the validity ence on Empirical Methods in Natural Language of bleu. Computational Linguistics, 44(3):393–401. Processing, pages 1066–1072. Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. Karen Sparck Jones and Julia R Galliers. 1995. Eval- 1998. A metric for distributions with applications to uating natural language processing systems: An image databases. In Sixth International Conference analysis and review, volume 1083. Springer Science on Computer Vision (IEEE Cat. No. 98CH36271), & Business Media. pages 59–66. IEEE. Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Stefan Ruseti, Mihai Dascalu, Amy M Johnson, Weinberger. 2015. From word embeddings to docu- Danielle S McNamara, Renu Balyan, Kathryn S Mc- ment distances. In International conference on ma- Carthy, and Stefan Trausan-Matu. 2018. Scoring chine learning, pages 957–966. summaries using recurrent neural networks. In In- ternational Conference on Intelligent Tutoring Sys- Pairui Li, Chuan Chen, Wujie Zheng, Yuetang Deng, tems, pages 191–201. Springer. Fanghua Ye, and Zibin Zheng. 2019. Std: An au- tomatic evaluation metric for machine translation Mike Schuster and Kaisuke Nakajima. 2012. Japanese based on word embeddings. IEEE/ACM Transac- and korean voice search. In 2012 IEEE Interna- tions on Audio, Speech, and Language Processing, tional Conference on Acoustics, Speech and Signal 27(10):1497–1506. Processing (ICASSP), pages 5149–5152. IEEE. Christophe Servan, Alexandre Bérard, Zied Elloumi, Hervé Blanchon, and Laurent Besacier. 2016. Word2vec vs dbnary: Augmenting meteor using vector representations or lexical resources? In Pro- ceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Techni- cal Papers, pages 1159–1168. Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2018. Ruse: Regressor using sentence embeddings for automatic machine translation eval- uation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 751–758. Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2019. Machine translation eval- uation with bert regressor. arXiv preprint arXiv:1907.12679. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Haozhou Wang and Paola Merlo. 2016. Modifica- tions of machine translation evaluation metrics by using word embeddings. In Proceedings of the Sixth Workshop on Hybrid Approaches to Transla- tion (HyTra6), pages 33–41. Menglin Xia, Ekaterina Kochmar, and Ted Briscoe. 2019. Automatic learner summary assessment for reading comprehension. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2532–2542. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Eval- uating text generation with bert. arXiv preprint arXiv:1904.09675. Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris- tian M Meyer, and Steffen Eger. 2019. Moverscore: Text generation evaluating with contextualized em- beddings and earth mover distance. arXiv preprint arXiv:1909.02622.