UPB at GermEval-2020 Task 3: Assessing Summaries for German Texts
                 using BERTScore and Sentence-BERT
                  Andrei Paraschiv                               Dumitru-Clementin Cercel
               University Politehnica of                           University Politehnica of
                 Bucharest, Romania                                  Bucharest, Romania
                Computer Science and                                Computer Science and
               Engineering Department                              Engineering Department
           andrei.paraschiv74@stud.                           clementin.cercel@gmail.com
                   acs.upb.ro

                       Abstract                                  Summarization skill assessment is often used to
                                                              test the reading proficiency and the cognitive ac-
     The overwhelming amount of online text                   quisitions for learners (Grabe and Jiang, 2013).
     information available today has increased                In addition, the automated scoring tools of sum-
     the need for more research on its auto-                  maries can help students to improve their reading
     matic summarization. In this work, we                    comprehension and also lead to improvements in
     describe our participation in GermEval-                  educational applications.
     2020, Task 3: German Text Summariza-                        There are two kinds of evaluation methods of
     tion. We compare two BERT-based met-                     summaries: extrinsic evaluation, where the can-
     rics, Sentence-BERT and BERTScore, to                    didate summary is judged on how useful it is for
     automatically evaluate the quality of sum-               a specific task, and intrinsic evaluation based on
     maries in the German language. Our low-                  a deep analysis of the candidate summary, for in-
     est error rate achieved was 31.9925, rank-               stance, a comparison with the original text, with
     ing us in 4th place out of 6 participating               a reference summary, or with the text generated
     teams.                                                   by another automated system (Jones and Galliers,
                                                              1995).
1    Introduction                                                The shared task 3 proposed by the organizers
                                                              of Germeval 2020 encouraged participants to sug-
The objective of the text summarization task is to
                                                              gest a metric for an intrinsic evaluation of candi-
generate a condensed and coherent representation
                                                              date summaries for the German text data against
of the input text, with the important ideas from it,
                                                              reference summaries. The quality of each can-
as well as maintaining the meaning of the orig-
                                                              didate summary will be indicated by a score be-
inal (Allahyari et al., 2017). The task of auto-
                                                              tween 0 and 1, where 0 and 1 are a ”bad summary”
matic summarization is a hard problem since the
                                                              and a ”good summary”, respectively. Our ap-
system must understand the content, context, and
                                                              proaches rely on two newly introduced measures
meaning of the text. Most often, additional word-
                                                              for evaluating summary quality, Sentence-BERT
level knowledge is required to complete the task
                                                              (Reimers and Gurevych, 2019) and BERTScore
(Malviya and Tiwary, 2016).
                                                              (Zhang et al., 2019) and we assess their perfor-
   In this task, a major issue is evaluating the qual-
                                                              mance on the competition dataset to observe how
ity of summaries that were automatically gener-
                                                              well they correlate with human judgment.
ated. Since human evaluation is expensive, time-
                                                                 In the next section, we cover the relevant work
consuming, and prone to subjective biases, au-
                                                              to the goal of this research task. Section 3 presents
tomatic metrics have sparked the interest of re-
                                                              the methodology used in our case. Then, Section 4
searchers. Sharing similarities with the evalua-
                                                              presents the results from the experiments. Finally,
tion of Machine Translation (MT), many evalua-
                                                              we discuss the conclusions of the paper.
tion metrics originate in this area of research (Pa-
pineni et al., 2002).
                                                              2   Related Work
Copyright c 2020 for this paper by its authors. Use permit-
ted under Creative Commons License Attribution 4.0 Interna-   For almost twenty years, BLEU (Papineni et al.,
tional (CC BY 4.0)                                            2002), ROUGE (Lin, 2004), and METEOR
   BERT Model                             BERT Version      Corpora used for training
   Deepset.ai1                            Cased             Wikipedia, Legal data, news
   bert-base-german-europeana-uc 2        Uncased           Europeana newspapers
   bert-base-german-uc2                   Uncased           Wikipedia, Subtitles, News, Commoncrawl
   literary-german-bert3                  Uncased           German Fiction Literature
   bert-adapted-german-press4             Uncased           Newspapers

          Table 1: Collection of pre-trained BERT models for the German language used in our study.


(Banerjee and Lavie, 2005) are the most used met-       have improved performance in many tasks in the
rics to assess summaries. These measures based          field of natural language processing in the last
on n-grams matching stand out through simplicity        year. In contrast to previous word embeddings,
and a relatively good correlation with human eval-      these contextual embeddings can produce differ-
uations. Although these metrics and their variants      ent vector representations for the same word in
are widely used, there are valid objections to their    distinct sentences, depending on the neighboring
limitations (Reiter, 2018).                             words. Since the contextual embeddings capture
   In recent years, metrics based on word embed-        also the context of the words in the token repre-
dings as well as measures based on deep learn-          sentations for the input sentences, evaluation met-
ing models have gained more attention from re-          rics based on them tend to be more correlated
searchers. Word embeddings (Mikolov et al.,             with human evaluations. For instance, both the
2013; Pennington et al., 2014) are dense represen-      BERT adaptation for RUSE and BERT with an
tations for words in a vector space. Using these        appended regressor did outperform the individual
representations rather than the n-gram decomposi-       RUSE model (Shimanaka et al., 2019). Also, Zhao
tion of the texts, researchers computed summary         et al. (2019) shows that MoverScore, the Word
similarity scores. Either by enhancing existing         Mover’s Distance (Kusner et al., 2015) over con-
metrics like BLEU (Wang and Merlo, 2016; Ser-           textualized embeddings, can achieve state-of-the-
van et al., 2016) or by using an adapted version        art performance.
of Earth Mover’s Distance proposed by Rubner et
                                                        3       Methodology
al. (1998) (Li et al., 2019; Echizen-ya et al., 2019;
Clark et al., 2019), these representations proved to    In our case, we adopt two novel BERT-based
be more in tune with human judgment than tradi-         metrics, Sentence-BERT (Reimers and Gurevych,
tional measures such as ROUGE, METEOR, and              2019) and BERTScore (Zhang et al., 2019) to
BLEU.                                                   automatically asses pairs of German candidate-
   Another application of deep learning in eval-        reference summaries. Specifically, for the two
uation metrics to score summaries are measures          metrics, we evaluate five different pre-trained
learned by the model. For instance, models like         BERT models as listed in Table 1. In each exper-
ReVal (Gupta et al., 2015) or RUSE (Shimanaka           iment, we have generated a score between 0 and
et al., 2018) learn sentence-level embeddings for       1 for every candidate-reference summary pair and
the input sentences and then compute a similar-         then submitted the resulting file to the competition
ity score between them. A common architecture           website for error evaluation.
in summary scoring is the siamese neural network           Sentence-BERT In order to derive fixed em-
(Bromley et al., 1994). Ruseti et al. (2018) used a     beddings for two input summaries (i.e., the
siamese BiGRU neural network to score candidate         candidate and reference summary, respectively),
summaries against the source text. Further, Xia et      Sentence-BERT uses a siamese network architec-
al. (2019) proposed three architectures (i.e., CNN,     ture that has a pooling layer on the top of BERT.
LSTM, and attention mechanism-based LSTM) to            There are three scenarios available for using the
assess the students for reading comprehension by            1
                                                            https://deepset.ai/german-bert
scoring their summaries against the source text.            2
                                                            https://github.com/dbmdz/berts
                                                          3
   Pre-trained language models based on Trans-              https://huggingface.co/
                                                        severinsimmler/literary-german-bert
formers (Vaswani et al., 2017), such as BERT (De-         4
                                                            https://huggingface.co/
vlin et al., 2019) and RoBERTa (Liu et al., 2019),      severinsimmler/german-press-bert
pooling layer, as follows: using the output cor-        from around 2000 characters to 12000, averaging
responding to the [CLS] token, the mean of the          around 5800 characters. Also, the length of the
vectorial representations over all 12 BERT head-        reference summaries varied from 3% to 13% of the
ers, as well as the max-over-time of these output       source text length with an average of 6%. More-
vectors. Our experiments indicated that only the        over, the candidate summaries varied from 0.6%
mean vector scenario delivers optimal scores.           length of the source text to 21%, having an aver-
   Through fine-tuning, Sentence-BERT will pro-         age around 6%.
duce summary-level embeddings that capture both
the semantic and context of these texts in a power-     4.2   BERT Fine-tuning
ful way. By exploiting the cosine similarity mea-       We fine-tune the aforementioned BERT mod-
sure, the two summary embeddings can then be            els (see Table 1) using the Opusparcus cor-
compared.                                               pus (Creutz, 2018) that introduced 3168 human-
   BERTScore In contrast to Sentence-BERT,              annotated paraphrase pairs, sourced from the
BERTScore is a token-level matching metric.             OpenSubtitles2016 thesaurus consisting of paral-
Since BERT-based models use a Wordpiece tok-            lel corpora (Lison and Tiedemann, 2016). The
enizer (Schuster and Nakajima, 2012), both the          paraphrase pairs are scored on a scale from 1 to
candidate (sc ) and reference (sr ) summaries are       4, in 0.5 increments, where 4 is a good match and
split into k and m tokens, respectively. The vec-       1 is a bad match. For our purposes of fine-tuning,
tor space representations v c and v r for sc and sr     we translated the scores in the [0, 1] interval ac-
respectively, are then computed through 12 Trans-       cording to Table 2.
former layers (Vaswani et al., 2017). Using a              In order to train SentenceBERT, we used the
greedy matching approach, the resulting tokens          Opusparcus dataset with the modified scores for
are paired and the precision, recall and F1 scores      5 training-epochs, with a mean squared loss. Fur-
are determined:                                         ther, we use the fine-tuned BERT models as the
                     1 X                                basis for computing the BERTScore.
         RBERT =              max (v c )> vjr
                     k vc ∈vc vjr ∈vr i
                        i                                     Opusparcus Rating        Similarity Score
                    1 X                                       4                        0.85
         PBERT =             max (v c )> vjr
                    m vr ∈vr vic ∈vc i                        3.5                      0.70
                         j
                                                              3                        0.50
                       PBERT · RBERT                          2.5                      0.30
         F 1BERT = 2
                       PBERT + RBERT                          2                        0.20
   Additionally, we compute the inverse document              1.5                      0.10
frequencies (idf) based on the source text of the             1                        0.05
summaries, for each word from all candidate-
                                                        Table 2: Mapping from the Opusparcus ratings to the
reference summary pairs and use them for impor-         similarity scores for each paraphrase pair used for fine-
tance weighting in BERTScore, as described in the       tuning of Sentence-BERT and BERTScore via BERT.
original paper. Also, we tested the re-scaling strat-
egy of the scores as suggested by the authors, but
the performance did not improve.                        4.3   Results

4     Performance Evaluation                            In Table 3, we show the results for our exper-
                                                        iments. First of all, we find that training Sen-
4.1    Corpus                                           tenceBERT with the literary-german-bert and bert-
The experimental data consisted of 216 German           adapted-german-press models and using a score
language source texts, their reference summary,         translation from the Opusparcus to the [0, 1] in-
and summaries proposed for evaluation. More             terval delivered a more accurate evaluation.
specifically, there were 24 distinct source texts,         For BERTScore, after trying out the vectors
each with one reference summary and nine sum-           from several attention heads, we concluded that
maries proposed for evaluation. All texts were          using the last layer for the token representations
provided in lower case, with punctuation and quo-       yields the best performance. Using the fine-tuned
tations intact. The length of the source texts varied   BERT models with Sentence-BERT as basis for
 BERT Model                          Sentence-BERT      BERT-Score      BERT-Score      BERT-Score with
                                                                        with idf        fine-tuning and idf
 Deepset.ai                          37.2916            35.6950         35.3121         31.9925
 bert-base-german-europeana-uc       35.2817            32.9403         32.2169         32.0194
 bert-base-german-uc                 42.7792            34.1719         33.4136         40.5780
 literary-german-bert                36.5822            44.7441         43.2454         35.5773
 bert-adapted-german-press           36.5098            33.1080         32.2967         35.3199

Table 3: Results for comparing the metrics: Sentence-BERT trained on Opusparcus, BERT-Score without fine-
tuning, BERT-Score without fine-tuning and with idf weighting, and BERT-Score with both fine-tuning and idf
weighting, considering different pre-trained BERT models of the German language.


BERTScore did improve the error rate for all pre-      5   Conclusions
trained BERT models, but had a significant im-
                                                       In this paper, we analyzed the robustness of
pact on the case sensitive version from deepset.ai,
                                                       two different metrics (i.e., Sentence-BERT and
which delivered the best result of 31.9925. The
                                                       BERTScore) based on the pre-trained BERT
fine-tuning of the uncased BERT versions with
                                                       language model, with application to automatic
Sentence-BERT before applying BERTScore did
                                                       assessment of summary quality.           Intuitively,
add some improvement, but the small decrease in
                                                       Sentence-BERT learns embeddings for the two in-
error may not be justified by the computational ef-
                                                       put summaries whereas BERTScore focuses on
fort. On the other hand, for the cased BERT ver-
                                                       the token-level embeddings in each summary and
sion, the increase in performance was significant.
                                                       computes an average score from them. Compared
   Overall, BERTScore did perform more closely
                                                       to classical scoring methods, like BLEU, ROUGE,
correlated with the human evaluators, regardless
                                                       or METEOR, these metrics are more compute-
of the used pre-trained BERT model. The impact
                                                       intensive and lack the simple explainability that
of the idf-weighting on the final result amounted
                                                       classical scores provide. Also, as seen in our ex-
to about 1 percentage point improvement.
                                                       periments, the scores can differ depending on the
   As expected, since the provided summaries had
                                                       pre-trained BERT model is used.
no capitalization and since the importance of cap-
                                                          Since BERT embeddings are context-
italization in the German language is significant,
                                                       dependent, this simpler approach, BERT-Score,
the case sensitive version, without fine-tuning,
                                                       proves to be more in tune with the human evalua-
performed worse for both metrics. Also, the BERT
                                                       tors. Also, computationally, BERTScore is much
model pre-trained with the Europeana Newspaper
                                                       easier to streamline since it does not require an
corpus performed the best for both metrics.
                                                       additional training dataset. Due to the lack of
   As seen in Table 4, the scores obtained by our
                                                       qualitative and manually annotated datasets of
best model, compared to the baselines are at least
                                                       paraphrases in German, the easiest use in produc-
10 percentage points better. Surprisingly, from all
                                                       tion would be BERTScore with an appropriate
the baseline scoring methods, BLEU performed
                                                       cased model. We also showed that BERTScore
the best.
                                                       applied on a BERT model fine-tuned using a para-
   Baseline                             Score          phrase dataset and the SentenceBERT similarity
   BLEU                                 41.4299        objective can lead to a higher correlation between
   ROUGE-1                              42.6328        human assessments and the automatic scores.
   ROUGE-2                              55.7044
   ROUGE-L                              43.7750
   METEOR                               48.0823
                                                       References
                                                       Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi,
Table 4: Results using the baseline scoring methods:    Saeid Safaei, Elizabeth D Trippe, Juan B Gutier-
BLUE, three variants of ROUGE (i.e., ROUGE-1 using      rez, and Krys Kochut. 2017. Text summariza-
unigram overlap, ROUGE-2 using bigram overlap, and      tion techniques: a brief survey. arXiv preprint
ROUGE-L using the Longest Common Subsequence),          arXiv:1707.02268.
and METEOR.                                            Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An
                                                         automatic metric for mt evaluation with improved
  correlation with human judgments. In Proceedings        Chin-Yew Lin. 2004. Rouge: A package for automatic
  of the acl workshop on intrinsic and extrinsic evalu-     evaluation of summaries. In Proceedings of Work-
  ation measures for machine translation and/or sum-        shop on Text Summarization Branches Out, Post-
  marization, pages 65–72.                                  Conference Workshop of ACL.
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard          Pierre Lison and Jörg Tiedemann. 2016. Opensub-
   Säckinger, and Roopak Shah. 1994. Signature ver-         titles2016: Extracting large parallel corpora from
   ification using a” siamese” time delay neural net-        movie and tv subtitles. In Proceedings of the Tenth
   work. In Advances in neural information processing        International Conference on Language Resources
   systems, pages 737–744.                                   and Evaluation (LREC’16), pages 923–929.
Elizabeth Clark, Asli Celikyilmaz, and Noah A Smith.      Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
   2019. Sentence mover’s similarity: Automatic eval-       dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
   uation for multi-sentence texts. In Proceedings of       Luke Zettlemoyer, and Veselin Stoyanov. 2019.
   the 57th Annual Meeting of the Association for Com-      Roberta: A robustly optimized bert pretraining ap-
   putational Linguistics, pages 2748–2760.                 proach. arXiv preprint arXiv:1907.11692.
Mathias Creutz. 2018. Open subtitles paraphrase           Shrikant Malviya and Uma Shanker Tiwary. 2016.
 corpus for six languages. In Proceedings of the            Knowledge based summarization and document
 Eleventh International Conference on Language Re-          generation using bayesian network. Procedia Com-
 sources and Evaluation (LREC 2018).                        puter Science, 89:333–340.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and             Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
   Kristina Toutanova. 2019. Bert: Pre-training of          rado, and Jeff Dean. 2013. Distributed representa-
   deep bidirectional transformers for language under-      tions of words and phrases and their compositional-
   standing. In Proceedings of the 2019 Conference of       ity. In Advances in neural information processing
   the North American Chapter of the Association for        systems, pages 3111–3119.
   Computational Linguistics: Human Language Tech-
   nologies, Volume 1 (Long and Short Papers), pages      Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
   4171–4186.                                               Jing Zhu. 2002. Bleu: a method for automatic eval-
Hiroshi Echizen-ya, Kenji Araki, and Eduard Hovy.           uation of machine translation. In Proceedings of
  2019. Word embedding-based automatic mt evalua-           the 40th annual meeting on association for compu-
  tion metric using word position information. In Pro-      tational linguistics, pages 311–318. Association for
  ceedings of the 2019 Conference of the North Amer-        Computational Linguistics.
  ican Chapter of the Association for Computational       Jeffrey Pennington, Richard Socher, and Christopher D
  Linguistics: Human Language Technologies, Vol-             Manning. 2014. Glove: Global vectors for word
  ume 1 (Long and Short Papers), pages 1874–1883.            representation. In Proceedings of the 2014 confer-
William Grabe and Xiangying Jiang. 2013. Assessing           ence on empirical methods in natural language pro-
  reading. The companion to language assessment,             cessing (EMNLP), pages 1532–1543.
  1:185–200.
                                                          Nils Reimers and Iryna Gurevych. 2019. Sentence-
Rohit Gupta, Constantin Orasan, and Josef van Gen-          bert: Sentence embeddings using siamese bert-
  abith. 2015. Reval: A simple and effective machine        networks. arXiv preprint arXiv:1908.10084.
  translation evaluation metric based on recurrent neu-
  ral networks. In Proceedings of the 2015 Confer-        Ehud Reiter. 2018. A structured review of the validity
  ence on Empirical Methods in Natural Language             of bleu. Computational Linguistics, 44(3):393–401.
  Processing, pages 1066–1072.
                                                          Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas.
Karen Sparck Jones and Julia R Galliers. 1995. Eval-        1998. A metric for distributions with applications to
  uating natural language processing systems: An            image databases. In Sixth International Conference
  analysis and review, volume 1083. Springer Science        on Computer Vision (IEEE Cat. No. 98CH36271),
  & Business Media.                                         pages 59–66. IEEE.

Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian          Stefan Ruseti, Mihai Dascalu, Amy M Johnson,
 Weinberger. 2015. From word embeddings to docu-             Danielle S McNamara, Renu Balyan, Kathryn S Mc-
 ment distances. In International conference on ma-          Carthy, and Stefan Trausan-Matu. 2018. Scoring
 chine learning, pages 957–966.                              summaries using recurrent neural networks. In In-
                                                             ternational Conference on Intelligent Tutoring Sys-
Pairui Li, Chuan Chen, Wujie Zheng, Yuetang Deng,            tems, pages 191–201. Springer.
  Fanghua Ye, and Zibin Zheng. 2019. Std: An au-
  tomatic evaluation metric for machine translation       Mike Schuster and Kaisuke Nakajima. 2012. Japanese
  based on word embeddings. IEEE/ACM Transac-               and korean voice search. In 2012 IEEE Interna-
  tions on Audio, Speech, and Language Processing,          tional Conference on Acoustics, Speech and Signal
  27(10):1497–1506.                                         Processing (ICASSP), pages 5149–5152. IEEE.
Christophe Servan, Alexandre Bérard, Zied Elloumi,
  Hervé Blanchon, and Laurent Besacier. 2016.
  Word2vec vs dbnary: Augmenting meteor using
  vector representations or lexical resources? In Pro-
  ceedings of COLING 2016, the 26th International
  Conference on Computational Linguistics: Techni-
  cal Papers, pages 1159–1168.
Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru
  Komachi. 2018. Ruse: Regressor using sentence
  embeddings for automatic machine translation eval-
  uation. In Proceedings of the Third Conference on
  Machine Translation: Shared Task Papers, pages
  751–758.
Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru
  Komachi. 2019.      Machine translation eval-
  uation with bert regressor.     arXiv preprint
  arXiv:1907.12679.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in neural information pro-
  cessing systems, pages 5998–6008.

Haozhou Wang and Paola Merlo. 2016. Modifica-
  tions of machine translation evaluation metrics by
  using word embeddings. In Proceedings of the
  Sixth Workshop on Hybrid Approaches to Transla-
  tion (HyTra6), pages 33–41.

Menglin Xia, Ekaterina Kochmar, and Ted Briscoe.
 2019. Automatic learner summary assessment for
 reading comprehension. In Proceedings of the 2019
 Conference of the North American Chapter of the
 Association for Computational Linguistics: Human
 Language Technologies, Volume 1 (Long and Short
 Papers), pages 2532–2542.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
   Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
   uating text generation with bert. arXiv preprint
   arXiv:1904.09675.
Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris-
  tian M Meyer, and Steffen Eger. 2019. Moverscore:
  Text generation evaluating with contextualized em-
  beddings and earth mover distance. arXiv preprint
  arXiv:1909.02622.