<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UPB at GermEval-2020 Task 3: Assessing Summaries for German Texts using BERTScore and Sentence-BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrei Paraschiv</string-name>
          <email>andrei.paraschiv74@stud</email>
          <email>andrei.paraschiv74@stud. acs.upb.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dumitru-Clementin Cercel</string-name>
          <email>clementin.cercel@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University Politehnica of</institution>
          ,
          <addr-line>Bucharest, Romania, Computer Science and</addr-line>
          ,
          <institution>Engineering Department</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2002</year>
      </pub-date>
      <abstract>
        <p>The overwhelming amount of online text information available today has increased the need for more research on its automatic summarization. In this work, we describe our participation in GermEval2020, Task 3: German Text Summarization. We compare two BERT-based metrics, Sentence-BERT and BERTScore, to automatically evaluate the quality of summaries in the German language. Our lowest error rate achieved was 31.9925, ranking us in 4th place out of 6 participating teams.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The objective of the text summarization task is to
generate a condensed and coherent representation
of the input text, with the important ideas from it,
as well as maintaining the meaning of the
original
        <xref ref-type="bibr" rid="ref1">(Allahyari et al., 2017)</xref>
        . The task of
automatic summarization is a hard problem since the
system must understand the content, context, and
meaning of the text. Most often, additional
wordlevel knowledge is required to complete the task
        <xref ref-type="bibr" rid="ref14 ref16 ref29">(Malviya and Tiwary, 2016)</xref>
        .
      </p>
      <p>
        In this task, a major issue is evaluating the
quality of summaries that were automatically
generated. Since human evaluation is expensive,
timeconsuming, and prone to subjective biases,
automatic metrics have sparked the interest of
researchers. Sharing similarities with the
evaluation of Machine Translation (MT), many
evaluation metrics originate in this area of research
        <xref ref-type="bibr" rid="ref18">(Papineni et al., 2002)</xref>
        .
      </p>
      <p>
        Summarization skill assessment is often used to
test the reading proficiency and the cognitive
acquisitions for learners
        <xref ref-type="bibr" rid="ref9">(Grabe and Jiang, 2013)</xref>
        .
In addition, the automated scoring tools of
summaries can help students to improve their reading
comprehension and also lead to improvements in
educational applications.
      </p>
      <p>
        There are two kinds of evaluation methods of
summaries: extrinsic evaluation, where the
candidate summary is judged on how useful it is for
a specific task, and intrinsic evaluation based on
a deep analysis of the candidate summary, for
instance, a comparison with the original text, with
a reference summary, or with the text generated
by another automated system
        <xref ref-type="bibr" rid="ref11">(Jones and Galliers,
1995)</xref>
        .
      </p>
      <p>
        The shared task 3 proposed by the organizers
of Germeval 2020 encouraged participants to
suggest a metric for an intrinsic evaluation of
candidate summaries for the German text data against
reference summaries. The quality of each
candidate summary will be indicated by a score
between 0 and 1, where 0 and 1 are a ”bad summary”
and a ”good summary”, respectively. Our
approaches rely on two newly introduced measures
for evaluating summary quality, Sentence-BERT
        <xref ref-type="bibr" rid="ref20 ref27 ref30 ref5 ref8">(Reimers and Gurevych, 2019)</xref>
        and BERTScore
        <xref ref-type="bibr" rid="ref31">(Zhang et al., 2019)</xref>
        and we assess their
performance on the competition dataset to observe how
well they correlate with human judgment.
      </p>
      <p>In the next section, we cover the relevant work
to the goal of this research task. Section 3 presents
the methodology used in our case. Then, Section 4
presents the results from the experiments. Finally,
we discuss the conclusions of the paper.</p>
      <sec id="sec-1-1">
        <title>BERT Model</title>
        <p>Deepset.ai1
bert-base-german-europeana-uc 2
bert-base-german-uc2
literary-german-bert3
bert-adapted-german-press4</p>
      </sec>
      <sec id="sec-1-2">
        <title>Corpora used for training</title>
        <p>
          Wikipedia, Legal data, news
Europeana newspapers
Wikipedia, Subtitles, News, Commoncrawl
German Fiction Literature
Newspapers
          <xref ref-type="bibr" rid="ref2">(Banerjee and Lavie, 2005)</xref>
          are the most used
metrics to assess summaries. These measures based
on n-grams matching stand out through simplicity
and a relatively good correlation with human
evaluations. Although these metrics and their variants
are widely used, there are valid objections to their
limitations
          <xref ref-type="bibr" rid="ref21">(Reiter, 2018)</xref>
          .
        </p>
        <p>
          In recent years, metrics based on word
embeddings as well as measures based on deep
learning models have gained more attention from
researchers. Word embeddings
          <xref ref-type="bibr" rid="ref17 ref19">(Mikolov et al.,
2013; Pennington et al., 2014)</xref>
          are dense
representations for words in a vector space. Using these
representations rather than the n-gram
decomposition of the texts, researchers computed summary
similarity scores. Either by enhancing existing
metrics like BLEU
          <xref ref-type="bibr" rid="ref14 ref16 ref25 ref29">(Wang and Merlo, 2016;
Servan et al., 2016)</xref>
          or by using an adapted version
of Earth Mover’s Distance proposed by Rubner et
al. (1998)
          <xref ref-type="bibr" rid="ref13 ref5 ref8">(Li et al., 2019; Echizen-ya et al., 2019;
Clark et al., 2019)</xref>
          , these representations proved to
be more in tune with human judgment than
traditional measures such as ROUGE, METEOR, and
BLEU.
        </p>
        <p>
          Another application of deep learning in
evaluation metrics to score summaries are measures
learned by the model. For instance, models like
ReVal
          <xref ref-type="bibr" rid="ref10">(Gupta et al., 2015)</xref>
          or RUSE
          <xref ref-type="bibr" rid="ref26">(Shimanaka
et al., 2018)</xref>
          learn sentence-level embeddings for
the input sentences and then compute a
similarity score between them. A common architecture
in summary scoring is the siamese neural network
          <xref ref-type="bibr" rid="ref4">(Bromley et al., 1994)</xref>
          . Ruseti et al. (2018) used a
siamese BiGRU neural network to score candidate
summaries against the source text. Further, Xia et
al. (2019) proposed three architectures (i.e., CNN,
LSTM, and attention mechanism-based LSTM) to
assess the students for reading comprehension by
scoring their summaries against the source text.
        </p>
        <p>
          Pre-trained language models based on
Transformers
          <xref ref-type="bibr" rid="ref28">(Vaswani et al., 2017)</xref>
          , such as BERT
          <xref ref-type="bibr" rid="ref7">(Devlin et al., 2019)</xref>
          and RoBERTa
          <xref ref-type="bibr" rid="ref15 ref32">(Liu et al., 2019)</xref>
          ,
have improved performance in many tasks in the
field of natural language processing in the last
year. In contrast to previous word embeddings,
these contextual embeddings can produce
different vector representations for the same word in
distinct sentences, depending on the neighboring
words. Since the contextual embeddings capture
also the context of the words in the token
representations for the input sentences, evaluation
metrics based on them tend to be more correlated
with human evaluations. For instance, both the
BERT adaptation for RUSE and BERT with an
appended regressor did outperform the individual
RUSE model
          <xref ref-type="bibr" rid="ref27">(Shimanaka et al., 2019)</xref>
          . Also, Zhao
et al. (2019) shows that MoverScore, the Word
Mover’s Distance
          <xref ref-type="bibr" rid="ref12">(Kusner et al., 2015)</xref>
          over
contextualized embeddings, can achieve
state-of-theart performance.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>
        In our case, we adopt two novel BERT-based
metrics, Sentence-BERT
        <xref ref-type="bibr" rid="ref20 ref27 ref30 ref5 ref8">(Reimers and Gurevych,
2019)</xref>
        and BERTScore
        <xref ref-type="bibr" rid="ref31">(Zhang et al., 2019)</xref>
        to
automatically asses pairs of German
candidatereference summaries. Specifically, for the two
metrics, we evaluate five different pre-trained
BERT models as listed in Table 1. In each
experiment, we have generated a score between 0 and
1 for every candidate-reference summary pair and
then submitted the resulting file to the competition
website for error evaluation.
      </p>
      <p>Sentence-BERT In order to derive fixed
embeddings for two input summaries (i.e., the
candidate and reference summary, respectively),
Sentence-BERT uses a siamese network
architecture that has a pooling layer on the top of BERT.
There are three scenarios available for using the
1https://deepset.ai/german-bert
2https://github.com/dbmdz/berts
3https://huggingface.co/
severinsimmler/literary-german-bert
4https://huggingface.co/
severinsimmler/german-press-bert
pooling layer, as follows: using the output
corresponding to the [CLS] token, the mean of the
vectorial representations over all 12 BERT
headers, as well as the max-over-time of these output
vectors. Our experiments indicated that only the
mean vector scenario delivers optimal scores.</p>
      <p>Through fine-tuning, Sentence-BERT will
produce summary-level embeddings that capture both
the semantic and context of these texts in a
powerful way. By exploiting the cosine similarity
measure, the two summary embeddings can then be
compared.</p>
      <p>
        BERTScore In contrast to Sentence-BERT,
BERTScore is a token-level matching metric.
Since BERT-based models use a Wordpiece
tokenizer
        <xref ref-type="bibr" rid="ref24">(Schuster and Nakajima, 2012)</xref>
        , both the
candidate (sc) and reference (sr) summaries are
split into k and m tokens, respectively. The
vector space representations vc and vr for sc and sr
respectively, are then computed through 12
Transformer layers
        <xref ref-type="bibr" rid="ref28">(Vaswani et al., 2017)</xref>
        . Using a
greedy matching approach, the resulting tokens
are paired and the precision, recall and F1 scores
are determined:
      </p>
      <p>RBERT =
PBERT =
1 X max (vic)&gt;vjr
k vic2vc vjr2vr
1 X max(vic)&gt;vjr
m vjr2vr vic2vc
F 1BERT = 2 PBERT RBERT</p>
      <p>PBERT + RBERT</p>
      <p>Additionally, we compute the inverse document
frequencies (idf) based on the source text of the
summaries, for each word from all
candidatereference summary pairs and use them for
importance weighting in BERTScore, as described in the
original paper. Also, we tested the re-scaling
strategy of the scores as suggested by the authors, but
the performance did not improve.
4
4.1</p>
      <sec id="sec-2-1">
        <title>Corpus</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Performance Evaluation</title>
      <p>The experimental data consisted of 216 German
language source texts, their reference summary,
and summaries proposed for evaluation. More
specifically, there were 24 distinct source texts,
each with one reference summary and nine
summaries proposed for evaluation. All texts were
provided in lower case, with punctuation and
quotations intact. The length of the source texts varied
from around 2000 characters to 12000, averaging
around 5800 characters. Also, the length of the
reference summaries varied from 3% to 13% of the
source text length with an average of 6%.
Moreover, the candidate summaries varied from 0.6%
length of the source text to 21%, having an
average around 6%.
4.2</p>
      <sec id="sec-3-1">
        <title>BERT Fine-tuning</title>
        <p>
          We fine-tune the aforementioned BERT
models (see Table 1) using the Opusparcus
corpus
          <xref ref-type="bibr" rid="ref6">(Creutz, 2018)</xref>
          that introduced 3168
humanannotated paraphrase pairs, sourced from the
OpenSubtitles2016 thesaurus consisting of
parallel corpora
          <xref ref-type="bibr" rid="ref14 ref16 ref29">(Lison and Tiedemann, 2016)</xref>
          . The
paraphrase pairs are scored on a scale from 1 to
4, in 0.5 increments, where 4 is a good match and
1 is a bad match. For our purposes of fine-tuning,
we translated the scores in the [0, 1] interval
according to Table 2.
        </p>
        <p>In order to train SentenceBERT, we used the
Opusparcus dataset with the modified scores for
5 training-epochs, with a mean squared loss.
Further, we use the fine-tuned BERT models as the
basis for computing the BERTScore.</p>
        <p>Opusparcus Rating
4
3.5
3
2.5
2
1.5
1</p>
        <p>Similarity Score
0.85
0.70
0.50
0.30
0.20
0.10
0.05
In Table 3, we show the results for our
experiments. First of all, we find that training
SentenceBERT with the literary-german-bert and
bertadapted-german-press models and using a score
translation from the Opusparcus to the [0, 1]
interval delivered a more accurate evaluation.</p>
        <p>For BERTScore, after trying out the vectors
from several attention heads, we concluded that
using the last layer for the token representations
yields the best performance. Using the fine-tuned
BERT models with Sentence-BERT as basis for
Deepset.ai
bert-base-german-europeana-uc
bert-base-german-uc
literary-german-bert
bert-adapted-german-press
BERTScore did improve the error rate for all
pretrained BERT models, but had a significant
impact on the case sensitive version from deepset.ai,
which delivered the best result of 31.9925. The
fine-tuning of the uncased BERT versions with
Sentence-BERT before applying BERTScore did
add some improvement, but the small decrease in
error may not be justified by the computational
effort. On the other hand, for the cased BERT
version, the increase in performance was significant.</p>
        <p>Overall, BERTScore did perform more closely
correlated with the human evaluators, regardless
of the used pre-trained BERT model. The impact
of the idf-weighting on the final result amounted
to about 1 percentage point improvement.</p>
        <p>As expected, since the provided summaries had
no capitalization and since the importance of
capitalization in the German language is significant,
the case sensitive version, without fine-tuning,
performed worse for both metrics. Also, the BERT
model pre-trained with the Europeana Newspaper
corpus performed the best for both metrics.</p>
        <p>As seen in Table 4, the scores obtained by our
best model, compared to the baselines are at least
10 percentage points better. Surprisingly, from all
the baseline scoring methods, BLEU performed
the best.
In this paper, we analyzed the robustness of
two different metrics (i.e., Sentence-BERT and
BERTScore) based on the pre-trained BERT
language model, with application to automatic
assessment of summary quality. Intuitively,
Sentence-BERT learns embeddings for the two
input summaries whereas BERTScore focuses on
the token-level embeddings in each summary and
computes an average score from them. Compared
to classical scoring methods, like BLEU, ROUGE,
or METEOR, these metrics are more
computeintensive and lack the simple explainability that
classical scores provide. Also, as seen in our
experiments, the scores can differ depending on the
pre-trained BERT model is used.</p>
        <p>Since BERT embeddings are
contextdependent, this simpler approach, BERT-Score,
proves to be more in tune with the human
evaluators. Also, computationally, BERTScore is much
easier to streamline since it does not require an
additional training dataset. Due to the lack of
qualitative and manually annotated datasets of
paraphrases in German, the easiest use in
production would be BERTScore with an appropriate
cased model. We also showed that BERTScore
applied on a BERT model fine-tuned using a
paraphrase dataset and the SentenceBERT similarity
objective can lead to a higher correlation between
human assessments and the automatic scores.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Mehdi</given-names>
            <surname>Allahyari</surname>
          </string-name>
          , Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei,
          <string-name>
            <surname>Elizabeth D Trippe</surname>
            , Juan B Gutierrez, and
            <given-names>Krys</given-names>
          </string-name>
          <string-name>
            <surname>Kochut</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Text summarization techniques: a brief survey</article-title>
          .
          <source>arXiv preprint arXiv:1707</source>
          .
          <fpage>02268</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Satanjeev</given-names>
            <surname>Banerjee</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alon</given-names>
            <surname>Lavie</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Meteor: An automatic metric for mt evaluation with improved correlation with human judgments</article-title>
          .
          <source>In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization</source>
          , pages
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Rouge: A package for automatic evaluation of summaries</article-title>
          .
          <source>In Proceedings of Workshop on Text Summarization Branches Out</source>
          , PostConference Workshop of ACL.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Jane</given-names>
            <surname>Bromley</surname>
          </string-name>
          , Isabelle Guyon,
          <string-name>
            <surname>Yann</surname>
            <given-names>LeCun</given-names>
          </string-name>
          , Eduard Sa¨ckinger, and
          <string-name>
            <given-names>Roopak</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Signature verification using a” siamese” time delay neural network</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>737</fpage>
          -
          <lpage>744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Elizabeth</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <source>Asli Celikyilmaz, and Noah A Smith</source>
          .
          <year>2019</year>
          .
          <article-title>Sentence mover's similarity: Automatic evaluation for multi-sentence texts</article-title>
          .
          <source>In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>2748</fpage>
          -
          <lpage>2760</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Mathias</given-names>
            <surname>Creutz</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Open subtitles paraphrase corpus for six languages</article-title>
          .
          <source>In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Hiroshi</surname>
          </string-name>
          Echizen-ya,
          <source>Kenji Araki, and Eduard Hovy</source>
          .
          <year>2019</year>
          .
          <article-title>Word embedding-based automatic mt evaluation metric using word position information</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>1874</fpage>
          -
          <lpage>1883</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>William</given-names>
            <surname>Grabe</surname>
          </string-name>
          and
          <string-name>
            <given-names>Xiangying</given-names>
            <surname>Jiang</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Assessing reading. The companion to language assessment</article-title>
          ,
          <volume>1</volume>
          :
          <fpage>185</fpage>
          -
          <lpage>200</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Rohit</given-names>
            <surname>Gupta</surname>
          </string-name>
          , Constantin Orasan, and Josef van Genabith.
          <year>2015</year>
          .
          <article-title>Reval: A simple and effective machine translation evaluation metric based on recurrent neural networks</article-title>
          .
          <source>In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>1066</fpage>
          -
          <lpage>1072</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Karen</given-names>
            <surname>Sparck Jones and Julia R Galliers</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Evaluating natural language processing systems: An analysis and review</article-title>
          , volume
          <volume>1083</volume>
          . Springer Science &amp; Business Media.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Matt</given-names>
            <surname>Kusner</surname>
          </string-name>
          , Yu Sun, Nicholas Kolkin,
          <string-name>
            <given-names>and Kilian</given-names>
            <surname>Weinberger</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>From word embeddings to document distances</article-title>
          .
          <source>In International conference on machine learning</source>
          , pages
          <fpage>957</fpage>
          -
          <lpage>966</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Pairui</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Chuan</given-names>
            <surname>Chen</surname>
          </string-name>
          , Wujie Zheng, Yuetang Deng, Fanghua Ye, and
          <string-name>
            <given-names>Zibin</given-names>
            <surname>Zheng</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Std: An automatic evaluation metric for machine translation based on word embeddings</article-title>
          .
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          ,
          <volume>27</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1497</fpage>
          -
          <lpage>1506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Lison</surname>
          </string-name>
          and Jo¨rg Tiedemann.
          <year>2016</year>
          .
          <article-title>Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles</article-title>
          .
          <source>In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)</source>
          , pages
          <fpage>923</fpage>
          -
          <lpage>929</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Yinhan</given-names>
            <surname>Liu</surname>
          </string-name>
          , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Shrikant</given-names>
            <surname>Malviya</surname>
          </string-name>
          and Uma Shanker Tiwary.
          <year>2016</year>
          .
          <article-title>Knowledge based summarization and document generation using bayesian network</article-title>
          .
          <source>Procedia Computer Science</source>
          ,
          <volume>89</volume>
          :
          <fpage>333</fpage>
          -
          <lpage>340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Kishore</given-names>
            <surname>Papineni</surname>
          </string-name>
          , Salim Roukos, Todd Ward, and
          <string-name>
            <given-names>WeiJing</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          .
          <source>In Proceedings of the 40th annual meeting on association for computational linguistics</source>
          , pages
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Sentencebert: Sentence embeddings using siamese bertnetworks</article-title>
          . arXiv preprint arXiv:
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Ehud</given-names>
            <surname>Reiter</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A structured review of the validity of bleu</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>44</volume>
          (
          <issue>3</issue>
          ):
          <fpage>393</fpage>
          -
          <lpage>401</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Yossi</given-names>
            <surname>Rubner</surname>
          </string-name>
          , Carlo Tomasi, and
          <string-name>
            <surname>Leonidas</surname>
          </string-name>
          J Guibas.
          <year>1998</year>
          .
          <article-title>A metric for distributions with applications to image databases</article-title>
          .
          <source>In Sixth International Conference on Computer Vision</source>
          (IEEE Cat.
          <source>No. 98CH36271)</source>
          , pages
          <fpage>59</fpage>
          -
          <lpage>66</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Ruseti</surname>
          </string-name>
          , Mihai Dascalu,
          <article-title>Amy M Johnson, Danielle S McNamara, Renu Balyan, Kathryn S McCarthy,</article-title>
          and
          <string-name>
            <surname>Stefan</surname>
          </string-name>
          Trausan-Matu.
          <year>2018</year>
          .
          <article-title>Scoring summaries using recurrent neural networks</article-title>
          .
          <source>In International Conference on Intelligent Tutoring Systems</source>
          , pages
          <fpage>191</fpage>
          -
          <lpage>201</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Mike</given-names>
            <surname>Schuster</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kaisuke</given-names>
            <surname>Nakajima</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Japanese and korean voice search</article-title>
          .
          <source>In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , pages
          <fpage>5149</fpage>
          -
          <lpage>5152</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Christophe</given-names>
            <surname>Servan</surname>
          </string-name>
          , Alexandre Be´rard, Zied Elloumi, Herve´ Blanchon,
          <string-name>
            <given-names>and Laurent</given-names>
            <surname>Besacier</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Word2vec vs dbnary: Augmenting meteor using vector representations or lexical resources</article-title>
          ?
          <source>In Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics: Technical Papers</source>
          , pages
          <fpage>1159</fpage>
          -
          <lpage>1168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Hiroki</given-names>
            <surname>Shimanaka</surname>
          </string-name>
          , Tomoyuki Kajiwara, and
          <string-name>
            <given-names>Mamoru</given-names>
            <surname>Komachi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Ruse: Regressor using sentence embeddings for automatic machine translation evaluation</article-title>
          .
          <source>In Proceedings of the Third Conference on Machine Translation: Shared Task Papers</source>
          , pages
          <fpage>751</fpage>
          -
          <lpage>758</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Hiroki</given-names>
            <surname>Shimanaka</surname>
          </string-name>
          , Tomoyuki Kajiwara, and
          <string-name>
            <given-names>Mamoru</given-names>
            <surname>Komachi</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Machine translation evaluation with bert regressor</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .12679.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Łukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Haozhou</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Paola</given-names>
            <surname>Merlo</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Modifications of machine translation evaluation metrics by using word embeddings</article-title>
          .
          <source>In Proceedings of the Sixth Workshop on Hybrid Approaches to Translation (HyTra6)</source>
          , pages
          <fpage>33</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>Menglin</given-names>
            <surname>Xia</surname>
          </string-name>
          , Ekaterina Kochmar, and
          <string-name>
            <given-names>Ted</given-names>
            <surname>Briscoe</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Automatic learner summary assessment for reading comprehension</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>2532</fpage>
          -
          <lpage>2542</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>Tianyi</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Varsha Kishore, Felix Wu,
          <string-name>
            <surname>Kilian Q Weinberger</surname>
            , and
            <given-names>Yoav</given-names>
          </string-name>
          <string-name>
            <surname>Artzi</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Bertscore: Evaluating text generation with bert</article-title>
          .
          <source>arXiv preprint arXiv:1904</source>
          .09675.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Wei</surname>
            <given-names>Zhao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Maxime</given-names>
            <surname>Peyrard</surname>
          </string-name>
          , Fei Liu,
          <string-name>
            <surname>Yang</surname>
            <given-names>Gao</given-names>
          </string-name>
          , Christian M Meyer, and
          <string-name>
            <given-names>Steffen</given-names>
            <surname>Eger</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance</article-title>
          . arXiv preprint arXiv:
          <year>1909</year>
          .02622.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>