<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hybrid Ensemble Predictor as Quality Metric for German Text Summarization: Fraunhofer IAIS at GermEval 2020 Task 3</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>David Biesner</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose an alternative quality metric to evaluate automatically generated texts based on an ensemble of different scores, combining simple rule-based metrics with more complex models of very different nature, including ROUGE, tf-idf, neural sentence embeddings, and a matrix factorization method. Our approach achieved one of the top scores on the second German Text Summarization Challenge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        In our previous work on automatic text
summarization
        <xref ref-type="bibr" rid="ref3">(Brito et al., 2019)</xref>
        , we concluded
criticizing the suitability of ROUGE scores
        <xref ref-type="bibr" rid="ref5">(Lin, 2004)</xref>
        for overall evaluation purposes. These and other
common quality metrics found in the automatic
text summarization literature like BLEU
        <xref ref-type="bibr" rid="ref7">(Papineni
et al., 2002)</xref>
        or METEOR
        <xref ref-type="bibr" rid="ref2">(Banerjee and Lavie,
2005)</xref>
        are far from being optimal since they only
focus on the lexical overlap as a proxy for assessing
content selection. They do not only penalize
certain abstractions (e.g. when the original sentences
are heavily reformulated or when synonyms are
applied) but they also ignore other aspects that are
usually considered desirable in good summaries,
including grammatical correctness and
compactness.
      </p>
      <p>The second German Text Summarization
Challenge aims to address this issue by releasing a text
corpus with several summaries per text1. Its
participants were asked to rate these summaries with new
ideas and solutions regarding an automatic quality
Copyright c 2020 for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0
International (CC BY 4.0)
These co-first authors contributed equally to this work.
1https://swisstext-and-konvens-2020.org/
2nd-german-text-summarization-challenge.
assessment of German text summarizations. We
propose to combine the advantages of neural
approaches that excel at encoding semantic textual
similarity (and are thus suitable to predict content)
with statistical and rule-based metrics that can
evaluate other important summarization aspects such
as compactness and abstractiveness.</p>
      <p>In our approach, we employ an ensemble of 7
statistically significant predictors (p-value &lt; 15%)
in a linear regression model (see Table 2).
Comparing our predictions to the competition host’s own
non-public annotations we achieved a score (i.e.
loss) of 33:72, one of the lowest and therefore best
scores of participating teams.</p>
      <p>In the following sections, we detail the different
metrics that we considered and how we optimized
its combination.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Experimental Setup</title>
      <p>This section describes our experimental setup,
namely the underlying dataset and the
methodological approach.
2.1</p>
      <sec id="sec-2-1">
        <title>Data</title>
        <p>The shared task organizers released a corpus
consisting of 216 texts with a corresponding reference
summary and a generated summary, each of them
rated with a value between 0 (bad) to 1 (excellent).</p>
        <p>In order to evaluate the methods we manually
annotated all summaries in the dataset with a score
from 0 to 1. We independently rated a part of the
corpus each, such that different human biases can
be compensated to a certain extent. A submission
of these annotations to the competition received a
high score, indicating a large similarity to the gold
standard annotations set by the organizers.
Additionally, we expanded the dataset by considering
the given reference summaries as perfect generated
summaries with an automatic score of 1.</p>
        <p>This results in a dataset of 248 summary texts
with their corresponding score, which is used to
evaluate the unsupervised methods described
below.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Methodology</title>
        <p>We address this challenge as a metric learning
problem, where we define a set of unsupervised
predictors covering one or several features that answer
the required properties of a good summary (content
relevancy, compactness, abstractiveness and
grammatical correctness). After calculating all predictor
scores (unsupervised) for each document we apply
min-max normalization to assure all scores lay in
the closed 0-1 interval. In a final step, we
ensemble these predictors in a capped linear regression
model (output between 0 and 1), which is trained
via ordinary least squares on our manual summary
annotations (see Section 2.1). We iteratively
remove non-significant predictors, p-value 15%,
and re-run the regression model until all predictors
yield significant t-statistics, namely their
coefficients lay within the two-sided 85% confidence
interval. Due to the limited amount of documents
and the loss of interpretability, we refrain from
including non-linearities (e.g. multiple layers,
nonlinear activation functions, interaction terms of
different polynomial degrees, etc.) into the regression
model. Also, by using a simple linear ensemble
model, we reduce the likelihood of overfitting on
our annotations, especially since no validation set
for parameter tuning is available.</p>
        <p>The following subsections lay the focus on our
predictors and describe their functionality. We start
presenting three content predictors, which all
determine the most important words in the original
text and compute the fraction of how many of these
words occur in the generated summaries. We
assume that the most important words in a document
capture the essence of the text and thus, function as
proxy for contentual relevance. We continue with
neural language model driven predictors which
primarily focus on contentual relevance and
grammatical correctness. We also include the standard
quality metrics for automatic summary evaluation,
ROUGE, BLEU, and METEOR, which all aim
to measure contentual relevance, as well. The
remaining predictors are mainly rule-based and refer
largely to compactness and abstractiveness.
2.2.1</p>
      </sec>
      <sec id="sec-2-3">
        <title>Tf-Idf content predictor</title>
        <p>A very popular text vectorization method is tf–idf
(Term frequency – Inverse document frequency).
It is a frequency-based statistic, which intends to
reflect how important a word is to a “document” in
a corpus.</p>
        <p>Given that our entire corpus contains N
documents and the vocabulary of our corpus is of size
K, we can collect the individual tf–idf scores in
some matrix M 2 RN K . Each row vector in this
matrix corresponds to a document embedding. We
find the top 10 important words per document by
decreasingly sorting the tf-idf scores within each
embedding.</p>
        <p>We utilize the sklearn 2 implementation of
the TfidfVectorizer and restrict our
vocabulary to words with a document frequency below 0:9.
Before vectorization, we apply lower-casing,
punctuation and stop word removal, and stemming to
the entire text corpus, which helps to better capture
meaning and content in the text’s vector
representation.
2.2.2</p>
      </sec>
      <sec id="sec-2-4">
        <title>NMF content predictor</title>
        <p>
          NMF (Nonnegative Matrix Factorization)
          <xref ref-type="bibr" rid="ref4 ref6">(Paatero
and Tapper, 1994; Lee and Seung, 2001)</xref>
          is a
common matrix factorization technique frequently used
for topic modeling. In previous work, we find that
NMF achieves good results in clustering document
words to a predefined number of latent topics.
Assuming that a good summary should cover all main
topics in a text, we apply NMF on each document,
and determine the top 5 important words per
latent topic dimension. In particular, we factorize
the document’s symmetrical co-occurrence matrix3
S 2 RN N into a nonnegative loading matrix
W 2 RN M and a nonnegative affinity matrix
H 2 RM N ,
        </p>
        <p>S = W H + E ;
(1)
where N is the vocabulary size of the document at
question, M = 10 is the number of latent topics
and E 2 RN N is the error matrix, whose elements
approach zero for a perfect decomposition.
2https://github.com/scikit-learn/
scikit-learn.
3We apply the same document preprocessing as in Section
2.2.1 before calculating the co-occurrence matrix. Also, we
choose a window size of 5 and each context word j
contributes 1=d to the total word pair count, given it is d words
apart from the base word i.</p>
        <p>For both, W and H T we assign each word (row
vector) to the latent topic dimension with the
highest value. Next, we decreasingly sort the assigned
words per topic, so that the most distinct topic
words are ranked on top. Finally, we get the
important words per document by removing all duplicates
from the selected topic words of W and H .
2.2.3</p>
      </sec>
      <sec id="sec-2-5">
        <title>Flair NER content predictor</title>
        <p>
          Flair
          <xref ref-type="bibr" rid="ref1">(Akbik et al., 2018)</xref>
          is a specific contextual
string embedding architecture. The backbone of
the flair framework is a pretrained character-based
language model (based on an LSTM4-RNN), which
is bidirectionally trained on a huge independent text
corpus for different languages, including German.
        </p>
        <p>
          Build on top of this language model, the
framework provides a German named entity tagger,
which is pretrained on the Conll-03 dataset
          <xref ref-type="bibr" rid="ref9">(Sang
and De Meulder, 2003)</xref>
          . First, raw and unprocessed
text is fed sequentially into the encoding part of the
bidirectional language model. Second, we retrieve
for each word i a contextual embedding by
concatenating the forward model’s hidden state after word
i and the backward model’s hidden state before
word i. This word embedding is then passed into a
vanilla BiLSTM-CRF5 sequence labeler.
        </p>
        <p>We apply this sequence tagger on our raw
input documents and consider all predicted named
entities as the document’s important words.
2.2.4</p>
      </sec>
      <sec id="sec-2-6">
        <title>Flair grammar predictor</title>
        <p>In order to evaluate grammatical correctness, we
again leverage the aforementioned flair language
model, which was trained as an auto-encoder to
correctly predict the next character in a text. For
a grammatically correct text we would expect the
model to mostly guess the next character correctly.
A text with grammatical errors however would not
match the expectations of the model, thus creating
a larger reconstruction error on the characters that
do not fit grammatically. To assess grammatical
correctness we feed the summary text through the
model and score the summary based on the
accumulated reconstruction error.
2.2.5</p>
      </sec>
      <sec id="sec-2-7">
        <title>Sentence-BERT predictor</title>
        <p>
          We explore how sentence embeddings can be used
to measure “how similar” (semantically) a
summary is compared to its original text. In particular,
4Long Short Term Memory.
5Bi-directional Long Short-Term Memory Conditional
Random Field.
we infer sentence embeddings with the pretrained
bert-base-german-uncased BERT model from the
HuggingFace’s transformers library
          <xref ref-type="bibr" rid="ref10">(Wolf et al.,
2019)</xref>
          in the fashion proposed with the
SentenceBERT architecture
          <xref ref-type="bibr" rid="ref8">(Reimers and Gurevych, 2019)</xref>
          .
The output of the BERT model is max-pooled to
obtain a fixed-size vector for each processed piece
of text. This way, we can obtain embeddings for
both the original text and each of the summaries.
The resulting predictor score is thus the cosine
similarity of the summary vector with the original text
vector.
        </p>
      </sec>
      <sec id="sec-2-8">
        <title>2.2.6 ROUGE predictor</title>
        <p>The ROUGE score is a classic metric for assessing
the quality of summaries. Even though it alone is
not sufficient to evaluate summaries it can give
useful insight when applied in an ensemble setting. We
calculate the rouge-1, rouge-2 and rouge-L scores
between the summary and both the full original
text and the reference summary. While rouge-1
and rouge-2 calculates the overlap of unigrams and
bigrams (i.e. single words and adjacent word pairs)
between reference text and summary, rouge-L
evaluates the longest common subsequence between
reference and summary.</p>
      </sec>
      <sec id="sec-2-9">
        <title>2.2.7 BLEU predictor</title>
        <p>BLEU is a metric that calculates an n-gram
precision between one or multiple reference texts and
a summary hypothesis, in which n-gram counts
in the summary are compared to their maximum
count in one of the references.</p>
      </sec>
      <sec id="sec-2-10">
        <title>2.2.8 METEOR predictor</title>
        <p>METEOR is a metric that calculates a harmonic
mean between the recall and precision of an n-gram
matching which considers word order between a
reference text and a summary.</p>
      </sec>
      <sec id="sec-2-11">
        <title>2.2.9 Compactness predictor</title>
        <p>We calculate the compactness score as the
compression rate with respect to the original text, where the
text length is measured by the number of
characters.</p>
      </sec>
      <sec id="sec-2-12">
        <title>2.2.10 Number matching predictor</title>
        <p>A good summary should be factually correct.
While there might be some ambiguity from
different word choices between original text and
summary, there usually is only one way to display exact
numbers like dates. We thus expect every number
in the summary to also appear in the original text.
To assess factual correctness regarding numbers,
we count how many of the numbers in the summary
are also present in the text.</p>
      </sec>
      <sec id="sec-2-13">
        <title>2.2.11 Sentence copying predictor</title>
        <p>At times, one can generate a usable summary by
simply extracting the first sentences of the
original text, since they often provide an introduction
and therefore a mini-summary of the remaining
text. However, the goal of our evaluation is
finding abstractive and novel summaries. We therefore
perform a binary check on whether the summary
exactly matches the first sentences of the original
text and assign a 1 if they are extracted from the
original text and a 0 if they are more abstracted.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>In this section, we report and analyze our results
of employing a capped linear regression model to
ensemble the significant subset of our predictors
to generate a representative summarization quality
metric. We start by fitting a capped linear
regression model to the full set of predictors, including
an intercept, and consider the p-values of each
predictor. We iteratively remove the most insignificant
predictor (largest p-value) and re-run the linear
regression. We stop once all predictors are
statistically significant to the 15% level.</p>
      <p>The final regression model on the remaining 7
significant predictors is described in Table 1.
constant
tfidf content
flair grammar
sbert
sentence copying
rouge-1
rouge-2
rouge-L
coef
0.072
0.535
0.226
0.169
0.168
2.560
1.531
1.329
std err
0.095
0.107
0.109
0.106
0.064
0.571
0.340
0.646</p>
      <p>The columns show the estimated coefficients,
standard errors and p-values of each predictor.
Since all predictors have been normalized
(minmax normalization) prior to the regression, their
regression coefficients are directly comparable in
magnitude. It can be seen that the rouge-1 predictor
has the highest coefficient and thus, is most
important for predicting the summary evaluation score.
However, the other predictors also contribute
significantly to the prediction outcome, which gets
evident when comparing the final ensemble error
of 33.72 (see Table 2) to the individual rouge-1
error of 35.99 (see Table 3).</p>
      <p>Further, the coefficients of the sentence copying,
rouge-2 and rouge-L predictors imply a negative
correlation to the annotated summary scores. This
is expected because all three predictors yield high
scores, when entire sentences, bigrams or common
subsequences of the original documents get copied
to or make up the generated summaries. Yet, our
annotations favor abstractive summaries which is
why a higher score of one of the above predictors
indicates a worse summary when taking
abstractiveness as a quality indicator into account.</p>
      <p>Table 2 shows the final error values obtained
by different predictor ensembles in the shared task
public ranking. Despite of more predictors
increas</p>
      <p>Ensemble
7 predictors
10 predictors
13 predictors</p>
      <p>Error
33.72
33.90
33.82</p>
      <p>Predictors
constant, tfidf content,
flair grammar, sentence copying
sbert, rouge-1,
rouge-2, rouge-L
+ nmf content,
bleu, meteor
+ flair ner content,
compression, number matching
ing the likelihood of overfitting on our manual
annotations and thereby lowering our final error score,
one can observe the opposite. Removing
insignificant predictors actually yields the best
performing model and puts us among the top participating
teams.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Comparison with standard metrics</title>
      <p>In order to show the validity of our approach and
its improvement over previously established
methods, we take a look at the performance of BLEU,
METEOR and ROUGE as single predictors.</p>
      <p>We implement each metric using the standard
definition and further employ min-max
normalization as described above in order to receive a metric
that assigns a score between 0 (bad) and 1 (good)
so that both extremes appear in the dataset. This
approach is developed entirely without manual
annotations. The scores received on the challenge
task are depicted in the middle column of Table 3.</p>
      <p>Furthermore, we use our manual annotations to
adjust the predictors to the available dataset, fitting
a linear regression of a single predictor to the
annotated summary scores. These scores are depicted
in the right column of Table 3.</p>
      <p>As already signified, we see that using these
metrics out-of-the-box results in significantly worse
performance than both the fitted algorithm and our
ensemble approach. While the fitted metrics score
is considerably higher than their original
counterpart, we still see a distinct improvement when
employing an ensemble of different predictors.</p>
      <p>Predictor</p>
      <p>Error (original)</p>
      <p>Error (fitted)
rouge-1
rouge-2
rouge-L
bleu
meteor
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>We showed that a hybrid combination of
rulebased, statistical and deep-learning techniques
outperforms other alternatives for automatic
evaluation of automatically generated German text
summarization given the provided shared task dataset.</p>
      <p>Although the text corpus covers a wide range of
topics, the text style is quite homogeneous. Mostly,
it consists of generally grammatically perfect
descriptive texts. It would be interesting to test if
our approach also works for more informal noisy
texts. Furthermore, it would be also interesting to
evaluate different state-of-the-art summarization
approaches with our new metric.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The authors of this work were supported in parts
by the Fraunhofer Research Center for Machine
Learning (RCML) within the Fraunhofer Cluster of
Excellence Cognitive Internet Technologies (CCIT)
and by the Competence Center for Machine
Learning Rhine Ruhr (ML2R) which is funded by the
Federal Ministry of Education and Research of
Germany (grant no. 01—S18038B). We gratefully
acknowledge this support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Alan</given-names>
            <surname>Akbik</surname>
          </string-name>
          , Duncan Blythe, and
          <string-name>
            <given-names>Roland</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Contextual string embeddings for sequence labeling</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics</source>
          , pages
          <fpage>1638</fpage>
          -
          <lpage>1649</lpage>
          ,
          <string-name>
            <given-names>Santa</given-names>
            <surname>Fe</surname>
          </string-name>
          , New Mexico, USA. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Satanjeev</given-names>
            <surname>Banerjee</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alon</given-names>
            <surname>Lavie</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Meteor: An automatic metric for mt evaluation with improved correlation with human judgments</article-title>
          .
          <source>In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization</source>
          , pages
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Eduardo</given-names>
            <surname>Brito</surname>
          </string-name>
          , Max Lu¨bbering, David Biesner, Lars Patrick Hillebrand, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bauckhage</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Towards supervised extractive text summarization via RNN-based sequence classification</article-title>
          . arXiv preprint arXiv:
          <year>1911</year>
          .06121.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Daniel D</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>H Sebastian</given-names>
            <surname>Seung</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Algorithms for non-negative matrix factorization</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>556</fpage>
          -
          <lpage>562</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>ROUGE: A package for automatic evaluation of summaries</article-title>
          .
          <source>In Text Summarization Branches Out</source>
          , pages
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          , Barcelona, Spain. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Pentti</given-names>
            <surname>Paatero</surname>
          </string-name>
          and
          <string-name>
            <given-names>Unto</given-names>
            <surname>Tapper</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values</article-title>
          .
          <source>Environmetrics</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):
          <fpage>111</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Kishore</given-names>
            <surname>Papineni</surname>
          </string-name>
          , Salim Roukos, Todd Ward, and
          <string-name>
            <given-names>WeiJing</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          .
          <source>In Proceedings of the 40th annual meeting on association for computational linguistics</source>
          , pages
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>SentenceBERT: Sentence embeddings using siamese BERTnetworks</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Erik F Sang and Fien De Meulder</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Introduction to the conll-2003 shared task: Languageindependent named entity recognition</article-title>
          .
          <source>arXiv preprint cs/0306050.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Wolf</surname>
          </string-name>
          , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Re´mi Louf, Morgan Funtowicz, et al.
          <year>2019</year>
          .
          <article-title>Transformers: State-of-theart natural language processing</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .03771.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>