=Paper= {{Paper |id=Vol-2624/germeval-task3-paper3 |storemode=property |title=Hybrid Ensemble Predictor as Quality Metric for German Text Summarization: Fraunhofer IAIS at GermEval 2020 Task 3 |pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task3-paper3.pdf |volume=Vol-2624 |authors=David Biesner,Eduardo Brito,Lars Patrick Hillebrand,Rafet Sifa |dblpUrl=https://dblp.org/rec/conf/swisstext/BiesnerBHS20 }} ==Hybrid Ensemble Predictor as Quality Metric for German Text Summarization: Fraunhofer IAIS at GermEval 2020 Task 3== https://ceur-ws.org/Vol-2624/germeval-task3-paper3.pdf
         Hybrid Ensemble Predictor as Quality Metric for German Text
          Summarization: Fraunhofer IAIS at GermEval 2020 Task 3

          David Biesner∗†‡ , Eduardo Brito∗†§ , Lars Patrick Hillebrand∗†‡ , Rafet Sifa†
            †
               Fraunhofer IAIS, Schloss Birlinghoven, 53757 Sankt Augustin, Germany
                       §
                         Fraunhofer Center for Machine Learning, Germany
             ‡
               B-IT, University of Bonn, Endenicher Allee 19a, 53115 Bonn, Germany



                        Abstract                                assessment of German text summarizations. We
                                                                propose to combine the advantages of neural ap-
    We propose an alternative quality metric                    proaches that excel at encoding semantic textual
    to evaluate automatically generated texts                   similarity (and are thus suitable to predict content)
    based on an ensemble of different scores,                   with statistical and rule-based metrics that can eval-
    combining simple rule-based metrics with                    uate other important summarization aspects such
    more complex models of very different na-                   as compactness and abstractiveness.
    ture, including ROUGE, tf-idf, neural sen-
                                                                   In our approach, we employ an ensemble of 7
    tence embeddings, and a matrix factoriza-
                                                                statistically significant predictors (p-value < 15%)
    tion method. Our approach achieved one
                                                                in a linear regression model (see Table 2). Compar-
    of the top scores on the second German
                                                                ing our predictions to the competition host’s own
    Text Summarization Challenge.
                                                                non-public annotations we achieved a score (i.e.
1   Introduction                                                loss) of 33.72, one of the lowest and therefore best
                                                                scores of participating teams.
In our previous work on automatic text summa-                      In the following sections, we detail the different
rization (Brito et al., 2019), we concluded criticiz-           metrics that we considered and how we optimized
ing the suitability of ROUGE scores (Lin, 2004)                 its combination.
for overall evaluation purposes. These and other
common quality metrics found in the automatic                   2     Experimental Setup
text summarization literature like BLEU (Papineni
et al., 2002) or METEOR (Banerjee and Lavie,                    This section describes our experimental setup,
2005) are far from being optimal since they only                namely the underlying dataset and the methodolog-
focus on the lexical overlap as a proxy for assessing           ical approach.
content selection. They do not only penalize cer-
tain abstractions (e.g. when the original sentences             2.1    Data
are heavily reformulated or when synonyms are                   The shared task organizers released a corpus con-
applied) but they also ignore other aspects that are            sisting of 216 texts with a corresponding reference
usually considered desirable in good summaries,                 summary and a generated summary, each of them
including grammatical correctness and compact-                  rated with a value between 0 (bad) to 1 (excellent).
ness.                                                              In order to evaluate the methods we manually
   The second German Text Summarization Chal-                   annotated all summaries in the dataset with a score
lenge aims to address this issue by releasing a text            from 0 to 1. We independently rated a part of the
corpus with several summaries per text1 . Its partici-          corpus each, such that different human biases can
pants were asked to rate these summaries with new               be compensated to a certain extent. A submission
ideas and solutions regarding an automatic quality              of these annotations to the competition received a
Copyright c 2020 for this paper by its authors. Use permitted   high score, indicating a large similarity to the gold
under Creative Commons License Attribution 4.0 Interna-         standard annotations set by the organizers. Addi-
tional (CC BY 4.0)                                              tionally, we expanded the dataset by considering
∗
  These co-first authors contributed equally to this work.
1
  https://swisstext-and-konvens-2020.org/                       the given reference summaries as perfect generated
  2nd-german-text-summarization-challenge.                      summaries with an automatic score of 1.
  This results in a dataset of 248 summary texts         2.2.1    Tf-Idf content predictor
with their corresponding score, which is used to         A very popular text vectorization method is tf–idf
evaluate the unsupervised methods described be-          (Term frequency – Inverse document frequency).
low.                                                     It is a frequency-based statistic, which intends to
                                                         reflect how important a word is to a “document” in
                                                         a corpus.
2.2   Methodology                                           Given that our entire corpus contains N docu-
                                                         ments and the vocabulary of our corpus is of size
We address this challenge as a metric learning prob-     K, we can collect the individual tf–idf scores in
lem, where we define a set of unsupervised predic-       some matrix M ∈ RN ×K . Each row vector in this
tors covering one or several features that answer        matrix corresponds to a document embedding. We
the required properties of a good summary (content       find the top 10 important words per document by
relevancy, compactness, abstractiveness and gram-        decreasingly sorting the tf-idf scores within each
matical correctness). After calculating all predictor    embedding.
scores (unsupervised) for each document we apply            We utilize the sklearn 2 implementation of
min-max normalization to assure all scores lay in        the TfidfVectorizer and restrict our vocabu-
the closed 0-1 interval. In a final step, we ensem-      lary to words with a document frequency below 0.9.
ble these predictors in a capped linear regression       Before vectorization, we apply lower-casing, punc-
model (output between 0 and 1), which is trained         tuation and stop word removal, and stemming to
via ordinary least squares on our manual summary         the entire text corpus, which helps to better capture
annotations (see Section 2.1). We iteratively re-        meaning and content in the text’s vector represen-
move non-significant predictors, p-value ≥ 15%,          tation.
and re-run the regression model until all predictors
yield significant t-statistics, namely their coeffi-     2.2.2    NMF content predictor
cients lay within the two-sided 85% confidence           NMF (Nonnegative Matrix Factorization) (Paatero
interval. Due to the limited amount of documents         and Tapper, 1994; Lee and Seung, 2001) is a com-
and the loss of interpretability, we refrain from        mon matrix factorization technique frequently used
including non-linearities (e.g. multiple layers, non-    for topic modeling. In previous work, we find that
linear activation functions, interaction terms of dif-   NMF achieves good results in clustering document
ferent polynomial degrees, etc.) into the regression     words to a predefined number of latent topics. As-
model. Also, by using a simple linear ensemble           suming that a good summary should cover all main
model, we reduce the likelihood of overfitting on        topics in a text, we apply NMF on each document,
our annotations, especially since no validation set      and determine the top 5 important words per la-
for parameter tuning is available.                       tent topic dimension. In particular, we factorize
   The following subsections lay the focus on our        the document’s symmetrical co-occurrence matrix3
predictors and describe their functionality. We start    S ∈ RN ×N into a nonnegative loading matrix
presenting three content predictors, which all de-       W ∈ RN ×M and a nonnegative affinity matrix
termine the most important words in the original         H ∈ RM ×N ,
text and compute the fraction of how many of these
words occur in the generated summaries. We as-                                S = W H + E,                          (1)
sume that the most important words in a document
capture the essence of the text and thus, function as    where N is the vocabulary size of the document at
proxy for contentual relevance. We continue with         question, M = 10 is the number of latent topics
neural language model driven predictors which pri-       and E ∈ RN ×N is the error matrix, whose elements
marily focus on contentual relevance and gram-           approach zero for a perfect decomposition.
matical correctness. We also include the standard        2
                                                           https://github.com/scikit-learn/
quality metrics for automatic summary evaluation,          scikit-learn.
ROUGE, BLEU, and METEOR, which all aim                   3
                                                           We apply the same document preprocessing as in Section
to measure contentual relevance, as well. The re-          2.2.1 before calculating the co-occurrence matrix. Also, we
                                                           choose a window size of 5 and each context word j con-
maining predictors are mainly rule-based and refer         tributes 1/d to the total word pair count, given it is d words
largely to compactness and abstractiveness.                apart from the base word i.
   For both, W and H T we assign each word (row              we infer sentence embeddings with the pretrained
vector) to the latent topic dimension with the high-         bert-base-german-uncased BERT model from the
est value. Next, we decreasingly sort the assigned           HuggingFace’s transformers library (Wolf et al.,
words per topic, so that the most distinct topic             2019) in the fashion proposed with the Sentence-
words are ranked on top. Finally, we get the impor-          BERT architecture (Reimers and Gurevych, 2019).
tant words per document by removing all duplicates           The output of the BERT model is max-pooled to
from the selected topic words of W and H.                    obtain a fixed-size vector for each processed piece
                                                             of text. This way, we can obtain embeddings for
2.2.3      Flair NER content predictor                       both the original text and each of the summaries.
Flair (Akbik et al., 2018) is a specific contextual          The resulting predictor score is thus the cosine sim-
string embedding architecture. The backbone of               ilarity of the summary vector with the original text
the flair framework is a pretrained character-based          vector.
language model (based on an LSTM4 -RNN), which
is bidirectionally trained on a huge independent text        2.2.6 ROUGE predictor
corpus for different languages, including German.            The ROUGE score is a classic metric for assessing
   Build on top of this language model, the frame-           the quality of summaries. Even though it alone is
work provides a German named entity tagger,                  not sufficient to evaluate summaries it can give use-
which is pretrained on the Conll-03 dataset (Sang            ful insight when applied in an ensemble setting. We
and De Meulder, 2003). First, raw and unprocessed            calculate the rouge-1, rouge-2 and rouge-L scores
text is fed sequentially into the encoding part of the       between the summary and both the full original
bidirectional language model. Second, we retrieve            text and the reference summary. While rouge-1
for each word i a contextual embedding by concate-           and rouge-2 calculates the overlap of unigrams and
nating the forward model’s hidden state after word           bigrams (i.e. single words and adjacent word pairs)
i and the backward model’s hidden state before               between reference text and summary, rouge-L eval-
word i. This word embedding is then passed into a            uates the longest common subsequence between
vanilla BiLSTM-CRF5 sequence labeler.                        reference and summary.
   We apply this sequence tagger on our raw in-              2.2.7 BLEU predictor
put documents and consider all predicted named
                                                             BLEU is a metric that calculates an n-gram preci-
entities as the document’s important words.
                                                             sion between one or multiple reference texts and
2.2.4      Flair grammar predictor                           a summary hypothesis, in which n-gram counts
In order to evaluate grammatical correctness, we             in the summary are compared to their maximum
again leverage the aforementioned flair language             count in one of the references.
model, which was trained as an auto-encoder to               2.2.8 METEOR predictor
correctly predict the next character in a text. For
                                                             METEOR is a metric that calculates a harmonic
a grammatically correct text we would expect the
                                                             mean between the recall and precision of an n-gram
model to mostly guess the next character correctly.
                                                             matching which considers word order between a
A text with grammatical errors however would not
                                                             reference text and a summary.
match the expectations of the model, thus creating
a larger reconstruction error on the characters that         2.2.9 Compactness predictor
do not fit grammatically. To assess grammatical              We calculate the compactness score as the compres-
correctness we feed the summary text through the             sion rate with respect to the original text, where the
model and score the summary based on the accu-               text length is measured by the number of charac-
mulated reconstruction error.                                ters.
2.2.5      Sentence-BERT predictor                           2.2.10 Number matching predictor
We explore how sentence embeddings can be used               A good summary should be factually correct.
to measure “how similar” (semantically) a sum-               While there might be some ambiguity from dif-
mary is compared to its original text. In particular,        ferent word choices between original text and sum-
4                                                            mary, there usually is only one way to display exact
    Long Short Term Memory.
5
    Bi-directional Long Short-Term Memory Conditional Ran-   numbers like dates. We thus expect every number
    dom Field.                                               in the summary to also appear in the original text.
To assess factual correctness regarding numbers,        evident when comparing the final ensemble error
we count how many of the numbers in the summary         of 33.72 (see Table 2) to the individual rouge-1
are also present in the text.                           error of 35.99 (see Table 3).
                                                           Further, the coefficients of the sentence copying,
2.2.11 Sentence copying predictor
                                                        rouge-2 and rouge-L predictors imply a negative
At times, one can generate a usable summary by          correlation to the annotated summary scores. This
simply extracting the first sentences of the origi-     is expected because all three predictors yield high
nal text, since they often provide an introduction      scores, when entire sentences, bigrams or common
and therefore a mini-summary of the remaining           subsequences of the original documents get copied
text. However, the goal of our evaluation is find-      to or make up the generated summaries. Yet, our
ing abstractive and novel summaries. We therefore       annotations favor abstractive summaries which is
perform a binary check on whether the summary           why a higher score of one of the above predictors
exactly matches the first sentences of the original     indicates a worse summary when taking abstrac-
text and assign a 1 if they are extracted from the      tiveness as a quality indicator into account.
original text and a 0 if they are more abstracted.         Table 2 shows the final error values obtained
                                                        by different predictor ensembles in the shared task
3   Evaluation
                                                        public ranking. Despite of more predictors increas-
In this section, we report and analyze our results
of employing a capped linear regression model to             Ensemble      Error   Predictors
ensemble the significant subset of our predictors           7 predictors   33.72   constant, tfidf content,
                                                                                   flair grammar, sentence copying
to generate a representative summarization quality                                 sbert, rouge-1,
metric. We start by fitting a capped linear regres-                                rouge-2, rouge-L
sion model to the full set of predictors, including      10 predictors     33.90   + nmf content,
                                                                                   bleu, meteor
an intercept, and consider the p-values of each pre-     13 predictors     33.82   + flair ner content,
dictor. We iteratively remove the most insignificant                               compression, number matching
predictor (largest p-value) and re-run the linear re-
gression. We stop once all predictors are statisti-     Table 2: Error values obtained in the shared task pub-
cally significant to the 15% level.                     lic ranking by different predictor ensembles. A lower
                                                        value means better performance.
   The final regression model on the remaining 7
significant predictors is described in Table 1.
                                                        ing the likelihood of overfitting on our manual an-
                          coef   std err   P> |t|
                                                        notations and thereby lowering our final error score,
                                                        one can observe the opposite. Removing insignif-
    constant            0.072    0.095     0.447
    tfidf content       0.535    0.107     0.000        icant predictors actually yields the best perform-
    flair grammar       0.226    0.109     0.038        ing model and puts us among the top participating
    sbert               0.169    0.106     0.110        teams.
    sentence copying   −0.168    0.064     0.009
    rouge-1             2.560    0.571     0.000
    rouge-2            −1.531    0.340     0.000        4     Comparison with standard metrics
    rouge-L            −1.329    0.646     0.041
                                                        In order to show the validity of our approach and
Table 1: Regression coefficients, standard errors and   its improvement over previously established meth-
p-values for final predictor set.                       ods, we take a look at the performance of BLEU,
                                                        METEOR and ROUGE as single predictors.
   The columns show the estimated coefficients,            We implement each metric using the standard
standard errors and p-values of each predictor.         definition and further employ min-max normaliza-
Since all predictors have been normalized (min-         tion as described above in order to receive a metric
max normalization) prior to the regression, their       that assigns a score between 0 (bad) and 1 (good)
regression coefficients are directly comparable in      so that both extremes appear in the dataset. This
magnitude. It can be seen that the rouge-1 predictor    approach is developed entirely without manual an-
has the highest coefficient and thus, is most impor-    notations. The scores received on the challenge
tant for predicting the summary evaluation score.       task are depicted in the middle column of Table 3.
However, the other predictors also contribute sig-         Furthermore, we use our manual annotations to
nificantly to the prediction outcome, which gets        adjust the predictors to the available dataset, fitting
a linear regression of a single predictor to the anno-       References
tated summary scores. These scores are depicted              Alan Akbik, Duncan Blythe, and Roland Vollgraf.
in the right column of Table 3.                                2018. Contextual string embeddings for sequence
   As already signified, we see that using these met-          labeling. In Proceedings of the 27th International
rics out-of-the-box results in significantly worse             Conference on Computational Linguistics, pages
                                                               1638–1649, Santa Fe, New Mexico, USA. Associ-
performance than both the fitted algorithm and our             ation for Computational Linguistics.
ensemble approach. While the fitted metrics score
is considerably higher than their original counter-          Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An
                                                               automatic metric for mt evaluation with improved
part, we still see a distinct improvement when em-
                                                               correlation with human judgments. In Proceedings
ploying an ensemble of different predictors.                   of the acl workshop on intrinsic and extrinsic evalu-
                                                               ation measures for machine translation and/or sum-
      Predictor    Error (original)   Error (fitted)           marization, pages 65–72.
      rouge-1           44.26             35.99
      rouge-2           52.50             36.08              Eduardo Brito, Max Lübbering, David Biesner,
      rouge-L           44.27             36.12                Lars Patrick Hillebrand, and Christian Bauckhage.
      bleu              64.16             36.11                2019. Towards supervised extractive text sum-
      meteor            53.05             36.06                marization via RNN-based sequence classification.
                                                               arXiv preprint arXiv:1911.06121.
Table 3: Error values obtained by some of the common
                                                             Daniel D Lee and H Sebastian Seung. 2001. Algo-
evaluation metrics for automatic text summarization af-
                                                               rithms for non-negative matrix factorization. In
ter uploading their scores to the shared task public rank-     Advances in neural information processing systems,
ing. A lower value means better performance. The               pages 556–562.
middle column represents the errors for the min-max
normalized predictor scores. The right column shows          Chin-Yew Lin. 2004. ROUGE: A package for auto-
the final errors for the normalized predictor scores be-       matic evaluation of summaries. In Text Summariza-
ing fitted via linear regression to our manual summary         tion Branches Out, pages 74–81, Barcelona, Spain.
annotations.                                                   Association for Computational Linguistics.
                                                             Pentti Paatero and Unto Tapper. 1994. Positive matrix
                                                               factorization: A non-negative factor model with opti-
5   Conclusion and Future Work                                 mal utilization of error estimates of data values. En-
                                                               vironmetrics, 5(2):111–126.
We showed that a hybrid combination of rule-
based, statistical and deep-learning techniques out-         Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
performs other alternatives for automatic evalua-              Jing Zhu. 2002. Bleu: a method for automatic eval-
tion of automatically generated German text sum-               uation of machine translation. In Proceedings of
                                                               the 40th annual meeting on association for compu-
marization given the provided shared task dataset.             tational linguistics, pages 311–318. Association for
   Although the text corpus covers a wide range of             Computational Linguistics.
topics, the text style is quite homogeneous. Mostly,
                                                             Nils Reimers and Iryna Gurevych. 2019. Sentence-
it consists of generally grammatically perfect de-             BERT: Sentence embeddings using siamese BERT-
scriptive texts. It would be interesting to test if            networks. In Proceedings of the 2019 Conference on
our approach also works for more informal noisy                Empirical Methods in Natural Language Processing.
texts. Furthermore, it would be also interesting to            Association for Computational Linguistics.
evaluate different state-of-the-art summarization            Erik F Sang and Fien De Meulder. 2003. Intro-
approaches with our new metric.                                 duction to the conll-2003 shared task: Language-
                                                                independent named entity recognition.      arXiv
Acknowledgments                                                 preprint cs/0306050.

The authors of this work were supported in parts             Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
by the Fraunhofer Research Center for Machine                  Chaumond, Clement Delangue, Anthony Moi, Pier-
                                                               ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-
Learning (RCML) within the Fraunhofer Cluster of
                                                               towicz, et al. 2019. Transformers: State-of-the-
Excellence Cognitive Internet Technologies (CCIT)              art natural language processing. arXiv preprint
and by the Competence Center for Machine Learn-                arXiv:1910.03771.
ing Rhine Ruhr (ML2R) which is funded by the
Federal Ministry of Education and Research of
Germany (grant no. 01—S18038B). We gratefully
acknowledge this support.