=Paper=
{{Paper
|id=Vol-2624/germeval-task3-paper3
|storemode=property
|title=Hybrid Ensemble Predictor as Quality Metric for German Text Summarization: Fraunhofer IAIS at GermEval 2020 Task 3
|pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task3-paper3.pdf
|volume=Vol-2624
|authors=David Biesner,Eduardo Brito,Lars Patrick Hillebrand,Rafet Sifa
|dblpUrl=https://dblp.org/rec/conf/swisstext/BiesnerBHS20
}}
==Hybrid Ensemble Predictor as Quality Metric for German Text Summarization: Fraunhofer IAIS at GermEval 2020 Task 3==
Hybrid Ensemble Predictor as Quality Metric for German Text
Summarization: Fraunhofer IAIS at GermEval 2020 Task 3
David Biesner∗†‡ , Eduardo Brito∗†§ , Lars Patrick Hillebrand∗†‡ , Rafet Sifa†
†
Fraunhofer IAIS, Schloss Birlinghoven, 53757 Sankt Augustin, Germany
§
Fraunhofer Center for Machine Learning, Germany
‡
B-IT, University of Bonn, Endenicher Allee 19a, 53115 Bonn, Germany
Abstract assessment of German text summarizations. We
propose to combine the advantages of neural ap-
We propose an alternative quality metric proaches that excel at encoding semantic textual
to evaluate automatically generated texts similarity (and are thus suitable to predict content)
based on an ensemble of different scores, with statistical and rule-based metrics that can eval-
combining simple rule-based metrics with uate other important summarization aspects such
more complex models of very different na- as compactness and abstractiveness.
ture, including ROUGE, tf-idf, neural sen-
In our approach, we employ an ensemble of 7
tence embeddings, and a matrix factoriza-
statistically significant predictors (p-value < 15%)
tion method. Our approach achieved one
in a linear regression model (see Table 2). Compar-
of the top scores on the second German
ing our predictions to the competition host’s own
Text Summarization Challenge.
non-public annotations we achieved a score (i.e.
1 Introduction loss) of 33.72, one of the lowest and therefore best
scores of participating teams.
In our previous work on automatic text summa- In the following sections, we detail the different
rization (Brito et al., 2019), we concluded criticiz- metrics that we considered and how we optimized
ing the suitability of ROUGE scores (Lin, 2004) its combination.
for overall evaluation purposes. These and other
common quality metrics found in the automatic 2 Experimental Setup
text summarization literature like BLEU (Papineni
et al., 2002) or METEOR (Banerjee and Lavie, This section describes our experimental setup,
2005) are far from being optimal since they only namely the underlying dataset and the methodolog-
focus on the lexical overlap as a proxy for assessing ical approach.
content selection. They do not only penalize cer-
tain abstractions (e.g. when the original sentences 2.1 Data
are heavily reformulated or when synonyms are The shared task organizers released a corpus con-
applied) but they also ignore other aspects that are sisting of 216 texts with a corresponding reference
usually considered desirable in good summaries, summary and a generated summary, each of them
including grammatical correctness and compact- rated with a value between 0 (bad) to 1 (excellent).
ness. In order to evaluate the methods we manually
The second German Text Summarization Chal- annotated all summaries in the dataset with a score
lenge aims to address this issue by releasing a text from 0 to 1. We independently rated a part of the
corpus with several summaries per text1 . Its partici- corpus each, such that different human biases can
pants were asked to rate these summaries with new be compensated to a certain extent. A submission
ideas and solutions regarding an automatic quality of these annotations to the competition received a
Copyright c 2020 for this paper by its authors. Use permitted high score, indicating a large similarity to the gold
under Creative Commons License Attribution 4.0 Interna- standard annotations set by the organizers. Addi-
tional (CC BY 4.0) tionally, we expanded the dataset by considering
∗
These co-first authors contributed equally to this work.
1
https://swisstext-and-konvens-2020.org/ the given reference summaries as perfect generated
2nd-german-text-summarization-challenge. summaries with an automatic score of 1.
This results in a dataset of 248 summary texts 2.2.1 Tf-Idf content predictor
with their corresponding score, which is used to A very popular text vectorization method is tf–idf
evaluate the unsupervised methods described be- (Term frequency – Inverse document frequency).
low. It is a frequency-based statistic, which intends to
reflect how important a word is to a “document” in
a corpus.
2.2 Methodology Given that our entire corpus contains N docu-
ments and the vocabulary of our corpus is of size
We address this challenge as a metric learning prob- K, we can collect the individual tf–idf scores in
lem, where we define a set of unsupervised predic- some matrix M ∈ RN ×K . Each row vector in this
tors covering one or several features that answer matrix corresponds to a document embedding. We
the required properties of a good summary (content find the top 10 important words per document by
relevancy, compactness, abstractiveness and gram- decreasingly sorting the tf-idf scores within each
matical correctness). After calculating all predictor embedding.
scores (unsupervised) for each document we apply We utilize the sklearn 2 implementation of
min-max normalization to assure all scores lay in the TfidfVectorizer and restrict our vocabu-
the closed 0-1 interval. In a final step, we ensem- lary to words with a document frequency below 0.9.
ble these predictors in a capped linear regression Before vectorization, we apply lower-casing, punc-
model (output between 0 and 1), which is trained tuation and stop word removal, and stemming to
via ordinary least squares on our manual summary the entire text corpus, which helps to better capture
annotations (see Section 2.1). We iteratively re- meaning and content in the text’s vector represen-
move non-significant predictors, p-value ≥ 15%, tation.
and re-run the regression model until all predictors
yield significant t-statistics, namely their coeffi- 2.2.2 NMF content predictor
cients lay within the two-sided 85% confidence NMF (Nonnegative Matrix Factorization) (Paatero
interval. Due to the limited amount of documents and Tapper, 1994; Lee and Seung, 2001) is a com-
and the loss of interpretability, we refrain from mon matrix factorization technique frequently used
including non-linearities (e.g. multiple layers, non- for topic modeling. In previous work, we find that
linear activation functions, interaction terms of dif- NMF achieves good results in clustering document
ferent polynomial degrees, etc.) into the regression words to a predefined number of latent topics. As-
model. Also, by using a simple linear ensemble suming that a good summary should cover all main
model, we reduce the likelihood of overfitting on topics in a text, we apply NMF on each document,
our annotations, especially since no validation set and determine the top 5 important words per la-
for parameter tuning is available. tent topic dimension. In particular, we factorize
The following subsections lay the focus on our the document’s symmetrical co-occurrence matrix3
predictors and describe their functionality. We start S ∈ RN ×N into a nonnegative loading matrix
presenting three content predictors, which all de- W ∈ RN ×M and a nonnegative affinity matrix
termine the most important words in the original H ∈ RM ×N ,
text and compute the fraction of how many of these
words occur in the generated summaries. We as- S = W H + E, (1)
sume that the most important words in a document
capture the essence of the text and thus, function as where N is the vocabulary size of the document at
proxy for contentual relevance. We continue with question, M = 10 is the number of latent topics
neural language model driven predictors which pri- and E ∈ RN ×N is the error matrix, whose elements
marily focus on contentual relevance and gram- approach zero for a perfect decomposition.
matical correctness. We also include the standard 2
https://github.com/scikit-learn/
quality metrics for automatic summary evaluation, scikit-learn.
ROUGE, BLEU, and METEOR, which all aim 3
We apply the same document preprocessing as in Section
to measure contentual relevance, as well. The re- 2.2.1 before calculating the co-occurrence matrix. Also, we
choose a window size of 5 and each context word j con-
maining predictors are mainly rule-based and refer tributes 1/d to the total word pair count, given it is d words
largely to compactness and abstractiveness. apart from the base word i.
For both, W and H T we assign each word (row we infer sentence embeddings with the pretrained
vector) to the latent topic dimension with the high- bert-base-german-uncased BERT model from the
est value. Next, we decreasingly sort the assigned HuggingFace’s transformers library (Wolf et al.,
words per topic, so that the most distinct topic 2019) in the fashion proposed with the Sentence-
words are ranked on top. Finally, we get the impor- BERT architecture (Reimers and Gurevych, 2019).
tant words per document by removing all duplicates The output of the BERT model is max-pooled to
from the selected topic words of W and H. obtain a fixed-size vector for each processed piece
of text. This way, we can obtain embeddings for
2.2.3 Flair NER content predictor both the original text and each of the summaries.
Flair (Akbik et al., 2018) is a specific contextual The resulting predictor score is thus the cosine sim-
string embedding architecture. The backbone of ilarity of the summary vector with the original text
the flair framework is a pretrained character-based vector.
language model (based on an LSTM4 -RNN), which
is bidirectionally trained on a huge independent text 2.2.6 ROUGE predictor
corpus for different languages, including German. The ROUGE score is a classic metric for assessing
Build on top of this language model, the frame- the quality of summaries. Even though it alone is
work provides a German named entity tagger, not sufficient to evaluate summaries it can give use-
which is pretrained on the Conll-03 dataset (Sang ful insight when applied in an ensemble setting. We
and De Meulder, 2003). First, raw and unprocessed calculate the rouge-1, rouge-2 and rouge-L scores
text is fed sequentially into the encoding part of the between the summary and both the full original
bidirectional language model. Second, we retrieve text and the reference summary. While rouge-1
for each word i a contextual embedding by concate- and rouge-2 calculates the overlap of unigrams and
nating the forward model’s hidden state after word bigrams (i.e. single words and adjacent word pairs)
i and the backward model’s hidden state before between reference text and summary, rouge-L eval-
word i. This word embedding is then passed into a uates the longest common subsequence between
vanilla BiLSTM-CRF5 sequence labeler. reference and summary.
We apply this sequence tagger on our raw in- 2.2.7 BLEU predictor
put documents and consider all predicted named
BLEU is a metric that calculates an n-gram preci-
entities as the document’s important words.
sion between one or multiple reference texts and
2.2.4 Flair grammar predictor a summary hypothesis, in which n-gram counts
In order to evaluate grammatical correctness, we in the summary are compared to their maximum
again leverage the aforementioned flair language count in one of the references.
model, which was trained as an auto-encoder to 2.2.8 METEOR predictor
correctly predict the next character in a text. For
METEOR is a metric that calculates a harmonic
a grammatically correct text we would expect the
mean between the recall and precision of an n-gram
model to mostly guess the next character correctly.
matching which considers word order between a
A text with grammatical errors however would not
reference text and a summary.
match the expectations of the model, thus creating
a larger reconstruction error on the characters that 2.2.9 Compactness predictor
do not fit grammatically. To assess grammatical We calculate the compactness score as the compres-
correctness we feed the summary text through the sion rate with respect to the original text, where the
model and score the summary based on the accu- text length is measured by the number of charac-
mulated reconstruction error. ters.
2.2.5 Sentence-BERT predictor 2.2.10 Number matching predictor
We explore how sentence embeddings can be used A good summary should be factually correct.
to measure “how similar” (semantically) a sum- While there might be some ambiguity from dif-
mary is compared to its original text. In particular, ferent word choices between original text and sum-
4 mary, there usually is only one way to display exact
Long Short Term Memory.
5
Bi-directional Long Short-Term Memory Conditional Ran- numbers like dates. We thus expect every number
dom Field. in the summary to also appear in the original text.
To assess factual correctness regarding numbers, evident when comparing the final ensemble error
we count how many of the numbers in the summary of 33.72 (see Table 2) to the individual rouge-1
are also present in the text. error of 35.99 (see Table 3).
Further, the coefficients of the sentence copying,
2.2.11 Sentence copying predictor
rouge-2 and rouge-L predictors imply a negative
At times, one can generate a usable summary by correlation to the annotated summary scores. This
simply extracting the first sentences of the origi- is expected because all three predictors yield high
nal text, since they often provide an introduction scores, when entire sentences, bigrams or common
and therefore a mini-summary of the remaining subsequences of the original documents get copied
text. However, the goal of our evaluation is find- to or make up the generated summaries. Yet, our
ing abstractive and novel summaries. We therefore annotations favor abstractive summaries which is
perform a binary check on whether the summary why a higher score of one of the above predictors
exactly matches the first sentences of the original indicates a worse summary when taking abstrac-
text and assign a 1 if they are extracted from the tiveness as a quality indicator into account.
original text and a 0 if they are more abstracted. Table 2 shows the final error values obtained
by different predictor ensembles in the shared task
3 Evaluation
public ranking. Despite of more predictors increas-
In this section, we report and analyze our results
of employing a capped linear regression model to Ensemble Error Predictors
ensemble the significant subset of our predictors 7 predictors 33.72 constant, tfidf content,
flair grammar, sentence copying
to generate a representative summarization quality sbert, rouge-1,
metric. We start by fitting a capped linear regres- rouge-2, rouge-L
sion model to the full set of predictors, including 10 predictors 33.90 + nmf content,
bleu, meteor
an intercept, and consider the p-values of each pre- 13 predictors 33.82 + flair ner content,
dictor. We iteratively remove the most insignificant compression, number matching
predictor (largest p-value) and re-run the linear re-
gression. We stop once all predictors are statisti- Table 2: Error values obtained in the shared task pub-
cally significant to the 15% level. lic ranking by different predictor ensembles. A lower
value means better performance.
The final regression model on the remaining 7
significant predictors is described in Table 1.
ing the likelihood of overfitting on our manual an-
coef std err P> |t|
notations and thereby lowering our final error score,
one can observe the opposite. Removing insignif-
constant 0.072 0.095 0.447
tfidf content 0.535 0.107 0.000 icant predictors actually yields the best perform-
flair grammar 0.226 0.109 0.038 ing model and puts us among the top participating
sbert 0.169 0.106 0.110 teams.
sentence copying −0.168 0.064 0.009
rouge-1 2.560 0.571 0.000
rouge-2 −1.531 0.340 0.000 4 Comparison with standard metrics
rouge-L −1.329 0.646 0.041
In order to show the validity of our approach and
Table 1: Regression coefficients, standard errors and its improvement over previously established meth-
p-values for final predictor set. ods, we take a look at the performance of BLEU,
METEOR and ROUGE as single predictors.
The columns show the estimated coefficients, We implement each metric using the standard
standard errors and p-values of each predictor. definition and further employ min-max normaliza-
Since all predictors have been normalized (min- tion as described above in order to receive a metric
max normalization) prior to the regression, their that assigns a score between 0 (bad) and 1 (good)
regression coefficients are directly comparable in so that both extremes appear in the dataset. This
magnitude. It can be seen that the rouge-1 predictor approach is developed entirely without manual an-
has the highest coefficient and thus, is most impor- notations. The scores received on the challenge
tant for predicting the summary evaluation score. task are depicted in the middle column of Table 3.
However, the other predictors also contribute sig- Furthermore, we use our manual annotations to
nificantly to the prediction outcome, which gets adjust the predictors to the available dataset, fitting
a linear regression of a single predictor to the anno- References
tated summary scores. These scores are depicted Alan Akbik, Duncan Blythe, and Roland Vollgraf.
in the right column of Table 3. 2018. Contextual string embeddings for sequence
As already signified, we see that using these met- labeling. In Proceedings of the 27th International
rics out-of-the-box results in significantly worse Conference on Computational Linguistics, pages
1638–1649, Santa Fe, New Mexico, USA. Associ-
performance than both the fitted algorithm and our ation for Computational Linguistics.
ensemble approach. While the fitted metrics score
is considerably higher than their original counter- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An
automatic metric for mt evaluation with improved
part, we still see a distinct improvement when em-
correlation with human judgments. In Proceedings
ploying an ensemble of different predictors. of the acl workshop on intrinsic and extrinsic evalu-
ation measures for machine translation and/or sum-
Predictor Error (original) Error (fitted) marization, pages 65–72.
rouge-1 44.26 35.99
rouge-2 52.50 36.08 Eduardo Brito, Max Lübbering, David Biesner,
rouge-L 44.27 36.12 Lars Patrick Hillebrand, and Christian Bauckhage.
bleu 64.16 36.11 2019. Towards supervised extractive text sum-
meteor 53.05 36.06 marization via RNN-based sequence classification.
arXiv preprint arXiv:1911.06121.
Table 3: Error values obtained by some of the common
Daniel D Lee and H Sebastian Seung. 2001. Algo-
evaluation metrics for automatic text summarization af-
rithms for non-negative matrix factorization. In
ter uploading their scores to the shared task public rank- Advances in neural information processing systems,
ing. A lower value means better performance. The pages 556–562.
middle column represents the errors for the min-max
normalized predictor scores. The right column shows Chin-Yew Lin. 2004. ROUGE: A package for auto-
the final errors for the normalized predictor scores be- matic evaluation of summaries. In Text Summariza-
ing fitted via linear regression to our manual summary tion Branches Out, pages 74–81, Barcelona, Spain.
annotations. Association for Computational Linguistics.
Pentti Paatero and Unto Tapper. 1994. Positive matrix
factorization: A non-negative factor model with opti-
5 Conclusion and Future Work mal utilization of error estimates of data values. En-
vironmetrics, 5(2):111–126.
We showed that a hybrid combination of rule-
based, statistical and deep-learning techniques out- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
performs other alternatives for automatic evalua- Jing Zhu. 2002. Bleu: a method for automatic eval-
tion of automatically generated German text sum- uation of machine translation. In Proceedings of
the 40th annual meeting on association for compu-
marization given the provided shared task dataset. tational linguistics, pages 311–318. Association for
Although the text corpus covers a wide range of Computational Linguistics.
topics, the text style is quite homogeneous. Mostly,
Nils Reimers and Iryna Gurevych. 2019. Sentence-
it consists of generally grammatically perfect de- BERT: Sentence embeddings using siamese BERT-
scriptive texts. It would be interesting to test if networks. In Proceedings of the 2019 Conference on
our approach also works for more informal noisy Empirical Methods in Natural Language Processing.
texts. Furthermore, it would be also interesting to Association for Computational Linguistics.
evaluate different state-of-the-art summarization Erik F Sang and Fien De Meulder. 2003. Intro-
approaches with our new metric. duction to the conll-2003 shared task: Language-
independent named entity recognition. arXiv
Acknowledgments preprint cs/0306050.
The authors of this work were supported in parts Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
by the Fraunhofer Research Center for Machine Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-
Learning (RCML) within the Fraunhofer Cluster of
towicz, et al. 2019. Transformers: State-of-the-
Excellence Cognitive Internet Technologies (CCIT) art natural language processing. arXiv preprint
and by the Competence Center for Machine Learn- arXiv:1910.03771.
ing Rhine Ruhr (ML2R) which is funded by the
Federal Ministry of Education and Research of
Germany (grant no. 01—S18038B). We gratefully
acknowledge this support.