2nd German Text Summarization Challenge Dominik Frefel, Manfred Vogel, Fabian Märki University of Applied Sciences Northwestern Switzerland Institute of Data Science, Bahnhofstrasse 6, 5210 Windisch dominik.frefel@fhnw.ch manfred.vogel@fhnw.ch fabian.maerki@fhnw.ch Overview Rank Participant Error 1 David Biesner 29.037 Automatic text summarization has made tremen- 2 UPB 31.993 dous progress in recent years. However, the rating 3 ROUGE-1 Baseline 32.098 of a summary is still an open research topic. Es- 4 Inovex 34.630 pecially when it comes to measuring the abstrac- tiveness, existing evaluation metrics like ROUGE, Table 1: Challenge results BLEU or METEOR show severe shortcomings. In the 2nd German Text Summarization Chal- lenge we aimed to explore new ideas and solutions Evaluation regarding an automatic quality assessment of Ger- The participants’ submissions are ranked by the man text summarizations. For the challenge, we mean squared error of their score predictions. We provided a text corpus together with several sum- use our own German ROUGE-1 implementation maries per text. The goal was to assign a quality as a baseline (Frefel, 2020). It scores an error of measure in the range from 0 (bad) to 1 (excellent) 32.098. Refer to table 1 for the results of all par- to each summary. We asked the participants to ticipants. consider aspects such as correctness in content and grammar as well as facets like compactness and abstractiveness. The participants were able to sub- References mit (and resubmit) their solution to our evaluation Dominik Frefel. 2020. Summarization corpora of board. The solution was evaluated automatically, wikipedia articles. In Proceedings of The 12th Lan- and the achieved rank published on the leaderboard. guage Resources and Evaluation Conference, pages 6653–6657, Marseille, France. European Language Resources Association. Data The dataset provided consists of 24 distinct source texts from our German summarization corpus (Frefel, 2020). It contains one reference summary and 9 summaries proposed for evaluation for each source text. The summaries are generated by vari- ous summarization algorithms and humans. Each summary is evaluated and given a score between 0 to 1 by the task organizers. All texts are provided in lower case, with punctuation and quotations in- tact. The source texts are on average 786 tokens long. The reference summaries contain on average 46 and the generated summaries 38 tokens. The average compression ratio is 6%. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0)