Re-Evaluating GermEval17 Using German Pre-Trained Language Models Matthias Aßenmacher1♠ Alessandra Corvonato1♣ Christian Heumann1♠ 1 Department of Statistics, Ludwig-Maximilians-Universität, Munich, Germany ♠ ♣ {matthias,chris}@stat.uni-muenchen.de, alessandracorvonato@yahoo.de Abstract Sentiment Analysis was mainly conducted us- ing traditional machine learning and recurrent The lack of a commonly used benchmark neural networks, like LSTMs (Hochreiter and data set (collection) such as (Super) GLUE Schmidhuber, 1997) or GRUs (Cho et al., 2014). (Wang et al., 2018, 2019) for the evalua- Those models have been practically replaced tion of non-English pre-trained language by language models relying on (parts of) the models is a severe shortcoming of cur- Transformer architecture, a novel framework pro- rent English-centric NLP-research. It con- posed by Vaswani et al. (2017). Devlin et al. centrates a large part of the research on (2019) developed a Transformer-encoder-based lan- English, neglecting the uncertainty when guage model called BERT (Bidirectional Encoder transferring conclusions found for the En- Representations from Transfomers), achieving glish language to other languages. We eval- state-of-the-art (SOTA) performance on several uate the performance of German and mul- benchmark tasks - mainly for the English language tilingual BERT models currently available - and becoming a milestone in the field of NLP. via the huggingface transformers li- Up to now, only a few researchers have focused brary on four subtasks of Aspect-based on sentiment related problems for German reviews, Sentiment Analysis (ABSA) from the despite language-specific evaluation is a crucial GermEval17 workshop. We compare them driving force for a more universal model develop- to pre-BERT architectures (Wojatzki et al., ment and improvement. Unique characteristics of 2017; Schmitt et al., 2018; Attia et al., the different languages present different challenges 2018) as well as to an ELMo-based ar- to the models, which is why sole evaluation on chitecture (Biesialska et al., 2020) and a English data is a severe shortcoming. BERT-based approach (Guhr et al., 2020). The first shared task on German ABSA, which The observed improvements are put in rela- provides a large annotated data set for training tion to those for a similar ABSA task (Pon- and evaluation, is the GermEval17 Shared Task tiki et al., 2014) and similar models (pre- (Wojatzki et al., 2017). The participating teams BERT vs. BERT-based) for the English back then analyzed the data using mostly stan- language and we check whether the re- dard machine learning techniques such as SVMs, ported improvements correspond to those CRFs, or LSTMs. In contrast to 2017, today, dif- we observe for German. ferent pre-trained BERT models are available for a variety of different languages, including Ger- 1 Introduction man. We re-analyzed the complete GermEval17 Task using seven pre-trained BERT models suit- (Aspect-based) Sentiment Analysis is often used able for German provided by the huggingface to transform reviews into helpful information on transformers library (Wolf et al., 2020). We how a product or service of a company is per- evaluate which one of the models is best suited ceived among the customers. Until recently, for the different GermEval17 subtasks by compar- ing their performance values. Furthermore, we Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- compare our findings on whether (and how much) tional (CC BY 4.0) BERT-based models are able to improve the pre- BERT SOTA in German ABSA with the SOTA tiki et al., 2014) was the first workshop to introduce developments for English ABSA by the example Aspect-based Sentiment Analysis (ABSA) which of SemEval-2014 (Pontiki et al., 2014). was expanded within SemEval15 Task 12 (Pontiki We first give an overview on the GermEval17 et al., 2015) and SemEval16 Task 5 (Pontiki et al., tasks (cf. Sec. 2) and on related work (cf. Sec. 2016). Here, restaurant and laptop reviews were ex- 3). Second, we present the data and the models (cf. amined on different granularities. The best model Sec. 4), while Section 5 holds the results of our at SemEval16 was an SVM/CRF architecture using re-evaluation. Sections 6 and 7 conclude our work GloVe embeddings (Pennington et al., 2014). How- by stating our main findings and drawing parallels ever, many works recently focused on re-evaluating to the English language. the SemEval Sentiment Analysis task using BERT- based language models (Hoang et al., 2019; Xu 2 The GermEval17 Task(s) et al., 2019; Sun et al., 2019; Li et al., 2019; Karimi et al., 2020; Tao and Fang, 2020). The GermEval17 Shared Task (Wojatzki et al., 2017) is a task on analyzing aspect-based senti- In comparison, little research deals with German ments in customer reviews about ”Deutsche Bahn” ABSA. For instance, Barriere and Balahur (2020) (DB) - the German public train company. The main trained a multilingual BERT model for German data was crawled from various social media plat- Document-level Sentiment Analysis on the SB-10k forms such as Twitter, Facebook and Q&A web- data set (Cieliebak et al., 2017). Regarding the sites from May 2015 to June 2016. The documents GermEval17 Subtask B, Guhr et al. (2020) consid- were manually annotated, and split into a train- ered both FastText (Bojanowski et al., 2017) and ing (train), a development (dev) and a synchronic BERT, achieving notable improvements. Biesialska (testsyn ) test set. A diachronic test set (testdia ) was et al. (2020) made use of ensemble models: One is collected the same way from November 2016 to an ensemble of ELMo (Peters et al., 2018), GloVe January 2017 in order to test for temporal robust- and a bi-attentive classification network (BCN; Mc- ness. The task comprises four subtasks represent- Cann et al., 2017), achieving a score of 0.782, and ing a complete classification pipeline. Subtask A is the other one consists of ELMo and a Transformer- a binary Relevance Classification task which aims based Sentiment Analysis model (TSA), reaching at identifying whether the feedback refers to DB. a score of 0.789 for the synchronic test data set. Subtask B aims at classifying the Document-level Moreover, Attia et al. (2018) trained a convolu- Polarity (”negative”, ”positive” and ”neutral”). In tional neural network (CNN), achieving a score of Subtask C, the model has to identify all the aspect 0.7545 on the synchronic test set. Schmitt et al. categories with associated sentiment polarities in a (2018) advanced the SOTA for Subtask C by em- relevant document. This multi-label classification ploying biLSTMs and CNNs to carry out end-to- task was divided into Subtask C1 (Aspect-only) end Aspect-based Sentiment Analysis. The highest and Subtask C2 (Aspect+Sentiment). For this pur- score was achieved using an end-to-end CNN archi- pose, the organizers defined 20 different aspect cat- tecture with FastText embeddings, scoring 0.523 egories, e.g. Allgemein (General), Sonstige and 0.557 on the synchronic and diachronic test Unregelmäßigkeiten (Other irregularities). data set for Subtask C1, respectively, and 0.423 Finally, Subtask D refers to the Opinion Target Ex- and 0.465 for Subtask C2. traction (OTE), i.e. a sequence labeling task extract- 4 Materials and Methods ing the linguistic phrase used to express an opinion. We differentiate between exact match (Subtask D1) Data The GermEval17 data is freely available in and overlapping match, tolerating errors of +/− .xml- and .tsv-format1 . Each data split (train, one token (Subtask D2). validation, test) in .tsv-format contains the fol- lowing variables: 3 Related Work • document id (URL) Already before BERT, many researchers focused • document text on (English) Sentiment Analysis (Behdenna et al., 2018). The most common architectures were tra- • relevance label (true, false) ditional machine learning classifiers and recurrent 1 The data sets (in both formats) can be obtained from neural networks (RNNs). SemEval14 (Task 4; Pon- http://ltdata1.informatik.uni-hamburg.de/germeval2017/. • document-level sentiment label shows the number of documents containing cer- (negative, neutral, positive) tain categories without differentiating between how • aspects with respective polarities often a category appears within a given document. (e.g. Ticketkauf#Haupt:negative) For documents which are annotated as irrelevant, Sentiment train dev testsyn testdia the sentiment label is set to neutral and no as- negative 5,045 589 780 497 pects are available. Visibly, the .tsv-formatted neutral 13,208 1,632 1,681 1,237 data does not contain the target expressions or their positive 1,179 148 105 108 associated sequence positions. Consequently, Sub- Table 3: Sentiment distribution for Subtask B. task D can only be conducted using the data in .xml-format, which additionally holds the infor- The relative distribution of the aspect categories mation on the starting and ending sequence posi- is similar between the splits. On average, there tions of the target phrases. are ∼ 1.12 different aspects per document. Again, The data set comprises ∼ 26k documents in to- the label distribution is heavily skewed, with tal, including the diachronic test set with around Allgemein (General) clearly representing the 1.8k examples. Further, the main data was ran- majority class, as it is present in 75.8% of the domly split by the organizers into a train data set documents with aspects. The second most frequent for training, a development data set for validation category is Zugfahrt (Train ride) appearing and a synchronic test data set. Table 1 displays the in around 13.8% of the documents. This strong number of documents for each split. imbalance in the aspect categories leads to an almost Zipfian distribution (Wojatzki et al., 2017). train dev testsyn testdia 19,432 2,369 2,566 1,842 Category train dev testsyn testdia Table 1: Number of documents per split of the data set. Allgemein 11,454 1,391 1,398 1,024 Zugfahrt 1,687 177 241 184 Sonstige Unregelmäßigkeiten 1,277 139 224 164 While roughly 74% of the documents form the train Atmosphäre 990 128 148 53 Ticketkauf 540 64 95 48 set, the development split and the synchronic test Service und Kundenbetreuung 447 42 63 27 split contain around 9% and around 10%, respec- Sicherheit 405 59 84 42 tively. The remaining 7% of the data belong to Informationen 306 28 58 35 Connectivity 250 22 36 73 the diachronic set (cf. Tab. 1). Table 2 shows Auslastung und Platzangebot 231 25 35 20 the relevance distribution per data split. This un- DB App und Website 175 20 28 18 veils a pretty skewed distribution of the labels since Komfort und Ausstattung 125 18 24 11 Barrierefreiheit 53 14 9 2 the relevant documents represent the clear majority Image 42 6 0 3 with over 80% in each split. Toiletten 41 5 7 4 Gastronomisches Angebot 38 2 3 3 Reisen mit Kindern 35 3 7 2 Relevance train dev testsyn testdia Design 29 3 4 2 Gepäck 12 2 2 6 true 16,201 1,931 2,095 1,547 QR-Code 0 1 1 0 false 3,231 438 471 295 total 18,137 2,149 2,467 1,721 # documents with aspects 16,200 1,930 2,095 1,547 Table 2: Relevance distribution for Subtask A. ∅ different aspects/document 1.12 1.11 1.18 1.11 Table 4: Aspect category distribution for Subtask C. The distribution of the sentiments is depicted in Multiple mentions of the same aspect category in a doc- Table 3, which shows that between 65% and 69% ument are only considered once. (per split) belong to the neutral class, 25–31% to the negative and only 4–6% to the positive class. Table 4 holds the distribution of the 20 different Pre-trained architectures BERT was initially aspect categories assigned to the documents2 . It introduced in a base (110M parameters) and a 2 large (340M) variant, Sanh et al. (2019) pro- Multiple annotations per document are pos- sible; for a detailed category description see posed an even smaller BERT model (DistilBERT, https://sites.google.com/view/germeval2017-absa/data. 60M parameters) trained via knowledge distillation Model variant Pre-training corpus Properties bert-base-german-cased 12GB of German text (deepset.ai) L=12, H=768, A=12, 110M parameters bert-base-german-dbmdz-cased 16GB of German text (dbmdz) L=12, H=768, A=12, 110M parameters bert-base-german-dbmdz-uncased 16GB of German text (dbmdz) L=12, H=768, A=12, 110M parameters bert-base-multilingual-cased Largest Wikipedias (top 104 languages) L=12, H=768, A=12, 179M parameters bert-base-multilingual-uncased Largest Wikipedias (top 102 languages) L=12, H=768, A=12, 168M parameters distilbert-base-german-cased 16GB of German text (dbmdz) L=6, H=768, A=12, 66M parameters distilbert-base-multilingual-cased Largest Wikipedias (top 104 languages) L=6, H=768, A=12, 134M parameters Table 5: Pre-trained models provided by huggingface transformers (version 4.0.1) suitable for German. For all available models, see: https://huggingface.co/transformers/pretrained_models.html. (Hinton et al., 2015). The exact model specifica- sequence positions, i.e. one entity corresponds to at tions regarding number of layers (L), number of least one token tag starting with B- for ”Beginning” attention heads (A) and embedding size (H) for and continuing with I- for ”Inner”. If a token does available German BERT models are depicted in not belong to any entity, the tag O for ”Outer” is the last column of Table 5. Both architectures were assigned. For instance, the sequence ”fährt nicht” pre-trained on the Masked Language Modeling task (engl. ”does not run”) consists of two tokens and as well as on the auxiliary Next Sentence Predic- would receive the entity Zugfahrt:negative tion task (only BERT) and can subsequently be and the token tags [B-Zugfahrt:negative, fine-tuned on a task at hand. I-Zugfahrt:negative] if it refers to a DB We include three German (Distil)BERT models train which is not running. pre-trained by DBMDZ3 and one by Deepset.ai4 . The models were fine-tuned on one Tesla V100 The latter one is pre-trained using German PCIe 16GB GPU using Python 3.8.7. Moreover, Wikipedia (6GB raw text files), the Open the transformers module (version 4.0.1) and Legal Data dump (2.4GB; Ostendorff et al., torch (version 1.7.1) were used6 . The considered 2020) and news articles (3.6GB). DBMDZ com- values for the hyperparameters for fine-tuning fol- bined Wikipedia, EU Bookshop (Skadiņš low the recommendations of Devlin et al. (2019): et al., 2014), Open Subtitles (Lison and • Batch size ∈ {16, 32}, Tiedemann, 2016), CommonCrawl (Ortiz Suárez • Adam learning rate ∈ {5e,3e,2e} − 5, et al., 2019), ParaCrawl (Esplà-Gomis et al., • # epochs ∈ {2, 3, 4}. 2019) and News Crawl (Haddow, 2018) to a cor- After evaluating the model performance for com- pus with a total size of 16GB with ∼ 2, 350M binations7 of the different hyperparameters, all pre- tokens. Besides this, we use the three mul- trained architectures were fine-tuned with a learn- tilingual (Distil)BERT models included in the ing rate of 5e-5 for four epochs, which turned out transformers module. This amounts to five to be the most promising combination across the BERT and two DistilBERT models, two of which different models. The maximum sequence length are ”uncased” (i.e. every character is lower-cased) was set to 256, which is sufficient since the eval- while the other five models are ”cased” ones. uated data set consists of rather short texts from social media, and a batch size of 32 was chosen. 5 Results Other models Eight teams officially participated For the re-evaluation, we used the latest data pro- in the GermEval17 shared task, five of which an- vided in .xml-format. Duplicates were not re- alyzed Subtask A, all of them Subtask B and two moved, in order to make our results as comparable repectively Subtask C and D. We furthermore con- as possible. We tokenized the documents and fixed sider the system by Ruppert et al. (2017) addition- single spelling mistakes in the labels5 . For Subtask ally to the participants’ models from 2017, even D, the BIO-tags were added based on the provided 6 Source code is available on GitHub: 3 MDZ Digital Library team at the Bavarian State Li- https://github.com/ac74/reevaluating germeval2017. The brary. Visit https://www.digitale-sammlungen.de for details results are fully reproducible for Subtasks A, B and C. For and https://github.com/dbmdz/berts for their repository on Subtask D, reproducibility could not be ensured. The micro pre-trained BERT models. F1 scores fluctuate across different runs between +/-0.01 4 Visit https://deepset.ai/german-bert for details. around the reported values. 5 7 ”positve” in train set was replaced with ”positive”, Due to memory limitations, not every hyperparameter ” negative” in testdia set was replaced with ”negative”. combination was applicable. though they were the organizers and did not ”offi- Subtask B Subtask B refers to the Document- cially” participate. They also tackled all four sub- level Polarity, which is a multi-class classification tasks. Since 2017 several other authors analyzed task with three classes. Table 8 demonstrates the (parts of) the GermEval17 subtasks using more ad- performances on the two test sets: vanced models, which we also consider for compar- Language model testsyn testdia ison here. Table 6 shows which authors employed Best models 2017 (testsyn : Ruppert et al., 2017) which kinds of models to solve which task. (testdia : Sayyed et al., 2017) 0.767 0.750 Subtask A B C1 C2 D1 D2 bert-base-german-cased 0.798 0.793 bert-base-german-dbmdz-cased 0.799 0.785 Models from 2017 X X X X X X bert-base-german-dbmdz-uncased 0.807 0.800 (Wojatzki et al., 2017; Ruppert et al., 2017) bert-base-multilingual-cased 0.790 0.780 Our BERT models X X X X X X bert-base-multilingual-uncased 0.784 0.766 CNN (Attia et al., 2018) – X – – – – distilbert-base-german-cased 0.798 0.776 CNN+FastText (Schmitt et al., 2018) – – X X – – distilbert-base-multilingual-cased 0.777 0.770 ELMo+GloVe+BCN (Biesialska et al., 2020) – X – – – – ELMo+TSA (Biesialska et al., 2020) – X – – – – CNN (Attia et al., 2018) 0.755 – FastText (Guhr et al., 2020) – X – – – – ELMo+GloVe+BCN (Biesialska et al., 2020) 0.782 – bert-base-german-cased ELMo+TSA (Biesialska et al., 2020) 0.789 – – X – – – – FastText (Guhr et al., 2020) 0.698† – (Guhr et al., 2020) bert-base-german-cased (Guhr et al., 2020) 0.789† – Table 6: An overview on all the models discussed in this article, an ”X” in a column indicates that the archi- Table 8: Micro-averaged F1 scores for Subtask B on tecture was evaluated on the respective subtask. synchronic and diachronic test sets. † Guhr et al. (2020) created their own (balanced & un- balanced) data splits, which limits comparability. We Subtask A The Relevance Classification is a compare to the performance on the unbalanced data binary document classification task with classes since it more likely resembles the original data splits. true and false. Table 7 displays the micro F1 score obtained by each language model on each All models outperform the best model from 2017 test set (best result per data set in bold). by 1.0–4.0 percentage points for the synchronic, and by 1.6–5.0 percentage points for the diachronic Language model testsyn testdia test set. On the synchronic test set, the uncased Best model 2017 (Sayyed et al., 2017) 0.903 0.906 German BERT-BASE model by dbmdz performs bert-base-german-cased 0.950 0.939 best with a score of 0.807, followed by its cased bert-base-german-dbmdz-cased 0.951 0.946 bert-base-german-dbmdz-uncased 0.957 0.948 variant with 0.799. For the diachronic test set, the bert-base-multilingual-cased 0.942 0.933 uncased German BERT-BASE model exceeds the bert-base-multilingual-uncased 0.944 0.939 other models with a score of 0.800, followed by distilbert-base-german-cased 0.944 0.939 distilbert-base-multilingual-cased 0.941 0.932 the cased German BERT-BASE model reaching a score of 0.793. The three multilingual models Table 7: F1 scores for Subtask A on synchronic and perform generally worse than the German mod- diachronic test sets. els on this task. Besides this, all the models per- form slightly better on the synchronic data set than All the models outperform the best result achieved on the diachronic one. The FastText-based model in 2017 for both test data sets. For the synchronic (Guhr et al., 2020) comes not even close to the test set, the previous best result is surpassed by baseline from 2017, while the ELMo-based mod- 3.8–5.4 percentage points. For the diachronic test els (Biesialska et al., 2020) are pretty competitive. set, the absolute difference to the best contender of Interestingly, two of the multilingual models are 2017 varies between 2.6 and 4.2 percentage points. even outperformed by these ELMo-based models. With a micro F1 score of 0.957 and 0.948, respec- tively, the best scoring pre-trained language model Subtask C Subtask C is split into Aspect-only is the uncased German BERT-BASE variant by (Subtask C1) and Aspect+Sentiment Classification dbmdz, followed by its cased version. All the (Subtask C2), each being a multi-label classifica- pre-trained models perform slightly better on the tion task8 . As the organizers provide 20 aspect synchronic test data than on the diachronic data. categories, Subtask C1 includes 20 labels, whereas Attia et al. (2018), Schmitt et al. (2018), Biesialska Subtask C2 has 60 labels since each aspect category et al. (2020) and Guhr et al. (2020) did not evaluate 8 This leads to a change of activation functions in the final their models on this task. layer from softmax to sigmoid + binary cross entropy loss. can be combined with each of the three sentiments. synchronic and diachronic test sets. Again, the best Consistent with Lee et al. (2017) and Mishra et al. model is the uncased German BERT-BASE dbmdz (2017), we do not account for multiple mentions model reaching scores of 0.655 and 0.689, respec- of the same label in one document. The results for tively. The CNN models (Schmitt et al., 2018) Subtask C1 are shown in Table 9: are also outperformed. For both, Subtask C1 and C2, all the displayed models perform better on the Language model testsyn testdia diachronic than on the synchronic test data. Best model 2017 (Ruppert et al., 2017) 0.537 0.556 bert-base-german-cased 0.756 0.762 Subtask D Subtask D refers to the Opinion bert-base-german-dbmdz-cased 0.756 0.781 bert-base-german-dbmdz-uncased 0.761 0.791 Target Extraction (OTE) and is thus a token- bert-base-multilingual-cased 0.706 0.734 level classification task. As this is a rather bert-base-multilingual-uncased 0.723 0.752 distilbert-base-german-cased 0.738 0.768 difficult task, Wojatzki et al. (2017) distinguish distilbert-base-multilingual-cased 0.716 0.744 between exact (Subtask D1) and overlapping CNN+FastText (Schmitt et al., 2018) 0.523 0.557 match (Subtask D2), tolerating a deviation of +/− one token. Here, ”entities” are identified Table 9: Micro-averaged F1 scores for Subtask C1 by their BIO-tags. It is noteworthy that there (Aspect-only) on synchronic and diachronic test sets. A are less entities here than for Subtask C since detailed overview of per-class performances for error document-level aspects or sentiments could not analysis can be found in Table 15 in Appendix A. always be assigned to a certain sequence in the document. As a result, there are less documents All pre-trained German BERTs clearly surpass the at disposal for this task, namely 9,193. The best performance from 2017 as well as the results remaining data has 1.86 opinions per document on reported by Schmitt et al. (2018), who are the only average. The majority class is now Sonstige ones of the other authors to evaluate their models Unregelmäßigkeiten:negative with on this tasks. Regarding the synchronic test set, around 15.4% of the true entities (16,650 in total), the absolute improvement ranges between 16.9 and leading to more balanced data than in Subtask C. 22.4 percentage points, while for the diachronic test data, the models outperform the previous results Language model testsyn testdia by 17.8–23.5 percentage points. The best model Best model 2017 (Ruppert et al., 2017) 0.229 0.301 is again the uncased German BERT-BASE model bert-base-german-cased 0.460 0.455 without CRF bert-base-german-dbmdz-cased 0.480 0.466 by dbmdz, reaching scores of 0.761 and 0.791, bert-base-german-dbmdz-uncased 0.492 0.501 respectively, followed by the two cased German bert-base-multilingual-cased 0.447 0.457 bert-base-multilingual-uncased 0.429 0.404 BERT-BASE models. One more time, the multi- distilbert-base-german-cased 0.347 0.357 lingual models exhibit the poorest performances distilbert-base-multilingual-cased 0.430 0.419 amongst the evaluated models. Next, Table 10 bert-base-german-cased 0.446 0.443 bert-base-german-dbmdz-cased 0.466 0.444 shows the results for Subtask C2: with CRF bert-base-german-dbmdz-uncased 0.515 0.518 bert-base-multilingual-cased 0.472 0.466 Language model testsyn testdia bert-base-multilingual-uncased 0.477 0.452 distilbert-base-german-cased 0.424 0.403 Best model 2017 (Ruppert et al., 2017) 0.396 0.424 distilbert-base-multilingual-cased 0.436 0.418 bert-base-german-cased 0.634 0.663 bert-base-german-dbmdz-cased 0.628 0.663 bert-base-german-dbmdz-uncased 0.655 0.689 Table 11: Entity-level micro-averaged F1 scores for bert-base-multilingual-cased 0.571 0.634 Subtask D1 (exact match) on synchronic and di- bert-base-multilingual-uncased 0.553 0.631 achronic test sets. A detailed overview of per-class per- distilbert-base-german-cased 0.629 0.663 formances for error analysis can be found in Table 17 distilbert-base-multilingual-cased 0.589 0.642 in Appendix B. CNN+FastText (Schmitt et al., 2018) 0.423 0.465 Table 10: Micro-averaged F1 scores for Subtask C2 In Table 11, we compare the pre-trained models (Aspect+Sentiment) on synchronic and diachronic test using an ”ordinary” softmax layer to when using a sets. A detailed overview of per-class performances for CRF layer for Subtask D1. error analysis can be found in Table 16 in Appendix A. The best performing model is the uncased Ger- man BERT-BASE model by dbmdz with CRF Here, the pre-trained models surpass the best model layer on both test sets, with a score of 0.515 and from 2017 by 15.7–25.9 percentage points and 0.518, respectively. Overall, the results from 2017 20.7–26.5 percentage points, respectively, for the are outperformed by 11.8–28.6 percentage points Language model testsyn testdia • The monolingual DistilBERT model is pretty Best models 2017 (testsyn : Lee et al., 2017) competitive, it consistently outperforms its 0.348 0.365 (testdia : Ruppert et al., 2017) bert-base-german-cased 0.471 0.474 multilingual counterpart as well as the multi- lingual BERT models on the subtasks A – C without CRF bert-base-german-dbmdz-cased 0.491 0.488 bert-base-german-dbmdz-uncased 0.501 0.518 bert-base-multilingual-cased 0.457 0.473 and is at least competitive to the monolingual bert-base-multilingual-uncased 0.435 0.417 BERT models. distilbert-base-german-cased 0.397 0.407 distilbert-base-multilingual-cased 0.433 0.429 For D1 and D2 we observe a rather clear domi- bert-base-german-cased 0.455 0.457 nance of the uncased monolingual model which is bert-base-german-dbmdz-cased 0.476 0.469 not observable to this extent for the other tasks. with CRF bert-base-german-dbmdz-uncased 0.523 0.533 bert-base-multilingual-cased 0.476 0.474 bert-base-multilingual-uncased 0.484 0.464 6 Discussion distilbert-base-german-cased 0.433 0.423 distilbert-base-multilingual-cased 0.442 0.427 After having observed a notable performance in- crease for German ABSA when employing pre- Table 12: Entity-level micro-averaged F1 scores for trained models, the next step is to compare these Subtask D2 (overlapping match) on synchronic and di- achronic test sets. A detailed overview of per-class per- observations to what was reported for the English formances for error analysis can be found in Table 18 language. Therefore, we examine the temporal de- in Appendix B. velopment of the SOTA performance on the most widely adopted data sets for English ABSA, orig- inating from the SemEval Shared Tasks (Pontiki on the synchronic test set and 5.6–21.7 percentage et al., 2014, 2015, 2016). When looking at pub- points on the diachronic test set. lic leaderboards, e.g. https://paperswithcode.com/, For the overlapping match (cf. Tab. 12), the best Subtask SB2 (aspect term polarity) from SemEval- system from 2017 are outperformed by 4.9–17.5 2014 is the task which attracts most of the re- percentage points on the synchronic and by 4.2– searchers. This task is related, but not perfectly 16.8 percentage points on the diachronic test set. similar, to Subtask C2, since in this case, the as- Again, the uncased German BERT-BASE model by pect term is always a word which has to present dbmdz with CRF layer performs best with an mi- in the given review. For this task, a comparison cro F1 score of 0.523 on the synchronic and 0.533 of pre-BERT and BERT-based methods reveals no on the diachronic set. To our knowledge, there big ”jump” in the performance values, but rather a were no other models to compare our performance steady increase over time (cf. Tab. 13). values with, besides the results from 2017. Language model Laptops Restaurants Main Takeaways For the first two subtasks, which are rather simple binary and multi-class clas- Best model SemEval-2014 0.7048 0.8095 pre-BERT (Pontiki et al., 2014) sification tasks, the pre-trained models are able to MemNet (Tang et al., 2016) 0.7221 0.8095 improve a little upon the already pretty decent per- formance values from 2017. Further, we do not see HAPN (Li et al., 2018) 0.7727 0.8223 large differences between the different pre-trained models. Nevertheless, the small differences we BERT-SPC (Song et al., 2019) 0.7899 0.8446 BERT-based can observe, already point in the same direction as BERT-ADA (Rietzler et al., 2020) 0.8023 0.8789 what can be observed for the primary ABSA tasks of interest, C1 and C2: LCF-ATEPC (Yang et al., 2019) 0.8229 0.9018 • Uncased models have a tendency of outper- forming their cased counterparts for the mono- Table 13: Development of the SOTA Accuracy lingual models, for multilingual models this for the aspect term polarity task (SemEval-2014; cannot be clearly confirmed. Pontiki et al., 2014). Selected models were • Monolingual models outperform the multilin- picked from https://paperswithcode.com/sota/aspect- based-sentiment-analysis-on-semeval. gual ones. • There are no large performance differences Clearly more related, but unfortunately also less between the two cased BERT models by used, are the subtasks SB3 (aspect category ex- DBMDZ and Deepset.ai, which suggests only traction; comparable to Subtask C1) and SB4 (as- a minor influence of the different corpora, pect category polarity; comparable to Subtask C2) which the models were pre-trained on. from SemEval-2014.9 Limitations with respect 7 Conclusion to comparability arise from the different numbers As one would have hoped, all the state-of-the art of categories: Subtask SB4 only exhibits five as- pre-trained language models clearly outperform all pect categories (as opposed to 20 categories for the models from 2017, proving the power of trans- GermEval17) which leads to an easier classifica- fer learning also for German ABSA. Throughout tion problem and is reflected in the already pretty the presented analyses, the models always achieve high scores of the 2014 baselines. Table 14 shows similar results between the synchronic and the di- the performance of the best model from 2014 as achronic test sets, indicating temporal robustness well as performance of subsequent (pre-BERT and for the models. Nonetheless, the diachronic data BERT-based) models for subtasks SB3 and SB4. was collected only half a year after the main data. Restaurants It would be interesting to see whether the trained Language model SB3 SB4 models would return similar predictions on data collected a couple of years later. pre-BERT Best model SemEval-2014 0.8857 0.8292 (Pontiki et al., 2014) The uncased German BERT-BASE model by ATAE-LSTM (Wang et al., 2016) —- 0.840 dbmdz achieves the best results across all subtasks. Since Rönnqvist et al. (2019) showed that mono- BERT-pair (Sun et al., 2019) 0.9218 0.899 lingual BERT models often outperform the mul- BERT-based tilingual models for a variety of tasks, one might CG-BERT (Wu and Ong, 2020) 0.9162† 0.901† have already suspected that a monolingual Ger- QACG-BERT (Wu and Ong, 2020) 0.9264 0.904† man BERT performs best across the performed tasks. It may not seem evident at first that an Table 14: Development of the SOTA F1 score (SB3) uncased language model ends up as the best per- and Accuracy (SB4) for the aspect category extrac- forming model since, e.g. in Sentiment Analysis, tion/polarity task (SemEval-2014; Pontiki et al., 2014). capitalized letters might be an indicator for polar- † Additional auxiliary sentences were used. ity. In addition, since nouns and beginnings of sentences always start with a capital letter in Ger- In contrast to what can be observed for SB2, in this man, one might assume that lower-casing the whole case, the performance increase on SB4 caused by text changes the meaning of some words and thus the introduction of BERT seems to be kind of strik- confuses the language model. Nevertheless, the ing. While the ATAE-LSTM (Wang et al., 2016) GermEval17 documents are very noisy since they only slightly increased the performance compared were retrieved from social media. That means that to 2014, the BERT-based models led to a jump of the data contains many misspellings, grammar and more than 6 percentage points. So when taking into expression mistakes, dialect, and colloquial lan- account the potential room for improvement (0.16 guage. For this reason, already some participating for SB4 vs. 0.60 for C2), the improvements relative teams in 2017 pursued an elaborate pre-processing to the potential (0.06/0.16 for SB4 vs. 0.23/0.60 on the text data in order to eliminate some noise for C2) are quite similar. (Hövelmann and Friedrich, 2017; Sayyed et al., Another issue is that (partly) highly specialized 2017; Sidarenka, 2017). Among other things, (T)ABSA architectures were used for improving Hövelmann and Friedrich (2017) transformed the the SOTA on the SemEval-2014 tasks, while we text to lower-case and replaced, for example, ”S- ”only” applied standard pre-trained German BERT Bahn” and ”S Bahn” with ”sbahn”. We suppose models without any task-specific modifications or that in this case, lower-casing the texts improves extensions. This leaves room for further improve- the data quality by eliminating some of the noise ments on this task on German data which should and acts as a sort of regularization. As a result, be an objective for future research. the uncased models potentially generalize better 9 Since the data sets (Restaurants and Laptops) have been than the cased models. The findings from May- further developed for SemEval-2015 and SemEval-2016, sub- hew et al. (2019), who compare cased and uncased tasks SB3 and SB4 are revisited under the names Slot 1 and Slot 3 for the in-domain ABSA in SemEval-2015. Slot 2 pre-trained models on social media data for NER, from SemEval-2015 aims at OTE and thus corresponds to corroborate this hypothesis. Subtask D from GermEval17. For SemEval-2016 the same task names as in 2015 were used, subdivided into Subtask 1 (sentence-level ABSA) and Subtask 2 (text-level ABSA). References Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems. In Mohammed Attia, Younes Samih, Ali Elkahky, and Proceedings of the 12th Conference on Language Laura Kallmeyer. 2018. Multilingual multi-class Resources and Evaluation (LREC 2020), pages sentiment classification using convolutional neural 1627–1632, Marseille, France. networks. In Proceedings of the Eleventh Interna- tional Conference on Language Resources and Eval- Barry Haddow. 2018. News Crawl Corpus. uation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv Valentin Barriere and Alexandra Balahur. 2020. Im- preprint arXiv:1503.02531. proving sentiment analysis over non-English tweets using multilingual transformers and automatic trans- Mickel Hoang, Oskar Alija Bihorac, and Jacobo lation for data-augmentation. In Proceedings of Rouces. 2019. Aspect-based sentiment analysis us- the 28th International Conference on Computational ing BERT. In Proceedings of the 22nd Nordic Con- Linguistics, pages 266–271, Barcelona, Spain (On- ference on Computational Linguistics, pages 187– line). International Committee on Computational 196, Turku, Finland. Linköping University Elec- Linguistics. tronic Press. Salima Behdenna, Fatiha Barigou, and Ghalem Be- Sepp Hochreiter and Jürgen Schmidhuber. 1997. lalem. 2018. Document level sentiment analysis: Long short-term memory. Neural computation, A survey. EAI Endorsed Transactions on Context- 9(8):1735–1780. aware Systems and Applications, 4:154339. Leonard Hövelmann and Christoph M. Friedrich. Katarzyna Biesialska, Magdalena Biesialska, and Hen- 2017. Fasttext and Gradient Boosted Trees at ryk Rybinski. 2020. Sentiment analysis with contex- GermEval-2017 Tasks on Relevance Classification tual embeddings and self-attention. arXiv preprint and Document-level Polarity. In Proceedings of the arXiv:2003.05574. GermEval 2017 – Shared Task on Aspect-based Sen- timent in Social Media Customer Feedback, Berlin, Piotr Bojanowski, Edouard Grave, Armand Joulin, and Germany. Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Associa- Akbar Karimi, Leonardo Rossi, and Andrea Prati. 2020. tion for Computational Linguistics, 5:135–146. Adversarial training for aspect-based sentiment anal- ysis with bert. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Ji-Ung Lee, Steffen Eger, Johannes Daxenberger, and Schwenk, and Yoshua Bengio. 2014. Learning Iryna Gurevych. 2017. UKP TU-DA at GermEval phrase representations using rnn encoder-decoder 2017: Deep Learning for Aspect Based Sentiment for statistical machine translation. arXiv preprint Detection. In Proceedings of the GermEval 2017 arXiv:1406.1078. – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback, Berlin, Germany. Mark Cieliebak, Jan Milan Deriu, Dominic Egger, and Fatih Uzdilli. 2017. A Twitter corpus and bench- Lishuang Li, Yang Liu, and AnQiao Zhou. 2018. Hier- mark resources for German sentiment analysis. In archical attention based position-aware network for Proceedings of the Fifth International Workshop aspect-level sentiment analysis. In Proceedings of on Natural Language Processing for Social Media, the 22nd Conference on Computational Natural Lan- pages 45–51, Valencia, Spain. Association for Com- guage Learning, pages 181–189, Brussels, Belgium. putational Linguistics. Association for Computational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam. Kristina Toutanova. 2019. BERT: Pre-training of 2019. Exploiting BERT for end-to-end aspect-based Deep Bidirectional Transformers for Language Un- sentiment analysis. In Proceedings of the 5th Work- derstanding. In Proceedings of the 2019 Conference shop on Noisy User-generated Text (W-NUT 2019), of the North American Chapter of the Association pages 34–41, Hong Kong, China. Association for for Computational Linguistics: Human Language Computational Linguistics. Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- Pierre Lison and Jörg Tiedemann. 2016. OpenSubti- ation for Computational Linguistics. tles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th M. Esplà-Gomis, M. Forcada, Gema Ramı́rez-Sánchez, International Conference on Language Resources and Hieu T. Hoang. 2019. ParaCrawl: Web-scale and Evaluation (LREC 2016). parallel corpora for the languages of the EU. In MT- Summit. Stephen Mayhew, Tatiana Tsygankova, and Dan Roth. 2019. ner and pos when nothing is capitalized. In Oliver Guhr, Anne-Kathrin Schumann, Frank Proceedings of the 2019 Conference on Empirical Bahrmann, and Hans-Joachim Böhme. 2020. Methods in Natural Language Processingand the 9th International Joint Conference on Natural Lan- Suresh Manandhar. 2014. SemEval-2014 task 4: As- guage Processing, pages 6256–6261, Hong Kong, pect based sentiment analysis. In Proceedings of the China. Association for Computational Linguistics. 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27–35, Dublin, Ireland. As- Bryan McCann, James Bradbury, Caiming Xiong, and sociation for Computational Linguistics. Richard Socher. 2017. Learned in translation: Con- textualized word vectors. In Advances in Neural Alexander Rietzler, Sebastian Stabinger, Paul Opitz, Information Processing Systems, volume 30, pages and Stefan Engl. 2020. Adapt or get left behind: 6294–6305. Curran Associates, Inc. Domain adaptation through BERT language model finetuning for aspect-target sentiment classification. Pruthwik Mishra, Vandan Mujadia, and Soujanya In Proceedings of the 12th Language Resources Lanka. 2017. GermEval 2017: Sequence based and Evaluation Conference, pages 4933–4941, Mar- Models for Customer Feedback Analysis. In Pro- seille, France. European Language Resources Asso- ceedings of the GermEval 2017 – Shared Task on ciation. Aspect-based Sentiment in Social Media Customer Feedback, Berlin, Germany. Samuel Rönnqvist, Jenna Kanerva, Tapio Salakoski, and Filip Ginter. 2019. Is Multilingual BERT Flu- Pedro Javier Ortiz Suárez, Benoı̂t Sagot, and Laurent ent in Language Generation? In Proceedings of the Romary. 2019. Asynchronous Pipeline for Process- First NLPL Workshop on Deep Learning for Natural ing Huge Corpora on Medium to Low Resource In- Language Processing, pages 29–36, Turku, Finland. frastructures. In 7th Workshop on the Challenges Linköping University Electronic Press. in the Management of Large Corpora (CMLC- 7), Cardiff, United Kingdom. Leibniz-Institut für Eugen Ruppert, Abhishek Kumar, and Chris Biemann. Deutsche Sprache. 2017. LT-ABSA: An Extensible Open-Source Sys- tem for Document-Level and Aspect-Based Senti- Malte Ostendorff, Till Blume, and Saskia Ostendorff. ment Analysis. In Proceedings of the GermEval 2020. Towards an Open Platform for Legal Informa- 2017 – Shared Task on Aspect-based Sentiment in tion. In Proceedings of the ACM/IEEE Joint Confer- Social Media Customer Feedback, Berlin, Germany. ence on Digital Libraries in 2020, JCDL ’20, pages 385—-388, New York, NY, USA. Association for Victor Sanh, Lysandre Debut, Julien Chaumond, and Computing Machinery. Thomas Wolf. 2019. Distilbert, a distilled version Jeffrey Pennington, Richard Socher, and Christopher of bert: smaller, faster, cheaper and lighter. arXiv Manning. 2014. Glove: Global vectors for word rep- preprint arXiv:1910.01108. resentation. In Proceedings of the 2014 conference on empirical methods in natural language process- Zeeshan Ali Sayyed, Daniel Dakota, and Sandra ing (EMNLP), pages 1532–1543. Kübler. 2017. IDS-IUCL: Investigating Feature Se- lection and Oversampling for GermEval 2017. In Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Proceedings of the GermEval 2017 – Shared Task on Gardner, Christopher Clark, Kenton Lee, and Luke Aspect-based Sentiment in Social Media Customer Zettlemoyer. 2018. Deep contextualized word repre- Feedback, Berlin, Germany. sentations. arXiv preprint arXiv:1802.05365. Martin Schmitt, Simon Steinheber, Konrad Schreiber, Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, and Benjamin Roth. 2018. Joint aspect and polar- Ion Androutsopoulos, Suresh Manandhar, Moham- ity classification for aspect-based sentiment analysis mad AL-Smadi, Mahmoud Al-Ayyoub, Yanyan with end-to-end neural networks. In Proceedings of Zhao, Bing Qin, Orphee de clercq, Veronique the 2018 Conference on Empirical Methods in Nat- Hoste, Marianna Apidianaki, Xavier Tannier, Na- ural Language Processing, pages 1109–1114, Brus- talia Loukachevitch, Evgeny Kotelnikov, Nuria sels, Belgium. Association for Computational Lin- Bel, Salud Marı́a Zafra, and Gülşen Eryiğit. 2016. guistics. Semeval-2016 task 5: Aspect based sentiment anal- ysis. In Proceedings of the 10th International Uladzimir Sidarenka. 2017. PotTS at GermEval-2017 Workshop on Semantic Evaluation (SemEval-2016), Task B: Document-Level Polarity Detection Using pages 19–30. Hand-Crafted SVM and Deep Bidirectional LSTM Network. In Proceedings of the GermEval 2017 Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, – Shared Task on Aspect-based Sentiment in Social Suresh Manandhar, and Ion Androutsopoulos. 2015. Media Customer Feedback, Berlin, Germany. SemEval-2015 task 12: Aspect based sentiment analysis. In Proceedings of the 9th International Raivis Skadiņš, Jörg Tiedemann, Roberts Rozis, and Workshop on Semantic Evaluation (SemEval 2015), Daiga Deksne. 2014. Billions of Parallel Words for pages 486–495, Denver, Colorado. Association for Free: Building and Using the EU Bookshop Cor- Computational Linguistics. pus. In Proceedings of the 9th International Confer- ence on Language Resources and Evaluation (LREC Maria Pontiki, Dimitris Galanis, John Pavlopoulos, 2014), pages 1850–1855, Reykjavik, Iceland. Euro- Harris Papageorgiou, Ion Androutsopoulos, and pean Language Resources Association (ELRA). Youwei Song, Jiahai Wang, Tao Jiang, Zhiyue Liu, and 2020. Transformers: State-of-the-Art Natural Lan- Yanghui Rao. 2019. Attentional encoder network guage Processing. In Proceedings of the 2020 Con- for targeted sentiment classification. arXiv preprint ference on Empirical Methods in Natural Language arXiv:1902.09314. Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. Chi Sun, Luyao Huang, and Xipeng Qiu. 2019. Uti- lizing BERT for aspect-based sentiment analysis via Zhengxuan Wu and Desmond C Ong. 2020. Context- constructing auxiliary sentence. In Proceedings of guided bert for targeted aspect-based sentiment anal- the 2019 Conference of the North American Chap- ysis. arXiv preprint arXiv:2010.07523. ter of the Association for Computational Linguistics: Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2019. Bert Human Language Technologies, Volume 1 (Long post-training for review reading comprehension and and Short Papers), pages 380–385, Minneapolis, aspect-based sentiment analysis. Minnesota. Association for Computational Linguis- tics. Heng Yang, Biqing Zeng, JianHao Yang, Youwei Song, and Ruyang Xu. 2019. A multi-task learn- Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect ing model for chinese-oriented aspect polarity clas- level sentiment classification with deep memory net- sification and aspect term extraction. arXiv preprint work. arXiv preprint arXiv:1605.08900. arXiv:1912.07976. Jie Tao and Xing Fang. 2020. Toward multi-label sen- Appendix timent analysis: a transfer learning based approach. Journal of Big Data, 7:1. A Detailed results (per category) for Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Subtask C Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz It may be interesting to have a more detailed look Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In 31st Conference on Neural Informa- at the model performance for this subtask because tion Processing Systems (NIPS 2017), Long Beach, of the high number of classes and their skewed California, USA. distribution by investigating the performance on category-level. Table 15 shows the performance Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer of the uncased German BERT-BASE model by Levy, and Samuel Bowman. 2019. Superglue: A dbmdz per test set for Subtask C1. The support in- stickier benchmark for general-purpose language un- dicates the number of appearances, which are also derstanding systems. In Advances in neural informa- displayed in Table 4 in this case. Seven categories tion processing systems, pages 3266–3280. are summarized in Rest because they have an F1 Alex Wang, Amanpreet Singh, Julian Michael, Felix score of 0 for both test sets, i.e. the model is not Hill, Omer Levy, and Samuel R Bowman. 2018. able to correctly identify any of these seven aspects Glue: A multi-task benchmark and analysis platform appearing in the test data. The table is sorted by for natural language understanding. arXiv preprint arXiv:1804.07461. the score on the synchronic test set. testsyn testdia Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Aspect Category Score Support Score Support Li Zhao. 2016. Attention-based lstm for aspect- Allgemein 0.854 1,398 0.877 1,024 level sentiment classification. In Proceedings of the Sonstige Unregelmäßigkeiten 0.782 224 0.785 164 2016 conference on empirical methods in natural Connectivity 0.750 36 0.838 73 language processing, pages 606–615. Zugfahrt 0.678 241 0.687 184 Auslastung und Platzangebot 0.645 35 0.667 20 Sicherheit 0.602 84 0.639 42 Michael Wojatzki, Eugen Ruppert, Sarah Holschneider, Atmosphäre 0.600 148 0.532 53 Torsten Zesch, and Chris Biemann. 2017. GermEval Barrierefreiheit 0.500 9 0 2 2017: Shared Task on Aspect-based Sentiment in Ticketkauf 0.481 95 0.506 48 Social Media Customer Feedback. In Proceedings Service und Kundenbetreuung 0.476 63 0.417 27 of the GermEval 2017 – Shared Task on Aspect- DB App und Website 0.455 28 0.563 18 Informationen 0.329 58 0.464 35 based Sentiment in Social Media Customer Feed- Komfort und Ausstattung 0.286 24 0 11 back, pages 1–12, Berlin, Germany. Rest 0 24 0 20 Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Table 15: Micro-averaged F1 scores and support by as- Chaumond, Clement Delangue, Anthony Moi, Pier- pect category (Subtask C1). Seven categories are sum- ric Cistac, Tim Rault, Rémi Louf, Morgan Fun- marized in Rest and show each a score of 0. towicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Can- wen Xu, Teven Le Scao, Sylvain Gugger, Mariama The F1 scores for Allgemein (General), Drame, Quentin Lhoest, and Alexander M. Rush. Sonstige Unregelmäßigkeiten (Other ir- regularities) and Connectivity are the highest. rare categories could still not be identified by the 13 categories, mostly similar between the two test model. sets, show a positive F1 score on at least one of the two test sets. For the categories subsumed un- B Detailed results (per category) for der Rest, the model was not able to learn how to Subtask D correctly identify these categories. Similar as for Subtask C, the results for the best Subtask C2 exhibits a similar distribution of the model are investigated in more detail. Table 17 true labels, with the Aspect+Sentiment category gives the detailed classification report for the un- Allgemein:neutral as majority class. Over cased German BERT-BASE model with CRF layer 50% of the true labels belong to this class. Table 16 on Subtask D1. Only entities that were correctly shows that only 12 out of 60 labels can be detected detected at least once are displayed. The table is by the model (see Table 16). sorted by the score on the synchronic test set. The testsyn testdia classification report for Subtask D2 is displayed Aspect+Sentiment Category Score Support Score Support analogously in Table 18. Allgemein:neutral 0.804 1,108 0.832 913 Sonstige Unregelmäßigkeiten:negative 0.782 221 0.793 159 Zugfahrt:negative 0.645 197 0.725 149 testsyn testdia Sicherheit:negative 0.640 78 0.585 39 Category Score Support Score Support Allgemein:negative 0.582 258 0.333 80 Zugfahrt:negative 0.702 622 0.729 495 Atmosphäre:negative 0.569 126 0.447 39 Sonstige Unregelmäßigkeiten:negative 0.681 693 0.581 484 Connectivity:negative 0.400 20 0.291 46 Sicherheit:negative 0.604 337 0.457 122 Ticketkauf:negative 0.364 42 0.298 34 Connectivity:negative 0.598 56 0.620 109 Auslastung und Platzangebot:negative 0.350 31 0.211 17 Barrierefreiheit:negative 0.595 14 0 3 Allgemein:positive 0.214 41 0.690 33 Auslastung und Platzangebot:negative 0.579 66 0.447 31 Zugfahrt:positive 0.154 34 0 34 Connectivity:positive 0.571 26 0.555 60 Service und Kundenbetreuung:negative 0.146 36 0.174 21 Allgemein:negative 0.545 807 0.343 139 Rest 0 343 0 180 Atmosphäre:negative 0.500 403 0.337 164 Ticketkauf:negative 0.383 96 0.583 74 Ticketkauf:positive 0.368 59 0 13 Table 16: Micro-averaged F1 scores and support by As- Komfort und Ausstattung:negative 0.357 24 0 16 pect+Sentiment category (Subtask C2). 48 categories Atmosphäre:neutral 0.348 40 0.111 14 Service und Kundenbetreuung:negative 0.323 74 0.286 31 are summarized in Rest and show each a score of 0. Informationen:negative 0.301 68 0.505 46 Zugfahrt:positive 0.276 62 0.343 83 DB App und Website:negative 0.232 39 0.375 33 All the aspect categories displayed in Ta- DB App und Website:neutral 0.188 23 0 11 Sonstige Unregelmäßigkeiten:neutral 0.179 13 0.222 2 ble 16 are also visible in Table 15 and Allgemein:positive 0.157 86 0.586 92 Service und Kundenbetreuung:positive 0.115 23 0 5 most of them have negative sentiment. Atmosphäre:positive 0.105 26 0 15 Allgemein:neutral and Sonstige Ticketkauf:neutral 0.040 144 0.222 25 Connectivity:neutral 0 11 0.211 15 Unregelmäßigkeiten:negative show the Toiletten:negative 0 15 0.160 23 highest scores. Again, we assume that here, 48 Rest 0 355 0 115 categories could not be identified due to data Table 17: Micro-averaged F1 scores and support by As- sparsity. However, having this in mind, the model pect+Sentiment entity with exact match (Subtask D1). achieves a relatively high overall performance 35 categories are summarized in Rest, each of them ex- for both, Subtask C1 and C2 (cf. Tab. 9 and hibiting a score of 0. Tab. 10). This is mainly owed to the high score of the majority classes Allgemein and For Subtask D1, the model returns a pos- Allgemein:neutral, respectively, because itive score on 25 entity categories on at the micro F1 score puts a lot of weight on majority least one of the two test sets. The category classes. It might be interesting whether the clas- Zugfahrt:negative can be classified best sification of the rare categories can be improved on both test sets, followed by Sonstige by balancing the data. We experimented with Unregelmäßigkeiten:negative and removing general categories such as Allgemein, Sicherheit:negative for the synchronic Allgemein:neutral or documents with test set and by Connectivity:negative and sentiment neutral since these are usually less Allgemein:positive for the diachronic set. interesting for a company. We observe a large Visibly, the scores between the two test sets differ drop in the overall F1 score which is attributed to more here than in the classification report of the the absence of the strong majority class and the previous task. resulting data loss. Indeed, the classification for The report for the overlapping match (cf. Tab. some single categories could be improved, but the 18) shows slightly better results on some categories testsyn testdia Category Score Support Score Support Zugfahrt:negative 0.708 622 0.739 495 Sonstige Unregelmäßigkeiten:negative 0.697 693 0.617 484 Sicherheit:negative 0.607 337 0.475 122 Connectivity:negative 0.598 56 0.620 109 Barrierefreiheit:negative 0.595 14 0 3 Auslastung und Platzangebot:negative 0.579 66 0.447 31 Connectivity:positive 0.571 26 0.555 60 Allgemein:negative 0.561 807 0.363 139 Atmosphäre:negative 0.505 403 0.358 164 Ticketkauf:negative 0.383 96 0.583 74 Ticketkauf:positive 0.368 59 0 13 Komfort und Ausstattung:negative 0.357 24 0 16 Atmosphäre:neutral 0.348 40 0.111 14 Service und Kundenbetreuung:negative 0.323 74 0.286 31 Informationen:negative 0.301 68 0.505 46 Zugfahrt:positive 0.276 62 0.343 83 DB App und Website:negative 0.261 39 0.406 33 DB App und Website:neutral 0.188 23 0 11 Sonstige Unregelmäßigkeiten:neutral 0.179 13 0.222 2 Allgemein:positive 0.157 86 0.586 92 Service und Kundenbetreuung:positive 0.115 23 0 5 Atmosphäre:positive 0.105 26 0 15 Ticketkauf:neutral 0.040 144 0.222 25 Connectivity:neutral 0 11 0.211 15 Toiletten:negative 0 15 0.160 23 Rest 0 355 0 112 Table 18: Micro-averaged F1 scores and support by Aspect+Sentiment entity with overlapping match (Sub- task D2). 35 categories are summarized in Rest and show each a score of 0. than for the exact match. The third-best score on the diachronic test data is now Sonstige Unregelmäßigkeiten:negative. Besides this, the top three categories per test set remain the same. Apart from the fact that this is a different kind of task than before, one can notice that even though the overall micro F1 scores are lower for Subtask D than for Subtask C, the model manages to success- fully identify a larger variety of categories, i.e. it achieves a positive score for more categories. This is probably due to the more balanced data for Sub- task D than for Subtask C2, resulting in a lower overall score and mostly higher scores per category.