Re-Evaluating GermEval17 Using German Pre-Trained Language Models

Re-Evaluating GermEval17 Using German Pre-Trained Language Models MatthiasAßenmacher matthias@stat.uni-muenchen.de Department of Statistics Ludwig-Maximilians-Universität

Munich Germany

AlessandraCorvonato alessandracorvonato@yahoo.de Department of Statistics Ludwig-Maximilians-Universität

Munich Germany

Re-Evaluating GermEval17 Using German Pre-Trained Language Models AE61EDBA280C88770EDB20913ABBE7B9 GROBID - A machine learning software for extracting information from scholarly documents

The lack of a commonly used benchmark data set (collection) such as (Super) GLUE (Wang et al., 2018(Wang et al., , 2019) ) for the evaluation of non-English pre-trained language models is a severe shortcoming of current English-centric NLP-research. It concentrates a large part of the research on English, neglecting the uncertainty when transferring conclusions found for the English language to other languages. We evaluate the performance of German and multilingual BERT models currently available via the huggingface transformers library on four subtasks of Aspect-based Sentiment Analysis (ABSA) from the GermEval17 workshop. We compare them to pre-BERT architectures (Wojatzki et al., 2017;Schmitt et al., 2018;Attia et al., 2018) as well as to an ELMo-based architecture (Biesialska et al., 2020) and a BERT-based approach (Guhr et al., 2020). The observed improvements are put in relation to those for a similar ABSA task (Pontiki et al., 2014) and similar models (pre-BERT vs. BERT-based) for the English language and we check whether the reported improvements correspond to those we observe for German.

Introduction

(Aspect-based) Sentiment Analysis is often used to transform reviews into helpful information on how a product or service of a company is perceived among the customers. Until recently, Sentiment Analysis was mainly conducted using traditional machine learning and recurrent neural networks, like LSTMs (Hochreiter and Schmidhuber, 1997) or GRUs (Cho et al., 2014). Those models have been practically replaced by language models relying on (parts of) the Transformer architecture, a novel framework proposed by Vaswani et al. (2017). Devlin et al. (2019) developed a Transformer-encoder-based language model called BERT (Bidirectional Encoder Representations from Transfomers), achieving state-of-the-art (SOTA) performance on several benchmark tasks -mainly for the English language -and becoming a milestone in the field of NLP.

Up to now, only a few researchers have focused on sentiment related problems for German reviews, despite language-specific evaluation is a crucial driving force for a more universal model development and improvement. Unique characteristics of the different languages present different challenges to the models, which is why sole evaluation on English data is a severe shortcoming.

The first shared task on German ABSA, which provides a large annotated data set for training and evaluation, is the GermEval17 Shared Task (Wojatzki et al., 2017). The participating teams back then analyzed the data using mostly standard machine learning techniques such as SVMs, CRFs, or LSTMs. In contrast to 2017, today, different pre-trained BERT models are available for a variety of different languages, including German. We re-analyzed the complete GermEval17 Task using seven pre-trained BERT models suitable for German provided by the huggingface transformers library (Wolf et al., 2020). We evaluate which one of the models is best suited for the different GermEval17 subtasks by comparing their performance values. Furthermore, we compare our findings on whether (and how much) BERT-based models are able to improve the pre-BERT SOTA in German ABSA with the SOTA developments for English ABSA by the example of SemEval-2014 (Pontiki et al., 2014).

We first give an overview on the GermEval17 tasks (cf. Sec. 2) and on related work (cf. Sec. 3). Second, we present the data and the models (cf. Sec. 4), while Section 5 holds the results of our re-evaluation. Sections 6 and 7 conclude our work by stating our main findings and drawing parallels to the English language.

2 The GermEval17 Task(s)

The GermEval17 Shared Task (Wojatzki et al., 2017) is a task on analyzing aspect-based sentiments in customer reviews about "Deutsche Bahn" (DB) -the German public train company. The main data was crawled from various social media platforms such as Twitter, Facebook and Q&A websites from May 2015 to June 2016. The documents were manually annotated, and split into a training (train), a development (dev) and a synchronic (test syn ) test set. A diachronic test set (test dia ) was collected the same way from November 2016 to January 2017 in order to test for temporal robustness. The task comprises four subtasks representing a complete classification pipeline. Subtask A is a binary Relevance Classification task which aims at identifying whether the feedback refers to DB. Subtask B aims at classifying the Document-level Polarity ("negative", "positive" and "neutral"). In Subtask C, the model has to identify all the aspect categories with associated sentiment polarities in a relevant document. This multi-label classification task was divided into Subtask C1 (Aspect-only) and Subtask C2 (Aspect+Sentiment). For this purpose, the organizers defined 20 different aspect categories, e.g. Allgemein (General), Sonstige Unregelmäßigkeiten (Other irregularities). Finally, Subtask D refers to the Opinion Target Extraction (OTE), i.e. a sequence labeling task extracting the linguistic phrase used to express an opinion. We differentiate between exact match (Subtask D1) and overlapping match, tolerating errors of +/− one token (Subtask D2).

Related Work

Already before BERT, many researchers focused on (English) Sentiment Analysis (Behdenna et al., 2018). The most common architectures were traditional machine learning classifiers and recurrent neural networks (RNNs). SemEval14 (Task 4;Pon-tiki et al., 2014) was the first workshop to introduce Aspect-based Sentiment Analysis (ABSA) which was expanded within SemEval15 Task 12 (Pontiki et al., 2015) and SemEval16 Task 5 (Pontiki et al., 2016). Here, restaurant and laptop reviews were examined on different granularities. The best model at SemEval16 was an SVM/CRF architecture using GloVe embeddings (Pennington et al., 2014). However, many works recently focused on re-evaluating the SemEval Sentiment Analysis task using BERTbased language models (Hoang et al., 2019;Xu et al., 2019;Sun et al., 2019;Li et al., 2019;Karimi et al., 2020;Tao and Fang, 2020).

In comparison, little research deals with German ABSA. For instance, Barriere and Balahur (2020) trained a multilingual BERT model for German Document-level Sentiment Analysis on the SB-10k data set (Cieliebak et al., 2017). Regarding the GermEval17 Subtask B, Guhr et al. (2020) considered both FastText (Bojanowski et al., 2017) and BERT, achieving notable improvements. Biesialska et al. (2020) made use of ensemble models: One is an ensemble of ELMo (Peters et al., 2018), GloVe and a bi-attentive classification network (BCN; Mc-Cann et al., 2017), achieving a score of 0.782, and the other one consists of ELMo and a Transformerbased Sentiment Analysis model (TSA), reaching a score of 0.789 for the synchronic test data set. Moreover, Attia et al. (2018) trained a convolutional neural network (CNN), achieving a score of 0.7545 on the synchronic test set. Schmitt et al. (2018) advanced the SOTA for Subtask C by employing biLSTMs and CNNs to carry out end-toend Aspect-based Sentiment Analysis. The highest score was achieved using an end-to-end CNN architecture with FastText embeddings, scoring 0.523 and 0.557 on the synchronic and diachronic test data set for Subtask C1, respectively, and 0.423 and 0.465 for Subtask C2.

Materials and Methods

Data The GermEval17 data is freely available in .xml-and .tsv-format1 . Each data split (train, validation, test) in .tsv-format contains the following variables:

• document id (URL)

• document text • relevance label (true, false)

• document-level sentiment label (negative, neutral, positive)

• aspects with respective polarities (e.g. Ticketkauf#Haupt:negative)

For documents which are annotated as irrelevant, the sentiment label is set to neutral and no aspects are available. Visibly, the .tsv-formatted data does not contain the target expressions or their associated sequence positions. Consequently, Subtask D can only be conducted using the data in .xml-format, which additionally holds the information on the starting and ending sequence positions of the target phrases.

The data set comprises ∼ 26k documents in total, including the diachronic test set with around 1.8k examples. Further, the main data was randomly split by the organizers into a train data set for training, a development data set for validation and a synchronic test data set. The distribution of the sentiments is depicted in Table 3, which shows that between 65% and 69% (per split) belong to the neutral class, 25-31% to the negative and only 4-6% to the positive class. (Hinton et al., 2015). The exact model specifications regarding number of layers (L), number of attention heads (A) and embedding size (H) for available German BERT models are depicted in the last column of Table 5. Both architectures were pre-trained on the Masked Language Modeling task as well as on the auxiliary Next Sentence Prediction task (only BERT) and can subsequently be fine-tuned on a task at hand.

We include three German (Distil)BERT models pre-trained by DBMDZ3 and one by Deepset.ai4 . The latter one is pre-trained using German Wikipedia (6GB raw text files), the Open Legal Data dump (2.4GB; Ostendorff et al., 2020) and news articles (3.6GB). DBMDZ combined Wikipedia, EU Bookshop (Skadin ¸š et al., 2014), Open Subtitles (Lison and Tiedemann, 2016), CommonCrawl (Ortiz Suárez et al., 2019), ParaCrawl (Esplà-Gomis et al., 2019) and News Crawl (Haddow, 2018) to a corpus with a total size of 16GB with ∼ 2, 350M tokens. Besides this, we use the three multilingual (Distil)BERT models included in the transformers module. This amounts to five BERT and two DistilBERT models, two of which are "uncased" (i.e. every character is lower-cased) while the other five models are "cased" ones.

Results

For the re-evaluation, we used the latest data provided in .xml-format. Duplicates were not removed, in order to make our results as comparable as possible. We tokenized the documents and fixed single spelling mistakes in the labels5 . For Subtask D, the BIO-tags were added based on the provided sequence positions, i.e. one entity corresponds to at least one token tag starting with B-for "Beginning" and continuing with I-for "Inner". If a token does not belong to any entity, the tag O for "Outer" is assigned. For instance, the sequence "fährt nicht" (engl. "does not run") consists of two tokens and would receive the entity Zugfahrt:negative and the token tags [B-Zugfahrt:negative, I-Zugfahrt:negative] if it refers to a DB train which is not running.

The models were fine-tuned on one Tesla V100 PCIe 16GB GPU using Python 3.8.7. Moreover, the transformers module (version 4.0.1) and torch (version 1.7.1) were used 6 . The considered values for the hyperparameters for fine-tuning follow the recommendations of Devlin et al. (2019):

• Batch size ∈ {16, 32},

• Adam learning rate ∈ {5e,3e,2e} − 5,

• # epochs ∈ {2, 3, 4}.

After evaluating the model performance for combinations7 of the different hyperparameters, all pretrained architectures were fine-tuned with a learning rate of 5e-5 for four epochs, which turned out to be the most promising combination across the different models. The maximum sequence length was set to 256, which is sufficient since the evaluated data set consists of rather short texts from social media, and a batch size of 32 was chosen.

Other models Eight teams officially participated in the GermEval17 shared task, five of which analyzed Subtask A, all of them Subtask B and two repectively Subtask C and D. We furthermore consider the system by Ruppert et al. (2017) though they were the organizers and did not "officially" participate. They also tackled all four subtasks. Since 2017 several other authors analyzed (parts of) the GermEval17 subtasks using more advanced models, which we also consider for comparison here. Table 6 shows which authors employed which kinds of models to solve which task.

Subtask

A B C1 C2 D1 D2 Models from 2017 X X X X X X (Wojatzki et al., 2017;Ruppert et al., 2017) Our BERT models X X X X X X CNN (Attia et al., 2018) -X ----CNN+FastText (Schmitt et al., 2018) --X X --ELMo+GloVe+BCN (Biesialska et al., 2020) -X ----ELMo+TSA (Biesialska et al., 2020) -X ----FastText (Guhr et al., 2020) -X ---bert-base-german-cased -X ---- (Guhr et al., 2020) Table 6: An overview on all the models discussed in this article, an "X" in a column indicates that the architecture was evaluated on the respective subtask.

Subtask A The Relevance Classification is a binary document classification task with classes true and false. (Biesialska et al., 2020) 0.782 -ELMo+TSA (Biesialska et al., 2020) 0.789 -FastText (Guhr et al., 2020) 0.698 † bert-base-german-cased (Guhr et al., 2020) All models outperform the best model from 2017 by 1.0-4.0 percentage points for the synchronic, and by 1.6-5.0 percentage points for the diachronic test set. On the synchronic test set, the uncased German BERT-BASE model by dbmdz performs best with a score of 0.807, followed by its cased variant with 0.799. For the diachronic test set, the uncased German BERT-BASE model exceeds the other models with a score of 0.800, followed by the cased German BERT-BASE model reaching a score of 0.793. The three multilingual models perform generally worse than the German models on this task. Besides this, all the models perform slightly better on the synchronic data set than on the diachronic one. The FastText-based model (Guhr et al., 2020) comes not even close to the baseline from 2017, while the ELMo-based models (Biesialska et al., 2020) Here, the pre-trained models surpass the best model from 2017 by 15.7-25.9 percentage points and 20.7-26.5 percentage points, respectively, for the synchronic and diachronic test sets. Again, the best model is the uncased German BERT-BASE dbmdz model reaching scores of 0.655 and 0.689, respectively. The CNN models (Schmitt et al., 2018) are also outperformed. For both, Subtask C1 and C2, all the displayed models perform better on the diachronic than on the synchronic test data.

Subtask D Subtask D refers to the Opinion Target Extraction (OTE) and is thus a tokenlevel classification task. As this is a rather difficult task, Wojatzki et al. (2017) In Table 11, we compare the pre-trained models using an "ordinary" softmax layer to when using a CRF layer for Subtask D1. on the synchronic test set and 5.6-21.7 percentage points on the diachronic test set.

For the overlapping match (cf. Tab. 12), the best system from 2017 are outperformed by 4.9-17.5 percentage points on the synchronic and by 4.2-16.8 percentage points on the diachronic test set. Again, the uncased German BERT-BASE model by dbmdz with CRF layer performs best with an micro F1 score of 0.523 on the synchronic and 0.533 on the diachronic set. To our knowledge, there were no other models to compare our performance values with, besides the results from 2017.

Main Takeaways For the first two subtasks, which are rather simple binary and multi-class classification tasks, the pre-trained models are able to improve a little upon the already pretty decent performance values from 2017. Further, we do not see large differences between the different pre-trained models. Nevertheless, the small differences we can observe, already point in the same direction as what can be observed for the primary ABSA tasks of interest, C1 and C2:

• Uncased models have a tendency of outperforming their cased counterparts for the monolingual models, for multilingual models this cannot be clearly confirmed. • Monolingual models outperform the multilingual ones. • There are no large performance differences between the two cased BERT models by DBMDZ and Deepset.ai, which suggests only a minor influence of the different corpora, which the models were pre-trained on.

• The monolingual DistilBERT model is pretty competitive, it consistently outperforms its multilingual counterpart as well as the multilingual BERT models on the subtasks A -C and is at least competitive to the monolingual BERT models. For D1 and D2 we observe a rather clear dominance of the uncased monolingual model which is not observable to this extent for the other tasks.

Discussion

After having observed a notable performance increase for German ABSA when employing pretrained models, the next step is to compare these observations to what was reported for the English language. Therefore, we examine the temporal development of the SOTA performance on the most widely adopted data sets for English ABSA, originating from the SemEval Shared Tasks (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016)). When looking at public leaderboards, e.g. https://paperswithcode.com/, Subtask SB2 (aspect term polarity) from SemEval-2014 is the task which attracts most of the researchers. This task is related, but not perfectly similar, to Subtask C2, since in this case, the aspect term is always a word which has to present in the given review. For this task, a comparison of pre-BERT and BERT-based methods reveals no big "jump" in the performance values, but rather a steady increase over time (cf. Tab. 13).

Language model

Laptops Restaurants

Best model SemEval-2014 0.7048 0.8095 pre-BERT (Pontiki et al., 2014) MemNet (Tang et al., 2016) Pontiki et al., 2014). Selected models were picked from https://paperswithcode.com/sota/aspectbased-sentiment-analysis-on-semeval.

Clearly more related, but unfortunately also less used, are the subtasks SB3 (aspect category extraction; comparable to Subtask C1) and SB4 (aspect category polarity; comparable to Subtask C2) from SemEval-2014. 9 Limitations with respect to comparability arise from the different numbers of categories: Subtask SB4 only exhibits five aspect categories (as opposed to 20 categories for GermEval17) which leads to an easier classification problem and is reflected in the already pretty high scores of the 2014 baselines. Table 14 shows the performance of the best model from 2014 as well as performance of subsequent (pre-BERT and BERT-based) models for subtasks SB3 and SB4. In contrast to what can be observed for SB2, in this case, the performance increase on SB4 caused by the introduction of BERT seems to be kind of striking. While the ATAE-LSTM (Wang et al., 2016) only slightly increased the performance compared to 2014, the BERT-based models led to a jump of more than 6 percentage points. So when taking into account the potential room for improvement (0.16 for SB4 vs. 0.60 for C2), the improvements relative to the potential (0.06/0.16 for SB4 vs. 0.23/0.60 for C2) are quite similar. Another issue is that (partly) highly specialized (T)ABSA architectures were used for improving the SOTA on the SemEval-2014 tasks, while we "only" applied standard pre-trained German BERT models without any task-specific modifications or extensions. This leaves room for further improvements on this task on German data which should be an objective for future research. 9 Since the data sets (Restaurants and Laptops) have been further developed for SemEval-2015 and SemEval-2016, subtasks SB3 and SB4 are revisited under the names Slot 1 and Slot 3 for the in-domain ABSA in SemEval-2015. Slot 2 from SemEval-2015 aims at OTE and thus corresponds to Subtask D from GermEval17. For SemEval-2016 the same task names as in 2015 were used, subdivided into Subtask 1 (sentence-level ABSA) and Subtask 2 (text-level ABSA).

Conclusion

As one would have hoped, all the state-of-the art pre-trained language models clearly outperform all the models from 2017, proving the power of transfer learning also for German ABSA. Throughout the presented analyses, the models always achieve similar results between the synchronic and the diachronic test sets, indicating temporal robustness for the models. Nonetheless, the diachronic data was collected only half a year after the main data. It would be interesting to see whether the trained models would return similar predictions on data collected a couple of years later.

The uncased German BERT-BASE model by dbmdz achieves the best results across all subtasks. Since Rönnqvist et al. (2019) showed that monolingual BERT models often outperform the multilingual models for a variety of tasks, one might have already suspected that a monolingual German BERT performs best across the performed tasks. It may not seem evident at first that an uncased language model ends up as the best performing model since, e.g. in Sentiment Analysis, capitalized letters might be an indicator for polarity. In addition, since nouns and beginnings of sentences always start with a capital letter in German, one might assume that lower-casing the whole text changes the meaning of some words and thus confuses the language model. Nevertheless, the GermEval17 documents are very noisy since they were retrieved from social media. That means that the data contains many misspellings, grammar and expression mistakes, dialect, and colloquial language. For this reason, already some participating teams in 2017 pursued an elaborate pre-processing on the text data in order to eliminate some noise (Hövelmann and Friedrich, 2017;Sayyed et al., 2017;Sidarenka, 2017). Among other things, Hövelmann and Friedrich (2017) transformed the text to lower-case and replaced, for example, "S-Bahn" and "S Bahn" with "sbahn". We suppose that in this case, lower-casing the texts improves the data quality by eliminating some of the noise and acts as a sort of regularization. As a result, the uncased models potentially generalize better than the cased models. The findings from Mayhew et al. (2019), who compare cased and uncased pre-trained models on social media data for NER, corroborate this hypothesis.

It may be interesting to have a more detailed look at the model performance for this subtask because of the high number of classes and their skewed distribution by investigating the performance on category-level. Table 15 shows the performance of the uncased German BERT-BASE model by dbmdz per test set for Subtask C1. The support indicates the number of appearances, which are also displayed in Table 4 All the aspect categories displayed in Table 16 are also visible in Table 15 and most of them have negative sentiment. Allgemein:neutral and Sonstige Unregelmäßigkeiten:negative show the highest scores. Again, we assume that here, 48 categories could not be identified due to data sparsity. However, having this in mind, the model achieves a relatively high overall performance for both, Subtask C1 and C2 (cf. Tab. 9 and Tab. 10). This is mainly owed to the high score of the majority classes Allgemein and Allgemein:neutral, respectively, because the micro F1 score puts a lot of weight on majority classes. It might be interesting whether the classification of the rare categories can be improved by balancing the data. We experimented with removing general categories such as Allgemein, Allgemein:neutral or documents with sentiment neutral since these are usually less interesting for a company. We observe a large drop in the overall F1 score which is attributed to the absence of the strong majority class and the resulting data loss. Indeed, the classification for some single categories could be improved, but the rare categories could still not be identified by the model.

B Detailed results (per category) for Subtask D

Similar as for Subtask C, the results for the best model are investigated in more detail. For Subtask D1, the model returns a positive score on 25 entity categories on at least one of the two test sets. The category Zugfahrt:negative can be classified best on both test sets, followed by Sonstige Unregelmäßigkeiten:negative and Sicherheit:negative for the synchronic test set and by Connectivity:negative and Allgemein:positive for the diachronic set. Visibly, the scores between the two test sets differ more here than in the classification report of the previous task.

The report for the overlapping match (cf. Tab. 18) shows slightly better results on some categories than for the exact match. The third-best score on the diachronic test data is now Sonstige Unregelmäßigkeiten:negative. Besides this, the top three categories per test set remain the same.

Apart from the fact that this is a different kind of task than before, one can notice that even though the overall micro F1 scores are lower for Subtask D than for Subtask C, the model manages to successfully identify a larger variety of categories, i.e. it achieves a positive score for more categories. This is probably due to the more balanced data for Subtask D than for Subtask C2, resulting in a lower overall score and mostly higher scores per category.

Table 11displays thenumber of documents for each split.traindev test syn test dia19,432 2,3692,566 1,842

Table 1 :1Number of documents per split of the data set.While roughly 74% of the documents form the trainset, the development split and the synchronic testsplit contain around 9% and around 10%, respec-tively. The remaining 7% of the data belong tothe diachronic set (cf. Tab. 1). Table 2 showsthe relevance distribution per data split. This un-veils a pretty skewed distribution of the labels sincethe relevant documents represent the clear majoritywith over 80% in each split.Relevancetraindev test syn test diatrue16,201 1,9312,095 1,547false3,231438471295

Table 2 :2Relevance distribution for Subtask A.

Table 44

holds the distribution of the 20 differentaspect categories assigned to the documents 2 . It

Table 44: Aspect category distribution for Subtask C.Multiple mentions of the same aspect category in a doc-ument are only considered once.Pre-trained architectures BERT was initiallyintroduced in a base (110M parameters) and alarge (340M) variant, Sanh et al. (2019) pro-posed an even smaller BERT model (DistilBERT,60M parameters) trained via knowledge distillation

Table 5 :5Pre-trained models provided by huggingface transformers (version 4.0.1) suitable for German. For all available models, see: https://huggingface.co/transformers/pretrained_models.html.

Table 7 displays the micro F1 score obtained by each language model on each test set (best result per data set in bold). Subtask B Subtask B refers to the Documentlevel Polarity, which is a multi-class classification task with three classes. Table 8 demonstrates the performances on the two test sets:

Language modeltest syn test diaBest models 2017 (test syn : Ruppert et al., 2017) (test dia : Sayyed et al., 2017)0.7670.750bert-base-german-cased0.7980.793bert-base-german-dbmdz-cased0.7990.785bert-base-german-dbmdz-uncased0.8070.800bert-base-multilingual-cased0.7900.780bert-base-multilingual-uncased0.7840.766distilbert-base-german-cased0.7980.776distilbert-base-multilingual-cased0.7770.770CNN (Attia et al., 2018)0.755-ELMo+GloVe+BCNLanguage modeltest syn test diaBest model 2017 (Sayyed et al., 2017)0.9030.906bert-base-german-cased0.9500.939bert-base-german-dbmdz-cased0.9510.946bert-base-german-dbmdz-uncased0.9570.948bert-base-multilingual-cased0.9420.933bert-base-multilingual-uncased0.9440.939distilbert-base-german-cased0.9440.939distilbert-base-multilingual-cased0.9410.932Table 7: F1 scores for Subtask A on synchronic anddiachronic test sets.All the models outperform the best result achievedin 2017 for both test data sets. For the synchronictest set, the previous best result is surpassed by3.8-5.4 percentage points. For the diachronic testset, the absolute difference to the best contender of2017 varies between 2.6 and 4.2 percentage points.With a micro F1 score of 0.957 and 0.948, respec-tively, the best scoring pre-trained language modelis the uncased German BERT-BASE variant bydbmdz, followed by its cased version. All the

pre-trained models perform slightly better on the synchronic test data than on the diachronic data.Attia et al. (2018),Schmitt et al. (2018),Biesialska et al. (2020) andGuhr et al. (2020) did not evaluate their models on this task.

Table 8 :80.789 † -Micro-averaged F1 scores for Subtask B on synchronic and diachronic test sets.† Guhr et al. (2020) created their own (balanced & unbalanced) data splits, which limits comparability. We compare to the performance on the unbalanced data since it more likely resembles the original data splits.

Table 9 :9are pretty competitive. Interestingly, two of the multilingual models are even outperformed by these ELMo-based models. with each of the three sentiments. Consistent withLee et al. (2017) andMishra et al. (2017), we do not account for multiple mentions of the same label in one document. The results for Subtask C1 are shown in Table9: Micro-averaged F1 scores for Subtask C1 (Aspect-only) on synchronic and diachronic test sets. A detailed overview of per-class performances for error analysis can be found in Table15in Appendix A.All pre-trained German BERTs clearly surpass the best performance from 2017 as well as the results reported bySchmitt et al. (2018), who are the only ones of the other authors to evaluate their models on this tasks. Regarding the synchronic test set, the absolute improvement ranges between 16.9 and 22.4 percentage points, while for the diachronic test data, the models outperform the previous results by 17.8-23.5 percentage points. The best model is again the uncased German BERT-BASE model by dbmdz, reaching scores of 0.761 and 0.791, respectively, followed by the two cased German BERT-BASE models. One more time, the multilingual models exhibit the poorest performances amongst the evaluated models. Next, Table10shows the results for Subtask C2:

Subtask C Subtask C is split into Aspect-only(Subtask C1) and Aspect+Sentiment Classification(Subtask C2), each being a multi-label classifica-tion task 8 . As the organizers provide 20 aspectcategories, Subtask C1 includes 20 labels, whereasSubtask C2 has 60 labels since each aspect category

Table 12 :12The best performing model is the uncased German BERT-BASE model by dbmdz with CRF layer on both test sets, with a score of 0.515 and 0.518, respectively. Overall, the results from 2017 are outperformed by 11.8-28.6 percentage points Entity-level micro-averaged F1 scores for Subtask D2 (overlapping match) on synchronic and diachronic test sets. A detailed overview of per-class performances for error analysis can be found in Table18in Appendix B.Language modeltest syn test diaBest models 2017 (test syn : Lee et al., 2017) (test dia : Ruppert et al., 2017)0.3480.365bert-base-german-cased0.4710.474without CRFbert-base-german-dbmdz-cased bert-base-german-dbmdz-uncased bert-base-multilingual-cased bert-base-multilingual-uncased distilbert-base-german-cased0.491 0.501 0.457 0.435 0.3970.488 0.518 0.473 0.417 0.407distilbert-base-multilingual-cased0.4330.429bert-base-german-cased0.4550.457bert-base-german-dbmdz-cased0.4760.469with CRFbert-base-german-dbmdz-uncased bert-base-multilingual-cased bert-base-multilingual-uncased0.523 0.476 0.4840.533 0.474 0.464distilbert-base-german-cased0.4330.423distilbert-base-multilingual-cased0.4420.427

Table 13 :13Development of the SOTA Accuracy for the aspect term polarity task(SemEval-2014; 0.72210.8095

Table 14 :14Development of the SOTA F1 score (SB3) and Accuracy (SB4) for the aspect category extraction/polarity task(SemEval-2014;Pontiki et al., 2014).† Additional auxiliary sentences were used.

in this case. Seven categories are summarized in Rest because they have an F1 score of 0 for both test sets, i.e. the model is not able to correctly identify any of these seven aspects appearing in the test data. The table is sorted by the score on the synchronic test set.

test syntest diaAspect CategoryScore Support Score SupportAllgemein0.8541,398 0.8771,024Sonstige Unregelmäßigkeiten0.782224 0.785164Connectivity0.75036 0.83873Zugfahrt0.678241 0.687184Auslastung und Platzangebot0.64535 0.66720Sicherheit0.60284 0.63942Atmosphäre0.600148 0.53253Barrierefreiheit0.500902Ticketkauf0.48195 0.50648Service und Kundenbetreuung 0.47663 0.41727DB App und Website0.45528 0.56318Informationen0.32958 0.46435Komfort und Ausstattung0.28624011Rest024020

Table 15 :15Micro-averaged F1 scores and support by aspect category (Subtask C1). Seven categories are summarized in Rest and show each a score of 0. Connectivity are the highest. 13 categories, mostly similar between the two test sets, show a positive F1 score on at least one of the two test sets. For the categories subsumed under Rest, the model was not able to learn how to correctly identify these categories.Subtask C2 exhibits a similar distribution of the true labels, with the Aspect+Sentiment category Allgemein:neutral as majority class. Over 50% of the true labels belong to this class. Table16shows that only 12 out of 60 labels can be detected by the model (see Table16).The F1 scores for Allgemein (General),Sonstige Unregelmäßigkeiten (Other ir-

Table 16 :16Micro-averaged F1 scores and support by As-pect+Sentiment category (Subtask C2). 48 categories are summarized in Rest and show each a score of 0.

Table 17 gives the detailed classification report for the uncased German BERT-BASE model with CRF layer on Subtask D1. Only entities that were correctly detected at least once are displayed. The table is sorted by the score on the synchronic test set. The classification report for Subtask D2 is displayed analogously in Table 18.

test syntest diaCategoryScore Support Score SupportZugfahrt:negative0.702622 0.729495Sonstige Unregelmäßigkeiten:negative0.681693 0.581484Sicherheit:negative0.604337 0.457122Connectivity:negative0.59856 0.620109Barrierefreiheit:negative0.5951403Auslastung und Platzangebot:negative0.57966 0.44731Connectivity:positive0.57126 0.55560Allgemein:negative0.545807 0.343139Atmosphäre:negative0.500403 0.337164Ticketkauf:negative0.38396 0.58374Ticketkauf:positive0.36859013Komfort und Ausstattung:negative0.35724016Atmosphäre:neutral0.34840 0.11114Service und Kundenbetreuung:negative 0.32374 0.28631Informationen:negative0.30168 0.50546Zugfahrt:positive0.27662 0.34383DB App und Website:negative0.23239 0.37533DB App und Website:neutral0.18823011Sonstige Unregelmäßigkeiten:neutral0.17913 0.2222Allgemein:positive0.15786 0.58692Service und Kundenbetreuung:positive 0.1152305Atmosphäre:positive0.10526015Ticketkauf:neutral0.040144 0.22225Connectivity:neutral011 0.21115Toiletten:negative015 0.16023Rest03550115

Table 17 :17Micro-averaged F1 scores and support by As-pect+Sentiment entity with exact match (Subtask D1). 35 categories are summarized in Rest, each of them exhibiting a score of 0.

Table 18 :18Micro-averaged F1 scores and support by Aspect+Sentiment entity with overlapping match (Subtask D2). 35 categories are summarized in Rest and show each a score of 0.

The data sets (in both formats) can be obtained from http://ltdata1.informatik.uni-hamburg.de/germeval2017/.Multiple annotations per document are possible;for a detailed category description see https://sites.google.com/view/germeval2017-absa/data.MDZ Digital Library team at the Bavarian State Library. Visit https://www.digitale-sammlungen.de for details and https://github.com/dbmdz/berts for their repository on pre-trained BERT models. 4 Visit https://deepset.ai/german-bert for details.5 "positve" in train set was replaced with "positive", " negative" in test dia set was replaced with "negative".Due to memory limitations, not every hyperparameter combination was applicable.This leads to a change of activation functions in the final layer from softmax to sigmoid + binary cross entropy loss.

Appendix

A Detailed results (per category) for Subtask C

Multilingual multi-class sentiment classification using convolutional neural networks MohammedAttia YounesSamih AliElkahky LauraKallmeyer Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Miyazaki, Japan

ELRA 2018 Improving sentiment analysis over non-English tweets using multilingual transformers and automatic translation for data-augmentation ValentinBarriere AlexandraBalahur 10.18653/v1/2020.coling-main.23 Proceedings of the 28th International Conference on Computational Linguistics the 28th International Conference on Computational Linguistics

Barcelona, Spain

2020 International Committee on Computational Linguistics Document level sentiment analysis: A survey SalimaBehdenna FatihaBarigou GhalemBelalem 10.4108/eai.14-3-2018.154339 EAI Endorsed Transactions on Contextaware Systems and Applications 4 154339 2018 KatarzynaBiesialska MagdalenaBiesialska HenrykRybinski arXiv:2003.05574 Sentiment analysis with contextual embeddings and self-attention 2020 arXiv preprint Enriching word vectors with subword information PiotrBojanowski EdouardGrave ArmandJoulin TomasMikolov Transactions of the Association for Computational Linguistics 5 2017 Learning phrase representations using rnn encoder-decoder for statistical machine translation KyunghyunCho BartVan Merriënboer CaglarGulcehre DzmitryBahdanau FethiBougares HolgerSchwenk YoshuaBengio arXiv:1406.1078 2014 arXiv preprint A Twitter corpus and benchmark resources for German sentiment analysis MarkCieliebak JanMilanDeriu DominicEgger FatihUzdilli 10.18653/v1/W17-1106 Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media the Fifth International Workshop on Natural Language Processing for Social Media

Valencia, Spain

Association for Computational Linguistics 2017 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding JacobDevlin Ming-WeiChang KentonLee KristinaToutanova 10.18653/v1/N19-1423 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

Association for Computational Linguistics 2019 1 ParaCrawl: Web-scale parallel corpora for the languages of the EU MEsplà-Gomis MForcada GemaRamírez-Sánchez HieuTHoang MT-Summit 2019 Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems OliverGuhr Anne-KathrinSchumann FrankBahrmann Hans-JoachimBöhme Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) the 12th Conference on Language Resources and Evaluation (LREC 2020)

Marseille, France

2020 News Crawl Corpus BarryHaddow 2018 Distilling the knowledge in a neural network GeoffreyHinton OriolVinyals JeffDean arXiv:1503.02531 2015 arXiv preprint Aspect-based sentiment analysis using BERT MickelHoang OskarAlija Bihorac JacoboRouces Proceedings of the 22nd Nordic Conference on Computational Linguistics the 22nd Nordic Conference on Computational Linguistics

Turku, Finland

Linköping University Electronic Press 2019 Long short-term memory SeppHochreiter JürgenSchmidhuber Neural computation 9 8 1997 Fasttext and Gradient Boosted Trees at GermEval-2017 Tasks on Relevance Classification and Document-level Polarity LeonardHövelmann ChristophMFriedrich Proceedings of the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback

Berlin, Germany

2017 Adversarial training for aspect-based sentiment analysis with bert AkbarKarimi LeonardoRossi AndreaPrati 2020 UKP TU-DA at GermEval 2017: Deep Learning for Aspect Based Sentiment Detection Ji-UngLee SteffenEger JohannesDaxenberger IrynaGurevych Proceedings of the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback

Berlin, Germany

2017 Hierarchical attention based position-aware network for aspect-level sentiment analysis LishuangLi YangLiu AnqiaoZhou 10.18653/v1/K18-1018 Proceedings of the 22nd Conference on Computational Natural Language Learning the 22nd Conference on Computational Natural Language Learning

Brussels, Belgium

Association for Computational Linguistics 2018 Exploiting BERT for end-to-end aspect-based sentiment analysis XinLi LidongBing WenxuanZhang WaiLam 10.18653/v1/D19-5505 Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019) the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Hong Kong, China

2019 Association for Computational Linguistics OpenSubti-tles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles PierreLison JörgTiedemann Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC the 10th International Conference on Language Resources and Evaluation (LREC 2016. 2016 ner and pos when nothing is capitalized StephenMayhew TatianaTsygankova DanRoth Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing

Hong Kong, China

2019 Association for Computational Linguistics Learned in translation: Contextualized word vectors BryanMccann JamesBradbury CaimingXiong RichardSocher Advances in Neural Information Processing Systems Curran Associates, Inc 2017 30 GermEval 2017: Sequence based Models for Customer Feedback Analysis PruthwikMishra VandanMujadia SoujanyaLanka Proceedings of the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback

Berlin, Germany

2017 Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures PedroJavier OrtizSuárez BenoîtSagot LaurentRomary 10.14618/IDS-PUB-9021 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7)

Cardiff, United Kingdom

Leibniz-Institut für Deutsche Sprache 2019 Towards an Open Platform for Legal Information MalteOstendorff TillBlume SaskiaOstendorff 10.1145/3383583.3398616 Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, JCDL '20 the ACM/IEEE Joint Conference on Digital Libraries in 2020, JCDL '20

New York, NY, USA

Association for Computing Machinery 2020 Glove: Global vectors for word representation JeffreyPennington RichardSocher ChristopherManning Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) the 2014 conference on empirical methods in natural language processing (EMNLP) 2014 MarkMatthew E Peters MohitNeumann MattIyyer ChristopherGardner KentonClark LukeLee Zettlemoyer arXiv:1802.05365 Deep contextualized word representations 2018 arXiv preprint Semeval-2016 task 5: Aspect based sentiment analysis MariaPontiki DimitrisGalanis HarisPapageorgiou IonAndroutsopoulos SureshManandhar Al-SmadiMohammad MahmoudAl-Ayyoub YanyanZhao BingQin OrpheeDe Clercq VeroniqueHoste MariannaApidianaki XavierTannier NataliaLoukachevitch EvgenyKotelnikov NuriaBel SaludMaría Zafra Güls ¸en Eryigit 10.18653/v1/S16-1002 Proceedings of the 10th International Workshop on Semantic Evaluation the 10th International Workshop on Semantic Evaluation

SemEval-

2016. 2016 SemEval-2015 task 12: Aspect based sentiment analysis MariaPontiki DimitrisGalanis HarisPapageorgiou SureshManandhar IonAndroutsopoulos 10.18653/v1/S15-2082 Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) the 9th International Workshop on Semantic Evaluation (SemEval 2015)

Denver, Colorado

Association for Computational Linguistics 2015 SemEval-2014 task 4: Aspect based sentiment analysis MariaPontiki DimitrisGalanis JohnPavlopoulos HarrisPapageorgiou IonAndroutsopoulos SureshManandhar 10.3115/v1/S14-2004 Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) Association for Computational Linguistics the 8th International Workshop on Semantic Evaluation (SemEval 2014)

Dublin, Ireland

2014 Adapt or get left behind: Domain adaptation through BERT language model finetuning for aspect-target sentiment classification AlexanderRietzler SebastianStabinger PaulOpitz StefanEngl Proceedings of the 12th Language Resources and Evaluation Conference the 12th Language Resources and Evaluation Conference

seille, France

European Language Resources Association 2020. Mar- Is Multilingual BERT Fluent in Language Generation? SamuelRönnqvist JennaKanerva TapioSalakoski FilipGinter Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing the First NLPL Workshop on Deep Learning for Natural Language Processing

Turku, Finland

Linköping University Electronic Press 2019 LT-ABSA: An Extensible Open-Source System for Document-Level and Aspect-Based Sentiment Analysis EugenRuppert AbhishekKumar ChrisBiemann Proceedings of the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback

Berlin, Germany

2017 Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter VictorSanh LysandreDebut JulienChaumond ThomasWolf arXiv:1910.01108 2019 arXiv preprint IDS-IUCL: Investigating Feature Selection and Oversampling for GermEval ZeeshanAli Sayyed DanielDakota SandraKübler Proceedings of the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback

Berlin, Germany

2017. 2017 Joint aspect and polarity classification for aspect-based sentiment analysis with end-to-end neural networks MartinSchmitt SimonSteinheber KonradSchreiber BenjaminRoth 10.18653/v1/D18-1139 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing the 2018 Conference on Empirical Methods in Natural Language Processing

Brussels, Belgium

Association for Computational Linguistics 2018 PotTS at GermEval-2017 Task B: Document-Level Polarity Detection Using Hand-Crafted SVM and Deep Bidirectional LSTM Network UladzimirSidarenka Proceedings of the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback the GermEval 2017 -Shared Task on Aspect-based Sentiment in Social Media Customer Feedback

Berlin, Germany

2017 Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus JörgRaivis Skadin ¸š RobertsTiedemann DaigaRozis Deksne Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014) the 9th International Conference on Language Resources and Evaluation (LREC 2014)

Reykjavik, Iceland

ELRA 2014 Attentional encoder network for targeted sentiment classification YouweiSong JiahaiWang TaoJiang ZhiyueLiu YanghuiRao arXiv:1902.09314 2019 arXiv preprint Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence ChiSun LuyaoHuang XipengQiu 10.18653/v1/N19-1035 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

Association for Computational Linguistics 2019 1 Aspect level sentiment classification with deep memory network DuyuTang BingQin TingLiu arXiv:1605.08900 2016 arXiv preprint Toward multi-label sentiment analysis: a transfer learning based approach JieTao XingFang 10.1186/s40537-019-0278-0 Journal of Big Data 7 1 2020 Attention Is All You Need AshishVaswani NoamShazeer NikiParmar JakobUszkoreit LlionJones AidanNGomez LukaszKaiser IlliaPolosukhin 31st Conference on Neural Information Processing Systems (NIPS 2017)

Long Beach, California, USA

2017 Superglue: A stickier benchmark for general-purpose language understanding systems AlexWang YadaPruksachatkun NikitaNangia AmanpreetSingh JulianMichael FelixHill OmerLevy SamuelBowman Advances in neural information processing systems 2019 Glue: A multi-task benchmark and analysis platform for natural language understanding AlexWang AmanpreetSingh JulianMichael FelixHill OmerLevy Samuel R Bowman arXiv:1804.07461 2018 arXiv preprint Attention-based lstm for aspectlevel sentiment classification YequanWang MinlieHuang XiaoyanZhu LiZhao Proceedings of the 2016 conference on empirical methods in natural language processing the 2016 conference on empirical methods in natural language processing 2016 GermEval 2017: Shared Task on Aspect-based Sentiment in Social Media Customer Feedback MichaelWojatzki EugenRuppert SarahHolschneider TorstenZesch ChrisBiemann Proceedings of the GermEval 2017 -Shared Task on Aspectbased Sentiment in Social Media Customer Feedback the GermEval 2017 -Shared Task on Aspectbased Sentiment in Social Media Customer Feedback

Berlin, Germany

2017 Transformers: State-of-the-Art Natural Language Processing ThomasWolf LysandreDebut VictorSanh JulienChaumond ClementDelangue AnthonyMoi PierricCistac TimRault RémiLouf MorganFuntowicz JoeDavison SamShleifer ClaraPatrick Von Platen YacineMa JulienJernite CanwenPlu TevenXu SylvainLe Scao MariamaGugger QuentinDrame AlexanderMLhoest Rush 10.18653/v1/2020.emnlp-demos.6 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Association for Computational Linguistics the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations 2020 Contextguided bert for targeted aspect-based sentiment analysis ZhengxuanWu DesmondCOng arXiv:2010.07523 2020 arXiv preprint Bert post-training for review reading comprehension and aspect-based sentiment analysis HuXu BingLiu LeiShu PhilipSYu 2019 HengYang BiqingZeng JianhaoYang YouweiSong RuyangXu arXiv:1912.07976 A multi-task learning model for chinese-oriented aspect polarity classification and aspect term extraction 2019 arXiv preprint