=Paper=
{{Paper
|id=Vol-2957/paper1
|storemode=property
|title=Re-Evaluating GermEval17 Using German Pre-Trained Language Models
|pdfUrl=https://ceur-ws.org/Vol-2957/paper1.pdf
|volume=Vol-2957
|authors=Matthias Aßenmacher,Alessandra Corvonato,Christian Heumann
|dblpUrl=https://dblp.org/rec/conf/swisstext/AssenmacherCH21
}}
==Re-Evaluating GermEval17 Using German Pre-Trained Language Models==
<pdf width="1500px">https://ceur-ws.org/Vol-2957/paper1.pdf</pdf>
<pre>
Re-Evaluating GermEval17 Using German Pre-Trained Language Models

    Matthias Aßenmacher1♠                    Alessandra Corvonato1♣                 Christian Heumann1♠
          1
              Department of Statistics, Ludwig-Maximilians-Universität, Munich, Germany

              ♠                                                      ♣
                  {matthias,chris}@stat.uni-muenchen.de,                 alessandracorvonato@yahoo.de


                         Abstract                               Sentiment Analysis was mainly conducted us-
                                                                ing traditional machine learning and recurrent
    The lack of a commonly used benchmark                       neural networks, like LSTMs (Hochreiter and
    data set (collection) such as (Super) GLUE                  Schmidhuber, 1997) or GRUs (Cho et al., 2014).
    (Wang et al., 2018, 2019) for the evalua-                   Those models have been practically replaced
    tion of non-English pre-trained language                    by language models relying on (parts of) the
    models is a severe shortcoming of cur-                      Transformer architecture, a novel framework pro-
    rent English-centric NLP-research. It con-                  posed by Vaswani et al. (2017). Devlin et al.
    centrates a large part of the research on                   (2019) developed a Transformer-encoder-based lan-
    English, neglecting the uncertainty when                    guage model called BERT (Bidirectional Encoder
    transferring conclusions found for the En-                  Representations from Transfomers), achieving
    glish language to other languages. We eval-                 state-of-the-art (SOTA) performance on several
    uate the performance of German and mul-                     benchmark tasks - mainly for the English language
    tilingual BERT models currently available                   - and becoming a milestone in the field of NLP.
    via the huggingface transformers li-                           Up to now, only a few researchers have focused
    brary on four subtasks of Aspect-based                      on sentiment related problems for German reviews,
    Sentiment Analysis (ABSA) from the                          despite language-specific evaluation is a crucial
    GermEval17 workshop. We compare them                        driving force for a more universal model develop-
    to pre-BERT architectures (Wojatzki et al.,                 ment and improvement. Unique characteristics of
    2017; Schmitt et al., 2018; Attia et al.,                   the different languages present different challenges
    2018) as well as to an ELMo-based ar-                       to the models, which is why sole evaluation on
    chitecture (Biesialska et al., 2020) and a                  English data is a severe shortcoming.
    BERT-based approach (Guhr et al., 2020).                       The first shared task on German ABSA, which
    The observed improvements are put in rela-                  provides a large annotated data set for training
    tion to those for a similar ABSA task (Pon-                 and evaluation, is the GermEval17 Shared Task
    tiki et al., 2014) and similar models (pre-                 (Wojatzki et al., 2017). The participating teams
    BERT vs. BERT-based) for the English                        back then analyzed the data using mostly stan-
    language and we check whether the re-                       dard machine learning techniques such as SVMs,
    ported improvements correspond to those                     CRFs, or LSTMs. In contrast to 2017, today, dif-
    we observe for German.                                      ferent pre-trained BERT models are available for
                                                                a variety of different languages, including Ger-
1   Introduction                                                man. We re-analyzed the complete GermEval17
                                                                Task using seven pre-trained BERT models suit-
(Aspect-based) Sentiment Analysis is often used                 able for German provided by the huggingface
to transform reviews into helpful information on                transformers library (Wolf et al., 2020). We
how a product or service of a company is per-                   evaluate which one of the models is best suited
ceived among the customers. Until recently,                     for the different GermEval17 subtasks by compar-
                                                                ing their performance values. Furthermore, we
Copyright © 2021 for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0 Interna-         compare our findings on whether (and how much)
tional (CC BY 4.0)                                              BERT-based models are able to improve the pre-
BERT SOTA in German ABSA with the SOTA                      tiki et al., 2014) was the first workshop to introduce
developments for English ABSA by the example                Aspect-based Sentiment Analysis (ABSA) which
of SemEval-2014 (Pontiki et al., 2014).                     was expanded within SemEval15 Task 12 (Pontiki
   We first give an overview on the GermEval17              et al., 2015) and SemEval16 Task 5 (Pontiki et al.,
tasks (cf. Sec. 2) and on related work (cf. Sec.            2016). Here, restaurant and laptop reviews were ex-
3). Second, we present the data and the models (cf.         amined on different granularities. The best model
Sec. 4), while Section 5 holds the results of our           at SemEval16 was an SVM/CRF architecture using
re-evaluation. Sections 6 and 7 conclude our work           GloVe embeddings (Pennington et al., 2014). How-
by stating our main findings and drawing parallels          ever, many works recently focused on re-evaluating
to the English language.                                    the SemEval Sentiment Analysis task using BERT-
                                                            based language models (Hoang et al., 2019; Xu
2   The GermEval17 Task(s)                                  et al., 2019; Sun et al., 2019; Li et al., 2019; Karimi
                                                            et al., 2020; Tao and Fang, 2020).
The GermEval17 Shared Task (Wojatzki et al.,
2017) is a task on analyzing aspect-based senti-               In comparison, little research deals with German
ments in customer reviews about ”Deutsche Bahn”             ABSA. For instance, Barriere and Balahur (2020)
(DB) - the German public train company. The main            trained a multilingual BERT model for German
data was crawled from various social media plat-            Document-level Sentiment Analysis on the SB-10k
forms such as Twitter, Facebook and Q&A web-                data set (Cieliebak et al., 2017). Regarding the
sites from May 2015 to June 2016. The documents             GermEval17 Subtask B, Guhr et al. (2020) consid-
were manually annotated, and split into a train-            ered both FastText (Bojanowski et al., 2017) and
ing (train), a development (dev) and a synchronic           BERT, achieving notable improvements. Biesialska
(testsyn ) test set. A diachronic test set (testdia ) was   et al. (2020) made use of ensemble models: One is
collected the same way from November 2016 to                an ensemble of ELMo (Peters et al., 2018), GloVe
January 2017 in order to test for temporal robust-          and a bi-attentive classification network (BCN; Mc-
ness. The task comprises four subtasks represent-           Cann et al., 2017), achieving a score of 0.782, and
ing a complete classification pipeline. Subtask A is        the other one consists of ELMo and a Transformer-
a binary Relevance Classification task which aims           based Sentiment Analysis model (TSA), reaching
at identifying whether the feedback refers to DB.           a score of 0.789 for the synchronic test data set.
Subtask B aims at classifying the Document-level            Moreover, Attia et al. (2018) trained a convolu-
Polarity (”negative”, ”positive” and ”neutral”). In         tional neural network (CNN), achieving a score of
Subtask C, the model has to identify all the aspect         0.7545 on the synchronic test set. Schmitt et al.
categories with associated sentiment polarities in a        (2018) advanced the SOTA for Subtask C by em-
relevant document. This multi-label classification          ploying biLSTMs and CNNs to carry out end-to-
task was divided into Subtask C1 (Aspect-only)              end Aspect-based Sentiment Analysis. The highest
and Subtask C2 (Aspect+Sentiment). For this pur-            score was achieved using an end-to-end CNN archi-
pose, the organizers defined 20 different aspect cat-       tecture with FastText embeddings, scoring 0.523
egories, e.g. Allgemein (General), Sonstige                 and 0.557 on the synchronic and diachronic test
Unregelmäßigkeiten (Other irregularities).                 data set for Subtask C1, respectively, and 0.423
Finally, Subtask D refers to the Opinion Target Ex-         and 0.465 for Subtask C2.
traction (OTE), i.e. a sequence labeling task extract-
                                                            4    Materials and Methods
ing the linguistic phrase used to express an opinion.
We differentiate between exact match (Subtask D1)           Data The GermEval17 data is freely available in
and overlapping match, tolerating errors of +/−             .xml- and .tsv-format1 . Each data split (train,
one token (Subtask D2).                                     validation, test) in .tsv-format contains the fol-
                                                            lowing variables:
3   Related Work
                                                                • document id (URL)
Already before BERT, many researchers focused
                                                                • document text
on (English) Sentiment Analysis (Behdenna et al.,
2018). The most common architectures were tra-                  • relevance label (true, false)
ditional machine learning classifiers and recurrent             1
                                                                  The data sets (in both formats) can be obtained from
neural networks (RNNs). SemEval14 (Task 4; Pon-             http://ltdata1.informatik.uni-hamburg.de/germeval2017/.
   • document-level sentiment label                           shows the number of documents containing cer-
        (negative, neutral, positive)                         tain categories without differentiating between how
   • aspects with respective polarities                       often a category appears within a given document.
        (e.g. Ticketkauf#Haupt:negative)
   For documents which are annotated as irrelevant,              Sentiment          train      dev     testsyn     testdia
the sentiment label is set to neutral and no as-                 negative          5,045        589       780         497
pects are available. Visibly, the .tsv-formatted                 neutral          13,208      1,632     1,681       1,237
data does not contain the target expressions or their            positive          1,179        148       105         108
associated sequence positions. Consequently, Sub-
                                                                  Table 3: Sentiment distribution for Subtask B.
task D can only be conducted using the data in
.xml-format, which additionally holds the infor-
                                                              The relative distribution of the aspect categories
mation on the starting and ending sequence posi-
                                                              is similar between the splits. On average, there
tions of the target phrases.
                                                              are ∼ 1.12 different aspects per document. Again,
   The data set comprises ∼ 26k documents in to-
                                                              the label distribution is heavily skewed, with
tal, including the diachronic test set with around
                                                              Allgemein (General) clearly representing the
1.8k examples. Further, the main data was ran-
                                                              majority class, as it is present in 75.8% of the
domly split by the organizers into a train data set
                                                              documents with aspects. The second most frequent
for training, a development data set for validation
                                                              category is Zugfahrt (Train ride) appearing
and a synchronic test data set. Table 1 displays the
                                                              in around 13.8% of the documents. This strong
number of documents for each split.
                                                              imbalance in the aspect categories leads to an
                                                              almost Zipfian distribution (Wojatzki et al., 2017).
             train      dev     testsyn     testdia
           19,432      2,369     2,566       1,842
                                                               Category                        train    dev    testsyn   testdia

Table 1: Number of documents per split of the data set.        Allgemein                      11,454   1,391     1,398    1,024
                                                               Zugfahrt                        1,687     177       241      184
                                                               Sonstige Unregelmäßigkeiten    1,277     139       224      164
While roughly 74% of the documents form the train              Atmosphäre                       990     128       148       53
                                                               Ticketkauf                        540      64        95       48
set, the development split and the synchronic test             Service und Kundenbetreuung       447      42        63       27
split contain around 9% and around 10%, respec-                Sicherheit                        405      59        84       42
tively. The remaining 7% of the data belong to                 Informationen                     306      28        58       35
                                                               Connectivity                      250      22        36       73
the diachronic set (cf. Tab. 1). Table 2 shows                 Auslastung und Platzangebot       231      25        35       20
the relevance distribution per data split. This un-            DB App und Website                175      20        28       18
veils a pretty skewed distribution of the labels since         Komfort und Ausstattung           125      18        24       11
                                                               Barrierefreiheit                   53      14         9        2
the relevant documents represent the clear majority            Image                              42       6         0        3
with over 80% in each split.                                   Toiletten                          41       5         7        4
                                                               Gastronomisches Angebot            38       2         3        3
                                                               Reisen mit Kindern                 35       3         7        2
   Relevance          train     dev    testsyn    testdia      Design                             29       3         4        2
                                                               Gepäck                            12       2         2        6
   true              16,201    1,931      2,095       1,547    QR-Code                             0       1         1        0
   false              3,231      438        471         295    total                          18,137   2,149     2,467    1,721
                                                               # documents with aspects       16,200   1,930     2,095    1,547
       Table 2: Relevance distribution for Subtask A.          ∅ different aspects/document     1.12    1.11      1.18     1.11

                                                              Table 4: Aspect category distribution for Subtask C.
The distribution of the sentiments is depicted in
                                                              Multiple mentions of the same aspect category in a doc-
Table 3, which shows that between 65% and 69%                 ument are only considered once.
(per split) belong to the neutral class, 25–31% to
the negative and only 4–6% to the positive class.
  Table 4 holds the distribution of the 20 different          Pre-trained architectures BERT was initially
aspect categories assigned to the documents2 . It             introduced in a base (110M parameters) and a
   2
                                                              large (340M) variant, Sanh et al. (2019) pro-
     Multiple annotations per document are pos-
sible;     for a detailed category description see            posed an even smaller BERT model (DistilBERT,
https://sites.google.com/view/germeval2017-absa/data.         60M parameters) trained via knowledge distillation
 Model variant                                  Pre-training corpus                      Properties
 bert-base-german-cased                         12GB of German text (deepset.ai)         L=12, H=768, A=12, 110M parameters
 bert-base-german-dbmdz-cased                   16GB of German text (dbmdz)              L=12, H=768, A=12, 110M parameters
 bert-base-german-dbmdz-uncased                 16GB of German text (dbmdz)              L=12, H=768, A=12, 110M parameters
 bert-base-multilingual-cased                   Largest Wikipedias (top 104 languages)   L=12, H=768, A=12, 179M parameters
 bert-base-multilingual-uncased                 Largest Wikipedias (top 102 languages)   L=12, H=768, A=12, 168M parameters
 distilbert-base-german-cased                   16GB of German text (dbmdz)              L=6, H=768, A=12, 66M parameters
 distilbert-base-multilingual-cased             Largest Wikipedias (top 104 languages)   L=6, H=768, A=12, 134M parameters

Table 5: Pre-trained models provided by huggingface transformers (version 4.0.1) suitable for German. For
all available models, see: https://huggingface.co/transformers/pretrained_models.html.


(Hinton et al., 2015). The exact model specifica-               sequence positions, i.e. one entity corresponds to at
tions regarding number of layers (L), number of                 least one token tag starting with B- for ”Beginning”
attention heads (A) and embedding size (H) for                  and continuing with I- for ”Inner”. If a token does
available German BERT models are depicted in                    not belong to any entity, the tag O for ”Outer” is
the last column of Table 5. Both architectures were             assigned. For instance, the sequence ”fährt nicht”
pre-trained on the Masked Language Modeling task                (engl. ”does not run”) consists of two tokens and
as well as on the auxiliary Next Sentence Predic-               would receive the entity Zugfahrt:negative
tion task (only BERT) and can subsequently be                   and the token tags [B-Zugfahrt:negative,
fine-tuned on a task at hand.                                   I-Zugfahrt:negative] if it refers to a DB
    We include three German (Distil)BERT models                 train which is not running.
pre-trained by DBMDZ3 and one by Deepset.ai4 .                     The models were fine-tuned on one Tesla V100
The latter one is pre-trained using German                      PCIe 16GB GPU using Python 3.8.7. Moreover,
Wikipedia (6GB raw text files), the Open                        the transformers module (version 4.0.1) and
Legal Data dump (2.4GB; Ostendorff et al.,                      torch (version 1.7.1) were used6 . The considered
2020) and news articles (3.6GB). DBMDZ com-                     values for the hyperparameters for fine-tuning fol-
bined Wikipedia, EU Bookshop (Skadiņš                         low the recommendations of Devlin et al. (2019):
et al., 2014), Open Subtitles (Lison and                           • Batch size ∈ {16, 32},
Tiedemann, 2016), CommonCrawl (Ortiz Suárez                       • Adam learning rate ∈ {5e,3e,2e} − 5,
et al., 2019), ParaCrawl (Esplà-Gomis et al.,                     • # epochs ∈ {2, 3, 4}.
2019) and News Crawl (Haddow, 2018) to a cor-                      After evaluating the model performance for com-
pus with a total size of 16GB with ∼ 2, 350M                    binations7 of the different hyperparameters, all pre-
tokens. Besides this, we use the three mul-                     trained architectures were fine-tuned with a learn-
tilingual (Distil)BERT models included in the                   ing rate of 5e-5 for four epochs, which turned out
transformers module. This amounts to five                       to be the most promising combination across the
BERT and two DistilBERT models, two of which                    different models. The maximum sequence length
are ”uncased” (i.e. every character is lower-cased)             was set to 256, which is sufficient since the eval-
while the other five models are ”cased” ones.                   uated data set consists of rather short texts from
                                                                social media, and a batch size of 32 was chosen.
5       Results
                                                                Other models Eight teams officially participated
For the re-evaluation, we used the latest data pro-             in the GermEval17 shared task, five of which an-
vided in .xml-format. Duplicates were not re-                   alyzed Subtask A, all of them Subtask B and two
moved, in order to make our results as comparable               repectively Subtask C and D. We furthermore con-
as possible. We tokenized the documents and fixed               sider the system by Ruppert et al. (2017) addition-
single spelling mistakes in the labels5 . For Subtask           ally to the participants’ models from 2017, even
D, the BIO-tags were added based on the provided
                                                                    6
                                                                      Source     code     is    available    on    GitHub:
    3
     MDZ Digital Library team at the Bavarian State Li-         https://github.com/ac74/reevaluating germeval2017. The
brary. Visit https://www.digitale-sammlungen.de for details     results are fully reproducible for Subtasks A, B and C. For
and https://github.com/dbmdz/berts for their repository on      Subtask D, reproducibility could not be ensured. The micro
pre-trained BERT models.                                        F1 scores fluctuate across different runs between +/-0.01
   4
     Visit https://deepset.ai/german-bert for details.          around the reported values.
   5                                                                7
     ”positve” in train set was replaced with ”positive”,             Due to memory limitations, not every hyperparameter
” negative” in testdia set was replaced with ”negative”.        combination was applicable.
though they were the organizers and did not ”offi-                               Subtask B Subtask B refers to the Document-
cially” participate. They also tackled all four sub-                             level Polarity, which is a multi-class classification
tasks. Since 2017 several other authors analyzed                                 task with three classes. Table 8 demonstrates the
(parts of) the GermEval17 subtasks using more ad-                                performances on the two test sets:
vanced models, which we also consider for compar-
                                                                                  Language model                                      testsyn   testdia
ison here. Table 6 shows which authors employed
                                                                                  Best models 2017 (testsyn : Ruppert et al., 2017)
which kinds of models to solve which task.                                        (testdia : Sayyed et al., 2017)
                                                                                                                                      0.767     0.750

 Subtask                                         A   B   C1     C2     D1   D2    bert-base-german-cased                              0.798     0.793
                                                                                  bert-base-german-dbmdz-cased                        0.799     0.785
 Models from 2017
                                                 X   X   X      X      X    X     bert-base-german-dbmdz-uncased                      0.807     0.800
 (Wojatzki et al., 2017; Ruppert et al., 2017)
                                                                                  bert-base-multilingual-cased                        0.790     0.780
 Our BERT models                                 X   X   X      X      X    X     bert-base-multilingual-uncased                      0.784     0.766
 CNN (Attia et al., 2018)                        –   X   –      –      –    –     distilbert-base-german-cased                        0.798     0.776
 CNN+FastText (Schmitt et al., 2018)             –   –   X      X      –    –     distilbert-base-multilingual-cased                  0.777     0.770
 ELMo+GloVe+BCN (Biesialska et al., 2020)        –   X   –      –      –    –
 ELMo+TSA (Biesialska et al., 2020)              –   X   –      –      –    –
                                                                                  CNN (Attia et al., 2018)                            0.755       –
 FastText (Guhr et al., 2020)                    –   X   –      –      –    –     ELMo+GloVe+BCN (Biesialska et al., 2020)            0.782       –
 bert-base-german-cased                                                           ELMo+TSA (Biesialska et al., 2020)                  0.789       –
                                                 –   X   –       –     –    –     FastText (Guhr et al., 2020)                        0.698†      –
 (Guhr et al., 2020)
                                                                                  bert-base-german-cased (Guhr et al., 2020)          0.789†      –
Table 6: An overview on all the models discussed in
this article, an ”X” in a column indicates that the archi-                       Table 8: Micro-averaged F1 scores for Subtask B on
tecture was evaluated on the respective subtask.                                 synchronic and diachronic test sets.
                                                                                 †
                                                                                   Guhr et al. (2020) created their own (balanced & un-
                                                                                 balanced) data splits, which limits comparability. We
Subtask A The Relevance Classification is a                                      compare to the performance on the unbalanced data
binary document classification task with classes                                 since it more likely resembles the original data splits.
true and false. Table 7 displays the micro F1
score obtained by each language model on each                                    All models outperform the best model from 2017
test set (best result per data set in bold).                                     by 1.0–4.0 percentage points for the synchronic,
                                                                                 and by 1.6–5.0 percentage points for the diachronic
 Language model                                              testsyn   testdia   test set. On the synchronic test set, the uncased
 Best model 2017 (Sayyed et al., 2017)                        0.903     0.906    German BERT-BASE model by dbmdz performs
 bert-base-german-cased                                       0.950     0.939    best with a score of 0.807, followed by its cased
 bert-base-german-dbmdz-cased                                 0.951     0.946
 bert-base-german-dbmdz-uncased                               0.957     0.948
                                                                                 variant with 0.799. For the diachronic test set, the
 bert-base-multilingual-cased                                 0.942     0.933    uncased German BERT-BASE model exceeds the
 bert-base-multilingual-uncased                               0.944     0.939    other models with a score of 0.800, followed by
 distilbert-base-german-cased                                 0.944     0.939
 distilbert-base-multilingual-cased                           0.941     0.932    the cased German BERT-BASE model reaching
                                                                                 a score of 0.793. The three multilingual models
Table 7: F1 scores for Subtask A on synchronic and                               perform generally worse than the German mod-
diachronic test sets.                                                            els on this task. Besides this, all the models per-
                                                                                 form slightly better on the synchronic data set than
All the models outperform the best result achieved                               on the diachronic one. The FastText-based model
in 2017 for both test data sets. For the synchronic                              (Guhr et al., 2020) comes not even close to the
test set, the previous best result is surpassed by                               baseline from 2017, while the ELMo-based mod-
3.8–5.4 percentage points. For the diachronic test                               els (Biesialska et al., 2020) are pretty competitive.
set, the absolute difference to the best contender of                            Interestingly, two of the multilingual models are
2017 varies between 2.6 and 4.2 percentage points.                               even outperformed by these ELMo-based models.
With a micro F1 score of 0.957 and 0.948, respec-
tively, the best scoring pre-trained language model                              Subtask C Subtask C is split into Aspect-only
is the uncased German BERT-BASE variant by                                       (Subtask C1) and Aspect+Sentiment Classification
dbmdz, followed by its cased version. All the                                    (Subtask C2), each being a multi-label classifica-
pre-trained models perform slightly better on the                                tion task8 . As the organizers provide 20 aspect
synchronic test data than on the diachronic data.                                categories, Subtask C1 includes 20 labels, whereas
Attia et al. (2018), Schmitt et al. (2018), Biesialska                           Subtask C2 has 60 labels since each aspect category
et al. (2020) and Guhr et al. (2020) did not evaluate                               8
                                                                                      This leads to a change of activation functions in the final
their models on this task.                                                       layer from softmax to sigmoid + binary cross entropy loss.
can be combined with each of the three sentiments.            synchronic and diachronic test sets. Again, the best
Consistent with Lee et al. (2017) and Mishra et al.           model is the uncased German BERT-BASE dbmdz
(2017), we do not account for multiple mentions               model reaching scores of 0.655 and 0.689, respec-
of the same label in one document. The results for            tively. The CNN models (Schmitt et al., 2018)
Subtask C1 are shown in Table 9:                              are also outperformed. For both, Subtask C1 and
                                                              C2, all the displayed models perform better on the
 Language model                           testsyn   testdia
                                                              diachronic than on the synchronic test data.
 Best model 2017 (Ruppert et al., 2017)    0.537    0.556
 bert-base-german-cased                    0.756    0.762     Subtask D Subtask D refers to the Opinion
 bert-base-german-dbmdz-cased              0.756    0.781
 bert-base-german-dbmdz-uncased            0.761    0.791
                                                              Target Extraction (OTE) and is thus a token-
 bert-base-multilingual-cased              0.706    0.734     level classification task. As this is a rather
 bert-base-multilingual-uncased            0.723    0.752
 distilbert-base-german-cased              0.738    0.768
                                                              difficult task, Wojatzki et al. (2017) distinguish
 distilbert-base-multilingual-cased        0.716    0.744     between exact (Subtask D1) and overlapping
 CNN+FastText (Schmitt et al., 2018)       0.523    0.557     match (Subtask D2), tolerating a deviation of
                                                              +/− one token. Here, ”entities” are identified
Table 9: Micro-averaged F1 scores for Subtask C1              by their BIO-tags. It is noteworthy that there
(Aspect-only) on synchronic and diachronic test sets. A       are less entities here than for Subtask C since
detailed overview of per-class performances for error
                                                              document-level aspects or sentiments could not
analysis can be found in Table 15 in Appendix A.
                                                              always be assigned to a certain sequence in the
                                                              document. As a result, there are less documents
All pre-trained German BERTs clearly surpass the
                                                              at disposal for this task, namely 9,193. The
best performance from 2017 as well as the results
                                                              remaining data has 1.86 opinions per document on
reported by Schmitt et al. (2018), who are the only
                                                              average. The majority class is now Sonstige
ones of the other authors to evaluate their models
                                                              Unregelmäßigkeiten:negative                  with
on this tasks. Regarding the synchronic test set,
                                                              around 15.4% of the true entities (16,650 in total),
the absolute improvement ranges between 16.9 and
                                                              leading to more balanced data than in Subtask C.
22.4 percentage points, while for the diachronic test
data, the models outperform the previous results                             Language model                           testsyn   testdia
by 17.8–23.5 percentage points. The best model                               Best model 2017 (Ruppert et al., 2017)   0.229     0.301
is again the uncased German BERT-BASE model                                  bert-base-german-cased                   0.460     0.455
                                                               without CRF


                                                                             bert-base-german-dbmdz-cased             0.480     0.466
by dbmdz, reaching scores of 0.761 and 0.791,                                bert-base-german-dbmdz-uncased           0.492     0.501
respectively, followed by the two cased German                               bert-base-multilingual-cased             0.447     0.457
                                                                             bert-base-multilingual-uncased           0.429     0.404
BERT-BASE models. One more time, the multi-                                  distilbert-base-german-cased             0.347     0.357
lingual models exhibit the poorest performances                              distilbert-base-multilingual-cased       0.430     0.419
amongst the evaluated models. Next, Table 10                                 bert-base-german-cased                   0.446     0.443
                                                                             bert-base-german-dbmdz-cased             0.466     0.444
shows the results for Subtask C2:
                                                               with CRF


                                                                             bert-base-german-dbmdz-uncased           0.515     0.518
                                                                             bert-base-multilingual-cased             0.472     0.466
 Language model                           testsyn   testdia                  bert-base-multilingual-uncased           0.477     0.452
                                                                             distilbert-base-german-cased             0.424     0.403
 Best model 2017 (Ruppert et al., 2017)    0.396    0.424
                                                                             distilbert-base-multilingual-cased       0.436     0.418
 bert-base-german-cased                    0.634    0.663
 bert-base-german-dbmdz-cased              0.628    0.663
 bert-base-german-dbmdz-uncased            0.655    0.689
                                                              Table 11: Entity-level micro-averaged F1 scores for
 bert-base-multilingual-cased              0.571    0.634     Subtask D1 (exact match) on synchronic and di-
 bert-base-multilingual-uncased            0.553    0.631     achronic test sets. A detailed overview of per-class per-
 distilbert-base-german-cased              0.629    0.663     formances for error analysis can be found in Table 17
 distilbert-base-multilingual-cased        0.589    0.642
                                                              in Appendix B.
 CNN+FastText (Schmitt et al., 2018)       0.423    0.465


Table 10: Micro-averaged F1 scores for Subtask C2             In Table 11, we compare the pre-trained models
(Aspect+Sentiment) on synchronic and diachronic test          using an ”ordinary” softmax layer to when using a
sets. A detailed overview of per-class performances for       CRF layer for Subtask D1.
error analysis can be found in Table 16 in Appendix A.           The best performing model is the uncased Ger-
                                                              man BERT-BASE model by dbmdz with CRF
Here, the pre-trained models surpass the best model           layer on both test sets, with a score of 0.515 and
from 2017 by 15.7–25.9 percentage points and                  0.518, respectively. Overall, the results from 2017
20.7–26.5 percentage points, respectively, for the            are outperformed by 11.8–28.6 percentage points
               Language model                                  testsyn   testdia      • The monolingual DistilBERT model is pretty
               Best models 2017 (testsyn : Lee et al., 2017)                            competitive, it consistently outperforms its
                                                               0.348     0.365
               (testdia : Ruppert et al., 2017)
               bert-base-german-cased                          0.471     0.474
                                                                                        multilingual counterpart as well as the multi-
                                                                                        lingual BERT models on the subtasks A – C
 without CRF


               bert-base-german-dbmdz-cased                    0.491     0.488
               bert-base-german-dbmdz-uncased                  0.501     0.518
               bert-base-multilingual-cased                    0.457     0.473
                                                                                        and is at least competitive to the monolingual
               bert-base-multilingual-uncased                  0.435     0.417          BERT models.
               distilbert-base-german-cased                    0.397     0.407
               distilbert-base-multilingual-cased              0.433     0.429
                                                                                     For D1 and D2 we observe a rather clear domi-
               bert-base-german-cased                          0.455     0.457     nance of the uncased monolingual model which is
               bert-base-german-dbmdz-cased                    0.476     0.469     not observable to this extent for the other tasks.
 with CRF


               bert-base-german-dbmdz-uncased                  0.523     0.533
               bert-base-multilingual-cased                    0.476     0.474
               bert-base-multilingual-uncased                  0.484     0.464     6             Discussion
               distilbert-base-german-cased                    0.433     0.423
               distilbert-base-multilingual-cased              0.442     0.427     After having observed a notable performance in-
                                                                                   crease for German ABSA when employing pre-
Table 12: Entity-level micro-averaged F1 scores for
                                                                                   trained models, the next step is to compare these
Subtask D2 (overlapping match) on synchronic and di-
achronic test sets. A detailed overview of per-class per-                          observations to what was reported for the English
formances for error analysis can be found in Table 18                              language. Therefore, we examine the temporal de-
in Appendix B.                                                                     velopment of the SOTA performance on the most
                                                                                   widely adopted data sets for English ABSA, orig-
                                                                                   inating from the SemEval Shared Tasks (Pontiki
on the synchronic test set and 5.6–21.7 percentage
                                                                                   et al., 2014, 2015, 2016). When looking at pub-
points on the diachronic test set.
                                                                                   lic leaderboards, e.g. https://paperswithcode.com/,
  For the overlapping match (cf. Tab. 12), the best
                                                                                   Subtask SB2 (aspect term polarity) from SemEval-
system from 2017 are outperformed by 4.9–17.5
                                                                                   2014 is the task which attracts most of the re-
percentage points on the synchronic and by 4.2–
                                                                                   searchers. This task is related, but not perfectly
16.8 percentage points on the diachronic test set.
                                                                                   similar, to Subtask C2, since in this case, the as-
Again, the uncased German BERT-BASE model by
                                                                                   pect term is always a word which has to present
dbmdz with CRF layer performs best with an mi-
                                                                                   in the given review. For this task, a comparison
cro F1 score of 0.523 on the synchronic and 0.533
                                                                                   of pre-BERT and BERT-based methods reveals no
on the diachronic set. To our knowledge, there
                                                                                   big ”jump” in the performance values, but rather a
were no other models to compare our performance
                                                                                   steady increase over time (cf. Tab. 13).
values with, besides the results from 2017.
                                                                                                 Language model                     Laptops   Restaurants
Main Takeaways For the first two subtasks,
which are rather simple binary and multi-class clas-                                             Best model SemEval-2014
                                                                                                                                    0.7048      0.8095
                                                                                    pre-BERT


                                                                                                 (Pontiki et al., 2014)
sification tasks, the pre-trained models are able to
                                                                                                 MemNet (Tang et al., 2016)         0.7221      0.8095
improve a little upon the already pretty decent per-
formance values from 2017. Further, we do not see                                                HAPN (Li et al., 2018)             0.7727      0.8223
large differences between the different pre-trained
models. Nevertheless, the small differences we                                                   BERT-SPC (Song et al., 2019)       0.7899      0.8446
                                                                                    BERT-based


can observe, already point in the same direction as
                                                                                                 BERT-ADA (Rietzler et al., 2020)   0.8023      0.8789
what can be observed for the primary ABSA tasks
of interest, C1 and C2:                                                                          LCF-ATEPC (Yang et al., 2019)      0.8229      0.9018
    • Uncased models have a tendency of outper-
      forming their cased counterparts for the mono-                               Table 13: Development of the SOTA Accuracy
      lingual models, for multilingual models this                                 for the aspect term polarity task (SemEval-2014;
      cannot be clearly confirmed.                                                 Pontiki et al., 2014).       Selected models were
    • Monolingual models outperform the multilin-                                  picked from https://paperswithcode.com/sota/aspect-
                                                                                   based-sentiment-analysis-on-semeval.
      gual ones.
    • There are no large performance differences
                                                                                   Clearly more related, but unfortunately also less
      between the two cased BERT models by
                                                                                   used, are the subtasks SB3 (aspect category ex-
      DBMDZ and Deepset.ai, which suggests only
                                                                                   traction; comparable to Subtask C1) and SB4 (as-
      a minor influence of the different corpora,
                                                                                   pect category polarity; comparable to Subtask C2)
      which the models were pre-trained on.
from SemEval-2014.9 Limitations with respect                     7   Conclusion
to comparability arise from the different numbers
                                                                 As one would have hoped, all the state-of-the art
of categories: Subtask SB4 only exhibits five as-
                                                                 pre-trained language models clearly outperform all
pect categories (as opposed to 20 categories for
                                                                 the models from 2017, proving the power of trans-
GermEval17) which leads to an easier classifica-
                                                                 fer learning also for German ABSA. Throughout
tion problem and is reflected in the already pretty
                                                                 the presented analyses, the models always achieve
high scores of the 2014 baselines. Table 14 shows
                                                                 similar results between the synchronic and the di-
the performance of the best model from 2014 as
                                                                 achronic test sets, indicating temporal robustness
well as performance of subsequent (pre-BERT and
                                                                 for the models. Nonetheless, the diachronic data
BERT-based) models for subtasks SB3 and SB4.
                                                                 was collected only half a year after the main data.
                                                Restaurants      It would be interesting to see whether the trained
              Language model                   SB3      SB4      models would return similar predictions on data
                                                                 collected a couple of years later.
 pre-BERT


              Best model SemEval-2014
                                              0.8857    0.8292
              (Pontiki et al., 2014)                                 The uncased German BERT-BASE model by
              ATAE-LSTM (Wang et al., 2016)     —-      0.840    dbmdz achieves the best results across all subtasks.
                                                                 Since Rönnqvist et al. (2019) showed that mono-
              BERT-pair (Sun et al., 2019)    0.9218    0.899    lingual BERT models often outperform the mul-
 BERT-based


                                                                 tilingual models for a variety of tasks, one might
              CG-BERT (Wu and Ong, 2020)      0.9162†   0.901†
                                                                 have already suspected that a monolingual Ger-
              QACG-BERT (Wu and Ong, 2020)    0.9264    0.904†   man BERT performs best across the performed
                                                                 tasks. It may not seem evident at first that an
Table 14: Development of the SOTA F1 score (SB3)                 uncased language model ends up as the best per-
and Accuracy (SB4) for the aspect category extrac-               forming model since, e.g. in Sentiment Analysis,
tion/polarity task (SemEval-2014; Pontiki et al., 2014).         capitalized letters might be an indicator for polar-
†
  Additional auxiliary sentences were used.                      ity. In addition, since nouns and beginnings of
                                                                 sentences always start with a capital letter in Ger-
In contrast to what can be observed for SB2, in this             man, one might assume that lower-casing the whole
case, the performance increase on SB4 caused by                  text changes the meaning of some words and thus
the introduction of BERT seems to be kind of strik-              confuses the language model. Nevertheless, the
ing. While the ATAE-LSTM (Wang et al., 2016)                     GermEval17 documents are very noisy since they
only slightly increased the performance compared                 were retrieved from social media. That means that
to 2014, the BERT-based models led to a jump of                  the data contains many misspellings, grammar and
more than 6 percentage points. So when taking into               expression mistakes, dialect, and colloquial lan-
account the potential room for improvement (0.16                 guage. For this reason, already some participating
for SB4 vs. 0.60 for C2), the improvements relative              teams in 2017 pursued an elaborate pre-processing
to the potential (0.06/0.16 for SB4 vs. 0.23/0.60                on the text data in order to eliminate some noise
for C2) are quite similar.                                       (Hövelmann and Friedrich, 2017; Sayyed et al.,
   Another issue is that (partly) highly specialized             2017; Sidarenka, 2017). Among other things,
(T)ABSA architectures were used for improving                    Hövelmann and Friedrich (2017) transformed the
the SOTA on the SemEval-2014 tasks, while we                     text to lower-case and replaced, for example, ”S-
”only” applied standard pre-trained German BERT                  Bahn” and ”S Bahn” with ”sbahn”. We suppose
models without any task-specific modifications or                that in this case, lower-casing the texts improves
extensions. This leaves room for further improve-                the data quality by eliminating some of the noise
ments on this task on German data which should                   and acts as a sort of regularization. As a result,
be an objective for future research.                             the uncased models potentially generalize better
    9
      Since the data sets (Restaurants and Laptops) have been    than the cased models. The findings from May-
further developed for SemEval-2015 and SemEval-2016, sub-        hew et al. (2019), who compare cased and uncased
tasks SB3 and SB4 are revisited under the names Slot 1 and
Slot 3 for the in-domain ABSA in SemEval-2015. Slot 2
                                                                 pre-trained models on social media data for NER,
from SemEval-2015 aims at OTE and thus corresponds to            corroborate this hypothesis.
Subtask D from GermEval17. For SemEval-2016 the same
task names as in 2015 were used, subdivided into Subtask 1
(sentence-level ABSA) and Subtask 2 (text-level ABSA).
References                                                 Training a Broad-Coverage German Sentiment
                                                           Classification Model for Dialog Systems.    In
Mohammed Attia, Younes Samih, Ali Elkahky, and             Proceedings of the 12th Conference on Language
 Laura Kallmeyer. 2018. Multilingual multi-class           Resources and Evaluation (LREC 2020), pages
 sentiment classification using convolutional neural       1627–1632, Marseille, France.
 networks. In Proceedings of the Eleventh Interna-
 tional Conference on Language Resources and Eval-       Barry Haddow. 2018. News Crawl Corpus.
 uation (LREC 2018), Miyazaki, Japan. European
 Language Resources Association (ELRA).                  Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.
                                                           Distilling the knowledge in a neural network. arXiv
Valentin Barriere and Alexandra Balahur. 2020. Im-         preprint arXiv:1503.02531.
  proving sentiment analysis over non-English tweets
  using multilingual transformers and automatic trans-   Mickel Hoang, Oskar Alija Bihorac, and Jacobo
  lation for data-augmentation. In Proceedings of          Rouces. 2019. Aspect-based sentiment analysis us-
  the 28th International Conference on Computational       ing BERT. In Proceedings of the 22nd Nordic Con-
  Linguistics, pages 266–271, Barcelona, Spain (On-        ference on Computational Linguistics, pages 187–
  line). International Committee on Computational         196, Turku, Finland. Linköping University Elec-
  Linguistics.                                             tronic Press.
Salima Behdenna, Fatiha Barigou, and Ghalem Be-          Sepp Hochreiter and Jürgen Schmidhuber. 1997.
  lalem. 2018. Document level sentiment analysis:          Long short-term memory. Neural computation,
  A survey. EAI Endorsed Transactions on Context-          9(8):1735–1780.
  aware Systems and Applications, 4:154339.
                                                         Leonard Hövelmann and Christoph M. Friedrich.
Katarzyna Biesialska, Magdalena Biesialska, and Hen-       2017. Fasttext and Gradient Boosted Trees at
  ryk Rybinski. 2020. Sentiment analysis with contex-      GermEval-2017 Tasks on Relevance Classification
  tual embeddings and self-attention. arXiv preprint       and Document-level Polarity. In Proceedings of the
  arXiv:2003.05574.                                        GermEval 2017 – Shared Task on Aspect-based Sen-
                                                           timent in Social Media Customer Feedback, Berlin,
Piotr Bojanowski, Edouard Grave, Armand Joulin, and        Germany.
  Tomas Mikolov. 2017. Enriching word vectors with
   subword information. Transactions of the Associa-     Akbar Karimi, Leonardo Rossi, and Andrea Prati. 2020.
   tion for Computational Linguistics, 5:135–146.          Adversarial training for aspect-based sentiment anal-
                                                           ysis with bert.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
  cehre, Dzmitry Bahdanau, Fethi Bougares, Holger        Ji-Ung Lee, Steffen Eger, Johannes Daxenberger, and
  Schwenk, and Yoshua Bengio. 2014. Learning                Iryna Gurevych. 2017. UKP TU-DA at GermEval
  phrase representations using rnn encoder-decoder          2017: Deep Learning for Aspect Based Sentiment
  for statistical machine translation. arXiv preprint       Detection. In Proceedings of the GermEval 2017
  arXiv:1406.1078.                                          – Shared Task on Aspect-based Sentiment in Social
                                                            Media Customer Feedback, Berlin, Germany.
Mark Cieliebak, Jan Milan Deriu, Dominic Egger, and
 Fatih Uzdilli. 2017. A Twitter corpus and bench-        Lishuang Li, Yang Liu, and AnQiao Zhou. 2018. Hier-
 mark resources for German sentiment analysis. In           archical attention based position-aware network for
 Proceedings of the Fifth International Workshop            aspect-level sentiment analysis. In Proceedings of
 on Natural Language Processing for Social Media,           the 22nd Conference on Computational Natural Lan-
 pages 45–51, Valencia, Spain. Association for Com-         guage Learning, pages 181–189, Brussels, Belgium.
 putational Linguistics.                                   Association for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and            Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam.
   Kristina Toutanova. 2019. BERT: Pre-training of         2019. Exploiting BERT for end-to-end aspect-based
   Deep Bidirectional Transformers for Language Un-        sentiment analysis. In Proceedings of the 5th Work-
   derstanding. In Proceedings of the 2019 Conference      shop on Noisy User-generated Text (W-NUT 2019),
   of the North American Chapter of the Association        pages 34–41, Hong Kong, China. Association for
   for Computational Linguistics: Human Language           Computational Linguistics.
  Technologies, Volume 1 (Long and Short Papers),
   pages 4171–4186, Minneapolis, Minnesota. Associ-      Pierre Lison and Jörg Tiedemann. 2016. OpenSubti-
   ation for Computational Linguistics.                     tles2016: Extracting Large Parallel Corpora from
                                                            Movie and TV Subtitles. In Proceedings of the 10th
M. Esplà-Gomis, M. Forcada, Gema Ramı́rez-Sánchez,        International Conference on Language Resources
  and Hieu T. Hoang. 2019. ParaCrawl: Web-scale             and Evaluation (LREC 2016).
  parallel corpora for the languages of the EU. In MT-
  Summit.                                                Stephen Mayhew, Tatiana Tsygankova, and Dan Roth.
                                                            2019. ner and pos when nothing is capitalized. In
Oliver Guhr, Anne-Kathrin Schumann, Frank                   Proceedings of the 2019 Conference on Empirical
  Bahrmann, and Hans-Joachim Böhme. 2020.                 Methods in Natural Language Processingand the
  9th International Joint Conference on Natural Lan-       Suresh Manandhar. 2014. SemEval-2014 task 4: As-
  guage Processing, pages 6256–6261, Hong Kong,            pect based sentiment analysis. In Proceedings of the
  China. Association for Computational Linguistics.        8th International Workshop on Semantic Evaluation
                                                           (SemEval 2014), pages 27–35, Dublin, Ireland. As-
Bryan McCann, James Bradbury, Caiming Xiong, and           sociation for Computational Linguistics.
  Richard Socher. 2017. Learned in translation: Con-
  textualized word vectors. In Advances in Neural        Alexander Rietzler, Sebastian Stabinger, Paul Opitz,
  Information Processing Systems, volume 30, pages         and Stefan Engl. 2020. Adapt or get left behind:
  6294–6305. Curran Associates, Inc.                       Domain adaptation through BERT language model
                                                           finetuning for aspect-target sentiment classification.
Pruthwik Mishra, Vandan Mujadia, and Soujanya              In Proceedings of the 12th Language Resources
  Lanka. 2017. GermEval 2017: Sequence based               and Evaluation Conference, pages 4933–4941, Mar-
  Models for Customer Feedback Analysis. In Pro-           seille, France. European Language Resources Asso-
  ceedings of the GermEval 2017 – Shared Task on           ciation.
  Aspect-based Sentiment in Social Media Customer
  Feedback, Berlin, Germany.                             Samuel Rönnqvist, Jenna Kanerva, Tapio Salakoski,
                                                           and Filip Ginter. 2019. Is Multilingual BERT Flu-
Pedro Javier Ortiz Suárez, Benoı̂t Sagot, and Laurent     ent in Language Generation? In Proceedings of the
  Romary. 2019. Asynchronous Pipeline for Process-         First NLPL Workshop on Deep Learning for Natural
  ing Huge Corpora on Medium to Low Resource In-           Language Processing, pages 29–36, Turku, Finland.
  frastructures. In 7th Workshop on the Challenges         Linköping University Electronic Press.
  in the Management of Large Corpora (CMLC-
  7), Cardiff, United Kingdom. Leibniz-Institut für     Eugen Ruppert, Abhishek Kumar, and Chris Biemann.
  Deutsche Sprache.                                        2017. LT-ABSA: An Extensible Open-Source Sys-
                                                           tem for Document-Level and Aspect-Based Senti-
Malte Ostendorff, Till Blume, and Saskia Ostendorff.       ment Analysis. In Proceedings of the GermEval
 2020. Towards an Open Platform for Legal Informa-         2017 – Shared Task on Aspect-based Sentiment in
 tion. In Proceedings of the ACM/IEEE Joint Confer-        Social Media Customer Feedback, Berlin, Germany.
 ence on Digital Libraries in 2020, JCDL ’20, pages
 385—-388, New York, NY, USA. Association for            Victor Sanh, Lysandre Debut, Julien Chaumond, and
 Computing Machinery.                                      Thomas Wolf. 2019. Distilbert, a distilled version
Jeffrey Pennington, Richard Socher, and Christopher        of bert: smaller, faster, cheaper and lighter. arXiv
   Manning. 2014. Glove: Global vectors for word rep-      preprint arXiv:1910.01108.
   resentation. In Proceedings of the 2014 conference
   on empirical methods in natural language process-     Zeeshan Ali Sayyed, Daniel Dakota, and Sandra
   ing (EMNLP), pages 1532–1543.                           Kübler. 2017. IDS-IUCL: Investigating Feature Se-
                                                           lection and Oversampling for GermEval 2017. In
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt          Proceedings of the GermEval 2017 – Shared Task on
 Gardner, Christopher Clark, Kenton Lee, and Luke          Aspect-based Sentiment in Social Media Customer
 Zettlemoyer. 2018. Deep contextualized word repre-        Feedback, Berlin, Germany.
 sentations. arXiv preprint arXiv:1802.05365.
                                                         Martin Schmitt, Simon Steinheber, Konrad Schreiber,
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,      and Benjamin Roth. 2018. Joint aspect and polar-
 Ion Androutsopoulos, Suresh Manandhar, Moham-            ity classification for aspect-based sentiment analysis
 mad AL-Smadi, Mahmoud Al-Ayyoub, Yanyan                  with end-to-end neural networks. In Proceedings of
 Zhao, Bing Qin, Orphee de clercq, Veronique              the 2018 Conference on Empirical Methods in Nat-
 Hoste, Marianna Apidianaki, Xavier Tannier, Na-          ural Language Processing, pages 1109–1114, Brus-
 talia Loukachevitch, Evgeny Kotelnikov, Nuria            sels, Belgium. Association for Computational Lin-
 Bel, Salud Marı́a Zafra, and Gülşen Eryiğit. 2016.    guistics.
 Semeval-2016 task 5: Aspect based sentiment anal-
 ysis. In Proceedings of the 10th International          Uladzimir Sidarenka. 2017. PotTS at GermEval-2017
 Workshop on Semantic Evaluation (SemEval-2016),           Task B: Document-Level Polarity Detection Using
 pages 19–30.                                              Hand-Crafted SVM and Deep Bidirectional LSTM
                                                           Network. In Proceedings of the GermEval 2017
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,       – Shared Task on Aspect-based Sentiment in Social
 Suresh Manandhar, and Ion Androutsopoulos. 2015.          Media Customer Feedback, Berlin, Germany.
 SemEval-2015 task 12: Aspect based sentiment
 analysis. In Proceedings of the 9th International       Raivis Skadiņš, Jörg Tiedemann, Roberts Rozis, and
 Workshop on Semantic Evaluation (SemEval 2015),           Daiga Deksne. 2014. Billions of Parallel Words for
 pages 486–495, Denver, Colorado. Association for          Free: Building and Using the EU Bookshop Cor-
 Computational Linguistics.                                pus. In Proceedings of the 9th International Confer-
                                                           ence on Language Resources and Evaluation (LREC
Maria Pontiki, Dimitris Galanis, John Pavlopoulos,         2014), pages 1850–1855, Reykjavik, Iceland. Euro-
 Harris Papageorgiou, Ion Androutsopoulos, and             pean Language Resources Association (ELRA).
Youwei Song, Jiahai Wang, Tao Jiang, Zhiyue Liu, and          2020. Transformers: State-of-the-Art Natural Lan-
  Yanghui Rao. 2019. Attentional encoder network              guage Processing. In Proceedings of the 2020 Con-
  for targeted sentiment classification. arXiv preprint       ference on Empirical Methods in Natural Language
  arXiv:1902.09314.                                           Processing: System Demonstrations, pages 38–45,
                                                              Online. Association for Computational Linguistics.
Chi Sun, Luyao Huang, and Xipeng Qiu. 2019. Uti-
  lizing BERT for aspect-based sentiment analysis via     Zhengxuan Wu and Desmond C Ong. 2020. Context-
  constructing auxiliary sentence. In Proceedings of        guided bert for targeted aspect-based sentiment anal-
  the 2019 Conference of the North American Chap-           ysis. arXiv preprint arXiv:2010.07523.
  ter of the Association for Computational Linguistics:
                                                          Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2019. Bert
  Human Language Technologies, Volume 1 (Long
                                                            post-training for review reading comprehension and
  and Short Papers), pages 380–385, Minneapolis,
                                                            aspect-based sentiment analysis.
  Minnesota. Association for Computational Linguis-
  tics.                                                   Heng Yang, Biqing Zeng, JianHao Yang, Youwei
                                                            Song, and Ruyang Xu. 2019. A multi-task learn-
Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect             ing model for chinese-oriented aspect polarity clas-
  level sentiment classification with deep memory net-      sification and aspect term extraction. arXiv preprint
  work. arXiv preprint arXiv:1605.08900.                    arXiv:1912.07976.
Jie Tao and Xing Fang. 2020. Toward multi-label sen-      Appendix
   timent analysis: a transfer learning based approach.
   Journal of Big Data, 7:1.                              A     Detailed results (per category) for
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob                Subtask C
  Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz          It may be interesting to have a more detailed look
  Kaiser, and Illia Polosukhin. 2017. Attention Is All
  You Need. In 31st Conference on Neural Informa-         at the model performance for this subtask because
  tion Processing Systems (NIPS 2017), Long Beach,        of the high number of classes and their skewed
  California, USA.                                        distribution by investigating the performance on
                                                          category-level. Table 15 shows the performance
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
  Amanpreet Singh, Julian Michael, Felix Hill, Omer       of the uncased German BERT-BASE model by
  Levy, and Samuel Bowman. 2019. Superglue: A             dbmdz per test set for Subtask C1. The support in-
  stickier benchmark for general-purpose language un-     dicates the number of appearances, which are also
  derstanding systems. In Advances in neural informa-     displayed in Table 4 in this case. Seven categories
  tion processing systems, pages 3266–3280.
                                                          are summarized in Rest because they have an F1
Alex Wang, Amanpreet Singh, Julian Michael, Felix         score of 0 for both test sets, i.e. the model is not
  Hill, Omer Levy, and Samuel R Bowman. 2018.             able to correctly identify any of these seven aspects
  Glue: A multi-task benchmark and analysis platform      appearing in the test data. The table is sorted by
  for natural language understanding. arXiv preprint
  arXiv:1804.07461.
                                                          the score on the synchronic test set.
                                                                                              testsyn             testdia
Yequan Wang, Minlie Huang, Xiaoyan Zhu, and                Aspect Category                Score Support      Score Support
  Li Zhao. 2016. Attention-based lstm for aspect-          Allgemein                      0.854      1,398   0.877       1,024
  level sentiment classification. In Proceedings of the    Sonstige Unregelmäßigkeiten   0.782        224   0.785         164
  2016 conference on empirical methods in natural          Connectivity                   0.750         36   0.838          73
  language processing, pages 606–615.                      Zugfahrt                       0.678        241   0.687         184
                                                           Auslastung und Platzangebot    0.645         35   0.667          20
                                                           Sicherheit                     0.602         84   0.639          42
Michael Wojatzki, Eugen Ruppert, Sarah Holschneider,       Atmosphäre                    0.600        148   0.532          53
  Torsten Zesch, and Chris Biemann. 2017. GermEval         Barrierefreiheit               0.500          9       0           2
  2017: Shared Task on Aspect-based Sentiment in           Ticketkauf                     0.481         95   0.506          48
  Social Media Customer Feedback. In Proceedings           Service und Kundenbetreuung    0.476         63   0.417          27
  of the GermEval 2017 – Shared Task on Aspect-            DB App und Website             0.455         28   0.563          18
                                                           Informationen                  0.329         58   0.464          35
  based Sentiment in Social Media Customer Feed-           Komfort und Ausstattung        0.286         24       0          11
  back, pages 1–12, Berlin, Germany.                       Rest                               0         24       0          20

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien          Table 15: Micro-averaged F1 scores and support by as-
  Chaumond, Clement Delangue, Anthony Moi, Pier-          pect category (Subtask C1). Seven categories are sum-
  ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-          marized in Rest and show each a score of 0.
  towicz, Joe Davison, Sam Shleifer, Patrick von
  Platen, Clara Ma, Yacine Jernite, Julien Plu, Can-
  wen Xu, Teven Le Scao, Sylvain Gugger, Mariama          The F1 scores for Allgemein (General),
  Drame, Quentin Lhoest, and Alexander M. Rush.           Sonstige Unregelmäßigkeiten (Other ir-
regularities) and Connectivity are the highest.                                rare categories could still not be identified by the
13 categories, mostly similar between the two test                             model.
sets, show a positive F1 score on at least one of
the two test sets. For the categories subsumed un-                             B     Detailed results (per category) for
der Rest, the model was not able to learn how to                                     Subtask D
correctly identify these categories.
                                                                               Similar as for Subtask C, the results for the best
   Subtask C2 exhibits a similar distribution of the                           model are investigated in more detail. Table 17
true labels, with the Aspect+Sentiment category                                gives the detailed classification report for the un-
Allgemein:neutral as majority class. Over                                      cased German BERT-BASE model with CRF layer
50% of the true labels belong to this class. Table 16                          on Subtask D1. Only entities that were correctly
shows that only 12 out of 60 labels can be detected                            detected at least once are displayed. The table is
by the model (see Table 16).                                                   sorted by the score on the synchronic test set. The
                                             testsyn             testdia
                                                                               classification report for Subtask D2 is displayed
 Aspect+Sentiment Category               Score Support      Score Support      analogously in Table 18.
 Allgemein:neutral                       0.804      1,108   0.832        913
 Sonstige Unregelmäßigkeiten:negative   0.782        221   0.793        159
 Zugfahrt:negative                       0.645        197   0.725        149                                                testsyn            testdia
 Sicherheit:negative                     0.640         78   0.585         39    Category                                Score Support     Score Support
 Allgemein:negative                      0.582        258   0.333         80    Zugfahrt:negative                       0.702       622   0.729        495
 Atmosphäre:negative                    0.569        126   0.447         39    Sonstige Unregelmäßigkeiten:negative   0.681       693   0.581        484
 Connectivity:negative                   0.400         20   0.291         46    Sicherheit:negative                     0.604       337   0.457        122
 Ticketkauf:negative                     0.364         42   0.298         34    Connectivity:negative                   0.598        56   0.620        109
 Auslastung und Platzangebot:negative    0.350         31   0.211         17    Barrierefreiheit:negative               0.595        14       0          3
 Allgemein:positive                      0.214         41   0.690         33    Auslastung und Platzangebot:negative    0.579        66   0.447         31
 Zugfahrt:positive                       0.154         34       0         34    Connectivity:positive                   0.571        26   0.555         60
 Service und Kundenbetreuung:negative    0.146         36   0.174         21    Allgemein:negative                      0.545       807   0.343        139
 Rest                                        0        343       0        180    Atmosphäre:negative                    0.500       403   0.337        164
                                                                                Ticketkauf:negative                     0.383        96   0.583         74
                                                                                Ticketkauf:positive                     0.368        59       0         13
Table 16: Micro-averaged F1 scores and support by As-                           Komfort und Ausstattung:negative        0.357        24       0         16
pect+Sentiment category (Subtask C2). 48 categories                             Atmosphäre:neutral                     0.348        40   0.111         14
                                                                                Service und Kundenbetreuung:negative    0.323        74   0.286         31
are summarized in Rest and show each a score of 0.                              Informationen:negative                  0.301        68   0.505         46
                                                                                Zugfahrt:positive                       0.276        62   0.343         83
                                                                                DB App und Website:negative             0.232        39   0.375         33
   All the aspect categories displayed in Ta-                                   DB App und Website:neutral              0.188        23       0         11
                                                                                Sonstige Unregelmäßigkeiten:neutral    0.179        13   0.222          2
ble 16 are also visible in Table 15 and                                         Allgemein:positive                      0.157        86   0.586         92
                                                                                Service und Kundenbetreuung:positive    0.115        23       0          5
most of them have negative sentiment.                                           Atmosphäre:positive                    0.105        26       0         15
Allgemein:neutral              and     Sonstige                                 Ticketkauf:neutral                      0.040       144   0.222         25
                                                                                Connectivity:neutral                        0        11   0.211         15
Unregelmäßigkeiten:negative show the                                           Toiletten:negative                          0        15   0.160         23
highest scores. Again, we assume that here, 48                                  Rest                                        0       355       0        115

categories could not be identified due to data                                 Table 17: Micro-averaged F1 scores and support by As-
sparsity. However, having this in mind, the model                              pect+Sentiment entity with exact match (Subtask D1).
achieves a relatively high overall performance                                 35 categories are summarized in Rest, each of them ex-
for both, Subtask C1 and C2 (cf. Tab. 9 and                                    hibiting a score of 0.
Tab. 10). This is mainly owed to the high
score of the majority classes Allgemein and                                    For Subtask D1, the model returns a pos-
Allgemein:neutral, respectively, because                                       itive score on 25 entity categories on at
the micro F1 score puts a lot of weight on majority                            least one of the two test sets. The category
classes. It might be interesting whether the clas-                             Zugfahrt:negative can be classified best
sification of the rare categories can be improved                              on both test sets, followed by Sonstige
by balancing the data. We experimented with                                    Unregelmäßigkeiten:negative                    and
removing general categories such as Allgemein,                                 Sicherheit:negative for the synchronic
Allgemein:neutral or documents with                                            test set and by Connectivity:negative and
sentiment neutral since these are usually less                                 Allgemein:positive for the diachronic set.
interesting for a company. We observe a large                                  Visibly, the scores between the two test sets differ
drop in the overall F1 score which is attributed to                            more here than in the classification report of the
the absence of the strong majority class and the                               previous task.
resulting data loss. Indeed, the classification for                               The report for the overlapping match (cf. Tab.
some single categories could be improved, but the                              18) shows slightly better results on some categories
                                             testsyn            testdia
 Category                                Score Support     Score Support
 Zugfahrt:negative                       0.708       622   0.739        495
 Sonstige Unregelmäßigkeiten:negative   0.697       693   0.617        484
 Sicherheit:negative                     0.607       337   0.475        122
 Connectivity:negative                   0.598        56   0.620        109
 Barrierefreiheit:negative               0.595        14       0          3
 Auslastung und Platzangebot:negative    0.579        66   0.447         31
 Connectivity:positive                   0.571        26   0.555         60
 Allgemein:negative                      0.561       807   0.363        139
 Atmosphäre:negative                    0.505       403   0.358        164
 Ticketkauf:negative                     0.383        96   0.583         74
 Ticketkauf:positive                     0.368        59       0         13
 Komfort und Ausstattung:negative        0.357        24       0         16
 Atmosphäre:neutral                     0.348        40   0.111         14
 Service und Kundenbetreuung:negative    0.323        74   0.286         31
 Informationen:negative                  0.301        68   0.505         46
 Zugfahrt:positive                       0.276        62   0.343         83
 DB App und Website:negative             0.261        39   0.406         33
 DB App und Website:neutral              0.188        23       0         11
 Sonstige Unregelmäßigkeiten:neutral    0.179        13   0.222          2
 Allgemein:positive                      0.157        86   0.586         92
 Service und Kundenbetreuung:positive    0.115        23       0          5
 Atmosphäre:positive                    0.105        26       0         15
 Ticketkauf:neutral                      0.040       144   0.222         25
 Connectivity:neutral                        0        11   0.211         15
 Toiletten:negative                          0        15   0.160         23
 Rest                                        0       355       0        112


Table 18: Micro-averaged F1 scores and support by
Aspect+Sentiment entity with overlapping match (Sub-
task D2). 35 categories are summarized in Rest and
show each a score of 0.


than for the exact match. The third-best score
on the diachronic test data is now Sonstige
Unregelmäßigkeiten:negative. Besides
this, the top three categories per test set remain the
same.
   Apart from the fact that this is a different kind of
task than before, one can notice that even though
the overall micro F1 scores are lower for Subtask D
than for Subtask C, the model manages to success-
fully identify a larger variety of categories, i.e. it
achieves a positive score for more categories. This
is probably due to the more balanced data for Sub-
task D than for Subtask C2, resulting in a lower
overall score and mostly higher scores per category.

</pre>