=Paper=
{{Paper
|id=Vol-2957/paper1
|storemode=property
|title=Re-Evaluating GermEval17 Using German Pre-Trained Language Models
|pdfUrl=https://ceur-ws.org/Vol-2957/paper1.pdf
|volume=Vol-2957
|authors=Matthias Aßenmacher,Alessandra Corvonato,Christian Heumann
|dblpUrl=https://dblp.org/rec/conf/swisstext/AssenmacherCH21
}}
==Re-Evaluating GermEval17 Using German Pre-Trained Language Models==
Re-Evaluating GermEval17 Using German Pre-Trained Language Models
Matthias Aßenmacher1♠ Alessandra Corvonato1♣ Christian Heumann1♠
1
Department of Statistics, Ludwig-Maximilians-Universität, Munich, Germany
♠ ♣
{matthias,chris}@stat.uni-muenchen.de, alessandracorvonato@yahoo.de
Abstract Sentiment Analysis was mainly conducted us-
ing traditional machine learning and recurrent
The lack of a commonly used benchmark neural networks, like LSTMs (Hochreiter and
data set (collection) such as (Super) GLUE Schmidhuber, 1997) or GRUs (Cho et al., 2014).
(Wang et al., 2018, 2019) for the evalua- Those models have been practically replaced
tion of non-English pre-trained language by language models relying on (parts of) the
models is a severe shortcoming of cur- Transformer architecture, a novel framework pro-
rent English-centric NLP-research. It con- posed by Vaswani et al. (2017). Devlin et al.
centrates a large part of the research on (2019) developed a Transformer-encoder-based lan-
English, neglecting the uncertainty when guage model called BERT (Bidirectional Encoder
transferring conclusions found for the En- Representations from Transfomers), achieving
glish language to other languages. We eval- state-of-the-art (SOTA) performance on several
uate the performance of German and mul- benchmark tasks - mainly for the English language
tilingual BERT models currently available - and becoming a milestone in the field of NLP.
via the huggingface transformers li- Up to now, only a few researchers have focused
brary on four subtasks of Aspect-based on sentiment related problems for German reviews,
Sentiment Analysis (ABSA) from the despite language-specific evaluation is a crucial
GermEval17 workshop. We compare them driving force for a more universal model develop-
to pre-BERT architectures (Wojatzki et al., ment and improvement. Unique characteristics of
2017; Schmitt et al., 2018; Attia et al., the different languages present different challenges
2018) as well as to an ELMo-based ar- to the models, which is why sole evaluation on
chitecture (Biesialska et al., 2020) and a English data is a severe shortcoming.
BERT-based approach (Guhr et al., 2020). The first shared task on German ABSA, which
The observed improvements are put in rela- provides a large annotated data set for training
tion to those for a similar ABSA task (Pon- and evaluation, is the GermEval17 Shared Task
tiki et al., 2014) and similar models (pre- (Wojatzki et al., 2017). The participating teams
BERT vs. BERT-based) for the English back then analyzed the data using mostly stan-
language and we check whether the re- dard machine learning techniques such as SVMs,
ported improvements correspond to those CRFs, or LSTMs. In contrast to 2017, today, dif-
we observe for German. ferent pre-trained BERT models are available for
a variety of different languages, including Ger-
1 Introduction man. We re-analyzed the complete GermEval17
Task using seven pre-trained BERT models suit-
(Aspect-based) Sentiment Analysis is often used able for German provided by the huggingface
to transform reviews into helpful information on transformers library (Wolf et al., 2020). We
how a product or service of a company is per- evaluate which one of the models is best suited
ceived among the customers. Until recently, for the different GermEval17 subtasks by compar-
ing their performance values. Furthermore, we
Copyright © 2021 for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0 Interna- compare our findings on whether (and how much)
tional (CC BY 4.0) BERT-based models are able to improve the pre-
BERT SOTA in German ABSA with the SOTA tiki et al., 2014) was the first workshop to introduce
developments for English ABSA by the example Aspect-based Sentiment Analysis (ABSA) which
of SemEval-2014 (Pontiki et al., 2014). was expanded within SemEval15 Task 12 (Pontiki
We first give an overview on the GermEval17 et al., 2015) and SemEval16 Task 5 (Pontiki et al.,
tasks (cf. Sec. 2) and on related work (cf. Sec. 2016). Here, restaurant and laptop reviews were ex-
3). Second, we present the data and the models (cf. amined on different granularities. The best model
Sec. 4), while Section 5 holds the results of our at SemEval16 was an SVM/CRF architecture using
re-evaluation. Sections 6 and 7 conclude our work GloVe embeddings (Pennington et al., 2014). How-
by stating our main findings and drawing parallels ever, many works recently focused on re-evaluating
to the English language. the SemEval Sentiment Analysis task using BERT-
based language models (Hoang et al., 2019; Xu
2 The GermEval17 Task(s) et al., 2019; Sun et al., 2019; Li et al., 2019; Karimi
et al., 2020; Tao and Fang, 2020).
The GermEval17 Shared Task (Wojatzki et al.,
2017) is a task on analyzing aspect-based senti- In comparison, little research deals with German
ments in customer reviews about ”Deutsche Bahn” ABSA. For instance, Barriere and Balahur (2020)
(DB) - the German public train company. The main trained a multilingual BERT model for German
data was crawled from various social media plat- Document-level Sentiment Analysis on the SB-10k
forms such as Twitter, Facebook and Q&A web- data set (Cieliebak et al., 2017). Regarding the
sites from May 2015 to June 2016. The documents GermEval17 Subtask B, Guhr et al. (2020) consid-
were manually annotated, and split into a train- ered both FastText (Bojanowski et al., 2017) and
ing (train), a development (dev) and a synchronic BERT, achieving notable improvements. Biesialska
(testsyn ) test set. A diachronic test set (testdia ) was et al. (2020) made use of ensemble models: One is
collected the same way from November 2016 to an ensemble of ELMo (Peters et al., 2018), GloVe
January 2017 in order to test for temporal robust- and a bi-attentive classification network (BCN; Mc-
ness. The task comprises four subtasks represent- Cann et al., 2017), achieving a score of 0.782, and
ing a complete classification pipeline. Subtask A is the other one consists of ELMo and a Transformer-
a binary Relevance Classification task which aims based Sentiment Analysis model (TSA), reaching
at identifying whether the feedback refers to DB. a score of 0.789 for the synchronic test data set.
Subtask B aims at classifying the Document-level Moreover, Attia et al. (2018) trained a convolu-
Polarity (”negative”, ”positive” and ”neutral”). In tional neural network (CNN), achieving a score of
Subtask C, the model has to identify all the aspect 0.7545 on the synchronic test set. Schmitt et al.
categories with associated sentiment polarities in a (2018) advanced the SOTA for Subtask C by em-
relevant document. This multi-label classification ploying biLSTMs and CNNs to carry out end-to-
task was divided into Subtask C1 (Aspect-only) end Aspect-based Sentiment Analysis. The highest
and Subtask C2 (Aspect+Sentiment). For this pur- score was achieved using an end-to-end CNN archi-
pose, the organizers defined 20 different aspect cat- tecture with FastText embeddings, scoring 0.523
egories, e.g. Allgemein (General), Sonstige and 0.557 on the synchronic and diachronic test
Unregelmäßigkeiten (Other irregularities). data set for Subtask C1, respectively, and 0.423
Finally, Subtask D refers to the Opinion Target Ex- and 0.465 for Subtask C2.
traction (OTE), i.e. a sequence labeling task extract-
4 Materials and Methods
ing the linguistic phrase used to express an opinion.
We differentiate between exact match (Subtask D1) Data The GermEval17 data is freely available in
and overlapping match, tolerating errors of +/− .xml- and .tsv-format1 . Each data split (train,
one token (Subtask D2). validation, test) in .tsv-format contains the fol-
lowing variables:
3 Related Work
• document id (URL)
Already before BERT, many researchers focused
• document text
on (English) Sentiment Analysis (Behdenna et al.,
2018). The most common architectures were tra- • relevance label (true, false)
ditional machine learning classifiers and recurrent 1
The data sets (in both formats) can be obtained from
neural networks (RNNs). SemEval14 (Task 4; Pon- http://ltdata1.informatik.uni-hamburg.de/germeval2017/.
• document-level sentiment label shows the number of documents containing cer-
(negative, neutral, positive) tain categories without differentiating between how
• aspects with respective polarities often a category appears within a given document.
(e.g. Ticketkauf#Haupt:negative)
For documents which are annotated as irrelevant, Sentiment train dev testsyn testdia
the sentiment label is set to neutral and no as- negative 5,045 589 780 497
pects are available. Visibly, the .tsv-formatted neutral 13,208 1,632 1,681 1,237
data does not contain the target expressions or their positive 1,179 148 105 108
associated sequence positions. Consequently, Sub-
Table 3: Sentiment distribution for Subtask B.
task D can only be conducted using the data in
.xml-format, which additionally holds the infor-
The relative distribution of the aspect categories
mation on the starting and ending sequence posi-
is similar between the splits. On average, there
tions of the target phrases.
are ∼ 1.12 different aspects per document. Again,
The data set comprises ∼ 26k documents in to-
the label distribution is heavily skewed, with
tal, including the diachronic test set with around
Allgemein (General) clearly representing the
1.8k examples. Further, the main data was ran-
majority class, as it is present in 75.8% of the
domly split by the organizers into a train data set
documents with aspects. The second most frequent
for training, a development data set for validation
category is Zugfahrt (Train ride) appearing
and a synchronic test data set. Table 1 displays the
in around 13.8% of the documents. This strong
number of documents for each split.
imbalance in the aspect categories leads to an
almost Zipfian distribution (Wojatzki et al., 2017).
train dev testsyn testdia
19,432 2,369 2,566 1,842
Category train dev testsyn testdia
Table 1: Number of documents per split of the data set. Allgemein 11,454 1,391 1,398 1,024
Zugfahrt 1,687 177 241 184
Sonstige Unregelmäßigkeiten 1,277 139 224 164
While roughly 74% of the documents form the train Atmosphäre 990 128 148 53
Ticketkauf 540 64 95 48
set, the development split and the synchronic test Service und Kundenbetreuung 447 42 63 27
split contain around 9% and around 10%, respec- Sicherheit 405 59 84 42
tively. The remaining 7% of the data belong to Informationen 306 28 58 35
Connectivity 250 22 36 73
the diachronic set (cf. Tab. 1). Table 2 shows Auslastung und Platzangebot 231 25 35 20
the relevance distribution per data split. This un- DB App und Website 175 20 28 18
veils a pretty skewed distribution of the labels since Komfort und Ausstattung 125 18 24 11
Barrierefreiheit 53 14 9 2
the relevant documents represent the clear majority Image 42 6 0 3
with over 80% in each split. Toiletten 41 5 7 4
Gastronomisches Angebot 38 2 3 3
Reisen mit Kindern 35 3 7 2
Relevance train dev testsyn testdia Design 29 3 4 2
Gepäck 12 2 2 6
true 16,201 1,931 2,095 1,547 QR-Code 0 1 1 0
false 3,231 438 471 295 total 18,137 2,149 2,467 1,721
# documents with aspects 16,200 1,930 2,095 1,547
Table 2: Relevance distribution for Subtask A. ∅ different aspects/document 1.12 1.11 1.18 1.11
Table 4: Aspect category distribution for Subtask C.
The distribution of the sentiments is depicted in
Multiple mentions of the same aspect category in a doc-
Table 3, which shows that between 65% and 69% ument are only considered once.
(per split) belong to the neutral class, 25–31% to
the negative and only 4–6% to the positive class.
Table 4 holds the distribution of the 20 different Pre-trained architectures BERT was initially
aspect categories assigned to the documents2 . It introduced in a base (110M parameters) and a
2
large (340M) variant, Sanh et al. (2019) pro-
Multiple annotations per document are pos-
sible; for a detailed category description see posed an even smaller BERT model (DistilBERT,
https://sites.google.com/view/germeval2017-absa/data. 60M parameters) trained via knowledge distillation
Model variant Pre-training corpus Properties
bert-base-german-cased 12GB of German text (deepset.ai) L=12, H=768, A=12, 110M parameters
bert-base-german-dbmdz-cased 16GB of German text (dbmdz) L=12, H=768, A=12, 110M parameters
bert-base-german-dbmdz-uncased 16GB of German text (dbmdz) L=12, H=768, A=12, 110M parameters
bert-base-multilingual-cased Largest Wikipedias (top 104 languages) L=12, H=768, A=12, 179M parameters
bert-base-multilingual-uncased Largest Wikipedias (top 102 languages) L=12, H=768, A=12, 168M parameters
distilbert-base-german-cased 16GB of German text (dbmdz) L=6, H=768, A=12, 66M parameters
distilbert-base-multilingual-cased Largest Wikipedias (top 104 languages) L=6, H=768, A=12, 134M parameters
Table 5: Pre-trained models provided by huggingface transformers (version 4.0.1) suitable for German. For
all available models, see: https://huggingface.co/transformers/pretrained_models.html.
(Hinton et al., 2015). The exact model specifica- sequence positions, i.e. one entity corresponds to at
tions regarding number of layers (L), number of least one token tag starting with B- for ”Beginning”
attention heads (A) and embedding size (H) for and continuing with I- for ”Inner”. If a token does
available German BERT models are depicted in not belong to any entity, the tag O for ”Outer” is
the last column of Table 5. Both architectures were assigned. For instance, the sequence ”fährt nicht”
pre-trained on the Masked Language Modeling task (engl. ”does not run”) consists of two tokens and
as well as on the auxiliary Next Sentence Predic- would receive the entity Zugfahrt:negative
tion task (only BERT) and can subsequently be and the token tags [B-Zugfahrt:negative,
fine-tuned on a task at hand. I-Zugfahrt:negative] if it refers to a DB
We include three German (Distil)BERT models train which is not running.
pre-trained by DBMDZ3 and one by Deepset.ai4 . The models were fine-tuned on one Tesla V100
The latter one is pre-trained using German PCIe 16GB GPU using Python 3.8.7. Moreover,
Wikipedia (6GB raw text files), the Open the transformers module (version 4.0.1) and
Legal Data dump (2.4GB; Ostendorff et al., torch (version 1.7.1) were used6 . The considered
2020) and news articles (3.6GB). DBMDZ com- values for the hyperparameters for fine-tuning fol-
bined Wikipedia, EU Bookshop (Skadiņš low the recommendations of Devlin et al. (2019):
et al., 2014), Open Subtitles (Lison and • Batch size ∈ {16, 32},
Tiedemann, 2016), CommonCrawl (Ortiz Suárez • Adam learning rate ∈ {5e,3e,2e} − 5,
et al., 2019), ParaCrawl (Esplà-Gomis et al., • # epochs ∈ {2, 3, 4}.
2019) and News Crawl (Haddow, 2018) to a cor- After evaluating the model performance for com-
pus with a total size of 16GB with ∼ 2, 350M binations7 of the different hyperparameters, all pre-
tokens. Besides this, we use the three mul- trained architectures were fine-tuned with a learn-
tilingual (Distil)BERT models included in the ing rate of 5e-5 for four epochs, which turned out
transformers module. This amounts to five to be the most promising combination across the
BERT and two DistilBERT models, two of which different models. The maximum sequence length
are ”uncased” (i.e. every character is lower-cased) was set to 256, which is sufficient since the eval-
while the other five models are ”cased” ones. uated data set consists of rather short texts from
social media, and a batch size of 32 was chosen.
5 Results
Other models Eight teams officially participated
For the re-evaluation, we used the latest data pro- in the GermEval17 shared task, five of which an-
vided in .xml-format. Duplicates were not re- alyzed Subtask A, all of them Subtask B and two
moved, in order to make our results as comparable repectively Subtask C and D. We furthermore con-
as possible. We tokenized the documents and fixed sider the system by Ruppert et al. (2017) addition-
single spelling mistakes in the labels5 . For Subtask ally to the participants’ models from 2017, even
D, the BIO-tags were added based on the provided
6
Source code is available on GitHub:
3
MDZ Digital Library team at the Bavarian State Li- https://github.com/ac74/reevaluating germeval2017. The
brary. Visit https://www.digitale-sammlungen.de for details results are fully reproducible for Subtasks A, B and C. For
and https://github.com/dbmdz/berts for their repository on Subtask D, reproducibility could not be ensured. The micro
pre-trained BERT models. F1 scores fluctuate across different runs between +/-0.01
4
Visit https://deepset.ai/german-bert for details. around the reported values.
5 7
”positve” in train set was replaced with ”positive”, Due to memory limitations, not every hyperparameter
” negative” in testdia set was replaced with ”negative”. combination was applicable.
though they were the organizers and did not ”offi- Subtask B Subtask B refers to the Document-
cially” participate. They also tackled all four sub- level Polarity, which is a multi-class classification
tasks. Since 2017 several other authors analyzed task with three classes. Table 8 demonstrates the
(parts of) the GermEval17 subtasks using more ad- performances on the two test sets:
vanced models, which we also consider for compar-
Language model testsyn testdia
ison here. Table 6 shows which authors employed
Best models 2017 (testsyn : Ruppert et al., 2017)
which kinds of models to solve which task. (testdia : Sayyed et al., 2017)
0.767 0.750
Subtask A B C1 C2 D1 D2 bert-base-german-cased 0.798 0.793
bert-base-german-dbmdz-cased 0.799 0.785
Models from 2017
X X X X X X bert-base-german-dbmdz-uncased 0.807 0.800
(Wojatzki et al., 2017; Ruppert et al., 2017)
bert-base-multilingual-cased 0.790 0.780
Our BERT models X X X X X X bert-base-multilingual-uncased 0.784 0.766
CNN (Attia et al., 2018) – X – – – – distilbert-base-german-cased 0.798 0.776
CNN+FastText (Schmitt et al., 2018) – – X X – – distilbert-base-multilingual-cased 0.777 0.770
ELMo+GloVe+BCN (Biesialska et al., 2020) – X – – – –
ELMo+TSA (Biesialska et al., 2020) – X – – – –
CNN (Attia et al., 2018) 0.755 –
FastText (Guhr et al., 2020) – X – – – – ELMo+GloVe+BCN (Biesialska et al., 2020) 0.782 –
bert-base-german-cased ELMo+TSA (Biesialska et al., 2020) 0.789 –
– X – – – – FastText (Guhr et al., 2020) 0.698† –
(Guhr et al., 2020)
bert-base-german-cased (Guhr et al., 2020) 0.789† –
Table 6: An overview on all the models discussed in
this article, an ”X” in a column indicates that the archi- Table 8: Micro-averaged F1 scores for Subtask B on
tecture was evaluated on the respective subtask. synchronic and diachronic test sets.
†
Guhr et al. (2020) created their own (balanced & un-
balanced) data splits, which limits comparability. We
Subtask A The Relevance Classification is a compare to the performance on the unbalanced data
binary document classification task with classes since it more likely resembles the original data splits.
true and false. Table 7 displays the micro F1
score obtained by each language model on each All models outperform the best model from 2017
test set (best result per data set in bold). by 1.0–4.0 percentage points for the synchronic,
and by 1.6–5.0 percentage points for the diachronic
Language model testsyn testdia test set. On the synchronic test set, the uncased
Best model 2017 (Sayyed et al., 2017) 0.903 0.906 German BERT-BASE model by dbmdz performs
bert-base-german-cased 0.950 0.939 best with a score of 0.807, followed by its cased
bert-base-german-dbmdz-cased 0.951 0.946
bert-base-german-dbmdz-uncased 0.957 0.948
variant with 0.799. For the diachronic test set, the
bert-base-multilingual-cased 0.942 0.933 uncased German BERT-BASE model exceeds the
bert-base-multilingual-uncased 0.944 0.939 other models with a score of 0.800, followed by
distilbert-base-german-cased 0.944 0.939
distilbert-base-multilingual-cased 0.941 0.932 the cased German BERT-BASE model reaching
a score of 0.793. The three multilingual models
Table 7: F1 scores for Subtask A on synchronic and perform generally worse than the German mod-
diachronic test sets. els on this task. Besides this, all the models per-
form slightly better on the synchronic data set than
All the models outperform the best result achieved on the diachronic one. The FastText-based model
in 2017 for both test data sets. For the synchronic (Guhr et al., 2020) comes not even close to the
test set, the previous best result is surpassed by baseline from 2017, while the ELMo-based mod-
3.8–5.4 percentage points. For the diachronic test els (Biesialska et al., 2020) are pretty competitive.
set, the absolute difference to the best contender of Interestingly, two of the multilingual models are
2017 varies between 2.6 and 4.2 percentage points. even outperformed by these ELMo-based models.
With a micro F1 score of 0.957 and 0.948, respec-
tively, the best scoring pre-trained language model Subtask C Subtask C is split into Aspect-only
is the uncased German BERT-BASE variant by (Subtask C1) and Aspect+Sentiment Classification
dbmdz, followed by its cased version. All the (Subtask C2), each being a multi-label classifica-
pre-trained models perform slightly better on the tion task8 . As the organizers provide 20 aspect
synchronic test data than on the diachronic data. categories, Subtask C1 includes 20 labels, whereas
Attia et al. (2018), Schmitt et al. (2018), Biesialska Subtask C2 has 60 labels since each aspect category
et al. (2020) and Guhr et al. (2020) did not evaluate 8
This leads to a change of activation functions in the final
their models on this task. layer from softmax to sigmoid + binary cross entropy loss.
can be combined with each of the three sentiments. synchronic and diachronic test sets. Again, the best
Consistent with Lee et al. (2017) and Mishra et al. model is the uncased German BERT-BASE dbmdz
(2017), we do not account for multiple mentions model reaching scores of 0.655 and 0.689, respec-
of the same label in one document. The results for tively. The CNN models (Schmitt et al., 2018)
Subtask C1 are shown in Table 9: are also outperformed. For both, Subtask C1 and
C2, all the displayed models perform better on the
Language model testsyn testdia
diachronic than on the synchronic test data.
Best model 2017 (Ruppert et al., 2017) 0.537 0.556
bert-base-german-cased 0.756 0.762 Subtask D Subtask D refers to the Opinion
bert-base-german-dbmdz-cased 0.756 0.781
bert-base-german-dbmdz-uncased 0.761 0.791
Target Extraction (OTE) and is thus a token-
bert-base-multilingual-cased 0.706 0.734 level classification task. As this is a rather
bert-base-multilingual-uncased 0.723 0.752
distilbert-base-german-cased 0.738 0.768
difficult task, Wojatzki et al. (2017) distinguish
distilbert-base-multilingual-cased 0.716 0.744 between exact (Subtask D1) and overlapping
CNN+FastText (Schmitt et al., 2018) 0.523 0.557 match (Subtask D2), tolerating a deviation of
+/− one token. Here, ”entities” are identified
Table 9: Micro-averaged F1 scores for Subtask C1 by their BIO-tags. It is noteworthy that there
(Aspect-only) on synchronic and diachronic test sets. A are less entities here than for Subtask C since
detailed overview of per-class performances for error
document-level aspects or sentiments could not
analysis can be found in Table 15 in Appendix A.
always be assigned to a certain sequence in the
document. As a result, there are less documents
All pre-trained German BERTs clearly surpass the
at disposal for this task, namely 9,193. The
best performance from 2017 as well as the results
remaining data has 1.86 opinions per document on
reported by Schmitt et al. (2018), who are the only
average. The majority class is now Sonstige
ones of the other authors to evaluate their models
Unregelmäßigkeiten:negative with
on this tasks. Regarding the synchronic test set,
around 15.4% of the true entities (16,650 in total),
the absolute improvement ranges between 16.9 and
leading to more balanced data than in Subtask C.
22.4 percentage points, while for the diachronic test
data, the models outperform the previous results Language model testsyn testdia
by 17.8–23.5 percentage points. The best model Best model 2017 (Ruppert et al., 2017) 0.229 0.301
is again the uncased German BERT-BASE model bert-base-german-cased 0.460 0.455
without CRF
bert-base-german-dbmdz-cased 0.480 0.466
by dbmdz, reaching scores of 0.761 and 0.791, bert-base-german-dbmdz-uncased 0.492 0.501
respectively, followed by the two cased German bert-base-multilingual-cased 0.447 0.457
bert-base-multilingual-uncased 0.429 0.404
BERT-BASE models. One more time, the multi- distilbert-base-german-cased 0.347 0.357
lingual models exhibit the poorest performances distilbert-base-multilingual-cased 0.430 0.419
amongst the evaluated models. Next, Table 10 bert-base-german-cased 0.446 0.443
bert-base-german-dbmdz-cased 0.466 0.444
shows the results for Subtask C2:
with CRF
bert-base-german-dbmdz-uncased 0.515 0.518
bert-base-multilingual-cased 0.472 0.466
Language model testsyn testdia bert-base-multilingual-uncased 0.477 0.452
distilbert-base-german-cased 0.424 0.403
Best model 2017 (Ruppert et al., 2017) 0.396 0.424
distilbert-base-multilingual-cased 0.436 0.418
bert-base-german-cased 0.634 0.663
bert-base-german-dbmdz-cased 0.628 0.663
bert-base-german-dbmdz-uncased 0.655 0.689
Table 11: Entity-level micro-averaged F1 scores for
bert-base-multilingual-cased 0.571 0.634 Subtask D1 (exact match) on synchronic and di-
bert-base-multilingual-uncased 0.553 0.631 achronic test sets. A detailed overview of per-class per-
distilbert-base-german-cased 0.629 0.663 formances for error analysis can be found in Table 17
distilbert-base-multilingual-cased 0.589 0.642
in Appendix B.
CNN+FastText (Schmitt et al., 2018) 0.423 0.465
Table 10: Micro-averaged F1 scores for Subtask C2 In Table 11, we compare the pre-trained models
(Aspect+Sentiment) on synchronic and diachronic test using an ”ordinary” softmax layer to when using a
sets. A detailed overview of per-class performances for CRF layer for Subtask D1.
error analysis can be found in Table 16 in Appendix A. The best performing model is the uncased Ger-
man BERT-BASE model by dbmdz with CRF
Here, the pre-trained models surpass the best model layer on both test sets, with a score of 0.515 and
from 2017 by 15.7–25.9 percentage points and 0.518, respectively. Overall, the results from 2017
20.7–26.5 percentage points, respectively, for the are outperformed by 11.8–28.6 percentage points
Language model testsyn testdia • The monolingual DistilBERT model is pretty
Best models 2017 (testsyn : Lee et al., 2017) competitive, it consistently outperforms its
0.348 0.365
(testdia : Ruppert et al., 2017)
bert-base-german-cased 0.471 0.474
multilingual counterpart as well as the multi-
lingual BERT models on the subtasks A – C
without CRF
bert-base-german-dbmdz-cased 0.491 0.488
bert-base-german-dbmdz-uncased 0.501 0.518
bert-base-multilingual-cased 0.457 0.473
and is at least competitive to the monolingual
bert-base-multilingual-uncased 0.435 0.417 BERT models.
distilbert-base-german-cased 0.397 0.407
distilbert-base-multilingual-cased 0.433 0.429
For D1 and D2 we observe a rather clear domi-
bert-base-german-cased 0.455 0.457 nance of the uncased monolingual model which is
bert-base-german-dbmdz-cased 0.476 0.469 not observable to this extent for the other tasks.
with CRF
bert-base-german-dbmdz-uncased 0.523 0.533
bert-base-multilingual-cased 0.476 0.474
bert-base-multilingual-uncased 0.484 0.464 6 Discussion
distilbert-base-german-cased 0.433 0.423
distilbert-base-multilingual-cased 0.442 0.427 After having observed a notable performance in-
crease for German ABSA when employing pre-
Table 12: Entity-level micro-averaged F1 scores for
trained models, the next step is to compare these
Subtask D2 (overlapping match) on synchronic and di-
achronic test sets. A detailed overview of per-class per- observations to what was reported for the English
formances for error analysis can be found in Table 18 language. Therefore, we examine the temporal de-
in Appendix B. velopment of the SOTA performance on the most
widely adopted data sets for English ABSA, orig-
inating from the SemEval Shared Tasks (Pontiki
on the synchronic test set and 5.6–21.7 percentage
et al., 2014, 2015, 2016). When looking at pub-
points on the diachronic test set.
lic leaderboards, e.g. https://paperswithcode.com/,
For the overlapping match (cf. Tab. 12), the best
Subtask SB2 (aspect term polarity) from SemEval-
system from 2017 are outperformed by 4.9–17.5
2014 is the task which attracts most of the re-
percentage points on the synchronic and by 4.2–
searchers. This task is related, but not perfectly
16.8 percentage points on the diachronic test set.
similar, to Subtask C2, since in this case, the as-
Again, the uncased German BERT-BASE model by
pect term is always a word which has to present
dbmdz with CRF layer performs best with an mi-
in the given review. For this task, a comparison
cro F1 score of 0.523 on the synchronic and 0.533
of pre-BERT and BERT-based methods reveals no
on the diachronic set. To our knowledge, there
big ”jump” in the performance values, but rather a
were no other models to compare our performance
steady increase over time (cf. Tab. 13).
values with, besides the results from 2017.
Language model Laptops Restaurants
Main Takeaways For the first two subtasks,
which are rather simple binary and multi-class clas- Best model SemEval-2014
0.7048 0.8095
pre-BERT
(Pontiki et al., 2014)
sification tasks, the pre-trained models are able to
MemNet (Tang et al., 2016) 0.7221 0.8095
improve a little upon the already pretty decent per-
formance values from 2017. Further, we do not see HAPN (Li et al., 2018) 0.7727 0.8223
large differences between the different pre-trained
models. Nevertheless, the small differences we BERT-SPC (Song et al., 2019) 0.7899 0.8446
BERT-based
can observe, already point in the same direction as
BERT-ADA (Rietzler et al., 2020) 0.8023 0.8789
what can be observed for the primary ABSA tasks
of interest, C1 and C2: LCF-ATEPC (Yang et al., 2019) 0.8229 0.9018
• Uncased models have a tendency of outper-
forming their cased counterparts for the mono- Table 13: Development of the SOTA Accuracy
lingual models, for multilingual models this for the aspect term polarity task (SemEval-2014;
cannot be clearly confirmed. Pontiki et al., 2014). Selected models were
• Monolingual models outperform the multilin- picked from https://paperswithcode.com/sota/aspect-
based-sentiment-analysis-on-semeval.
gual ones.
• There are no large performance differences
Clearly more related, but unfortunately also less
between the two cased BERT models by
used, are the subtasks SB3 (aspect category ex-
DBMDZ and Deepset.ai, which suggests only
traction; comparable to Subtask C1) and SB4 (as-
a minor influence of the different corpora,
pect category polarity; comparable to Subtask C2)
which the models were pre-trained on.
from SemEval-2014.9 Limitations with respect 7 Conclusion
to comparability arise from the different numbers
As one would have hoped, all the state-of-the art
of categories: Subtask SB4 only exhibits five as-
pre-trained language models clearly outperform all
pect categories (as opposed to 20 categories for
the models from 2017, proving the power of trans-
GermEval17) which leads to an easier classifica-
fer learning also for German ABSA. Throughout
tion problem and is reflected in the already pretty
the presented analyses, the models always achieve
high scores of the 2014 baselines. Table 14 shows
similar results between the synchronic and the di-
the performance of the best model from 2014 as
achronic test sets, indicating temporal robustness
well as performance of subsequent (pre-BERT and
for the models. Nonetheless, the diachronic data
BERT-based) models for subtasks SB3 and SB4.
was collected only half a year after the main data.
Restaurants It would be interesting to see whether the trained
Language model SB3 SB4 models would return similar predictions on data
collected a couple of years later.
pre-BERT
Best model SemEval-2014
0.8857 0.8292
(Pontiki et al., 2014) The uncased German BERT-BASE model by
ATAE-LSTM (Wang et al., 2016) —- 0.840 dbmdz achieves the best results across all subtasks.
Since Rönnqvist et al. (2019) showed that mono-
BERT-pair (Sun et al., 2019) 0.9218 0.899 lingual BERT models often outperform the mul-
BERT-based
tilingual models for a variety of tasks, one might
CG-BERT (Wu and Ong, 2020) 0.9162† 0.901†
have already suspected that a monolingual Ger-
QACG-BERT (Wu and Ong, 2020) 0.9264 0.904† man BERT performs best across the performed
tasks. It may not seem evident at first that an
Table 14: Development of the SOTA F1 score (SB3) uncased language model ends up as the best per-
and Accuracy (SB4) for the aspect category extrac- forming model since, e.g. in Sentiment Analysis,
tion/polarity task (SemEval-2014; Pontiki et al., 2014). capitalized letters might be an indicator for polar-
†
Additional auxiliary sentences were used. ity. In addition, since nouns and beginnings of
sentences always start with a capital letter in Ger-
In contrast to what can be observed for SB2, in this man, one might assume that lower-casing the whole
case, the performance increase on SB4 caused by text changes the meaning of some words and thus
the introduction of BERT seems to be kind of strik- confuses the language model. Nevertheless, the
ing. While the ATAE-LSTM (Wang et al., 2016) GermEval17 documents are very noisy since they
only slightly increased the performance compared were retrieved from social media. That means that
to 2014, the BERT-based models led to a jump of the data contains many misspellings, grammar and
more than 6 percentage points. So when taking into expression mistakes, dialect, and colloquial lan-
account the potential room for improvement (0.16 guage. For this reason, already some participating
for SB4 vs. 0.60 for C2), the improvements relative teams in 2017 pursued an elaborate pre-processing
to the potential (0.06/0.16 for SB4 vs. 0.23/0.60 on the text data in order to eliminate some noise
for C2) are quite similar. (Hövelmann and Friedrich, 2017; Sayyed et al.,
Another issue is that (partly) highly specialized 2017; Sidarenka, 2017). Among other things,
(T)ABSA architectures were used for improving Hövelmann and Friedrich (2017) transformed the
the SOTA on the SemEval-2014 tasks, while we text to lower-case and replaced, for example, ”S-
”only” applied standard pre-trained German BERT Bahn” and ”S Bahn” with ”sbahn”. We suppose
models without any task-specific modifications or that in this case, lower-casing the texts improves
extensions. This leaves room for further improve- the data quality by eliminating some of the noise
ments on this task on German data which should and acts as a sort of regularization. As a result,
be an objective for future research. the uncased models potentially generalize better
9
Since the data sets (Restaurants and Laptops) have been than the cased models. The findings from May-
further developed for SemEval-2015 and SemEval-2016, sub- hew et al. (2019), who compare cased and uncased
tasks SB3 and SB4 are revisited under the names Slot 1 and
Slot 3 for the in-domain ABSA in SemEval-2015. Slot 2
pre-trained models on social media data for NER,
from SemEval-2015 aims at OTE and thus corresponds to corroborate this hypothesis.
Subtask D from GermEval17. For SemEval-2016 the same
task names as in 2015 were used, subdivided into Subtask 1
(sentence-level ABSA) and Subtask 2 (text-level ABSA).
References Training a Broad-Coverage German Sentiment
Classification Model for Dialog Systems. In
Mohammed Attia, Younes Samih, Ali Elkahky, and Proceedings of the 12th Conference on Language
Laura Kallmeyer. 2018. Multilingual multi-class Resources and Evaluation (LREC 2020), pages
sentiment classification using convolutional neural 1627–1632, Marseille, France.
networks. In Proceedings of the Eleventh Interna-
tional Conference on Language Resources and Eval- Barry Haddow. 2018. News Crawl Corpus.
uation (LREC 2018), Miyazaki, Japan. European
Language Resources Association (ELRA). Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.
Distilling the knowledge in a neural network. arXiv
Valentin Barriere and Alexandra Balahur. 2020. Im- preprint arXiv:1503.02531.
proving sentiment analysis over non-English tweets
using multilingual transformers and automatic trans- Mickel Hoang, Oskar Alija Bihorac, and Jacobo
lation for data-augmentation. In Proceedings of Rouces. 2019. Aspect-based sentiment analysis us-
the 28th International Conference on Computational ing BERT. In Proceedings of the 22nd Nordic Con-
Linguistics, pages 266–271, Barcelona, Spain (On- ference on Computational Linguistics, pages 187–
line). International Committee on Computational 196, Turku, Finland. Linköping University Elec-
Linguistics. tronic Press.
Salima Behdenna, Fatiha Barigou, and Ghalem Be- Sepp Hochreiter and Jürgen Schmidhuber. 1997.
lalem. 2018. Document level sentiment analysis: Long short-term memory. Neural computation,
A survey. EAI Endorsed Transactions on Context- 9(8):1735–1780.
aware Systems and Applications, 4:154339.
Leonard Hövelmann and Christoph M. Friedrich.
Katarzyna Biesialska, Magdalena Biesialska, and Hen- 2017. Fasttext and Gradient Boosted Trees at
ryk Rybinski. 2020. Sentiment analysis with contex- GermEval-2017 Tasks on Relevance Classification
tual embeddings and self-attention. arXiv preprint and Document-level Polarity. In Proceedings of the
arXiv:2003.05574. GermEval 2017 – Shared Task on Aspect-based Sen-
timent in Social Media Customer Feedback, Berlin,
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Germany.
Tomas Mikolov. 2017. Enriching word vectors with
subword information. Transactions of the Associa- Akbar Karimi, Leonardo Rossi, and Andrea Prati. 2020.
tion for Computational Linguistics, 5:135–146. Adversarial training for aspect-based sentiment anal-
ysis with bert.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Ji-Ung Lee, Steffen Eger, Johannes Daxenberger, and
Schwenk, and Yoshua Bengio. 2014. Learning Iryna Gurevych. 2017. UKP TU-DA at GermEval
phrase representations using rnn encoder-decoder 2017: Deep Learning for Aspect Based Sentiment
for statistical machine translation. arXiv preprint Detection. In Proceedings of the GermEval 2017
arXiv:1406.1078. – Shared Task on Aspect-based Sentiment in Social
Media Customer Feedback, Berlin, Germany.
Mark Cieliebak, Jan Milan Deriu, Dominic Egger, and
Fatih Uzdilli. 2017. A Twitter corpus and bench- Lishuang Li, Yang Liu, and AnQiao Zhou. 2018. Hier-
mark resources for German sentiment analysis. In archical attention based position-aware network for
Proceedings of the Fifth International Workshop aspect-level sentiment analysis. In Proceedings of
on Natural Language Processing for Social Media, the 22nd Conference on Computational Natural Lan-
pages 45–51, Valencia, Spain. Association for Com- guage Learning, pages 181–189, Brussels, Belgium.
putational Linguistics. Association for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam.
Kristina Toutanova. 2019. BERT: Pre-training of 2019. Exploiting BERT for end-to-end aspect-based
Deep Bidirectional Transformers for Language Un- sentiment analysis. In Proceedings of the 5th Work-
derstanding. In Proceedings of the 2019 Conference shop on Noisy User-generated Text (W-NUT 2019),
of the North American Chapter of the Association pages 34–41, Hong Kong, China. Association for
for Computational Linguistics: Human Language Computational Linguistics.
Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ- Pierre Lison and Jörg Tiedemann. 2016. OpenSubti-
ation for Computational Linguistics. tles2016: Extracting Large Parallel Corpora from
Movie and TV Subtitles. In Proceedings of the 10th
M. Esplà-Gomis, M. Forcada, Gema Ramı́rez-Sánchez, International Conference on Language Resources
and Hieu T. Hoang. 2019. ParaCrawl: Web-scale and Evaluation (LREC 2016).
parallel corpora for the languages of the EU. In MT-
Summit. Stephen Mayhew, Tatiana Tsygankova, and Dan Roth.
2019. ner and pos when nothing is capitalized. In
Oliver Guhr, Anne-Kathrin Schumann, Frank Proceedings of the 2019 Conference on Empirical
Bahrmann, and Hans-Joachim Böhme. 2020. Methods in Natural Language Processingand the
9th International Joint Conference on Natural Lan- Suresh Manandhar. 2014. SemEval-2014 task 4: As-
guage Processing, pages 6256–6261, Hong Kong, pect based sentiment analysis. In Proceedings of the
China. Association for Computational Linguistics. 8th International Workshop on Semantic Evaluation
(SemEval 2014), pages 27–35, Dublin, Ireland. As-
Bryan McCann, James Bradbury, Caiming Xiong, and sociation for Computational Linguistics.
Richard Socher. 2017. Learned in translation: Con-
textualized word vectors. In Advances in Neural Alexander Rietzler, Sebastian Stabinger, Paul Opitz,
Information Processing Systems, volume 30, pages and Stefan Engl. 2020. Adapt or get left behind:
6294–6305. Curran Associates, Inc. Domain adaptation through BERT language model
finetuning for aspect-target sentiment classification.
Pruthwik Mishra, Vandan Mujadia, and Soujanya In Proceedings of the 12th Language Resources
Lanka. 2017. GermEval 2017: Sequence based and Evaluation Conference, pages 4933–4941, Mar-
Models for Customer Feedback Analysis. In Pro- seille, France. European Language Resources Asso-
ceedings of the GermEval 2017 – Shared Task on ciation.
Aspect-based Sentiment in Social Media Customer
Feedback, Berlin, Germany. Samuel Rönnqvist, Jenna Kanerva, Tapio Salakoski,
and Filip Ginter. 2019. Is Multilingual BERT Flu-
Pedro Javier Ortiz Suárez, Benoı̂t Sagot, and Laurent ent in Language Generation? In Proceedings of the
Romary. 2019. Asynchronous Pipeline for Process- First NLPL Workshop on Deep Learning for Natural
ing Huge Corpora on Medium to Low Resource In- Language Processing, pages 29–36, Turku, Finland.
frastructures. In 7th Workshop on the Challenges Linköping University Electronic Press.
in the Management of Large Corpora (CMLC-
7), Cardiff, United Kingdom. Leibniz-Institut für Eugen Ruppert, Abhishek Kumar, and Chris Biemann.
Deutsche Sprache. 2017. LT-ABSA: An Extensible Open-Source Sys-
tem for Document-Level and Aspect-Based Senti-
Malte Ostendorff, Till Blume, and Saskia Ostendorff. ment Analysis. In Proceedings of the GermEval
2020. Towards an Open Platform for Legal Informa- 2017 – Shared Task on Aspect-based Sentiment in
tion. In Proceedings of the ACM/IEEE Joint Confer- Social Media Customer Feedback, Berlin, Germany.
ence on Digital Libraries in 2020, JCDL ’20, pages
385—-388, New York, NY, USA. Association for Victor Sanh, Lysandre Debut, Julien Chaumond, and
Computing Machinery. Thomas Wolf. 2019. Distilbert, a distilled version
Jeffrey Pennington, Richard Socher, and Christopher of bert: smaller, faster, cheaper and lighter. arXiv
Manning. 2014. Glove: Global vectors for word rep- preprint arXiv:1910.01108.
resentation. In Proceedings of the 2014 conference
on empirical methods in natural language process- Zeeshan Ali Sayyed, Daniel Dakota, and Sandra
ing (EMNLP), pages 1532–1543. Kübler. 2017. IDS-IUCL: Investigating Feature Se-
lection and Oversampling for GermEval 2017. In
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Proceedings of the GermEval 2017 – Shared Task on
Gardner, Christopher Clark, Kenton Lee, and Luke Aspect-based Sentiment in Social Media Customer
Zettlemoyer. 2018. Deep contextualized word repre- Feedback, Berlin, Germany.
sentations. arXiv preprint arXiv:1802.05365.
Martin Schmitt, Simon Steinheber, Konrad Schreiber,
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, and Benjamin Roth. 2018. Joint aspect and polar-
Ion Androutsopoulos, Suresh Manandhar, Moham- ity classification for aspect-based sentiment analysis
mad AL-Smadi, Mahmoud Al-Ayyoub, Yanyan with end-to-end neural networks. In Proceedings of
Zhao, Bing Qin, Orphee de clercq, Veronique the 2018 Conference on Empirical Methods in Nat-
Hoste, Marianna Apidianaki, Xavier Tannier, Na- ural Language Processing, pages 1109–1114, Brus-
talia Loukachevitch, Evgeny Kotelnikov, Nuria sels, Belgium. Association for Computational Lin-
Bel, Salud Marı́a Zafra, and Gülşen Eryiğit. 2016. guistics.
Semeval-2016 task 5: Aspect based sentiment anal-
ysis. In Proceedings of the 10th International Uladzimir Sidarenka. 2017. PotTS at GermEval-2017
Workshop on Semantic Evaluation (SemEval-2016), Task B: Document-Level Polarity Detection Using
pages 19–30. Hand-Crafted SVM and Deep Bidirectional LSTM
Network. In Proceedings of the GermEval 2017
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, – Shared Task on Aspect-based Sentiment in Social
Suresh Manandhar, and Ion Androutsopoulos. 2015. Media Customer Feedback, Berlin, Germany.
SemEval-2015 task 12: Aspect based sentiment
analysis. In Proceedings of the 9th International Raivis Skadiņš, Jörg Tiedemann, Roberts Rozis, and
Workshop on Semantic Evaluation (SemEval 2015), Daiga Deksne. 2014. Billions of Parallel Words for
pages 486–495, Denver, Colorado. Association for Free: Building and Using the EU Bookshop Cor-
Computational Linguistics. pus. In Proceedings of the 9th International Confer-
ence on Language Resources and Evaluation (LREC
Maria Pontiki, Dimitris Galanis, John Pavlopoulos, 2014), pages 1850–1855, Reykjavik, Iceland. Euro-
Harris Papageorgiou, Ion Androutsopoulos, and pean Language Resources Association (ELRA).
Youwei Song, Jiahai Wang, Tao Jiang, Zhiyue Liu, and 2020. Transformers: State-of-the-Art Natural Lan-
Yanghui Rao. 2019. Attentional encoder network guage Processing. In Proceedings of the 2020 Con-
for targeted sentiment classification. arXiv preprint ference on Empirical Methods in Natural Language
arXiv:1902.09314. Processing: System Demonstrations, pages 38–45,
Online. Association for Computational Linguistics.
Chi Sun, Luyao Huang, and Xipeng Qiu. 2019. Uti-
lizing BERT for aspect-based sentiment analysis via Zhengxuan Wu and Desmond C Ong. 2020. Context-
constructing auxiliary sentence. In Proceedings of guided bert for targeted aspect-based sentiment anal-
the 2019 Conference of the North American Chap- ysis. arXiv preprint arXiv:2010.07523.
ter of the Association for Computational Linguistics:
Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2019. Bert
Human Language Technologies, Volume 1 (Long
post-training for review reading comprehension and
and Short Papers), pages 380–385, Minneapolis,
aspect-based sentiment analysis.
Minnesota. Association for Computational Linguis-
tics. Heng Yang, Biqing Zeng, JianHao Yang, Youwei
Song, and Ruyang Xu. 2019. A multi-task learn-
Duyu Tang, Bing Qin, and Ting Liu. 2016. Aspect ing model for chinese-oriented aspect polarity clas-
level sentiment classification with deep memory net- sification and aspect term extraction. arXiv preprint
work. arXiv preprint arXiv:1605.08900. arXiv:1912.07976.
Jie Tao and Xing Fang. 2020. Toward multi-label sen- Appendix
timent analysis: a transfer learning based approach.
Journal of Big Data, 7:1. A Detailed results (per category) for
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Subtask C
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz It may be interesting to have a more detailed look
Kaiser, and Illia Polosukhin. 2017. Attention Is All
You Need. In 31st Conference on Neural Informa- at the model performance for this subtask because
tion Processing Systems (NIPS 2017), Long Beach, of the high number of classes and their skewed
California, USA. distribution by investigating the performance on
category-level. Table 15 shows the performance
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill, Omer of the uncased German BERT-BASE model by
Levy, and Samuel Bowman. 2019. Superglue: A dbmdz per test set for Subtask C1. The support in-
stickier benchmark for general-purpose language un- dicates the number of appearances, which are also
derstanding systems. In Advances in neural informa- displayed in Table 4 in this case. Seven categories
tion processing systems, pages 3266–3280.
are summarized in Rest because they have an F1
Alex Wang, Amanpreet Singh, Julian Michael, Felix score of 0 for both test sets, i.e. the model is not
Hill, Omer Levy, and Samuel R Bowman. 2018. able to correctly identify any of these seven aspects
Glue: A multi-task benchmark and analysis platform appearing in the test data. The table is sorted by
for natural language understanding. arXiv preprint
arXiv:1804.07461.
the score on the synchronic test set.
testsyn testdia
Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Aspect Category Score Support Score Support
Li Zhao. 2016. Attention-based lstm for aspect- Allgemein 0.854 1,398 0.877 1,024
level sentiment classification. In Proceedings of the Sonstige Unregelmäßigkeiten 0.782 224 0.785 164
2016 conference on empirical methods in natural Connectivity 0.750 36 0.838 73
language processing, pages 606–615. Zugfahrt 0.678 241 0.687 184
Auslastung und Platzangebot 0.645 35 0.667 20
Sicherheit 0.602 84 0.639 42
Michael Wojatzki, Eugen Ruppert, Sarah Holschneider, Atmosphäre 0.600 148 0.532 53
Torsten Zesch, and Chris Biemann. 2017. GermEval Barrierefreiheit 0.500 9 0 2
2017: Shared Task on Aspect-based Sentiment in Ticketkauf 0.481 95 0.506 48
Social Media Customer Feedback. In Proceedings Service und Kundenbetreuung 0.476 63 0.417 27
of the GermEval 2017 – Shared Task on Aspect- DB App und Website 0.455 28 0.563 18
Informationen 0.329 58 0.464 35
based Sentiment in Social Media Customer Feed- Komfort und Ausstattung 0.286 24 0 11
back, pages 1–12, Berlin, Germany. Rest 0 24 0 20
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Table 15: Micro-averaged F1 scores and support by as-
Chaumond, Clement Delangue, Anthony Moi, Pier- pect category (Subtask C1). Seven categories are sum-
ric Cistac, Tim Rault, Rémi Louf, Morgan Fun- marized in Rest and show each a score of 0.
towicz, Joe Davison, Sam Shleifer, Patrick von
Platen, Clara Ma, Yacine Jernite, Julien Plu, Can-
wen Xu, Teven Le Scao, Sylvain Gugger, Mariama The F1 scores for Allgemein (General),
Drame, Quentin Lhoest, and Alexander M. Rush. Sonstige Unregelmäßigkeiten (Other ir-
regularities) and Connectivity are the highest. rare categories could still not be identified by the
13 categories, mostly similar between the two test model.
sets, show a positive F1 score on at least one of
the two test sets. For the categories subsumed un- B Detailed results (per category) for
der Rest, the model was not able to learn how to Subtask D
correctly identify these categories.
Similar as for Subtask C, the results for the best
Subtask C2 exhibits a similar distribution of the model are investigated in more detail. Table 17
true labels, with the Aspect+Sentiment category gives the detailed classification report for the un-
Allgemein:neutral as majority class. Over cased German BERT-BASE model with CRF layer
50% of the true labels belong to this class. Table 16 on Subtask D1. Only entities that were correctly
shows that only 12 out of 60 labels can be detected detected at least once are displayed. The table is
by the model (see Table 16). sorted by the score on the synchronic test set. The
testsyn testdia
classification report for Subtask D2 is displayed
Aspect+Sentiment Category Score Support Score Support analogously in Table 18.
Allgemein:neutral 0.804 1,108 0.832 913
Sonstige Unregelmäßigkeiten:negative 0.782 221 0.793 159
Zugfahrt:negative 0.645 197 0.725 149 testsyn testdia
Sicherheit:negative 0.640 78 0.585 39 Category Score Support Score Support
Allgemein:negative 0.582 258 0.333 80 Zugfahrt:negative 0.702 622 0.729 495
Atmosphäre:negative 0.569 126 0.447 39 Sonstige Unregelmäßigkeiten:negative 0.681 693 0.581 484
Connectivity:negative 0.400 20 0.291 46 Sicherheit:negative 0.604 337 0.457 122
Ticketkauf:negative 0.364 42 0.298 34 Connectivity:negative 0.598 56 0.620 109
Auslastung und Platzangebot:negative 0.350 31 0.211 17 Barrierefreiheit:negative 0.595 14 0 3
Allgemein:positive 0.214 41 0.690 33 Auslastung und Platzangebot:negative 0.579 66 0.447 31
Zugfahrt:positive 0.154 34 0 34 Connectivity:positive 0.571 26 0.555 60
Service und Kundenbetreuung:negative 0.146 36 0.174 21 Allgemein:negative 0.545 807 0.343 139
Rest 0 343 0 180 Atmosphäre:negative 0.500 403 0.337 164
Ticketkauf:negative 0.383 96 0.583 74
Ticketkauf:positive 0.368 59 0 13
Table 16: Micro-averaged F1 scores and support by As- Komfort und Ausstattung:negative 0.357 24 0 16
pect+Sentiment category (Subtask C2). 48 categories Atmosphäre:neutral 0.348 40 0.111 14
Service und Kundenbetreuung:negative 0.323 74 0.286 31
are summarized in Rest and show each a score of 0. Informationen:negative 0.301 68 0.505 46
Zugfahrt:positive 0.276 62 0.343 83
DB App und Website:negative 0.232 39 0.375 33
All the aspect categories displayed in Ta- DB App und Website:neutral 0.188 23 0 11
Sonstige Unregelmäßigkeiten:neutral 0.179 13 0.222 2
ble 16 are also visible in Table 15 and Allgemein:positive 0.157 86 0.586 92
Service und Kundenbetreuung:positive 0.115 23 0 5
most of them have negative sentiment. Atmosphäre:positive 0.105 26 0 15
Allgemein:neutral and Sonstige Ticketkauf:neutral 0.040 144 0.222 25
Connectivity:neutral 0 11 0.211 15
Unregelmäßigkeiten:negative show the Toiletten:negative 0 15 0.160 23
highest scores. Again, we assume that here, 48 Rest 0 355 0 115
categories could not be identified due to data Table 17: Micro-averaged F1 scores and support by As-
sparsity. However, having this in mind, the model pect+Sentiment entity with exact match (Subtask D1).
achieves a relatively high overall performance 35 categories are summarized in Rest, each of them ex-
for both, Subtask C1 and C2 (cf. Tab. 9 and hibiting a score of 0.
Tab. 10). This is mainly owed to the high
score of the majority classes Allgemein and For Subtask D1, the model returns a pos-
Allgemein:neutral, respectively, because itive score on 25 entity categories on at
the micro F1 score puts a lot of weight on majority least one of the two test sets. The category
classes. It might be interesting whether the clas- Zugfahrt:negative can be classified best
sification of the rare categories can be improved on both test sets, followed by Sonstige
by balancing the data. We experimented with Unregelmäßigkeiten:negative and
removing general categories such as Allgemein, Sicherheit:negative for the synchronic
Allgemein:neutral or documents with test set and by Connectivity:negative and
sentiment neutral since these are usually less Allgemein:positive for the diachronic set.
interesting for a company. We observe a large Visibly, the scores between the two test sets differ
drop in the overall F1 score which is attributed to more here than in the classification report of the
the absence of the strong majority class and the previous task.
resulting data loss. Indeed, the classification for The report for the overlapping match (cf. Tab.
some single categories could be improved, but the 18) shows slightly better results on some categories
testsyn testdia
Category Score Support Score Support
Zugfahrt:negative 0.708 622 0.739 495
Sonstige Unregelmäßigkeiten:negative 0.697 693 0.617 484
Sicherheit:negative 0.607 337 0.475 122
Connectivity:negative 0.598 56 0.620 109
Barrierefreiheit:negative 0.595 14 0 3
Auslastung und Platzangebot:negative 0.579 66 0.447 31
Connectivity:positive 0.571 26 0.555 60
Allgemein:negative 0.561 807 0.363 139
Atmosphäre:negative 0.505 403 0.358 164
Ticketkauf:negative 0.383 96 0.583 74
Ticketkauf:positive 0.368 59 0 13
Komfort und Ausstattung:negative 0.357 24 0 16
Atmosphäre:neutral 0.348 40 0.111 14
Service und Kundenbetreuung:negative 0.323 74 0.286 31
Informationen:negative 0.301 68 0.505 46
Zugfahrt:positive 0.276 62 0.343 83
DB App und Website:negative 0.261 39 0.406 33
DB App und Website:neutral 0.188 23 0 11
Sonstige Unregelmäßigkeiten:neutral 0.179 13 0.222 2
Allgemein:positive 0.157 86 0.586 92
Service und Kundenbetreuung:positive 0.115 23 0 5
Atmosphäre:positive 0.105 26 0 15
Ticketkauf:neutral 0.040 144 0.222 25
Connectivity:neutral 0 11 0.211 15
Toiletten:negative 0 15 0.160 23
Rest 0 355 0 112
Table 18: Micro-averaged F1 scores and support by
Aspect+Sentiment entity with overlapping match (Sub-
task D2). 35 categories are summarized in Rest and
show each a score of 0.
than for the exact match. The third-best score
on the diachronic test data is now Sonstige
Unregelmäßigkeiten:negative. Besides
this, the top three categories per test set remain the
same.
Apart from the fact that this is a different kind of
task than before, one can notice that even though
the overall micro F1 scores are lower for Subtask D
than for Subtask C, the model manages to success-
fully identify a larger variety of categories, i.e. it
achieves a positive score for more categories. This
is probably due to the more balanced data for Sub-
task D than for Subtask C2, resulting in a lower
overall score and mostly higher scores per category.