<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Connectivity:neutral</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Re-Evaluating GermEval17 Using German Pre-Trained Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Department of Statistics</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ludwig-Maximilians-Universita¨t</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Munich</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>0</volume>
      <issue>11</issue>
      <fpage>6256</fpage>
      <lpage>6261</lpage>
      <abstract>
        <p>The lack of a commonly used benchmark data set (collection) such as (Super) GLUE (Wang et al., 2018, 2019) for the evaluation of non-English pre-trained language models is a severe shortcoming of current English-centric NLP-research. It concentrates a large part of the research on English, neglecting the uncertainty when transferring conclusions found for the English language to other languages. We evaluate the performance of German and multilingual BERT models currently available via the huggingface transformers library on four subtasks of Aspect-based Sentiment Analysis (ABSA) from the GermEval17 workshop. We compare them to pre-BERT architectures (Wojatzki et al., 2017; Schmitt et al., 2018; Attia et al., 2018) as well as to an ELMo-based architecture (Biesialska et al., 2020) and a BERT-based approach (Guhr et al., 2020). The observed improvements are put in relation to those for a similar ABSA task (Pontiki et al., 2014) and similar models (preBERT vs. BERT-based) for the English language and we check whether the reported improvements correspond to those we observe for German.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        (Aspect-based) Sentiment Analysis is often used
to transform reviews into helpful information on
how a product or service of a company is
perceived among the customers. Until recently,
Sentiment Analysis was mainly conducted
using traditional machine learning and recurrent
neural networks, like LSTMs
        <xref ref-type="bibr" rid="ref15">(Hochreiter and
Schmidhuber, 1997)</xref>
        or GRUs
        <xref ref-type="bibr" rid="ref6">(Cho et al., 2014)</xref>
        .
Those models have been practically replaced
by language models relying on (parts of) the
Transformer architecture, a novel framework
proposed by Vaswani et al. (2017). Devlin et al.
(2019) developed a Transformer-encoder-based
language model called BERT (Bidirectional Encoder
Representations from Transfomers), achieving
state-of-the-art (SOTA) performance on several
benchmark tasks - mainly for the English language
- and becoming a milestone in the field of NLP.
      </p>
      <p>Up to now, only a few researchers have focused
on sentiment related problems for German reviews,
despite language-specific evaluation is a crucial
driving force for a more universal model
development and improvement. Unique characteristics of
the different languages present different challenges
to the models, which is why sole evaluation on
English data is a severe shortcoming.</p>
      <p>
        The first shared task on German ABSA, which
provides a large annotated data set for training
and evaluation, is the GermEval17 Shared Task
        <xref ref-type="bibr" rid="ref47">(Wojatzki et al., 2017)</xref>
        . The participating teams
back then analyzed the data using mostly
standard machine learning techniques such as SVMs,
CRFs, or LSTMs. In contrast to 2017, today,
different pre-trained BERT models are available for
a variety of different languages, including
German. We re-analyzed the complete GermEval17
Task using seven pre-trained BERT models
suitable for German provided by the huggingface
transformers library (Wolf et al., 2020). We
evaluate which one of the models is best suited
for the different GermEval17 subtasks by
comparing their performance values. Furthermore, we
compare our findings on whether (and how much)
BERT-based models are able to improve the
preBERT SOTA in German ABSA with the SOTA
developments for English ABSA by the example
of SemEval-2014
        <xref ref-type="bibr" rid="ref30">(Pontiki et al., 2014)</xref>
        .
      </p>
      <p>We first give an overview on the GermEval17
tasks (cf. Sec. 2) and on related work (cf. Sec.
3). Second, we present the data and the models (cf.
Sec. 4), while Section 5 holds the results of our
re-evaluation. Sections 6 and 7 conclude our work
by stating our main findings and drawing parallels
to the English language.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The GermEval17 Task(s)</title>
      <p>
        The GermEval17 Shared Task
        <xref ref-type="bibr" rid="ref47">(Wojatzki et al.,
2017)</xref>
        is a task on analyzing aspect-based
sentiments in customer reviews about ”Deutsche Bahn”
(DB) - the German public train company. The main
data was crawled from various social media
platforms such as Twitter, Facebook and Q&amp;A
websites from May 2015 to June 2016. The documents
were manually annotated, and split into a
training (train), a development (dev) and a synchronic
(testsyn) test set. A diachronic test set (testdia) was
collected the same way from November 2016 to
January 2017 in order to test for temporal
robustness. The task comprises four subtasks
representing a complete classification pipeline. Subtask A is
a binary Relevance Classification task which aims
at identifying whether the feedback refers to DB.
Subtask B aims at classifying the Document-level
Polarity (”negative”, ”positive” and ”neutral”). In
Subtask C, the model has to identify all the aspect
categories with associated sentiment polarities in a
relevant document. This multi-label classification
task was divided into Subtask C1 (Aspect-only)
and Subtask C2 (Aspect+Sentiment). For this
purpose, the organizers defined 20 different aspect
categories, e.g. Allgemein (General), Sonstige
Unregelma¨ßigkeiten (Other irregularities).
Finally, Subtask D refers to the Opinion Target
Extraction (OTE), i.e. a sequence labeling task
extracting the linguistic phrase used to express an opinion.
We differentiate between exact match (Subtask D1)
and overlapping match, tolerating errors of +=
one token (Subtask D2).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        Already before BERT, many researchers focused
on (English) Sentiment Analysis
        <xref ref-type="bibr" rid="ref3">(Behdenna et al.,
2018)</xref>
        . The most common architectures were
traditional machine learning classifiers and recurrent
neural networks (RNNs). SemEval14
        <xref ref-type="bibr" rid="ref30">(Task 4;
Pontiki et al., 2014)</xref>
        was the first workshop to introduce
Aspect-based Sentiment Analysis (ABSA) which
was expanded within SemEval15 Task 12
        <xref ref-type="bibr" rid="ref29">(Pontiki
et al., 2015)</xref>
        and SemEval16 Task 5
        <xref ref-type="bibr" rid="ref28">(Pontiki et al.,
2016)</xref>
        . Here, restaurant and laptop reviews were
examined on different granularities. The best model
at SemEval16 was an SVM/CRF architecture using
GloVe embeddings
        <xref ref-type="bibr" rid="ref26">(Pennington et al., 2014)</xref>
        .
However, many works recently focused on re-evaluating
the SemEval Sentiment Analysis task using
BERTbased language models
        <xref ref-type="bibr" rid="ref14 ref17 ref17 ref2 ref20 ref25 ref31 ref4 ref40 ref42 ref48 ref49">(Hoang et al., 2019; Xu
et al., 2019; Sun et al., 2019; Li et al., 2019; Karimi
et al., 2020; Tao and Fang, 2020)</xref>
        .
      </p>
      <p>
        In comparison, little research deals with German
ABSA. For instance, Barriere and Balahur (2020)
trained a multilingual BERT model for German
Document-level Sentiment Analysis on the SB-10k
data set
        <xref ref-type="bibr" rid="ref7">(Cieliebak et al., 2017)</xref>
        . Regarding the
GermEval17 Subtask B, Guhr et al. (2020)
considered both FastText
        <xref ref-type="bibr" rid="ref5">(Bojanowski et al., 2017)</xref>
        and
BERT, achieving notable improvements. Biesialska
et al. (2020) made use of ensemble models: One is
an ensemble of ELMo
        <xref ref-type="bibr" rid="ref27">(Peters et al., 2018)</xref>
        , GloVe
and a bi-attentive classification network (BCN;
McCann et al., 2017), achieving a score of 0.782, and
the other one consists of ELMo and a
Transformerbased Sentiment Analysis model (TSA), reaching
a score of 0.789 for the synchronic test data set.
Moreover, Attia et al. (2018) trained a
convolutional neural network (CNN), achieving a score of
0.7545 on the synchronic test set. Schmitt et al.
(2018) advanced the SOTA for Subtask C by
employing biLSTMs and CNNs to carry out
end-toend Aspect-based Sentiment Analysis. The highest
score was achieved using an end-to-end CNN
architecture with FastText embeddings, scoring 0.523
and 0.557 on the synchronic and diachronic test
data set for Subtask C1, respectively, and 0.423
and 0.465 for Subtask C2.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Materials and Methods</title>
      <p>Data The GermEval17 data is freely available in
.xml- and .tsv-format1. Each data split (train,
validation, test) in .tsv-format contains the
following variables:
• document id (URL)
• document text
• relevance label (true, false)
1The data sets (in both formats) can be obtained from
http://ltdata1.informatik.uni-hamburg.de/germeval2017/.
• document-level sentiment label</p>
      <p>(negative, neutral, positive)
• aspects with respective polarities</p>
      <p>(e.g. Ticketkauf#Haupt:negative)</p>
      <p>For documents which are annotated as irrelevant,
the sentiment label is set to neutral and no
aspects are available. Visibly, the .tsv-formatted
data does not contain the target expressions or their
associated sequence positions. Consequently,
Subtask D can only be conducted using the data in
.xml-format, which additionally holds the
information on the starting and ending sequence
positions of the target phrases.</p>
      <p>The data set comprises 26k documents in
total, including the diachronic test set with around
1:8k examples. Further, the main data was
randomly split by the organizers into a train data set
for training, a development data set for validation
and a synchronic test data set. Table 1 displays the
number of documents for each split.</p>
      <p>train</p>
      <p>dev
19,432
2,369
testsyn
2,566
testdia
1,842
While roughly 74% of the documents form the train
set, the development split and the synchronic test
split contain around 9% and around 10%,
respectively. The remaining 7% of the data belong to
the diachronic set (cf. Tab. 1). Table 2 shows
the relevance distribution per data split. This
unveils a pretty skewed distribution of the labels since
the relevant documents represent the clear majority
with over 80% in each split.
The distribution of the sentiments is depicted in
Table 3, which shows that between 65% and 69%
(per split) belong to the neutral class, 25–31% to
the negative and only 4–6% to the positive class.</p>
      <p>Table 4 holds the distribution of the 20 different
aspect categories assigned to the documents2. It
2Multiple annotations per document are
sible; for a detailed category description
https://sites.google.com/view/germeval2017-absa/data.
possee
shows the number of documents containing
certain categories without differentiating between how
often a category appears within a given document.</p>
      <p>
        Sentiment
negative
neutral
positive
train
The relative distribution of the aspect categories
is similar between the splits. On average, there
are 1:12 different aspects per document. Again,
the label distribution is heavily skewed, with
Allgemein (General) clearly representing the
majority class, as it is present in 75.8% of the
documents with aspects. The second most frequent
category is Zugfahrt (Train ride) appearing
in around 13.8% of the documents. This strong
imbalance in the aspect categories leads to an
almost Zipfian distribution
        <xref ref-type="bibr" rid="ref47">(Wojatzki et al., 2017)</xref>
        .
      </p>
      <sec id="sec-4-1">
        <title>Category</title>
        <p>Allgemein
Zugfahrt
Sonstige Unregelma¨ßigkeiten
Atmospha¨re
Ticketkauf
Service und Kundenbetreuung
Sicherheit
Informationen
Connectivity
Auslastung und Platzangebot
DB App und Website
Komfort und Ausstattung
Barrierefreiheit
Image
Toiletten
Gastronomisches Angebot
Reisen mit Kindern
Design
Gepa¨ck
QR-Code
total
# documents with aspects
; different aspects/document</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Pre-trained architectures BERT was initially</title>
      <p>
        introduced in a base (110M parameters) and a
large (340M) variant, Sanh et al. (2019)
proposed an even smaller BERT model (DistilBERT,
60M parameters) trained via knowledge distillation
        <xref ref-type="bibr" rid="ref13">(Hinton et al., 2015)</xref>
        . The exact model
specifications regarding number of layers (L), number of
attention heads (A) and embedding size (H ) for
available German BERT models are depicted in
the last column of Table 5. Both architectures were
pre-trained on the Masked Language Modeling task
as well as on the auxiliary Next Sentence
Prediction task (only BERT) and can subsequently be
fine-tuned on a task at hand.
      </p>
      <p>
        We include three German (Distil)BERT models
pre-trained by DBMDZ3 and one by Deepset.ai4.
The latter one is pre-trained using German
Wikipedia (6GB raw text files), the Open
Legal Data dump
        <xref ref-type="bibr" rid="ref25">(2.4GB; Ostendorff et al.,
2020)</xref>
        and news articles (3.6GB). DBMDZ
combined Wikipedia, EU Bookshop (Skadi n¸sˇ
et al., 2014), Open Subtitles
        <xref ref-type="bibr" rid="ref21 ref41 ref46">(Lison and
Tiedemann, 2016)</xref>
        , CommonCrawl
        <xref ref-type="bibr" rid="ref24">(Ortiz Sua´rez
et al., 2019)</xref>
        , ParaCrawl
        <xref ref-type="bibr" rid="ref9">(Espla`-Gomis et al.,
2019)</xref>
        and News Crawl
        <xref ref-type="bibr" rid="ref12">(Haddow, 2018)</xref>
        to a
corpus with a total size of 16GB with 2; 350M
tokens. Besides this, we use the three
multilingual (Distil)BERT models included in the
transformers module. This amounts to five
BERT and two DistilBERT models, two of which
are ”uncased” (i.e. every character is lower-cased)
while the other five models are ”cased” ones.
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>For the re-evaluation, we used the latest data
provided in .xml-format. Duplicates were not
removed, in order to make our results as comparable
as possible. We tokenized the documents and fixed
single spelling mistakes in the labels5. For Subtask
D, the BIO-tags were added based on the provided
3MDZ Digital Library team at the Bavarian State
Library. Visit https://www.digitale-sammlungen.de for details
and https://github.com/dbmdz/berts for their repository on
pre-trained BERT models.</p>
      <p>4Visit https://deepset.ai/german-bert for details.</p>
      <p>5”positve” in train set was replaced with ”positive”,
” negative” in testdia set was replaced with ”negative”.
sequence positions, i.e. one entity corresponds to at
least one token tag starting with B- for ”Beginning”
and continuing with I- for ”Inner”. If a token does
not belong to any entity, the tag O for ”Outer” is
assigned. For instance, the sequence ”fa¨hrt nicht”
(engl. ”does not run”) consists of two tokens and
would receive the entity Zugfahrt:negative
and the token tags [B-Zugfahrt:negative,
I-Zugfahrt:negative] if it refers to a DB
train which is not running.</p>
      <p>The models were fine-tuned on one Tesla V100
PCIe 16GB GPU using Python 3.8.7. Moreover,
the transformers module (version 4.0.1) and
torch (version 1.7.1) were used6. The considered
values for the hyperparameters for fine-tuning
follow the recommendations of Devlin et al. (2019):
• Batch size 2 f16; 32g,
• Adam learning rate 2 f5e,3e,2eg 5,
• # epochs 2 f2; 3; 4g.</p>
      <p>After evaluating the model performance for
combinations7 of the different hyperparameters, all
pretrained architectures were fine-tuned with a
learning rate of 5e-5 for four epochs, which turned out
to be the most promising combination across the
different models. The maximum sequence length
was set to 256, which is sufficient since the
evaluated data set consists of rather short texts from
social media, and a batch size of 32 was chosen.
Other models Eight teams officially participated
in the GermEval17 shared task, five of which
analyzed Subtask A, all of them Subtask B and two
repectively Subtask C and D. We furthermore
consider the system by Ruppert et al. (2017)
additionally to the participants’ models from 2017, even
6Source code is available on GitHub:
https://github.com/ac74/reevaluating germeval2017. The
results are fully reproducible for Subtasks A, B and C. For
Subtask D, reproducibility could not be ensured. The micro
F1 scores fluctuate across different runs between +/-0.01
around the reported values.</p>
      <p>
        7Due to memory limitations, not every hyperparameter
combination was applicable.
though they were the organizers and did not
”officially” participate. They also tackled all four
subtasks. Since 2017 several other authors analyzed
(parts of) the GermEval17 subtasks using more
advanced models, which we also consider for
comparison here. Table 6 shows which authors employed
which kinds of models to solve which task.
A B C1 C2 D1 D2
Subtask
        <xref ref-type="bibr" rid="ref33 ref47">(MWoodjealtszkfrioemta2l.0,127017; Ruppert et al., 2017)</xref>
        X X X
Our BERT models X X X
CNN
        <xref ref-type="bibr" rid="ref1">(Attia et al., 2018)</xref>
        – X –
CNN+FastText
        <xref ref-type="bibr" rid="ref36">(Schmitt et al., 2018)</xref>
        – – X
ELMo+GloVe+BCN
        <xref ref-type="bibr" rid="ref4">(Biesialska et al., 2020)</xref>
        – X –
ELMo+TSA
        <xref ref-type="bibr" rid="ref4">(Biesialska et al., 2020)</xref>
        – X –
FastText
        <xref ref-type="bibr" rid="ref10">(Guhr et al., 2020)</xref>
        – X –
(bGeurhtr-ebtaals.,e2-0g20e)rman-cased – X –
X
X
–
X
–
–
–
–
      </p>
      <p>X
X
–
–
–
–
–
–</p>
      <p>X
X
–
–
–
–
–
–</p>
      <p>Subtask B Subtask B refers to the
Documentlevel Polarity, which is a multi-class classification
task with three classes. Table 8 demonstrates the
performances on the two test sets:
Subtask A The Relevance Classification is a
binary document classification task with classes
true and false. Table 7 displays the micro F1
score obtained by each language model on each
test set (best result per data set in bold).</p>
      <p>
        Language model
Best model 2017
        <xref ref-type="bibr" rid="ref35">(Sayyed et al., 2017)</xref>
        bert-base-german-cased
bert-base-german-dbmdz-cased
bert-base-german-dbmdz-uncased
bert-base-multilingual-cased
bert-base-multilingual-uncased
distilbert-base-german-cased
distilbert-base-multilingual-cased
      </p>
      <p>
        All the models outperform the best result achieved
in 2017 for both test data sets. For the synchronic
test set, the previous best result is surpassed by
3.8–5.4 percentage points. For the diachronic test
set, the absolute difference to the best contender of
2017 varies between 2.6 and 4.2 percentage points.
With a micro F1 score of 0.957 and 0.948,
respectively, the best scoring pre-trained language model
is the uncased German BERT-BASE variant by
dbmdz, followed by its cased version. All the
pre-trained models perform slightly better on the
synchronic test data than on the diachronic data.
Attia et al. (2018), Schmitt et al. (2018), Biesialska
et al. (2020) and Guhr et al. (2020) did not evaluate
their models on this task.
All models outperform the best model from 2017
by 1.0–4.0 percentage points for the synchronic,
and by 1.6–5.0 percentage points for the diachronic
test set. On the synchronic test set, the uncased
German BERT-BASE model by dbmdz performs
best with a score of 0.807, followed by its cased
variant with 0.799. For the diachronic test set, the
uncased German BERT-BASE model exceeds the
other models with a score of 0.800, followed by
the cased German BERT-BASE model reaching
a score of 0.793. The three multilingual models
perform generally worse than the German
models on this task. Besides this, all the models
perform slightly better on the synchronic data set than
on the diachronic one. The FastText-based model
        <xref ref-type="bibr" rid="ref10">(Guhr et al., 2020)</xref>
        comes not even close to the
baseline from 2017, while the ELMo-based
models
        <xref ref-type="bibr" rid="ref4">(Biesialska et al., 2020)</xref>
        are pretty competitive.
Interestingly, two of the multilingual models are
even outperformed by these ELMo-based models.
Subtask C Subtask C is split into Aspect-only
(Subtask C1) and Aspect+Sentiment Classification
(Subtask C2), each being a multi-label
classification task8. As the organizers provide 20 aspect
categories, Subtask C1 includes 20 labels, whereas
Subtask C2 has 60 labels since each aspect category
8This leads to a change of activation functions in the final
layer from softmax to sigmoid + binary cross entropy loss.
can be combined with each of the three sentiments.
Consistent with Lee et al. (2017) and Mishra et al.
(2017), we do not account for multiple mentions
of the same label in one document. The results for
Subtask C1 are shown in Table 9:
Language model
Best model 2017
        <xref ref-type="bibr" rid="ref33 ref47">(Ruppert et al., 2017)</xref>
        bert-base-german-cased
bert-base-german-dbmdz-cased
bert-base-german-dbmdz-uncased
bert-base-multilingual-cased
bert-base-multilingual-uncased
distilbert-base-german-cased
distilbert-base-multilingual-cased
CNN+FastText
        <xref ref-type="bibr" rid="ref36">(Schmitt et al., 2018)</xref>
        All pre-trained German BERTs clearly surpass the
best performance from 2017 as well as the results
reported by Schmitt et al. (2018), who are the only
ones of the other authors to evaluate their models
on this tasks. Regarding the synchronic test set,
the absolute improvement ranges between 16.9 and
22.4 percentage points, while for the diachronic test
data, the models outperform the previous results
by 17.8–23.5 percentage points. The best model
is again the uncased German BERT-BASE model
by dbmdz, reaching scores of 0.761 and 0.791,
respectively, followed by the two cased German
BERT-BASE models. One more time, the
multilingual models exhibit the poorest performances
amongst the evaluated models. Next, Table 10
shows the results for Subtask C2:
Language model
Best model 2017
        <xref ref-type="bibr" rid="ref33 ref47">(Ruppert et al., 2017)</xref>
        bert-base-german-cased
bert-base-german-dbmdz-cased
bert-base-german-dbmdz-uncased
bert-base-multilingual-cased
bert-base-multilingual-uncased
distilbert-base-german-cased
distilbert-base-multilingual-cased
CNN+FastText
        <xref ref-type="bibr" rid="ref36">(Schmitt et al., 2018)</xref>
        Here, the pre-trained models surpass the best model
from 2017 by 15.7–25.9 percentage points and
20.7–26.5 percentage points, respectively, for the
synchronic and diachronic test sets. Again, the best
model is the uncased German BERT-BASE dbmdz
model reaching scores of 0.655 and 0.689,
respectively. The CNN models
        <xref ref-type="bibr" rid="ref36">(Schmitt et al., 2018)</xref>
        are also outperformed. For both, Subtask C1 and
C2, all the displayed models perform better on the
diachronic than on the synchronic test data.
Subtask D Subtask D refers to the Opinion
Target Extraction (OTE) and is thus a
tokenlevel classification task. As this is a rather
difficult task, Wojatzki et al. (2017) distinguish
between exact (Subtask D1) and overlapping
match (Subtask D2), tolerating a deviation of
+= one token. Here, ”entities” are identified
by their BIO-tags. It is noteworthy that there
are less entities here than for Subtask C since
document-level aspects or sentiments could not
always be assigned to a certain sequence in the
document. As a result, there are less documents
at disposal for this task, namely 9,193. The
remaining data has 1.86 opinions per document on
average. The majority class is now Sonstige
Unregelma¨ßigkeiten:negative with
around 15.4% of the true entities (16,650 in total),
leading to more balanced data than in Subtask C.
      </p>
      <p>
        Language model
Best model 2017
        <xref ref-type="bibr" rid="ref33 ref47">(Ruppert et al., 2017)</xref>
        bert-base-german-cased
F bert-base-german-dbmdz-cased
R
tC bert-base-german-dbmdz-uncased
ou bert-base-multilingual-cased
ith bert-base-multilingual-uncased
w distilbert-base-german-cased
distilbert-base-multilingual-cased
bert-base-german-cased
bert-base-german-dbmdz-cased
F bert-base-german-dbmdz-uncased
R
C bert-base-multilingual-cased
ith bert-base-multilingual-uncased
w distilbert-base-german-cased
distilbert-base-multilingual-cased
      </p>
      <p>In Table 11, we compare the pre-trained models
using an ”ordinary” softmax layer to when using a
CRF layer for Subtask D1.</p>
      <p>The best performing model is the uncased
German BERT-BASE model by dbmdz with CRF
layer on both test sets, with a score of 0.515 and
0.518, respectively. Overall, the results from 2017
are outperformed by 11.8–28.6 percentage points
on the synchronic test set and 5.6–21.7 percentage
points on the diachronic test set.</p>
      <p>For the overlapping match (cf. Tab. 12), the best
system from 2017 are outperformed by 4.9–17.5
percentage points on the synchronic and by 4.2–
16.8 percentage points on the diachronic test set.
Again, the uncased German BERT-BASE model by
dbmdz with CRF layer performs best with an
micro F1 score of 0.523 on the synchronic and 0.533
on the diachronic set. To our knowledge, there
were no other models to compare our performance
values with, besides the results from 2017.
Main Takeaways For the first two subtasks,
which are rather simple binary and multi-class
classification tasks, the pre-trained models are able to
improve a little upon the already pretty decent
performance values from 2017. Further, we do not see
large differences between the different pre-trained
models. Nevertheless, the small differences we
can observe, already point in the same direction as
what can be observed for the primary ABSA tasks
of interest, C1 and C2:
• Uncased models have a tendency of
outperforming their cased counterparts for the
monolingual models, for multilingual models this
cannot be clearly confirmed.
• Monolingual models outperform the
multilingual ones.
• There are no large performance differences
between the two cased BERT models by
DBMDZ and Deepset.ai, which suggests only
a minor influence of the different corpora,
which the models were pre-trained on.
• The monolingual DistilBERT model is pretty
competitive, it consistently outperforms its
multilingual counterpart as well as the
multilingual BERT models on the subtasks A – C
and is at least competitive to the monolingual
BERT models.</p>
      <p>For D1 and D2 we observe a rather clear
dominance of the uncased monolingual model which is
not observable to this extent for the other tasks.
6</p>
    </sec>
    <sec id="sec-7">
      <title>Discussion</title>
      <p>
        After having observed a notable performance
increase for German ABSA when employing
pretrained models, the next step is to compare these
observations to what was reported for the English
language. Therefore, we examine the temporal
development of the SOTA performance on the most
widely adopted data sets for English ABSA,
originating from the SemEval Shared Tasks
        <xref ref-type="bibr" rid="ref28 ref29 ref30">(Pontiki
et al., 2014, 2015, 2016)</xref>
        . When looking at
public leaderboards, e.g. https://paperswithcode.com/,
Subtask SB2 (aspect term polarity) from
SemEval2014 is the task which attracts most of the
researchers. This task is related, but not perfectly
similar, to Subtask C2, since in this case, the
aspect term is always a word which has to present
in the given review. For this task, a comparison
of pre-BERT and BERT-based methods reveals no
big ”jump” in the performance values, but rather a
steady increase over time (cf. Tab. 13).
      </p>
      <sec id="sec-7-1">
        <title>Language model</title>
        <p>
          Best model SemEval-2014
TR
          <xref ref-type="bibr" rid="ref30">(Pontiki et al., 2014)</xref>
          -EB MemNet
          <xref ref-type="bibr" rid="ref41">(Tang et al., 2016)</xref>
          e
r
p
        </p>
        <p>
          HAPN
          <xref ref-type="bibr" rid="ref19">(Li et al., 2018)</xref>
          d BERT-SPC
          <xref ref-type="bibr" rid="ref39">(Song et al., 2019)</xref>
          e
s
a
-b BERT-ADA
          <xref ref-type="bibr" rid="ref31">(Rietzler et al., 2020)</xref>
          T
R
E
B LCF-ATEPC
          <xref ref-type="bibr" rid="ref50">(Yang et al., 2019)</xref>
          Laptops Restaurants
        </p>
        <p>Clearly more related, but unfortunately also less
used, are the subtasks SB3 (aspect category
extraction; comparable to Subtask C1) and SB4
(aspect category polarity; comparable to Subtask C2)
from SemEval-2014.9 Limitations with respect
to comparability arise from the different numbers
of categories: Subtask SB4 only exhibits five
aspect categories (as opposed to 20 categories for
GermEval17) which leads to an easier
classification problem and is reflected in the already pretty
high scores of the 2014 baselines. Table 14 shows
the performance of the best model from 2014 as
well as performance of subsequent (pre-BERT and
BERT-based) models for subtasks SB3 and SB4.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Language model</title>
        <p>
          TR Best model SemEval-2014
E
          <xref ref-type="bibr" rid="ref30">(Pontiki et al., 2014)</xref>
          B
rep ATAE-LSTM
          <xref ref-type="bibr" rid="ref46">(Wang et al., 2016)</xref>
          d BERT-pair
          <xref ref-type="bibr" rid="ref40">(Sun et al., 2019)</xref>
          e
s
a
-b CG-BERT
          <xref ref-type="bibr" rid="ref17 ref2 ref25 ref31 ref4 ref42 ref48">(Wu and Ong, 2020)</xref>
          T
R
E
B QACG-BERT
          <xref ref-type="bibr" rid="ref17 ref2 ref25 ref31 ref4 ref42 ref48">(Wu and Ong, 2020)</xref>
          Restaurants
SB3 SB4
        </p>
        <p>
          In contrast to what can be observed for SB2, in this
case, the performance increase on SB4 caused by
the introduction of BERT seems to be kind of
striking. While the ATAE-LSTM
          <xref ref-type="bibr" rid="ref46">(Wang et al., 2016)</xref>
          only slightly increased the performance compared
to 2014, the BERT-based models led to a jump of
more than 6 percentage points. So when taking into
account the potential room for improvement (0:16
for SB4 vs. 0:60 for C2), the improvements relative
to the potential (0:06=0:16 for SB4 vs. 0:23=0:60
for C2) are quite similar.
        </p>
        <p>Another issue is that (partly) highly specialized
(T)ABSA architectures were used for improving
the SOTA on the SemEval-2014 tasks, while we
”only” applied standard pre-trained German BERT
models without any task-specific modifications or
extensions. This leaves room for further
improvements on this task on German data which should
be an objective for future research.</p>
        <p>9Since the data sets (Restaurants and Laptops) have been
further developed for SemEval-2015 and SemEval-2016,
subtasks SB3 and SB4 are revisited under the names Slot 1 and
Slot 3 for the in-domain ABSA in SemEval-2015. Slot 2
from SemEval-2015 aims at OTE and thus corresponds to
Subtask D from GermEval17. For SemEval-2016 the same
task names as in 2015 were used, subdivided into Subtask 1
(sentence-level ABSA) and Subtask 2 (text-level ABSA).</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Conclusion</title>
      <p>As one would have hoped, all the state-of-the art
pre-trained language models clearly outperform all
the models from 2017, proving the power of
transfer learning also for German ABSA. Throughout
the presented analyses, the models always achieve
similar results between the synchronic and the
diachronic test sets, indicating temporal robustness
for the models. Nonetheless, the diachronic data
was collected only half a year after the main data.
It would be interesting to see whether the trained
models would return similar predictions on data
collected a couple of years later.</p>
      <p>
        The uncased German BERT-BASE model by
dbmdz achieves the best results across all subtasks.
Since R o¨nnqvist et al. (2019) showed that
monolingual BERT models often outperform the
multilingual models for a variety of tasks, one might
have already suspected that a monolingual
German BERT performs best across the performed
tasks. It may not seem evident at first that an
uncased language model ends up as the best
performing model since, e.g. in Sentiment Analysis,
capitalized letters might be an indicator for
polarity. In addition, since nouns and beginnings of
sentences always start with a capital letter in
German, one might assume that lower-casing the whole
text changes the meaning of some words and thus
confuses the language model. Nevertheless, the
GermEval17 documents are very noisy since they
were retrieved from social media. That means that
the data contains many misspellings, grammar and
expression mistakes, dialect, and colloquial
language. For this reason, already some participating
teams in 2017 pursued an elaborate pre-processing
on the text data in order to eliminate some noise
        <xref ref-type="bibr" rid="ref16 ref22 ref23 ref33 ref35 ref35 ref37 ref5">(H o¨velmann and Friedrich, 2017; Sayyed et al.,
2017; Sidarenka, 2017)</xref>
        . Among other things,
H o¨velmann and Friedrich (2017) transformed the
text to lower-case and replaced, for example,
”SBahn” and ”S Bahn” with ”sbahn”. We suppose
that in this case, lower-casing the texts improves
the data quality by eliminating some of the noise
and acts as a sort of regularization. As a result,
the uncased models potentially generalize better
than the cased models. The findings from
Mayhew et al. (2019), who compare cased and uncased
pre-trained models on social media data for NER,
corroborate this hypothesis.
      </p>
    </sec>
    <sec id="sec-9">
      <title>Appendix A</title>
    </sec>
    <sec id="sec-10">
      <title>Detailed results (per category) for</title>
    </sec>
    <sec id="sec-11">
      <title>Subtask C</title>
      <p>It may be interesting to have a more detailed look
at the model performance for this subtask because
of the high number of classes and their skewed
distribution by investigating the performance on
category-level. Table 15 shows the performance
of the uncased German BERT-BASE model by
dbmdz per test set for Subtask C1. The support
indicates the number of appearances, which are also
displayed in Table 4 in this case. Seven categories
are summarized in Rest because they have an F1
score of 0 for both test sets, i.e. the model is not
able to correctly identify any of these seven aspects
appearing in the test data. The table is sorted by
the score on the synchronic test set.
testsyn testdia
Aspect Category Score Support Score Support
Allgemein 0.854 1,398 0.877 1,024
Sonstige Unregelma¨ßigkeiten 0.782 224 0.785 164
Connectivity 0.750 36 0.838 73
Zugfahrt 0.678 241 0.687 184
Auslastung und Platzangebot 0.645 35 0.667 20
Sicherheit 0.602 84 0.639 42
Atmospha¨re 0.600 148 0.532 53
Barrierefreiheit 0.500 9 0 2
Ticketkauf 0.481 95 0.506 48
Service und Kundenbetreuung 0.476 63 0.417 27
DB App und Website 0.455 28 0.563 18
Informationen 0.329 58 0.464 35
Komfort und Ausstattung 0.286 24 0 11
Rest 0 24 0 20</p>
      <p>The F1 scores for Allgemein (General),
Sonstige Unregelma¨ßigkeiten (Other
irregularities) and Connectivity are the highest.
13 categories, mostly similar between the two test
sets, show a positive F1 score on at least one of
the two test sets. For the categories subsumed
under Rest, the model was not able to learn how to
correctly identify these categories.</p>
      <p>Subtask C2 exhibits a similar distribution of the
true labels, with the Aspect+Sentiment category
Allgemein:neutral as majority class. Over
50% of the true labels belong to this class. Table 16
shows that only 12 out of 60 labels can be detected
by the model (see Table 16).</p>
      <p>All the aspect categories displayed in
Table 16 are also visible in Table 15 and
most of them have negative sentiment.
Allgemein:neutral and Sonstige
Unregelma¨ßigkeiten:negative show the
highest scores. Again, we assume that here, 48
categories could not be identified due to data
sparsity. However, having this in mind, the model
achieves a relatively high overall performance
for both, Subtask C1 and C2 (cf. Tab. 9 and
Tab. 10). This is mainly owed to the high
score of the majority classes Allgemein and
Allgemein:neutral, respectively, because
the micro F1 score puts a lot of weight on majority
classes. It might be interesting whether the
classification of the rare categories can be improved
by balancing the data. We experimented with
removing general categories such as Allgemein,
Allgemein:neutral or documents with
sentiment neutral since these are usually less
interesting for a company. We observe a large
drop in the overall F1 score which is attributed to
the absence of the strong majority class and the
resulting data loss. Indeed, the classification for
some single categories could be improved, but the
rare categories could still not be identified by the
model.</p>
      <p>B</p>
    </sec>
    <sec id="sec-12">
      <title>Detailed results (per category) for</title>
    </sec>
    <sec id="sec-13">
      <title>Subtask D</title>
      <p>Similar as for Subtask C, the results for the best
model are investigated in more detail. Table 17
gives the detailed classification report for the
uncased German BERT-BASE model with CRF layer
on Subtask D1. Only entities that were correctly
detected at least once are displayed. The table is
sorted by the score on the synchronic test set. The
classification report for Subtask D2 is displayed
analogously in Table 18.
testsyn
Category Score Support
Zugfahrt:negative 0.702 622
Sonstige Unregelma¨ßigkeiten:negative 0.681 693
Sicherheit:negative 0.604 337
Connectivity:negative 0.598 56
Barrierefreiheit:negative 0.595 14
Auslastung und Platzangebot:negative 0.579 66
Connectivity:positive 0.571 26
Allgemein:negative 0.545 807
Atmospha¨re:negative 0.500 403
Ticketkauf:negative 0.383 96
Ticketkauf:positive 0.368 59
Komfort und Ausstattung:negative 0.357 24
Atmospha¨re:neutral 0.348 40
Service und Kundenbetreuung:negative 0.323 74
Informationen:negative 0.301 68
Zugfahrt:positive 0.276 62
DB App und Website:negative 0.232 39
DB App und Website:neutral 0.188 23
Sonstige Unregelma¨ßigkeiten:neutral 0.179 13
Allgemein:positive 0.157 86
Service und Kundenbetreuung:positive 0.115 23
Atmospha¨re:positive 0.105 26
Ticketkauf:neutral 0.040 144
Connectivity:neutral 0 11
Toiletten:negative 0 15
Rest 0 355</p>
      <p>For Subtask D1, the model returns a
positive score on 25 entity categories on at
least one of the two test sets. The category
Zugfahrt:negative can be classified best
on both test sets, followed by Sonstige
Unregelma¨ßigkeiten:negative and
Sicherheit:negative for the synchronic
test set and by Connectivity:negative and
Allgemein:positive for the diachronic set.
Visibly, the scores between the two test sets differ
more here than in the classification report of the
previous task.</p>
      <p>The report for the overlapping match (cf. Tab.
18) shows slightly better results on some categories
Aspect+Sentiment entity with overlapping match
(Subtask D2). 35 categories are summarized in Rest and
show each a score of 0.
than for the exact match.</p>
      <p>The third-best score
on the diachronic test data is now
Sonstige
Unregelma¨ßigkeiten:negative. Besides
this, the top three categories per test set remain the
same.</p>
      <p>Apart from the fact that this is a different kind of
task than before, one can notice that even though
the overall micro F1 scores are lower for Subtask D
than for Subtask C, the model manages to
successfully identify a larger variety of categories, i.e. it
achieves a positive score for more categories. This
is probably due to the more balanced data for
Subtask D than for Subtask C2, resulting in a lower
overall score and mostly higher scores per category.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Mohammed</given-names>
            <surname>Attia</surname>
          </string-name>
          , Younes Samih, Ali Elkahky, and
          <string-name>
            <given-names>Laura</given-names>
            <surname>Kallmeyer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Multilingual multi-class sentiment classification using convolutional neural networks</article-title>
          .
          <source>In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ), Miyazaki,
          <string-name>
            <given-names>Japan. European</given-names>
            <surname>Language Resources Association (ELRA).</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Valentin</given-names>
            <surname>Barriere</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexandra</given-names>
            <surname>Balahur</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Improving sentiment analysis over non-English tweets using multilingual transformers and automatic translation for data-augmentation</article-title>
          .
          <source>In Proceedings of the 28th International Conference on Computational Linguistics</source>
          , pages
          <fpage>266</fpage>
          -
          <lpage>271</lpage>
          , Barcelona, Spain (Online).
          <source>International Committee on Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Salima</given-names>
            <surname>Behdenna</surname>
          </string-name>
          , Fatiha Barigou, and
          <string-name>
            <given-names>Ghalem</given-names>
            <surname>Belalem</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Document level sentiment analysis: A survey</article-title>
          .
          <source>EAI Endorsed Transactions on Contextaware Systems and Applications</source>
          ,
          <volume>4</volume>
          :
          <fpage>154339</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Katarzyna</given-names>
            <surname>Biesialska</surname>
          </string-name>
          , Magdalena Biesialska, and
          <string-name>
            <given-names>Henryk</given-names>
            <surname>Rybinski</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Sentiment analysis with contextual embeddings and self-attention</article-title>
          . arXiv preprint arXiv:
          <year>2003</year>
          .05574.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>5</volume>
          :
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart Van Merrie¨nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning phrase representations using rnn encoder-decoder for statistical machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1406</source>
          .
          <fpage>1078</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Mark</given-names>
            <surname>Cieliebak</surname>
          </string-name>
          , Jan Milan Deriu, Dominic Egger, and
          <string-name>
            <given-names>Fatih</given-names>
            <surname>Uzdilli</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Twitter corpus and benchmark resources for German sentiment analysis</article-title>
          .
          <source>In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media</source>
          , pages
          <fpage>45</fpage>
          -
          <lpage>51</lpage>
          , Valencia, Spain. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Minneapolis, Minnesota. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>M. Espla</surname>
          </string-name>
          `-Gomis, M. Forcada, Gema Ram´
          <article-title>ırez-Sa´nchez</article-title>
          , and
          <string-name>
            <surname>Hieu</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Hoang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>ParaCrawl: Web-scale parallel corpora for the languages of the EU</article-title>
          . In MTSummit.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Oliver</given-names>
            <surname>Guhr</surname>
          </string-name>
          ,
          <string-name>
            <surname>Anne-Kathrin</surname>
            <given-names>Schumann</given-names>
          </string-name>
          , Frank Bahrmann, and
          <string-name>
            <surname>Hans-Joachim Bo</surname>
          </string-name>
          ¨hme.
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems</article-title>
          .
          <source>In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC</source>
          <year>2020</year>
          ), pages
          <fpage>1627</fpage>
          -
          <lpage>1632</lpage>
          , Marseille, France.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Barry</given-names>
            <surname>Haddow</surname>
          </string-name>
          .
          <year>2018</year>
          . News Crawl Corpus.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <source>Oriol Vinyals, and Jeff Dean</source>
          .
          <year>2015</year>
          .
          <article-title>Distilling the knowledge in a neural network</article-title>
          .
          <source>arXiv preprint arXiv:1503</source>
          .
          <fpage>02531</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Mickel</given-names>
            <surname>Hoang</surname>
          </string-name>
          , Oskar Alija Bihorac, and
          <string-name>
            <given-names>Jacobo</given-names>
            <surname>Rouces</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Aspect-based sentiment analysis using BERT</article-title>
          .
          <source>In Proceedings of the 22nd Nordic Conference on Computational Linguistics</source>
          , pages
          <fpage>187</fpage>
          -
          <lpage>196</lpage>
          , Turku, Finland. Linko¨ping University Electronic Press.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and Ju¨rgen Schmidhuber.
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Leonard</given-names>
            <surname>Ho</surname>
          </string-name>
          <article-title>¨velmann</article-title>
          and
          <string-name>
            <surname>Christoph</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Friedrich</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Fasttext and Gradient Boosted Trees at GermEval-2017 Tasks on Relevance Classification and Document-level Polarity</article-title>
          .
          <source>In Proceedings of the GermEval 2017 - Shared Task on Aspect-based Sentiment in Social Media Customer Feedback</source>
          , Berlin, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Akbar</given-names>
            <surname>Karimi</surname>
          </string-name>
          , Leonardo Rossi, and
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Prati</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Adversarial training for aspect-based sentiment analysis with bert</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Ji-Ung</surname>
            <given-names>Lee</given-names>
          </string-name>
          , Steffen Eger, Johannes Daxenberger, and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>UKP TU-DA at GermEval 2017: Deep Learning for Aspect Based Sentiment Detection</article-title>
          .
          <source>In Proceedings of the GermEval 2017 - Shared Task on Aspect-based Sentiment in Social Media Customer Feedback</source>
          , Berlin, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Lishuang</surname>
            <given-names>Li</given-names>
          </string-name>
          , Yang Liu, and
          <string-name>
            <given-names>AnQiao</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Hierarchical attention based position-aware network for aspect-level sentiment analysis</article-title>
          .
          <source>In Proceedings of the 22nd Conference on Computational Natural Language Learning</source>
          , pages
          <fpage>181</fpage>
          -
          <lpage>189</lpage>
          , Brussels, Belgium. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Xin</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Lidong</given-names>
            <surname>Bing</surname>
          </string-name>
          , Wenxuan Zhang, and
          <string-name>
            <given-names>Wai</given-names>
            <surname>Lam</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Exploiting BERT for end-to-end aspect-based sentiment analysis</article-title>
          .
          <source>In Proceedings of the 5th Workshop</source>
          on Noisy User-generated
          <string-name>
            <surname>Text</surname>
          </string-name>
          (
          <article-title>W-NUT</article-title>
          <year>2019</year>
          ), pages
          <fpage>34</fpage>
          -
          <lpage>41</lpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Lison</surname>
          </string-name>
          and Jo¨rg Tiedemann.
          <year>2016</year>
          .
          <article-title>OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles</article-title>
          .
          <source>In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC</source>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Mayhew</surname>
          </string-name>
          , Tatiana Tsygankova, and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Roth</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>ner and pos when nothing is capitalized</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the Bryan McCann</source>
          ,
          <string-name>
            <given-names>James</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , Caiming Xiong, and Richard Socher.
          <year>2017</year>
          .
          <article-title>Learned in translation: Contextualized word vectors</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          , pages
          <fpage>6294</fpage>
          -
          <lpage>6305</lpage>
          . Curran Associates, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Pruthwik</given-names>
            <surname>Mishra</surname>
          </string-name>
          , Vandan Mujadia, and
          <string-name>
            <given-names>Soujanya</given-names>
            <surname>Lanka</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>GermEval 2017: Sequence based Models for Customer Feedback Analysis</article-title>
          .
          <source>In Proceedings of the GermEval 2017 - Shared Task on Aspect-based Sentiment in Social Media Customer Feedback</source>
          , Berlin, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Javier Ortiz</surname>
          </string-name>
          <article-title>Sua´rez, Benoˆıt Sagot,</article-title>
          and
          <string-name>
            <given-names>Laurent</given-names>
            <surname>Romary</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures</article-title>
          .
          <source>In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC7)</source>
          , Cardiff,
          <string-name>
            <given-names>United</given-names>
            <surname>Kingdom</surname>
          </string-name>
          .
          <article-title>Leibniz-Institut fu¨r Deutsche Sprache</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Malte</given-names>
            <surname>Ostendorff</surname>
          </string-name>
          , Till Blume, and
          <string-name>
            <given-names>Saskia</given-names>
            <surname>Ostendorff</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Towards an Open Platform for Legal Information</article-title>
          .
          <source>In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in</source>
          <year>2020</year>
          , JCDL '
          <volume>20</volume>
          , pages
          <fpage>385</fpage>
          --
          <lpage>388</lpage>
          , New York, NY, USA. Association for Computing Machinery.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Matthew E Peters</surname>
            , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
            <given-names>Kenton</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>and Luke</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>arXiv preprint arXiv:1802</source>
          .05365.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Maria</given-names>
            <surname>Pontiki</surname>
          </string-name>
          , Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar,
          <string-name>
            <surname>Mohammad</surname>
            <given-names>AL</given-names>
          </string-name>
          -Smadi, Mahmoud Al-Ayyoub,
          <string-name>
            <given-names>Yanyan</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Bing</given-names>
            <surname>Qin</surname>
          </string-name>
          , Orphee de clercq, Veronique Hoste, Marianna Apidianaki, Xavier Tannier, Natalia Loukachevitch, Evgeny Kotelnikov, Nuria Bel, Salud Mar´ıa Zafra, and Gu¨ls¸en Eryig˘it.
          <year>2016</year>
          .
          <article-title>Semeval-2016 task 5: Aspect based sentiment analysis</article-title>
          .
          <source>In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Maria</given-names>
            <surname>Pontiki</surname>
          </string-name>
          , Dimitris Galanis, Haris Papageorgiou, Suresh Manandhar, and
          <string-name>
            <given-names>Ion</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          .
          <year>2015</year>
          . SemEval
          <article-title>-2015 task 12: Aspect based sentiment analysis</article-title>
          .
          <source>In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2015</year>
          ), pages
          <fpage>486</fpage>
          -
          <lpage>495</lpage>
          , Denver, Colorado. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>Maria</given-names>
            <surname>Pontiki</surname>
          </string-name>
          , Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and
          <string-name>
            <given-names>Suresh</given-names>
            <surname>Manandhar</surname>
          </string-name>
          .
          <year>2014</year>
          . SemEval
          <article-title>-2014 task 4: Aspect based sentiment analysis</article-title>
          .
          <source>In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2014</year>
          ), pages
          <fpage>27</fpage>
          -
          <lpage>35</lpage>
          , Dublin, Ireland. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Rietzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Stabinger</surname>
          </string-name>
          , Paul Opitz, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Engl</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Adapt or get left behind: Domain adaptation through BERT language model finetuning for aspect-target sentiment classification</article-title>
          .
          <source>In Proceedings of the 12th Language Resources and Evaluation Conference</source>
          , pages
          <fpage>4933</fpage>
          -
          <lpage>4941</lpage>
          , Marseille, France. European Language Resources Association.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Ro</surname>
          </string-name>
          ¨nnqvist, Jenna Kanerva, Tapio Salakoski, and
          <string-name>
            <given-names>Filip</given-names>
            <surname>Ginter</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Is Multilingual BERT Fluent in Language Generation</article-title>
          ?
          <source>In Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing</source>
          , pages
          <fpage>29</fpage>
          -
          <lpage>36</lpage>
          , Turku, Finland. Linko¨ping University Electronic Press.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>Eugen</given-names>
            <surname>Ruppert</surname>
          </string-name>
          , Abhishek Kumar, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Biemann</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>LT-ABSA: An Extensible Open-Source System for Document-Level and Aspect-Based Sentiment Analysis</article-title>
          .
          <source>In Proceedings of the GermEval 2017 - Shared Task on Aspect-based Sentiment in Social Media Customer Feedback</source>
          , Berlin, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <given-names>Victor</given-names>
            <surname>Sanh</surname>
          </string-name>
          , Lysandre Debut, Julien Chaumond, and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Wolf</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .01108.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <given-names>Zeeshan</given-names>
            <surname>Ali</surname>
          </string-name>
          <string-name>
            <surname>Sayyed</surname>
          </string-name>
          , Daniel Dakota, and Sandra Ku¨bler.
          <year>2017</year>
          .
          <article-title>IDS-IUCL: Investigating Feature Selection and Oversampling for GermEval 2017</article-title>
          .
          <source>In Proceedings of the GermEval 2017 - Shared Task on Aspect-based Sentiment in Social Media Customer Feedback</source>
          , Berlin, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <given-names>Martin</given-names>
            <surname>Schmitt</surname>
          </string-name>
          , Simon Steinheber, Konrad Schreiber, and
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Roth</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Joint aspect and polarity classification for aspect-based sentiment analysis with end-to-end neural networks</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>1109</fpage>
          -
          <lpage>1114</lpage>
          , Brussels, Belgium. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <given-names>Uladzimir</given-names>
            <surname>Sidarenka</surname>
          </string-name>
          .
          <year>2017</year>
          . PotTS at GermEval-2017
          <string-name>
            <surname>Task</surname>
            <given-names>B</given-names>
          </string-name>
          :
          <string-name>
            <surname>Document-Level Polarity Detection Using Hand-Crafted</surname>
            <given-names>SVM</given-names>
          </string-name>
          and
          <article-title>Deep Bidirectional LSTM Network</article-title>
          .
          <source>In Proceedings of the GermEval 2017 - Shared Task on Aspect-based Sentiment in Social Media Customer Feedback</source>
          , Berlin, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <given-names>Raivis</given-names>
            <surname>Skadin</surname>
          </string-name>
          ¸sˇ, Jo¨rg Tiedemann, Roberts Rozis, and
          <string-name>
            <given-names>Daiga</given-names>
            <surname>Deksne</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC</source>
          <year>2014</year>
          ), pages
          <fpage>1850</fpage>
          -
          <lpage>1855</lpage>
          , Reykjavik, Iceland.
          <source>European Language Resources Association (ELRA).</source>
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <given-names>Youwei</given-names>
            <surname>Song</surname>
          </string-name>
          , Jiahai Wang, Tao Jiang, Zhiyue Liu, and
          <string-name>
            <given-names>Yanghui</given-names>
            <surname>Rao</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Attentional encoder network for targeted sentiment classification</article-title>
          . arXiv preprint arXiv:
          <year>1902</year>
          .09314.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <given-names>Chi</given-names>
            <surname>Sun</surname>
          </string-name>
          , Luyao Huang, and
          <string-name>
            <given-names>Xipeng</given-names>
            <surname>Qiu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>380</fpage>
          -
          <lpage>385</lpage>
          , Minneapolis, Minnesota. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <given-names>Duyu</given-names>
            <surname>Tang</surname>
          </string-name>
          , Bing Qin, and Ting Liu.
          <year>2016</year>
          .
          <article-title>Aspect level sentiment classification with deep memory network</article-title>
          .
          <source>arXiv preprint arXiv:1605</source>
          .
          <fpage>08900</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <given-names>Jie</given-names>
            <surname>Tao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Xing</given-names>
            <surname>Fang</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Toward multi-label sentiment analysis: a transfer learning based approach</article-title>
          .
          <source>Journal of Big Data</source>
          ,
          <volume>7</volume>
          :
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention Is All You Need</article-title>
          .
          <source>In 31st Conference on Neural Information Processing Systems (NIPS</source>
          <year>2017</year>
          ), Long Beach, California, USA.
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Wang</surname>
          </string-name>
          , Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh,
          <string-name>
            <surname>Julian Michael</surname>
          </string-name>
          , Felix Hill,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Bowman</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Superglue: A stickier benchmark for general-purpose language understanding systems</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3266</fpage>
          -
          <lpage>3280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Amanpreet</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Julian Michael</surname>
          </string-name>
          , Felix Hill,
          <string-name>
            <given-names>Omer Levy</given-names>
            , and
            <surname>Samuel R Bowman</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Glue: A multi-task benchmark and analysis platform for natural language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .07461.
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <given-names>Yequan</given-names>
            <surname>Wang</surname>
          </string-name>
          , Minlie Huang,
          <string-name>
            <given-names>Xiaoyan</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Li</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Attention-based lstm for aspectlevel sentiment classification</article-title>
          .
          <source>In Proceedings of the 2016 conference on empirical methods in natural language processing</source>
          , pages
          <fpage>606</fpage>
          -
          <lpage>615</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <string-name>
            <given-names>Michael</given-names>
            <surname>Wojatzki</surname>
          </string-name>
          , Eugen Ruppert, Sarah Holschneider, Torsten Zesch, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Biemann</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>GermEval 2017: Shared Task on Aspect-based Sentiment in Social Media Customer Feedback</article-title>
          .
          <source>In Proceedings of the GermEval 2017 - Shared Task on Aspectbased Sentiment in Social Media Customer Feedback</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          , Berlin, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <string-name>
            <given-names>Zhengxuan</given-names>
            <surname>Wu and Desmond C Ong</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Contextguided bert for targeted aspect-based sentiment analysis</article-title>
          .
          <source>arXiv preprint arXiv:2010</source>
          .07523.
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <string-name>
            <given-names>Hu</given-names>
            <surname>Xu</surname>
          </string-name>
          , Bing Liu, Lei Shu, and
          <string-name>
            <surname>Philip</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Bert post-training for review reading comprehension and aspect-based sentiment analysis</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <string-name>
            <given-names>Heng</given-names>
            <surname>Yang</surname>
          </string-name>
          , Biqing Zeng, JianHao Yang, Youwei Song, and
          <string-name>
            <given-names>Ruyang</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A multi-task learning model for chinese-oriented aspect polarity classification and aspect term extraction</article-title>
          . arXiv preprint arXiv:
          <year>1912</year>
          .07976.
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <article-title>Table 18: Micro-averaged F1 scores and support by</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>