=Paper=
{{Paper
|id=Vol-2624/paper9
|storemode=property
|title=X -stance: A Multilingual Multi-Target Dataset for Stance Detection
|pdfUrl=https://ceur-ws.org/Vol-2624/paper9.pdf
|volume=Vol-2624
|authors=Jannis Vamvas,Rico Sennrich
|dblpUrl=https://dblp.org/rec/conf/swisstext/VamvasS20
}}
==X -stance: A Multilingual Multi-Target Dataset for Stance Detection==
X -stance:
A Multilingual Multi-Target Dataset for Stance Detection
Jannis Vamvas1 Rico Sennrich1,2
1
Department of Computational Linguistics, University of Zurich
2
School of Informatics, University of Edinburgh
{vamvas,sennrich}@cl.uzh.ch
Abstract able today are relatively small, and specific to a
single target (Taulé et al., 2017, 2018). Further-
We extract a large-scale stance detection more, specific models tend to be developed for
dataset from comments written by can- each single target or pair of targets (Sobhani et al.,
didates of elections in Switzerland. The 2017). Concerns have been raised that cross-target
dataset consists of German, French and performance is often considerably lower than fully
Italian text, allowing for a cross-lingual supervised performance (Küçük and Can, 2020).
evaluation of stance detection. It contains In this paper we propose a much larger dataset
67 000 comments on more than 150 po- that combines multilinguality and a multitude of
litical issues (targets). Unlike stance de- topics and targets. X-stance comprises more than
tection models that have specific target is- 150 questions about Swiss politics and more than
sues, we use the dataset to train a single 67k answers given by candidates running for polit-
model on all the issues. To make learn- ical office in Switzerland. Questions are available
ing across targets possible, we prepend in four languages: English, Swiss Standard Ger-
to each instance a natural question that man, French, and Italian. The language of a com-
represents the target (e.g. “Do you sup- ment depends on the candidate’s region of origin.
port X?”). Baseline results from multi-
We have extracted the data from the voting ad-
lingual B ERT show that zero-shot cross-
vice application Smartvote. Candidates respond to
lingual and cross-target transfer of stance
questions mainly in categorical form (yes / rather
detection is moderately successful with
yes / rather no / no). They can also submit a free-
this approach.
text comment to justify or explain their categorical
1 Introduction answer. An example is given in Figure 1.
We transform the dataset into a stance detec-
In recent years many datasets have been cre- tion task by interpreting the question as a natural-
ated for the task of automated stance detection, language representation of the target, and the com-
advancing natural language understanding sys- mentary as the input to be classified.
tems for political science, opinion research and The dataset is split into a multilingual train-
other application areas. Typically, such bench- ing set and into several test sets to evaluate zero-
marks (Mohammad et al., 2016a) are composed shot cross-lingual and cross-target transfer. To
of short pieces of text commenting on politicians provide a baseline, we fine-tune a multilingual
or public issues and are manually annotated with B ERT model (Devlin et al., 2019) on X-stance. We
their stance towards a target entity (e.g. Climate show that the baseline accuracy is comparable to
Change, or Trump). However, they are limited in previous stance detection benchmarks while leav-
scope on multiple levels (Küçük and Can, 2020). ing ample room for improvement. In addition,
First of all, it is questionable how well cur- the model can generalize to a degree both cross-
rent stance detection methods perform in a cross- lingually and in a cross-target setting.
lingual setting, as the multilingual datasets avail- We have made the dataset and the code for re-
Copyright c 2020 for this paper by its authors. Use permit- producing the baseline models publicly available.1
ted under Creative Commons License Attribution 4.0 Interna-
1
tional (CC BY 4.0) http://doi.org/10.5281/zenodo.3831317
Question #3414 – Available in all languages
Should Switzerland strive for a free Soll der Bundesrat ein Frei- La Suisse devrait-elle conclure un
trade agreement with the USA? handelsabkommen mit den USA accord de libre-échange avec les
anstreben? Etats-Unis?
Comment #26597 (German) Comment #21421 (French)
Label: FAVOR Label: AGAINST
Mit unserem zweitwichtigsten Handels- Les accords de libre-échange menacent la
partner sollten wir ein Freihandels- qualité des produits suisses.
abkommen haben.
[With our second most important trading [The free trade agreements jeopardize the
partner we should have a free trade quality of the Swiss products.]
agreement.]
Figure 1: Example of a question and two answers in the X-stance dataset. The answers were submitted by electoral
candidates on a voting advice website. The author of the German comment was in favor of the issue; the author of
the French comment against. Both authors use comments to explain their respective stance.
2 Related Work task B cross-target transfer to the target “Donald
Trump” was tested, for which no annotated train-
Multilingual Stance Detection In the context of ing data were provided. While this required the
the IberEval shared tasks, two related multilingual development of more universal models, their per-
datasets have been created (Taulé et al., 2017, formance was generally much lower.
2018). Both are a collection of annotated Spanish Sobhani et al. (2017) introduced a multi-target
and Catalan tweets. Crucially, the tweets in both stance dataset which provides two targets per in-
languages focus on the same issue (Catalan inde- stance. For example, a model designed in this
pendence); given this fact they are the first truly framework is supposed to simultaneously classify
multilingual stance detection datasets known to us. a tweet with regard to Clinton and with regard to
With regard to the languages covered by Trump. While in theory the framework allows for
X -stance, only monolingual datasets seem to be more than two targets, it is still restricted to a fi-
available. For French, a collection of tweets nite and clearly defined set of targets. It focuses
on French presidential candidates has been an- on modeling the dependencies of multiple targets
notated with stance (Lai et al., 2020). Simi- within the same text sample, while our approach
larly, two datasets of Italian tweets on the occa- focuses on learning stance detection from many
sion of the 2016 constitutional referendum have samples with many different targets.
been created (Lai et al., 2018, 2020). With re-
gard to German, a corpus of 270 sentences has Representation Learning for Stance Detection
been annotated with fine-grained stance and atti- In a target-specific setting, Ghosh et al. (2019)
tude information (Clematide et al., 2012). Fur- perform a systematic evaluation of stance detec-
thermore, fine-grained stance detection has been tion approaches. They also evaluate B ERT (Devlin
qualitatively studied on a large corpus of Facebook et al., 2019) and find that it consistently outper-
posts (Klenner et al., 2017). forms previous approaches.
However, they only experiment with a single-
Multi-Target Stance Detection The SemEval- segment encoding of the input, preventing cross-
2016 task on detecting stance in tweets (Moham- target transfer of the model. Augenstein et al.
mad et al., 2016b) offers data concerning multi- (2016) propose a conditional encoding approach
ple targets (Atheism, Climate Change, Feminism, to encode both the target and the tweet as se-
Hillary Clinton, and Abortion). In the supervised quences. They use a bidirectional LSTM to condi-
subtask A, participants tended to develop a target- tion the encoding of the tweets on the encoding of
specific model for each of those targets. In sub- the target, and then apply a nonlinear projection on
Topic Questions Answers depending on the locale they see translated ver-
sions of the questions. They can answer each
Digitisation 2 1168
question with either ‘yes’, ‘rather yes’, ‘rather no’,
Economy 23 6899
or ‘no’. They can supplement each answer with a
Education 16 7639
comment of at most 500 characters.
Finances 15 3980
The questions asked on Smartvote have been
Foreign Policy 16 4393
edited by a team of political scientists. They are
Immigration 19 6270
intended to cover a broad range of political is-
Infrastructure & Environment 31 9590
sues relevant at the time of the election. A de-
Security 20 5193
tailed documentation of the design of Smartvote
Society 17 6275
and the editing process of the questions is provided
Welfare 15 8508
by Thurman and Gasser (2009).
Total (training topics) 174 59 915
Healthcare 11 4711 Preprocessing We merged the two labels on
Political System 9 2645 each pole into a single label: ‘yes’ and ‘rather yes’
Total (held-out topics) 20 7356 were combined into ‘favor’; ‘rather no’, or ‘no’
into ‘against‘. This improves the consistency of
Table 1: Number of questions and answers per topic. the data and the comparability to previous stance
detection datasets. We did not further preprocess
the text of the comments.
the conditionally encoded tweet. This allows them
to train a model that can generalize to previously Language Identification As the API does not
unseen targets. provide the language of comments, we employed
a language identifier to automatically annotate
3 The X-stance Dataset this information. We used the langdetect li-
brary (Shuyo, 2010). For each responder we clas-
3.1 Task Definition
sified all the comments jointly, assuming that re-
The input provided by X-stance is two-fold: (A) sponders did not switch code during the answering
a natural language question concerning a politi- of the questionnaire.
cal issue; (B) a natural language commentary on We applied the identifier in a two-step approach.
a specific stance towards the question. In the first run we allowed the identifier to out-
The label to be predicted is either ‘favor’ or put all 55 languages that it supports out of the
‘against‘. This corresponds to a standard estab- box, plus Romansh, the fourth official language in
lished by Mohammad et al. (2016a). However, Switzerland3 . We found that no Romansh com-
X -stance differs from that dataset in that it lacks a ments were detected and that all unexpected out-
‘neither’ class; all comments refer to either a ‘fa- puts were misclassifications of German, French or
vor’ or an ‘against‘ position. The task posed by Italian comments. We further concluded that little
X -stance is thus a binary classification task. or no Swiss German comments are in the dataset;
As an evaluation metric we report the macro- otherwise, some of them would have manifested
average of the F1-score for ‘favor’ and the F1- themselves via misclassifications (e.g. as Dutch).
score for ‘against’, similar to Mohammad et al. In the second run, drawing from these conclu-
(2016b). We use this metric mainly to strengthen sions, we restricted the identifier’s set of choices
comparability with the previous benchmarks. to English, French, German and Italian.
3.2 Data Collection Filtering We pre-filtered the questions and an-
Provenance We downloaded the questions and swers to improve the quality of the dataset. To
answers via the Smartvote API2 . The downloaded keep the domain of the data surveyable, we set a
data cover 175 communal, cantonal and national focus on national-level questions. Therefore, all
elections between 2011 and 2020. 3
Namely the Rumantsch Grischun variety; the lan-
All candidates in an election who participate in guage profile was created using resources from the
Smartvote are asked the same set of questions, but Zurich Parallel Corpus Collection (Graën et al., 2019)
and the Quotidiana corpus (https://github.com/
2
https://smartvote.ch ProSvizraRumantscha/corpora).
Intra-target Cross-question Cross-topic
(New answers to (New questions
known questions) within known topics)
Train: 33 850
DE Test: 2871 Test: 3143 Test: 5269
Valid: 3479
Train: 11 790
FR Test: 1055 Test: 1170 Test: 1914
Valid: 1284
IT Test: 1173 Test: (110) Test: (173)
Table 2: Number of answer instances in the training, validation and test sets. The upper left corner represents a
multilingually supervised task, where training, validation and test data are from exactly the same domain. The top-
to-bottom axis gives rise to a cross-lingual transfer task, where a model trained on German and French is evaluated
on Italian answers to the same questions. The left-to-right axis represents a continuous shift of domain: In the
middle column, the model is tested on previously unseen questions that belong to the same topics as seen during
training. In the right column the model encounters unseen answers to unseen questions within an unseen topic.
The two test sets in parentheses are too small for a significant evaluation.
questions and corresponding answers pertaining to Given the sensitive nature of the data, we in-
national elections were included. crease the anonymity of the data by hashing the
In the context of communal and cantonal elec- respondents’ IDs. No personal attributes of the re-
tions, candidates have answered both local ques- spondents are included in the dataset. We provide
tions and a subset of the national questions. Of a data statement (Bender and Friedman, 2018) in
those elections, we only considered answers to the Appendix B.
questions that also had been asked in a national
election. They were only used to augment the 3.3 Data Split
training set while the validation and test sets were
restricted to answers from national elections. We held out the topics “Healthcare” and “Political
We discarded the fewer than 20 comments clas- System” from the training data and created a sepa-
sified as English. Furthermore, we discarded in- rate cross-topic test set that contains the questions
stances that met any of the following conditions: and answers related to those topics.
Furthermore, in order to test cross-question
• Question is not a closed question or does not generalization performance within previously seen
address a clearly defined political issue. topics, we manually selected 16 held-out ques-
tions that are distributed over the remaining
• No comment was submitted by the candidate
10 topics. We selected the held-out questions man-
or the comment is shorter than 50 characters.
ually because we wanted to make sure that they are
• Comment starts with “but” or a similar indi- truly unseen and that no paraphrases of the ques-
cator that the comment is not self-contained. tions are found in the training set.
• Comment contains a URL. We designated Italian as a test-only language,
since relatively few comments have been written
In total, a fifth of the comments were filtered out. in Italian. From the remaining German and French
data we randomly selected a percentage of respon-
Topics The questions have been organized
dents as validation or as test respondents.
by the Smartvote editors into categories (such
as “Economy”). We further consolidated the pre- As a result we received one training set, one val-
defined categories into 12 broad topics (Table 1). idation set and four test sets. The sizes of the sets
are listed in Table 2. We did not consider test sets
Compliance The dataset is shared under a CC that are cross-lingual and cross-target at the same
BY-NC 4.0 license. Copyright remains with time, as they would have been too small to yield
www.smartvote.ch. significant results.
100%
Proportion of class ‘favor’ held-out
75%
50% mean
25%
Digitisation Education Foreign Policy Infrastructure Society Healthcare
Economy Finances Immigration Security Welfare Political System
Figure 2: Proportion of ‘favor’ labels per question, grouped by topic. While the proportion of favorable answers
varies from question to question, it is balanced overall.
3.4 Analysis Languages The X-stance dataset has more Ger-
man samples than French samples. The language
Some observations regarding the composition of
ratio of about 3:1 is consistent across all train-
X -stance can be made.
ing and test sets. Given the two languages it
Class Distribution Figure 2 visualizes the pro- is possible to either train two monolingual mod-
portion of ‘favor’ and ‘against‘ stances for each els or to train a single model in a multi-source
target in the dataset. The ratio differs between setup (McDonald et al., 2011). We choose a multi-
questions but is relatively equally distributed source baseline because M-B ERT is known to ben-
across the topics. In particular, the questions in efit from multilingual training data both in a super-
the held-out topics (with a ‘favor’ ratio of 49.4%) vised and in a cross-lingual scenario (Kondratyuk
have a similar class distribution as the questions in and Straka, 2019).
other topics (with a ‘favor’ ratio of 50.0%).
4 Baseline Experiments
Linguistic Properties Not every question is
We evaluate four baselines to obtain an impression
unique; some questions are paraphrases describing
of the difficulty of the task.
the same political issue. For example, in the 2015
election, the candidates were asked: “Should the 4.1 Majority Class Baselines
consumption of cannabis as well as its possession
for personal use be legalised?” Four years later The first pair of baselines uses the most frequent
they were asked: “Should cannabis use be legal- class in the training set for prediction. Specifi-
ized?” However, we do not see any need to con- cally, the global majority class baseline predicts
solidate those duplicates because they contribute the most frequent class across all training targets
to the diversity of the training data. while the target-wise majority class baseline pre-
dicts the class that is most frequent for a given tar-
We further observe that while some questions
get question. The latter can only be applied to the
in the dataset are quite short, some questions are
intra-target test sets.
rather convoluted. For example, a typical long
question reads: 4.2 Bag-of-Words Baseline
Some 1% of direct payments to Swiss agricul- As a second baseline, we train a fastText bag-of-
ture currently go to organic farming operations. words linear classifier (Joulin et al., 2017). For
Should this proportion be increased at the ex-
pense of standard farming operations as part of each comment, we select the translation of the
Switzerland’s 2014-2017 agricultural policy? question that matches its language, and concate-
nate it to the comment. We tokenize the text using
Such longer questions might be more challenging the Europarl preprocessing tools (Koehn, 2005).
to process semantically. The ‘against’ class was slightly upsampled in
the training data so that the classes are balanced DE FR IT
when summing over all questions and topics.
Majority class (global) 33.1 34.8 34.4
We use the standard settings provided by the
Majority class (target-wise) 60.8 65.1 59.3
fastText library.4 Optimal hyperparameters from
fastText 69.9 71.2 53.7
the following range were determined based on the
M-B ERT 76.8 76.6 70.2
validation accuracy:
• Learning rate: 0.1, 0.2, 1 Table 3: Baseline scores in the cross-lingual setting.
No Italian samples were seen during training, mak-
• Number of epochs: 5, 50 ing this a case of zero-shot cross-lingual transfer. The
The word vectors were set to a size of 300. We scores are reported as the macro-average of the F1-
scores for ‘favor’ and for ‘against’.
do not initialize them with pre-trained multilingual
embeddings since preliminary experiments did not
show a beneficial effect. The grid search was repeated independently for
every variant that we test in the following sub-
4.3 Multilingual BERT Baseline sections. Furthermore, the standard recommenda-
As our main baseline model we fine-tune multilin- tions for fine-tuning B ERT were used: Adam with
gual B ERT (M-B ERT) on the task (Devlin et al., β1 = 0.9 and β2 = 0.999; an L2 weight decay
2019) which has been pre-trained jointly in 104 of 0.01; a learning rate warmup over the first 10%
languages5 and has established itself as a state of the steps; and a linear decay of the learning rate.
of the art for various multilingual tasks (Wu and A dropout probability of 0.1 was set on all layers.
Dredze, 2019; Pires et al., 2019). Within the field
of stance detection, B ERT can outperform both Results Table 3 shows the results for the cross-
feature-based and other neural approaches in a lingual setting. M-B ERT performs consistently
monolingual English setting (Ghosh et al., 2019). better than the previous baselines. Even the zero-
shot performance in Italian, while significantly
Architecture In the context of B ERT we in- lower than the supervised scores, is much better
terpret the X-stance task as sequence pair clas- than the target-wise majority class baseline.
sification inspired by natural language inference Results for the cross-target setting are given in
tasks (Bowman et al., 2015). We follow the pro- Table 4. Similar to the cross-lingual setting, model
cedure outlined by Devlin et al. (2019) for such performance drops in the cross-target setting, but
tasks. We designate the question as segment A M-B ERT remains the strongest baseline and eas-
and the comment as segment B. The two segments ily surpasses the majority class baselines. Fur-
are separated with the special token [SEP], and thermore, the cross-question score of M-B ERT is
the special token [CLS] is prepended to the se- slightly lower than the cross-topic score.
quence. The final hidden state corresponding to
[CLS] is then classified by a linear layer. 4.4 How Important is Consistent Language?
We fine-tune the full model with a cross-entropy The default setup preserves horizontal language
loss, using the AllenNLP library (Gardner et al., consistency in that the language of the questions
2018) as a basis for our implementation. always corresponds to the language of the com-
ments. For example, the Italian test instances are
Training As above, we balanced out the num-
combined with the Italian version of the questions,
ber of classes in the training set. We use a batch
even though during training the model has only
size of 16 and a maximum sequence length of 512
ever seen the German and French version of them.
subwords, and performed a grid search over the
An alternative concept is vertical language con-
following hyperparameters based on the validation
sistency, whereby the questions are consistently
accuracy:
presented in one language, regardless of the com-
• Learning rate: 5e-5, 3e-5, 2e-5 ment. To test whether horizontal or vertical con-
• Number of epochs: 3, 4 sistency is more helpful, we train and evaluate
M-B ERT on a dataset variant where all questions
4
https://github.com/facebookresearch/ are in their English version. We chose English as
fastText
5
https://github.com/google-research/ a lingua franca because it had the largest share of
bert/blob/master/multilingual.md data during the pre-training of M-B ERT.
Intra-target Cross-question Cross-topic
DE FR Mean DE FR Mean DE FR Mean
Majority class (global) 33.1 34.8 33.9 36.4 37.9 37.1 32.1 33.8 32.9
Majority class (target-wise) 60.8 65.1 62.9 - - - - - -
fastText 69.9 71.2 70.5 62.0 65.6 63.7 63.1 65.5 64.3
M-B ERT 76.8 76.6 76.6 68.5 68.4 68.4 68.9 70.9 69.9
Table 4: Baseline scores in the cross-target setting. For each test set we separately report a German and a French
score, as well as their harmonic mean.
Results are shown in Table 5. While the effect time. To rule this out, we probe the model with
is negligible in most settings, cross-lingual perfor- randomized data at test time:
mance increases when all questions are in English.
• Test the model on versions of the test sets
4.5 How Important are the Segments? where the comments remain in place but
In order to rule out that only the questions or only the questions are shuffled randomly (random
the comments are necessary to optimally solve the questions). We make sure that the random
task, we conduct some additional experiments: questions come from the same test set and
language as the original questions.
• Only use a single segment containing the
• Keep the questions in place and randomize
comment, removing the questions from the
the comments (random comments). Again
training and test data (missing questions).
we shuffle the comments only within test set
• Only use the question and remove the com- boundaries.
ment (missing comments).
The results in Table 5 show that the performance
In both cases the performance decreases across of the model decreases in both cases, confirming
all evaluation settings (Table 5). The loss in that it learns to take into account both segments.
performance is much higher when comments are
4.6 How Important are Spelled-Out Targets?
missing, indicating that the comments contain the
most important information about stance. As can Finally we test whether the target really needs to
be expected, the score achieved without comments be represented by natural language (e.g. “Do you
is only slightly different from the target-wise ma- support X?”). An alternative is to represent the
jority class baseline. target with a trainable embedding instead.
But there is also a loss in performance when the In order to fit target embeddings smoothly
questions are missing, which underlines the im- into our architecture, we represent each target
portance of pairing both pieces of text. The effect type with a different reserved symbol from the
of missing questions is especially strong in the su- M-B ERT vocabulary. Segment A is then set to this
pervised and cross-lingual settings. To illustrate symbol instead of a natural language question.
this, we provide in Table A8 some examples of The results for this experiment are listed in the
comments that occur with multiple different tar- bottom row of Table 5. An M-B ERT model that
gets in the training set. Those examples can ex- learns target embeddings instead of encoding a
plain why the target can be essential for disam- question performs clearly worse in the supervised
biguating a stance detection problem. On the other and cross-lingual settings. From this we conclude
hand, the effect of omitting the questions is less that spelled-out natural language questions pro-
pronounced in the cross-target settings. vide important linguistic detail that can help in
stance detection.
The above single-segment experiments tell us
5 Discussion
that both the comment and the question provide
crucial information. But it is possible that the Our experiments show that M-B ERT achieves a
M-B ERT model, even though trained on both seg- reasonable accuracy on X-stance, outperforming
ments, mainly looks at a single segment at test majority class baselines and a fastText classifier.
Supervised Cross-Lingual Cross-Question Cross-Topic
M-B ERT 76.6 70.2 68.4 69.9
— with English questions 76.1 71.7 68.5 69.4
— with missing questions 73.2 67.1 67.8 69.3
— with missing comments 64.2 60.5 51.1 48.6
— with random questions 56.0 52.5 47.7 48.5
— with random comments 50.7 50.7 48.2 48.7
— with target embeddings 70.1 66.0 68.4 69.0
Table 5: Results for additional experiments. The cross-lingual score is the F1-score on the Italian test set. For the
supervised, cross-question and cross-topic settings we report the harmonic mean of the German and French scores.
Dataset Evaluation Score notation, and, in our view, largely compensates for
the implicitness of the texts.
SemEval-2016 Ghosh et al. (2019) 75.1
MPCHI Ghosh et al. (2019) 75.6
X -stance this paper 76.6 6 Conclusion
Table 6: Performance of B ERT-like models on differ- We have presented a new dataset for political
ent supervised stance detection benchmarks. stance detection called X-stance. The dataset ex-
tends over a broad range of topics and issues re-
garding national Swiss politics. This diversity of
To put the supervised score into context we list
topics opens up an opportunity to further study
scores that variants of B ERT have achieved on
multi-target learning. Moreover, being partly
other stance detection datasets in Table 6. It seems
Swiss Standard German, partly French and Ital-
that the supervised part of X-stance has a similar
ian, the dataset promotes a multilingual approach
difficulty as the SemEval-2016 (Mohammad et al.,
to stance detection.
2016a) or MPCHI (Sen et al., 2018) datasets on
which B ERT has previously been evaluated. By compiling formal commentary by politicians
on political questions, we add a new text genre to
On the other hand, in the cross-lingual and
the field of stance detection. We also propose a
cross-target settings, the mean score drops by 6–8
question–answer format that allows us to condi-
percentage points compared to the supervised set-
tion stance detection models on a target naturally.
ting; while zero-shot transfer is possible to a de-
gree, it can still be improved. Our baseline results with multilingual B ERT
The additional experiments (Table 5) validate show that the model has some capability to per-
the results and show that the sequence-pair clas- form zero-shot transfer to unseen languages and
sification approach to stance detection is justified. to unseen targets (both within a topic and to un-
It is interesting to see what errors the M-B ERT seen topics). However, there is some gap in per-
model makes. Table A7 presents instances where formance that future work could address. We ex-
it predicts the wrong label with a high confidence. pect that the X-stance dataset could furthermore
These examples indicate that many comments ex- be a valuable resource for fields such as argument
press their stance only on a very implicit level, and mining, argument search or topic classification.
thus hint at a potential weakness of the dataset.
Because on the voting advice platform the label is Acknowledgments
explicitly shown to readers in addition to the com-
ments, the comments do not need to express the This work was funded by the Swiss Na-
stance explicitly. tional Science Foundation (project MUTAMUR;
Manual annotation could eliminate very im- no. 176727). We would like to thank Isabelle
plicit samples in a future version of the dataset. Augenstein, Anne Göhring and the anonymous re-
However, the sheer size and breadth of the dataset viewers for helpful feedback.
could not realistically be achieved with manual an-
References Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Tomas Mikolov. 2017. Bag of tricks for efficient
Isabelle Augenstein, Tim Rocktäschel, Andreas Vla- text classification. In Proceedings of the 15th Con-
chos, and Kalina Bontcheva. 2016. Stance detec- ference of the European Chapter of the Association
tion with bidirectional conditional encoding. In Pro- for Computational Linguistics: Volume 2, Short Pa-
ceedings of the 2016 Conference on Empirical Meth- pers, pages 427–431, Valencia, Spain. Association
ods in Natural Language Processing, pages 876– for Computational Linguistics.
885, Austin, Texas. Association for Computational
Linguistics. Manfred Klenner, Don Tuggener, and Simon
Clematide. 2017. Stance detection in Facebook
Emily M. Bender and Batya Friedman. 2018. Data posts of a German right-wing party. In Proceed-
statements for natural language processing: Toward ings of the 2nd Workshop on Linking Models of
mitigating system bias and enabling better science. Lexical, Sentential and Discourse-level Semantics,
Transactions of the Association for Computational pages 31–40, Valencia, Spain. Association for
Linguistics, 6:587–604. Computational Linguistics.
Samuel R. Bowman, Gabor Angeli, Christopher Potts, Philipp Koehn. 2005. Europarl: A parallel corpus for
and Christopher D. Manning. 2015. A large anno- statistical machine translation. Machine Translation
tated corpus for learning natural language inference. Summit, 2005, pages 79–86.
In Proceedings of the 2015 Conference on Empiri-
cal Methods in Natural Language Processing, pages Dan Kondratyuk and Milan Straka. 2019. 75 lan-
632–642, Lisbon, Portugal. Association for Compu- guages, 1 model: Parsing universal dependencies
tational Linguistics. universally. In Proceedings of the 2019 Confer-
ence on Empirical Methods in Natural Language
Simon Clematide, Stefan Gindl, Manfred Klenner, Ste- Processing and the 9th International Joint Confer-
fanos Petrakis, Robert Remus, Josef Ruppenhofer, ence on Natural Language Processing (EMNLP-
Ulli Waltinger, and Michael Wiegand. 2012. MLSA IJCNLP), pages 2779–2795, Hong Kong, China. As-
— a multi-layered reference corpus for German sen- sociation for Computational Linguistics.
timent analysis. In Proceedings of the Eighth In-
ternational Conference on Language Resources and Dilek Küçük and Fazli Can. 2020. Stance detection: A
Evaluation (LREC’12), pages 3551–3556, Istanbul, survey. ACM Comput. Surv., 53(1).
Turkey. European Language Resources Association
(ELRA). Mirko Lai, Alessandra Teresa Cignarella, Delia
Irazú Hernández Farías, Cristina Bosco, Viviana
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Patti, and Paolo Rosso. 2020. Multilingual stance
Kristina Toutanova. 2019. BERT: Pre-training of detection in social media political debates. Com-
deep bidirectional transformers for language under- puter Speech & Language, page 101075.
standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association Mirko Lai, Viviana Patti, Giancarlo Ruffo, and Paolo
for Computational Linguistics: Human Language Rosso. 2018. Stance evolution and twitter interac-
Technologies, Volume 1 (Long and Short Papers), tions in an italian political debate. In International
pages 4171–4186, Minneapolis, Minnesota. Associ- Conference on Applications of Natural Language to
ation for Computational Linguistics. Information Systems, pages 15–27. Springer.
Matt Gardner, Joel Grus, Mark Neumann, Oyvind Ryan McDonald, Slav Petrov, and Keith Hall. 2011.
Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe- Multi-source transfer of delexicalized dependency
ters, Michael Schmitz, and Luke Zettlemoyer. 2018. parsers. In Proceedings of the 2011 Conference on
AllenNLP: A deep semantic natural language pro- Empirical Methods in Natural Language Process-
cessing platform. In Proceedings of Workshop for ing, pages 62–72, Edinburgh, Scotland, UK. Asso-
NLP Open Source Software (NLP-OSS), pages 1– ciation for Computational Linguistics.
6, Melbourne, Australia. Association for Computa-
tional Linguistics. Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-
hani, Xiaodan Zhu, and Colin Cherry. 2016a. A
Shalmoli Ghosh, Prajwal Singhania, Siddharth Singh, dataset for detecting stance in tweets. In Proceed-
Koustav Rudra, and Saptarshi Ghosh. 2019. Stance ings of the Tenth International Conference on Lan-
detection in web and social media: a comparative guage Resources and Evaluation (LREC’16), pages
study. In International Conference of the Cross- 3945–3952, Portorož, Slovenia. European Language
Language Evaluation Forum for European Lan- Resources Association (ELRA).
guages, pages 75–87. Springer.
Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-
Johannes Graën, Tannon Kew, Anastassia Shaitarova, hani, Xiaodan Zhu, and Colin Cherry. 2016b.
and Martin Volk. 2019. Modelling large parallel cor- SemEval-2016 task 6: Detecting stance in tweets.
pora: The zurich parallel corpus collection. In Pro- In Proceedings of the 10th International Workshop
ceedings of the 7th Workshop on Challenges in the on Semantic Evaluation (SemEval-2016), pages 31–
Management of Large Corpora (CMLC), pages 1–8. 41, San Diego, California. Association for Compu-
Leibniz-Institut für Deutsche Sprache. tational Linguistics.
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
How multilingual is multilingual BERT? In Pro-
ceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 4996–
5001, Florence, Italy. Association for Computa-
tional Linguistics.
Anirban Sen, Manjira Sinha, Sandya Mannarswamy,
and Shourya Roy. 2018. Stance classification of
multi-perspective consumer health information. In
Proceedings of the ACM India Joint International
Conference on Data Science and Management of
Data, pages 273–281.
Nakatani Shuyo. 2010. Language detection library for
java.
Parinaz Sobhani, Diana Inkpen, and Xiaodan Zhu.
2017. A dataset for multi-target stance detection. In
Proceedings of the 15th Conference of the European
Chapter of the Association for Computational Lin-
guistics: Volume 2, Short Papers, pages 551–557,
Valencia, Spain. Association for Computational Lin-
guistics.
Mariona Taulé, M Antònia Martí, Francisco Rangel,
Paolo Rosso, Cristina Bosco, and Viviana Patti.
2017. Overview of the task on stance and gen-
der detection in tweets on catalan independence at
ibereval 2017. In 2nd Workshop on Evaluation
of Human Language Technologies for Iberian Lan-
guages, IberEval 2017, volume 1881, pages 157–
177.
Mariona Taulé, Francisco Rangel, M Antònia Martí,
and Paolo Rosso. 2018. Overview of the task on
multimodal stance detection in tweets on catalan
#1oct referendum. In 3rd Workshop on Evaluation
of Human Language Technologies for Iberian Lan-
guages, IberEval 2018, volume 2150, pages 149–
166.
James Thurman and Urs Gasser. 2009. Three case
studies from switzerland: Smartvote. Berkman Cen-
ter Research Publications.
Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-
cas: The surprising cross-lingual effectiveness of
BERT. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
833–844, Hong Kong, China. Association for Com-
putational Linguistics.
A Examples
Question Comment Gold Label Prob.
Befürworten Sie eine vollständige Liberalisierung Ausser Sonntag. Dies sollte ein Ruhetag bleiben FAVOR 0.001
der Geschäftsöffnungszeiten? können.
[Are you in favour of a complete liberalisation of [Except Sunday. That should remain a day of
business hours for shops?] rest.]
Soll die Schweiz innerhalb der nächsten vier In den nächsten vier Jahren ist dies wohl un- FAVOR 0.005
Jahre EU-Beitrittsverhandlungen aufnehmen? realistisch.
[Should Switzerland embark on negotiations in [For the next four years this is probably unrealis-
the next four years to join the EU?] tic.]
Befürworten Sie einen Ausbau des Landschaftss- Wenn es darum geht erneuerbare Energien zu AGAINST 0.006
chutzes? fördern, ist sogar eine Lockerung angebracht.
[Are you in favour of extending landscape protec- [When it comes to promoting renewable energy,
tion?] even a relaxation is appropriate.]
La Suisse devrait-elle engager des négociations Il faut cependant en parallèle veiller à ce que la AGAINST 0.010
pour un accord de libre échange avec les Etats- Suisse ne soit pas mise de côté par les Etats-Unis !
Unis?
[Should Switzerland start negotiations with the [At the same time it must be ensured that Switzer-
USA on a free trade agreement?] land is not sidelined by the United States!]
Table A7: Some classification errors where the predicted probability of the correct label is especially low. The
examples have been taken from the validation set.
Comment . . . is favorable towards target . . . but against target . . .
Ich will offene Grenzen für Waren Soll die Schweiz mit den USA Verhand- Soll die Schweiz das Schengen-
und selbstverantwortliche mündige lungen über ein Freihandelsabkommen Abkommen mit der EU kündigen und
Bürger. Der Staat hat kein Recht, uns aufnehmen? wieder verstärkte Personenkontrollen
einzuschränken. direkt an der Grenze einführen?
[I want open borders for goods and re- [Should Switzerland start negotiations [Should Switzerland terminate the
sponsible citizens. The state has no right with the USA on a free trade agree- Schengen Agreement with the EU and
to restrict us.] ment?] reintroduce increased identity checks
directly on the border?]
Hier gilt der Grundsatz der Eigenver- Sind Sie für eine vollständige Liberal- Würden Sie die Einführung einer
antwortung und Selbstbestimmung des isierung der Ladenöffnungszeiten? Frauenquote in Verwaltungsräten
Unternehmens! börsenkotierter Unternehmen befür-
worten?
[The principle of personal responsibil- [Are you in favour of the complete lib- [Would you support the introduction of
ity and corporate self-regulation applies eralization of shop opening times?] a woman’s quota for the Boards of Di-
here!] rectors of listed companies?]
Table A8: Two comments that imply a positive stance towards one target issue but a negative stance towards
another target issue. Such cases can be found in the dataset because respondents have copy-pasted some comments.
These examples have been extracted from the training set.
B Data Statement
Curation rationale In order to study the automatic detection of stances on political issues, questions
and candidate responses on the voting advice application smartvote.ch were downloaded. Mainly
data pertaining to national-level issues were included to reduce variability.
Language variety The training set consists of questions and answers in Swiss Standard German and
Swiss French (74.1% de-CH; 25.9% fr-CH). The test sets also contain questions and answers in Swiss
Italian (67.1% de-CH; 24.7% fr-CH; 8.2% it-CH). The questions have also been translated into English.
Speaker demographic (answers)
• Candidates for communal, cantonal or national elections in Switzerland who have filled out an
online questionnaire.
• Age: 18 or older – mixed.
• Gender: Unknown – mixed.
• Race/ethnicity: Unknown – mixed.
• Native language: Unknown – mixed.
• Socioeconomic status: Unknown – mixed.
• Different speakers represented: 7581.
• Presence of disordered speech: Unknown.
Speech situation
• The questions were edited and translated by political scientists for a public voting advice website.
• The answers were written between 2011 and 2020 by the users of the website.
Text characteristics Questions, answers, arguments and comments regarding political issues.