=Paper= {{Paper |id=Vol-2624/paper9 |storemode=property |title=X -stance: A Multilingual Multi-Target Dataset for Stance Detection |pdfUrl=https://ceur-ws.org/Vol-2624/paper9.pdf |volume=Vol-2624 |authors=Jannis Vamvas,Rico Sennrich |dblpUrl=https://dblp.org/rec/conf/swisstext/VamvasS20 }} ==X -stance: A Multilingual Multi-Target Dataset for Stance Detection== https://ceur-ws.org/Vol-2624/paper9.pdf
                                                     X -stance:
             A Multilingual Multi-Target Dataset for Stance Detection

                                     Jannis Vamvas1           Rico Sennrich1,2
                    1
                        Department of Computational Linguistics, University of Zurich
                              2
                                School of Informatics, University of Edinburgh
                                   {vamvas,sennrich}@cl.uzh.ch


                         Abstract                             able today are relatively small, and specific to a
                                                              single target (Taulé et al., 2017, 2018). Further-
    We extract a large-scale stance detection                 more, specific models tend to be developed for
    dataset from comments written by can-                     each single target or pair of targets (Sobhani et al.,
    didates of elections in Switzerland. The                  2017). Concerns have been raised that cross-target
    dataset consists of German, French and                    performance is often considerably lower than fully
    Italian text, allowing for a cross-lingual                supervised performance (Küçük and Can, 2020).
    evaluation of stance detection. It contains                  In this paper we propose a much larger dataset
    67 000 comments on more than 150 po-                      that combines multilinguality and a multitude of
    litical issues (targets). Unlike stance de-               topics and targets. X-stance comprises more than
    tection models that have specific target is-              150 questions about Swiss politics and more than
    sues, we use the dataset to train a single                67k answers given by candidates running for polit-
    model on all the issues. To make learn-                   ical office in Switzerland. Questions are available
    ing across targets possible, we prepend                   in four languages: English, Swiss Standard Ger-
    to each instance a natural question that                  man, French, and Italian. The language of a com-
    represents the target (e.g. “Do you sup-                  ment depends on the candidate’s region of origin.
    port X?”). Baseline results from multi-
                                                                 We have extracted the data from the voting ad-
    lingual B ERT show that zero-shot cross-
                                                              vice application Smartvote. Candidates respond to
    lingual and cross-target transfer of stance
                                                              questions mainly in categorical form (yes / rather
    detection is moderately successful with
                                                              yes / rather no / no). They can also submit a free-
    this approach.
                                                              text comment to justify or explain their categorical
1   Introduction                                              answer. An example is given in Figure 1.
                                                                 We transform the dataset into a stance detec-
In recent years many datasets have been cre-                  tion task by interpreting the question as a natural-
ated for the task of automated stance detection,              language representation of the target, and the com-
advancing natural language understanding sys-                 mentary as the input to be classified.
tems for political science, opinion research and                 The dataset is split into a multilingual train-
other application areas. Typically, such bench-               ing set and into several test sets to evaluate zero-
marks (Mohammad et al., 2016a) are composed                   shot cross-lingual and cross-target transfer. To
of short pieces of text commenting on politicians             provide a baseline, we fine-tune a multilingual
or public issues and are manually annotated with              B ERT model (Devlin et al., 2019) on X-stance. We
their stance towards a target entity (e.g. Climate            show that the baseline accuracy is comparable to
Change, or Trump). However, they are limited in               previous stance detection benchmarks while leav-
scope on multiple levels (Küçük and Can, 2020).               ing ample room for improvement. In addition,
   First of all, it is questionable how well cur-             the model can generalize to a degree both cross-
rent stance detection methods perform in a cross-             lingually and in a cross-target setting.
lingual setting, as the multilingual datasets avail-             We have made the dataset and the code for re-
Copyright c 2020 for this paper by its authors. Use permit-   producing the baseline models publicly available.1
ted under Creative Commons License Attribution 4.0 Interna-
                                                                 1
tional (CC BY 4.0)                                                   http://doi.org/10.5281/zenodo.3831317
    Question #3414 – Available in all languages

    Should Switzerland strive for a free    Soll der Bundesrat ein Frei-        La Suisse devrait-elle conclure un
    trade agreement with the USA?           handelsabkommen mit den USA         accord de libre-échange avec les
                                            anstreben?                          Etats-Unis?




                          Comment #26597 (German)                          Comment #21421 (French)
                          Label: FAVOR                                     Label: AGAINST
                          Mit unserem zweitwichtigsten Handels-            Les accords de libre-échange menacent la
                          partner sollten wir ein Freihandels-             qualité des produits suisses.
                          abkommen haben.
                          [With our second most important trading          [The free trade agreements jeopardize the
                          partner we should have a free trade              quality of the Swiss products.]
                          agreement.]



Figure 1: Example of a question and two answers in the X-stance dataset. The answers were submitted by electoral
candidates on a voting advice website. The author of the German comment was in favor of the issue; the author of
the French comment against. Both authors use comments to explain their respective stance.


2   Related Work                                              task B cross-target transfer to the target “Donald
                                                              Trump” was tested, for which no annotated train-
Multilingual Stance Detection In the context of               ing data were provided. While this required the
the IberEval shared tasks, two related multilingual           development of more universal models, their per-
datasets have been created (Taulé et al., 2017,               formance was generally much lower.
2018). Both are a collection of annotated Spanish                Sobhani et al. (2017) introduced a multi-target
and Catalan tweets. Crucially, the tweets in both             stance dataset which provides two targets per in-
languages focus on the same issue (Catalan inde-              stance. For example, a model designed in this
pendence); given this fact they are the first truly           framework is supposed to simultaneously classify
multilingual stance detection datasets known to us.           a tweet with regard to Clinton and with regard to
   With regard to the languages covered by                    Trump. While in theory the framework allows for
X -stance, only monolingual datasets seem to be               more than two targets, it is still restricted to a fi-
available. For French, a collection of tweets                 nite and clearly defined set of targets. It focuses
on French presidential candidates has been an-                on modeling the dependencies of multiple targets
notated with stance (Lai et al., 2020). Simi-                 within the same text sample, while our approach
larly, two datasets of Italian tweets on the occa-            focuses on learning stance detection from many
sion of the 2016 constitutional referendum have               samples with many different targets.
been created (Lai et al., 2018, 2020). With re-
gard to German, a corpus of 270 sentences has                 Representation Learning for Stance Detection
been annotated with fine-grained stance and atti-             In a target-specific setting, Ghosh et al. (2019)
tude information (Clematide et al., 2012). Fur-               perform a systematic evaluation of stance detec-
thermore, fine-grained stance detection has been              tion approaches. They also evaluate B ERT (Devlin
qualitatively studied on a large corpus of Facebook           et al., 2019) and find that it consistently outper-
posts (Klenner et al., 2017).                                 forms previous approaches.
                                                                 However, they only experiment with a single-
Multi-Target Stance Detection The SemEval-                    segment encoding of the input, preventing cross-
2016 task on detecting stance in tweets (Moham-               target transfer of the model. Augenstein et al.
mad et al., 2016b) offers data concerning multi-              (2016) propose a conditional encoding approach
ple targets (Atheism, Climate Change, Feminism,               to encode both the target and the tweet as se-
Hillary Clinton, and Abortion). In the supervised             quences. They use a bidirectional LSTM to condi-
subtask A, participants tended to develop a target-           tion the encoding of the tweets on the encoding of
specific model for each of those targets. In sub-             the target, and then apply a nonlinear projection on
Topic                        Questions      Answers      depending on the locale they see translated ver-
                                                         sions of the questions. They can answer each
Digitisation                           2       1168
                                                         question with either ‘yes’, ‘rather yes’, ‘rather no’,
Economy                               23        6899
                                                         or ‘no’. They can supplement each answer with a
Education                             16        7639
                                                         comment of at most 500 characters.
Finances                              15        3980
                                                            The questions asked on Smartvote have been
Foreign Policy                        16        4393
                                                         edited by a team of political scientists. They are
Immigration                           19        6270
                                                         intended to cover a broad range of political is-
Infrastructure & Environment          31        9590
                                                         sues relevant at the time of the election. A de-
Security                              20        5193
                                                         tailed documentation of the design of Smartvote
Society                               17        6275
                                                         and the editing process of the questions is provided
Welfare                               15        8508
                                                         by Thurman and Gasser (2009).
Total (training topics)              174      59 915
Healthcare                            11        4711     Preprocessing We merged the two labels on
Political System                       9        2645     each pole into a single label: ‘yes’ and ‘rather yes’
Total (held-out topics)               20        7356     were combined into ‘favor’; ‘rather no’, or ‘no’
                                                         into ‘against‘. This improves the consistency of
Table 1: Number of questions and answers per topic.      the data and the comparability to previous stance
                                                         detection datasets. We did not further preprocess
                                                         the text of the comments.
the conditionally encoded tweet. This allows them
to train a model that can generalize to previously       Language Identification As the API does not
unseen targets.                                          provide the language of comments, we employed
                                                         a language identifier to automatically annotate
3       The X-stance Dataset                             this information. We used the langdetect li-
                                                         brary (Shuyo, 2010). For each responder we clas-
3.1      Task Definition
                                                         sified all the comments jointly, assuming that re-
The input provided by X-stance is two-fold: (A)          sponders did not switch code during the answering
a natural language question concerning a politi-         of the questionnaire.
cal issue; (B) a natural language commentary on             We applied the identifier in a two-step approach.
a specific stance towards the question.                  In the first run we allowed the identifier to out-
   The label to be predicted is either ‘favor’ or        put all 55 languages that it supports out of the
‘against‘. This corresponds to a standard estab-         box, plus Romansh, the fourth official language in
lished by Mohammad et al. (2016a). However,              Switzerland3 . We found that no Romansh com-
X -stance differs from that dataset in that it lacks a   ments were detected and that all unexpected out-
‘neither’ class; all comments refer to either a ‘fa-     puts were misclassifications of German, French or
vor’ or an ‘against‘ position. The task posed by         Italian comments. We further concluded that little
X -stance is thus a binary classification task.          or no Swiss German comments are in the dataset;
   As an evaluation metric we report the macro-          otherwise, some of them would have manifested
average of the F1-score for ‘favor’ and the F1-          themselves via misclassifications (e.g. as Dutch).
score for ‘against’, similar to Mohammad et al.             In the second run, drawing from these conclu-
(2016b). We use this metric mainly to strengthen         sions, we restricted the identifier’s set of choices
comparability with the previous benchmarks.              to English, French, German and Italian.
3.2      Data Collection                                 Filtering We pre-filtered the questions and an-
Provenance We downloaded the questions and               swers to improve the quality of the dataset. To
answers via the Smartvote API2 . The downloaded          keep the domain of the data surveyable, we set a
data cover 175 communal, cantonal and national           focus on national-level questions. Therefore, all
elections between 2011 and 2020.                            3
                                                             Namely the Rumantsch Grischun variety; the lan-
   All candidates in an election who participate in      guage profile was created using resources from the
Smartvote are asked the same set of questions, but       Zurich Parallel Corpus Collection (Graën et al., 2019)
                                                         and the Quotidiana corpus (https://github.com/
    2
        https://smartvote.ch                             ProSvizraRumantscha/corpora).
                     Intra-target                        Cross-question                     Cross-topic
                     (New answers to                     (New questions
                     known questions)                    within known topics)
                     Train:           33 850
DE                   Test:              2871             Test:             3143             Test:            5269
                     Valid:             3479
                     Train:           11 790
FR                   Test:              1055             Test:             1170             Test:            1914
                     Valid:             1284
IT                   Test:              1173             Test:             (110)            Test:            (173)

Table 2: Number of answer instances in the training, validation and test sets. The upper left corner represents a
multilingually supervised task, where training, validation and test data are from exactly the same domain. The top-
to-bottom axis gives rise to a cross-lingual transfer task, where a model trained on German and French is evaluated
on Italian answers to the same questions. The left-to-right axis represents a continuous shift of domain: In the
middle column, the model is tested on previously unseen questions that belong to the same topics as seen during
training. In the right column the model encounters unseen answers to unseen questions within an unseen topic.
The two test sets in parentheses are too small for a significant evaluation.


questions and corresponding answers pertaining to             Given the sensitive nature of the data, we in-
national elections were included.                          crease the anonymity of the data by hashing the
   In the context of communal and cantonal elec-           respondents’ IDs. No personal attributes of the re-
tions, candidates have answered both local ques-           spondents are included in the dataset. We provide
tions and a subset of the national questions. Of           a data statement (Bender and Friedman, 2018) in
those elections, we only considered answers to the         Appendix B.
questions that also had been asked in a national
election. They were only used to augment the               3.3   Data Split
training set while the validation and test sets were
restricted to answers from national elections.             We held out the topics “Healthcare” and “Political
   We discarded the fewer than 20 comments clas-           System” from the training data and created a sepa-
sified as English. Furthermore, we discarded in-           rate cross-topic test set that contains the questions
stances that met any of the following conditions:          and answers related to those topics.
                                                              Furthermore, in order to test cross-question
     • Question is not a closed question or does not       generalization performance within previously seen
       address a clearly defined political issue.          topics, we manually selected 16 held-out ques-
                                                           tions that are distributed over the remaining
     • No comment was submitted by the candidate
                                                           10 topics. We selected the held-out questions man-
       or the comment is shorter than 50 characters.
                                                           ually because we wanted to make sure that they are
     • Comment starts with “but” or a similar indi-        truly unseen and that no paraphrases of the ques-
       cator that the comment is not self-contained.       tions are found in the training set.
     • Comment contains a URL.                                We designated Italian as a test-only language,
                                                           since relatively few comments have been written
In total, a fifth of the comments were filtered out.       in Italian. From the remaining German and French
                                                           data we randomly selected a percentage of respon-
Topics The questions have been organized
                                                           dents as validation or as test respondents.
by the Smartvote editors into categories (such
as “Economy”). We further consolidated the pre-               As a result we received one training set, one val-
defined categories into 12 broad topics (Table 1).         idation set and four test sets. The sizes of the sets
                                                           are listed in Table 2. We did not consider test sets
Compliance The dataset is shared under a CC                that are cross-lingual and cross-target at the same
BY-NC 4.0 license. Copyright remains with                  time, as they would have been too small to yield
www.smartvote.ch.                                          significant results.
                              100%
Proportion of class ‘favor’                                                                                                   held-out

                              75%



                              50% mean



                              25%


                                      Digitisation      Education      Foreign Policy  Infrastructure       Society           Healthcare
                                                 Economy         Finances       Immigration        Security         Welfare    Political System

   Figure 2: Proportion of ‘favor’ labels per question, grouped by topic. While the proportion of favorable answers
   varies from question to question, it is balanced overall.


   3.4                          Analysis                                              Languages The X-stance dataset has more Ger-
                                                                                      man samples than French samples. The language
   Some observations regarding the composition of
                                                                                      ratio of about 3:1 is consistent across all train-
   X -stance can be made.
                                                                                      ing and test sets. Given the two languages it
   Class Distribution Figure 2 visualizes the pro-                                    is possible to either train two monolingual mod-
   portion of ‘favor’ and ‘against‘ stances for each                                  els or to train a single model in a multi-source
   target in the dataset. The ratio differs between                                   setup (McDonald et al., 2011). We choose a multi-
   questions but is relatively equally distributed                                    source baseline because M-B ERT is known to ben-
   across the topics. In particular, the questions in                                 efit from multilingual training data both in a super-
   the held-out topics (with a ‘favor’ ratio of 49.4%)                                vised and in a cross-lingual scenario (Kondratyuk
   have a similar class distribution as the questions in                              and Straka, 2019).
   other topics (with a ‘favor’ ratio of 50.0%).
                                                                                      4     Baseline Experiments
   Linguistic Properties Not every question is
                                                                                      We evaluate four baselines to obtain an impression
   unique; some questions are paraphrases describing
                                                                                      of the difficulty of the task.
   the same political issue. For example, in the 2015
   election, the candidates were asked: “Should the                                   4.1    Majority Class Baselines
   consumption of cannabis as well as its possession
   for personal use be legalised?” Four years later                                   The first pair of baselines uses the most frequent
   they were asked: “Should cannabis use be legal-                                    class in the training set for prediction. Specifi-
   ized?” However, we do not see any need to con-                                     cally, the global majority class baseline predicts
   solidate those duplicates because they contribute                                  the most frequent class across all training targets
   to the diversity of the training data.                                             while the target-wise majority class baseline pre-
                                                                                      dicts the class that is most frequent for a given tar-
      We further observe that while some questions
                                                                                      get question. The latter can only be applied to the
   in the dataset are quite short, some questions are
                                                                                      intra-target test sets.
   rather convoluted. For example, a typical long
   question reads:                                                                    4.2    Bag-of-Words Baseline
                                Some 1% of direct payments to Swiss agricul-          As a second baseline, we train a fastText bag-of-
                                ture currently go to organic farming operations.      words linear classifier (Joulin et al., 2017). For
                                Should this proportion be increased at the ex-
                                pense of standard farming operations as part of       each comment, we select the translation of the
                                Switzerland’s 2014-2017 agricultural policy?          question that matches its language, and concate-
                                                                                      nate it to the comment. We tokenize the text using
   Such longer questions might be more challenging                                    the Europarl preprocessing tools (Koehn, 2005).
   to process semantically.                                                             The ‘against’ class was slightly upsampled in
the training data so that the classes are balanced                                        DE      FR        IT
when summing over all questions and topics.
                                                       Majority class (global)          33.1    34.8      34.4
   We use the standard settings provided by the
                                                       Majority class (target-wise)     60.8    65.1      59.3
fastText library.4 Optimal hyperparameters from
                                                       fastText                         69.9    71.2      53.7
the following range were determined based on the
                                                       M-B ERT                          76.8    76.6      70.2
validation accuracy:
  • Learning rate: 0.1, 0.2, 1                         Table 3: Baseline scores in the cross-lingual setting.
                                                       No Italian samples were seen during training, mak-
  • Number of epochs: 5, 50                            ing this a case of zero-shot cross-lingual transfer. The
The word vectors were set to a size of 300. We         scores are reported as the macro-average of the F1-
                                                       scores for ‘favor’ and for ‘against’.
do not initialize them with pre-trained multilingual
embeddings since preliminary experiments did not
show a beneficial effect.                                 The grid search was repeated independently for
                                                       every variant that we test in the following sub-
4.3   Multilingual BERT Baseline                       sections. Furthermore, the standard recommenda-
As our main baseline model we fine-tune multilin-      tions for fine-tuning B ERT were used: Adam with
gual B ERT (M-B ERT) on the task (Devlin et al.,       β1 = 0.9 and β2 = 0.999; an L2 weight decay
2019) which has been pre-trained jointly in 104        of 0.01; a learning rate warmup over the first 10%
languages5 and has established itself as a state       of the steps; and a linear decay of the learning rate.
of the art for various multilingual tasks (Wu and      A dropout probability of 0.1 was set on all layers.
Dredze, 2019; Pires et al., 2019). Within the field
of stance detection, B ERT can outperform both         Results Table 3 shows the results for the cross-
feature-based and other neural approaches in a         lingual setting. M-B ERT performs consistently
monolingual English setting (Ghosh et al., 2019).      better than the previous baselines. Even the zero-
                                                       shot performance in Italian, while significantly
Architecture In the context of B ERT we in-            lower than the supervised scores, is much better
terpret the X-stance task as sequence pair clas-       than the target-wise majority class baseline.
sification inspired by natural language inference         Results for the cross-target setting are given in
tasks (Bowman et al., 2015). We follow the pro-        Table 4. Similar to the cross-lingual setting, model
cedure outlined by Devlin et al. (2019) for such       performance drops in the cross-target setting, but
tasks. We designate the question as segment A          M-B ERT remains the strongest baseline and eas-
and the comment as segment B. The two segments         ily surpasses the majority class baselines. Fur-
are separated with the special token [SEP], and        thermore, the cross-question score of M-B ERT is
the special token [CLS] is prepended to the se-        slightly lower than the cross-topic score.
quence. The final hidden state corresponding to
[CLS] is then classified by a linear layer.            4.4   How Important is Consistent Language?
   We fine-tune the full model with a cross-entropy    The default setup preserves horizontal language
loss, using the AllenNLP library (Gardner et al.,      consistency in that the language of the questions
2018) as a basis for our implementation.               always corresponds to the language of the com-
                                                       ments. For example, the Italian test instances are
Training As above, we balanced out the num-
                                                       combined with the Italian version of the questions,
ber of classes in the training set. We use a batch
                                                       even though during training the model has only
size of 16 and a maximum sequence length of 512
                                                       ever seen the German and French version of them.
subwords, and performed a grid search over the
                                                          An alternative concept is vertical language con-
following hyperparameters based on the validation
                                                       sistency, whereby the questions are consistently
accuracy:
                                                       presented in one language, regardless of the com-
  • Learning rate: 5e-5, 3e-5, 2e-5                    ment. To test whether horizontal or vertical con-
  • Number of epochs: 3, 4                             sistency is more helpful, we train and evaluate
                                                       M-B ERT on a dataset variant where all questions
  4
    https://github.com/facebookresearch/               are in their English version. We chose English as
fastText
  5
    https://github.com/google-research/                a lingua franca because it had the largest share of
bert/blob/master/multilingual.md                       data during the pre-training of M-B ERT.
                                  Intra-target                    Cross-question             Cross-topic
                                   DE      FR    Mean             DE    FR Mean              DE      FR    Mean
Majority class (global)          33.1    34.8     33.9           36.4   37.9   37.1         32.1   33.8      32.9
Majority class (target-wise)     60.8    65.1     62.9              -      -      -            -      -         -
fastText                         69.9    71.2     70.5           62.0   65.6   63.7         63.1   65.5      64.3
M-B ERT                          76.8    76.6     76.6           68.5   68.4   68.4         68.9   70.9      69.9

Table 4: Baseline scores in the cross-target setting. For each test set we separately report a German and a French
score, as well as their harmonic mean.


   Results are shown in Table 5. While the effect          time. To rule this out, we probe the model with
is negligible in most settings, cross-lingual perfor-      randomized data at test time:
mance increases when all questions are in English.
                                                               • Test the model on versions of the test sets
4.5   How Important are the Segments?                            where the comments remain in place but
In order to rule out that only the questions or only             the questions are shuffled randomly (random
the comments are necessary to optimally solve the                questions). We make sure that the random
task, we conduct some additional experiments:                    questions come from the same test set and
                                                                 language as the original questions.
  • Only use a single segment containing the
                                                               • Keep the questions in place and randomize
    comment, removing the questions from the
                                                                 the comments (random comments). Again
    training and test data (missing questions).
                                                                 we shuffle the comments only within test set
  • Only use the question and remove the com-                    boundaries.
    ment (missing comments).
                                                           The results in Table 5 show that the performance
   In both cases the performance decreases across          of the model decreases in both cases, confirming
all evaluation settings (Table 5). The loss in             that it learns to take into account both segments.
performance is much higher when comments are
                                                           4.6     How Important are Spelled-Out Targets?
missing, indicating that the comments contain the
most important information about stance. As can            Finally we test whether the target really needs to
be expected, the score achieved without comments           be represented by natural language (e.g. “Do you
is only slightly different from the target-wise ma-        support X?”). An alternative is to represent the
jority class baseline.                                     target with a trainable embedding instead.
   But there is also a loss in performance when the           In order to fit target embeddings smoothly
questions are missing, which underlines the im-            into our architecture, we represent each target
portance of pairing both pieces of text. The effect        type with a different reserved symbol from the
of missing questions is especially strong in the su-       M-B ERT vocabulary. Segment A is then set to this
pervised and cross-lingual settings. To illustrate         symbol instead of a natural language question.
this, we provide in Table A8 some examples of                 The results for this experiment are listed in the
comments that occur with multiple different tar-           bottom row of Table 5. An M-B ERT model that
gets in the training set. Those examples can ex-           learns target embeddings instead of encoding a
plain why the target can be essential for disam-           question performs clearly worse in the supervised
biguating a stance detection problem. On the other         and cross-lingual settings. From this we conclude
hand, the effect of omitting the questions is less         that spelled-out natural language questions pro-
pronounced in the cross-target settings.                   vide important linguistic detail that can help in
                                                           stance detection.
  The above single-segment experiments tell us
                                                           5     Discussion
that both the comment and the question provide
crucial information. But it is possible that the           Our experiments show that M-B ERT achieves a
M-B ERT model, even though trained on both seg-            reasonable accuracy on X-stance, outperforming
ments, mainly looks at a single segment at test            majority class baselines and a fastText classifier.
                                               Supervised         Cross-Lingual   Cross-Question       Cross-Topic
M-B ERT                                                76.6                70.2                68.4             69.9
— with English questions                               76.1                71.7                68.5             69.4
— with missing questions                               73.2                67.1                67.8             69.3
— with missing comments                                64.2                60.5                51.1             48.6
— with random questions                                56.0                52.5                47.7             48.5
— with random comments                                 50.7                50.7                48.2             48.7
— with target embeddings                               70.1                66.0                68.4             69.0

Table 5: Results for additional experiments. The cross-lingual score is the F1-score on the Italian test set. For the
supervised, cross-question and cross-topic settings we report the harmonic mean of the German and French scores.


Dataset                 Evaluation               Score        notation, and, in our view, largely compensates for
                                                              the implicitness of the texts.
SemEval-2016            Ghosh et al. (2019)        75.1
MPCHI                   Ghosh et al. (2019)        75.6
X -stance               this paper                 76.6       6    Conclusion

Table 6: Performance of B ERT-like models on differ-          We have presented a new dataset for political
ent supervised stance detection benchmarks.                   stance detection called X-stance. The dataset ex-
                                                              tends over a broad range of topics and issues re-
                                                              garding national Swiss politics. This diversity of
   To put the supervised score into context we list
                                                              topics opens up an opportunity to further study
scores that variants of B ERT have achieved on
                                                              multi-target learning. Moreover, being partly
other stance detection datasets in Table 6. It seems
                                                              Swiss Standard German, partly French and Ital-
that the supervised part of X-stance has a similar
                                                              ian, the dataset promotes a multilingual approach
difficulty as the SemEval-2016 (Mohammad et al.,
                                                              to stance detection.
2016a) or MPCHI (Sen et al., 2018) datasets on
which B ERT has previously been evaluated.                       By compiling formal commentary by politicians
                                                              on political questions, we add a new text genre to
   On the other hand, in the cross-lingual and
                                                              the field of stance detection. We also propose a
cross-target settings, the mean score drops by 6–8
                                                              question–answer format that allows us to condi-
percentage points compared to the supervised set-
                                                              tion stance detection models on a target naturally.
ting; while zero-shot transfer is possible to a de-
gree, it can still be improved.                                  Our baseline results with multilingual B ERT
   The additional experiments (Table 5) validate              show that the model has some capability to per-
the results and show that the sequence-pair clas-             form zero-shot transfer to unseen languages and
sification approach to stance detection is justified.         to unseen targets (both within a topic and to un-
   It is interesting to see what errors the M-B ERT           seen topics). However, there is some gap in per-
model makes. Table A7 presents instances where                formance that future work could address. We ex-
it predicts the wrong label with a high confidence.           pect that the X-stance dataset could furthermore
These examples indicate that many comments ex-                be a valuable resource for fields such as argument
press their stance only on a very implicit level, and         mining, argument search or topic classification.
thus hint at a potential weakness of the dataset.
Because on the voting advice platform the label is            Acknowledgments
explicitly shown to readers in addition to the com-
ments, the comments do not need to express the                This work was funded by the Swiss Na-
stance explicitly.                                            tional Science Foundation (project MUTAMUR;
   Manual annotation could eliminate very im-                 no. 176727). We would like to thank Isabelle
plicit samples in a future version of the dataset.            Augenstein, Anne Göhring and the anonymous re-
However, the sheer size and breadth of the dataset            viewers for helpful feedback.
could not realistically be achieved with manual an-
References                                                 Armand Joulin, Edouard Grave, Piotr Bojanowski, and
                                                             Tomas Mikolov. 2017. Bag of tricks for efficient
Isabelle Augenstein, Tim Rocktäschel, Andreas Vla-           text classification. In Proceedings of the 15th Con-
   chos, and Kalina Bontcheva. 2016. Stance detec-           ference of the European Chapter of the Association
   tion with bidirectional conditional encoding. In Pro-     for Computational Linguistics: Volume 2, Short Pa-
   ceedings of the 2016 Conference on Empirical Meth-        pers, pages 427–431, Valencia, Spain. Association
   ods in Natural Language Processing, pages 876–            for Computational Linguistics.
   885, Austin, Texas. Association for Computational
   Linguistics.                                            Manfred Klenner, Don Tuggener, and Simon
                                                            Clematide. 2017. Stance detection in Facebook
Emily M. Bender and Batya Friedman. 2018. Data              posts of a German right-wing party. In Proceed-
  statements for natural language processing: Toward        ings of the 2nd Workshop on Linking Models of
  mitigating system bias and enabling better science.       Lexical, Sentential and Discourse-level Semantics,
  Transactions of the Association for Computational         pages 31–40, Valencia, Spain. Association for
  Linguistics, 6:587–604.                                   Computational Linguistics.
Samuel R. Bowman, Gabor Angeli, Christopher Potts,         Philipp Koehn. 2005. Europarl: A parallel corpus for
  and Christopher D. Manning. 2015. A large anno-            statistical machine translation. Machine Translation
  tated corpus for learning natural language inference.      Summit, 2005, pages 79–86.
  In Proceedings of the 2015 Conference on Empiri-
  cal Methods in Natural Language Processing, pages        Dan Kondratyuk and Milan Straka. 2019. 75 lan-
  632–642, Lisbon, Portugal. Association for Compu-          guages, 1 model: Parsing universal dependencies
  tational Linguistics.                                      universally. In Proceedings of the 2019 Confer-
                                                             ence on Empirical Methods in Natural Language
Simon Clematide, Stefan Gindl, Manfred Klenner, Ste-         Processing and the 9th International Joint Confer-
  fanos Petrakis, Robert Remus, Josef Ruppenhofer,           ence on Natural Language Processing (EMNLP-
  Ulli Waltinger, and Michael Wiegand. 2012. MLSA            IJCNLP), pages 2779–2795, Hong Kong, China. As-
  — a multi-layered reference corpus for German sen-         sociation for Computational Linguistics.
  timent analysis. In Proceedings of the Eighth In-
  ternational Conference on Language Resources and         Dilek Küçük and Fazli Can. 2020. Stance detection: A
  Evaluation (LREC’12), pages 3551–3556, Istanbul,           survey. ACM Comput. Surv., 53(1).
  Turkey. European Language Resources Association
  (ELRA).                                                  Mirko Lai, Alessandra Teresa Cignarella, Delia
                                                             Irazú Hernández Farías, Cristina Bosco, Viviana
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and                Patti, and Paolo Rosso. 2020. Multilingual stance
   Kristina Toutanova. 2019. BERT: Pre-training of           detection in social media political debates. Com-
   deep bidirectional transformers for language under-       puter Speech & Language, page 101075.
   standing. In Proceedings of the 2019 Conference
   of the North American Chapter of the Association        Mirko Lai, Viviana Patti, Giancarlo Ruffo, and Paolo
   for Computational Linguistics: Human Language             Rosso. 2018. Stance evolution and twitter interac-
   Technologies, Volume 1 (Long and Short Papers),           tions in an italian political debate. In International
   pages 4171–4186, Minneapolis, Minnesota. Associ-          Conference on Applications of Natural Language to
   ation for Computational Linguistics.                      Information Systems, pages 15–27. Springer.
Matt Gardner, Joel Grus, Mark Neumann, Oyvind              Ryan McDonald, Slav Petrov, and Keith Hall. 2011.
 Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe-         Multi-source transfer of delexicalized dependency
 ters, Michael Schmitz, and Luke Zettlemoyer. 2018.          parsers. In Proceedings of the 2011 Conference on
 AllenNLP: A deep semantic natural language pro-             Empirical Methods in Natural Language Process-
 cessing platform. In Proceedings of Workshop for            ing, pages 62–72, Edinburgh, Scotland, UK. Asso-
 NLP Open Source Software (NLP-OSS), pages 1–                ciation for Computational Linguistics.
 6, Melbourne, Australia. Association for Computa-
 tional Linguistics.                                       Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-
                                                             hani, Xiaodan Zhu, and Colin Cherry. 2016a. A
Shalmoli Ghosh, Prajwal Singhania, Siddharth Singh,          dataset for detecting stance in tweets. In Proceed-
  Koustav Rudra, and Saptarshi Ghosh. 2019. Stance           ings of the Tenth International Conference on Lan-
  detection in web and social media: a comparative           guage Resources and Evaluation (LREC’16), pages
  study. In International Conference of the Cross-           3945–3952, Portorož, Slovenia. European Language
  Language Evaluation Forum for European Lan-                Resources Association (ELRA).
  guages, pages 75–87. Springer.
                                                           Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-
Johannes Graën, Tannon Kew, Anastassia Shaitarova,           hani, Xiaodan Zhu, and Colin Cherry. 2016b.
  and Martin Volk. 2019. Modelling large parallel cor-       SemEval-2016 task 6: Detecting stance in tweets.
  pora: The zurich parallel corpus collection. In Pro-       In Proceedings of the 10th International Workshop
  ceedings of the 7th Workshop on Challenges in the          on Semantic Evaluation (SemEval-2016), pages 31–
  Management of Large Corpora (CMLC), pages 1–8.             41, San Diego, California. Association for Compu-
  Leibniz-Institut für Deutsche Sprache.                     tational Linguistics.
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
  How multilingual is multilingual BERT? In Pro-
  ceedings of the 57th Annual Meeting of the Asso-
  ciation for Computational Linguistics, pages 4996–
  5001, Florence, Italy. Association for Computa-
  tional Linguistics.
Anirban Sen, Manjira Sinha, Sandya Mannarswamy,
  and Shourya Roy. 2018. Stance classification of
  multi-perspective consumer health information. In
  Proceedings of the ACM India Joint International
  Conference on Data Science and Management of
  Data, pages 273–281.
Nakatani Shuyo. 2010. Language detection library for
  java.
Parinaz Sobhani, Diana Inkpen, and Xiaodan Zhu.
  2017. A dataset for multi-target stance detection. In
  Proceedings of the 15th Conference of the European
  Chapter of the Association for Computational Lin-
  guistics: Volume 2, Short Papers, pages 551–557,
  Valencia, Spain. Association for Computational Lin-
  guistics.
Mariona Taulé, M Antònia Martí, Francisco Rangel,
 Paolo Rosso, Cristina Bosco, and Viviana Patti.
 2017. Overview of the task on stance and gen-
 der detection in tweets on catalan independence at
 ibereval 2017. In 2nd Workshop on Evaluation
 of Human Language Technologies for Iberian Lan-
 guages, IberEval 2017, volume 1881, pages 157–
 177.
Mariona Taulé, Francisco Rangel, M Antònia Martí,
 and Paolo Rosso. 2018. Overview of the task on
 multimodal stance detection in tweets on catalan
 #1oct referendum. In 3rd Workshop on Evaluation
 of Human Language Technologies for Iberian Lan-
 guages, IberEval 2018, volume 2150, pages 149–
 166.
James Thurman and Urs Gasser. 2009. Three case
  studies from switzerland: Smartvote. Berkman Cen-
  ter Research Publications.
Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-
  cas: The surprising cross-lingual effectiveness of
  BERT. In Proceedings of the 2019 Conference on
  Empirical Methods in Natural Language Processing
  and the 9th International Joint Conference on Natu-
  ral Language Processing (EMNLP-IJCNLP), pages
  833–844, Hong Kong, China. Association for Com-
  putational Linguistics.
A     Examples

Question                                              Comment                                                Gold Label   Prob.
Befürworten Sie eine vollständige Liberalisierung     Ausser Sonntag. Dies sollte ein Ruhetag bleiben        FAVOR        0.001
der Geschäftsöffnungszeiten?                          können.
[Are you in favour of a complete liberalisation of    [Except Sunday. That should remain a day of
business hours for shops?]                            rest.]
Soll die Schweiz innerhalb der nächsten vier          In den nächsten vier Jahren ist dies wohl un-          FAVOR        0.005
Jahre EU-Beitrittsverhandlungen aufnehmen?            realistisch.
[Should Switzerland embark on negotiations in         [For the next four years this is probably unrealis-
the next four years to join the EU?]                  tic.]
Befürworten Sie einen Ausbau des Landschaftss-        Wenn es darum geht erneuerbare Energien zu             AGAINST      0.006
chutzes?                                              fördern, ist sogar eine Lockerung angebracht.
[Are you in favour of extending landscape protec-     [When it comes to promoting renewable energy,
tion?]                                                even a relaxation is appropriate.]
La Suisse devrait-elle engager des négociations       Il faut cependant en parallèle veiller à ce que la     AGAINST      0.010
pour un accord de libre échange avec les Etats-       Suisse ne soit pas mise de côté par les Etats-Unis !
Unis?
[Should Switzerland start negotiations with the       [At the same time it must be ensured that Switzer-
USA on a free trade agreement?]                       land is not sidelined by the United States!]

Table A7: Some classification errors where the predicted probability of the correct label is especially low. The
examples have been taken from the validation set.




Comment . . .                                is favorable towards target . . .           but against target . . .
Ich will offene Grenzen für Waren            Soll die Schweiz mit den USA Verhand-       Soll die Schweiz das Schengen-
und selbstverantwortliche mündige            lungen über ein Freihandelsabkommen         Abkommen mit der EU kündigen und
Bürger. Der Staat hat kein Recht, uns        aufnehmen?                                  wieder verstärkte Personenkontrollen
einzuschränken.                                                                          direkt an der Grenze einführen?
[I want open borders for goods and re-       [Should Switzerland start negotiations      [Should Switzerland terminate the
sponsible citizens. The state has no right   with the USA on a free trade agree-         Schengen Agreement with the EU and
to restrict us.]                             ment?]                                      reintroduce increased identity checks
                                                                                         directly on the border?]
Hier gilt der Grundsatz der Eigenver-        Sind Sie für eine vollständige Liberal-     Würden Sie die Einführung einer
antwortung und Selbstbestimmung des          isierung der Ladenöffnungszeiten?           Frauenquote       in   Verwaltungsräten
Unternehmens!                                                                            börsenkotierter Unternehmen befür-
                                                                                         worten?
[The principle of personal responsibil-      [Are you in favour of the complete lib-     [Would you support the introduction of
ity and corporate self-regulation applies    eralization of shop opening times?]         a woman’s quota for the Boards of Di-
here!]                                                                                   rectors of listed companies?]

Table A8: Two comments that imply a positive stance towards one target issue but a negative stance towards
another target issue. Such cases can be found in the dataset because respondents have copy-pasted some comments.
These examples have been extracted from the training set.
B    Data Statement
Curation rationale In order to study the automatic detection of stances on political issues, questions
and candidate responses on the voting advice application smartvote.ch were downloaded. Mainly
data pertaining to national-level issues were included to reduce variability.
Language variety The training set consists of questions and answers in Swiss Standard German and
Swiss French (74.1% de-CH; 25.9% fr-CH). The test sets also contain questions and answers in Swiss
Italian (67.1% de-CH; 24.7% fr-CH; 8.2% it-CH). The questions have also been translated into English.
Speaker demographic (answers)

    • Candidates for communal, cantonal or national elections in Switzerland who have filled out an
      online questionnaire.

    • Age: 18 or older – mixed.

    • Gender: Unknown – mixed.

    • Race/ethnicity: Unknown – mixed.

    • Native language: Unknown – mixed.

    • Socioeconomic status: Unknown – mixed.

    • Different speakers represented: 7581.

    • Presence of disordered speech: Unknown.

Speech situation

    • The questions were edited and translated by political scientists for a public voting advice website.

    • The answers were written between 2011 and 2020 by the users of the website.

Text characteristics Questions, answers, arguments and comments regarding political issues.