X -stance: A Multilingual Multi-Target Dataset for Stance Detection Jannis Vamvas1 Rico Sennrich1,2 1 Department of Computational Linguistics, University of Zurich 2 School of Informatics, University of Edinburgh {vamvas,sennrich}@cl.uzh.ch Abstract able today are relatively small, and specific to a single target (Taulé et al., 2017, 2018). Further- We extract a large-scale stance detection more, specific models tend to be developed for dataset from comments written by can- each single target or pair of targets (Sobhani et al., didates of elections in Switzerland. The 2017). Concerns have been raised that cross-target dataset consists of German, French and performance is often considerably lower than fully Italian text, allowing for a cross-lingual supervised performance (Küçük and Can, 2020). evaluation of stance detection. It contains In this paper we propose a much larger dataset 67 000 comments on more than 150 po- that combines multilinguality and a multitude of litical issues (targets). Unlike stance de- topics and targets. X-stance comprises more than tection models that have specific target is- 150 questions about Swiss politics and more than sues, we use the dataset to train a single 67k answers given by candidates running for polit- model on all the issues. To make learn- ical office in Switzerland. Questions are available ing across targets possible, we prepend in four languages: English, Swiss Standard Ger- to each instance a natural question that man, French, and Italian. The language of a com- represents the target (e.g. “Do you sup- ment depends on the candidate’s region of origin. port X?”). Baseline results from multi- We have extracted the data from the voting ad- lingual B ERT show that zero-shot cross- vice application Smartvote. Candidates respond to lingual and cross-target transfer of stance questions mainly in categorical form (yes / rather detection is moderately successful with yes / rather no / no). They can also submit a free- this approach. text comment to justify or explain their categorical 1 Introduction answer. An example is given in Figure 1. We transform the dataset into a stance detec- In recent years many datasets have been cre- tion task by interpreting the question as a natural- ated for the task of automated stance detection, language representation of the target, and the com- advancing natural language understanding sys- mentary as the input to be classified. tems for political science, opinion research and The dataset is split into a multilingual train- other application areas. Typically, such bench- ing set and into several test sets to evaluate zero- marks (Mohammad et al., 2016a) are composed shot cross-lingual and cross-target transfer. To of short pieces of text commenting on politicians provide a baseline, we fine-tune a multilingual or public issues and are manually annotated with B ERT model (Devlin et al., 2019) on X-stance. We their stance towards a target entity (e.g. Climate show that the baseline accuracy is comparable to Change, or Trump). However, they are limited in previous stance detection benchmarks while leav- scope on multiple levels (Küçük and Can, 2020). ing ample room for improvement. In addition, First of all, it is questionable how well cur- the model can generalize to a degree both cross- rent stance detection methods perform in a cross- lingually and in a cross-target setting. lingual setting, as the multilingual datasets avail- We have made the dataset and the code for re- Copyright c 2020 for this paper by its authors. Use permit- producing the baseline models publicly available.1 ted under Creative Commons License Attribution 4.0 Interna- 1 tional (CC BY 4.0) http://doi.org/10.5281/zenodo.3831317 Question #3414 – Available in all languages Should Switzerland strive for a free Soll der Bundesrat ein Frei- La Suisse devrait-elle conclure un trade agreement with the USA? handelsabkommen mit den USA accord de libre-échange avec les anstreben? Etats-Unis? Comment #26597 (German) Comment #21421 (French) Label: FAVOR Label: AGAINST Mit unserem zweitwichtigsten Handels- Les accords de libre-échange menacent la partner sollten wir ein Freihandels- qualité des produits suisses. abkommen haben. [With our second most important trading [The free trade agreements jeopardize the partner we should have a free trade quality of the Swiss products.] agreement.] Figure 1: Example of a question and two answers in the X-stance dataset. The answers were submitted by electoral candidates on a voting advice website. The author of the German comment was in favor of the issue; the author of the French comment against. Both authors use comments to explain their respective stance. 2 Related Work task B cross-target transfer to the target “Donald Trump” was tested, for which no annotated train- Multilingual Stance Detection In the context of ing data were provided. While this required the the IberEval shared tasks, two related multilingual development of more universal models, their per- datasets have been created (Taulé et al., 2017, formance was generally much lower. 2018). Both are a collection of annotated Spanish Sobhani et al. (2017) introduced a multi-target and Catalan tweets. Crucially, the tweets in both stance dataset which provides two targets per in- languages focus on the same issue (Catalan inde- stance. For example, a model designed in this pendence); given this fact they are the first truly framework is supposed to simultaneously classify multilingual stance detection datasets known to us. a tweet with regard to Clinton and with regard to With regard to the languages covered by Trump. While in theory the framework allows for X -stance, only monolingual datasets seem to be more than two targets, it is still restricted to a fi- available. For French, a collection of tweets nite and clearly defined set of targets. It focuses on French presidential candidates has been an- on modeling the dependencies of multiple targets notated with stance (Lai et al., 2020). Simi- within the same text sample, while our approach larly, two datasets of Italian tweets on the occa- focuses on learning stance detection from many sion of the 2016 constitutional referendum have samples with many different targets. been created (Lai et al., 2018, 2020). With re- gard to German, a corpus of 270 sentences has Representation Learning for Stance Detection been annotated with fine-grained stance and atti- In a target-specific setting, Ghosh et al. (2019) tude information (Clematide et al., 2012). Fur- perform a systematic evaluation of stance detec- thermore, fine-grained stance detection has been tion approaches. They also evaluate B ERT (Devlin qualitatively studied on a large corpus of Facebook et al., 2019) and find that it consistently outper- posts (Klenner et al., 2017). forms previous approaches. However, they only experiment with a single- Multi-Target Stance Detection The SemEval- segment encoding of the input, preventing cross- 2016 task on detecting stance in tweets (Moham- target transfer of the model. Augenstein et al. mad et al., 2016b) offers data concerning multi- (2016) propose a conditional encoding approach ple targets (Atheism, Climate Change, Feminism, to encode both the target and the tweet as se- Hillary Clinton, and Abortion). In the supervised quences. They use a bidirectional LSTM to condi- subtask A, participants tended to develop a target- tion the encoding of the tweets on the encoding of specific model for each of those targets. In sub- the target, and then apply a nonlinear projection on Topic Questions Answers depending on the locale they see translated ver- sions of the questions. They can answer each Digitisation 2 1168 question with either ‘yes’, ‘rather yes’, ‘rather no’, Economy 23 6899 or ‘no’. They can supplement each answer with a Education 16 7639 comment of at most 500 characters. Finances 15 3980 The questions asked on Smartvote have been Foreign Policy 16 4393 edited by a team of political scientists. They are Immigration 19 6270 intended to cover a broad range of political is- Infrastructure & Environment 31 9590 sues relevant at the time of the election. A de- Security 20 5193 tailed documentation of the design of Smartvote Society 17 6275 and the editing process of the questions is provided Welfare 15 8508 by Thurman and Gasser (2009). Total (training topics) 174 59 915 Healthcare 11 4711 Preprocessing We merged the two labels on Political System 9 2645 each pole into a single label: ‘yes’ and ‘rather yes’ Total (held-out topics) 20 7356 were combined into ‘favor’; ‘rather no’, or ‘no’ into ‘against‘. This improves the consistency of Table 1: Number of questions and answers per topic. the data and the comparability to previous stance detection datasets. We did not further preprocess the text of the comments. the conditionally encoded tweet. This allows them to train a model that can generalize to previously Language Identification As the API does not unseen targets. provide the language of comments, we employed a language identifier to automatically annotate 3 The X-stance Dataset this information. We used the langdetect li- brary (Shuyo, 2010). For each responder we clas- 3.1 Task Definition sified all the comments jointly, assuming that re- The input provided by X-stance is two-fold: (A) sponders did not switch code during the answering a natural language question concerning a politi- of the questionnaire. cal issue; (B) a natural language commentary on We applied the identifier in a two-step approach. a specific stance towards the question. In the first run we allowed the identifier to out- The label to be predicted is either ‘favor’ or put all 55 languages that it supports out of the ‘against‘. This corresponds to a standard estab- box, plus Romansh, the fourth official language in lished by Mohammad et al. (2016a). However, Switzerland3 . We found that no Romansh com- X -stance differs from that dataset in that it lacks a ments were detected and that all unexpected out- ‘neither’ class; all comments refer to either a ‘fa- puts were misclassifications of German, French or vor’ or an ‘against‘ position. The task posed by Italian comments. We further concluded that little X -stance is thus a binary classification task. or no Swiss German comments are in the dataset; As an evaluation metric we report the macro- otherwise, some of them would have manifested average of the F1-score for ‘favor’ and the F1- themselves via misclassifications (e.g. as Dutch). score for ‘against’, similar to Mohammad et al. In the second run, drawing from these conclu- (2016b). We use this metric mainly to strengthen sions, we restricted the identifier’s set of choices comparability with the previous benchmarks. to English, French, German and Italian. 3.2 Data Collection Filtering We pre-filtered the questions and an- Provenance We downloaded the questions and swers to improve the quality of the dataset. To answers via the Smartvote API2 . The downloaded keep the domain of the data surveyable, we set a data cover 175 communal, cantonal and national focus on national-level questions. Therefore, all elections between 2011 and 2020. 3 Namely the Rumantsch Grischun variety; the lan- All candidates in an election who participate in guage profile was created using resources from the Smartvote are asked the same set of questions, but Zurich Parallel Corpus Collection (Graën et al., 2019) and the Quotidiana corpus (https://github.com/ 2 https://smartvote.ch ProSvizraRumantscha/corpora). Intra-target Cross-question Cross-topic (New answers to (New questions known questions) within known topics) Train: 33 850 DE Test: 2871 Test: 3143 Test: 5269 Valid: 3479 Train: 11 790 FR Test: 1055 Test: 1170 Test: 1914 Valid: 1284 IT Test: 1173 Test: (110) Test: (173) Table 2: Number of answer instances in the training, validation and test sets. The upper left corner represents a multilingually supervised task, where training, validation and test data are from exactly the same domain. The top- to-bottom axis gives rise to a cross-lingual transfer task, where a model trained on German and French is evaluated on Italian answers to the same questions. The left-to-right axis represents a continuous shift of domain: In the middle column, the model is tested on previously unseen questions that belong to the same topics as seen during training. In the right column the model encounters unseen answers to unseen questions within an unseen topic. The two test sets in parentheses are too small for a significant evaluation. questions and corresponding answers pertaining to Given the sensitive nature of the data, we in- national elections were included. crease the anonymity of the data by hashing the In the context of communal and cantonal elec- respondents’ IDs. No personal attributes of the re- tions, candidates have answered both local ques- spondents are included in the dataset. We provide tions and a subset of the national questions. Of a data statement (Bender and Friedman, 2018) in those elections, we only considered answers to the Appendix B. questions that also had been asked in a national election. They were only used to augment the 3.3 Data Split training set while the validation and test sets were restricted to answers from national elections. We held out the topics “Healthcare” and “Political We discarded the fewer than 20 comments clas- System” from the training data and created a sepa- sified as English. Furthermore, we discarded in- rate cross-topic test set that contains the questions stances that met any of the following conditions: and answers related to those topics. Furthermore, in order to test cross-question • Question is not a closed question or does not generalization performance within previously seen address a clearly defined political issue. topics, we manually selected 16 held-out ques- tions that are distributed over the remaining • No comment was submitted by the candidate 10 topics. We selected the held-out questions man- or the comment is shorter than 50 characters. ually because we wanted to make sure that they are • Comment starts with “but” or a similar indi- truly unseen and that no paraphrases of the ques- cator that the comment is not self-contained. tions are found in the training set. • Comment contains a URL. We designated Italian as a test-only language, since relatively few comments have been written In total, a fifth of the comments were filtered out. in Italian. From the remaining German and French data we randomly selected a percentage of respon- Topics The questions have been organized dents as validation or as test respondents. by the Smartvote editors into categories (such as “Economy”). We further consolidated the pre- As a result we received one training set, one val- defined categories into 12 broad topics (Table 1). idation set and four test sets. The sizes of the sets are listed in Table 2. We did not consider test sets Compliance The dataset is shared under a CC that are cross-lingual and cross-target at the same BY-NC 4.0 license. Copyright remains with time, as they would have been too small to yield www.smartvote.ch. significant results. 100% Proportion of class ‘favor’ held-out 75% 50% mean 25% Digitisation Education Foreign Policy Infrastructure Society Healthcare Economy Finances Immigration Security Welfare Political System Figure 2: Proportion of ‘favor’ labels per question, grouped by topic. While the proportion of favorable answers varies from question to question, it is balanced overall. 3.4 Analysis Languages The X-stance dataset has more Ger- man samples than French samples. The language Some observations regarding the composition of ratio of about 3:1 is consistent across all train- X -stance can be made. ing and test sets. Given the two languages it Class Distribution Figure 2 visualizes the pro- is possible to either train two monolingual mod- portion of ‘favor’ and ‘against‘ stances for each els or to train a single model in a multi-source target in the dataset. The ratio differs between setup (McDonald et al., 2011). We choose a multi- questions but is relatively equally distributed source baseline because M-B ERT is known to ben- across the topics. In particular, the questions in efit from multilingual training data both in a super- the held-out topics (with a ‘favor’ ratio of 49.4%) vised and in a cross-lingual scenario (Kondratyuk have a similar class distribution as the questions in and Straka, 2019). other topics (with a ‘favor’ ratio of 50.0%). 4 Baseline Experiments Linguistic Properties Not every question is We evaluate four baselines to obtain an impression unique; some questions are paraphrases describing of the difficulty of the task. the same political issue. For example, in the 2015 election, the candidates were asked: “Should the 4.1 Majority Class Baselines consumption of cannabis as well as its possession for personal use be legalised?” Four years later The first pair of baselines uses the most frequent they were asked: “Should cannabis use be legal- class in the training set for prediction. Specifi- ized?” However, we do not see any need to con- cally, the global majority class baseline predicts solidate those duplicates because they contribute the most frequent class across all training targets to the diversity of the training data. while the target-wise majority class baseline pre- dicts the class that is most frequent for a given tar- We further observe that while some questions get question. The latter can only be applied to the in the dataset are quite short, some questions are intra-target test sets. rather convoluted. For example, a typical long question reads: 4.2 Bag-of-Words Baseline Some 1% of direct payments to Swiss agricul- As a second baseline, we train a fastText bag-of- ture currently go to organic farming operations. words linear classifier (Joulin et al., 2017). For Should this proportion be increased at the ex- pense of standard farming operations as part of each comment, we select the translation of the Switzerland’s 2014-2017 agricultural policy? question that matches its language, and concate- nate it to the comment. We tokenize the text using Such longer questions might be more challenging the Europarl preprocessing tools (Koehn, 2005). to process semantically. The ‘against’ class was slightly upsampled in the training data so that the classes are balanced DE FR IT when summing over all questions and topics. Majority class (global) 33.1 34.8 34.4 We use the standard settings provided by the Majority class (target-wise) 60.8 65.1 59.3 fastText library.4 Optimal hyperparameters from fastText 69.9 71.2 53.7 the following range were determined based on the M-B ERT 76.8 76.6 70.2 validation accuracy: • Learning rate: 0.1, 0.2, 1 Table 3: Baseline scores in the cross-lingual setting. No Italian samples were seen during training, mak- • Number of epochs: 5, 50 ing this a case of zero-shot cross-lingual transfer. The The word vectors were set to a size of 300. We scores are reported as the macro-average of the F1- scores for ‘favor’ and for ‘against’. do not initialize them with pre-trained multilingual embeddings since preliminary experiments did not show a beneficial effect. The grid search was repeated independently for every variant that we test in the following sub- 4.3 Multilingual BERT Baseline sections. Furthermore, the standard recommenda- As our main baseline model we fine-tune multilin- tions for fine-tuning B ERT were used: Adam with gual B ERT (M-B ERT) on the task (Devlin et al., β1 = 0.9 and β2 = 0.999; an L2 weight decay 2019) which has been pre-trained jointly in 104 of 0.01; a learning rate warmup over the first 10% languages5 and has established itself as a state of the steps; and a linear decay of the learning rate. of the art for various multilingual tasks (Wu and A dropout probability of 0.1 was set on all layers. Dredze, 2019; Pires et al., 2019). Within the field of stance detection, B ERT can outperform both Results Table 3 shows the results for the cross- feature-based and other neural approaches in a lingual setting. M-B ERT performs consistently monolingual English setting (Ghosh et al., 2019). better than the previous baselines. Even the zero- shot performance in Italian, while significantly Architecture In the context of B ERT we in- lower than the supervised scores, is much better terpret the X-stance task as sequence pair clas- than the target-wise majority class baseline. sification inspired by natural language inference Results for the cross-target setting are given in tasks (Bowman et al., 2015). We follow the pro- Table 4. Similar to the cross-lingual setting, model cedure outlined by Devlin et al. (2019) for such performance drops in the cross-target setting, but tasks. We designate the question as segment A M-B ERT remains the strongest baseline and eas- and the comment as segment B. The two segments ily surpasses the majority class baselines. Fur- are separated with the special token [SEP], and thermore, the cross-question score of M-B ERT is the special token [CLS] is prepended to the se- slightly lower than the cross-topic score. quence. The final hidden state corresponding to [CLS] is then classified by a linear layer. 4.4 How Important is Consistent Language? We fine-tune the full model with a cross-entropy The default setup preserves horizontal language loss, using the AllenNLP library (Gardner et al., consistency in that the language of the questions 2018) as a basis for our implementation. always corresponds to the language of the com- ments. For example, the Italian test instances are Training As above, we balanced out the num- combined with the Italian version of the questions, ber of classes in the training set. We use a batch even though during training the model has only size of 16 and a maximum sequence length of 512 ever seen the German and French version of them. subwords, and performed a grid search over the An alternative concept is vertical language con- following hyperparameters based on the validation sistency, whereby the questions are consistently accuracy: presented in one language, regardless of the com- • Learning rate: 5e-5, 3e-5, 2e-5 ment. To test whether horizontal or vertical con- • Number of epochs: 3, 4 sistency is more helpful, we train and evaluate M-B ERT on a dataset variant where all questions 4 https://github.com/facebookresearch/ are in their English version. We chose English as fastText 5 https://github.com/google-research/ a lingua franca because it had the largest share of bert/blob/master/multilingual.md data during the pre-training of M-B ERT. Intra-target Cross-question Cross-topic DE FR Mean DE FR Mean DE FR Mean Majority class (global) 33.1 34.8 33.9 36.4 37.9 37.1 32.1 33.8 32.9 Majority class (target-wise) 60.8 65.1 62.9 - - - - - - fastText 69.9 71.2 70.5 62.0 65.6 63.7 63.1 65.5 64.3 M-B ERT 76.8 76.6 76.6 68.5 68.4 68.4 68.9 70.9 69.9 Table 4: Baseline scores in the cross-target setting. For each test set we separately report a German and a French score, as well as their harmonic mean. Results are shown in Table 5. While the effect time. To rule this out, we probe the model with is negligible in most settings, cross-lingual perfor- randomized data at test time: mance increases when all questions are in English. • Test the model on versions of the test sets 4.5 How Important are the Segments? where the comments remain in place but In order to rule out that only the questions or only the questions are shuffled randomly (random the comments are necessary to optimally solve the questions). We make sure that the random task, we conduct some additional experiments: questions come from the same test set and language as the original questions. • Only use a single segment containing the • Keep the questions in place and randomize comment, removing the questions from the the comments (random comments). Again training and test data (missing questions). we shuffle the comments only within test set • Only use the question and remove the com- boundaries. ment (missing comments). The results in Table 5 show that the performance In both cases the performance decreases across of the model decreases in both cases, confirming all evaluation settings (Table 5). The loss in that it learns to take into account both segments. performance is much higher when comments are 4.6 How Important are Spelled-Out Targets? missing, indicating that the comments contain the most important information about stance. As can Finally we test whether the target really needs to be expected, the score achieved without comments be represented by natural language (e.g. “Do you is only slightly different from the target-wise ma- support X?”). An alternative is to represent the jority class baseline. target with a trainable embedding instead. But there is also a loss in performance when the In order to fit target embeddings smoothly questions are missing, which underlines the im- into our architecture, we represent each target portance of pairing both pieces of text. The effect type with a different reserved symbol from the of missing questions is especially strong in the su- M-B ERT vocabulary. Segment A is then set to this pervised and cross-lingual settings. To illustrate symbol instead of a natural language question. this, we provide in Table A8 some examples of The results for this experiment are listed in the comments that occur with multiple different tar- bottom row of Table 5. An M-B ERT model that gets in the training set. Those examples can ex- learns target embeddings instead of encoding a plain why the target can be essential for disam- question performs clearly worse in the supervised biguating a stance detection problem. On the other and cross-lingual settings. From this we conclude hand, the effect of omitting the questions is less that spelled-out natural language questions pro- pronounced in the cross-target settings. vide important linguistic detail that can help in stance detection. The above single-segment experiments tell us 5 Discussion that both the comment and the question provide crucial information. But it is possible that the Our experiments show that M-B ERT achieves a M-B ERT model, even though trained on both seg- reasonable accuracy on X-stance, outperforming ments, mainly looks at a single segment at test majority class baselines and a fastText classifier. Supervised Cross-Lingual Cross-Question Cross-Topic M-B ERT 76.6 70.2 68.4 69.9 — with English questions 76.1 71.7 68.5 69.4 — with missing questions 73.2 67.1 67.8 69.3 — with missing comments 64.2 60.5 51.1 48.6 — with random questions 56.0 52.5 47.7 48.5 — with random comments 50.7 50.7 48.2 48.7 — with target embeddings 70.1 66.0 68.4 69.0 Table 5: Results for additional experiments. The cross-lingual score is the F1-score on the Italian test set. For the supervised, cross-question and cross-topic settings we report the harmonic mean of the German and French scores. Dataset Evaluation Score notation, and, in our view, largely compensates for the implicitness of the texts. SemEval-2016 Ghosh et al. (2019) 75.1 MPCHI Ghosh et al. (2019) 75.6 X -stance this paper 76.6 6 Conclusion Table 6: Performance of B ERT-like models on differ- We have presented a new dataset for political ent supervised stance detection benchmarks. stance detection called X-stance. The dataset ex- tends over a broad range of topics and issues re- garding national Swiss politics. This diversity of To put the supervised score into context we list topics opens up an opportunity to further study scores that variants of B ERT have achieved on multi-target learning. Moreover, being partly other stance detection datasets in Table 6. It seems Swiss Standard German, partly French and Ital- that the supervised part of X-stance has a similar ian, the dataset promotes a multilingual approach difficulty as the SemEval-2016 (Mohammad et al., to stance detection. 2016a) or MPCHI (Sen et al., 2018) datasets on which B ERT has previously been evaluated. By compiling formal commentary by politicians on political questions, we add a new text genre to On the other hand, in the cross-lingual and the field of stance detection. We also propose a cross-target settings, the mean score drops by 6–8 question–answer format that allows us to condi- percentage points compared to the supervised set- tion stance detection models on a target naturally. ting; while zero-shot transfer is possible to a de- gree, it can still be improved. Our baseline results with multilingual B ERT The additional experiments (Table 5) validate show that the model has some capability to per- the results and show that the sequence-pair clas- form zero-shot transfer to unseen languages and sification approach to stance detection is justified. to unseen targets (both within a topic and to un- It is interesting to see what errors the M-B ERT seen topics). However, there is some gap in per- model makes. Table A7 presents instances where formance that future work could address. We ex- it predicts the wrong label with a high confidence. pect that the X-stance dataset could furthermore These examples indicate that many comments ex- be a valuable resource for fields such as argument press their stance only on a very implicit level, and mining, argument search or topic classification. thus hint at a potential weakness of the dataset. Because on the voting advice platform the label is Acknowledgments explicitly shown to readers in addition to the com- ments, the comments do not need to express the This work was funded by the Swiss Na- stance explicitly. tional Science Foundation (project MUTAMUR; Manual annotation could eliminate very im- no. 176727). We would like to thank Isabelle plicit samples in a future version of the dataset. Augenstein, Anne Göhring and the anonymous re- However, the sheer size and breadth of the dataset viewers for helpful feedback. could not realistically be achieved with manual an- References Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient Isabelle Augenstein, Tim Rocktäschel, Andreas Vla- text classification. In Proceedings of the 15th Con- chos, and Kalina Bontcheva. 2016. Stance detec- ference of the European Chapter of the Association tion with bidirectional conditional encoding. In Pro- for Computational Linguistics: Volume 2, Short Pa- ceedings of the 2016 Conference on Empirical Meth- pers, pages 427–431, Valencia, Spain. Association ods in Natural Language Processing, pages 876– for Computational Linguistics. 885, Austin, Texas. Association for Computational Linguistics. Manfred Klenner, Don Tuggener, and Simon Clematide. 2017. Stance detection in Facebook Emily M. Bender and Batya Friedman. 2018. Data posts of a German right-wing party. In Proceed- statements for natural language processing: Toward ings of the 2nd Workshop on Linking Models of mitigating system bias and enabling better science. Lexical, Sentential and Discourse-level Semantics, Transactions of the Association for Computational pages 31–40, Valencia, Spain. Association for Linguistics, 6:587–604. Computational Linguistics. Samuel R. Bowman, Gabor Angeli, Christopher Potts, Philipp Koehn. 2005. Europarl: A parallel corpus for and Christopher D. Manning. 2015. A large anno- statistical machine translation. Machine Translation tated corpus for learning natural language inference. Summit, 2005, pages 79–86. In Proceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing, pages Dan Kondratyuk and Milan Straka. 2019. 75 lan- 632–642, Lisbon, Portugal. Association for Compu- guages, 1 model: Parsing universal dependencies tational Linguistics. universally. In Proceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Simon Clematide, Stefan Gindl, Manfred Klenner, Ste- Processing and the 9th International Joint Confer- fanos Petrakis, Robert Remus, Josef Ruppenhofer, ence on Natural Language Processing (EMNLP- Ulli Waltinger, and Michael Wiegand. 2012. MLSA IJCNLP), pages 2779–2795, Hong Kong, China. As- — a multi-layered reference corpus for German sen- sociation for Computational Linguistics. timent analysis. In Proceedings of the Eighth In- ternational Conference on Language Resources and Dilek Küçük and Fazli Can. 2020. Stance detection: A Evaluation (LREC’12), pages 3551–3556, Istanbul, survey. ACM Comput. Surv., 53(1). Turkey. European Language Resources Association (ELRA). Mirko Lai, Alessandra Teresa Cignarella, Delia Irazú Hernández Farías, Cristina Bosco, Viviana Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Patti, and Paolo Rosso. 2020. Multilingual stance Kristina Toutanova. 2019. BERT: Pre-training of detection in social media political debates. Com- deep bidirectional transformers for language under- puter Speech & Language, page 101075. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association Mirko Lai, Viviana Patti, Giancarlo Ruffo, and Paolo for Computational Linguistics: Human Language Rosso. 2018. Stance evolution and twitter interac- Technologies, Volume 1 (Long and Short Papers), tions in an italian political debate. In International pages 4171–4186, Minneapolis, Minnesota. Associ- Conference on Applications of Natural Language to ation for Computational Linguistics. Information Systems, pages 15–27. Springer. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe- Multi-source transfer of delexicalized dependency ters, Michael Schmitz, and Luke Zettlemoyer. 2018. parsers. In Proceedings of the 2011 Conference on AllenNLP: A deep semantic natural language pro- Empirical Methods in Natural Language Process- cessing platform. In Proceedings of Workshop for ing, pages 62–72, Edinburgh, Scotland, UK. Asso- NLP Open Source Software (NLP-OSS), pages 1– ciation for Computational Linguistics. 6, Melbourne, Australia. Association for Computa- tional Linguistics. Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob- hani, Xiaodan Zhu, and Colin Cherry. 2016a. A Shalmoli Ghosh, Prajwal Singhania, Siddharth Singh, dataset for detecting stance in tweets. In Proceed- Koustav Rudra, and Saptarshi Ghosh. 2019. Stance ings of the Tenth International Conference on Lan- detection in web and social media: a comparative guage Resources and Evaluation (LREC’16), pages study. In International Conference of the Cross- 3945–3952, Portorož, Slovenia. European Language Language Evaluation Forum for European Lan- Resources Association (ELRA). guages, pages 75–87. Springer. Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob- Johannes Graën, Tannon Kew, Anastassia Shaitarova, hani, Xiaodan Zhu, and Colin Cherry. 2016b. and Martin Volk. 2019. Modelling large parallel cor- SemEval-2016 task 6: Detecting stance in tweets. pora: The zurich parallel corpus collection. In Pro- In Proceedings of the 10th International Workshop ceedings of the 7th Workshop on Challenges in the on Semantic Evaluation (SemEval-2016), pages 31– Management of Large Corpora (CMLC), pages 1–8. 41, San Diego, California. Association for Compu- Leibniz-Institut für Deutsche Sprache. tational Linguistics. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Pro- ceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4996– 5001, Florence, Italy. Association for Computa- tional Linguistics. Anirban Sen, Manjira Sinha, Sandya Mannarswamy, and Shourya Roy. 2018. Stance classification of multi-perspective consumer health information. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, pages 273–281. Nakatani Shuyo. 2010. Language detection library for java. Parinaz Sobhani, Diana Inkpen, and Xiaodan Zhu. 2017. A dataset for multi-target stance detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Lin- guistics: Volume 2, Short Papers, pages 551–557, Valencia, Spain. Association for Computational Lin- guistics. Mariona Taulé, M Antònia Martí, Francisco Rangel, Paolo Rosso, Cristina Bosco, and Viviana Patti. 2017. Overview of the task on stance and gen- der detection in tweets on catalan independence at ibereval 2017. In 2nd Workshop on Evaluation of Human Language Technologies for Iberian Lan- guages, IberEval 2017, volume 1881, pages 157– 177. Mariona Taulé, Francisco Rangel, M Antònia Martí, and Paolo Rosso. 2018. Overview of the task on multimodal stance detection in tweets on catalan #1oct referendum. In 3rd Workshop on Evaluation of Human Language Technologies for Iberian Lan- guages, IberEval 2018, volume 2150, pages 149– 166. James Thurman and Urs Gasser. 2009. Three case studies from switzerland: Smartvote. Berkman Cen- ter Research Publications. Shijie Wu and Mark Dredze. 2019. Beto, bentz, be- cas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 833–844, Hong Kong, China. Association for Com- putational Linguistics. A Examples Question Comment Gold Label Prob. Befürworten Sie eine vollständige Liberalisierung Ausser Sonntag. Dies sollte ein Ruhetag bleiben FAVOR 0.001 der Geschäftsöffnungszeiten? können. [Are you in favour of a complete liberalisation of [Except Sunday. That should remain a day of business hours for shops?] rest.] Soll die Schweiz innerhalb der nächsten vier In den nächsten vier Jahren ist dies wohl un- FAVOR 0.005 Jahre EU-Beitrittsverhandlungen aufnehmen? realistisch. [Should Switzerland embark on negotiations in [For the next four years this is probably unrealis- the next four years to join the EU?] tic.] Befürworten Sie einen Ausbau des Landschaftss- Wenn es darum geht erneuerbare Energien zu AGAINST 0.006 chutzes? fördern, ist sogar eine Lockerung angebracht. [Are you in favour of extending landscape protec- [When it comes to promoting renewable energy, tion?] even a relaxation is appropriate.] La Suisse devrait-elle engager des négociations Il faut cependant en parallèle veiller à ce que la AGAINST 0.010 pour un accord de libre échange avec les Etats- Suisse ne soit pas mise de côté par les Etats-Unis ! Unis? [Should Switzerland start negotiations with the [At the same time it must be ensured that Switzer- USA on a free trade agreement?] land is not sidelined by the United States!] Table A7: Some classification errors where the predicted probability of the correct label is especially low. The examples have been taken from the validation set. Comment . . . is favorable towards target . . . but against target . . . Ich will offene Grenzen für Waren Soll die Schweiz mit den USA Verhand- Soll die Schweiz das Schengen- und selbstverantwortliche mündige lungen über ein Freihandelsabkommen Abkommen mit der EU kündigen und Bürger. Der Staat hat kein Recht, uns aufnehmen? wieder verstärkte Personenkontrollen einzuschränken. direkt an der Grenze einführen? [I want open borders for goods and re- [Should Switzerland start negotiations [Should Switzerland terminate the sponsible citizens. The state has no right with the USA on a free trade agree- Schengen Agreement with the EU and to restrict us.] ment?] reintroduce increased identity checks directly on the border?] Hier gilt der Grundsatz der Eigenver- Sind Sie für eine vollständige Liberal- Würden Sie die Einführung einer antwortung und Selbstbestimmung des isierung der Ladenöffnungszeiten? Frauenquote in Verwaltungsräten Unternehmens! börsenkotierter Unternehmen befür- worten? [The principle of personal responsibil- [Are you in favour of the complete lib- [Would you support the introduction of ity and corporate self-regulation applies eralization of shop opening times?] a woman’s quota for the Boards of Di- here!] rectors of listed companies?] Table A8: Two comments that imply a positive stance towards one target issue but a negative stance towards another target issue. Such cases can be found in the dataset because respondents have copy-pasted some comments. These examples have been extracted from the training set. B Data Statement Curation rationale In order to study the automatic detection of stances on political issues, questions and candidate responses on the voting advice application smartvote.ch were downloaded. Mainly data pertaining to national-level issues were included to reduce variability. Language variety The training set consists of questions and answers in Swiss Standard German and Swiss French (74.1% de-CH; 25.9% fr-CH). The test sets also contain questions and answers in Swiss Italian (67.1% de-CH; 24.7% fr-CH; 8.2% it-CH). The questions have also been translated into English. Speaker demographic (answers) • Candidates for communal, cantonal or national elections in Switzerland who have filled out an online questionnaire. • Age: 18 or older – mixed. • Gender: Unknown – mixed. • Race/ethnicity: Unknown – mixed. • Native language: Unknown – mixed. • Socioeconomic status: Unknown – mixed. • Different speakers represented: 7581. • Presence of disordered speech: Unknown. Speech situation • The questions were edited and translated by political scientists for a public voting advice website. • The answers were written between 2011 and 2020 by the users of the website. Text characteristics Questions, answers, arguments and comments regarding political issues.