ACQuA at Same Side Stance Classification 2019 Alexander Bondarenko1 Ekaterina Shirshakova Niklas Homann Matthias Hagen Martin-Luther-Universität Halle-Wittenberg 1 alexander.bondarenko@informatik.uni-halle.de Abstract accuracy of 0.54 on binary-labeled balanced test sets—obviously only a very slight improvement We describe the ACQuA team’s participa- tion in the “Same Side Stance Classification” over random guessing. shared task (are two given arguments both on the pro or con side for some topic?) that was 2 Related Work run as part of the ArgMining 2019 workshop. Stance classification has been studied in numer- 1 Introduction ous research publications proposing different fea- tures. For instance, Walker et al. (2012) analyzed In recent years, the popularity of social media and 11 feature types and showed that Naı̈ve Bayes us- discussion platforms has lead to online pro and ing POS tags achieved better results than word uni- con argumentation on almost every topic. Still, grams, while HaCohen-Kerner et al. (2017) applied since not all contributions in such online discus- an SVM classifier on 18 feature types extracted sions clearly indicate their stance or polarity, auto- from tweets (hashtags, slang and emojis, POS tags, matically identifying some post’s stance could help character and word n-grams, etc.) and reported readers quickly get an overview of a discussion good performance for character skip n-grams. Nev- similar to debating portals with pro/con arguments. ertheless, word n-grams have been a very common In this extended abstract, we report on our par- choice in many stance classification experiments. ticipation at the “Same Side Stance Classification” Also common for stance classification is the shared task. The task was run as a pilot at the utilization of sentiment attributes. For in- ArgMining 2019 workshop and stated the prob- stance, Somasundaran and Wiebe (2010) combined lem as: given two arguments, decide whether ei- argumentation-based features (1- to 3-grams ex- ther both support or both attack some controversial tracted from sentiments and argument targets) with topic like gay marriage—i.e., whether the two ar- sentiment-based features (sentiment lexicon with guments are “on the same side.” negative and positive words). Given that the available time prior to the pilot Comparing different classification models, Liu edition of the shared task was rather limited, we et al. (2016) in their evaluation showed gradient decided to focus our research interest simply on boosting decision trees to outperform SVMs for examining the effectiveness of simple word n-gram stance classification. More recently, neural ap- features and various variants of sentiment detection proaches have been successfully applied to stance for same side classification. We experiment with classification: Popat et al. (2019) tuned BERT with three respective classifiers: (1) a simple rule-based hidden state representations, and Durmus et al. method “counting” positive and negative terms, (2019) used BERT fine-tuned with path informa- (2) a rule-based method with sentiment flipping tion extracted from argument trees for 741 topics that uses sentiment and shifter lexicons, and (3) a from kialo.com. gradient boosting decision tree-based method using Given the limited time prior to the shared task, word n-gram as features. we simply wanted to test word n-grams (gradient Not too surprisingly, the evaluation results for boosting tree-based classifier) and sentiment fea- the three classifiers show that relying on sentiment tures (rule-based classifiers) as common feature words or word n-grams alone cannot really solve types for stance classification. stance classification. Our “best” models achieve an Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 3 Task and Data In case that the counts of positive and negative markers are equal or if an argument does not con- The “Same Side Stance Classification” shared task tain any marker, a random label is assigned. This is has two experimental settings: within-topic (argu- the case for about 25% of the provided within-topic mentative topics for training and test are the same) and about 20% of the provided cross-topic training and cross-topic (argumentative topics for training pairs (12% of within and about 19% for cross-topic and test are different). test pairs). The provided data are argumentative topics and Finally, if the counter-based sentiments of an corresponding pairs of arguments collected from argument pair agree, they are classified as “same the debating portals idebate.org, debatepedia.org, side.” debatewise.org, and debate.org. The data is split into training (within-topic: 63,903 argument pairs Rule-based classification with sentiment flip- for the two topics abortion and gay marriage, cross- ping We re-implemented a sentiment classifier topic: 61,048 argument pairs for the topic abor- which is one step in a three-step approach to clas- tion) and test sets (within-topic: 31,475 argument sify a single claim’s stance as pro or con with re- pairs for the two topics abortion and gay marriage, spect to some controversial topic proposed by Bar- cross-topic: 6,163 argument pairs for the topic Haim et al. (2017). A complete approach com- gay marriage). We randomly split the provided bines argument target identification with sentiment training sets into local training, validation and test detection and consistency/contrastiveness classi- sets (80:10:10). fication. In a semester-long student project, we re-implemented parts of this approach and verified 4 ACQuA Runs that it produces results similar to the originally re- Our three runs1 are based (1) on a rule-based clas- ported performances. sifier, (2) on a rule-based classifier with sentiment In the setting of the “Same Side Stance Classifi- flipping, and (3) on gradient boosting decision cation” shared task, we applied only the sentiment trees. classifier, which follows the approach by Ding et al. (2008) and uses the sentiment words counts Rule-based classification Argument stances can matched with the lexicon of Hu and Liu (2004)(the either support or attack some argumentative topic. same that is used in our first approach) and the In other words, they can convey a positive or a nega- shifter lexicon of Polanyi and Zaenen (2006) (sen- tive “sentiment” towards the topic. Since the shared timent shifters flip the polarity of sentiment words). task is topic-agnostic (i.e., there is no need to dis- We could not directly apply the target identifier and tinguish topic-specific argumentation vocabulary), the contrast classifier due to differences in semantic our first run only tries to identify whether a pair of structures of the IBM and Same Side datasets. arguments expresses the same sentiment. So far, a In case that the counts of positive and negative plethora of approaches have been proposed to clas- sentiments are equal or if an argument does not sify sentiment of opinions as positive or negative contain any sentiments, a label that arguments are (or neutral), but given the time constraints of task on the same side is assigned (this reflects the ma- participation we decided to investigate whether sen- jority label in the IBM dataset). This is the case timent signals in the simplest form of lexicon-based for about 4% of the provided within-topic and counts of positive or negative terms can contribute about 0.3% of the provided cross-topic pairs in to same side classification. the official test set. Employing the Hu and Liu (2004)’s sentiment lexicon, we use sentiment marker keyword lists for Gradient boosting decision tree In our third sentiment detection (e.g., good vs. bad). Depending run, we use the fast gradient boosting framework on whether the positive or the negative markers LightGBM (Ke et al., 2017) that employs tree- have a higher total count, the rule-based classifier based learning algorithms. LightGBM is often assigns the respective label to the argument—note used for text classification tasks, even in one of that sentiment flipping terms (e.g., not bad) are not the winning approaches in the Kaggle competi- part of our first run. tion on identifying duplicate Quora questions (Iyer et al., 2017). We use token frequencies and tf-idf- 1 Code available at: https://github.com/ weighted bags of 1-, 2-, 3-, 1–2-, and 1–3-gram webis-de/argmining19-acqua-same-side/ Table 1: Classification accuracy on our local test set. Table 2: Classification accuracy on the official test set. Model within-topic cross-topic Model within-topic cross-topic Rule-based 0.51 0.51 Rule-based 0.54 0.50 Rule-based with flipping 0.50 0.50 Rule-based with flipping 0.50 0.50 LightGBM 0.54 0.52 LightGBM 0.51 0.50 Informed guessing 0.50 0.50 Informed guessing 0.50 0.50 lemmas as features (often used in text classification prove upon an informed random guessing (50:50 tasks). label balance). Note that the rules’ without flipping As LightGBM returns a confidence for predic- slightly better performance on the official test set tions, we run preliminary experiments with dif- compared to our local test set might be due to the ferent thresholds on our local training and valida- fewer random decisions in case of ties for the num- tion sets to select the best performing parameters. bers of positive/negative dictionary words (12% vs. The following features and thresholds achieved the 25%). highest accuracy in these pilot experiments: tf-idf- weighted unigram lemmas and a confidence thresh- 6 Conclusion old of 0.520 for the within-topic setup and of 0.501 We have submitted three approaches to the shared for the cross-topic setup. task on same side stance classification (i.e., de- ciding whether two arguments are “on the same 5 Experiments and Results side” for a given topic): (1) a simple rule-based We use our local training, validation, and test sentiment-oriented approach, (2) a rule-based sen- sets (80/10/10) to train, validate, and test the timent classifier with flipping, and (3) gradient LightGBM-based classifier and only test the two boosted decision trees with tf-idf-weighted uni- rule-based classifiers (they do not have training gram lemmas as features. step) locally (classification accuracies on the lo- All our runs do not really improve upon an in- cal test set given in Table 1). The simple rule- formed random guessing. Sentiment in the simplis- based and LightGBM approaches perform only tic form of our rule-based models does not seem to very slightly better than a random guessing in- help too much in same side classification. formed about the balanced data (50:50 same / dif- A proper adaptation of the complete IBM Re- ferent side). One possible reason for the rule-based search’s stance classifier to the Same Side clas- classifier without flipping probably is that about sification task and training classifiers over word 25% of the cases were randomly decided due to embeddings including deployment of the neural ties in the numbers of positive/negative terms. Sur- classifiers are interesting directions for future re- prisingly, considering sentiment flipping only wors- search. ened the performance. In case of the LightGBM ap- proach, probably simple word n-gram lemmas are Acknowledgments still not sufficient as features for a stance classifica- This work has been partially supported by the tion decision tree. Deutsche Forschungsgemeinschaft (DFG) within Even though our approaches performed very the project “Answering Comparative Questions poorly on the local data, we submitted all three ap- with Arguments (ACQuA)” (grant HA 5851/2-1) proaches with their best parameter settings as runs that is part of the Priority Program “Robust Argu- for the shared task. To this end, the LightGBM- mentation Machines (RATIO)” (SPP-1999). based approach was trained on the full official train- ing set. The accuracies for all our three runs as reported References by the task organizers are shown in Table 2. Not Roy Bar-Haim, Indrajit Bhattacharya, Francesco Din- too surprisingly, also on the official test set, the uzzo, Amrita Saha, and Noam Slonim. 2017. Stance performance of the rule-based approaches and of Classification of Context-Dependent Claim. In Pro- the LightGBM-based approach does not really im- ceedings of ACL 2017, pages 251–261. Xiaowen Ding, Bing Liu, and Philip S. Yu. 2008. A Kaufman, Andrew Lamont, Manan Pancholi, Ken- Holistic Lexicon-Based Approach to Opinion Min- neth Steimel, and Sandra Kübler. 2016. IUCL at ing. In Proceedings of WSDM 2008, pages 231–240. SemEval-2016 Task 6: An Ensemble Model for Stance Detection in Twitter. In Proceedings of Esin Durmus, Faisal Ladhak, and Claire Cardie. 2019. SemEval-2016, pages 394–400. Determining Relative Argument Specificity and Stance for Complex Argumentative Structures. In Livia Polanyi and Annie Zaenen. 2006. Contextual Proceedings of ACL 2019, pages 4630–4641. Valence Shifters. In Computing Attitude and Af- Yaakov HaCohen-Kerner, Ziv Ido, and Ronen fect in Text: Theory and Applications, pages 1–10. Ya’akobov. 2017. Stance Classification of Tweets Springer. using Skip Char Ngrams. In Proceedings of ECML Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, PKDD 2017, pages 266–278. and Gerhard Weikum. 2019. STANCY: Stance Clas- Minqing Hu and Bing Liu. 2004. Mining and Sum- sification Based on Consistency Cues. In Proceed- marizing Customer Reviews. In Proceedings of ings of EMNLP-IJCNLP 2019, pages 6412–6417. SIGKDD 2004, pages 168–177. Swapna Somasundaran and Janyce Wiebe. 2010. Rec- Shankar Iyer, Nikhil Dandekar, and Kornèl Csernai. ognizing Stances in Ideological Online Debates. In 2017. First Quora Dataset Release: Question Pairs. Proceedings of the Workshop CAAGET at NAACL HLT 2010, pages 116–124. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Marilyn A. Walker, Pranav Anand, Rob Abbott, Jean Liu. 2017. LightGBM: A Highly Efficient Gradi- E. Fox Tree, Craig Martelly, and Joseph King. 2012. ent Boosting Decision Tree. In Proceedings of NIPS That is Your Evidence?: Classifying Stance in On- 2017, pages 3146–3154. line Political Debate. Decision Support Systems, 53(4):719–729. Can Liu, Wen Li, Bradford Demarest, Yue Chen, Sara Couture, Daniel Dakota, Nikita Haduong, Noah