ACQuA at Same Side Stance Classification 2019


Alexander Bondarenko1                       Ekaterina Shirshakova                                Niklas Homann                          Matthias Hagen
                                        Martin-Luther-Universität Halle-Wittenberg
                                   1
                                       alexander.bondarenko@informatik.uni-halle.de


                         Abstract                                               accuracy of 0.54 on binary-labeled balanced test
                                                                                sets—obviously only a very slight improvement
     We describe the ACQuA team’s participa-
     tion in the “Same Side Stance Classification”                              over random guessing.
     shared task (are two given arguments both on
     the pro or con side for some topic?) that was                              2      Related Work
     run as part of the ArgMining 2019 workshop.                                Stance classification has been studied in numer-
 1   Introduction                                                               ous research publications proposing different fea-
                                                                                tures. For instance, Walker et al. (2012) analyzed
 In recent years, the popularity of social media and                            11 feature types and showed that Naı̈ve Bayes us-
 discussion platforms has lead to online pro and                                ing POS tags achieved better results than word uni-
 con argumentation on almost every topic. Still,                                grams, while HaCohen-Kerner et al. (2017) applied
 since not all contributions in such online discus-                             an SVM classifier on 18 feature types extracted
 sions clearly indicate their stance or polarity, auto-                         from tweets (hashtags, slang and emojis, POS tags,
 matically identifying some post’s stance could help                            character and word n-grams, etc.) and reported
 readers quickly get an overview of a discussion                                good performance for character skip n-grams. Nev-
 similar to debating portals with pro/con arguments.                            ertheless, word n-grams have been a very common
    In this extended abstract, we report on our par-                            choice in many stance classification experiments.
 ticipation at the “Same Side Stance Classification”                               Also common for stance classification is the
 shared task. The task was run as a pilot at the                                utilization of sentiment attributes.        For in-
 ArgMining 2019 workshop and stated the prob-                                   stance, Somasundaran and Wiebe (2010) combined
 lem as: given two arguments, decide whether ei-                                argumentation-based features (1- to 3-grams ex-
 ther both support or both attack some controversial                            tracted from sentiments and argument targets) with
 topic like gay marriage—i.e., whether the two ar-                              sentiment-based features (sentiment lexicon with
 guments are “on the same side.”                                                negative and positive words).
    Given that the available time prior to the pilot                               Comparing different classification models, Liu
 edition of the shared task was rather limited, we                              et al. (2016) in their evaluation showed gradient
 decided to focus our research interest simply on                               boosting decision trees to outperform SVMs for
 examining the effectiveness of simple word n-gram                              stance classification. More recently, neural ap-
 features and various variants of sentiment detection                           proaches have been successfully applied to stance
 for same side classification. We experiment with                               classification: Popat et al. (2019) tuned BERT with
 three respective classifiers: (1) a simple rule-based                          hidden state representations, and Durmus et al.
 method “counting” positive and negative terms,                                 (2019) used BERT fine-tuned with path informa-
 (2) a rule-based method with sentiment flipping                                tion extracted from argument trees for 741 topics
 that uses sentiment and shifter lexicons, and (3) a                            from kialo.com.
 gradient boosting decision tree-based method using                                Given the limited time prior to the shared task,
 word n-gram as features.                                                       we simply wanted to test word n-grams (gradient
    Not too surprisingly, the evaluation results for                            boosting tree-based classifier) and sentiment fea-
 the three classifiers show that relying on sentiment                           tures (rule-based classifiers) as common feature
 words or word n-grams alone cannot really solve                                types for stance classification.
 stance classification. Our “best” models achieve an


              Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
3   Task and Data                                           In case that the counts of positive and negative
                                                         markers are equal or if an argument does not con-
The “Same Side Stance Classification” shared task
                                                         tain any marker, a random label is assigned. This is
has two experimental settings: within-topic (argu-
                                                         the case for about 25% of the provided within-topic
mentative topics for training and test are the same)
                                                         and about 20% of the provided cross-topic training
and cross-topic (argumentative topics for training
                                                         pairs (12% of within and about 19% for cross-topic
and test are different).
                                                         test pairs).
   The provided data are argumentative topics and
                                                            Finally, if the counter-based sentiments of an
corresponding pairs of arguments collected from
                                                         argument pair agree, they are classified as “same
the debating portals idebate.org, debatepedia.org,
                                                         side.”
debatewise.org, and debate.org. The data is split
into training (within-topic: 63,903 argument pairs       Rule-based classification with sentiment flip-
for the two topics abortion and gay marriage, cross-     ping We re-implemented a sentiment classifier
topic: 61,048 argument pairs for the topic abor-         which is one step in a three-step approach to clas-
tion) and test sets (within-topic: 31,475 argument       sify a single claim’s stance as pro or con with re-
pairs for the two topics abortion and gay marriage,      spect to some controversial topic proposed by Bar-
cross-topic: 6,163 argument pairs for the topic          Haim et al. (2017). A complete approach com-
gay marriage). We randomly split the provided            bines argument target identification with sentiment
training sets into local training, validation and test   detection and consistency/contrastiveness classi-
sets (80:10:10).                                         fication. In a semester-long student project, we
                                                         re-implemented parts of this approach and verified
4   ACQuA Runs                                           that it produces results similar to the originally re-
Our three runs1 are based (1) on a rule-based clas-      ported performances.
sifier, (2) on a rule-based classifier with sentiment       In the setting of the “Same Side Stance Classifi-
flipping, and (3) on gradient boosting decision          cation” shared task, we applied only the sentiment
trees.                                                   classifier, which follows the approach by Ding
                                                         et al. (2008) and uses the sentiment words counts
Rule-based classification Argument stances can           matched with the lexicon of Hu and Liu (2004)(the
either support or attack some argumentative topic.       same that is used in our first approach) and the
In other words, they can convey a positive or a nega-    shifter lexicon of Polanyi and Zaenen (2006) (sen-
tive “sentiment” towards the topic. Since the shared     timent shifters flip the polarity of sentiment words).
task is topic-agnostic (i.e., there is no need to dis-   We could not directly apply the target identifier and
tinguish topic-specific argumentation vocabulary),       the contrast classifier due to differences in semantic
our first run only tries to identify whether a pair of   structures of the IBM and Same Side datasets.
arguments expresses the same sentiment. So far, a           In case that the counts of positive and negative
plethora of approaches have been proposed to clas-       sentiments are equal or if an argument does not
sify sentiment of opinions as positive or negative       contain any sentiments, a label that arguments are
(or neutral), but given the time constraints of task     on the same side is assigned (this reflects the ma-
participation we decided to investigate whether sen-     jority label in the IBM dataset). This is the case
timent signals in the simplest form of lexicon-based     for about 4% of the provided within-topic and
counts of positive or negative terms can contribute      about 0.3% of the provided cross-topic pairs in
to same side classification.                             the official test set.
   Employing the Hu and Liu (2004)’s sentiment
lexicon, we use sentiment marker keyword lists for       Gradient boosting decision tree In our third
sentiment detection (e.g., good vs. bad). Depending      run, we use the fast gradient boosting framework
on whether the positive or the negative markers          LightGBM (Ke et al., 2017) that employs tree-
have a higher total count, the rule-based classifier     based learning algorithms. LightGBM is often
assigns the respective label to the argument—note        used for text classification tasks, even in one of
that sentiment flipping terms (e.g., not bad) are not    the winning approaches in the Kaggle competi-
part of our first run.                                   tion on identifying duplicate Quora questions (Iyer
                                                         et al., 2017). We use token frequencies and tf-idf-
  1
    Code available at: https://github.com/               weighted bags of 1-, 2-, 3-, 1–2-, and 1–3-gram
webis-de/argmining19-acqua-same-side/
 Table 1: Classification accuracy on our local test set.    Table 2: Classification accuracy on the official test set.

Model                       within-topic cross-topic        Model                        within-topic cross-topic
Rule-based                           0.51          0.51     Rule-based                            0.54          0.50
Rule-based with flipping             0.50          0.50     Rule-based with flipping              0.50          0.50
LightGBM                             0.54          0.52     LightGBM                              0.51          0.50
Informed guessing                    0.50          0.50     Informed guessing                     0.50          0.50


lemmas as features (often used in text classification       prove upon an informed random guessing (50:50
tasks).                                                     label balance). Note that the rules’ without flipping
   As LightGBM returns a confidence for predic-             slightly better performance on the official test set
tions, we run preliminary experiments with dif-             compared to our local test set might be due to the
ferent thresholds on our local training and valida-         fewer random decisions in case of ties for the num-
tion sets to select the best performing parameters.         bers of positive/negative dictionary words (12% vs.
The following features and thresholds achieved the          25%).
highest accuracy in these pilot experiments: tf-idf-
weighted unigram lemmas and a confidence thresh-            6   Conclusion
old of 0.520 for the within-topic setup and of 0.501        We have submitted three approaches to the shared
for the cross-topic setup.                                  task on same side stance classification (i.e., de-
                                                            ciding whether two arguments are “on the same
5   Experiments and Results
                                                            side” for a given topic): (1) a simple rule-based
We use our local training, validation, and test             sentiment-oriented approach, (2) a rule-based sen-
sets (80/10/10) to train, validate, and test the            timent classifier with flipping, and (3) gradient
LightGBM-based classifier and only test the two             boosted decision trees with tf-idf-weighted uni-
rule-based classifiers (they do not have training           gram lemmas as features.
step) locally (classification accuracies on the lo-            All our runs do not really improve upon an in-
cal test set given in Table 1). The simple rule-            formed random guessing. Sentiment in the simplis-
based and LightGBM approaches perform only                  tic form of our rule-based models does not seem to
very slightly better than a random guessing in-             help too much in same side classification.
formed about the balanced data (50:50 same / dif-              A proper adaptation of the complete IBM Re-
ferent side). One possible reason for the rule-based        search’s stance classifier to the Same Side clas-
classifier without flipping probably is that about          sification task and training classifiers over word
25% of the cases were randomly decided due to               embeddings including deployment of the neural
ties in the numbers of positive/negative terms. Sur-        classifiers are interesting directions for future re-
prisingly, considering sentiment flipping only wors-        search.
ened the performance. In case of the LightGBM ap-
proach, probably simple word n-gram lemmas are              Acknowledgments
still not sufficient as features for a stance classifica-   This work has been partially supported by the
tion decision tree.                                         Deutsche Forschungsgemeinschaft (DFG) within
   Even though our approaches performed very                the project “Answering Comparative Questions
poorly on the local data, we submitted all three ap-        with Arguments (ACQuA)” (grant HA 5851/2-1)
proaches with their best parameter settings as runs         that is part of the Priority Program “Robust Argu-
for the shared task. To this end, the LightGBM-             mentation Machines (RATIO)” (SPP-1999).
based approach was trained on the full official train-
ing set.
   The accuracies for all our three runs as reported        References
by the task organizers are shown in Table 2. Not            Roy Bar-Haim, Indrajit Bhattacharya, Francesco Din-
too surprisingly, also on the official test set, the          uzzo, Amrita Saha, and Noam Slonim. 2017. Stance
performance of the rule-based approaches and of               Classification of Context-Dependent Claim. In Pro-
the LightGBM-based approach does not really im-               ceedings of ACL 2017, pages 251–261.
Xiaowen Ding, Bing Liu, and Philip S. Yu. 2008. A        Kaufman, Andrew Lamont, Manan Pancholi, Ken-
  Holistic Lexicon-Based Approach to Opinion Min-        neth Steimel, and Sandra Kübler. 2016. IUCL at
  ing. In Proceedings of WSDM 2008, pages 231–240.       SemEval-2016 Task 6: An Ensemble Model for
                                                         Stance Detection in Twitter. In Proceedings of
Esin Durmus, Faisal Ladhak, and Claire Cardie. 2019.     SemEval-2016, pages 394–400.
  Determining Relative Argument Specificity and
  Stance for Complex Argumentative Structures. In      Livia Polanyi and Annie Zaenen. 2006. Contextual
  Proceedings of ACL 2019, pages 4630–4641.              Valence Shifters. In Computing Attitude and Af-
Yaakov HaCohen-Kerner, Ziv Ido, and Ronen                fect in Text: Theory and Applications, pages 1–10.
  Ya’akobov. 2017. Stance Classification of Tweets       Springer.
  using Skip Char Ngrams. In Proceedings of ECML       Kashyap Popat, Subhabrata Mukherjee, Andrew Yates,
  PKDD 2017, pages 266–278.                              and Gerhard Weikum. 2019. STANCY: Stance Clas-
Minqing Hu and Bing Liu. 2004. Mining and Sum-           sification Based on Consistency Cues. In Proceed-
  marizing Customer Reviews. In Proceedings of           ings of EMNLP-IJCNLP 2019, pages 6412–6417.
  SIGKDD 2004, pages 168–177.
                                                       Swapna Somasundaran and Janyce Wiebe. 2010. Rec-
Shankar Iyer, Nikhil Dandekar, and Kornèl Csernai.      ognizing Stances in Ideological Online Debates. In
  2017. First Quora Dataset Release: Question Pairs.     Proceedings of the Workshop CAAGET at NAACL
                                                         HLT 2010, pages 116–124.
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang,
  Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan          Marilyn A. Walker, Pranav Anand, Rob Abbott, Jean
  Liu. 2017. LightGBM: A Highly Efficient Gradi-        E. Fox Tree, Craig Martelly, and Joseph King. 2012.
  ent Boosting Decision Tree. In Proceedings of NIPS    That is Your Evidence?: Classifying Stance in On-
  2017, pages 3146–3154.                                line Political Debate. Decision Support Systems,
                                                        53(4):719–729.
Can Liu, Wen Li, Bradford Demarest, Yue Chen, Sara
  Couture, Daniel Dakota, Nikita Haduong, Noah