Analysing the Submissions to the Same Side Stance Classification Task Yamen Ajjour Khalid Al-Khatib Bauhaus-Universität Weimar, Germany Leipzig University, Germany y a m e n . a j j o u r @ u n i - w e i m a r.de khalid.alkhatib@uni-leipzig.de Abstract effectiveness that ranges between 0.5 and 0.77 in terms of accuracy. This paper presents an analysis of the sub- The paper at hand illustrates diverse insights for missions to the first shared task on same-side stance classification. The analysis draws atten- the same-side classification based on analysing the tion to the potential of combining the submis- systems submitted to the shared task. In particular, sions in ensemble models, demonstrates the we examine the effectiveness of aggregating the cases where the top-performed submissions submitted classifiers by combining them with two succeed to resolve the same-side stance and ensemble models (majority and oracle). The two where they did not, and puts forward some sug- models were evaluated against the two experimen- gestions to enhance the datasets used in the tal settings. shared task. In addition to analyzing the ensemble models, we scrutinize the data cases which most of the clas- sifiers tackle successfully (i.e., easy cases), and the 1 Introduction cases in which most of the classifiers fail (i.e., hard The recently proposed task of the same-side stance cases). Also, we conduct a manual inspection anal- classification aims at identifying whether two argu- ysis of the task data, bringing to light its limitations ments share the same or different stance toward a and proposing several suggestions to enhance it. given topic. Approaching this task, the first shared- Our experiments show that while the majority task competition 1 was introduced in the second ensemble is comparable to the best systems, the symposium of the RATIO priority program 2 and oracle ensemble achieves the optimal effectiveness. conducted in the ArgMining workshop at ACL This shows that almost all the instances in the test 2019 3 . In this shared task, two sets of argu- dataset were classified correctly by at least one sub- ments that belong to the topics of abortion and mitted system. The inability of the majority system gay marriage were sampled from the args.me cor- to outperform the submitted classifiers shows the pus (Ajjour et al., 2019) and prepared for two ex- dominance of the top two systems (Trier University perimental settings: cross-topics and within-topic. and Leipzig University). Overall, the results show Eleven different systems were submitted to this the potential of using ensemble models to tackle shared task. These systems employed several su- the same-side stance classification task. pervised classifiers with various features, achieving Regarding the case inspection, we discover di- verse easy cases for the classifiers including when 1 https://events.webis.de/sameside-19/ the stance towards the topic is stated explicitly us- 2 http://ratio.sc.cit-ec.uni-bielefeld.de/events/yearly- symposium-may-2019/ ing a linguistic indicator, when an argument ques- 3 argmining19.webis.de tions certain statements in the other argument of Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Within-Topic Cross-Topics Within-Topic Cross-Topics Team Pre Rec Acc Pre Rec Acc Ensemble Pre Rec Acc Pre Rec Acc Trier University 0.85 0.66 0.77 0.73 0.72 0.73 Oracle 0.99 1 1 1 0.99 1 Leipzig University 0.79 0.73 0.77 0.72 0.72 0.72 Majority 0.82 0.64 0.75 0.75 0.6 0.7 IBM Research 0.69 0.59 0.66 0.62 0.49 0.60 TU Darmstadt 0.68 0.52 0.64 0.64 0.59 0.63 Table 2: The results of the ensemble classifiers oracle Düsseldorf University 0.70 0.33 0.60 0.72 0.53 0.66 and majority for the within-topic experiments and the cross-topics experiment in terms of precision (Pre), re- Table 1: The results of the submissions which achieved call (Rec), and accuracy (Acc) more than 0.6 accuracy in the within-topic experiments . and the cross-topics experiment in terms of precision (Pre), recall (Rec), and accuracy (Acc). majority ensemble predicts the stance label of an argument pair using the majority vote of the classi- the pair, and when the two arguments embody con- fiers’ predictions, while the oracle ensemble uses tradicting statements. For the hard cases, we no- the ground-truth labels to pick the classifier with tice that the classifiers fail to predict the correct the correct predicted label if one exists. stance when the knowledge about the discussed Table 2 shows the results of the oracle and major- topic is insufficient to resolve the stance as well ity ensembles in the cross-topics and within-topic as when the two arguments have partial agree- experiments. The oracle ensemble reaches an ac- ment/disagreement. curacy of 1 in both experiments. This shows that Lastly, for improving the shared task datasets, combining several classifiers for tackling the same- we observe some problems in the data such as the side classification task is a promising direction to treatment of debate meta-information as arguments. pursue. The results also show that almost all in- Based on our investigation of web resources, we stances in the test dataset were classified correctly propose a suggestion to sample higher quality data by at least one system. In comparison to the top for the task. classifier, the majority ensemble achieves subpar accuracy in both experiments. Still, it achieves 2 Submission Ensembles a precision of 0.75 in the cross-topic experiment, which is 0.02 points higher than the top classifier In this section, we first report on the results of the (Trier University). Besides, the majority ensem- individual classifiers which were submitted to the ble achieves higher precision than the second top shared task. Then, we present the two ensembles classifier (Leipzig University). The inability of the (oracle and majority), comparing their effective- majority ensemble to enhance over the best systems ness to those of the individual classifiers. in overall terms signals the superiority of the top 2.1 Classifiers Effectiveness systems (Trier University and Leipzig University) over the other three systems. However, since all in- To exclude potential noise that may be introduced stances in the test dataset were classified correctly by ineffective classifiers, we consider here only by the different systems, it seems that the different those classifiers which achieved an accuracy higher systems learned different patterns about the task. than 0.6 in both cross-topics and within-topic ex- periments. 3 Case Analysis Table 1 shows the results of the classifiers which satisfied our quality constraint. This constraint In this section, we present the outcomes of man- applies to five classifiers out of the eleven submitted ually analyzing the predictions of the eleven sys- classifiers. tems submitted to the shared task. We examine the argument pairs which are classified correctly 2.2 Combined Results: Ensembles (or wrongly) by most of the systems. A careful The ensemble models, used to aggregate the clas- review of these pairs reveals some easy and hard sifiers, combine the predictions of the submitted cases for the same-side stance classification. In the classifiers in a majority as well as an oracle en- following, we discuss these cases in detail. semble. Both ensembles utilize the predictions of the most effective submitted classifiers. The 3.1 Easy Cases 3.2 Hard Cases In total, we found 1234 pairs in which all the sub- In the test dataset, 126 argument pairs were difficult mitted systems classified correctly, 1215 in the to be classified by the systems (125 in the cross- cross-topics experiment and 19 in the within-topic. topics experiment). Two cases were noticeable in From these pairs, we determined four cases where these pairs: classifying the same-side stance is doable compu- tationally (i.e., easy cases): 1. Further knowledge about the discussed topic is needed to resolve the stance: 1. The stance towards the same topic is expressed explicitly in the two arguments: Argument 1. gay marriage violates reli- gious freedoms Argument 1. . . . because i don’t believe in gay marriage . . . Argument 2. gay marriage is a negligible change to institution of marriage Argument 2. . . . i want to first off point out that i am against gay marriage personally . . . 2. The two arguments agree on one aspect related to the topic but disagree on other aspects: 2. The two arguments include contradicting state- ments: Argument 1. marriage is a euphemism for using the government to enforce a relationship. Argument 1. . . . marriage is not a recogni- there’s no problem with gays getting married, tion of love and compassion . . . but they shouldn’t marry with government in- volvement. Argument 2. marriage is about love. . . . Argument 2. i say we let the gays get mar- 3. An argument questions a certain statement in ried. it’s not like it affects anyone but them the other argument: anyway. Argument 1. people should be allowed to make their own choices in life with out having 4 Data Quality their human rights taken away. The shared task datasets are derived from args.me corpus (Wachsmuth et al., 2017b). This corpus Argument 2. i would like to know how incorporates five different debate platforms: four people making their own choices has their comprise arguments in a monological form, while rights taken away in the first place. give me one embraces arguments within dialogues (aka de- something to argue about! bates). Because the latter is the largest platform that contributes the most to the args.me corpus with more than 182,198 arguments (%63), it largely 4. An argument quotes a certain statement in the dominates the shared tasks datasets. other argument: Deriving arguments from dialogues, however, Argument 1. i also gave references stat- requires extensive preprocessing, including re- ing that in the bible homosexuality isn’t even moving meta-dialogue and meta-users informa- accepted. tion, de-contextualizing arguments, and filtering low-quality texts that contain abusive language or Argument 2. “i also gave references stat- spams. ing that in the bible homosexuality isn’t even This preprocessing step was not performed for accepted” oops - sorry - the bible isn’t admis- the shared task datasets, which lead to several in- sible as a source of law in the us. valid argument instances. Overall, we found two main problematic cases: 1. The argument addresses solely a debate meta- information: Argument . this round is for acceptance only. the rest will be for argumentation. Argument . my opponent had forfeited the round, so my arguments stand unchallenged. 2. The argument contains ad hominom attack: Argument . like i said i didnt copy crap! and if you are going to acusse me for some- thing i didn’t do, then i wish to never have another debate with you again. Giving that these cases frequently occur in the shared task datasets, we suggest the following im- provements: • Using only monological sources of arguments, as dialogues need the preprocessing step we men- tioned above. • Conducting manual annotation or validation of the argument pairs, especially for those which are put in the test datasets. 5 Conclusion Analysing the output of shared tasks is key for learning lessons and prompting future development. This paper addresses the new shared task of same- side stance classification, presenting an analysis of its submissions and data. In particular, we have found that ensemble models have the potential for increasing the effectiveness of tackling the task. We also have observed that the missing knowledge of arguments and the possibility of partial agree- ment/disagreement between them are the main chal- lenges of the task. References Yamen Ajjour, Henning Wachsmuth, Johannes Kiesel, Martin Potthast, Matthias Hagen, and Benno Stein. 2019. Data Acquisition for Argument Search: The args.me corpus. In 42nd German Conference on Ar- tificial Intelligence (KI 2019). Springer.