Analysing the Submissions to the Same Side Stance Classification Task


                     Yamen Ajjour                                                                      Khalid Al-Khatib
           Bauhaus-Universität Weimar, Germany                                                     Leipzig University, Germany
              y a m e n . a j j o u r @ u n i - w e i m a r.de                           khalid.alkhatib@uni-leipzig.de


                            Abstract                                                effectiveness that ranges between 0.5 and 0.77 in
                                                                                    terms of accuracy.
        This paper presents an analysis of the sub-
                                                                                       The paper at hand illustrates diverse insights for
        missions to the first shared task on same-side
        stance classification. The analysis draws atten-                            the same-side classification based on analysing the
        tion to the potential of combining the submis-                              systems submitted to the shared task. In particular,
        sions in ensemble models, demonstrates the                                  we examine the effectiveness of aggregating the
        cases where the top-performed submissions                                   submitted classifiers by combining them with two
        succeed to resolve the same-side stance and                                 ensemble models (majority and oracle). The two
        where they did not, and puts forward some sug-                              models were evaluated against the two experimen-
        gestions to enhance the datasets used in the
                                                                                    tal settings.
        shared task.
                                                                                       In addition to analyzing the ensemble models,
                                                                                    we scrutinize the data cases which most of the clas-
                                                                                    sifiers tackle successfully (i.e., easy cases), and the
1       Introduction                                                                cases in which most of the classifiers fail (i.e., hard
The recently proposed task of the same-side stance                                  cases). Also, we conduct a manual inspection anal-
classification aims at identifying whether two argu-                                ysis of the task data, bringing to light its limitations
ments share the same or different stance toward a                                   and proposing several suggestions to enhance it.
given topic. Approaching this task, the first shared-                                  Our experiments show that while the majority
task competition 1 was introduced in the second                                     ensemble is comparable to the best systems, the
symposium of the RATIO priority program 2 and                                       oracle ensemble achieves the optimal effectiveness.
conducted in the ArgMining workshop at ACL                                          This shows that almost all the instances in the test
2019 3 . In this shared task, two sets of argu-                                     dataset were classified correctly by at least one sub-
ments that belong to the topics of abortion and                                     mitted system. The inability of the majority system
gay marriage were sampled from the args.me cor-                                     to outperform the submitted classifiers shows the
pus (Ajjour et al., 2019) and prepared for two ex-                                  dominance of the top two systems (Trier University
perimental settings: cross-topics and within-topic.                                 and Leipzig University). Overall, the results show
Eleven different systems were submitted to this                                     the potential of using ensemble models to tackle
shared task. These systems employed several su-                                     the same-side stance classification task.
pervised classifiers with various features, achieving                                  Regarding the case inspection, we discover di-
                                                                                    verse easy cases for the classifiers including when
    1
     https://events.webis.de/sameside-19/                                           the stance towards the topic is stated explicitly us-
    2
     http://ratio.sc.cit-ec.uni-bielefeld.de/events/yearly-
symposium-may-2019/                                                                 ing a linguistic indicator, when an argument ques-
   3
     argmining19.webis.de                                                           tions certain statements in the other argument of


                 Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                       Within-Topic      Cross-Topics                        Within-Topic      Cross-Topics
Team                 Pre Rec Acc        Pre Rec Acc           Ensemble Pre Rec Acc            Pre Rec Acc
Trier University      0.85 0.66 0.77   0.73 0.72 0.73         Oracle     0.99 1     1          1 0.99 1
Leipzig University    0.79 0.73 0.77   0.72 0.72 0.72         Majority   0.82 0.64 0.75       0.75 0.6 0.7
IBM Research          0.69 0.59 0.66   0.62 0.49 0.60
TU Darmstadt          0.68 0.52 0.64   0.64 0.59 0.63     Table 2: The results of the ensemble classifiers oracle
Düsseldorf University 0.70 0.33 0.60   0.72 0.53 0.66     and majority for the within-topic experiments and the
                                                          cross-topics experiment in terms of precision (Pre), re-
Table 1: The results of the submissions which achieved    call (Rec), and accuracy (Acc)
more than 0.6 accuracy in the within-topic experiments                               .
and the cross-topics experiment in terms of precision
(Pre), recall (Rec), and accuracy (Acc).
                                                          majority ensemble predicts the stance label of an
                                                          argument pair using the majority vote of the classi-
the pair, and when the two arguments embody con-          fiers’ predictions, while the oracle ensemble uses
tradicting statements. For the hard cases, we no-         the ground-truth labels to pick the classifier with
tice that the classifiers fail to predict the correct     the correct predicted label if one exists.
stance when the knowledge about the discussed                Table 2 shows the results of the oracle and major-
topic is insufficient to resolve the stance as well       ity ensembles in the cross-topics and within-topic
as when the two arguments have partial agree-             experiments. The oracle ensemble reaches an ac-
ment/disagreement.                                        curacy of 1 in both experiments. This shows that
   Lastly, for improving the shared task datasets,        combining several classifiers for tackling the same-
we observe some problems in the data such as the          side classification task is a promising direction to
treatment of debate meta-information as arguments.        pursue. The results also show that almost all in-
Based on our investigation of web resources, we           stances in the test dataset were classified correctly
propose a suggestion to sample higher quality data        by at least one system. In comparison to the top
for the task.                                             classifier, the majority ensemble achieves subpar
                                                          accuracy in both experiments. Still, it achieves
2     Submission Ensembles                                a precision of 0.75 in the cross-topic experiment,
                                                          which is 0.02 points higher than the top classifier
In this section, we first report on the results of the
                                                          (Trier University). Besides, the majority ensem-
individual classifiers which were submitted to the
                                                          ble achieves higher precision than the second top
shared task. Then, we present the two ensembles
                                                          classifier (Leipzig University). The inability of the
(oracle and majority), comparing their effective-
                                                          majority ensemble to enhance over the best systems
ness to those of the individual classifiers.
                                                          in overall terms signals the superiority of the top
2.1    Classifiers Effectiveness                          systems (Trier University and Leipzig University)
                                                          over the other three systems. However, since all in-
To exclude potential noise that may be introduced
                                                          stances in the test dataset were classified correctly
by ineffective classifiers, we consider here only
                                                          by the different systems, it seems that the different
those classifiers which achieved an accuracy higher
                                                          systems learned different patterns about the task.
than 0.6 in both cross-topics and within-topic ex-
periments.                                                3   Case Analysis
   Table 1 shows the results of the classifiers which
satisfied our quality constraint. This constraint         In this section, we present the outcomes of man-
applies to five classifiers out of the eleven submitted   ually analyzing the predictions of the eleven sys-
classifiers.                                              tems submitted to the shared task. We examine
                                                          the argument pairs which are classified correctly
2.2    Combined Results: Ensembles                        (or wrongly) by most of the systems. A careful
The ensemble models, used to aggregate the clas-          review of these pairs reveals some easy and hard
sifiers, combine the predictions of the submitted         cases for the same-side stance classification. In the
classifiers in a majority as well as an oracle en-        following, we discuss these cases in detail.
semble. Both ensembles utilize the predictions
of the most effective submitted classifiers. The
3.1    Easy Cases                                       3.2    Hard Cases
In total, we found 1234 pairs in which all the sub-     In the test dataset, 126 argument pairs were difficult
mitted systems classified correctly, 1215 in the        to be classified by the systems (125 in the cross-
cross-topics experiment and 19 in the within-topic.     topics experiment). Two cases were noticeable in
From these pairs, we determined four cases where        these pairs:
classifying the same-side stance is doable compu-
tationally (i.e., easy cases):                          1. Further knowledge about the discussed topic is
                                                           needed to resolve the stance:
1. The stance towards the same topic is expressed
   explicitly in the two arguments:                           Argument 1.    gay marriage violates reli-
                                                              gious freedoms
      Argument 1. . . . because i don’t believe in
      gay marriage . . .                                      Argument 2. gay marriage is a negligible
                                                              change to institution of marriage
      Argument 2. . . . i want to first off point out
      that i am against gay marriage personally . . .   2. The two arguments agree on one aspect related
                                                           to the topic but disagree on other aspects:
2. The two arguments include contradicting state-
   ments:                                                     Argument 1. marriage is a euphemism for
                                                              using the government to enforce a relationship.
      Argument 1. . . . marriage is not a recogni-            there’s no problem with gays getting married,
      tion of love and compassion . . .                       but they shouldn’t marry with government in-
                                                              volvement.
      Argument 2. marriage is about love. . . .
                                                              Argument 2. i say we let the gays get mar-
3. An argument questions a certain statement in               ried. it’s not like it affects anyone but them
   the other argument:                                        anyway.

      Argument 1. people should be allowed to
      make their own choices in life with out having    4     Data Quality
      their human rights taken away.                    The shared task datasets are derived from args.me
                                                        corpus (Wachsmuth et al., 2017b). This corpus
      Argument 2.      i would like to know how         incorporates five different debate platforms: four
      people making their own choices has their         comprise arguments in a monological form, while
      rights taken away in the first place. give me     one embraces arguments within dialogues (aka de-
      something to argue about!                         bates). Because the latter is the largest platform
                                                        that contributes the most to the args.me corpus
                                                        with more than 182,198 arguments (%63), it largely
4. An argument quotes a certain statement in the
                                                        dominates the shared tasks datasets.
   other argument:
                                                           Deriving arguments from dialogues, however,
      Argument 1.        i also gave references stat-   requires extensive preprocessing, including re-
      ing that in the bible homosexuality isn’t even    moving meta-dialogue and meta-users informa-
      accepted.                                         tion, de-contextualizing arguments, and filtering
                                                        low-quality texts that contain abusive language or
      Argument 2. “i also gave references stat-         spams.
      ing that in the bible homosexuality isn’t even       This preprocessing step was not performed for
      accepted” oops - sorry - the bible isn’t admis-   the shared task datasets, which lead to several in-
      sible as a source of law in the us.               valid argument instances. Overall, we found two
                                                        main problematic cases:
1. The argument addresses solely a debate meta-
   information:

    Argument . this round is for acceptance
    only. the rest will be for argumentation.


    Argument . my opponent had forfeited the
    round, so my arguments stand unchallenged.

2. The argument contains ad hominom attack:

    Argument .      like i said i didnt copy crap!
    and if you are going to acusse me for some-
    thing i didn’t do, then i wish to never have
    another debate with you again.

  Giving that these cases frequently occur in the
shared task datasets, we suggest the following im-
provements:

• Using only monological sources of arguments,
  as dialogues need the preprocessing step we men-
  tioned above.

• Conducting manual annotation or validation of
  the argument pairs, especially for those which
  are put in the test datasets.

5   Conclusion
Analysing the output of shared tasks is key for
learning lessons and prompting future development.
This paper addresses the new shared task of same-
side stance classification, presenting an analysis of
its submissions and data. In particular, we have
found that ensemble models have the potential for
increasing the effectiveness of tackling the task.
We also have observed that the missing knowledge
of arguments and the possibility of partial agree-
ment/disagreement between them are the main chal-
lenges of the task.


References
Yamen Ajjour, Henning Wachsmuth, Johannes Kiesel,
  Martin Potthast, Matthias Hagen, and Benno Stein.
  2019. Data Acquisition for Argument Search: The
  args.me corpus. In 42nd German Conference on Ar-
  tificial Intelligence (KI 2019). Springer.