<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Same Side Stance Classification Task: Facilitating Argument Stance Classification by Fine-tuning a BERT Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Ollinger</string-name>
          <email>stefan.ollinger@gmx.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorik Dumani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Premtim Sahitaj</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralph Bergmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralf Schenkel</string-name>
          <email>schenkelg@uni-trier.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Trier D-54286 Trier</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Research on computational argumentation is currently being intensively investigated. The goal of this community is to find the best pro and con arguments for a user given topic either to form an opinion for oneself, or to persuade others to adopt a certain standpoint. While existing argument mining methods can find appropriate arguments for a topic, a correct classification into pro and con is not yet reliable. The same side stance classification task provides a dataset of argument pairs classified by whether or not both arguments share the same stance and does not need to distinguish between topic-specific pro and con vocabulary but only the argument similarity within a stance needs to be assessed. The results of our contribution to the task are build on a setup based on the BERT architecture. We fine-tuned a pre-trained BERT model for three epochs and used the first 512 tokens of each argument to predict if two arguments share the same stance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Argumentation is an activity in everyday human
life. We argue in domains such as health, law
and politics either trying to find standpoints which
are acceptable by being supported by reasons or
to persuade others to a certain point of view and
if necessary to carry out certain actions.
Computational Argumentation (CA) aims to find
argument representations and models which are well
suited to do computation with arguments. CA is
a new and fast growing field of research. In the
simplest case, an argument is a claim supported
or opposed by at least one premise
        <xref ref-type="bibr" rid="ref6">(Peldszus and
Stede, 2013)</xref>
        . An example of a claim c could be
“We need to abolish nuclear energy”, examples of
premises that support and oppose this claim could
be p1 = “Renewable energy sources will
eventually be able to replace fossil fuel and nuclear
energy” and p2 = “Nuclear energy is a cheap
alternative to fossil fuels”, respectively. Common tasks
in CA include argument mining (AM) and
argument retrieval (AR). AM reconstructs arguments
from textual sources, e.g. in form of an argument
graph. AR finds all relevant arguments for a topic.
Existing argument search engines like ARGS1 or
ARGUMENTEXT2 search for the best supporting
and opposing premises for a user query on a
usually controversial topic and list them separately in
pro and con. The correct classification of stances
is therefore a fundamental task in computational
argumentation. However, a short-coming of
current stance classification algorithms is that their
classifiers must be trained for a particular topic,
i.e. they cannot be reliably applied across
topics
        <xref ref-type="bibr" rid="ref10 ref9">(Webis, 2019a)</xref>
        .
      </p>
      <p>In the task SAME SIDE (STANCE)
CLASSIFICATION a simplified variant is to be
examined, namely whether two given arguments to a
topic have the same stance. For example, p1 and
p2 have different stances, but p1 and p3 = “The
danger from radioactive contamination should be
avoided” would have the same stance to the topic
nuclear energy. The particular difficulty lies in the
fact that p1 and p3 are syntactically very different.
So we have to decide on a semantic level whether
the stances are the same.</p>
      <p>
        In this paper we present a method where we
fine-tune a pre-trained BERT
        <xref ref-type="bibr" rid="ref4">(Devlin et al., 2019)</xref>
        model to decide whether two arguments have the
same stance. In Section 2 we discuss related work.
In Section 3 we specify the dataset. Then, in
Section 4 we describe the implementation and the
evaluation of our approach. Finally, Section 5
concludes the paper.
      </p>
      <p>1www.args.me
2www.argumentext.de</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In our implementation we make use of BERT
        <xref ref-type="bibr" rid="ref4">(Devlin et al., 2019)</xref>
        which achieved state-of-the-art
results in many NLP tasks such as Natural
Language Inference (MNLI), semantic textual
similarity (STS), and others. BERT makes use
of the Transformer architecture
        <xref ref-type="bibr" rid="ref8">(Vaswani et al.,
2017)</xref>
        , more precisely it applies a bidirectional
masked language model training to the
architecture. Contrary to previous embedding
techniques like WORD2VEC
        <xref ref-type="bibr" rid="ref5">(Mikolov et al., 2013)</xref>
        its
mechanism learns contextualized representations
of words in a text.
      </p>
      <p>
        <xref ref-type="bibr" rid="ref7">(Stab and Gurevych, 2014)</xref>
        address
argumentative relation classification. The relation between
two argument components is divided in support
and non-support classes. Therefore a range of
structural, lexical and syntactic features are
defined and extracted for an argument component
pair. The classification is done with a SVM.
      </p>
      <p>Bar-Haim et al. (2017a) address stance
classification of premises towards a claim topic. Here,
the classification task is divided further into
simpler sub-tasks. Bar-Haim et al. (2017b) extend this
work by a more extensive sentiment lexicon and
contextual features.</p>
      <p>Most presented approaches classify the
relation of a premise towards a claim. In contrast to
the same stance classification the relation between
two premises is considered. Further we do not
apply feature engineering, but rely on the neural
network to extract good features.
3</p>
    </sec>
    <sec id="sec-3">
      <title>The provided Dataset</title>
      <p>The arguments from the provided dataset were
extracted from the four web sources idebate.
org, debatepedia.org, debatewise.
org and debate.org. Each instance consists
of the seven fields that are depicted in Table 1.</p>
      <p>The two most discussed topics “abortion” and
“gay marriage” were chosen and two experiments
were set-up for the same side stance classification
task. The first experiment addresses the
classification within topics and consists of a training set
with arguments for a set of topics (abortion and
gay marriage) and a test set with arguments
related to the same set of topics. Table 2 illustrates
an overview of the data within topics.</p>
      <p>For the second experiment, which addresses the
classification across topics, the training set
contains arguments for a topic (abortion) and the test
Label
id
topic
argument1
argument1 id
argument2
argument2 id
is same stance</p>
      <sec id="sec-3-1">
        <title>Description</title>
        <p>The id of the instance
The title of the debate. It can
be a general topic (e.g.
abortion) or a topic with a stance
(e.g. abortion should be
legalized).</p>
        <p>A pro or con argument related
to the topic.</p>
        <p>The ID of argument1.</p>
        <p>A pro or con argument related
to the topic.</p>
        <p>The ID of argument2.</p>
        <p>True or False. True in case
argument1 and argument2 have
the same stance towards the
topic and False otherwise.
set arguments are related to another set of topics.
The class Same Side contains 31,195 instances, the
class Different Side 29,853.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>In this section we describe the experimental setup
and evaluate our approach utilizing BERT and
compare it to a SVM baseline.
4.1</p>
      <sec id="sec-4-1">
        <title>Hypotheses</title>
        <p>In order to measure the performance of our
approach, the following hypotheses were formulated
and are subject of this evaluation:</p>
        <p>H1: A Transformer-based sequence
classification improves upon the SVM baseline.
H2: The large Transformer model
outperforms the smaller base model.</p>
        <p>H3: Longer input sequences yield better
classification than shorter sequence lengths.</p>
        <p>H4: Classification of full sentences performs
better than including partial input sentences.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experimental Setup</title>
        <p>
          First, we divided the provided data set into
training and test sets (90% and 10%). This results
in 57,512 training pairs and 6,391 test pairs for
within topics classification as well as 54,943
training and 6,105 test pairs for cross topics taken all
from the shared task labeled training data. Then
we used BERT
          <xref ref-type="bibr" rid="ref4">(Devlin et al., 2019)</xref>
          3 in our
implementation for training to classify arguments of
same stance. We used both provided models base
and large always with three epochs for fine-tuning.
All models use lower-cased token sequences and
vocabulary. It should be noted here that BERT is
limited to a fixed size of tokens, with the
maximum being 512 tokens for the pre-trained models.
Longer input sequences are truncated to the
maximum sequence length. This truncation can lead to
loss of information which we evaluate in
hypotheses 3 and 4.
4.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Results and Discussion</title>
        <p>Figure 1 shows the results within the same topic
with varying maximum sequence length.
Figure 2 shows the results between topics.
Argument pairs whose length exceeds the maximum
sequence lengths are uniformly truncated. In within
topic evaluation large yields higher accuracy than
base in four of five cases. In the cross topic
evaluation the result is similar. Therefore hypothesis 2
can be accepted. Nevertheless the smaller base
model is quite close to large.</p>
        <p>The SVM baseline, supplied by the shared
task organizers, achieves 54% accuracy in within
topic and 58% cross topic. The base model
improves upon this already with the smallest
sequence length of 32 tokens. Thus hypothesis 1
can be accepted. This result is possibly due to a
Transformer having a larger model capacity and
employing better suited representations for
natural language text compared to an SVM.</p>
        <p>Next, we take a look at argumentative input of
varying maximum sequence length. We can see
from Figure 1 and Figure 2 that the classification
benefits from more contextual information.
Hypothesis 3 can therefore be accepted. One question
is why a model using 64 tokens already performs
quite well. Figure 3 shows the distribution of the
3https://github.com/huggingface/
transformers
0.90
0.85
summed argument lengths. We can observe here
that the majority (76%) consists of less than 512
tokens. As we can infer from Figure 4, the
distribution of the lengths is usually even considerably
below 512 tokens. This explains why models with
rather short contextual information perform quite
well already.</p>
        <p>
          Since the input sequences are truncated the
Transformer model also trains with incomplete,
partial natural language sentences. In order to see
whether a model can better learn from full
sentences we filter out all partial sequences which
are longer than 512 tokens, reducing the available
training/testing data (no trunc train/test). As can
be seen in Table 3 and Table 4 the Transformer
is able to learn from partial sentences. Therefore
hypothesis 4 needs to be rejected. The highest
results are achieved when testing is also done on
untruncated full sentences. This result is an indicator
of what could be achieved with Transformers of
larger or variable maximum sequence length such
as explored by
          <xref ref-type="bibr" rid="ref3">(Dai et al., 2019)</xref>
          .
base
large
        </p>
        <p>512
base
large
512
0.60 32 64
128</p>
        <p>256</p>
        <p>Max sequence length</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        In this paper we have contributed to the SAME
SIDE (STANCE) CLASSIFICATION task and
proposed a method which uses a fine-tuned BERT
model to determine whether two given arguments
have the same stance. The baseline of the
organizers was outperformed with our method. In
our evaluation the large model performs better
than the base model. Our results also show that
longer input sentences are classified better than
shorter ones, and that classifying whole sentences
does not perform better than classifying partial
sentences. According to the organizers’
leaderboard
        <xref ref-type="bibr" rid="ref10 ref9">(Webis, 2019b)</xref>
        4 our approach performed
4Ranking on the 16th August 2019.
best across topics with precision and recall values
of 0.72 and an accuracy of 0.73. For within-topics
we achieved the best performance as well as the
ASV team from Leipzig University with an
accuracy of 0.77. However, for this task we had a
higher precision (0.85 vs. 0.79) but a lower recall
(0.66 vs. 0.73).
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work has been funded by the Deutsche
Forschungsgemeinschaft (DFG) within the project
ReCAP, Grant Number 375342983 - 2018-2020,
as part of the Priority Program “Robust
Argumentation Machines (RATIO)” (SPP-1999).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Roy</given-names>
            <surname>Bar-Haim</surname>
          </string-name>
          , Indrajit Bhattacharya, Francesco Dinuzzo, Amrita Saha, and
          <string-name>
            <given-names>Noam</given-names>
            <surname>Slonim</surname>
          </string-name>
          . 2017a.
          <article-title>Stance classification of context-dependent claims</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</source>
          ,
          <string-name>
            <surname>EACL</surname>
          </string-name>
          <year>2017</year>
          , pages
          <fpage>251</fpage>
          -
          <lpage>261</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Roy</given-names>
            <surname>Bar-Haim</surname>
          </string-name>
          , Lilach Edelstein, Charles Jochim, and
          <string-name>
            <given-names>Noam</given-names>
            <surname>Slonim</surname>
          </string-name>
          . 2017b.
          <article-title>Improving claim stance classification with lexical knowledge expansion and context utilization</article-title>
          .
          <source>In Proceedings of the 4th Workshop on Argument Mining</source>
          ,
          <source>ArgMining@EMNLP</source>
          <year>2017</year>
          , pages
          <fpage>32</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Zihang</given-names>
            <surname>Dai</surname>
          </string-name>
          , Zhilin Yang,
          <string-name>
            <given-names>Yiming</given-names>
            <surname>Yang</surname>
          </string-name>
          , Jaime G. Carbonell, Quoc Viet Le, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Transformer-xl: Attentive language models beyond a fixed-length context</article-title>
          .
          <source>In Proceedings of the 57th Conference of the Association for Computational Linguistics</source>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2019</year>
          , pages
          <fpage>2978</fpage>
          -
          <lpage>2988</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the</source>
          <year>2019</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis</article-title>
          , MN, USA, June 2-7,
          <year>2019</year>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Ilya Sutskever, Kai Chen, Gregory S. Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems</source>
          <year>2013</year>
          , pages
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Peldszus</surname>
          </string-name>
          and
          <string-name>
            <given-names>Manfred</given-names>
            <surname>Stede</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>From argument diagrams to argumentation mining in texts: A survey</article-title>
          .
          <source>International Journal of Cognitive Informatics and Natural Intelligence</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Christian</given-names>
            <surname>Stab</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Identifying argumentative discourse structures in persuasive essays</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29</source>
          ,
          <year>2014</year>
          , Doha,
          <string-name>
            <surname>Qatar,</surname>
          </string-name>
          <article-title>A meeting of SIGDAT, a Special Interest Group of the ACL</article-title>
          , pages
          <fpage>46</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems</source>
          <year>2017</year>
          ,
          <fpage>4</fpage>
          -9
          <source>December</source>
          <year>2017</year>
          , Long Beach, CA, USA, pages
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Webis.</surname>
          </string-name>
          2019a.
          <article-title>Same Side Stance Classification</article-title>
          . https://sameside.webis.de/. Accessed:
          <fpage>2019</fpage>
          -12-06.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Webis.</surname>
          </string-name>
          2019b.
          <article-title>Same Side Stance Classification Leaderboard</article-title>
          . https://sameside.webis. de/leaderboard.html. Accessed:
          <fpage>2019</fpage>
          -12- 06.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>