<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Same Side Stance Classification Using Contextualized Sentence Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Erik Körner</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leipzig University &lt;first&gt;.&lt;last&gt;@uni-leipzig.de</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The same side stance classification shared task surveyed approaches to decide whether two arguments have the same stance towards a particular topic. We show that embeddings derived from the transformer model BERT (Devlin et al., 2019) outperform traditional bagof-words and count-based word embeddings, yielding one of the two best-performing models on this task at the time of writing. In this paper, we detail our approach and further explore which of its hyperparameters influence the accuracy of our model with respect to the two task variants studied. We conclude that our model is good enough for the shared task but may need a more exhaustive inspection when exposed to a broader variety of data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>At the sixth argument mining workshop
ArgMining 2019, the same side stance classification
problem has been introduced by Ajjour et al. (2020)
as a shared task to the argument mining
community. Identifying the stance of an argument
towards a topic is a fundamental problem in
computational argumentation. The same side task, for short,
presents a new problem variant, namely to
classify whether two arguments share the same stance
without the need to identify the stance itself. The
underlying hypothesis is that this can be achieved
in a topic-agnostic manner, since, presumably, only
the similarity of two given arguments needs to be
assessed. To allow for the task’s evaluation, the
organizers have provided two datasets to test this
hypothesis. Our contribution to the same side task
is an approach based on the transformer neural
network architecture, one of the two best-performing
submissions to the shared task. Here, we detail our
experiments with hyperparameter settings and data
“preprocessing” to optimize our approach ahead of
submission.</p>
      <p>In what follows, Section 2 reviews related work,
Section 3 explains the provided datasets, Section 4
introduces our approach, and Section 5 reports on
our evaluation.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Prior work on stance classification focuses on
detecting the stance of individual arguments towards
a certain topic and only marginally exploits
argument similarity. Sridhar et al. (2014) describe a
stance classification approach using both
linguistic and structural features to predict the stance
of posts in an online debate forum. It uses a
weighted graph to model author and post
relations, predicting the stance with a set of logic
rules. Walker et al. (2012) exploit the dialogic
structure of online debates to outperform
contentbased models. As opinionated language in social
media typically expresses a stance toward a topic,
this allows to close the link between stance
classification and target-dependent sentiment
classification, as demonstrated by Ebrahimi et al. (2016).
Stance classification in tweets was also studied at
SemEval 2016 (Task 6, Mohammad et al. (2016)),
where most participants used n-gram features and
word embeddings, sometimes combined with
sentiment dictionaries. Stance classification also gained
recognition in argument mining, as demonstrated
by Sobhani et al. (2015).</p>
      <p>
        The same side task’s leading hypothesis bears
structural similarity to measuring semantic textual
similarity, on which a number of shared tasks have
been organized
        <xref ref-type="bibr" rid="ref1 ref15 ref4">(Agirre et al., 2013; Xu et al., 2015;
Cer et al., 2017)</xref>
        , and a variety of datasets compiled
        <xref ref-type="bibr" rid="ref6 ref8">(Dolan and Brockett, 2005; Ganitkevitch et al.,
2013)</xref>
        . This suggests that contemporary language
models like BERT
        <xref ref-type="bibr" rid="ref5">(Devlin et al., 2019)</xref>
        , which
represent the state-of-the-art in these tasks may be a
good starting point to solve the same side task.
Task Topic Instances (same/diff.) Unique (arg1/arg2)
within abortion 20,834 / 20,006 9,192 (7,107 / 7,068)
within gay marriage 13,277 / 9,786 4,391 (3,406 / 3,392)
cross abortion
The data used for the same side task are derived
from the args.me corpus
        <xref ref-type="bibr" rid="ref3">(Ajjour et al., 2019)</xref>
        ,
comprising pairs of arguments sampled from one of
two topics, namely “abortion” and “gay marriage.”
Each argument pair possesses a binary label,
indicating whether they take the same stance on their
topic or not. The arguments as well as the
labels have been collected from online debate
forums, such as idebate.org, debatepedia.org,
debatewise.org and debate.org. The shared task is split
into two same side task variants: In the “within
topic” task arguments on both topics are supplied
for training as well as for testing, and in the “cross
topics” task arguments on one topic (“abortion”)
are supplied for training, whereas arguments from
the other topic are used for testing.
      </p>
      <p>Table 1 shows the numbers of positive and
negative cases per task and topic. The datasets for both
tasks are of roughly the same size. As individual
arguments are reused to increase the number of
instances, the “Unique” column shows how many
instances remain if every argument is used only once.
Table 2 shows characteristics of the arguments
when using the BERT WordPiece tokenizer and
the NLTK sentence segmenter. The true amount
of words may be slightly smaller, since WordPiece
may generate sub-word tokens for longer words.
Note the wide range of argument lengths.</p>
      <p>Since, at the time of writing, the test set labels
have not been released, yet, in our experiments, we
split the training datasets into subsets for training
and validation.</p>
    </sec>
    <sec id="sec-3">
      <title>Measuring Stance Similarity</title>
      <p>
        The same side task basically requires an assessment
of a certain kind of similarity of two arguments.
We hence chose to reuse models that have been
originally developed for paraphrase detection and
for measuring semantic textual similarity
        <xref ref-type="bibr" rid="ref1 ref15">(Agirre
et al., 2013; Xu et al., 2015)</xref>
        . Below, we review our
baselines and introduce our BERT-based model.
4.1
      </p>
      <sec id="sec-3-1">
        <title>Baseline</title>
        <p>
          The organizers provided a baseline that represents
arguments as n-gram count vectors and an SVM for
classification,1 achieving 54% accuracy for within,
and 52% for cross topic classification (Table 3).
As our first attempt, and second baseline, we used
Doc2Vec
          <xref ref-type="bibr" rid="ref13 ref9">(Le and Mikolov, 2014)</xref>
          as implemented
in Gensim
          <xref ref-type="bibr" rid="ref11">( Rˇehu˚ rˇek and Sojka, 2010)</xref>
          and also an
SVM for classification. With accuracies of 53%
and 59%, respectively, this model showed no
notable improvement compared to the organizer’s
baseline. Slightly better results were achieved
with a DBOW-DMM concatenation model and a
stochastic gradient descent classifier. A better
performance might have been possible using more data
for training, or a pre-trained model.
4.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>BERT for Same Side Classification</title>
        <p>
          Our approach is based on the well-known BERT
model
          <xref ref-type="bibr" rid="ref5">(Devlin et al., 2019)</xref>
          . Using an existing setup
for sentence pair classification,2 adapting it to the
same side task’s data yielded promising results out
of the box: Fine-tuning the pre-trained uncased
BERT-base model bert_12_768_123 with
multilabel classification and a max_seq_len of 128 for
3 epochs, an accuracy of 83% was obtained for the
within-topic task.
        </p>
        <p>The classification model employs the standard
pre-trained BERT model architecture with an
additional classification layer, consisting of a dropout
of 0.1 and a dense layer with sigmoid activation.
This layer accepts a pooled vector representation
from the model based on the last hidden state of
the [CLS] token, the first token for each input
sequence intended to represent the whole sequence.
The outputs for the classification layer are either
two classes (multi-class) or a single, binary output
for regression.
1https://github.com/webis-de/argmining19-same-sideclassification
2https://gluon-nlp.mxnet.io/examples/sentence_embedding/
bert.html
3http://gluon-nlp.mxnet.io/model_zoo/bert/index.html
We experimented with different hyperparameter
settings: the number of epochs for fine-tuning, with
at least 3 and at most 5; the split between training
and validation instances, which we initially set to
70:30 and for the final models to 90:10; the model
output and loss functions, which was multi-label
and softmax cross entropy loss or binary with
sigmoid binary cross entropy loss; and the
parameter max_seq_len, which determines the maximum
amount of tokens the model accepts. The latter
defaults to 128 but can be increased up to 512 tokens.</p>
        <p>Since many arguments in the data are rather long
(Table 2), a longer max_seq_len turns out to be
necessary. For a setting of 128, a single
argument can on average only have 64 tokens, since the
model combines the pair of arguments into a single
sequential representation. The remaining tokens
of an argument are truncated from the end until it
fits into the length restriction. With a max_seq_len
of 512, 75% of all arguments can be completely fed
into the model instead. To test whether the stance
of an argument is expressed in certain positions,
we also modified the model to truncate arguments
from the front, from the end, and randomly from
both sides, until it fitted into the length restriction.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>For evaluation, we split the datasets supplied for
the within and the cross-topic tasks randomly into
training and validation sets. Since the cross-topic
task’s dataset contains only a single topic, and since
the labels for the test set with the other topic are not
available, yet, we evaluated the model on the same
topic, as exemplified by the organizer’s baseline
scripts; our results are to be considered with that
in mind. Since both experiments were to be
considered in isolation, we abstained from evaluating
our cross-topic model with the other topic (“gay
marriage”) supplied for the within-topic task. For
the official results, we refer to the shared task’s
leaderboard, partially reproduced in Table 4. We
employ accuracy, precision, recall, and the
macroaveraged F1 as performance measures.</p>
      <p>The final training / validation split consisted of
a random split of 90% for training and the rest for
validation. Due to the construction of the data,
arguments are reused across pairings (Table 1). We
failed to correct for this during sampling, so
individual arguments may occur both in the training as
well as the validation set, opening the potential for
information leakage.</p>
      <p>
        Model Task
BERT-base 128 E within
BERT-base 512 within
BERT-base 512 E within
Doc2Vec DBOW-DMM
SVM within
LogReg within
SGD within
SGD cross
Baseline within
Baseline cross
0.56
0.58
0.59
0.57
0.54
0.52
0.55
0.57
0.39
0.57
0.54
0.53
0.55
0.57
0.39
0.57
0.50
0.50
0.53
0.57
0.39
0.57
0.37
0.37
Tables 3 and 4 overview the relative performance
differences of our choice of models as well as the
success of parameter tuning of the best-performing
model. Neither the baseline model provided by
the organizers nor our own using Doc2Vec
embeddings outperform a random classification by a
large margin. Only the transformer-based model
BERT
        <xref ref-type="bibr" rid="ref5">(Devlin et al., 2019)</xref>
        , together with a
classification layer, achieved about 20% improvement up
front, and by tuning its hyperparameters as outlined
above, we achieve 30-35% accuracy improvement
compared to the baseline.
      </p>
      <p>Starting with the BERT-base model with
multilabel output and a sequence length of 128, we
achieve 82% accuracy for the within, and 85%
accuracy for the cross-topic task after 3 epochs of
finetuning with a training / validation split of 70:30.
Switching from multi-label to a single binary
output with corresponding loss function, we gain
4% accuracy; with a longer sequence length of 512
we gain 5% accuracy. Using the longer sequence
length and truncating longer text sequences from
the front instead of the back, we gain another 3%
to about 90% accuracy on the within-topic task.
Truncating from both ends of longer arguments, so
that we retain the middle part, is detrimental. Note,
however, that only about 25% of the arguments
are longer than the maximum sequence length
restriction (Table 2), so that only that portion of all
instances is affected. We also tried to artificially
double the sequence length by feeding both the
front and the end of an argument through the same
model, concatenating the output before
classification. While doubling the time per epoch of
finetuning, this yielded less than 1% accuracy gain.</p>
      <p>To summarize, increasing the sequence length
to 512 so that most argument pairs fit entirely into
the model input and using the sigmoid binary cross
entropy loss, we achieved the best performance.
Truncating seems to matter somewhat, but more
so for a shorter model sequence lengths than for
longer ones, as there is no effect if there is nothing
to truncate. For short sequence lengths, truncating
from the front performs better than truncating from
the end, which suggests that the stance-determining
part may be found at the end of an argument.
5.2</p>
      <sec id="sec-4-1">
        <title>Effect of Fine-tuning</title>
        <p>We took a closer look at how the resulting
prediction is affected by differently fine-tuned models.
Tables 5 and 6 show model performance per epoch.
Choosing the best-performing model for the
withintopic task, with a sequence length of 512 and
truncation from the front, we evaluated the model
untuned (i.e., with an untrained classification layer)
and after each epoch of fine-tuning for five epochs.
It is clearly visible that an untrained model has
a strong bias towards a positive same stance
prediction and that fine-tuning is necessary to better
generalize to predict also negative same stance
labels. However, while every additional epoch of
fine-tuning may improve the model, it may at the
same time overfit it to the topics used for training,
limiting generalization to unknown topics.</p>
        <p>As can be seen in Table 5, a single epoch of
finetuning is almost enough to get close to the best
result. This naturally depends on how much training
data is supplied, and, our current evaluation may be
biased by the fact that some arguments occur both
Task
within
cross</p>
        <p>Untuned
in the training and the validation data. More epochs
of fine-tuning yield diminishing gains, suggesting
that more as well as more diverse training data may
have a stronger impact.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We showed that, using the transformer model
BERT, we are able to achieve state of the art
performance in the same side stance classification task.4;5</p>
      <p>These results have to be taken with a grain of salt,
though, since there is reason to doubt that the same
side task can be reduced to measuring a kind of
textual similarity as not all nuances of expressing a
stance towards a topic may be caught. An analysis
of topic-specific vocabulary, for instance, may be
required for identifying the stance in certain cases.
As official results have a discrepancy of over 10%
accuracy compared to our own evaluation results,
a more thorough separation of training and
evaluation data is required to prevent information leakage
and to account for the artificial nature of the task’s
datasets. A more diverse selection of training data
may help to generalize the model better and
improve accuracy for unseen topics. We further found
that if a model for semantic similarity generally
performs poorly, the stance classification may not
be good enough to be useful.
4Our source code can be found at:
https://github.com/webisde/SAMESIDE-19/
5Official shared task leaderboard: https://sameside.webis.de/
leaderboard.html</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Eneko</surname>
            <given-names>Agirre</given-names>
          </string-name>
          , Daniel Cer, Mona Diab, Aitor GonzalezAgirre, and
          <string-name>
            <given-names>Weiwei</given-names>
            <surname>Guo</surname>
          </string-name>
          .
          <year>2013</year>
          . *
          <article-title>SEM 2013 shared task: Semantic textual similarity</article-title>
          .
          <source>In Second Joint Conference on Lexical and Computational Semantics (*SEM)</source>
          , Volume
          <volume>1</volume>
          :
          <source>Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity</source>
          , pages
          <fpage>32</fpage>
          -
          <lpage>43</lpage>
          , Atlanta, Georgia, USA. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Yamen</given-names>
            <surname>Ajjour</surname>
          </string-name>
          , Khalid Al-Khatib, Philipp Cimiano, Roxanne El-Baff, Basil Ell, Henning Wachsmuth, and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Same Side Stance Classification: An Overview of the First Shared Task (to appear)</article-title>
          . https://sameside.webis.de.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Yamen</given-names>
            <surname>Ajjour</surname>
          </string-name>
          , Henning Wachsmuth, Johannes Kiesel, Martin Potthast, Matthias Hagen, and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Data Acquisition for Argument Search: The args.me corpus</article-title>
          .
          <source>In 42nd German Conference on Artificial Intelligence (KI</source>
          <year>2019</year>
          ). Springer.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Cer</surname>
          </string-name>
          , Mona Diab, Eneko Agirre, Iñigo LopezGazpio, and
          <string-name>
            <given-names>Lucia</given-names>
            <surname>Specia</surname>
          </string-name>
          .
          <year>2017</year>
          . SemEval
          <article-title>-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation</article-title>
          .
          <source>In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          , Vancouver, Canada. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Minneapolis, Minnesota. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>William</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Dolan</surname>
            and
            <given-names>Chris</given-names>
          </string-name>
          <string-name>
            <surname>Brockett</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Automatically constructing a corpus of sentential paraphrases</article-title>
          .
          <source>In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Javid</given-names>
            <surname>Ebrahimi</surname>
          </string-name>
          , Dejing Dou, and
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Lowd</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A joint sentiment-target-stance model for stance classification in tweets</article-title>
          .
          <source>In Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics: Technical Papers</source>
          , pages
          <fpage>2656</fpage>
          -
          <lpage>2665</lpage>
          , Osaka, Japan.
          <source>The COLING 2016 Organizing Committee.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Juri</given-names>
            <surname>Ganitkevitch</surname>
          </string-name>
          , Benjamin Van Durme, and
          <string-name>
            <surname>Chris</surname>
          </string-name>
          Callison-Burch.
          <year>2013</year>
          .
          <article-title>PPDB: The paraphrase database</article-title>
          .
          <source>In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , pages
          <fpage>758</fpage>
          -
          <lpage>764</lpage>
          , Atlanta, Georgia. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            and
            <given-names>Tomas</given-names>
          </string-name>
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Distributed representations of sentences and documents</article-title>
          . CoRR, abs/1405.4053.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Saif</given-names>
            <surname>Mohammad</surname>
          </string-name>
          , Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and
          <string-name>
            <given-names>Colin</given-names>
            <surname>Cherry</surname>
          </string-name>
          .
          <year>2016</year>
          . SemEval
          <article-title>-2016 task 6: Detecting stance in tweets</article-title>
          .
          <source>In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)</source>
          , pages
          <fpage>31</fpage>
          -
          <lpage>41</lpage>
          , San Diego, California. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>Radim Rˇ ehu˚rˇek</article-title>
          and
          <string-name>
            <given-names>Petr</given-names>
            <surname>Sojka</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          , pages
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          , Valletta, Malta. ELRA. http://is.muni.cz/publication/ 884893/en.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Parinaz</given-names>
            <surname>Sobhani</surname>
          </string-name>
          , Diana Inkpen, and
          <string-name>
            <given-names>Stan</given-names>
            <surname>Matwin</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>From argumentation mining to stance classification</article-title>
          .
          <source>In Proceedings of the 2nd Workshop on Argumentation Mining</source>
          , pages
          <fpage>67</fpage>
          -
          <lpage>77</lpage>
          , Denver, CO. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Dhanya</given-names>
            <surname>Sridhar</surname>
          </string-name>
          , Lise Getoor, and
          <string-name>
            <given-names>Marilyn</given-names>
            <surname>Walker</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Collective stance classification of posts in online debate forums</article-title>
          .
          <source>In Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media</source>
          , pages
          <fpage>109</fpage>
          -
          <lpage>117</lpage>
          , Baltimore, Maryland. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Marilyn A.</given-names>
            <surname>Walker</surname>
          </string-name>
          , Pranav Anand, Robert Abbott, and
          <string-name>
            <given-names>Ricky</given-names>
            <surname>Grant</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Stance classification using dialogic properties of persuasion</article-title>
          .
          <source>In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , pages
          <fpage>592</fpage>
          -
          <lpage>596</lpage>
          , Stroudsburg, PA, USA. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Wei</surname>
            <given-names>Xu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chris</surname>
            Callison-Burch, and
            <given-names>Bill</given-names>
          </string-name>
          <string-name>
            <surname>Dolan</surname>
          </string-name>
          .
          <year>2015</year>
          . SemEval
          <article-title>-2015 task 1: Paraphrase and semantic similarity in twitter (PIT)</article-title>
          .
          <source>In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2015</year>
          ), pages
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          , Denver, Colorado. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>