<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Positional Transformers for Claim Span Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Sullivan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Navid Madani</string-name>
          <email>smadani@bufalo.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sougata Saha</string-name>
          <email>sougatas@bufalo.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rohini Srihari</string-name>
          <email>rohini@bufalo.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University at Bufalo</institution>
          ,
          <addr-line>Bufalo, NY 14260</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Given the vast amount of misinformation present in today's social media environment, it is critical to be able to identify claims made in social media posts to facilitate the fact-verification process. For this reason, the CLAIMSCAN (Task B) shared task introduces the objective of claim span identification , which requires identifying spans of text within tweets that correspond to (allegedly) factual claims made by users. In this submission to CLAIMSCAN Task B, we introduce the positional transformer architecture for claim span identification. This architecture utilizes a novel, position-sensitive attention mechanism that is able to outperform all other submissions to the shared task, but still falls behind a few of the task organizers' more complex baseline models. In this paper, we discuss the positional transformer architecture, the training and data pre-processing procedures used for CLAIMSCAN Task B, and our results on this task.</p>
      </abstract>
      <kwd-group>
        <kwd>CLAIMSCAN</kwd>
        <kwd>Claim span identification</kwd>
        <kwd>Social media</kwd>
        <kwd>Transformers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Today’s online social media users are inundated with various claims about current (and past)
events, hindering their ability to discern fact from fiction. To combat this deluge of
(mis)information, and better equip users to filter out false claims online, Task B of the CLAIMSCAN
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] shared task requires systems to identify spans in tweets corresponding to claims (“assertion[s]
that deserve[...] our attention” [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or “argumentative component[s] in which the speaker or
writer conveys the central, contentious conclusion of their argument” [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), using the Claim Unit
Recognition in Tweets (CURT) dataset. The ultimate goal of this line of research is to “empower
the fact checkers”; in other words, to facilitate fact-checking in tweets by flagging only those
parts of the text that require verification.
      </p>
      <p>
        In this submission to CLAIMSCAN Task B, we introduce the positional transformer
architecture, a modified variant of the transformer encoder [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] architecture that utilizes a
positionsensitive attention mechanism (positional attention). We find that the positional transformer
outperforms all other submissions to this task, with the exception of some of the organizer’s
baseline models, one of which employs gated, additive/subtractive attention between the input
text and a series of hand-crafted templates describing the content and/or structure that a given
claim may take. This suggests that, in the absence of the requisite time or resources needed to
https://www.acsu.buffalo.edu/~rohini/ (R. Srihari)
https://www.acsu.buffalo.edu/~mjs227 (M. Sullivan); https://sougata-ub.github.io/ (S. Saha);
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
      </p>
      <p>CEUR
Workshop
Proceedings
Some young people still think only older folks die when they get Coronavirus and younger people are
somehow immune to it. Young people have also died with Covid19 and some face health complications
that could last them for their lifetime #COVID19 #COVIDIOTS Stay At Home Save Lives
@micaela_ayye I am only worried about it bc there is no cure/ vaccine. Atm the only the only thing we
can do is treatment of symptoms, so keeping them hydrated and seeing whether or not their bodies kill
it. #coronavirus
just in three tsa oficers at sanjose intl airport have tested positive for coronavirus tsa all tsa employees
they have come in contact with them over the past 14 days are quarantined at home sjc airtravel full
statement</p>
      <p>How about China pick up the worlds dead bodies that they killed with their bio-weapon #coronavirus
hand-craft such templates, the positional transformer may present one of the current optimal
architectures for claim span identification.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Task and Dataset Description</title>
      <p>The CURT dataset consists of 6044, 756, and 755 tweets in the train, development, and test sets,
respectively. Each tweet is pre-segmented into a list of “tokens”; note that (perhaps obviously)
these “tokens” do not necessarily align with the tokens generated by a given language model’s
tokenizer, and some of the “tokens” in the token lists provided in the dataset may be segmented
into two or more tokens by the LM’s tokenizer. Claim spans are then given as pairs of list
indices indicating the start/end tokens of each span. See Table 1 for examples of tweets from
the dataset and claim spans within them.</p>
      <p>
        The vast majority of the tweets in the dataset contain at least one claim span, and some
contain two or more. A small fraction—approximately 0.4%—of the tweets do not contain any
claim spans. On average, approximately 56% of the tokens in each tweet in the CURT dataset
are contained within a claim span. See Sundriyal et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for more details on the construction
of and statistics regarding this dataset.
      </p>
      <p>The evaluation metric for this task is the token-level F1 score, averaged over each tweet in
the dataset.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Related Work</title>
      <p>
        The organizers of the CLAIMSCAN task introduce the DaBERTa (Description Aware RoBERTa;
see Sundriyal et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) architecure as a baseline for the claim span identification task. We refer
interested readers to their work for a detailed description of the DaBERTa architecture, but
provide a brief overview below.
      </p>
      <p>
        DaBERTa consists of the RoBERTa-base model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] coupled with a Description Infuser Network
(DescNet) classification head. First, the claim description templates   —hand-crafted templates
describing the content and/or structure that a given claim may take—are passed through a
pre-trained RoBERTa model to obtain sentence embeddings  ′. Then, each input tweet   is
passed through another pre-trained RoBERTa model, and a Compositional De-Attention (CoDA;
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) block generates a representation   of the claim description embeddings  ′,
attentionweighted by the input tweet tokens   . The claim description representations   , along with the
RoBERTa-encoded tweet text   , are then passed to the Interactive Gating Mechanism (IGM); a
series of pointwise-multiplicative gates (c.f. output gates in LSTM [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] architectures), which aims
to capture semantically similar and dissimilar features between   and   . Finally, the output
of the IGM is passed to a Conditional Random Field (CRF; [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) layer to obtain predicted labels
for each of the tokens in   . This entire architecture is then trained end-to-end. Furthermore,
the token target labels are Beginning-Inside-Outside (BIO; [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) encoded: tokens occurring at
the beginning of a claim span, tokens within a claim span not contained within the previous
category, and tokens not contained within a claim span each receive separate respective labels.
      </p>
      <p>The DaBERTa architecture achieves a token-level F1 score of 0.8604 on the CURT dataset
(see Table 2). This represents the current state-of-the-art for claim span detection, which is a
novel task first introduced in the CLAIMSCAN competition.</p>
      <p>
        Due to its novelty, there (perhaps obviously) does not exist any prior work on claim span
detection, aside from that of Sundriyal et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, claim span detection is closely
related to the field of Argument Mining (AM), which involves identifying arguments in text and
relations between them [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. As such, we briefly discuss related work from this area.
      </p>
      <p>
        In their work on AM, Chakrabarty et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] utilize a Rhetorical Structure Theory [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] parser
(for feature extraction) coupled with a fine-tuned BERT model [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to identify argument spans
in online discussions. The authors find that this approach yields high recall, but low precision,
on AM tasks. Similarly, Cheng et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] feed pre-trained BERT embeddings to a bidirectional
LSTM [15] classification head to detect reviewers’ objections and authors’ rebuttals in data
gathered from the peer-review process.
      </p>
    </sec>
    <sec id="sec-5">
      <title>4. Approach Description</title>
      <p>In this section, we first outline the positional transformer architecture (Section 4.1). We then
discuss the data preprocessing procedures and training setup/hyperparameters utilized in this
submission to the CLAIMSCAN task (Section 4.2).1</p>
      <p>1All code available on GitHub: https://github.com/mjs227/CLAIMSCAN</p>
      <sec id="sec-5-1">
        <title>4.1. Positional Transformer</title>
        <p>
          As mentioned in Section 1 above, the positional transformer is a modified variant of the
transformer encoder [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] architecture that utilizes a position-sensitive attention mechanism that we
refer to as positional attention. As with the DescNet and BiLSTM components of the approaches
of Sundriyal et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and Cheng et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] (respectively) discussed in Section 3 above, the
positional transformer is merely a classification head designed for sequence classification, and must
be strapped on top of an embedding model—following Sundriyal et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], we use RoBERTa-base
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] as the underlying language model for this task.
        </p>
        <p>Given an input text  of length  (in tokens), the positional transformer acts on each token
individually, while taking into account the tokens surrounding it. First,  is passed to the RoBERTa
model to obtain a sequence of embeddings. The sequence of embeddings of dimension   
are then downsampled via a linear layer to the significantly smaller positional transformer
embedding dimension (  ), to obtain a sequence of   -dimensional embeddings  ′. Then,
we construct the window  ( ′
) ∈   ×(  +  +1)×  , where for each 1 ≤  ≤  ,  (
′
) ∈
 (</p>
        <p>+  +1)×  is centered on the  ℎ token embedding  ′. The window  (
and   tokens preceding and following  ′ (Equation 1), where  
denoting the left- and right-hand window sizes, respectively.
 and  
′</p>
        <p>) consists of the  
 are hyperparameters
 ( ″
) = (
″
−  , … ,  −1
″
, 

″
, 
+″1 , … , 
″
+  )
or  +   &gt;  , we define  ″ (for all integers  ) as follows in Equation 2 below.</p>
        <p>To account for those tokens 
′ near the beginning and end of the sequence such that  −   &lt; 1
 ″ =  ′+1


′
′
 ( ′)1, consists of  
 copies of the BOS token embedding ( 0′) concatenated with  ′
1∶  +1
( 1′, along with the following  
corresponding to the last lexical token in the sequence,  (
 token embeddings in  ′). On the other hand, the window
′
) , consists of 
′
 −  ∶</p>
        <p>( ′ , along
with the</p>
        <p>preceding token embeddings in  ′) concatenated with   copies of the EOS token
embedding ( ′+1 ).</p>
        <p>
          Then, each window  ( ′) is passed to the positional transformer itself. The positional
transformer consists of 
positional transformer layers {  }1≤≤
along with a single positional
transformer final layer   . Each of the non-final layers are architecturally identical (see Figure

1), and consist of  
+   + 1 linear layers {   }  ≤≤  , along with the positional attention


block. Positional attention is similar to dot-product attention as in Vaswani et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], but
restricted to each window   and focused solely on the center of   (i.e. the  ℎ token   ). As
such, there are a few critical diferences between “classical” dot-product attention and positional

key projection matrix for each of the
        </p>
        <p>+   + 1 positions in the input window.
attention; namely, there is only one query vector (corresponding to the center   ), and a unique
(1)
(2)</p>
        <p>−1 )0 is the center (i.e. corresponds to the token   ) of the window   −1 ;
−1 )1 belong to the left- and right-hand contexts of   −1 (respectively), and
correspond to the tokens immediately to the left/right of   .</p>
        <p>The motivation behind the separate feed-forward layers and key projection matrices in each
positional transformer layer is to model the unique contributions that each position in the input
window makes towards the prediction of the label for the center   . For example, if a token to
the left of   belongs to a claim span and represents the beginning of a clause, that may increase
the likelihood that   belongs to a claim span. However, if such a token occurs to the right of   ,
that may decrease the likelihood that   belongs to a claim span. This is conceptually similar to
disentangled attention (e.g. DeBERTa; [16]), but, rather than implementing disentanglement via
separate embeddings for position and content, the positional transformer uses separate matrices
for each position.</p>
        <p>For each 1 ≤  ≤</p>
        <p>and each 0 ≤  ≤ 
applied to the  ℎ window—i.e.  
0 =   , and for all 1 ≤  ′ ≤  ,    ′
=   ′(</p>
        <p>′−1). Now, let
, let    denote the output of the  ℎ non-final layer
and 1 ≤  
(   )0 denote the embedding corresponding to   (the center of   ), and for each 1 ≤   ≤  

≤   , let (  )−  and (  ) denote the token embeddings   positions to the left</p>
        <p>and   positions to the right of the center (   )0, respectively.</p>
        <p>Within the positional attention block of the  ℎ layer, there are  
(linear) layers {  
}  ≤&lt;0


,   right-hand key-projection layers {   }0&lt;≤  , and a single central</p>
        <p>(  −1 )−  in the left-hand window, its corresponding key vector is defined to be
key-projection layer  0 , along with a single query-projection layer   . For each embedding
((</p>
        <p>−1 )−  ).</p>
        <p>− 
 left-hand
key-projection
Similarly, for each embedding ( 
is defined to be    (( 

−1 )</p>
        <p>−1 )
 ). The central input, (</p>
        <p>−1 )0, corresponds to the key vector
 in the right-hand window, its corresponding key vector

 0 (( 
−1 )0) and the query vector   ((</p>
        <p>−1 )0). Finally, we compute the attention weights
  by taking the softmax of the dot products of each key vector with the single query vector
(Equation 3).</p>
        <p>for all −  
 ≤  ≤   ∶ ( ′) =    ((  ) ) ⋅   ((   )0)</p>
        <p>=  ((
′) )
Due to the low values of</p>
        <p>
          used for this task, we do not normalize the attention values by
dividing by the square root of the key dimension as in Vaswani et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We then obtain an
attention-weighted representation for the central input,   , as in “classical” dot-product
attention—i.e. as the attention-weighted sum of each embedding within the window (see Equation
4).
(3a)
(3b)
(4)
(5)

 =   
2 ( ∑   (   ) )



=− 
feed-forward (linear) layer
        </p>
        <p>and 2 -normalized (see Equation 5).</p>
        <p>The central input,   , is then passed through the central feed-forward (linear) layer   0 ,
an additive skip connection, and a second 2 -normalization, before being outputted by the

layer  +1 . For all − 
≤  ≤   such that  ≠ 0 , (</p>
        <p>−1 ) is passed through the (unique)  ℎ
(   ) = {
  
  
2 (  0 (  ))
2 (    ((</p>
        <p>if  = 0
−1 ) )) otherwise
predicted label for the  ℎ token.</p>
        <p>After passing through the first</p>
        <p>non-final layers, the input    is then passed to the final
layer   , which consists solely of a positional attention block. The output of this attention
block—an attention-weighted representation of the tokens surrounding the central input   —is
the output of the positional transformer.</p>
        <p>
          Unlike Sundriyal et al.[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], we do not employ BIO encoding for the token labels, as we found
that it decreased performance (F1 score) on the development set. Rather, we utilize a simpler,
binary (“inside/outside”) labeling scheme. As such, the   -dimensional output of the positional
transformer with respect to the input   is then passed to a   × 1 sigmoid layer to obtain a
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Data Preprocessing and Training</title>
        <p>For the claim span detection task on the CURT dataset, utilized a three-layer positional
transformer (four layers total, including the final layer) with  
entire RoBERTa → Positional Transformer pipeline was trained end-to-end using token-level
binary cross-entropy loss and the Adam [17] optimizer, with learning rates of 10−4 and 10−5 for
RoBERTa and the positional transformer, respectively. The model was trained for 15 epochs,
with early stopping if development set performance did not increase after five epochs.</p>
        <p>Recall from the discussion in Section 2 that the input texts are pre-segmented into a list of
“tokens” (hereafter segments) in the CURT dataset, but that these segments do not necessarily
align with the tokens generated by the RoBERTa tokenizer. As such, for each input text  , we
apply the tokenizer to each input segment   in  and concatenate the resulting tokens to pass as

= 5 and  
= 


= 7. The
input to the model. For training, given a segment   with label   , each token within   receives
the label   . For inference, we pool over the predicted labels for each token within the segment;
we tested max-, min-, and mean-pooling, and found that mean-pooling yielded the highest F1
scores on the development set.</p>
        <p>Finally, we augmented the data by POS-tagging each word in the input texts, using the NLTK
UnigramTagger 2 trained on the Brown [18] corpus. While a more sophisticated POS-tagging
model likely would have yielded better results, the pre-segmented nature of the input texts
made applying a POS tagger with beyond-unigram complexity exceedingly dificult.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results and Discussion</title>
      <p>The positional transformer achieves an F1 score of 0.8344, the highest-performing submission
to CLAIMSCAN Task B, and the fith-best out of the organizers’ baseline models (see Table
2). Unfortunately, due to issues with the submission portal, participants were unable to view
their scores on after submission; this made optimizing performance with respect to the test set
somewhat problematic. In particular, it was dificult to ascertain whether our model was
“overiftting” the development set, in the sense that we were not able to optimize our model/training
hyperparameters with respect to the test set. We believe that the positional transformer could
have achieved a higher F1 score on this task had that not been the case (as we observed F1
scores above 0.85 on the development set), and would have liked to perform ablation studies to
this efect. Unfortunately, the test set labels for the task have yet to be released at this time.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>In this paper, we introduced the positional transformer token classification architecture for claim
span identification in the CLAIMSCAN (Task B) shared task. We found that this architecture
outperforms all other submissions to this task, but falls short of the performance of some of the
task organizers’ baseline models. Given certain limitations regarding the submission format
of this task, however, we remain optimistic about the utility of the positional transformer for
claim span identification. In the immediate future, we aim to conduct ablation studies on this
model to identify the optimal hyperparameter configuration for this task. Additionally, we
hope to evaluate the positional transformer on other sequence classification tasks, given the
task-agnostic nature of this architecture.
guistics, Online, 2020, pp. 7000–7011. URL: https://aclanthology.org/2020.emnlp-main.569.
doi:10.18653/v1/2020.emnlp- main.569.
[15] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional lstm
networks, in: Proceedings. 2005 IEEE International Joint Conference on Neural Networks,
2005., volume 4, IEEE, 2005, pp. 2047–2052.
[16] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled
attention, in: International Conference on Learning Representations, 2021.
[17] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
arXiv:1412.6980 (2014).
[18] W. N. Francis, H. Kucera, Brown corpus manual, Letters to the Editor 5 (1979) 7.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pulastya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          , T. Chakraborty,
          <article-title>Empowering the factcheckers! automatic identification of claim spans on Twitter</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>7701</fpage>
          -
          <lpage>7715</lpage>
          . URL: https: //aclanthology.org/
          <year>2022</year>
          .emnlp-main.
          <volume>525</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp- main.525.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Toulmin</surname>
          </string-name>
          , The uses of argument, Cambridge university press,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Stab</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Parsing argumentation structures in persuasive essays</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>43</volume>
          (
          <year>2017</year>
          )
          <fpage>619</fpage>
          -
          <lpage>659</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hui</surname>
          </string-name>
          , Compositional de-attention networks,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Long short-term memory</article-title>
          ,
          <source>Neural computation 9</source>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Laferty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. C. N.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          , in: ICML,
          <year>2001</year>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Ramshaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <source>Text Chunking Using Transformation-Based Learning</source>
          , Springer Netherlands, Dordrecht,
          <year>1999</year>
          . URL: https://doi.org/10.1007/
          <fpage>978</fpage>
          -94-017-2390-9_
          <fpage>10</fpage>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          - 94- 017- 2390- 9_
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Cabrio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Villata</surname>
          </string-name>
          ,
          <article-title>Five years of argument mining: A data-driven analysis</article-title>
          .,
          <source>in: IJCAI</source>
          , volume
          <volume>18</volume>
          ,
          <year>2018</year>
          , pp.
          <fpage>5427</fpage>
          -
          <lpage>5433</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakrabarty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hidey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McKeown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hwang</surname>
          </string-name>
          , AMPERSAND:
          <article-title>Argument mining for PERSuAsive oNline discussions</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>2933</fpage>
          -
          <lpage>2943</lpage>
          . URL: https://aclanthology.org/D19-1291. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          - 1291.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Taboada</surname>
          </string-name>
          , W. C.
          <article-title>Mann, Applications of rhetorical structure theory</article-title>
          ,
          <source>Discourse studies 8</source>
          (
          <year>2006</year>
          )
          <fpage>567</fpage>
          -
          <lpage>588</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/ N19-1423. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          - 1423.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14] L. Cheng, L. Bing,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          , L. Si, APE:
          <article-title>Argument pair extraction from peer review and rebuttal via multi-task learning</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , Association for Computational Lin-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>