=Paper= {{Paper |id=Vol-3681/T3-3 |storemode=property |title=Positional Transformers for Claim Span Identification |pdfUrl=https://ceur-ws.org/Vol-3681/T3-3.pdf |volume=Vol-3681 |authors=Michael Sullivan,Navid Madani,Sougata Saha,Rohini Srihari |dblpUrl=https://dblp.org/rec/conf/fire/SullivanMSS23 }} ==Positional Transformers for Claim Span Identification== https://ceur-ws.org/Vol-3681/T3-3.pdf

Positional Transformers for Claim Span Identification
Michael Sullivan1 , Navid Madani1 , Sougata Saha1 and Rohini Srihari1
1
University at Buffalo, Buffalo, NY 14260, United States

Abstract
Given the vast amount of misinformation present in today’s social media environment, it is critical to
be able to identify claims made in social media posts to facilitate the fact-verification process. For this
reason, the CLAIMSCAN (Task B) shared task introduces the objective of claim span identification, which
requires identifying spans of text within tweets that correspond to (allegedly) factual claims made by
users. In this submission to CLAIMSCAN Task B, we introduce the positional transformer architecture
for claim span identification. This architecture utilizes a novel, position-sensitive attention mechanism
that is able to outperform all other submissions to the shared task, but still falls behind a few of the
task organizers’ more complex baseline models. In this paper, we discuss the positional transformer
architecture, the training and data pre-processing procedures used for CLAIMSCAN Task B, and our
results on this task.

Keywords
CLAIMSCAN, Claim span identification, Social media, Transformers

1. Introduction
Today’s online social media users are inundated with various claims about current (and past)
events, hindering their ability to discern fact from fiction. To combat this deluge of (mis-
)information, and better equip users to filter out false claims online, Task B of the CLAIMSCAN
[1] shared task requires systems to identify spans in tweets corresponding to claims (“assertion[s]
that deserve[...] our attention” [2] or “argumentative component[s] in which the speaker or
writer conveys the central, contentious conclusion of their argument” [3]), using the Claim Unit
Recognition in Tweets (CURT) dataset. The ultimate goal of this line of research is to “empower
the fact checkers”; in other words, to facilitate fact-checking in tweets by flagging only those
parts of the text that require verification.
In this submission to CLAIMSCAN Task B, we introduce the positional transformer archi-
tecture, a modified variant of the transformer encoder [4] architecture that utilizes a position-
sensitive attention mechanism (positional attention). We find that the positional transformer
outperforms all other submissions to this task, with the exception of some of the organizer’s
baseline models, one of which employs gated, additive/subtractive attention between the input
text and a series of hand-crafted templates describing the content and/or structure that a given
claim may take. This suggests that, in the absence of the requisite time or resources needed to
Forum for Information Retrieval Evaluation, December 15-18, 2023, India
Envelope-Open mjs227@buffalo.edu (M. Sullivan); smadani@buffalo.edu (N. Madani); sougatas@buffalo.edu (S. Saha);
rohini@buffalo.edu (R. Srihari)
GLOBE https://www.acsu.buffalo.edu/~mjs227 (M. Sullivan); https://sougata-ub.github.io/ (S. Saha);
https://www.acsu.buffalo.edu/~rohini/ (R. Srihari)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Some young people still think only older folks die when they get Coronavirus and younger people are
somehow immune to it. Young people have also died with Covid19 and some face health complications
that could last them for their lifetime #COVID19 #COVIDIOTS Stay At Home Save Lives

@micaela_ayye I am only worried about it bc there is no cure/ vaccine. Atm the only the only thing we
can do is treatment of symptoms, so keeping them hydrated and seeing whether or not their bodies kill
it. #coronavirus

just in three tsa officers at sanjose intl airport have tested positive for coronavirus tsa all tsa employees
they have come in contact with them over the past 14 days are quarantined at home sjc airtravel full
statement

How about China pick up the worlds dead bodies that they killed with their bio-weapon #coronavirus

Table 1
Examples of tweets from the CURT dataset. Claim spans are underlined.

hand-craft such templates, the positional transformer may present one of the current optimal
architectures for claim span identification.

2. Task and Dataset Description
The CURT dataset consists of 6044, 756, and 755 tweets in the train, development, and test sets,
respectively. Each tweet is pre-segmented into a list of “tokens”; note that (perhaps obviously)
these “tokens” do not necessarily align with the tokens generated by a given language model’s
tokenizer, and some of the “tokens” in the token lists provided in the dataset may be segmented
into two or more tokens by the LM’s tokenizer. Claim spans are then given as pairs of list
indices indicating the start/end tokens of each span. See Table 1 for examples of tweets from
the dataset and claim spans within them.
The vast majority of the tweets in the dataset contain at least one claim span, and some
contain two or more. A small fraction—approximately 0.4%—of the tweets do not contain any
claim spans. On average, approximately 56% of the tokens in each tweet in the CURT dataset
are contained within a claim span. See Sundriyal et al. [1] for more details on the construction
of and statistics regarding this dataset.
The evaluation metric for this task is the token-level F1 score, averaged over each tweet in
the dataset.
3. Related Work
The organizers of the CLAIMSCAN task introduce the DaBERTa (Description Aware RoBERTa;
see Sundriyal et al. [1]) architecure as a baseline for the claim span identification task. We refer
interested readers to their work for a detailed description of the DaBERTa architecture, but
provide a brief overview below.
DaBERTa consists of the RoBERTa-base model [5] coupled with a Description Infuser Network
(DescNet) classification head. First, the claim description templates 𝑑𝑖 —hand-crafted templates
describing the content and/or structure that a given claim may take—are passed through a
pre-trained RoBERTa model to obtain sentence embeddings 𝑑𝑖′ . Then, each input tweet 𝑡𝑗 is
passed through another pre-trained RoBERTa model, and a Compositional De-Attention (CoDA;
[6]) block generates a representation 𝑧𝑗 of the claim description embeddings 𝑑𝑖′ , attention-
weighted by the input tweet tokens 𝑡𝑗𝑘 . The claim description representations 𝑧𝑗 , along with the
RoBERTa-encoded tweet text 𝑡𝑗 , are then passed to the Interactive Gating Mechanism (IGM); a
series of pointwise-multiplicative gates (c.f. output gates in LSTM [7] architectures), which aims
to capture semantically similar and dissimilar features between 𝑧𝑗 and 𝑡𝑗 . Finally, the output
of the IGM is passed to a Conditional Random Field (CRF; [8]) layer to obtain predicted labels
for each of the tokens in 𝑡𝑗 . This entire architecture is then trained end-to-end. Furthermore,
the token target labels are Beginning-Inside-Outside (BIO; [9]) encoded: tokens occurring at
the beginning of a claim span, tokens within a claim span not contained within the previous
category, and tokens not contained within a claim span each receive separate respective labels.
The DaBERTa architecture achieves a token-level F1 score of 0.8604 on the CURT dataset
(see Table 2). This represents the current state-of-the-art for claim span detection, which is a
novel task first introduced in the CLAIMSCAN competition.
Due to its novelty, there (perhaps obviously) does not exist any prior work on claim span
detection, aside from that of Sundriyal et al. [1]. However, claim span detection is closely
related to the field of Argument Mining (AM), which involves identifying arguments in text and
relations between them [10]. As such, we briefly discuss related work from this area.
In their work on AM, Chakrabarty et al. [11] utilize a Rhetorical Structure Theory [12] parser
(for feature extraction) coupled with a fine-tuned BERT model [13] to identify argument spans
in online discussions. The authors find that this approach yields high recall, but low precision,
on AM tasks. Similarly, Cheng et al. [14] feed pre-trained BERT embeddings to a bidirectional
LSTM [15] classification head to detect reviewers’ objections and authors’ rebuttals in data
gathered from the peer-review process.

4. Approach Description
In this section, we first outline the positional transformer architecture (Section 4.1). We then
discuss the data preprocessing procedures and training setup/hyperparameters utilized in this
submission to the CLAIMSCAN task (Section 4.2).1

1
All code available on GitHub: https://github.com/mjs227/CLAIMSCAN
4.1. Positional Transformer
As mentioned in Section 1 above, the positional transformer is a modified variant of the trans-
former encoder [4] architecture that utilizes a position-sensitive attention mechanism that we
refer to as positional attention. As with the DescNet and BiLSTM components of the approaches
of Sundriyal et al. [1] and Cheng et al. [14] (respectively) discussed in Section 3 above, the posi-
tional transformer is merely a classification head designed for sequence classification, and must
be strapped on top of an embedding model—following Sundriyal et al. [1], we use RoBERTa-base
[5] as the underlying language model for this task.
Given an input text 𝑡 of length 𝑁 (in tokens), the positional transformer acts on each token
individually, while taking into account the tokens surrounding it. First, 𝑡 is passed to the RoBERTa
model to obtain a sequence of embeddings. The sequence of embeddings of dimension 𝑑𝑅𝑜𝐵𝐸𝑅𝑇 𝑎
are then downsampled via a linear layer to the significantly smaller positional transformer
embedding dimension (𝑑𝑃𝑇 ), to obtain a sequence of 𝑑𝑃𝑇 -dimensional embeddings 𝑡 ′ . Then,
𝑊 𝑊
we construct the window 𝑊 (𝑡 ′ ) ∈ 𝑅𝑁 ×(𝑆𝐿 +𝑆𝑅 +1)×𝑑𝑃𝑇 , where for each 1 ≤ 𝑖 ≤ 𝑁, 𝑊 (𝑡 ′ )𝑖 ∈
𝑊 𝑊
𝑅(𝑆𝐿 +𝑆𝑅 +1)×𝑑𝑃𝑇 is centered on the 𝑖𝑡ℎ token embedding 𝑡𝑖′ . The window 𝑊 (𝑡 ′ )𝑖 consists of the 𝑆𝐿𝑊
and 𝑆𝑅𝑊 tokens preceding and following 𝑡𝑖′ (Equation 1), where 𝑆𝐿𝑊 and 𝑆𝑅𝑊 are hyperparameters
denoting the left- and right-hand window sizes, respectively.

″ , … , 𝑡 ″ , 𝑡 ″, 𝑡 ″ , … , 𝑡 ″ )
𝑊 (𝑡 ″ )𝑖 = 𝐶𝑜𝑛𝑐𝑎𝑡(𝑡𝑖−𝑆 (1)
𝑊 𝑖−1 𝑖 𝑖+1 𝑖+𝑆 𝑊
𝐿 𝑅

To account for those tokens 𝑡𝑖′ near the beginning and end of the sequence such that 𝑖 − 𝑆𝐿𝑊 < 1
or 𝑖 + 𝑆𝑅𝑊 > 𝑁, we define 𝑡 ″ (for all integers 𝑘) as follows in Equation 2 below.
′
⎧𝑡0 if 𝑘 < 1
𝑡𝑘″ = 𝑡𝑁′ +1 if 𝑘 > 𝑁 (2)
⎨′
⎩𝑡𝑘 otherwise
Where 𝑡0′ and 𝑡𝑁′ +1 denote the embeddings of the BOS and EOS tokens, respectively. For
example, the window corresponding to the first lexical (i.e. non-EOS/BOS) token in the sequence,
𝑊 (𝑡 ′ )1 , consists of 𝑆𝐿𝑊 copies of the BOS token embedding (𝑡0′ ) concatenated with 𝑡1∶𝑆
′
𝑊
+1 𝑅
(𝑡1′ , along with the following 𝑆𝑅𝑊 token embeddings in 𝑡 ′ ). On the other hand, the window
corresponding to the last lexical token in the sequence, 𝑊 (𝑡 ′ )𝑁 , consists of 𝑡𝑁′ −𝑆 𝑊 ∶𝑁 (𝑡𝑁′ , along
𝐿
with the 𝑆𝐿𝑊 preceding token embeddings in 𝑡 ′ ) concatenated with 𝑆𝑅𝑊 copies of the EOS token
embedding (𝑡𝑁′ +1 ).
Then, each window 𝑊 (𝑡 ′ )𝑖 is passed to the positional transformer itself. The positional
transformer consists of 𝑀 positional transformer layers {𝐿𝑘 }1≤𝑘≤𝑀 along with a single positional
transformer final layer 𝐿𝐹 . Each of the non-final layers are architecturally identical (see Figure
1), and consist of 𝑆𝐿𝑊 + 𝑆𝑅𝑊 + 1 linear layers {𝐹 𝐹𝑗𝑘 }𝑆 𝑊 ≤𝑗≤𝑆 𝑊 , along with the positional attention
𝐿 𝑅
block. Positional attention is similar to dot-product attention as in Vaswani et al. [4], but
restricted to each window 𝑊𝑖 and focused solely on the center of 𝑊𝑖 (i.e. the 𝑖𝑡ℎ token 𝑡𝑖 ). As
such, there are a few critical differences between “classical” dot-product attention and positional
attention; namely, there is only one query vector (corresponding to the center 𝑡𝑖 ), and a unique
key projection matrix for each of the 𝑆𝐿𝑊 + 𝑆𝑅𝑊 + 1 positions in the input window.
Figure 1: Architecture of a (non-final) positional transformer layer. In this example, 𝑆𝐿𝑊 = 𝑆𝑅𝑊 = 1 for
readability. The input 𝑥0 = (𝑊𝑖𝑘−1 )0 is the center (i.e. corresponds to the token 𝑡𝑖 ) of the window 𝑊𝑖𝑘−1 ;
𝑥−1 = (𝑊𝑖𝑘−1 )−1 and 𝑥1 = (𝑊𝑖𝑘−1 )1 belong to the left- and right-hand contexts of 𝑊𝑖𝑘−1 (respectively), and
correspond to the tokens immediately to the left/right of 𝑡𝑖 .

The motivation behind the separate feed-forward layers and key projection matrices in each
positional transformer layer is to model the unique contributions that each position in the input
window makes towards the prediction of the label for the center 𝑡𝑖 . For example, if a token to
the left of 𝑡𝑖 belongs to a claim span and represents the beginning of a clause, that may increase
the likelihood that 𝑡𝑖 belongs to a claim span. However, if such a token occurs to the right of 𝑡𝑖 ,
that may decrease the likelihood that 𝑡𝑖 belongs to a claim span. This is conceptually similar to
disentangled attention (e.g. DeBERTa; [16]), but, rather than implementing disentanglement via
separate embeddings for position and content, the positional transformer uses separate matrices
for each position.
For each 1 ≤ 𝑖 ≤ 𝑁 and each 0 ≤ 𝑘 ≤ 𝑀, let 𝑊𝑖𝑘 denote the output of the 𝑘 𝑡ℎ non-final layer
′ ′
applied to the 𝑖𝑡ℎ window—i.e. 𝑊𝑖0 = 𝑊𝑖 , and for all 1 ≤ 𝑘 ′ ≤ 𝑀, 𝑊𝑖𝑘 = 𝐿𝑘 ′ (𝑊𝑖𝑘 −1 ). Now, let
(𝑊𝑖𝑘 )0 denote the embedding corresponding to 𝑡𝑖 (the center of 𝑊𝑖 ), and for each 1 ≤ 𝑗𝐿 ≤ 𝑆𝐿𝑊
and 1 ≤ 𝑗𝑅 ≤ 𝑆𝑅𝑊 , let (𝑊𝑖𝑘 )−𝑗𝐿 and (𝑊𝑖𝑘 )𝑗𝑅 denote the token embeddings 𝑗𝐿 positions to the left
and 𝑗𝑅 positions to the right of the center (𝑊𝑖𝑘 )0 , respectively.
Within the positional attention block of the 𝑘 𝑡ℎ layer, there are 𝑆𝐿𝑊 left-hand key-projection
(linear) layers {𝐾𝑗𝑘 }𝑆 𝑊 ≤𝑗<0 , 𝑆𝑅𝑊 right-hand key-projection layers {𝐾𝑗𝑘 }0<𝑗≤𝑆 𝑊 , and a single central
𝐿 𝑅
key-projection layer 𝐾0𝑘 , along with a single query-projection layer 𝑄 𝑘 . For each embedding
(𝑊𝑖𝑘−1 )−𝑗𝐿 in the left-hand window, its corresponding key vector is defined to be 𝐾−𝑗
𝑘 ((𝑊 𝑘−1 ) ).
𝐿 𝑖 −𝑗𝐿
Similarly, for each embedding (𝑊𝑖𝑘−1 )𝑗𝑅 in the right-hand window, its corresponding key vector
is defined to be 𝐾𝑗𝑘𝑅 ((𝑊𝑖𝑘−1 )𝑗𝑅 ). The central input, (𝑊𝑖𝑘−1 )0 , corresponds to the key vector
𝐾0𝑘 ((𝑊𝑖𝑘−1 )0 ) and the query vector 𝑄 𝑘 ((𝑊𝑖𝑘−1 )0 ). Finally, we compute the attention weights
𝐴𝑘 by taking the softmax of the dot products of each key vector with the single query vector
(Equation 3).
for all − 𝑆𝐿𝑊 ≤ 𝑗 ≤ 𝑆𝑅𝑊 ∶ (𝐴′ )𝑘𝑗 = 𝐾𝑗𝑘 ((𝑊𝑖𝑘 )𝑗 ) ⋅ 𝑄 𝑘 ((𝑊𝑖𝑘 )0 ) (3a)
𝐴𝑘 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥((𝐴′ )𝑘 ) (3b)

Due to the low values of 𝑑𝑃𝑇 used for this task, we do not normalize the attention values by
dividing by the square root of the key dimension as in Vaswani et al. [4]. We then obtain an
attention-weighted representation for the central input, 𝛼 𝑘 , as in “classical” dot-product atten-
tion—i.e. as the attention-weighted sum of each embedding within the window (see Equation
4).

𝑆𝑅𝑊
𝛼 𝑘 = 𝑁 𝑜𝑟𝑚𝐿2 ( ∑ 𝐴𝑘𝑗 (𝑊𝑖𝑘 )𝑗 ) (4)
𝑗=−𝑆𝐿𝑊

The central input, 𝛼 𝑘 , is then passed through the central feed-forward (linear) layer 𝐹 𝐹0𝑘 ,
an additive skip connection, and a second 𝐿2-normalization, before being outputted by the
layer 𝐿𝑘+1 . For all −𝑆𝐿𝑊 ≤ 𝑗 ≤ 𝑆𝑅𝑊 such that 𝑗 ≠ 0, (𝑊𝑖𝑘−1 )𝑗 is passed through the (unique) 𝑗 𝑡ℎ
feed-forward (linear) layer 𝐹 𝐹𝑗𝑘 and 𝐿2-normalized (see Equation 5).

𝑁 𝑜𝑟𝑚𝐿2 (𝐹 𝐹0𝑘 (𝛼 𝑘 )) if 𝑗 = 0
(𝑊𝑖𝑘 )𝑗 = { 𝑘 𝑘−1 (5)
𝑁 𝑜𝑟𝑚𝐿2 (𝐹 𝐹𝑗 ((𝑊𝑖 )𝑗 )) otherwise

After passing through the first 𝑀 non-final layers, the input 𝑊𝑖𝑀 is then passed to the final
layer 𝐿𝐹 , which consists solely of a positional attention block. The output of this attention
block—an attention-weighted representation of the tokens surrounding the central input 𝑡𝑖 —is
the output of the positional transformer.
Unlike Sundriyal et al.[1], we do not employ BIO encoding for the token labels, as we found
that it decreased performance (F1 score) on the development set. Rather, we utilize a simpler,
binary (“inside/outside”) labeling scheme. As such, the 𝑑𝑃𝑇 -dimensional output of the positional
transformer with respect to the input 𝑊𝑖 is then passed to a 𝑑𝑃𝑇 × 1 sigmoid layer to obtain a
predicted label for the 𝑖𝑡ℎ token.

4.2. Data Preprocessing and Training
For the claim span detection task on the CURT dataset, utilized a three-layer positional trans-
former (four layers total, including the final layer) with 𝑑𝑃𝑇 = 5 and 𝑆𝐿𝑊 = 𝑆𝑅𝑊 = 7. The
entire RoBERTa → Positional Transformer pipeline was trained end-to-end using token-level
binary cross-entropy loss and the Adam [17] optimizer, with learning rates of 10−4 and 10−5 for
RoBERTa and the positional transformer, respectively. The model was trained for 15 epochs,
with early stopping if development set performance did not increase after five epochs.
Recall from the discussion in Section 2 that the input texts are pre-segmented into a list of
“tokens” (hereafter segments) in the CURT dataset, but that these segments do not necessarily
align with the tokens generated by the RoBERTa tokenizer. As such, for each input text 𝑡, we
apply the tokenizer to each input segment 𝑡𝑖 in 𝑡 and concatenate the resulting tokens to pass as
Model PT DaBERTa BERT+CRF SpanBERT+CRF RoBERTa+CRF
F1 0.8344 0.8604 0.8368 0.8390 0.8457
Table 2
Experimental results of the positional transformer (PT) and baseline architectures on the CURT dataset.

input to the model. For training, given a segment 𝑡𝑖 with label 𝑙𝑖 , each token within 𝑡𝑖 receives
the label 𝑙𝑖 . For inference, we pool over the predicted labels for each token within the segment;
we tested max-, min-, and mean-pooling, and found that mean-pooling yielded the highest F1
scores on the development set.
Finally, we augmented the data by POS-tagging each word in the input texts, using the NLTK
UnigramTagger 2 trained on the Brown [18] corpus. While a more sophisticated POS-tagging
model likely would have yielded better results, the pre-segmented nature of the input texts
made applying a POS tagger with beyond-unigram complexity exceedingly difficult.

5. Results and Discussion
The positional transformer achieves an F1 score of 0.8344, the highest-performing submission
to CLAIMSCAN Task B, and the fifth-best out of the organizers’ baseline models (see Table
2). Unfortunately, due to issues with the submission portal, participants were unable to view
their scores on after submission; this made optimizing performance with respect to the test set
somewhat problematic. In particular, it was difficult to ascertain whether our model was “over-
fitting” the development set, in the sense that we were not able to optimize our model/training
hyperparameters with respect to the test set. We believe that the positional transformer could
have achieved a higher F1 score on this task had that not been the case (as we observed F1
scores above 0.85 on the development set), and would have liked to perform ablation studies to
this effect. Unfortunately, the test set labels for the task have yet to be released at this time.

6. Conclusion
In this paper, we introduced the positional transformer token classification architecture for claim
span identification in the CLAIMSCAN (Task B) shared task. We found that this architecture
outperforms all other submissions to this task, but falls short of the performance of some of the
task organizers’ baseline models. Given certain limitations regarding the submission format
of this task, however, we remain optimistic about the utility of the positional transformer for
claim span identification. In the immediate future, we aim to conduct ablation studies on this
model to identify the optimal hyperparameter configuration for this task. Additionally, we
hope to evaluate the positional transformer on other sequence classification tasks, given the
task-agnostic nature of this architecture.

2
https://www.nltk.org/api/nltk.tag.html
References
[1] M. Sundriyal, A. Kulkarni, V. Pulastya, M. S. Akhtar, T. Chakraborty, Empowering the fact-
checkers! automatic identification of claim spans on Twitter, in: Proceedings of the 2022
Conference on Empirical Methods in Natural Language Processing, Association for Com-
putational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 7701–7715. URL: https:
//aclanthology.org/2022.emnlp-main.525. doi:10.18653/v1/2022.emnlp- main.525 .
[2] S. E. Toulmin, The uses of argument, Cambridge university press, 2003.
[3] C. Stab, I. Gurevych, Parsing argumentation structures in persuasive essays, Computational
Linguistics 43 (2017) 619–659.
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
sukhin, Attention is all you need, Advances in neural information processing systems 30
(2017).
[5] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
arXiv:1907.11692 (2019).
[6] Y. Tay, A. T. Luu, A. Zhang, S. Wang, S. C. Hui, Compositional de-attention networks,
Advances in Neural Information Processing Systems 32 (2019).
[7] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997)
1735–1780.
[8] J. D. Lafferty, A. McCallum, F. C. N. Pereira, Conditional random fields: Probabilistic
models for segmenting and labeling sequence data, in: ICML, 2001, pp. 282–289.
[9] L. A. Ramshaw, M. P. Marcus, Text Chunking Using Transformation-Based Learning,
Springer Netherlands, Dordrecht, 1999. URL: https://doi.org/10.1007/978-94-017-2390-9_10.
doi:10.1007/978- 94- 017- 2390- 9_10 .
[10] E. Cabrio, S. Villata, Five years of argument mining: A data-driven analysis., in: IJCAI,
volume 18, 2018, pp. 5427–5433.
[11] T. Chakrabarty, C. Hidey, S. Muresan, K. McKeown, A. Hwang, AMPERSAND: Argument
mining for PERSuAsive oNline discussions, in: Proceedings of the 2019 Conference on Em-
pirical Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Lin-
guistics, Hong Kong, China, 2019, pp. 2933–2943. URL: https://aclanthology.org/D19-1291.
doi:10.18653/v1/D19- 1291 .
[12] M. Taboada, W. C. Mann, Applications of rhetorical structure theory, Discourse studies 8
(2006) 567–588.
[13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/
N19-1423. doi:10.18653/v1/N19- 1423 .
[14] L. Cheng, L. Bing, Q. Yu, W. Lu, L. Si, APE: Argument pair extraction from peer review
and rebuttal via multi-task learning, in: Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP), Association for Computational Lin-
guistics, Online, 2020, pp. 7000–7011. URL: https://aclanthology.org/2020.emnlp-main.569.
doi:10.18653/v1/2020.emnlp- main.569 .
[15] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional lstm
networks, in: Proceedings. 2005 IEEE International Joint Conference on Neural Networks,
2005., volume 4, IEEE, 2005, pp. 2047–2052.
[16] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled
attention, in: International Conference on Learning Representations, 2021.
[17] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
arXiv:1412.6980 (2014).
[18] W. N. Francis, H. Kucera, Brown corpus manual, Letters to the Editor 5 (1979) 7.