<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Current language models' poor performance on pragmatic aspects of natural language</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Albert Pritzkau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Waldmüller</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivier Blanc</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michaela Geierhos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ulrich Schade</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Communication, Information Processing and Ergonomics (FKIE)</institution>
          ,
          <addr-line>Fraunhoferstraße 20, 53343 Wachtberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Research Institute for Cyber Defence and Smart Data (CODE), University of the Bundeswehr Munich</institution>
          ,
          <addr-line>Werner-Heisenberg-Weg 39, 85577 Neubiberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the following system description, we present our approach for claim detection in tweets. We address both Subtask A, a binary sequence classification task, and Subtask B, a token classification task. For the first of the two subtasks, each input chunk-in this case, each tweet-was given a class label. For the second subtask, a label was assigned to each individual token in an input sequence. In order to match each utterance with the appropriate class label, we used pre-trained RoBERTa (A Robustly Optimized BERT Pretraining Approach) language models for sequence classification. Using the provided data and annotations as training data, we fine-tuned a model for each of the two classification tasks. Though the resulting models serve as adequate baseline models, the exploratory data analysis suggests fundamental problems in the structure of the training data. We argue that such tasks cannot be fully solved if pragmatic aspects of language are ignored. This type of information, often contextual and thus not explicitly stated in written language, is insuficiently represented in the current models. For this reason, we posit that the provided training data is under-specified and imperfectly suited to these classification tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Pragmatics</kwd>
        <kwd>Information Extraction</kwd>
        <kwd>Text Classification</kwd>
        <kwd>RoBERTa</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Political rhetoric, propaganda, and advertising are all examples of persuasive discourse. As
defined by Lakof [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], persuasive discourse is the non-reciprocal “attempt or intention of one
party to change the behavior, feelings, intentions, or viewpoint of another by communicative
means”. Thus, in addition to the purely content-related features of communication, the
discursive context of utterances plays a central role. The shared task CLAIMSCAN’2023 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
on the topic Uncovering Truth in Social Media through Claim Detection and Identification of
Claim Spans considers claims as a key element of current information campaigns, with the
aim to mislead and deceive. The goal of both Subtasks A and B is to develop systems that can
efectively detect and identify claims in social media text. The utterance of a particular claim is
understood as a communicative phenomenon. This approach assumes that communication
depends not only on the meaning of the words in an utterance but also on what speakers intend
to communicate with a particular utterance. In linguistics such an approach is adopted by the
ifeld of pragmatics. It is not always possible to deduce the function of an utterance from its
form. Additional contextual information is often needed. Recent research [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] suggests the
possibility that transformer-based networks capture structural information about language,
ranging from orthographic, morphological and syntactical up to semantic features. Beyond
these features, these architectures remain almost entirely unexplored. This task is an attempt
to explore the limits of the prevailing approach, in particular, to investigate the ability of
transformers to capture pragmatic features.
      </p>
      <p>
        The shared task CLAIMSCAN’2023 defines the following subtasks:
Subtask A. Claim Detection [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: The task is a binary classification problem, where the
objective is to label the given social media post as a claim or non-claim. A claim is an assertive
statement that may or may not have evidence.
      </p>
      <p>
        Subtask B. Claim Span Identification [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]: The task is to identify the words/phrases that
contribute to the claims made in the given social media post. A claim is an assertive statement
that may or may not be supported by evidence.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        The linguistic field of pragmatics regards speaking as acting, or more precisely, as acting with
the intention of manipulating the audience. The speech act called assertion [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] means to make
a statement so that the audience is informed about something. According to Grice’s cooperative
principle [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the information provided must be relevant, helpful, and true in the context of
the discourse. Since we are attuned to this principle, false claims are efective if they show
no signs of falsehood or duplicity. We simply follow the cooperative principle and take the
statement to be true, with all the consequences it implies. Signs of falsehood or duplicity can
save us from such a washout. Such signs can be violations of one’s own beliefs (e.g., ‘Hawaiian
wildfire is an attack experiment of a weather weapon conducted by the US military’), a wrong
style, e.g. excessive emotion in a news text (e.g., ‘Hawaiian wildfire is a scandalous attack
experiment of a perfidious weather weapon conducted by the sleazy US military’), or untypical
grammatical errors like omitting determiners (e.g., ‘Hawaiian wildfire is attack experiment of
weather weapon conducted by US military’). However, some of these signs might be overlooked
because of our attunement to the cooperative principle in general and Grice’s maxim of quality
(ibidem) in particular. A system might therefore perform better at detecting false claims.
      </p>
      <sec id="sec-2-1">
        <title>Task descriptions</title>
        <p>This paper describes the participation in both subtasks. The challenge for Subtask A is to decide
whether a given tweet contains a claim. Accordingly, the task is formulated as a binary
classification problem. Beyond the mere identification of claims, Subtask B involves the delineation of
text intervals containing said claims. For each token in a tweet, it is to be examined whether it
is part of a claim, and subsequently, the claim span is to be determined. The model should then
predict the indices of the span intervals for each tweet.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Exploratory Data Analysis</title>
        <p>The organizers of the CodaLab competition CLAIMSCAN’2023 have released the datasets for
both subtasks. Each subtask dataset consists of a training set, a development set, and a test set,
all focused on discussions related to the COVID-19 pandemic.</p>
        <p>The labeled data for Subtask A, obtained from 8,483 tweets, includes both the training set of
6,986 tweets and the developer set of 1,497 tweets resulting in a ratio of 82:18. Assuming that
the split is already validated, we did not apply any resampling. Both sets consist of the tweets
in plain text with an additional binary label claim or non-claim. While the definition of a claim
was given as a claim is an assertive statement that may or may not have evidence, we observed
questionable annotations of the training set. For example, the tweet
‘Older but still relevant: Health products that make false or misleading claims to prevent, treat or
cure #COVID19 may put your health at risk via HealthCanada #cdnhealth</p>
        <p>https://t.co/9dFNXaV3gW’
is labeled as a non-claim. However, the tweet
‘coronavirus altnews founder shekhar gupta and others spread unverified claims by a fake twitter
account’
is marked as a claim. For the purpose of submission, an unlabeled test set consisting of 1,489
tweets was used.</p>
        <p>For Subtask B, the size of the training set was 6,044, the development set had 756 tweets
resulting in a ratio of 89:11. The test dataset contained 755 entries. In contrast to Task A, here,
in addition to the tweet text and the claim label, the start index and the end index of the token
spans corresponding to the claims were also provided. Of these, 7,585 spans were annotated as
claims, meaning that some tweets contained more than one claim. As in Subtask A, we made
several notable observations regarding the labeled training data. We observed an instance of an
impossible annotation in line 19 of the training set. This anomaly raised questions regarding
the quality of data and the need for quality control mechanisms when building the dataset.
Furthermore, during our analysis of annotation spans, it was revealed that 235 ‘@’ mentions and
16 URLs (starting with “https://...”) were present in the annotated text. We discovered that colons
appeared to be the most indicative feature for identifying the beginning of a claim, with 846
instances manifesting this pattern within the training dataset. Additionally, the data analysis
suggests the utilization of keyword-based sampling in the construction of the training dataset.
This is particularly evident from Figure 3. This is supported, for example, by the fact that the
account name of Donald Trump (@realdonaldtrump) appears in the top 30 most frequent words
(see Figure 3b). Surprisingly, we found that cleaning the training data resulted in a poorer
performance of the model.</p>
        <p>(a) training set
(b) development set</p>
        <p>(c) test set</p>
        <p>The value and meaning of accuracy and other well-known performance metrics of an
analytical model can be greatly afected by data imbalance. As shown in Figures 1a and 1b, the
class distribution is skewed. This poses a challenge for the balanced learning of the model, as
the non-claim class is significantly underrepresented. When comparing the distributions of
annotation length in the training set, development set, and test set, as shown in Figures 2, it
becomes apparent that these significantly deviate from each other and, in some cases, exhibit a
strong concentration of data points within specific groups.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. System overview</title>
      <p>In this study, we evaluate and compare a sequence classification approach on the given data
with diferent augmentations. The comparison is performed at the level of trained models on the
same dataset. The diferent evaluation paradigms result from applying the sequence classifier
heads to a pre-trained model as a base model. We suggest that contextual information leads to a
qualitative diference in the scores.</p>
      <sec id="sec-3-1">
        <title>3.1. Pre-trained language representation</title>
        <p>
          At the core of any solution to a given task is a pre-trained language model derived from
BERT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. BERT stands for Bidirectional Encoder Representations from Transformers. It is
(a) Annotation length distribution – training set (b) Annotation length distribution – development set
(c) Annotation length distribution – test set
based on the transformer model architectures introduced by Vaswani et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The general
approach consists of two stages. First, BERT is pre-trained on large amounts of text, with
the unsupervised goal of masked language modeling and next sentence prediction. Second,
this pre-trained network is then fine-tuned on task-specific, labeled data. The transformer
architecture consists of two parts, an encoder and a decoder, for each of the two stages. The
encoder used in BERT is an attention-based architecture for NLP. It works by performing a
small, constant number of steps. In each step, it applies an attention mechanism to understand
the relationships between all the words in a sentence, regardless of their respective positions.
By pre-training language representations, the encoder yields models that can either be used to
extract high-quality language features from text data or to fine-tune these models for specific
NLP tasks (classification, entity recognition, question answering, etc.). We rely on RoBERTa
(a) Term distributions in annotation spans.
(b) Term distributions in full text.
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], a pre-trained encoder model that builds on BERT’s language masking strategy. However,
it modifies key hyperparameters in BERT, such as removing BERT’s next-sentence pre-training
objective and training with much larger mini-batches and learning rates. In addition, RoBERTa
has been trained on an order of magnitude more data than BERT, for a longer period of time.
This allows RoBERTa representations to generalize downstream tasks even better than BERT.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Binary Sequence Classification Problem</title>
        <p>
          Model Architecture – NLytics. Subtask A is considered to be a binary classification problem.
The models for the experimental setup were based on RoBERTa. For the classification task,
finetuning is first performed using RobertaForSequenceClassification [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] —  
— as the pre-trained model. RobertaForSequenceClassification optimizes for a regression loss
(Binary Cross-Entropy Loss) using an AdamW optimizer [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] with an initial learning rate set to
2e-5. After a warm-up period during which the learning rate increases linearly from 0 to the
initial learning rate, the optimizer is scheduled to decrease the actual learning rate linearly to 0.
The training was started with 20 training epochs each. However, this relatively high number
is significantly reduced by an early stopping callback that monitors the performance of the
model on the validation dataset. A patience of 5 epochs is set for this callback. For this setup,
ifne-tuning was done on an NVIDIA TESLA V100 GPU using the Pytorch [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] framework with
a vocabulary size of 50,265 and an input size of 512.
        </p>
        <p>
          Model Architecture – CODE. The experimental setup and approach for the binary
classification problem are almost identical to the one above. Instead of RoBERTa, we used BERT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
Therefore, we fine-tuned the model using BertForSequenceClassification. This model was also
trained for five epochs, following the same approach described above. A NVIDIA GeForce RTX
3090 GPU with 24GB of memory was used for fine-tuning using Pytorch [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Token Classification Problem</title>
        <p>Tagging format. We have transformed the initial span markup into the IOB (Inside, Outside,
Begin) tagging format. Since we have only one possible entity class, each token can be assigned
one of the tags given by O-claim, B-claim, and I-claim.</p>
        <p>
          Model Architecture – NLytics. Subtask B is considered to be a token classification problem.
We have fine-tuned a RoBERTa model to predict the above IOB tags for each token in the input
sentence. In the default configuration, each token is classified independently of the surrounding
tokens. Although the surrounding tokens are taken into account in the contextualized
embeddings, there is no modeling of the dependency between the predicted labels: for example, an I tag
cannot logically follow an O tag. Since RoBERTa does not model the dependencies between the
predicted tokens, we further added a linear-chain Conditional Random Field (CRF) model [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] as
an additional layer, in order to model the dependency between the predicted labels of individual
tokens. Since the sequence of an O tag following an I tag does not occur in the training set, it
assigns a very low probability to the transition from an O tag to an I tag by observation. The CRF
receives the logits for each input token, and makes a prediction for the entire input sequence,
taking into account the dependencies between the labels, similar to Lample et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Note
that RoBERTa works with byte pair encoding (BPE) units, while the CRF needs to work with
whole words. Thus, only head tokens were used as input to the CRF, and any word continuation
tokens were omitted. The models for the experimental setup are based on RoBERTa. For the
classification task, fine-tuning is first performed using RobertaForSequenceClassification [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] —
  — as the pre-trained model. RobertaForSequenceClassification optimizes
for a regression loss (Binary Cross-Entropy Loss) using an AdamW optimizer [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] with an
initial learning rate set to 2e-5. After a warm-up period during which the learning rate increases
linearly from 0 to the initial learning rate, the optimizer is scheduled to decrease the actual
learning rate linearly to 0. The training was started with 20 training epochs each. However,
this relatively high number is significantly reduced by an early stopping callback that monitors
the performance of the model on the validation dataset. A patience of five epochs is set for
this callback. For this setup, fine-tuning was done on an NVIDIA TESLA V100 GPU using the
Pytorch [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] framework with a vocabulary size of 50,265 and an input size of 512.
Model Architecture – CODE. We also experiment with an alternative setup for the token
classification problem of Subtask B, using a simplified tag set. In this setup, the RoBERTa model
is fine-tuned to predict a binary label (0 or 1) for each token, describing whether the token is
part of a claim or not. Unlike the IOB tag set, the first token of a claim is not distinguished
and is assigned the same label 1 as the subsequent tokens that are part of the claim. For this
experiment, we stop fine-tuning the RobertaForSequenceClassification model after 4 epochs on
the training set to avoid overfitting, as we empirically observe a degradation of the performance
on the validation dataset after this point. We observed that regularly short sequences of only
one or two tokens were incorrectly annotated as claims with this setup. We therefore decide to
iflter out those predicted claims that are shorter than three words in a second step, in order to
reduce noise and obtain more realistic annotations.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We participated in both Subtasks A and B. Because of the similar approach, these working notes
describe the results of two teams, NLytics and CODE. The oficial evaluation results for the test
set are shown in Tables 1 and 6. In the following, the results are presented for each subtask.
In the discussion of the results, we address the reasons for the diferences in the performance
of the two teams. The submissions were optimized for the minimum validation loss to avoid
overfitting the resulting model. During the training phase, we focused on finding the best
combinations of deep learning methods and optimizing the corresponding hyperparameter
settings. Fine-tuning pre-trained language models like RoBERTa on downstream tasks has
become ubiquitous in NLP research and applied NLP. Even without extensive preprocessing of
the training data, we already achieved competitive results. The resulting models serve as strong
baselines, that, when fine-tuned, significantly outperform models trained from scratch.</p>
      <sec id="sec-4-1">
        <title>4.1. Subtask A</title>
        <p>The model checkpoint with the minimum validation error was selected for submission. For
NLytics, this minimum was reached after four epochs of training. The class-related diferences in
model performance shown in Table 2 clearly reflect the class imbalance in the initial distribution
(cf. Figure 1). Diferent data cleaning strategies to mitigate the impact of technical structures
such as URLs or account names in the linguistic evaluation, had a negative impact on the
performance of the resulting models on the development set. For example, URLs were replaced
with a unique sequence to clean up the data. The same happened with the account names.</p>
        <p>As shown in Table 1, the Macro-F1 value for CODE difers from NLytics by 0.0476. This
discrepancy is due to the choice of the model, as the model with the lower Macro-F1 used an
uncased BERT model, despite following the same approach.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Subtask B</title>
        <p>For NLytics, the model checkpoint with the minimum validation error was reached after three
epochs of training. Table 3 shows the corresponding evaluation metrics. The best result could
only be achieved by extending the model with the CRF. Similar to the results of Subtask A, the
data cleaning strategies had a negative impact on the performance of the resulting models on
the development set.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The use of neural architectures in the field of pragmatics remains largely unexplored. The
limitations are clearly demonstrated by the results of the given task. In the future, we would like
to extend the current approach by adding features that represent the extended communicative
context. Our research aims at the specification of a consistent goal function that is adapted
to the discursive context of manipulative communication. We hypothesize that the target
variables of this function in the form of diferent discourse elements will respond to diferent
features of the given communicative context. If the required features cannot be derived from the
linguistic structure of the utterances, they have to be obtained from the extended context of the
communication. We are investigating ways to make external features available to the training
process. Thus, in order to identify pragmatic features and to know how to take advantage of
them, the application of XAI methods seems to be promising.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Lakof</surname>
          </string-name>
          ,
          <article-title>Persuasive discourse and ordinary conversation, with examples from advertising, Analyzing discourse: Text and talk (</article-title>
          <year>1982</year>
          )
          <fpage>25</fpage>
          -
          <lpage>42</lpage>
          . Publisher: Georgetown, Georgetown University Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          , T. Chakraborty,
          <article-title>Overview of the claimscan-2023: Uncovering truth in social media through claim detection and identification of claim spans</article-title>
          , in: Working Notes of FIRE 2023 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Tenney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poliak</surname>
          </string-name>
          , R. T.
          <string-name>
            <surname>McCoy</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>B. V.</given-names>
          </string-name>
          <string-name>
            <surname>Durme</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          <string-name>
            <surname>Bowman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Pavlick</surname>
          </string-name>
          ,
          <article-title>What do you learn from context? Probing for sentence structure in contextualized word representations</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1905</year>
          .06316.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jawahar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Seddah</surname>
          </string-name>
          ,
          <article-title>What does BERT learn about the structure of language?, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>3651</fpage>
          -
          <lpage>3657</lpage>
          . URL: https://aclanthology.org/P19-1356. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1356.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          , T. Chakraborty, LESA:
          <article-title>Linguistic Encapsulation and Semantic Amalgamation Based Generalised Claim Detection from Online Content, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>3178</fpage>
          -
          <lpage>3188</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .eacl-main.
          <volume>277</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pulastya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          , T. Chakraborty,
          <article-title>Empowering the Fact-checkers! Automatic Identification of Claim Spans on Twitter</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>7701</fpage>
          -
          <lpage>7715</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp-main.
          <volume>525</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Austin</surname>
          </string-name>
          ,
          <article-title>How to do things with words</article-title>
          , Oxford University Press,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Searle</surname>
          </string-name>
          , Sprechakte: ein sprachphilosophischer Essay, Suhrkamp,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Grice</surname>
          </string-name>
          ,
          <article-title>Logic and conversation</article-title>
          , in: Speech acts, Brill,
          <year>1975</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2018</year>
          ). ISBN:
          <year>1810</year>
          .04805v2.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>2017</volume>
          <source>-Decem</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5999</fpage>
          -
          <lpage>6009</lpage>
          . ISSN:
          <volume>10495258</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , V. Stoyanov,
          <article-title>RoBERTa: A Robustly Optimized BERT Pretraining Approach</article-title>
          , arXiv e-prints (
          <year>2019</year>
          ) arXiv-
          <fpage>1907</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. v. Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rush</surname>
          </string-name>
          , Transformers:
          <article-title>State-of-the-Art Natural Language Processing</article-title>
          , in: arxiv.org,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-demos.
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I. Loshchilov</given-names>
            ,
            <surname>Ilya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          , Decoupled Weight Decay Regularization,
          <source>in: 7th International Conference on Learning Representations, ICLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Köpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tejani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chilamkurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          , S. Chintala,
          <string-name>
            <surname>PyTorch:</surname>
          </string-name>
          <article-title>An imperative style, highperformance deep learning library</article-title>
          ,
          <source>CoRR</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1912</year>
          .01703, iSSN:
          <fpage>10495258</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Laferty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. C. N.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          ,
          <source>in: Proceedings of the Eighteenth International Conference on Machine Learning</source>
          , ICML '
          <fpage>01</fpage>
          , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <year>2001</year>
          , p.
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ballesteros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kawakami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <article-title>Neural architectures for named entity recognition, in: 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          ,
          <source>NAACL HLT 2016 - Proceedings of the Conference, Association for Computational Linguistics (ACL)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>260</fpage>
          -
          <lpage>270</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/n16-
          <fpage>1030</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>