=Paper=
{{Paper
|id=Vol-3681/T3-2
|storemode=property
|title=Current language models' poor performance on pragmatic aspects of natural language
|pdfUrl=https://ceur-ws.org/Vol-3681/T3-2.pdf
|volume=Vol-3681
|authors=Albert Pritzkau,Julia Waldmüller,Olivier Blanc,Michaela Geierhos,Ulrich Schade
|dblpUrl=https://dblp.org/rec/conf/fire/PritzkauWBGS23
}}
==Current language models' poor performance on pragmatic aspects of natural language==
Current language models’ poor performance on pragmatic aspects of natural language Albert Pritzkau1,*,† , Julia Waldmüller2,† , Olivier Blanc2,† , Michaela Geierhos2 and Ulrich Schade1 1 Fraunhofer Institute for Communication, Information Processing and Ergonomics (FKIE), Fraunhoferstraße 20, 53343 Wachtberg, Germany 2 Research Institute for Cyber Defence and Smart Data (CODE), University of the Bundeswehr Munich, Werner-Heisenberg-Weg 39, 85577 Neubiberg, Germany Abstract With the following system description, we present our approach for claim detection in tweets. We address both Subtask A, a binary sequence classification task, and Subtask B, a token classification task. For the first of the two subtasks, each input chunk—in this case, each tweet—was given a class label. For the second subtask, a label was assigned to each individual token in an input sequence. In order to match each utterance with the appropriate class label, we used pre-trained RoBERTa (A Robustly Optimized BERT Pretraining Approach) language models for sequence classification. Using the provided data and annotations as training data, we fine-tuned a model for each of the two classification tasks. Though the resulting models serve as adequate baseline models, the exploratory data analysis suggests fundamental problems in the structure of the training data. We argue that such tasks cannot be fully solved if pragmatic aspects of language are ignored. This type of information, often contextual and thus not explicitly stated in written language, is insufficiently represented in the current models. For this reason, we posit that the provided training data is under-specified and imperfectly suited to these classification tasks. Keywords Pragmatics, Information Extraction, Text Classification, RoBERTa 1. Introduction Political rhetoric, propaganda, and advertising are all examples of persuasive discourse. As defined by Lakoff [1], persuasive discourse is the non-reciprocal “attempt or intention of one party to change the behavior, feelings, intentions, or viewpoint of another by communicative means”. Thus, in addition to the purely content-related features of communication, the Forum for Information Retrieval Evaluation, December 15–18, 2023, India * Corresponding author. † These authors contributed equally. $ albert.pritzkau@fkie.fraunhofer.de (A. Pritzkau); julia.waldmueller@unibw.de (J. Waldmüller); olivier.blanc@unibw.de (O. Blanc); michaela.geierhos@unibw.de (M. Geierhos); ulrich.schade@fkie.fraunhofer.de (U. Schade) https://www.fkie.fraunhofer.de (A. Pritzkau); https://go.unibw.de/waldmueller (J. Waldmüller); https://go.unibw.de/blanc (O. Blanc); https://go.unibw.de/geierhos (M. Geierhos); https://www.fkie.fraunhofer.de (U. Schade) 0000-0001-7985-0822 (A. Pritzkau); 0000-0002-8180-5606 (M. Geierhos) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Albert Pritzkau et al. CEUR Workshop Proceedings 1–11 discursive context of utterances plays a central role. The shared task CLAIMSCAN’2023 [2] on the topic Uncovering Truth in Social Media through Claim Detection and Identification of Claim Spans considers claims as a key element of current information campaigns, with the aim to mislead and deceive. The goal of both Subtasks A and B is to develop systems that can effectively detect and identify claims in social media text. The utterance of a particular claim is understood as a communicative phenomenon. This approach assumes that communication depends not only on the meaning of the words in an utterance but also on what speakers intend to communicate with a particular utterance. In linguistics such an approach is adopted by the field of pragmatics. It is not always possible to deduce the function of an utterance from its form. Additional contextual information is often needed. Recent research [3, 4] suggests the possibility that transformer-based networks capture structural information about language, ranging from orthographic, morphological and syntactical up to semantic features. Beyond these features, these architectures remain almost entirely unexplored. This task is an attempt to explore the limits of the prevailing approach, in particular, to investigate the ability of transformers to capture pragmatic features. The shared task CLAIMSCAN’2023 defines the following subtasks: Subtask A. Claim Detection [5]: The task is a binary classification problem, where the objective is to label the given social media post as a claim or non-claim. A claim is an assertive statement that may or may not have evidence. Subtask B. Claim Span Identification [6]: The task is to identify the words/phrases that contribute to the claims made in the given social media post. A claim is an assertive statement that may or may not be supported by evidence. 2. Background The linguistic field of pragmatics regards speaking as acting, or more precisely, as acting with the intention of manipulating the audience. The speech act called assertion [7, 8] means to make a statement so that the audience is informed about something. According to Grice’s cooperative principle [9], the information provided must be relevant, helpful, and true in the context of the discourse. Since we are attuned to this principle, false claims are effective if they show no signs of falsehood or duplicity. We simply follow the cooperative principle and take the statement to be true, with all the consequences it implies. Signs of falsehood or duplicity can save us from such a washout. Such signs can be violations of one’s own beliefs (e.g., ‘Hawaiian wildfire is an attack experiment of a weather weapon conducted by the US military’), a wrong style, e.g. excessive emotion in a news text (e.g., ‘Hawaiian wildfire is a scandalous attack experiment of a perfidious weather weapon conducted by the sleazy US military’), or untypical grammatical errors like omitting determiners (e.g., ‘Hawaiian wildfire is attack experiment of weather weapon conducted by US military’). However, some of these signs might be overlooked because of our attunement to the cooperative principle in general and Grice’s maxim of quality (ibidem) in particular. A system might therefore perform better at detecting false claims. 2 Albert Pritzkau et al. CEUR Workshop Proceedings 1–11 Task descriptions This paper describes the participation in both subtasks. The challenge for Subtask A is to decide whether a given tweet contains a claim. Accordingly, the task is formulated as a binary classifi- cation problem. Beyond the mere identification of claims, Subtask B involves the delineation of text intervals containing said claims. For each token in a tweet, it is to be examined whether it is part of a claim, and subsequently, the claim span is to be determined. The model should then predict the indices of the span intervals for each tweet. Exploratory Data Analysis The organizers of the CodaLab competition CLAIMSCAN’2023 have released the datasets for both subtasks. Each subtask dataset consists of a training set, a development set, and a test set, all focused on discussions related to the COVID-19 pandemic. The labeled data for Subtask A, obtained from 8,483 tweets, includes both the training set of 6,986 tweets and the developer set of 1,497 tweets resulting in a ratio of 82:18. Assuming that the split is already validated, we did not apply any resampling. Both sets consist of the tweets in plain text with an additional binary label claim or non-claim. While the definition of a claim was given as a claim is an assertive statement that may or may not have evidence, we observed questionable annotations of the training set. For example, the tweet ‘Older but still relevant: Health products that make false or misleading claims to prevent, treat or cure #COVID19 may put your health at risk via HealthCanada #cdnhealth https://t.co/9dFNXaV3gW’ is labeled as a non-claim. However, the tweet ‘coronavirus altnews founder shekhar gupta and others spread unverified claims by a fake twitter account’ is marked as a claim. For the purpose of submission, an unlabeled test set consisting of 1,489 tweets was used. For Subtask B, the size of the training set was 6,044, the development set had 756 tweets resulting in a ratio of 89:11. The test dataset contained 755 entries. In contrast to Task A, here, in addition to the tweet text and the claim label, the start index and the end index of the token spans corresponding to the claims were also provided. Of these, 7,585 spans were annotated as claims, meaning that some tweets contained more than one claim. As in Subtask A, we made several notable observations regarding the labeled training data. We observed an instance of an impossible annotation in line 19 of the training set. This anomaly raised questions regarding the quality of data and the need for quality control mechanisms when building the dataset. Furthermore, during our analysis of annotation spans, it was revealed that 235 ‘@’ mentions and 16 URLs (starting with “https://...”) were present in the annotated text. We discovered that colons appeared to be the most indicative feature for identifying the beginning of a claim, with 846 instances manifesting this pattern within the training dataset. Additionally, the data analysis suggests the utilization of keyword-based sampling in the construction of the training dataset. This is particularly evident from Figure 3. This is supported, for example, by the fact that the 3 Albert Pritzkau et al. CEUR Workshop Proceedings 1–11 account name of Donald Trump (@realdonaldtrump) appears in the top 30 most frequent words (see Figure 3b). Surprisingly, we found that cleaning the training data resulted in a poorer performance of the model. F O D L P F R X Q W (a) training set F O D L P F R X Q W (b) development set F O D L P F R X Q W (c) test set Figure 1: Label distribution and class imbalance for Subtask A. The value and meaning of accuracy and other well-known performance metrics of an an- alytical model can be greatly affected by data imbalance. As shown in Figures 1a and 1b, the class distribution is skewed. This poses a challenge for the balanced learning of the model, as the non-claim class is significantly underrepresented. When comparing the distributions of annotation length in the training set, development set, and test set, as shown in Figures 2, it becomes apparent that these significantly deviate from each other and, in some cases, exhibit a strong concentration of data points within specific groups. 3. System overview In this study, we evaluate and compare a sequence classification approach on the given data with different augmentations. The comparison is performed at the level of trained models on the same dataset. The different evaluation paradigms result from applying the sequence classifier heads to a pre-trained model as a base model. We suggest that contextual information leads to a qualitative difference in the scores. 3.1. Pre-trained language representation At the core of any solution to a given task is a pre-trained language model derived from BERT [10]. BERT stands for Bidirectional Encoder Representations from Transformers. It is 4 Albert Pritzkau et al. CEUR Workshop Proceedings 1–11 &