1. Introduction

ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning

Xin Wei

Md Reshad Ul Hoque

Jian Wu

Jiang Li

0 0 Old Dominion University , Norfolk, VA, 23529 , USA

The growth of scientific papers in the past decades calls for efective claim extraction tools to automatically and accurately locate key claims from unstructured text. Such claims will benefit content-wise aggregated exploration of scientific knowledge beyond the metadata level. One challenge of building such a model is how to efectively use limited labeled training data. In this paper, we compared transfer learning and contrastive learning frameworks in terms of performance, time and training data size. We found contrastive learning has better performance at a lower cost of data across all models. Our contrastivelearning-based model ClaimDistiller has the highest performance, boosting the F1 score of the base models by 3-4%, and achieved an F1=87.45%, improving the state-of-the-art by more than 7% on the same benchmark data previously used for this task. The same phenomenon is observed on another benchmark dataset, and ClaimDistiller consistently has the best performance. Qualitative assessment on a small sample of out-of-domain data indicates that the model generalizes well. Our source codes and datasets can be found here: https://github.com/lamps-lab/sci-claim-distiller.

eol>Scientific Claim Extraction Word Embedding Deep Learning Transfer Learning Contrastive Learning

1. Introduction

has shown to be an important step to automatically assessing reproducibility in social and behavioral sciences Because of the rapid increase of scientific papers indexed and other domains, e.g., [ 9, 10 ], which is investigated in by digital libraries [ 1 ] [ 2 ], there is an emergent need DARPA’s Systematizing Confidence in Open Research to help readers to eficiently grasp the main ideas of re- and Evidence (SCORE) program [11]. search papers. This can be achieved by development We define a scientific claim as a sentence that provides of algorithms to extract and aggregate key information the core findings of a scientific paper. One example is from unstructured scholarly text. Existing machine learn- given in Figure 1. Existing datasets with annotated claims ing methods have been developed to extract metadata, are scarce and not available in all domains. Current such as title, authors, year, venue, e.g., [ 3 ], non-textual datasets on claim extraction include CoreSC dataset [12] content such as figures and tables, e.g., [ 4 ], and high- with 265 articles in physical chemistry and biochemistry. level semantic information such as keywords, e.g., [ 5 ]. The Dr. Inventor dataset [13] contains claims extracted However, scientific claims, conveying key findings and from 40 computer graphics articles. Another dataset used contributions from unstructured text remains challeng- in a recent paper [14] contains claims extracted from ing because scientific ideas could be conveyed in a more 1,500 scientific abstracts in the biomedical domain. Due complicated way than general text as used in news pa- to data scarcity, it is important to develop models that pers and Wikipedia articles. Although deep learning has eficiently use existing data. In a recent paper [ 14] the shown promising results for open domain extractive sum- authors introduced transfer learning to perform scientific marization and key sentences identification, e.g., [ 6, 7 ], it claim extraction. In this paper, we explore alternative is still challenging to train robust deep learning models ways for this task. on scientific papers [ 8 ] because of the lack of large-scale Transfer learning uses the knowledge extracted from training data. Obtaining such training data usually re- one or more source tasks, which usually have a high quires domain knowledge, which regular crowdsourcing amount of resources, to accomplish a target task, which workers may not possess. Identifying key claims from usually has a lower amount of resources. Transfer learnscientific papers can also be time-consuming for domain ing works by pretraining a neural model using data for experts. In addition, mining claims from scientific papers the source tasks. The model is retrained by freezing the weights of a portion of a neural network and learning the weights of the other portion of the same neural network [15]. Transfer learning has been adopted in computer vision (CV) and natural language processing (NLP) tasks, e.g., [16] [17].

Transfer learning relaxes the i.i.d. (independent and identically distributed) requirement for training and testJoint Workshop of the 4th Extraction and Evaluation of Knowledge Entities from Scientific Documents and the 3rd AI + Informetrics (EEKEAII2023), June 26, 2023, Santa Fe, New Mexico, USA and Online † These authors contributed equally. $ xwei001@odu.edu (X. Wei); j1wu@odu.edu (J. Wu) © 2023 Copyright 2023 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR Workshop Proceedings (CEUR-WS.org) ing datasets. To be specific, the classes in source data 2. We compared 10 commonly used methods of text does not necessarily need to be the same with target data. augmentation for training SCL in the context of This is usually fullfilled by the extremely large sizes of scientific claim extraction. All methods exhibit a source datasets, such as ImageNet-21k dataset with 14.2 marginal efect on the model performance. million images [18]. Source data used in NLP (Natural Language Processing) is usually in the magnitude of tens 3. Our best model was trained and evaluated on a of Mega bites and even more. Data size is a limit for standard benchmark in the biomedical domain. claim extraction and as a result transfer learning does The model exhibited reasonably well generaliznot delivery enough power. In this paper, we introduce ability when it is tested in the computer science contrastive learning framework which uses significantly domain. less training data and achieves comparable or better performance. 2. Related Work

Self-supervised contrastive learning, a type of selfsupervised representation learning, eficiently leverages Scientific claim extraction is closely related to extraclimited training data and has demonstrated promising tive document summarization and argumentation minresults in multiple CV and NLP tasks, e.g., [19, 20]. This ing, which are more explored in literature. The goal of method puts similar samples close to each other while extractive document summarization is to extract text that pushing ‘negative’ samples far apart in the feature space is much shorter than the original documents and deliver [21]. For example, in image classification, data can be the main idea of the given documents [23]. A survey on augmented by cutting and rotation. We can adjust the extractive document summarization for scientific papers loss function and make the augmented samples from the can be found in [24]. The text output by extractive docusame image close to each other and augmented samples ment summarization may contain several key sentences from diferent images far away. In this way, the model that provide a high-level description of the original text. can learn the features without looking at labels. The These sentences may not necessarily describe the core drawback of self-supervised contrastive learning is that findings. Therefore, the methods cannot directly be used the correlation of features between images belonging for extracting scientific claims. to the same class is ignored. This could be mitigated Argument mining automatically extract the structure by leveraging label information, which is the supervised of inference and reasoning presented in natural language contrastive learning [22]. text [25]. In argument mining, premises were extracted

In this paper, we compared transfer learning and su- from news [26], social media [27], scientific article [ 28], pervised contrastive learning frameworks in terms of and Wikipedia [29]. Existing argument mining methods performance, time and training data size. We found con- include heuristic methods [30, 31] and classical machine trastive learning has better performance at a lower cost learning methods [32]. Recently, deep learning methof data across all models on both datasets. We propose a ods, including weak supervision and transfer learning contrastive-learning-based model ClaimDistiller, the mechanisms, have been proposed [33]. backbone of which is a recurrent neural model with su- There are limited publications on scientific claim expervised contrastive learning. We demonstrate that the traction. Dernoncourt et al. [34] developed a scientific supervised contrastive learning mechanism improves the discourse dataset PubMed-RCT, in which sentences model performance by a significant margin with less were labeled into five classes, namely, background, introtraining samples and training time. duction, method, result, and conclusion. However, claims

Our best model achieves F1=87.45% when trained and were not explicitly labeled in this dataset. Recently, a tested on SciCE. We further trained the model on another human-annotated scientific claim extraction dataset in benchmark dataset SciARK, and contrastive learning biomedical domains was published [14]. Existing methmethods obtained better performance across all models ods used for scientific claim extraction include rule-based than transfer learning. ClaimDistiller consistently out- and deep learning methods. Rule-based methods were performs all other models. used to extract claims from scientific papers in Jansen The contributions of the paper are as follows: et al. [30]. Achakulvisut et al. [14] proposed a model consisting of a bidirectional long short-term memory (BiLSTM) network stacked with a conditional random ifeld (CRF) model trained in a transfer learning framework. They trained their model on the PubMed-RCT dataset and then fine-tuned the model on their in-house

SciCE dataset. 1. We proposed using supervised contrastive learning for scientific claim extraction. The results show that SCL achieves a comparable or better performance than transfer learning with significantly less training data and training time. The best model achieves an F1=87.45% on the SciCE dataset.

Non-claim sentence: This article is part of a Special Issue entitled Advances in High Density Lipoprotein Formation and Metabolism: A Tribute to John F. Oram (1945-2010).

3. Data

The claims to be extracted should be absolute, independent, core findings of the paper. A conclusion may not necessarily be a claim, but a claim is highly likely to be a conclusion. Claims may appear in the abstracts and the body text, but in our research task, we focus on extracting claims from abstracts, assuming that authors should put the core findings of the paper in the abstracts.

The data used in this paper includes three corpora. The ifrst corpus was built by Achakulvisut et al. [ 14], which is the largest dataset so far for scientific claim extraction. For convenience, we call it the scientific claim extraction (SciCE) dataset.

Specifically, the dataset labels three types of claims:

Type 1: A statement that declares something is better; Type 2: A statement that proposes something new; Type 3: A statement that describes a new finding or a new cause-efect relationship.

The corpus contains 1,500 scientific abstracts in the half of the dataset contains only 1 claim in an abstract. biomedical domain. Each sentence in the abstracts was The dataset contains in total 2276 claims and 9426 nonlabeled by domain experts into two categories, namely, claims. For an even comparison, we adopt the split of the claim and non-claim. An example of a claim sentence and original dataset in which the numbers in training, test, a non-claim sentence, in an abstract, is shown in Figure 1. and validation samples are 750, 375, and 375, respectively. Each abstract contains 5 to 10 sentences (Figure 2). One The second corpus is the Pubmed-RCT dataset [34], abstract may contain more than one claim (Figure 3). The designed for the discourse prediction task, which was to majority of the abstracts contain 1–2 claims and about predict the discourse types for a sequence of sentences in one abstract. In our paper, it is used as the source the same class while simultaneously pushing apart difdataset for transfer learning. Pubmed-RCT is a larger ferent classes in the embedding space. This step helps to dataset consisting of 20,000 abstracts, including 2.3 mil- create more accurate embeddings and thus subsequent lion sentences selected from the MEDLINE/PubMed Base- classification based on it can achieve better performance line Database published in 2016. The abstracts are in than regular supervised learning. biomedical and life sciences domains, and particularly In self-supervised contrastive learning each sample in randomized controlled trials (RCTs). The discourse is considered a class, while in supervised contrastive type for each sentence is one of the five classes, Objective, learning each label is considered a class. As a result, in Introduction, Method, Result, and Conclusion. The Method self-supervised contrastive learning the training process and Result classes contain one-third of all labeled sen- requires 2 augmented samples for the samples in tences, respectively. The remaining one-third contains training data, but in supervised contrastive learning, the sentences labeled as the other three classes. The number model could be trained by either or 2 augmented of sentences in an abstract is between 3 and 51, with an samples. In our task we use supervised contrastive learnaverage of 11.6. This dataset will be used for pre-training ing to train the model. We tried both and 2 augin transfer learning. mented samples. The Supervised Contrastive Loss func

A third dataset SciARK was introduced in a recent tion is defined as: work [35]. It is a relatively small dataset composed of abstracts from 689 academic papers with 9055 sentences.

The number of abstracts in training, testing, and valida- = ∑︁ −1 ∑︁ log ∑︀ exp( * / ) tion samples are 350, 269, and 70, respectively, as split ∈ |()| ∈() ∈() exp( * / ) by the authors. Each sentence is annotated as Claim, (1) Evidence, or Nonetype. Unlike SciCE and Pubmed, this Here is the index of an arbitrary sample in the augdataset is multidisciplinary with abstracts of scientific mented dataset . () is the set of samples in the same publications related to a broad spectrum of Sustainable class with except sample . () is the set of samples Development Goals (SDG) domains. When using the in the augmented dataset except sample . , and dataset, we merge the "Evidence" and "Nonetype" as "non- stand for the representations of the anchor, positive, claim" and treat it as a binary-class dataset (claim vs. and negative samples respectively. is the temperature non-claim). parameter, which adjusts the distance of diferent classes in the embedding space.

4.2. Framework Architecture 4. Proposed Framework: ClaimDistiller Our proposed framework is based on supervised con

We formulate the claim extraction task as a classification trastive learning. The architecture of the framework is problem on a sequence of sentences, where the model pre- shown in Figure 4. The SCL can be implemented in two dicts a class label claim or non-claim for each sentence. stages. In the first stage, we augment each labeled senIn regular classification models, text is represented in tence into two sentences with similar semantics. This the form of vectors and training a good representation augmented dataset is fed into the encoder and supports is essential for classification. We improve the models the Stage 1 training. The encoder along with the proby adopting supervised contrastive learning to generate jection head, which is composed of several dense layers, better representations. We propose a framework called minimizes the supervised contrastive loss to obtain the ClaimDistiller for extracting scientific claims from ab- optimal embeddings in order to group positive samples stracts. together and push negative samples far away. In Stage 2, we keep the encoder and freeze the weights in its dense 4.1. Supervised Contrastive Learning layers, and add two more dense layers for classification. The classifier is trained to minimize the cross-entropy loss function.

Self-supervised contrastive learning [20] methods can be used to generate representations for non-labeled data. It treats each sample in the dataset as a class and compares them pairwise after data augmentation to obtain “apparent similarities”, and further generates representations for each sample. Supervised contrastive learning [22] methods introduce this framework for labeled data. The key idea is to train a representation that pulls together

4.3. Data Augmentation Data augmentation is an essential part in contrastive

learning methods, which creates the dataset used for pre-training by sentences with similar semantics. We investigate five types of methods and their variants to augment text given a labeled sentence.

Original Sentence John is going to town Joe is walking to town

Mary is running to town Second Stage Learning

Freezing Encoder Layers

Classifier WC-BiLSTM Dense Dense CrossEntropyLoss

Encoder WC-BiLSTM

Dense Relu

Dense Supervised Contrastive Loss First Stage

Learning Classification

Results 1. CNN-1D. Similar to regular CNN used in feature extraction from 2-dimensional images, 1dimensional CNN has been used for extracting features from word sequences, e.g., [ 38 ]. This method works by sliding a window with a fixwidth over a sequence and convolving features of tokens covered by the window [ 39 ]. An average pooling was used to aggregate features from individual tokens. Similar to a 2D-CNN, the 1D CNN can be used for extracting patterns from local 1D patches (aka sub-sequences) from sequences. After each word-level token is converted to initial vectors, 1D convolutional layers with the convolutional kernels of size were used to extract the patterns (Figure 6). These layers can recognize patterns in an input sequence. We used a 2-layer 1D CNN, which is flattened at the end before the presentation is fed to a dense fully-connected layer for classification.

2. USE-dense. We adopted the pre-trained Univer

sal Sentence Encoder (USE) [ 40 ] to encode claim text into dense 512-dimensional vectors. The initial embeddings produced by USE were fine-tuned on the SciCE corpus, after which the sentences

5. Experiment Setups

1. Round Trip Translation (RTT) [36]. This method first translates the sentence from English to French and then translates it back to English. 5.1. Base Models Translation is based on Google translation services as well as Amazon translate [36].

As mentioned above, the first stage is to encode the input sentence into a vector. We experiment three types 2. Wordnet Synonym Replacement [36]. This of encoders each having three settings of the original method replaces words with their synonyms in encoder, the encoder trained with transfer learning and the sentence. Replaceable words such as verbs, the encoder trained on SCL. nouns are selected from a sentence using a partof-speech tagger. Then a number of words are selected out of them following a Geometric distribution and replaced by their synonyms, which are given by a synonym library provided by Word

Net. 3. EDA (Easy Data Augmentation) Synonym Replacement [ 37 ]. Randomly pick a word (not stop words) from the sentence and then replace the word with one of its synonyms chosen at random. 4. EDA Random Deletion [ 37 ]. Randomly remove any word in the sentence with a probability you can specify. We use the default probability value 0.2. 5. EDA Random Insertion [ 37 ]. Find a random synonym of a random word (not a stop word) in the sentence and then insert the synonym into any position in the sentence randomly.

We further generate augmented data by two data augmentation methods to obtain a bigger dataset for pretraining. A comparison of the results will be given in Section 7.

were encoded to dense feature vectors used by the fully-connected layer for classification. 3. WC-BiLSTM (Word and Character embedding Bidirectional Long Short-Term Memory). One drawback of applying pre-trained word embedding is that unseen words have to be encoded as a default vector in the prediction time. The representations of these words could only be inferred by surrounding words. Word prefixes and sufixes often contain semantic information. Therefore, we combine pre-trained Word2Vec embedding [ 41 ] with character embedding [ 42 ] to encode unseen words. The combined embedding is fed to bidirectional long short-term memory (BiLSTM) layers to extract patterns from claim sentences (Figure 7). Finally, the representations were passed to a fully-connected layer for classiifcation.

5.2. Experiments

To evaluate the robustness of the proposed framework, we investigate the base models in diferent training frameworks: only the base model, transfer learning, and supervised contrastive learning. Figure 5 shows a comparison of the three diferent training frameworks. ‘Network’ in this figure can be any of the base models. The training frameworks are as follows: SciCE

Dataset Pre-train Network Pubmed Dataset

Network Data Augmentation

SciCE Dataset

Network

Training Network Fine-tuning

SciCE

Dataset Freeze Embb-layer

Fine-tuning

SciCE Dataset

Train with base models Train with transfer learning

Claim Non-claim Claim Non-claim Claim

Non-claim Train with contrastive learning addition, we include the following two experiments from previous academic papers as baselines:

2. Transfer Learning. In this setting, the neural

classifier is firstly pre-trained using the PubMedRCT corpus and then fine-tuned on the SciCE corpus. During the fine-tuning stage, we freeze the weights of all layers except the fully-connected classification layer. Then we replaced that fullyconnected layer with a new layer with classes in the target dataset. 3. Supervised Contrastive Learning. As discussed in Section 4, in supervised contrastive learning the neural network is firstly pre-trained with augmented training data from the SciCE corpus and then fine-tuned on the original SciCE data. Note that in this setting, only SciCE is used, which is a dataset much smaller than the PubMedRCT dataset.

As a result, we have in total 9 experiment specifications: 3 diferent frameworks for each base model. In 1. Heuristic Method. This baseline is adopted from Sateli & Witte [31]. This method used gazetteering, deictic phrases and hand-crafted rules to match against the text. The sentence containing the deictic phrase must be a statement in form of a factual implication, and have a comparative voice or asserts a property of the author’s contribution, such as novelty or performance. 2. CRF-based Transfer Learning. This baseline is adopted from Achakulvisut et al. [14], in which transfer learning was applied on a conditional random field (CRF) model. This is the state-ofthe-art to our best knowledge. This method treats claim extraction as a sequence tagging task and uses CRF to capture the dependencies of the label of the current sentence to the features and labels of neighbor sentences.

Initial sequence Window Size = 5

Output Features

6. Evaluation 6.1. Evaluation Metrics The proposed methods and baselines are evaluated using the standard precision, recall, and F1 scores, defined below.

= TP , = TP , 1 = 2 TP + FP TP + FN + (2) In Eq.( 2), and stand for precision and recall, respectively. TP is the number of predicted claims that are true. FP is the number of predicted claims that are false. FN is the number of predicted non-claims that are false. 1 is the harmonic mean of and .

In addition, we also compare the training time. The training time was measured as the time elapsed between when the program started taking inputs (including pretraining) and when the model stopped training after certain numbers of epochs.

6.2. Experiment Details All the experiments were performed on a single computer

with a 4 physical core CPU, 16GB RAM, and Solid State Disks and an Nvidia V100 GPU.

When working on the CNN-1D model, the window size = 5. Because the convolution is performed on the word level, we truncated sentences longer than 120 words and padded sentences shorter than 120 words.

When training the WC-BiLSTM model, the learning rate was set to 0.001, the batch size was set to 256, and the dropout rate was 0.5. Each encoder model and its variants were trained for a maximum of 50 epochs before which the loss function of the validation data reached the minimum. Early stopping was applied to avoid overfitting. We found that at this stage, the loss functions have asymptotically converged to the minimum.

6.3. Results The results of the experiments are shown in Table 2.

The first column shows the evaluation results of models trained on SciCE. The second column shows the training time. The third column shows the evaluation results of models trained on SciARK. The training time on SciARK is shown in column 4. In column 5 we compared the training data size for all scenarios.

As seen in Table 2, Deep learning based models achieving much better performance than rule-based models suggests that the semantic features of scientific claims are complicated and are better represented by neural models.

In general, transfer learning based models achieve better performance than the corresponding original encoders by ∆ F1=0.74–3.18. The eficacy of transfer learning comes from source data used for pre-training. The discourse information in the PubMed-RCT corpus used here is relevant and helps improve the performance.

The comparison of transfer learning and contrastive learning is performed on two datasets: SciCE and SciARK. Contrastive learning achieves better performance than transfer learning consistently across all models. With SciCE, SCL beats transfer learning by ∆ F1=0.82 – 2.72% for the SciCE dataset and ∆ F1=1.32– 1.72% for the SciARK dataset. The only exception is that CNN-1D-contrastive underperformed CNN-1D-transfer by 0.52%. Therefore, SCL in general achieves a comparable or better performance than transfer learning.

Contrastive-learning-based model ClaimDistiller has the best performance across all metrics compared with other models, achieving F1=87.45%, precision=87.08%, and recall=87.83%. With SciARK, ClaimDistiller has the best performance with F1=88.93%, precision=90.02%, and recall=89.47%.

The training time needed for each model varies. In general, transfer learning needs significantly more time for training than supervised contrastive learning. In the last column, we see a clear comparison of training data size for contrastive learning and transfer learning. Comparing the training data size, contrastive learning uses less than 6000 sentences while transfer learning uses 2 million sentences for pre-training in order to achieve the performance reported in Table 1. 1 Quoted from reference because they used the same test data. 2 Testing data size is in the parentheses. Measured by number of sentences. 3 Training time including both pre-training and fine-tuning. 4 Measured by number of sentences in training dataset. 5 WC-BiLSTM-contrastive.

7. Discussion 7.1. Data Augmentation Analysis

As discussed in Section 4, we tried several methods of text augmentation. Here we show the experimental results obtained with the best model WC-BiLSTM-contrastive model in Table 3. The results show that various types of text augmentation methods have marginal efect on the classification performance of the SCL base model, with the range of F1 going from 86.11% to 87.45%. Wordnet synonym replacement achieves the best performance while random deletion is the worst. We choose to use the best one "Wordnet synonym Replacement" as the data augmentation method.

7.2. Error Analysis

0.5 and with an average of 0.13.

In this section, we perform error analysis focusing on We demonstrate two examples containing typical erthe best model: WC-BiLSTM-contrastive. Out of the 375 rors in the prediction results for case studies (Figure 10). abstracts in testing set of SciCE, this model correctly The ground truth claims are highlighted in blue. Green predicts all the claims and non-claims in 125 abstracts. labels mean the sentences are non-claims and red labels As shown in Figure 8, in the remaining 250 abstracts, the mean sentences are claims. Labels with red frames indimajority of them have 1–2 wrongly predicted sentences, cate wrong predictions. In the first example, the model with the maximum prediction errors of 4 in a single ab- is able to identify all the claims, but it mistakenly recogstract. As shown in Figure 9, the error rates are all below nizes two sentences as claims. In the second example, there should be two claims but the model only identified one of them.

Example 1 is challenging because the two false positives look like claims but when they are read together with the first sentence, it is clear that the second sentence (starting with “The discussion emphasizes”) describes what the authors have done in the paper and the third sentence (starting with “A fundamental need”) describes a background, which is the motivation of the research.

Example 2 contains a false negative. It is not straightforward to determine why the sentence starting with “Results indicated that” was misclassified to a non-claim because the leading pattern clearly indicates the sentence conveys key findings. The error analyses indicate that although the recurrent model attempted to incorporate context information, it may still miss the nuances of semantics. Fine-tuning the hyperparameters may help, but a more sophisticated and robust model is needed to capture the nuances. One method is to combine latent and rule-based features. Another possible method is to leverage the "knowledge" encoded in large language models (LLMs), e.g., GPT 3, or using the LLM-adapter method to train an adapter for this task.

Prediction results on abstract 1: Grounded in a socio-ecological framework, we describe salient health care system and policy factors that influence engagement in human immunodeficiency virus (HIV) clinical care. Non-claim The discussion emphasizes successful programs and models of service delivery and highlights the limitations of current, fragmented health care system components in supporting effective, efficient, and sustained patient engagement across a continuum of care. Claim A fundamental need exists for improved synergies between funding and service agencies that provide HIV testing, prevention, treatment, and supportive services. Claim We propose a feedback loop whereby actionable, patient-level surveil ance of HIV testing and engagement in care activities inform educational outreach and resource al ocation to support integrated \"testing and linkage to care plus\" service delivery. Claim Ongoing surveil ance of programmatic performance in achieving defined benchmarks for linkage of patients who have newly diagnosed HIV infection and retention of those patients in care is imperative to iteratively inform further educational efforts, resource al ocation, and refinement of service delivery.

Claim Figure 10: Two examples of errors in the prediction results in the test set of SciCE. The ground truth claims are highlighted 7.3. Domain Adaptability in blue. Green labels mean the sentences are non-claims and red labels mean sentences are claims. Labels with red frames The SciCE corpus is in the biomedical domain. To test indicate wrong predictions. whether the model performs well in a diferent domain, we applied the best model (WC-BiLSTM-contrastive) to classify sentences in a random selection of 30 abstracts plying that claims in these two diferent domains are in computer science papers. Out of the 195 sentences in usually written with similar language patterns. We also this dataset, 60 sentences were predicted as claims. By observed that the model tends to omit claims, indicating visually examining these predicted claims, 50 of them are that a more robust domain adaptation may be needed to consistent with the definition of claims [ 14]. Examples of improve the recall. the successfully predicted claims are given in Figure 11.

This post-hoc evaluation result indicates that the model’s precision for computer science abstracts is roughly consistent with biomedical domains with (P ≈ 83.3%), im

We demonstrate that scaling up language models greatly improves taskagnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches.

These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements.

Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner.

We show that it significantly outperforms competitive baselines.

7.4. Visualization on SciCE Data

To further qualitatively demonstrate the efect of supervised contrastive learning, we project the 128dimensional vectors output by the WC-BiLSTM base model into a 2-dimensional feature space using tSNE [ 43 ], and then compare it with results in supervised contrastive learning. Figure 12 shows that the model with supervised contrastive learning grouped the same class altogether, making them more separated in the feature space.

8. Conclusion

To automatically obtain scientific findings from the ever Figure 12: The -SNR plots showing the efects of superincreasing volume of scientific papers, an efective and vised contrastive learning. The upper panel shows the two eficient claim-extracting tool is becoming increasingly classes without supervised contrastive learning. The lower important for information aggregation, summarization, panel shows the two classes with supervised contrastive learnand retrieval of scientific papers. One bottleneck of this ing. Orange dots represent claims and blue dots represent task is the limitation of annotated training data. The non-claims. challenge is how to eficiently use existing limited data.

We propose the ClaimDistiller framework, which uses supervised contrastive learning on top of existing text erative agreement No. W911NF-19-2-0272. The content encoders to boost the performance of classification. We of the information does not necessarily reflect the poshowcased the eficacy of this mechanism on two bench- sition or the policy of the Government, and no oficial mark datasets. Our result establish a new state-of-the-art endorsement should be inferred. We also acknowledge on the SciCE dataset, outperforming the existing method the partial support from the Virginia Commonwealth by 7%, which used transfer learning on a BiLSTM-CRF Cyber Initiative (CCI) Grant #H-4Q21-014. architecture. We demonstrated that the SCL achieved comparable or higher F1 scores compared with transfer learning methods with significantly less training data References and time. Future research will explore hybrid methods and LLMs to capture nuances of context.

Acknowledgments This work was partially supported by the Defense Ad

vanced Research Projects Agency (DARPA) under coopnational Conference on Machine Learning, ICML and Natural Language Processing, Forlì-Cesena, 2020, 13-18 July 2020, Virtual Event, volume 119 of Italy, July 21-25, 2014, volume 1341 of CEUR WorkProceedings of Machine Learning Research, PMLR, shop Proceedings, CEUR-WS.org, 2014. URL: http: 2020, pp. 1597–1607. URL: http://proceedings.mlr. //ceur-ws.org/Vol-1341/paper5.pdf. press/v119/chen20j.html. [29] J. Thorne, A. Vlachos, C. Christodoulopoulos, [21] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Baner- A. Mittal, FEVER: a large-scale dataset for fact jee, F. Makedon, A survey on contrastive self- extraction and verification, in: M. A. Walker, H. Ji, supervised learning, Technologies 9 (2020) 2. A. Stent (Eds.), Proceedings of the 2018 Conference [22] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, of the North American Chapter of the Association P. Isola, A. Maschinot, C. Liu, D. Krishnan, Su- for Computational Linguistics: Human Language pervised contrastive learning, Advances in Neural Technologies, NAACL-HLT 2018, New Orleans, Information Processing Systems 33 (2020) 18661– Louisiana, USA, June 1-6, 2018, Volume 1 (Long 18673. Papers), Association for Computational Linguistics, [23] A. Barrera, R. M. Verma, Combining syntax and se- 2018, pp. 809–819. URL: https://doi.org/10.18653/ mantics for automatic extractive single-document v1/n18-1074. doi:10.18653/v1/n18-1074. summarization, in: A. F. Gelbukh (Ed.), Com- [30] T. Jansen, T. Kuhn, Extracting core claims putational Linguistics and Intelligent Text Pro- from scientific articles, in: T. Bosse, B. Brecessing - 13th International Conference, CICLing deweg (Eds.), BNAIC 2016: Artificial Intelligence 2012, New Delhi, India, March 11-17, 2012, Pro- - 28th Benelux Conference on Artificial Intelliceedings, Part II, volume 7182 of Lecture Notes gence, Amsterdam, The Netherlands, November in Computer Science, Springer, 2012, pp. 366–377. 10-11, 2016, Revised Selected Papers, volume 765 of URL: https://doi.org/10.1007/978-3-642-28601-8_31. Communications in Computer and Information Scidoi:10.1007/978-3-642-28601-8\_31. ence, Springer, 2016, pp. 32–46. URL: https://doi. [24] N. Ibrahim Altmami, M. El Bachir Menai, Au- org/10.1007/978-3-319-67468-1_3. doi:10.1007/ tomatic summarization of scientific articles: A 978-3-319-67468-1\_3. survey, Journal of King Saud University - Com- [31] B. Sateli, R. Witte, Semantic representation of sciputer and Information Sciences 34 (2022) 1011– entific literature: bringing claims, contributions 1028. URL: https://www.sciencedirect.com/science/ and named entities onto the linked open data article/pii/S1319157820303554. doi:https://doi. cloud, PeerJ Computer Science 1 (2015) e37. URL: org/10.1016/j.jksuci.2020.04.020. https://doi.org/10.7717/peerj-cs.37. doi:10.7717/ [25] J. Lawrence, C. Reed, Argument mining: A survey, peerj-cs.37.

Computational Linguistics 45 (2020) 765–818. [32] S. Yuan, B. Yu, Hclaime: A tool for identify[26] C. Sardianos, I. M. Katakis, G. Petasis, V. Karkaletsis, ing health claims in health news headlines, Inf.

Argument extraction from news, in: Proceedings Process. Manag. 56 (2019) 1220–1233. URL: https: of the 2nd Workshop on Argumentation Mining, //doi.org/10.1016/j.ipm.2019.03.001. doi:10.1016/ ArgMining@HLT-NAACL 2015, June 4, 2015, Den- j.ipm.2019.03.001. ver, Colorado, USA, The Association for Compu- [33] M. Dehghani, A. Severyn, S. Rothe, J. Kamps, Learntational Linguistics, 2015, pp. 56–66. URL: https: ing to learn from weak supervision by full super//doi.org/10.3115/v1/w15-0508. doi:10.3115/v1/ vision, 2017. URL: https://arxiv.org/abs/1711.11383. w15-0508. doi:10.48550/ARXIV.1711.11383. [27] M. Dusmanu, E. Cabrio, S. Villata, Argument min- [34] F. Dernoncourt, J. Y. Lee, Pubmed 200k RCT: a ing on twitter: Arguments, facts and sources, in: dataset for sequential sentence classification in medM. Palmer, R. Hwa, S. Riedel (Eds.), Proceedings ical abstracts, in: G. Kondrak, T. Watanabe (Eds.), of the 2017 Conference on Empirical Methods in Proceedings of the Eighth International Joint ConNatural Language Processing, EMNLP 2017, Copen- ference on Natural Language Processing, IJCNLP hagen, Denmark, September 9-11, 2017, Associa- 2017, Taipei, Taiwan, November 27 - December 1, tion for Computational Linguistics, 2017, pp. 2317– 2017, Volume 2: Short Papers, Asian Federation of 2322. URL: https://doi.org/10.18653/v1/d17-1245. Natural Language Processing, 2017, pp. 308–313. doi:10.18653/v1/d17-1245. URL: https://aclanthology.org/I17-2052/. [28] C. Stab, C. Kirschner, J. Eckle-Kohler, I. Gurevych, [35] A. Fergadis, D. Pappas, A. Karamolegkou, H. PaArgumentation mining in persuasive essays and pageorgiou, Argumentation mining in scientific scientific articles from the discourse structure per- literature for sustainable development, in: Proceedspective, in: E. Cabrio, S. Villata, A. Z. Wyner ings of the 8th Workshop on Argument Mining, (Eds.), Proceedings of the Workshop on Frontiers 2021, pp. 100–111. and Connections between Argumentation Theory [36] V. Marivate, T. Sefara, Improving short text classifi

[1]

Weber , Applying artificial intelligence in the science & technology cycle , Inf. Serv. Use 39 ( 2019 ) 303 - 318 . URL: https://doi.org/10.3233/ISU-190062. doi: 10 .3233/ISU-190062.

[2]

Lo ,

L. L.

Wang ,

Neumann ,

Kinney ,

D. S.

Weld , S2orc: The semantic scholar open research corpus , arXiv preprint arXiv: 1911 . 02782 ( 2019 ).

[3]

Lipinski ,

Yao ,

Breitinger ,

Beel ,

Gipp , [10]

C. F.

Camerer ,

Dreber ,

Holzmeister ,

T.-H.

Ho , Evaluation of header metadata extraction ap- J.

Huber , M.

Johannesson , M.

Kirchler , G.

Nave, proaches and tools for scientific pdf documents , in: B. A. Nosek , T.

Pfeifer , A.

Altmejd , N.

Buttrick , Proceedings of the 13th ACM/IEEE-CS Joint Con- T. Chan,

Chen ,

Forsell ,

Gampa , E. Heikenference on Digital Libraries, JCDL '13 , ACM , New sten, L. Hummer, T.

Imai , S.

Isaksson , D. ManYork, NY, USA, 2013 , pp. 385 - 386 . URL: http://doi. fredi, J. Rose , E.-J. Wagenmakers , H. Wu, Evalacm.org/10.1145/2467696.2467753. doi: 10 .1145/ uating the replicability of social science experi2467696.2467753. ments in nature and science between 2010 and 2015 ,

[4]

Siegel ,

Lourie ,

Power , W. Ammar, Extract- Nature Human Behaviour 2 ( 2018 ) 637 - 644 . URL: ing scientific figures with distantly supervised neu - https://doi.org/10.1038/s41562-018-0399-z. doi:10. ral networks, in: J. Chen , M. A.

Gonçalves , J. M.

1038 /s41562-018-0399-z. Allen,

E. A.

Fox ,

Kan , V. Petras (Eds.), Proceed- [11]

Alipourfard ,

Arendt ,

D. M.

Benjamin , N. Benings of the 18th ACM /IEEE on Joint Conference kler, M. M. Bishop , M.

Burstein , M.

Bush , J. Caverlee, on Digital Libraries, JCDL

2018 , Fort

Worth

, TX,

Chen ,

Clark , et al., Systematizing confidence in USA, June 03 -07, 2018 , ACM, 2018 , pp. 223 - 232 . open research and evidence (score) , 2021 . URL: osf . URL: http://doi.acm. org/10 .1145/3197026.3197040. io/preprints/socarxiv/46mnb. doi: 10 .31235/osf. doi: 10 .1145/3197026.3197040. io/46mnb.

[5]

Bulgarov ,

Caragea , A comparison of super- [12]

Liakata ,

Teufel ,

Siddharthan , C.

Batchelor, vised keyphrase extraction models, in: Proceed- Corpora for the conceptualisation and zoning of ings of the 24th

International Conference on World scientific papers ( 2010 ). Wide Web, WWW '15 Companion , ACM , New York, [13]

Fisas ,

Saggion ,

Ronzano , On the discoursive NY , USA, 2015 , pp. 13 - 14 . URL: http://doi.acm.org/ structure of computer graphics research papers, in: 10.1145/2740908.2742776. doi:10.1145/2740908. Proceedings of the 9th linguistic annotation work2742776. shop , 2015 , pp. 42 - 51 .

[6]

Gong ,

Liu , Generic text summarization us- [14]

Achakulvisut ,

Bhagavatula ,

D. E.

Acuna , K. P. ing relevance measure and latent semantic analysis , Körding , Claim extraction in biomedical publiin: W. B. Croft , D. J.

Harper , D. H.

Kraft , J.

Zo- cations using deep discourse model and transfer bel (Eds .), SIGIR 2001 : Proceedings of the 24th An- learning , CoRR abs/ 1907 .00962 ( 2019 ). URL: http: nual International ACM SIGIR Conference on Re- //arxiv.org/abs/ 1907 .00962. arXiv: 1907 .00962. search and Development in Information Retrieval, [15]

Zhuang ,

Qi ,

Duan ,

Xi ,

Zhu ,

Zhu , September 9- 13 , 2001 , New Orleans, Louisiana, USA,

Xiong ,

He , A comprehensive survey on transACM, 2001 , pp. 19 - 25 . URL: https://doi.org/10.1145/ fer learning, Proceedings of the IEEE 109 ( 2021 ) 383952 .383955. doi: 10 .1145/383952.383955. 43 - 76 . doi: 10 .1109/JPROC. 2020 . 3004555 .

[7]

Liu ,

Zhang ,

Wan ,

Xia ,

He ,

P. S.

Yu , HET- [16]

Sharif Razavian ,

Azizpour ,

Sullivan , S. CarlsFORMER: heterogeneous transformer with sparse son, Cnn features of-the-shelf: an astounding baseattention for long-text extractive summarization, line for recognition , in: Proceedings of the IEEE in: M. Moens , X.

Huang , L.

Specia , S. W. Yih

conference on computer vision and pattern recog(Eds.), Proceedings of the 2021 Conference on Em- nition workshops , 2014 , pp. 806 - 813 . pirical Methods in Natural Language Processing, [17]

Devlin , M.-

Chang ,

Lee ,

Toutanova , EMNLP 2021 , Virtual Event / Punta Cana, Domini- Bert: Pre-training of deep bidirectional transformcan Republic , 7 - 11 November , 2021 , Association ers for language understanding , arXiv preprint for Computational Linguistics , 2021 , pp. 146 - 154 . arXiv: 1810 . 04805 ( 2018 ). URL: https://doi.org/10.18653/v1/ 2021 .emnlp-main. [18]

Kolesnikov ,

Beyer ,

Zhai , J. Puigcerver, 13 . doi: 10 .18653/v1/ 2021 .emnlp-main.13. J. Yung , S.

Gelly , N.

Houlsby , Big transfer (bit):

[8]

Jiang ,

Zhao ,

Qin , T. Liu,

N. V.

Chawla , General visual representation learning , in: ComM. Jiang , The role of" condition" a novel scientific puter Vision-ECCV 2020 : 16th European Conferknowledge graph representation and construction ence , Glasgow, UK, August 23- 28 , 2020 , Proceedmodel, in : Proceedings of the 25th ACM SIGKDD ings, Part V 16 , Springer, 2020 , pp. 491 - 507 . International Conference on Knowledge Discovery [19]

Bachman ,

R. D.

Hjelm , W. Buchwalter, Learn& Data Mining , 2019 , pp. 1634 - 1642 . ing Representations by Maximizing Mutual Infor-

[9]

Hutson , Artificial intelligence faces re- mation across Views , Curran Associates Inc., Red producibility crisis, Science 359 ( 2018 ) 725 - Hook, NY, USA, 2019 . 726. URL: http://science.sciencemag.org/content/ [20]

Chen ,

Kornblith ,

Norouzi ,

G. E.

Hinton , A 359 /6377/725. doi: 10 .1126/science.359.6377. simple framework for contrastive learning of visual 725. representations, in: Proceedings of the 37th Intercation through global augmentation methods , in: Machine Learning and Knowledge Extraction: 4th IFIP TC 5, TC 12, WG 8.4, WG 8 .9, WG 12 .9 International Cross-Domain

Conference

, CD-MAKE 2020 , Dublin, Ireland, August 25-28 , 2020 , Proceedings 4, Springer, 2020 , pp. 385 - 399 .

[37]

Wei ,

Zou , Eda: Easy data augmentation techniques for boosting performance on text classification tasks , arXiv preprint arXiv: 1901 . 11196 ( 2019 ).

[38]

Kim ,

Huang ,

Tomala ,

Matthews ,

Strubell ,

Saunders ,

McCallum , E. Olivetti, Machine-learned and codified synthesis parameters of oxide materials , Scientific Data 4 ( 2017 ) 170127 . URL: https://doi.org/10.1038/sdata. 2017 . 127 . doi: 10 .1038/sdata. 2017 . 127 .

[39]

Kiranyaz ,

Avci ,

Abdeljaber ,

Ince ,

Gabbouj ,

D. J.

Inman , 1d convolutional neural networks and applications: A survey , Mechanical Systems and Signal Processing 151 ( 2021 ) 107398 . URL: https://www.sciencedirect.com/ science/article/pii/S0888327020307846. doi:https: //doi.org/10.1016/j.ymssp. 2020 . 107398 .

[40]

Cer ,

Yang ,

Kong ,

Hua ,

Limtiaco , R. S. John, N. Constant ,

Guajardo-Cespedes ,

Yuan ,

Tar ,

Strope ,

Kurzweil , Universal sentence encoder for english , in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations , Brussels, Belgium, October 31 - November 4 , 2018 , 2018 , pp. 169 - 174 . URL: https: //aclanthology.info/papers/D18-2029/d18- 2029 .

[41]

Mikolov ,

Chen , G. Corrado,

Dean , Eficient estimation of word representations in vector space , CoRR abs/1301 .3781 ( 2013 ). URL: http://arxiv.org/ abs/1301.3781.

[42]

Xie ,

Avati ,

Arivazhagan ,

Jurafsky ,

A. Y.

Ng , Neural language correction with character-based attention , CoRR abs/1603 .09727 ( 2016 ). URL: http://arxiv.org/abs/ 1603.09727. arXiv: 1603 . 09727 .

[43]

Van der Maaten , G. Hinton, Visualizing data using T-SNE ., Journal of machine learning research 9 ( 2008 ).