<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xin Wei</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Md Reshad Ul Hoque</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jian Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiang Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Old Dominion University</institution>
          ,
          <addr-line>Norfolk, VA, 23529</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The growth of scientific papers in the past decades calls for efective claim extraction tools to automatically and accurately locate key claims from unstructured text. Such claims will benefit content-wise aggregated exploration of scientific knowledge beyond the metadata level. One challenge of building such a model is how to efectively use limited labeled training data. In this paper, we compared transfer learning and contrastive learning frameworks in terms of performance, time and training data size. We found contrastive learning has better performance at a lower cost of data across all models. Our contrastivelearning-based model ClaimDistiller has the highest performance, boosting the F1 score of the base models by 3-4%, and achieved an F1=87.45%, improving the state-of-the-art by more than 7% on the same benchmark data previously used for this task. The same phenomenon is observed on another benchmark dataset, and ClaimDistiller consistently has the best performance. Qualitative assessment on a small sample of out-of-domain data indicates that the model generalizes well. Our source codes and datasets can be found here: https://github.com/lamps-lab/sci-claim-distiller.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Scientific Claim Extraction</kwd>
        <kwd>Word Embedding</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Transfer Learning</kwd>
        <kwd>Contrastive Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        has shown to be an important step to automatically
assessing reproducibility in social and behavioral sciences
Because of the rapid increase of scientific papers indexed and other domains, e.g., [
        <xref ref-type="bibr" rid="ref9">9, 10</xref>
        ], which is investigated in
by digital libraries [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], there is an emergent need DARPA’s Systematizing Confidence in Open Research
to help readers to eficiently grasp the main ideas of re- and Evidence (SCORE) program [11].
search papers. This can be achieved by development We define a scientific claim as a sentence that provides
of algorithms to extract and aggregate key information the core findings of a scientific paper. One example is
from unstructured scholarly text. Existing machine learn- given in Figure 1. Existing datasets with annotated claims
ing methods have been developed to extract metadata, are scarce and not available in all domains. Current
such as title, authors, year, venue, e.g., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], non-textual datasets on claim extraction include CoreSC dataset [12]
content such as figures and tables, e.g., [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and high- with 265 articles in physical chemistry and biochemistry.
level semantic information such as keywords, e.g., [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The Dr. Inventor dataset [13] contains claims extracted
However, scientific claims, conveying key findings and from 40 computer graphics articles. Another dataset used
contributions from unstructured text remains challeng- in a recent paper [14] contains claims extracted from
ing because scientific ideas could be conveyed in a more 1,500 scientific abstracts in the biomedical domain. Due
complicated way than general text as used in news pa- to data scarcity, it is important to develop models that
pers and Wikipedia articles. Although deep learning has eficiently use existing data. In a recent paper [ 14] the
shown promising results for open domain extractive sum- authors introduced transfer learning to perform scientific
marization and key sentences identification, e.g., [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], it claim extraction. In this paper, we explore alternative
is still challenging to train robust deep learning models ways for this task.
on scientific papers [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] because of the lack of large-scale Transfer learning uses the knowledge extracted from
training data. Obtaining such training data usually re- one or more source tasks, which usually have a high
quires domain knowledge, which regular crowdsourcing amount of resources, to accomplish a target task, which
workers may not possess. Identifying key claims from usually has a lower amount of resources. Transfer
learnscientific papers can also be time-consuming for domain ing works by pretraining a neural model using data for
experts. In addition, mining claims from scientific papers the source tasks. The model is retrained by freezing the
weights of a portion of a neural network and learning the
weights of the other portion of the same neural network
[15]. Transfer learning has been adopted in computer
vision (CV) and natural language processing (NLP) tasks,
e.g., [16] [17].
      </p>
      <p>Transfer learning relaxes the i.i.d. (independent and
identically distributed) requirement for training and
testJoint Workshop of the 4th Extraction and Evaluation of Knowledge
Entities from Scientific Documents and the 3rd AI + Informetrics
(EEKEAII2023), June 26, 2023, Santa Fe, New Mexico, USA and Online
† These authors contributed equally.
$ xwei001@odu.edu (X. Wei); j1wu@odu.edu (J. Wu)
© 2023 Copyright 2023 for this paper by its authors. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).</p>
      <p>CEUR Workshop Proceedings (CEUR-WS.org)
ing datasets. To be specific, the classes in source data 2. We compared 10 commonly used methods of text
does not necessarily need to be the same with target data. augmentation for training SCL in the context of
This is usually fullfilled by the extremely large sizes of scientific claim extraction. All methods exhibit a
source datasets, such as ImageNet-21k dataset with 14.2 marginal efect on the model performance.
million images [18]. Source data used in NLP (Natural
Language Processing) is usually in the magnitude of tens 3. Our best model was trained and evaluated on a
of Mega bites and even more. Data size is a limit for standard benchmark in the biomedical domain.
claim extraction and as a result transfer learning does The model exhibited reasonably well
generaliznot delivery enough power. In this paper, we introduce ability when it is tested in the computer science
contrastive learning framework which uses significantly domain.
less training data and achieves comparable or better
performance. 2. Related Work</p>
      <p>Self-supervised contrastive learning, a type of
selfsupervised representation learning, eficiently leverages Scientific claim extraction is closely related to
extraclimited training data and has demonstrated promising tive document summarization and argumentation
minresults in multiple CV and NLP tasks, e.g., [19, 20]. This ing, which are more explored in literature. The goal of
method puts similar samples close to each other while extractive document summarization is to extract text that
pushing ‘negative’ samples far apart in the feature space is much shorter than the original documents and deliver
[21]. For example, in image classification, data can be the main idea of the given documents [23]. A survey on
augmented by cutting and rotation. We can adjust the extractive document summarization for scientific papers
loss function and make the augmented samples from the can be found in [24]. The text output by extractive
docusame image close to each other and augmented samples ment summarization may contain several key sentences
from diferent images far away. In this way, the model that provide a high-level description of the original text.
can learn the features without looking at labels. The These sentences may not necessarily describe the core
drawback of self-supervised contrastive learning is that findings. Therefore, the methods cannot directly be used
the correlation of features between images belonging for extracting scientific claims.
to the same class is ignored. This could be mitigated Argument mining automatically extract the structure
by leveraging label information, which is the supervised of inference and reasoning presented in natural language
contrastive learning [22]. text [25]. In argument mining, premises were extracted</p>
      <p>In this paper, we compared transfer learning and su- from news [26], social media [27], scientific article [ 28],
pervised contrastive learning frameworks in terms of and Wikipedia [29]. Existing argument mining methods
performance, time and training data size. We found con- include heuristic methods [30, 31] and classical machine
trastive learning has better performance at a lower cost learning methods [32]. Recently, deep learning
methof data across all models on both datasets. We propose a ods, including weak supervision and transfer learning
contrastive-learning-based model ClaimDistiller, the mechanisms, have been proposed [33].
backbone of which is a recurrent neural model with su- There are limited publications on scientific claim
expervised contrastive learning. We demonstrate that the traction. Dernoncourt et al. [34] developed a scientific
supervised contrastive learning mechanism improves the discourse dataset PubMed-RCT, in which sentences
model performance by a significant margin with less were labeled into five classes, namely, background,
introtraining samples and training time. duction, method, result, and conclusion. However, claims</p>
      <p>Our best model achieves F1=87.45% when trained and were not explicitly labeled in this dataset. Recently, a
tested on SciCE. We further trained the model on another human-annotated scientific claim extraction dataset in
benchmark dataset SciARK, and contrastive learning biomedical domains was published [14]. Existing
methmethods obtained better performance across all models ods used for scientific claim extraction include rule-based
than transfer learning. ClaimDistiller consistently out- and deep learning methods. Rule-based methods were
performs all other models. used to extract claims from scientific papers in Jansen
The contributions of the paper are as follows: et al. [30]. Achakulvisut et al. [14] proposed a model
consisting of a bidirectional long short-term memory
(BiLSTM) network stacked with a conditional random
ifeld (CRF) model trained in a transfer learning
framework. They trained their model on the PubMed-RCT
dataset and then fine-tuned the model on their in-house</p>
      <p>SciCE dataset.
1. We proposed using supervised contrastive
learning for scientific claim extraction. The results
show that SCL achieves a comparable or better
performance than transfer learning with
significantly less training data and training time. The
best model achieves an F1=87.45% on the SciCE
dataset.</p>
      <p>Non-claim sentence: This article is part of a Special
Issue entitled Advances in High Density Lipoprotein
Formation and Metabolism: A Tribute to John F. Oram
(1945-2010).</p>
    </sec>
    <sec id="sec-2">
      <title>3. Data</title>
      <p>The claims to be extracted should be absolute,
independent, core findings of the paper. A conclusion may not
necessarily be a claim, but a claim is highly likely to be a
conclusion. Claims may appear in the abstracts and the
body text, but in our research task, we focus on
extracting claims from abstracts, assuming that authors should
put the core findings of the paper in the abstracts.</p>
      <p>The data used in this paper includes three corpora. The
ifrst corpus was built by Achakulvisut et al. [ 14], which
is the largest dataset so far for scientific claim extraction.
For convenience, we call it the scientific claim extraction
(SciCE) dataset.</p>
      <p>Specifically, the dataset labels three types of claims:</p>
      <sec id="sec-2-1">
        <title>Type 1: A statement that declares something is better;</title>
      </sec>
      <sec id="sec-2-2">
        <title>Type 2: A statement that proposes something new;</title>
      </sec>
      <sec id="sec-2-3">
        <title>Type 3: A statement that describes a new finding or a new cause-efect relationship.</title>
        <p>The corpus contains 1,500 scientific abstracts in the half of the dataset contains only 1 claim in an abstract.
biomedical domain. Each sentence in the abstracts was The dataset contains in total 2276 claims and 9426
nonlabeled by domain experts into two categories, namely, claims. For an even comparison, we adopt the split of the
claim and non-claim. An example of a claim sentence and original dataset in which the numbers in training, test,
a non-claim sentence, in an abstract, is shown in Figure 1. and validation samples are 750, 375, and 375, respectively.
Each abstract contains 5 to 10 sentences (Figure 2). One The second corpus is the Pubmed-RCT dataset [34],
abstract may contain more than one claim (Figure 3). The designed for the discourse prediction task, which was to
majority of the abstracts contain 1–2 claims and about predict the discourse types for a sequence of sentences
in one abstract. In our paper, it is used as the source the same class while simultaneously pushing apart
difdataset for transfer learning. Pubmed-RCT is a larger ferent classes in the embedding space. This step helps to
dataset consisting of 20,000 abstracts, including 2.3 mil- create more accurate embeddings and thus subsequent
lion sentences selected from the MEDLINE/PubMed Base- classification based on it can achieve better performance
line Database published in 2016. The abstracts are in than regular supervised learning.
biomedical and life sciences domains, and particularly In self-supervised contrastive learning each sample
in randomized controlled trials (RCTs). The discourse is considered a class, while in supervised contrastive
type for each sentence is one of the five classes, Objective, learning each label is considered a class. As a result, in
Introduction, Method, Result, and Conclusion. The Method self-supervised contrastive learning the training process
and Result classes contain one-third of all labeled sen- requires 2 augmented samples for the  samples in
tences, respectively. The remaining one-third contains training data, but in supervised contrastive learning, the
sentences labeled as the other three classes. The number model could be trained by either  or 2 augmented
of sentences in an abstract is between 3 and 51, with an samples. In our task we use supervised contrastive
learnaverage of 11.6. This dataset will be used for pre-training ing to train the model. We tried both  and 2
augin transfer learning. mented samples. The Supervised Contrastive Loss
func</p>
        <p>A third dataset SciARK was introduced in a recent tion is defined as:
work [35]. It is a relatively small dataset composed of
abstracts from 689 academic papers with 9055 sentences.</p>
        <p>The number of abstracts in training, testing, and valida-  = ∑︁ −1 ∑︁ log ∑︀ exp( * / )
tion samples are 350, 269, and 70, respectively, as split ∈ |()| ∈() ∈() exp( * / )
by the authors. Each sentence is annotated as Claim, (1)
Evidence, or Nonetype. Unlike SciCE and Pubmed, this Here  is the index of an arbitrary sample in the
augdataset is multidisciplinary with abstracts of scientific mented dataset . () is the set of samples in the same
publications related to a broad spectrum of Sustainable class with  except sample . () is the set of samples
Development Goals (SDG) domains. When using the in the augmented dataset except sample . ,  and
dataset, we merge the "Evidence" and "Nonetype" as "non-  stand for the representations of the anchor, positive,
claim" and treat it as a binary-class dataset (claim vs. and negative samples respectively.  is the temperature
non-claim). parameter, which adjusts the distance of diferent classes
in the embedding space.</p>
        <sec id="sec-2-3-1">
          <title>4.2. Framework Architecture</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Proposed Framework:</title>
    </sec>
    <sec id="sec-4">
      <title>ClaimDistiller</title>
      <sec id="sec-4-1">
        <title>Our proposed framework is based on supervised con</title>
        <p>We formulate the claim extraction task as a classification trastive learning. The architecture of the framework is
problem on a sequence of sentences, where the model pre- shown in Figure 4. The SCL can be implemented in two
dicts a class label claim or non-claim for each sentence. stages. In the first stage, we augment each labeled
senIn regular classification models, text is represented in tence into two sentences with similar semantics. This
the form of vectors and training a good representation augmented dataset is fed into the encoder and supports
is essential for classification. We improve the models the Stage 1 training. The encoder along with the
proby adopting supervised contrastive learning to generate jection head, which is composed of several dense layers,
better representations. We propose a framework called minimizes the supervised contrastive loss to obtain the
ClaimDistiller for extracting scientific claims from ab- optimal embeddings in order to group positive samples
stracts. together and push negative samples far away. In Stage 2,
we keep the encoder and freeze the weights in its dense
4.1. Supervised Contrastive Learning layers, and add two more dense layers for classification.
The classifier is trained to minimize the cross-entropy
loss function.</p>
        <p>Self-supervised contrastive learning [20] methods can be
used to generate representations for non-labeled data. It
treats each sample in the dataset as a class and compares
them pairwise after data augmentation to obtain
“apparent similarities”, and further generates representations
for each sample. Supervised contrastive learning [22]
methods introduce this framework for labeled data. The
key idea is to train a representation that pulls together</p>
        <sec id="sec-4-1-1">
          <title>4.3. Data Augmentation</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Data augmentation is an essential part in contrastive</title>
        <p>learning methods, which creates the dataset used for
pre-training by sentences with similar semantics. We
investigate five types of methods and their variants to
augment text given a labeled sentence.</p>
        <p>Original Sentence
John is going to town
Joe is walking to town</p>
        <p>Mary is running to town
Second
Stage
Learning</p>
        <p>Freezing
Encoder
Layers</p>
        <p>Classifier
WC-BiLSTM
Dense
Dense
CrossEntropyLoss</p>
        <p>Encoder
WC-BiLSTM</p>
        <p>Dense
Relu</p>
        <p>Dense
Supervised Contrastive Loss
First
Stage</p>
        <p>Learning
Classification</p>
        <p>
          Results
1. CNN-1D. Similar to regular CNN used in
feature extraction from 2-dimensional images,
1dimensional CNN has been used for extracting
features from word sequences, e.g., [
          <xref ref-type="bibr" rid="ref11">38</xref>
          ]. This
method works by sliding a window with a
fixwidth over a sequence and convolving features of
tokens covered by the window [
          <xref ref-type="bibr" rid="ref12">39</xref>
          ]. An average
pooling was used to aggregate features from
individual tokens. Similar to a 2D-CNN, the 1D CNN
can be used for extracting patterns from local 1D
patches (aka sub-sequences) from sequences.
After each word-level token is converted to initial
vectors, 1D convolutional layers with the
convolutional kernels of size  were used to extract the
patterns (Figure 6). These layers can recognize
patterns in an input sequence. We used a 2-layer
1D CNN, which is flattened at the end before the
presentation is fed to a dense fully-connected
layer for classification.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>2. USE-dense. We adopted the pre-trained Univer</title>
        <p>
          sal Sentence Encoder (USE) [
          <xref ref-type="bibr" rid="ref13">40</xref>
          ] to encode claim
text into dense 512-dimensional vectors. The
initial embeddings produced by USE were fine-tuned
on the SciCE corpus, after which the sentences
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiment Setups</title>
      <p>1. Round Trip Translation (RTT) [36]. This
method first translates the sentence from English
to French and then translates it back to English. 5.1. Base Models
Translation is based on Google translation
services as well as Amazon translate [36].</p>
      <p>As mentioned above, the first stage is to encode the
input sentence into a vector. We experiment three types
2. Wordnet Synonym Replacement [36]. This of encoders each having three settings of the original
method replaces words with their synonyms in encoder, the encoder trained with transfer learning and
the sentence. Replaceable words such as verbs, the encoder trained on SCL.
nouns are selected from a sentence using a
partof-speech tagger. Then a number of words are
selected out of them following a Geometric
distribution and replaced by their synonyms, which
are given by a synonym library provided by
Word</p>
      <p>
        Net.
3. EDA (Easy Data Augmentation) Synonym
Replacement [
        <xref ref-type="bibr" rid="ref10">37</xref>
        ]. Randomly pick a word (not stop
words) from the sentence and then replace the
word with one of its synonyms chosen at random.
4. EDA Random Deletion [
        <xref ref-type="bibr" rid="ref10">37</xref>
        ]. Randomly remove
any word in the sentence with a probability you
can specify. We use the default probability value
0.2.
5. EDA Random Insertion [
        <xref ref-type="bibr" rid="ref10">37</xref>
        ]. Find a random
synonym of a random word (not a stop word) in
the sentence and then insert the synonym into
any position in the sentence randomly.
      </p>
      <sec id="sec-5-1">
        <title>We further generate augmented data by two data augmentation methods to obtain a bigger dataset for pretraining. A comparison of the results will be given in Section 7.</title>
        <p>
          were encoded to dense feature vectors used by
the fully-connected layer for classification.
3. WC-BiLSTM (Word and Character
embedding Bidirectional Long Short-Term
Memory). One drawback of applying pre-trained word
embedding is that unseen words have to be
encoded as a default vector in the prediction time.
The representations of these words could only be
inferred by surrounding words. Word prefixes
and sufixes often contain semantic information.
Therefore, we combine pre-trained Word2Vec
embedding [
          <xref ref-type="bibr" rid="ref14">41</xref>
          ] with character embedding [
          <xref ref-type="bibr" rid="ref15">42</xref>
          ] to
encode unseen words. The combined embedding
is fed to bidirectional long short-term memory
(BiLSTM) layers to extract patterns from claim
sentences (Figure 7). Finally, the representations
were passed to a fully-connected layer for
classiifcation.
        </p>
        <sec id="sec-5-1-1">
          <title>5.2. Experiments</title>
          <p>To evaluate the robustness of the proposed framework,
we investigate the base models in diferent training
frameworks: only the base model, transfer learning, and
supervised contrastive learning. Figure 5 shows a comparison
of the three diferent training frameworks. ‘Network’ in
this figure can be any of the base models. The training
frameworks are as follows:
SciCE</p>
          <p>Dataset
Pre-train
Network
Pubmed
Dataset</p>
          <p>Network
Data Augmentation</p>
          <p>SciCE
Dataset</p>
          <p>Network</p>
          <p>Training
Network
Fine-tuning</p>
          <p>SciCE</p>
          <p>Dataset
Freeze Embb-layer</p>
          <p>Fine-tuning</p>
          <p>SciCE
Dataset</p>
          <p>Train with base models
Train with transfer learning</p>
          <p>Claim
Non-claim
Claim
Non-claim
Claim</p>
          <p>Non-claim
Train with contrastive learning
addition, we include the following two experiments from
previous academic papers as baselines:</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>2. Transfer Learning. In this setting, the neural</title>
        <p>classifier is firstly pre-trained using the
PubMedRCT corpus and then fine-tuned on the SciCE
corpus. During the fine-tuning stage, we freeze the
weights of all layers except the fully-connected
classification layer. Then we replaced that
fullyconnected layer with a new layer with classes in
the target dataset.
3. Supervised Contrastive Learning. As
discussed in Section 4, in supervised contrastive
learning the neural network is firstly pre-trained
with augmented training data from the SciCE
corpus and then fine-tuned on the original SciCE
data. Note that in this setting, only SciCE is used,
which is a dataset much smaller than the
PubMedRCT dataset.</p>
        <p>As a result, we have in total 9 experiment
specifications: 3 diferent frameworks for each base model. In
1. Heuristic Method. This baseline is adopted
from Sateli &amp; Witte [31]. This method used
gazetteering, deictic phrases and hand-crafted
rules to match against the text. The sentence
containing the deictic phrase must be a statement in
form of a factual implication, and have a
comparative voice or asserts a property of the author’s
contribution, such as novelty or performance.
2. CRF-based Transfer Learning. This baseline is
adopted from Achakulvisut et al. [14], in which
transfer learning was applied on a conditional
random field (CRF) model. This is the
state-ofthe-art to our best knowledge. This method treats
claim extraction as a sequence tagging task and
uses CRF to capture the dependencies of the label
of the current sentence to the features and labels
of neighbor sentences.</p>
        <p>Initial sequence
Window Size = 5</p>
        <p>Output Features</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Evaluation</title>
      <sec id="sec-6-1">
        <title>6.1. Evaluation Metrics</title>
        <sec id="sec-6-1-1">
          <title>The proposed methods and baselines are evaluated using the standard precision, recall, and F1 scores, defined below.</title>
          <p>= TP ,  = TP , 1 = 2 
TP + FP TP + FN  + 
(2)
In Eq.( 2),  and  stand for precision and recall,
respectively. TP is the number of predicted claims that are
true. FP is the number of predicted claims that are false.
FN is the number of predicted non-claims that are false.
1 is the harmonic mean of  and .</p>
          <p>In addition, we also compare the training time. The
training time was measured as the time elapsed between
when the program started taking inputs (including
pretraining) and when the model stopped training after
certain numbers of epochs.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Experiment Details</title>
        <sec id="sec-6-2-1">
          <title>All the experiments were performed on a single computer</title>
          <p>with a 4 physical core CPU, 16GB RAM, and Solid State
Disks and an Nvidia V100 GPU.</p>
          <p>When working on the CNN-1D model, the window
size  = 5. Because the convolution is performed on
the word level, we truncated sentences longer than 120
words and padded sentences shorter than 120 words.</p>
          <p>When training the WC-BiLSTM model, the learning
rate was set to 0.001, the batch size was set to 256, and
the dropout rate was 0.5. Each encoder model and its
variants were trained for a maximum of 50 epochs before
which the loss function of the validation data reached the
minimum. Early stopping was applied to avoid
overfitting. We found that at this stage, the loss functions have
asymptotically converged to the minimum.</p>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Results</title>
        <sec id="sec-6-3-1">
          <title>The results of the experiments are shown in Table 2.</title>
          <p>The first column shows the evaluation results of models
trained on SciCE. The second column shows the training
time. The third column shows the evaluation results of
models trained on SciARK. The training time on SciARK
is shown in column 4. In column 5 we compared the
training data size for all scenarios.</p>
          <p>As seen in Table 2, Deep learning based models
achieving much better performance than rule-based models
suggests that the semantic features of scientific claims
are complicated and are better represented by neural
models.</p>
          <p>In general, transfer learning based models achieve
better performance than the corresponding original
encoders by ∆ F1=0.74–3.18. The eficacy of transfer
learning comes from source data used for pre-training. The
discourse information in the PubMed-RCT corpus used
here is relevant and helps improve the performance.</p>
          <p>The comparison of transfer learning and contrastive
learning is performed on two datasets: SciCE and
SciARK. Contrastive learning achieves better
performance than transfer learning consistently across all
models. With SciCE, SCL beats transfer learning by ∆
F1=0.82 – 2.72% for the SciCE dataset and ∆ F1=1.32–
1.72% for the SciARK dataset. The only exception is that
CNN-1D-contrastive underperformed CNN-1D-transfer
by 0.52%. Therefore, SCL in general achieves a
comparable or better performance than transfer learning.</p>
          <p>Contrastive-learning-based model
ClaimDistiller has the best performance across all metrics
compared with other models, achieving F1=87.45%,
precision=87.08%, and recall=87.83%. With SciARK,
ClaimDistiller has the best performance with
F1=88.93%, precision=90.02%, and recall=89.47%.</p>
          <p>The training time needed for each model varies. In
general, transfer learning needs significantly more time
for training than supervised contrastive learning. In the
last column, we see a clear comparison of training data
size for contrastive learning and transfer learning.
Comparing the training data size, contrastive learning uses
less than 6000 sentences while transfer learning uses 2
million sentences for pre-training in order to achieve the
performance reported in Table 1.
1 Quoted from reference because they used the same test data.
2 Testing data size is in the parentheses. Measured by number of sentences.
3 Training time including both pre-training and fine-tuning.
4 Measured by number of sentences in training dataset.
5 WC-BiLSTM-contrastive.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <sec id="sec-7-1">
        <title>7.1. Data Augmentation Analysis</title>
        <p>As discussed in Section 4, we tried several methods of text
augmentation. Here we show the experimental results
obtained with the best model WC-BiLSTM-contrastive
model in Table 3. The results show that various types of
text augmentation methods have marginal efect on the
classification performance of the SCL base model, with
the range of F1 going from 86.11% to 87.45%. Wordnet
synonym replacement achieves the best performance
while random deletion is the worst. We choose to use the
best one "Wordnet synonym Replacement" as the data
augmentation method.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Error Analysis</title>
        <p>0.5 and with an average of 0.13.</p>
        <p>In this section, we perform error analysis focusing on We demonstrate two examples containing typical
erthe best model: WC-BiLSTM-contrastive. Out of the 375 rors in the prediction results for case studies (Figure 10).
abstracts in testing set of SciCE, this model correctly The ground truth claims are highlighted in blue. Green
predicts all the claims and non-claims in 125 abstracts. labels mean the sentences are non-claims and red labels
As shown in Figure 8, in the remaining 250 abstracts, the mean sentences are claims. Labels with red frames
indimajority of them have 1–2 wrongly predicted sentences, cate wrong predictions. In the first example, the model
with the maximum prediction errors of 4 in a single ab- is able to identify all the claims, but it mistakenly
recogstract. As shown in Figure 9, the error rates are all below nizes two sentences as claims. In the second example,
there should be two claims but the model only identified
one of them.</p>
        <p>Example 1 is challenging because the two false
positives look like claims but when they are read together
with the first sentence, it is clear that the second sentence
(starting with “The discussion emphasizes”) describes
what the authors have done in the paper and the third
sentence (starting with “A fundamental need”) describes
a background, which is the motivation of the research.</p>
        <p>Example 2 contains a false negative. It is not
straightforward to determine why the sentence starting with
“Results indicated that” was misclassified to a non-claim
because the leading pattern clearly indicates the sentence
conveys key findings. The error analyses indicate that
although the recurrent model attempted to incorporate
context information, it may still miss the nuances of
semantics. Fine-tuning the hyperparameters may help, but
a more sophisticated and robust model is needed to
capture the nuances. One method is to combine latent and
rule-based features. Another possible method is to
leverage the "knowledge" encoded in large language models
(LLMs), e.g., GPT 3, or using the LLM-adapter method to
train an adapter for this task.</p>
        <p>Prediction results on abstract 1:
Grounded in a socio-ecological framework, we describe salient health care system and policy factors
that influence engagement in human immunodeficiency virus (HIV) clinical care. Non-claim
The discussion emphasizes successful programs and models of service delivery and highlights the
limitations of current, fragmented health care system components in supporting effective, efficient, and
sustained patient engagement across a continuum of care. Claim
A fundamental need exists for improved synergies between funding and service agencies that provide
HIV testing, prevention, treatment, and supportive services. Claim
We propose a feedback loop whereby actionable, patient-level surveil ance of HIV testing and
engagement in care activities inform educational outreach and resource al ocation to support integrated
\"testing and linkage to care plus\" service delivery. Claim
Ongoing surveil ance of programmatic performance in achieving defined benchmarks for linkage of
patients who have newly diagnosed HIV infection and retention of those patients in care is imperative
to iteratively inform further educational efforts, resource al ocation, and refinement of service delivery.</p>
        <p>Claim
Figure 10: Two examples of errors in the prediction results in
the test set of SciCE. The ground truth claims are highlighted
7.3. Domain Adaptability in blue. Green labels mean the sentences are non-claims and
red labels mean sentences are claims. Labels with red frames
The SciCE corpus is in the biomedical domain. To test indicate wrong predictions.
whether the model performs well in a diferent domain,
we applied the best model (WC-BiLSTM-contrastive) to
classify sentences in a random selection of 30 abstracts plying that claims in these two diferent domains are
in computer science papers. Out of the 195 sentences in usually written with similar language patterns. We also
this dataset, 60 sentences were predicted as claims. By observed that the model tends to omit claims, indicating
visually examining these predicted claims, 50 of them are that a more robust domain adaptation may be needed to
consistent with the definition of claims [ 14]. Examples of improve the recall.
the successfully predicted claims are given in Figure 11.</p>
        <p>This post-hoc evaluation result indicates that the model’s
precision for computer science abstracts is roughly
consistent with biomedical domains with (P ≈ 83.3%),
im</p>
        <p>We demonstrate that scaling up language models greatly improves
taskagnostic, few-shot performance, sometimes even becoming competitive
with prior state-of-the-art fine-tuning approaches.</p>
        <p>These results highlight the importance of previously overlooked design
choices, and raise questions about the source of recently reported
improvements.</p>
        <p>Moreover, DETR can be easily generalized to produce panoptic
segmentation in a unified manner.</p>
        <p>We show that it significantly outperforms competitive baselines.</p>
      </sec>
      <sec id="sec-7-3">
        <title>7.4. Visualization on SciCE Data</title>
        <p>
          To further qualitatively demonstrate the efect of
supervised contrastive learning, we project the
128dimensional vectors output by the WC-BiLSTM base
model into a 2-dimensional feature space using tSNE
[
          <xref ref-type="bibr" rid="ref16">43</xref>
          ], and then compare it with results in supervised
contrastive learning. Figure 12 shows that the model with
supervised contrastive learning grouped the same class
altogether, making them more separated in the feature
space.
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>To automatically obtain scientific findings from the ever Figure 12: The -SNR plots showing the efects of
superincreasing volume of scientific papers, an efective and vised contrastive learning. The upper panel shows the two
eficient claim-extracting tool is becoming increasingly classes without supervised contrastive learning. The lower
important for information aggregation, summarization, panel shows the two classes with supervised contrastive
learnand retrieval of scientific papers. One bottleneck of this ing. Orange dots represent claims and blue dots represent
task is the limitation of annotated training data. The non-claims.
challenge is how to eficiently use existing limited data.</p>
      <p>We propose the ClaimDistiller framework, which uses
supervised contrastive learning on top of existing text erative agreement No. W911NF-19-2-0272. The content
encoders to boost the performance of classification. We of the information does not necessarily reflect the
poshowcased the eficacy of this mechanism on two bench- sition or the policy of the Government, and no oficial
mark datasets. Our result establish a new state-of-the-art endorsement should be inferred. We also acknowledge
on the SciCE dataset, outperforming the existing method the partial support from the Virginia Commonwealth
by 7%, which used transfer learning on a BiLSTM-CRF Cyber Initiative (CCI) Grant #H-4Q21-014.
architecture. We demonstrated that the SCL achieved
comparable or higher F1 scores compared with transfer
learning methods with significantly less training data References
and time. Future research will explore hybrid methods
and LLMs to capture nuances of context.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <sec id="sec-9-1">
        <title>This work was partially supported by the Defense Ad</title>
        <p>vanced Research Projects Agency (DARPA) under
coopnational Conference on Machine Learning, ICML and Natural Language Processing, Forlì-Cesena,
2020, 13-18 July 2020, Virtual Event, volume 119 of Italy, July 21-25, 2014, volume 1341 of CEUR
WorkProceedings of Machine Learning Research, PMLR, shop Proceedings, CEUR-WS.org, 2014. URL: http:
2020, pp. 1597–1607. URL: http://proceedings.mlr. //ceur-ws.org/Vol-1341/paper5.pdf.
press/v119/chen20j.html. [29] J. Thorne, A. Vlachos, C. Christodoulopoulos,
[21] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Baner- A. Mittal, FEVER: a large-scale dataset for fact
jee, F. Makedon, A survey on contrastive self- extraction and verification, in: M. A. Walker, H. Ji,
supervised learning, Technologies 9 (2020) 2. A. Stent (Eds.), Proceedings of the 2018 Conference
[22] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, of the North American Chapter of the Association
P. Isola, A. Maschinot, C. Liu, D. Krishnan, Su- for Computational Linguistics: Human Language
pervised contrastive learning, Advances in Neural Technologies, NAACL-HLT 2018, New Orleans,
Information Processing Systems 33 (2020) 18661– Louisiana, USA, June 1-6, 2018, Volume 1 (Long
18673. Papers), Association for Computational Linguistics,
[23] A. Barrera, R. M. Verma, Combining syntax and se- 2018, pp. 809–819. URL: https://doi.org/10.18653/
mantics for automatic extractive single-document v1/n18-1074. doi:10.18653/v1/n18-1074.
summarization, in: A. F. Gelbukh (Ed.), Com- [30] T. Jansen, T. Kuhn, Extracting core claims
putational Linguistics and Intelligent Text Pro- from scientific articles, in: T. Bosse, B.
Brecessing - 13th International Conference, CICLing deweg (Eds.), BNAIC 2016: Artificial Intelligence
2012, New Delhi, India, March 11-17, 2012, Pro- - 28th Benelux Conference on Artificial
Intelliceedings, Part II, volume 7182 of Lecture Notes gence, Amsterdam, The Netherlands, November
in Computer Science, Springer, 2012, pp. 366–377. 10-11, 2016, Revised Selected Papers, volume 765 of
URL: https://doi.org/10.1007/978-3-642-28601-8_31. Communications in Computer and Information
Scidoi:10.1007/978-3-642-28601-8\_31. ence, Springer, 2016, pp. 32–46. URL: https://doi.
[24] N. Ibrahim Altmami, M. El Bachir Menai, Au- org/10.1007/978-3-319-67468-1_3. doi:10.1007/
tomatic summarization of scientific articles: A 978-3-319-67468-1\_3.
survey, Journal of King Saud University - Com- [31] B. Sateli, R. Witte, Semantic representation of
sciputer and Information Sciences 34 (2022) 1011– entific literature: bringing claims, contributions
1028. URL: https://www.sciencedirect.com/science/ and named entities onto the linked open data
article/pii/S1319157820303554. doi:https://doi. cloud, PeerJ Computer Science 1 (2015) e37. URL:
org/10.1016/j.jksuci.2020.04.020. https://doi.org/10.7717/peerj-cs.37. doi:10.7717/
[25] J. Lawrence, C. Reed, Argument mining: A survey, peerj-cs.37.</p>
        <p>Computational Linguistics 45 (2020) 765–818. [32] S. Yuan, B. Yu, Hclaime: A tool for
identify[26] C. Sardianos, I. M. Katakis, G. Petasis, V. Karkaletsis, ing health claims in health news headlines, Inf.</p>
        <p>Argument extraction from news, in: Proceedings Process. Manag. 56 (2019) 1220–1233. URL: https:
of the 2nd Workshop on Argumentation Mining, //doi.org/10.1016/j.ipm.2019.03.001. doi:10.1016/
ArgMining@HLT-NAACL 2015, June 4, 2015, Den- j.ipm.2019.03.001.
ver, Colorado, USA, The Association for Compu- [33] M. Dehghani, A. Severyn, S. Rothe, J. Kamps,
Learntational Linguistics, 2015, pp. 56–66. URL: https: ing to learn from weak supervision by full
super//doi.org/10.3115/v1/w15-0508. doi:10.3115/v1/ vision, 2017. URL: https://arxiv.org/abs/1711.11383.
w15-0508. doi:10.48550/ARXIV.1711.11383.
[27] M. Dusmanu, E. Cabrio, S. Villata, Argument min- [34] F. Dernoncourt, J. Y. Lee, Pubmed 200k RCT: a
ing on twitter: Arguments, facts and sources, in: dataset for sequential sentence classification in
medM. Palmer, R. Hwa, S. Riedel (Eds.), Proceedings ical abstracts, in: G. Kondrak, T. Watanabe (Eds.),
of the 2017 Conference on Empirical Methods in Proceedings of the Eighth International Joint
ConNatural Language Processing, EMNLP 2017, Copen- ference on Natural Language Processing, IJCNLP
hagen, Denmark, September 9-11, 2017, Associa- 2017, Taipei, Taiwan, November 27 - December 1,
tion for Computational Linguistics, 2017, pp. 2317– 2017, Volume 2: Short Papers, Asian Federation of
2322. URL: https://doi.org/10.18653/v1/d17-1245. Natural Language Processing, 2017, pp. 308–313.
doi:10.18653/v1/d17-1245. URL: https://aclanthology.org/I17-2052/.
[28] C. Stab, C. Kirschner, J. Eckle-Kohler, I. Gurevych, [35] A. Fergadis, D. Pappas, A. Karamolegkou, H.
PaArgumentation mining in persuasive essays and pageorgiou, Argumentation mining in scientific
scientific articles from the discourse structure per- literature for sustainable development, in:
Proceedspective, in: E. Cabrio, S. Villata, A. Z. Wyner ings of the 8th Workshop on Argument Mining,
(Eds.), Proceedings of the Workshop on Frontiers 2021, pp. 100–111.
and Connections between Argumentation Theory [36] V. Marivate, T. Sefara, Improving short text
classifi</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <article-title>Applying artificial intelligence in the science &amp; technology cycle</article-title>
          ,
          <source>Inf. Serv. Use</source>
          <volume>39</volume>
          (
          <year>2019</year>
          )
          <fpage>303</fpage>
          -
          <lpage>318</lpage>
          . URL: https://doi.org/10.3233/ISU-190062. doi:
          <volume>10</volume>
          .3233/ISU-190062.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kinney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <article-title>S2orc: The semantic scholar open research corpus</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>02782</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lipinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Breitinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Beel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          , [10]
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Camerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dreber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Holzmeister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-H.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <article-title>Evaluation of header metadata extraction ap- J.</article-title>
          <string-name>
            <surname>Huber</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Johannesson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kirchler</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Nave, proaches and tools for scientific pdf documents</article-title>
          , in: B.
          <string-name>
            <surname>A. Nosek</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Pfeifer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Altmejd</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Buttrick</surname>
          </string-name>
          ,
          <source>Proceedings of the 13th ACM/IEEE-CS Joint</source>
          Con- T. Chan,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Forsell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gampa</surname>
          </string-name>
          , E. Heikenference on Digital Libraries,
          <source>JCDL '13</source>
          ,
          <string-name>
            <surname>ACM</surname>
            , New sten, L. Hummer,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Imai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Isaksson</surname>
          </string-name>
          , D. ManYork, NY, USA,
          <year>2013</year>
          , pp.
          <fpage>385</fpage>
          -
          <lpage>386</lpage>
          . URL: http://doi. fredi, J.
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>E.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Wagenmakers</surname>
          </string-name>
          , H. Wu, Evalacm.org/10.1145/2467696.2467753. doi:
          <volume>10</volume>
          .1145/ uating the replicability
          <source>of social science experi2467696.2467753. ments in nature and science between 2010 and</source>
          <year>2015</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Siegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lourie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Power</surname>
          </string-name>
          , W. Ammar, Extract- Nature
          <source>Human Behaviour</source>
          <volume>2</volume>
          (
          <year>2018</year>
          )
          <fpage>637</fpage>
          -
          <lpage>644</lpage>
          . URL:
          <article-title>ing scientific figures with distantly supervised neu</article-title>
          - https://doi.org/10.1038/s41562-018-0399-z. doi:10. ral networks, in: J.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Gonçalves</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <year>1038</year>
          /s41562-018-0399-z. Allen,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Fox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kan</surname>
          </string-name>
          , V. Petras (Eds.), Proceed- [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Alipourfard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Arendt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Benjamin</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Benings of the 18th ACM</article-title>
          /IEEE on Joint Conference kler,
          <string-name>
            <surname>M. M. Bishop</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bush</surname>
            , J. Caverlee, on Digital Libraries,
            <given-names>JCDL</given-names>
          </string-name>
          <year>2018</year>
          ,
          <string-name>
            <surname>Fort</surname>
            <given-names>Worth</given-names>
          </string-name>
          , TX,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <source>Systematizing confidence in USA, June</source>
          <volume>03</volume>
          -07,
          <year>2018</year>
          , ACM,
          <year>2018</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>232</lpage>
          . open research and
          <source>evidence (score)</source>
          ,
          <year>2021</year>
          .
          <article-title>URL: osf</article-title>
          . URL: http://doi.acm.
          <source>org/10</source>
          .1145/3197026.3197040. io/preprints/socarxiv/46mnb. doi:
          <volume>10</volume>
          .31235/osf. doi:
          <volume>10</volume>
          .1145/3197026.3197040. io/46mnb.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bulgarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Caragea</surname>
          </string-name>
          , A comparison of super- [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Liakata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Teufel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddharthan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Batchelor, vised keyphrase extraction models, in: Proceed- Corpora for the conceptualisation and zoning of ings of the 24th</article-title>
          <source>International Conference on World scientific papers (</source>
          <year>2010</year>
          ). Wide Web,
          <source>WWW '15 Companion</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Fisas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ronzano</surname>
          </string-name>
          ,
          <article-title>On the discoursive NY</article-title>
          , USA,
          <year>2015</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>14</lpage>
          . URL: http://doi.acm.org/ structure of computer graphics research papers,
          <source>in: 10.1145/2740908.2742776. doi:10.1145/2740908. Proceedings of the 9th linguistic annotation work2742776. shop</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>42</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Generic text summarization us- [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Achakulvisut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Acuna</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. P.</surname>
          </string-name>
          <article-title>ing relevance measure and latent semantic analysis</article-title>
          ,
          <source>Körding</source>
          , Claim extraction in biomedical publiin: W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          <string-name>
            <surname>Harper</surname>
            ,
            <given-names>D. H.</given-names>
          </string-name>
          <string-name>
            <surname>Kraft</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Zo- cations using deep discourse model and transfer bel (Eds</article-title>
          .),
          <source>SIGIR</source>
          <year>2001</year>
          :
          <article-title>Proceedings of the 24th An- learning</article-title>
          , CoRR abs/
          <year>1907</year>
          .00962 (
          <year>2019</year>
          ). URL: http: nual International ACM SIGIR Conference on Re- //arxiv.org/abs/
          <year>1907</year>
          .00962. arXiv:
          <year>1907</year>
          .00962. search and Development in Information Retrieval, [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , September 9-
          <issue>13</issue>
          ,
          <year>2001</year>
          , New Orleans, Louisiana, USA,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          , A comprehensive survey on transACM,
          <year>2001</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>25</lpage>
          . URL: https://doi.org/10.1145/ fer learning,
          <source>Proceedings of the IEEE</source>
          <volume>109</volume>
          (
          <year>2021</year>
          )
          <volume>383952</volume>
          .383955. doi:
          <volume>10</volume>
          .1145/383952.383955.
          <fpage>43</fpage>
          -
          <lpage>76</lpage>
          . doi:
          <volume>10</volume>
          .1109/JPROC.
          <year>2020</year>
          .
          <volume>3004555</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          , HET- [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharif Razavian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Azizpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>CarlsFORMER: heterogeneous transformer with sparse son, Cnn features of-the-shelf: an astounding baseattention for long-text extractive summarization, line for recognition</article-title>
          , in: Proceedings of the IEEE in: M.
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
            ,
            <given-names>S. W. Yih</given-names>
          </string-name>
          <article-title>conference on computer vision</article-title>
          and pattern recog(Eds.),
          <source>Proceedings of the 2021 Conference on Em- nition workshops</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>806</fpage>
          -
          <lpage>813</lpage>
          . pirical Methods in Natural Language Processing, [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <string-name>
            <surname>EMNLP</surname>
          </string-name>
          <year>2021</year>
          , Virtual Event / Punta Cana, Domini- Bert:
          <article-title>Pre-training of deep bidirectional transformcan</article-title>
          <source>Republic</source>
          ,
          <fpage>7</fpage>
          -
          <issue>11</issue>
          <year>November</year>
          ,
          <year>2021</year>
          ,
          <article-title>Association ers for language understanding</article-title>
          ,
          <source>arXiv preprint for Computational Linguistics</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>154</lpage>
          . arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ). URL: https://doi.org/10.18653/v1/
          <year>2021</year>
          .emnlp-main. [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          , J. Puigcerver,
          <volume>13</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.13.
          <string-name>
            <surname>J. Yung</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gelly</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>Big transfer (bit):</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>N. V.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <article-title>General visual representation learning</article-title>
          ,
          <source>in: ComM. Jiang</source>
          ,
          <article-title>The role of" condition" a novel scientific puter Vision-ECCV</article-title>
          <year>2020</year>
          :
          <article-title>16th European Conferknowledge graph representation and construction ence</article-title>
          , Glasgow, UK,
          <year>August</year>
          23-
          <issue>28</issue>
          ,
          <year>2020</year>
          , Proceedmodel, in
          <source>: Proceedings of the 25th ACM SIGKDD ings, Part V 16</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>491</fpage>
          -
          <lpage>507</lpage>
          . International Conference on Knowledge Discovery [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bachman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Hjelm</surname>
          </string-name>
          , W. Buchwalter, Learn&amp; Data
          <string-name>
            <surname>Mining</surname>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>1634</fpage>
          -
          <lpage>1642</lpage>
          . ing Representations by Maximizing Mutual Infor-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hutson</surname>
          </string-name>
          , Artificial intelligence faces re- mation
          <source>across Views</source>
          , Curran Associates Inc., Red producibility crisis,
          <source>Science</source>
          <volume>359</volume>
          (
          <year>2018</year>
          )
          <fpage>725</fpage>
          - Hook, NY, USA,
          <year>2019</year>
          . 726. URL: http://science.sciencemag.org/content/ [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          , A
          <volume>359</volume>
          /6377/725. doi:
          <volume>10</volume>
          .1126/science.359.6377.
          <article-title>simple framework for contrastive learning of visual 725. representations, in: Proceedings of the 37th Intercation through global augmentation methods</article-title>
          ,
          <source>in: Machine Learning and Knowledge Extraction: 4th IFIP TC 5, TC 12, WG 8.4, WG 8</source>
          .9,
          <string-name>
            <surname>WG</surname>
          </string-name>
          <year>12</year>
          .9 International
          <string-name>
            <surname>Cross-Domain</surname>
            <given-names>Conference</given-names>
          </string-name>
          , CD-MAKE
          <year>2020</year>
          , Dublin, Ireland,
          <source>August 25-28</source>
          ,
          <year>2020</year>
          , Proceedings 4, Springer,
          <year>2020</year>
          , pp.
          <fpage>385</fpage>
          -
          <lpage>399</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zou</surname>
          </string-name>
          , Eda:
          <article-title>Easy data augmentation techniques for boosting performance on text classification tasks</article-title>
          , arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>11196</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Matthews</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Strubell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saunders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Olivetti,</surname>
          </string-name>
          <article-title>Machine-learned and codified synthesis parameters of oxide materials</article-title>
          ,
          <source>Scientific Data</source>
          <volume>4</volume>
          (
          <year>2017</year>
          )
          <article-title>170127</article-title>
          . URL: https://doi.org/10.1038/sdata.
          <year>2017</year>
          .
          <volume>127</volume>
          . doi:
          <volume>10</volume>
          .1038/sdata.
          <year>2017</year>
          .
          <volume>127</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kiranyaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Avci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Abdeljaber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ince</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gabbouj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Inman</surname>
          </string-name>
          ,
          <article-title>1d convolutional neural networks and applications: A survey</article-title>
          ,
          <source>Mechanical Systems and Signal Processing</source>
          <volume>151</volume>
          (
          <year>2021</year>
          )
          <article-title>107398</article-title>
          . URL: https://www.sciencedirect.com/ science/article/pii/S0888327020307846. doi:https: //doi.org/10.1016/j.ymssp.
          <year>2020</year>
          .
          <volume>107398</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Limtiaco</surname>
          </string-name>
          ,
          <string-name>
            R. S. John,
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guajardo-Cespedes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Strope</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kurzweil</surname>
          </string-name>
          ,
          <article-title>Universal sentence encoder for english</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations</source>
          , Brussels, Belgium,
          <source>October 31 - November 4</source>
          ,
          <year>2018</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>174</lpage>
          . URL: https: //aclanthology.info/papers/D18-2029/d18-
          <fpage>2029</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Eficient estimation of word representations in vector space</article-title>
          ,
          <source>CoRR abs/1301</source>
          .3781 (
          <year>2013</year>
          ). URL: http://arxiv.org/ abs/1301.3781.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Avati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Arivazhagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>Neural language correction with character-based attention</article-title>
          ,
          <source>CoRR abs/1603</source>
          .09727 (
          <year>2016</year>
          ). URL: http://arxiv.org/abs/ 1603.09727. arXiv:
          <volume>1603</volume>
          .
          <fpage>09727</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>L.</given-names>
            <surname>Van der Maaten</surname>
          </string-name>
          , G. Hinton,
          <article-title>Visualizing data using T-SNE</article-title>
          .,
          <source>Journal of machine learning research 9</source>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>