=Paper=
{{Paper
|id=Vol-3230/paper-05
|storemode=property
|title=Long Tailed Entity Extraction of Model Names using Distant Supervision
|pdfUrl=https://ceur-ws.org/Vol-3230/paper-05.pdf
|volume=Vol-3230
|authors=Swayatta Daw,Vikram Pudi
|dblpUrl=https://dblp.org/rec/conf/birws/DawP22
}}
==Long Tailed Entity Extraction of Model Names using Distant Supervision==
Long Tailed Entity Extraction of Model Names using
Distant Supervision
Swayatta Daw1 , Vikram Pudi1
1
Data Sciences and Analytics Center
IIIT Hyderabad, India
Abstract
We introduce the task of long-tailed detection of model entities from scientific documents. We use
distant supervision using an external Knowledge Base (KB) to generate synthetic training data and use a
simple entity replacement technique to improve performance significantly by addressing the problem
of overfitting in small sized datasets for supervised NER baselines. We introduce strong baselines for
this task which are evaluated on our annotated gold standard dataset. We also release the distantly
supervised silver labels generated using the KB. We introduce this model as part of a starting point for an
end-to-end automated framework to extract relevant model names and link them with their respective
cited papers from research documents. We believe this task will serve as an important starting point to
map the research landscape in a scalable manner, needing minimal human intervention.
Keywords
Long-Tailed Entity, Entity Extraction, NER, Information Extraction, Scientific Literature,
1. Introduction
Long tailed entities are named entities which rarely occur in text documents. For these types of
entities, the task of Named Entity Recognition (NER) is non-trivial. Recent approaches have
aimed at solving the problem of NER using supervised training using deep learning models.
However, supervised learning techniques require a large amount of token-level labelled data
for NER tasks. Annotating a large number of tokens can be time-consuming, expensive and
laborious. For real-life applications, the lack of labelled data has become a bottleneck on adopting
deep learning models to NER tasks.
Most scientific named entities can be classified as long-tailed entities because of the rarity
and domain-specificity of their occurrence. Recent work on NER in scientific documents has
been concentrated around detecting biomedical named entities [1] or scientific entities like
tasks, methods and datasets [2, 3, 4]. Some papers like [5] focus on the detection of a single
specific entity-type (like dataset names) from scientific documents. Although previous work
has focused on identifying methods [2, 3] as named entities, but what constitutes a method can
have a significant variance when it comes to human annotated data. The authors [2] report the
Kappa score of 76.9% for inter-annotator agreement in the SciERC dataset, which is widely used
as a benchmark for scientific entity extraction.
BIR 2022: 12th International Workshop on Bibliometric-enhanced Information Retrieval at ECIR 2022, April 10, 2022,
hybrid.
$ swayatta.daw@research.iiit.ac.in (S. Daw); vikram@iiit.ac.in (V. Pudi)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
28
NER has traditionally been treated as a sequence labelling problem, using CRF [6] and HMM
[7]. Recent approaches have used deep learning-based models [8] to address this task, which
require a large amount of labelled data to train. The high cost of labelling remains the main
challenge to train such models on rare long tailed entity types, where availability of labelled data
is scarce. In order to address the label scarcity problem, several methods like Active Learning
[9], Distant Supervision [10, 11, 12], Reinforcement Learning-based Distant Supervision [13, 14]
have been proposed. [5] focused on detecting dataset mentions from scientific text and used
data augmentation to overcome the label scarcity problem. In this paper, we leverage an external
Knowledge Base and a large scale unlabelled corpora for our distantly supervised approach,
using a simple entity replacement technique to prevent overfitting. In this paper, we introduce
the task of detection of model entity names from scientific documents. Papers with Code (PwC1 )
is a community driven corpus that serves to automatically list models that solve particular
subtasks, with links to the scientific research paper that introduced the model. Our aim is to
build a similar but automated end-to-end pipeline that detects model names from scientific
papers and benchmarks them against other similar models that solve the same task. We believe
the task introduced in this paper (extraction of model names from scientific documents) to be a
significant step forward towards the whole pipeline. This task is non-trivial mainly due to the
lack of availability of token-level high-quality labelled data which is required for training deep
learning models and the shortage of human annotated gold standard dataset for evaluation.
To address the above bottlenecks, we present a simple yet effective technique leveraging an
external Knowledge Base and a large unlabelled corpora (both of which are cheap and easy
to obtain) to generate our training dataset. We believe this simple technique can be easily
extended to any other domain given the availability of a domain-specific Knowledge Base and
unlabelled text corpora. Utilising this training set, we are able to establish a strong baseline for
this task using a standard BERT-CRF model. In order to evaluate our performance for this task,
we present a high quality human annotated gold standard evaluation dataset.
Using our trained models, we create an automated framework of detecting model names of
related work from research papers. We define related work as prior research work done by the
scientific community for the same or a similar related task that has been investigated by the
original paper. Our pipeline contains two steps: Firstly, we build a sentence intent classifier that
classifies whether a citation sentence contains information regarding related work or not. Then
we extract model names from the positively labelled sentences using our trained NER model
and link them to their respective citation mentions using a string distance based technique,
introduced by [15]. We believe this framework is a starting point to effectively map the entire
research landscape in a scalable manner.
2. Annotation
In order to create our whole set of gold labels for the evaluation test-set, we randomly sample
abstracts from a large set of arxiv Research papers2 . We also introduce randomly sampled
papers from DBLP citation dataset to add to the diversity of train and test-set selection.
1
https://github.com/paperswithcode/paperswithcode-data
2
https://www.kaggle.com/Cornell-University/arxiv
29
Our labeling model was built upon SciBERT
(Beltagy et al., 2019), a pre-trained language model
based on BERT (Devlin et al., 2019) but trained
on a large corpus of scientific text.
There are two
models in the transformers, which can handle multilingual posts –
multilingual-BERT[19] and XLM-Roberta [57].
For example, KG-BART encoded the graph structure of KGs with
knowledge embedding
algorithms like TransE (Bordes et al., 2013), and then took the
informative entity embeddings as
auxiliary input (Liu et al., 2021).
Figure 1: A few example sentences with annotated model named entities highlighted in blue. We only
consider strict span matching while detection.
Considering our end goal of automating a high precision framework of extracting related
model names and to minimise ambiguity, we consider only named models as model entities for
this task. Few examples are - BERT+BiLSTM+CRF, KG-BERT (with overlap), LSTM + Attention,
DeepWalk. We consider both single named entities and a combination of multiple model named
entities for annotation. A few example sentences with model entities are displayed in Figure 1.
We consider strict span matches for the model entity names, and do not consider any partial
matches or synonyms. We also consider plural variants of entity names as matches.
We aim to minimise ambiguity by considering only those model named entities that we can
verify about in Google Scholar and Semantic Scholar. We follow the process of identifying a
candidate model name and reviewing the existing Computer Science literature to verify whether
it is a model name entity or not by identifying its usage in the literature. A simple criteria that
we use is to observe if the model(or a variant of the model) has been mentioned in a Results
table and compared with baselines/other related models, in previous literature. Only after this
thorough review, we annotate a named entity as a model name. We discard any sentence if
a model named entity within the sentence does not follow the defining criteria. Hence, we
believe we reduce the ambiguity sufficiently enough to allow for a single annotator for the
entire annotation process. All the annotations has been done by a graduate NLP researcher
who is also a co-author of this paper. The overall statistics of the training and test set has been
provided in Table 2.
30
Table 1
Statistics of the train-set and the annotated test-set
Sentences Tokens Entities Unique Avg # Avg # En-
Entities tokens per tities per
sentence sentence
Train 7800 232600 19012 14748 29.82 2.44
Test 1000 22873 3647 1249 22.87 3.65
Total 8800 255473 22659 15672 29.03 2.57
3. Training Set Creation with Entity Replacement
For the unlabelled corpus, we use the arxiv dataset containing 2̃27,000 abstracts from various
domains of Computer Science. We use the Papers with Code (PwC) corpus as a reference
Knowledge Base to obtain a total of 14,748 model entity names. We use this list of named
entities and create a set of distant silver labels by extracting the corresponding sentences out
of the arxiv dataset that contain the same entity mention. We aim for exact match while also
considering plural forms of the entity words. We obtain a total of 7800 sentences that contain a
model named entity mention.
We plot a model entity vs frequency of occurrence in the entire corpus of our obtained
sentences. We provide the plot in Figure 2. We notice that the distribution is long-tailed in
nature, which is consistent with our hypothesis about scientific named entities as discussed in
the Introduction section. This means that there are certain pre-dominant popular models that
occur most frequently in the literature. The distribution tapers down and takes a long-tailed
form, where most of the entities have a much significant lower number of occurrence in the
literature. This can be attributed to the wide-spread use of certain models (like CNN), in the
existing Computer Science research literature.
However, such a skewed distribution is unsuitable for training supervised custom NER models.
The models tend to memorise and overfit for certain named entities. Hence, we use a simple
entity replacement technique to deal with this bottleneck. More specifically, we detect the
entity span of the model named entity in the occurring sentence. Then, we replace this entity
with another entity from the entire set of model entities obtained from the Knowledge Base.
We execute the process keeping the number of entity distributions to atmost 2 to maintain
uniformity. After the entire process is completed, the entire train sentences set is set with
uniformly distributed entities.
Figure 2: Distribution of entity occurrence frequency in the training dataset pre-replacement
31
Table 2
Result on Evaluation Dataset
Model Precision Recall F1
BiLSTM + CRF (w/o replacement) 0.205 0.519 0.294
BERT + CRF (w/o replacement) 0.389 0.310 0.345
SciBERT+CRF (w/o replacement) 0.391 0.312 0.346
BERT+CRF (with replacement) 0.575 0.563 0.569
BiLSTM + CRF (with replacement) 0.628 0.631 0.629
SciBERT+CRF (with replacement) 0.641 0.632 0.636
4. Distantly Supervised NER Model
We aim to classify each token into its candidate labels among the BIO-tags. We use pre-trained
BERT-based contextualised embeddings to capture the distributed representations from the
sequence of tokens. We aim to detect the entire entity span and classify the entity span into
specific entity types. We formulate this as a sequence labelling task, where we classify the
sequence of tokens into a sequence of tokens. We consider the entire training sentences as the
distantly labelled training data.
We experiment with multiple baselines which are standard for the sequence labelling process.
• BiLSTM + CRF: This BiLSTM-CRF model captures the contextual representations and
encodes them into a bidirectional hidden state using BiLSTM. The CRF layer models
the dependency among a sequence of tokens by considering the entire sequence label
probability distribution.
• SciBERT + CRF: This model contains pre-trained SciBERT [16] embeddings trained on
large scientific corpus. The SciBERT-embeddings are passed onto a CRF layer that models
each sequence probability distribution.
• BERT+CRF: This consists of a pretrained BERT-model and a CRF layer to model sequence-
level dependencies.
We evaluate the models on the gold test labels. We find that SciBERT model in combination
with CRF provides the best performance. We also find that the entity replacement technique
is particularly effective when dealing with long tailed entity distributions. We find that this
simple technique offers a significant boost in performance across all models. It is effective in
countering the overfitting bottleneck and successfully prevents memorisation of named entities.
We rely only on distant labels to obtain strong performance on gold labels. We illustrate our
best performing model in Figure 3.
5. Sentence Intent Classifier
We train a classifier to detect whether a sentence contains relevant information regarding
models that solve a similar task as specified in the target research paper. For a target scientific
document, we define a relevant model name as a model that the author has cited, which solves
a task that is similar or relevant to the original task that the target paper is solving. To create an
32
0.4 O O I-Model I-Model O O B-Model
0.9 O O B-Model I-Model O O O
CRF
B-Model 0.2 0.3 1.9 0.9 0.4 0.7 0.8
I-Model 0.1 0.2 0.7 1.3 0.1 0.6 0.5
O 1.2 1.5 0.1 0.4 1.9 1.6 1.3
Sci-BERT
We present SDP LSTM a novel network
Figure 3: Our SciBERT-CRF Model for sequence tagging
automatically labelled dataset, we iterate over all sentences in the research corpora. If a sentence
contains the words - ‘Related Work’ or ‘Previous Work’ or ‘Baseline’, then we take 15 sentences
occurring after it. We assign positive labels to sentences containing model entity mentions by
referring to our KB. We consider the maximal span for entity matching between our unlabelled
text and KB. For creating negative samples, we randomly sample from all sentences and make
sure the above words are absent and it also does not contain model entity mentions. We keep
an equal distribution of positive and negative labels. An example of a positive and negative
label is shown in Figure 4.
5.1. Training the classifier
The most commonly used approach of averaging BERT embeddings or using the output of the
first token (the [CLS] token) yields subpar sentence representations [17]. Hence, we choose
Sentence-BERT [17], a modification of the pretrained BERT network that uses siamese and
triplet network structures to derive semantically meaningful sentence embeddings that can
33
The authors have introduced a probabilistic
framework based on Hidden Markov Random Fields
(HMRFs) for semi-supervised clustering that
combines the constraint-based and distance based
approaches in a unified framework.
All processing units perform the same
computation, specified by equation (1), and are
locally connected to their three neighbours.
Figure 4: Sentence Intent : Positive label labelled with green, negative labelled with red
be compared using dot product. It takes a sentence as input and returns the corresponding
sentence-level representation as output. We use Sentence-BERT to encode the sentences and use
Logistic Regression as our binary classifier to train it on 15,518 labelled sentences, containing
both citation and non-citation sentences. Positive and negative samples are equally distributed.
The sentence dataset size is kept small to avoid compromising on the quality of the labels. The
train-test split followed is 75-25. The testset accuracy (which, again, consists of both citation
and non-citation sentences) is 86.41%.
6. Entity Citation Linker
For the entity citation linking, we iterate between all possible extracted entities and citation
combination and get their closeness score, which is the string distance between an entity and
the citation occurrence. We first take all the citations and keep the closest entity per citation.
Then, we take all the entities and keep the closest citations per entity. This linking process is
able to accurately link most of the extracted entities with their closest citations, as demonstrated
by [15].
7. Pipeline Formation
We show two end-to-end pipelines in this paper. First, we show the entire training process for
both model entity extraction and sentence intent classification. We use the same unlabelled
corpora and Knowledge Base(KB) for both training processes. The automatic data labelling
34
Sentences
The optimized 4-layer BiLSTM model was then calibrated
and validated for multiple prediction horizons.
Furthermore, case studies show that SIMCLDA
can effectively predict candidate lncRNAs
for renal cancer. B-Model I-Model O O O O
Longformer's attention mechanism is a
drop-in replacement for the standard self-attention. Entity Replacement
The optimized 4-layer BiLSTM model CRF Layer
was then calibrated and ...
Bi-LSTM MODEL
SIMCLDA MODEL
Longformer MODEL The optimized 4-layer TransE model
was then calibrated and ... SciBERT Embeddings
Unlabelled Sentences
corpora
The authors have
used... Weak Labels
All processing Distantly Labelled
units... Training Data
Distantly labelled
Knowledge Distantly
Train labelled
Sentences
Base Distantly
Train labelled
Sentences
Train Sentences
Sentence-BERT
Classification Layer
The authors
Predicted Positive(+)
have used ...
Labels
All processing Negative(-)
units ...
Figure 5: Training Pipeline for the Sentence Intent Classifier and NER model using unlabelled
corpora and KB
process using the external KB followed by entity replacement is shown in Figure 5. This
transformed dataset is used as distantly supervised training labels for input to the SciBERT-CRF
NER model. Also, the sentence intent classifier approach is illustrated, where the KB is utilised
to obtain weak binary labelled sentences from ’Related Work’ section to train the classifier.
For the automated framework, we use a two stage pipeline. It takes a scientific research paper
as input and obtains the citation sentences from it. Using the trained sentence intent classifier,
we segregate the sentences into positive and negative labels. Only positively labelled sentences
are passed into the next stage of the pipeline. We use the trained SciBERT-CRF NER model to
extract entity mentions from the sentences. The model mentions are then linked with their
respective citations using our Entity-Citation linker. The framework is illustrated in Figure 6.
35
Citation Entity-Citation
Target Scientific
Sentences Linker
Document
Trained Sentence Intent
The authors use CNN [1] Classifier
layer on top of BERT [2] The authors use CNN [1] layer on top of
(Sentence-BERT + Binary BERT [2] embeddings.
embeddings Classifier)
The ImdB dataset [2] is
popular for sentiment B-Model
classification
Predicted Labels Predicted Entities ( with
(green denotes +, The authors use CNN [1] layer on citation link)
red denotes -) top of BERT [2] embeddings
CNN [1]
The authors use
CNN [1] layer on BERT [2]
SciBERT Embeddings +
top of BERT [2] CRF
embeddings
Trained NER Model
(Model Name Extractor)
The ImdB dataset
[2] is popular for
sentiment
classification
Figure 6: Entire Automated Framework utilising the trained models. The input is a scientific
document and the output from the pipeline is a set of predicted model entities linked with
their citation.
8. Error Analysis
We conduct error analysis for model entity extraction, sentence intent classification and entity
citation linking. Some precision error is introduced into the model because for the training
set we consider the maximum span of each entity and the I-Model entity occurrence (a token
that lies inside a named entity) is high. We find in our evaluation dataset, the number of
B-Model entities is massively more, which leads to the model misclassifying an O as an I for
few sentences.
Also, due to the usage of citation sentences in the evaluation dataset, our model recognises
the citation marker occurring right after the entity as an I-Model. Also, most of the citation
sentences in the evaluation dataset has a large number of named entities occurring adjacently,
as seen in many citation contexts. The model, which is trained on sentences from abstracts
only, is unable to recognise all of them as entities sometimes.
For the sentence intent classification, our classifier often recognises sentences containing
dataset names as a positive label. This can be attributed to the fact that citation sentences that
refer to different datasets often have a similar structure to those citing model names of prior
work. Lastly, for the entity citation linker, sometimes an entity that is associated with a citation
marker occurs in the initial part of a sentence and its not the closest to the citation. This can
lead to missed out or incorrect linking.
9. Implementation details
We use PyTorch framework to implement our NER model. We use the pre-trained SciBERT
tokenizer and embeddings as input to a dropout layer with a dropout probability of 0.5 to prevent
overfitting. We use a learning rate of 1e-5 and train all models for 20 epochs. We pass the
output from the dropout layer through a linear layer with input dimension same as the hidden
36
dimension of SciBERT embeddings (768) and output dimension same as the number of labels
(4). We train the BiLSTM-CRF model for 20 epochs. We annotate the evaluation dataset in the
standard CoNLL BIO format. For Sentence-BERT, we use pretrained models available in Pytorch.
We use the DBLP corpus consisting of 4̃3K papers as our unlabelled research corpora to obtain
the distant labels for training the classifier. For the Knowledge Base (KB), we use PwC public
data corpus. For the CRF layer, we use allennlp3 models library. We use regular expressions to
extract citation sentences from papers that are written in the Springer LNCS/LNAI format.
10. Conclusion and future work
We have introduced a novel task of long-tailed model entity recognition from scientific docu-
ments. We test our gold standard evaluation set on multiple baselines. We also find that a simple
strategy of entity replacement works well on small labelled datasets for distant supervision. We
hope to extend this technique to different types of entities with low labelled data availability.
We integrate our model in the automated pipeline framework to extract model names from
scientific research documents and link them to their respective citations. For future work, we
aim to utilise this pipeline on a large research corpus to obtain a map of benchmarked model
names linked with their respective papers on a much larger scale. We believe our work will
serve as an important starting point for mapping the entire research landscape of computer
science.
References
[1] V. Kocaman, D. Talby, Biomedical named entity recognition at scale, CoRR abs/2011.06315
(2020). URL: https://arxiv.org/abs/2011.06315. arXiv:2011.06315.
[2] Y. Luan, L. He, M. Ostendorf, H. Hajishirzi, Multi-task identification of entities, rela-
tions, and coreference for scientific knowledge graph construction, in: Proceedings
of the 2018 Conference on Empirical Methods in Natural Language Processing, Asso-
ciation for Computational Linguistics, Brussels, Belgium, 2018, pp. 3219–3232. URL:
https://aclanthology.org/D18-1360. doi:10.18653/v1/D18-1360.
[3] S. Jain, M. van Zuylen, H. Hajishirzi, I. Beltagy, Scirex: A challenge dataset for document-
level information extraction, in: Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, 2020. arXiv:2005.00512.
[4] S. Mesbah, C. Lofi, M. V. Torre, A. Bozzon, G.-J. Houben, Tse-ner: An iterative approach
for long-tail entity extraction in scientific publications, in: International Semantic Web
Conference, Springer, 2018, pp. 127–143.
[5] Q. Liu, P. cheng Li, W. Lu, Q. Cheng, Long-tail dataset entity recognition based on data
augmentation, in: EEKE@JCDL, 2020.
[6] J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: Probabilistic models for
segmenting and labeling sequence data, in: ICML, 2001.
[7] H. L. Chieu, H. Ng, Named entity recognition with a maximum entropy approach, in:
CoNLL, 2003.
3
https://github.com/allenai/allennlp-models
37
[8] J. Li, A. Sun, J. Han, C. Li, A survey on deep learning for named entity recognition, ArXiv
abs/1812.09449 (2018).
[9] S. Goldberg, D. Z. Wang, C. Grant, A probabilistically integrated system for crowd-
assisted text labeling and extraction, J. Data and Information Quality 8 (2017). URL:
https://doi.org/10.1145/3012003. doi:10.1145/3012003.
[10] X. Wang, Y. Guan, Y. Zhang, Q. Li, J. Han, Pattern-enhanced named entity recognition
with distant supervision, in: 2020 IEEE International Conference on Big Data (Big Data),
2020, pp. 818–827. doi:10.1109/BigData50022.2020.9378052.
[11] C. Liang, Y. Yu, H. Jiang, S. Er, R. Wang, T. Zhao, C. Zhang, BOND: bert-assisted open-
domain named entity recognition with distant supervision, CoRR abs/2006.15509 (2020).
URL: https://arxiv.org/abs/2006.15509. arXiv:2006.15509.
[12] M. A. Hedderich, L. Lange, D. Klakow, ANEA: distant supervision for low-resource named
entity recognition, CoRR abs/2102.13129 (2021). URL: https://arxiv.org/abs/2102.13129.
arXiv:2102.13129.
[13] F. Nooralahzadeh, J. T. Lønning, L. Øvrelid, Reinforcement-based denoising of distantly
supervised NER with partial annotation, in: Proceedings of the 2nd Workshop on Deep
Learning Approaches for Low-Resource NLP (DeepLo 2019), Association for Computational
Linguistics, Hong Kong, China, 2019, pp. 225–233. URL: https://aclanthology.org/D19-6125.
doi:10.18653/v1/D19-6125.
[14] Y. Yang, W. Chen, Z. Li, Z. He, M. Zhang, Distantly supervised NER with partial annotation
learning and reinforcement learning, in: Proceedings of the 27th International Conference
on Computational Linguistics, Association for Computational Linguistics, Santa Fe, New
Mexico, USA, 2018, pp. 2159–2169. URL: https://aclanthology.org/C18-1183.
[15] S. Ganguly, V. Pudi, Competing algorithm detection from research papers, in: Proceedings
of the 3rd IKDD Conference on Data Science, 2016, CODS ’16, Association for Computing
Machinery, New York, NY, USA, 2016. doi:10.1145/2888451.2888473.
[16] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, arXiv
preprint arXiv:1903.10676 (2019). URL: https://www.aclweb.org/anthology/D19-1371/.
[17] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-
networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong,
China, 2019, pp. 3982–3992. URL: https://www.aclweb.org/anthology/D19-1410/.
38