=Paper=
{{Paper
|id=Vol-3164/paper16
|storemode=property
|title=Extraction of Competing Models using Distant Supervision and Graph Ranking
|pdfUrl=https://ceur-ws.org/Vol-3164/paper16.pdf
|volume=Vol-3164
|authors=Swayatta Daw,Vikram Pudi
|dblpUrl=https://dblp.org/rec/conf/aaai/DawP22
}}
==Extraction of Competing Models using Distant Supervision and Graph Ranking==
<pdf width="1500px">https://ceur-ws.org/Vol-3164/paper16.pdf</pdf>
<pre>
Extraction of Competing Models using Distant Supervision
and Graph Ranking
 Swayatta Daw, Vikram Pudi
Data Sciences and Analytics Center
IIIT Hyderabad, India
swayatta.daw@research.iiit.ac.in, vikram@iiit.ac.in


                                          Abstract
                                          We introduce the task of detection of competing model entities from scientific documents. We define competing models
                                          as those models that solve a particular task that is investigated in the target research document. The task is challenging
                                          due to the fact that contextual information is required from the entire target document to predict the model entities. Hence,
                                          traditional sequence labelling approaches fail in such settings. Furthermore, model entities themselves are long-tailed in
                                          nature, i.e, their prevalence in scientific literature is limited, along with a scarcity of labelled data for training supervised
                                          learning techniques. To address the above bottlenecks, we combine an Unsupervised Graph Ranking algorithm with a
                                          SciBERT-CRF based sequence labeller to predict the entities. We introduce a strong baseline using the above mentioned
                                          pipeline. Also, to address the label scarcity of long-tailed model entities, we use distant supervision leveraging an external
                                          Knowledge Base (KB) to generate synthetic training data. We address the problem of overfitting in small sized datasets for
                                          supervised NER baselines using a simple entity replacement technique. We introduce this model as part of a starting point for
                                          an end-to-end automated framework to extract relevant model names and link them with their respective cited papers from
                                          research documents. We believe this task will serve as an important starting point to map the research landscape of computer
                                          science in a scalable manner, needing minimal human intervention. The code and dataset is available in the given link :
                                              https://github.com/Swayatta/Competing-Models.

                                          Keywords
                                          NER, Graph Ranking, Distant Supervision, CEUR-WS


1. Introduction                                                                                                    model names from a research paper and links them to
                                                                                                                   their respective citation. While browsing related work
The number of scientific publications in the computer                                                              for a given task, a researcher has to manually visit every
science domain has increased exponentially in the recent                                                           research paper that uses a competing model that is used
past. Hence, it has become increasingly cumbersome                                                                 for the same task. This process is time-consuming if a
for researchers to keep track of the advancement of the                                                            survey of a research landscape is to be done on a large
research landscape. Often, research papers introduce                                                               scale. Our motivation is to automate this process by au-
new models that perform strongly in comparison with                                                                tomatically extracting model names that solve a similar
the baseline or advance the state-of-the-art. In order to                                                          task and linking them to their corresponding cited paper.
effectively benchmark models and compare their perfor-                                                             If executed on a large scale, this pipeline would be able
mances, it is important to be able to map the research                                                             to effectively map the computer science research land-
landscape for similar or related tasks. Papers with Code                                                           scape in an automatic and scalable manner with minimal
(Pwc1 ) is a community driven corpus that serves to au-                                                            human intervention.
tomatically list models that solve particular subtasks ,                                                              We introduce a strong baseline for this task by com-
with links to the scientific research paper that introduced                                                        bining an unsupervised document level graph ranking
the model. Our aim is to build a similar but automated                                                             algorithm and a supervised BERT-based sequence tagger
end-to-end pipeline which detects model names from sci-                                                            to obtain entity model names. Essentially, we treat the
entific papers and benchmarks them against other similar                                                           relevant keyphrases extracted by the graph ranker as a
models that solve the same task.                                                                                   superset of candidates for the sequence labeller.
   In this paper, we introduce the task of extracting com-                                                            We introduce two datasets for this task. For training
peting model names from a research paper. We establish                                                             the supervised sequence tagger, we create weakly super-
an end-to-end pipeline that extracts all the competing                                                             vised distant labels using an external Knowledge Base
Proceedings of the AAAI-22 Workshop on Scientific Document                                                         and unlabelled corpora. We also release a manually anno-
Understanding at the Thirty-Fifth AAAI Conference on Artificial                                                    tated dataset for the evaluation purpose of the sequence
Intelligence (AAAI-22)                                                                                             tagger. For evaluating the entire framework of compet-
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).                     ing model name extraction, we release another dataset
 CEUR

                       CEUR Workshop Proceedings (CEUR-WS.org)
               http://ceur-ws.org


                                                                                                                   with full paper document level annotation. Furthermore,
 Workshop      ISSN 1613-0073
 Proceedings


               1
                   https://paperswithcode.com/
we use a simple entity citation linking technique to link     using supervised training using deep learning models.
the extracted model names with their respective citation      However, supervised learning techniques require a large
in the research document. We believe this task will be a      amount of token-level labelled data for NER tasks. Anno-
significant step forward towards mapping the research         tating a large number of tokens can be time-consuming,
landscape of computer science.                                expensive and laborious. For real-life applications, the
   Our contributions can be summarised as follows:            lack of labelled data has become a bottleneck on adopting
                                                              deep learning models to NER tasks.
     • We introduce a novel approach of treating ranked
                                                                 Most scientific named entities can be classified as long-
       keyphrases as a superset of sequence labellers for
                                                              tailed entities because of the rarity and domain-specificity
       solving this task. To the best of our knowledge,
                                                              of their occurrence. Recent work on NER in scientific doc-
       this approach has not been used before in prior
                                                              uments has been concentrated around detecting biomed-
       research work. We believe this approach can be
                                                              ical named entities [10] or scientific entities like tasks,
       extended to other similar tasks that require docu-
                                                              methods and datasets [1, 2, 11]. Some papers like [12]
       ment level contextual information for NER.
                                                              focus on the detection of a single specific entity-type (like
     • We create an annotated dataset of annotated full       dataset names) from scientific documents. Although pre-
       papers for evaluation of the pipeline. Previous        vious work has focused on identifying methods [1, 2] as
       datasets for sequence labelling in the scientific      named entities, but what constitutes a method can have
       literature focused only on annotating abstracts of     a significant variance when it comes to human annotated
       scientific papers [1, 2]. We believe the approach      data. The authors [1] report the Kappa score of 76.9% for
       of incorporating full length document informa-         inter-annotator agreement in the SciERC dataset, which
       tion is crucial to capture the entire document con-    is widely used as a benchmark for scientific entity extrac-
       text, hence we introduce a full paper annotated        tion.
       dataset for final evaluation.                             NER has traditionally been treated as a sequence la-
     • We introduce strong baselines while relying only       belling problem, using CRF [13] and HMM [14]. Recent
       on distantly supervised weak labels to train our       approaches have used deep learning based models [15]
       sequence labeller. We evaluate the trained model       to address this task, which require a large amount of
       on our annotated evaluation dataset.                   labelled data to train. The high cost of labelling remains
                                                              the main challenge to train such models on rare long
2. Related Work                                               tailed entity types, where availability of labelled data is
                                                              scarce. In order to address the label scarcity problem,
Unsupervised Ranking Algorithms for Keyphrase                 several methods like Active Learning [16], Distant Super-
Extraction: EmbedRank[3] extracts candidate phrases           vision [17, 18, 19], Reinforcement Learning-based Distant
based on POS sequences and uses sentence embeddings           Supervision[20, 21] have been proposed. [12] focused
(Doc2Vec or Sent2vec) to represent both the candidate         on detecting dataset mentions from scientific text and
phrases and the document in the same high-dimensional         used data augmentation to overcome the label scarcity
vector space and ranks them using cosine similarity with      problem.
respect to the document embedding. [4] propose Wiki-
Rank, an unsupervised automatic keyphrase extraction
method that links semantic meaning to text. In graph-
                                                              3. Motivation
based ranking algorithms, candidate phrases are treated       Papers with Code (PwC2 ) is a community driven cor-
as nodes and related candidate phrases are connected          pus that serves to automatically list models that solve
by edges. TextRank [5] considered related candidates as       particular subtasks, with links to the scientific research
co-occurring phrases within a given window. SingleR-          paper that introduced the model. Our aim is to build a
ank [6] added weights to the edges between related can-       similar but automated end-to-end pipeline that detects
didates.SGRank [7] and PositionRank [8] incorporated          model names from scientific papers and benchmarks
statistical and positional heuristics into a graph-based      them against other similar models that solve the same
algorithm to obtain ranked keyphrases. MultipartiteRank       task. We believe the task introduced in this paper (ex-
[9] is an advanced version of TextRank that incorporates      traction of competing model names from scientific docu-
positional knowledge in edge weights, leading to state-       ments) to be a significant step forward towards the whole
of-the-art performances over benchmark datasets.              pipeline.
   Sequence labelling for Named Entity Recognition:
Long tailed entities are named entities which rarely occur
in text documents. For these types of entities, the task of
Named Entity Recognition (NER) is non-trivial. Recent
approaches have aimed at solving the problem of NER               2
                                                                      https://github.com/paperswithcode/paperswithcode-data
     Type           Sentence                                                                                   Paper Title
   Competing        Other transition-based models extend TransE to additionally use projection                 A Novel Embedding
                    vectors or matrices to translate head and tail embeddings into the relation                Model for Knowledge
                    vector space, such as: TransH (Wang et al., 2014), TransR (Lin et al., 2015b),             Base Completion Based
                    TransD (Ji et al., 2015), STransE (Nguyen et al., 2016b) and TranSparse (Ji                on Convolutional Neural
                    et al., 2016).                                                                             Network
   Competing        In Table 2, we compare SCIBERT results with reported BIOBERT results on                    SCIBERT: A Pretrained
                    the subset of datasets included in (Lee et al., 2019).                                     Language Model for Sci-
                                                                                                               entific Text
 Non-competing      TransE [4] is a translation based model inspired by Word2Vec [16]                          On Evaluating Embed-
                                                                                                               ding Models for Knowl-
                                                                                                               edge Base Completion
 Non-competing      (Xie et al. 2016) use convolutional neural networks (CNN) to encode word se-               KG-BERT: BERT for
                    quences in entity descriptions.                                                            Knowledge Graph Com-
                                                                                                               pletion
 Non-competing      To find the hyper-parameters, we used HyperOpt (Bergstra et al., 2015), which              Tabular Data:     Deep
                    uses Bayesian optimization.                                                                Learning is Not All You
                                                                                                               Need

Table 1
Few examples of competing and non-competing models. The competing models are highlighted in bold, whereas
the non-competing models are highlighted in underlined italic.


4. Task Definition                                               In this paper, we present SDP-LSTM, a novel neural network to classify the
                                                                                     relation of two entities in a sentence.

                                                                  Inspired by the unique feature representation learning capability of deep
We define competing models as model names that at-              autoencoder, we propose a novel model, named Deep Autoencoder-like NMF
tempt to solve the same task as investigated by the target                           (DANMF), for community detection.
research paper. For example, if a research paper investi-       We introduce the Multi-View Transformation Network (MVTN) that regresses
gates the task of producing knowledge base embeddings,           optimal view-points for 3D shape recognition, building upon advances in
                                                                                          differentiable rendering.
TransR [22] will be a competing model name as it has
been introduced by prior research work to solve the same
task. If a research paper investigates the task of Ques-       Figure 1: Example sentences with annotated model name
tion Answering, some competing model names can be T5           entities
model [23] or XL-Net [24], because these are models that
have been used to solve this task in prior research work.
A non-competing model name would be a model that has           extracting the model names, we link the extracted entities
not been used directly to solve the same task. We pro-         with their respective cited papers.
vide a few examples to illustrate the difference between
a competing and a non-competing model in Table 1. For
the first two examples, the models highlighted in bold are     5. Annotation Process
competing models because they directly solve the task
                                                               We create two datasets for training and evaluation. We
investigated in the input research paper. For the third
                                                               annotate sentences from scientific papers as per token-
example, TransE is a competing model, but Word2Vec is
                                                               level BIO tagging scheme to evaluate our sequence la-
not. The reason for this is that TransE produces Knowl-
                                                               beller, which only uses contextual information from an
edge Base embeddings directly that aid in Knowledge
                                                               input sentence for sequence tagging. To evaluate the
Base completion (which is the target task in the research
                                                               whole pipeline, we provide document-level annotations
paper). But, Word2Vec is a language model that TransE
                                                               with full length research papers as input and competing
is inspired by, as denoted in the sentence. Hence, it only
                                                               model names as the annotated output. We use two dif-
contributes indirectly to the research task. So, it is a
                                                               ferent datasets for a more comprehensive evaluation, as
non-competing model. Similarly, HyperOpt, in the last
                                                               our pipeline uses two stages. The first stage involves
example, is non-competing, as it is an algorithm the au-
                                                               extracting candidate keyphrases utilising the entire doc-
thors used for hyperparameter search and is not a model
                                                               ument level information for keyphrase ranking. The
that contributes directly in solving the task investigated
                                                               second stage is our sequence labeller that uses sentence
in the input research paper.
                                                               level information to find model named entities. We de-
   Our task in this paper is to detect competing model
                                                               scribe the annotation process for the dataset creation
names given an input research document. Also, after
                                                               for sequence labelling first. Considering our end goal
                              Train      Test        Total        length research papers. We read through the introduc-
    # sentences                7800      1000        8800
                                                                  tion and find out the task the paper solves. Then we
    # tokens                  232600    22873       255473
                                                                  browse the entire paper and find all mentions of model
    # entities                19012      3647       22659
    # unique entities         14748      1249       15672         names that solve a similar task. The process has a low
    avg # tokens per sen-      29.82    22.873       29.03        level of ambiguity because a majority of the model men-
    tence                                                         tions occur in the related work section, citation contexts
    avg # entities per sen-    2.44       3.65       2.57         or experimental results section. It is a standard prac-
    tence                                                         tice among authors to cite the relevant research paper if
                                                                  they mention any model names from prior research work.
Table 2
                                                                  Hence, we only consider models that the authors cite to
Overall statistics of train and evaluation dataset for sequence
labeller evaluation                                               be candidates for competing models. We make sure the
                                                                  labelled entities are model names by referring to Google
                                                                  Scholar and Semantic Scholar. If there is any ambiguity
                  # total papers             75                   regarding whether a labelled entity is a model name or
                # total sentences          34656                  not, we discard the full paper. To infer if a model is a
            # avg sentences per paper      462.08                 competing model or not, we find the task or the problem
                     # entities             622                   the paper solves. This is usually mentioned clearly in the
                # unique entities           473                   introduction and the related work section. We label the
             # avg entities per paper       8.29                  model entities (that the authors mention as solving a sim-
Table 3                                                           ilar problem or task as the original paper) as competing
Overall statistics of the document-level annotated dataset for    models. To further verify that the claim by the authors is
evaluation of the entire pipeline                                 indeed true, we visit the cited research paper and ensure
                                                                  that the model is solving a similar task. Furthermore,
                                                                  we only consider papers where the “competing” relation
                                                                  among models is clear and discard any paper where there
of automating a high precision framework of extracting            is ambiguity regarding this relation. Hence, we ensure
related model names and to minimise ambiguity, we con-            ambiguity to be significantly low regarding our annota-
sider only named models as model entities for this task .         tions. The statistical details about the annotations are
Few examples are - NMN+LSTM+FT, SpERT (with overlap),             provided in Table 3. As we ensure a negligible level of
B-BOT + Attention and CL loss, SA-FastRCNN, DS-CNNs               ambiguity, we use only one human annotator (one of
(Random Walk), Sparse Transformer 59M (strided). We               the authors in this paper) for our annotation process.
consider model entities that have a unique name or that           We believe the need for multiple annotators for an inter-
are formed by combination of other model names, eg -              annotator agreement is insignificant for our task, as a
NMN+LSTM+FT. A few example sentences with model                   low level of ambiguity is ensured by considering only
entities are displayed in Figure 1. We define and annotate        named models and clearly defined tasks with competing
the test corpus using the standard BIO tagging scheme.            model names.
Each model entity type was defined to have maximum
span length. For Acronyms, we consider the full length
entity name instead of the short form acronym if it occurs
                                                                  6. Method
in text - eg. DeCLUTR: Deep Contrastive Learning for Un-          Our entire pipeline has two components. Firstly, we ex-
supervised Textual Representations. On average, there are         tract all citation sentences from the input research paper.
2.5 tokens per entity. We refer Google Scholar and Seman-         We combine all the citation sentences to create a mini-
tic Scholar to confirm entity types. We randomly selected         document. We use a graph ranking algorithm to extract
a subset of abstracts from the arxiv dataset containing           all the candidate keyphrases from this mini-document.
1.7M+ paper data and metadata and randomly select sen-            This graph ranking algorithm utilises document level
tences from them to annotate. Also, we randomly sample            information to rank keyphrases. Secondly, we use a se-
the DBLP citation dataset containing 1,511,035 papers             quence labeller for extracting named entities from the
and obtain the full length versions from the available pa-        positively labelled citation sentences. Lastly, we merge
pers using DOI matching and obtained a random sample              the results of the graph ranker and the sequence labeller
of sentences from the full text. We use two different sets        to output final competing model entities. In the subsec-
of corpus because we want our model to be evaluated               tion Sequence Tagging , we provide details about the
on multiple domains within computer science and differ-           training process and the model for our sequence tagger.
ent publication venues. All the statistics related to our         In subsection Graph-Ranking Algorithm, we provide de-
annotated corpus and train set are provided in Table 2            tails about the unsupervised graph ranking algorithm for
   For evaluating the whole pipeline, we annotated full
keyphrase extraction.                                           is provided in section Distantly Supervised NER Model.
                                                                The training process overview for the sequence labeller
6.1. Graph-Ranking Algorithm                                    is shown in Figure Training pipeline for the Sequence
                                                                Labeller.
We use Multipartite Rank [9] as it had proved to be the
state-of-the-art among all keyphrase ranking algorithms,        6.2.1. Training Set Creation with Entity
performing particularly well on longer scholarly docu-                 Replacement
ments. We briefly describe how we use this algorithm
for unsupervised keyphrase extraction.                          We utilise the publicly available Papers with Code (PwC)
    Let 𝐶 be the set of all citation sentences in a docu-       corpus as a Knowledge Base. We crawl PwC and ob-
ment 𝑑. 𝐶 forms an order set of citation sentences, which       tain all the model names occurring in the metadata for
is collectively treated as a document. We build a graph         each task and subtask. We obtain a total of 14,748 model
representation of 𝐶. A set of candidate keyphrases 𝐾 is ex-     names. For the unlabelled corpora, we use a total of
tracted from 𝐶. The candidate keyphrases 𝐾 are grouped          227,000 abstracts from arxiv and obtain all sentences
into topics based on the stem forms of the words they           (7800) containing a model name mention. We find that
share using hierarchical agglomerative clustering with          the occurrence of some model names is much more fre-
average linkage. The candidate keyphrases are used to           quent in literature (e.g - CNN). Due to the small dataset
build a multipartite graph, where the nodes are keyphrase       size and the large imbalance in few entity mentions, the
candidates that are only connected if they belong to a dif-     model is prone to overfitting. To mitigate this, we use
ferent topic. The edges between each node is weighted as        a simple entity replacement technique, where we find
the inverse of the distance between the two keyphrases          all model entity mentions, and randomly replace them
𝐾𝑖 , 𝐾𝑗 in 𝐶. Weight 𝑤𝑖𝑗 is calculated as the sum of the        with other names to ensure a fairer distribution. The
inverse distances between 𝐾 𝑖 and 𝐾 𝑗:                          distribution pre-replacement is shown in Figure 4. We
                                                                use all 14,748 model entities at least once and limit an
                                            1                   entity occurrence to at most 2 in the train dataset, after
               𝑤𝑖𝑗 =        ∑           ∑
                       𝑝 ∈𝑃(𝐾 ) 𝑝 ∈𝑃(𝐾 )
                                         𝑝𝑖 − 𝑝𝑗                replacement.
                        𝑖       𝑖   𝑗       𝑗


   where 𝑃(𝐾𝑖 ) is a set of word offset positions of 𝐾𝑖 . The   6.2.2. Distantly Supervised NER Model
first occurring candidates of each topic are promoted
more as they capture higher relevance. Weights of the           We treat NER as a sequence labelling problem. Given a
first occurring candidates of each topic is modified ac-        sequence of 𝑁 tokens 𝑋 = [𝑥1 , ..., 𝑥𝑁 ], we aim to find an
cording:                                                        entity which is a span of tokens 𝑠 = [𝑥𝑖 , ..., 𝑥𝑗 ](0 ≤ 𝑖 ≤
                                    1
                                                                𝑗 ≤ 𝑁 ) associated with the entity type model name. We
              𝑤𝑖𝑗 = 𝑤𝑖𝑗 + 𝛼.𝑒 𝑝𝑖            ∑            𝑤𝑘𝑖
                                                                formulate this as a sequence labelling task of assigning
                                        𝐾𝑘 ∈𝑇 (𝐾𝑗 )\𝐾𝑗
                                                                a sequence of labels 𝑌 = [𝑦1 , ..., 𝑦𝑁 ]. The aim of our
where 𝛼 is a hyperparameter that controls the strength          sequence labeller is to classify each token as a certain
of the weight adjustment, 𝑇 (𝐾𝑗 ) is the set of candidates      entity type as per the BIO tagging scheme.
belonging to the same topic as 𝐾𝑗 , 𝑝𝑖 is the offset position      We consider 𝐾 train sentences denoted as {(𝑋𝑘 , 𝑌𝑘 )}𝐾𝑘=1
of the first occurrence of candidate 𝐾𝑖 . After the graph       with distant token level annotations. We aim to learn a
is built, a ranking algorithm is then used to order each        function 𝑓 (𝑋 , 𝜃), which can correctly predict the entity
keyphrase candidate 𝐾𝑖 . We adopt the popular TextRank          labels for a train sentence 𝑋𝑘 . We minimise the loss:
Algorithm [5] for the ranking mechanism. A final set of
top ranked keyphrases 𝐾̃ is obtained.                                                          𝐾
                                                                                            1
                                                                            𝜃 ∗ = arg min     ∑ 𝑙(𝑌 , 𝑓 (𝑋𝑘 , 𝜃))
                                                                                     𝜃      𝐾 𝑘=1 𝑘
6.2. Sequence Tagging
                                                             over {(𝑋𝑘 , 𝑌𝑘 )}𝐾 where 𝜃 is the parameter and 𝑙 is the
For training our sequence tagger, we only rely on distant cross-entropy 𝑘=1   loss.
labels created using an external Knowledge Base and an         We experiment with multiple baselines which are stan-
unlabelled research text corpus. We also demonstrate dard for the sequence labelling process.
that for long-tailed entity types, there is a need to en-
sure fairer distribution among entity occurrence in order         • A BiLSTM + CRF model where the bidirectional
to prevent overfitting, which occurs in the form of the             contextual representations are captured by the
model memorising certain popular entity names. The de-              BiLSTM model, and the resultant representations
tails about the training set creation is provided in section        are passed to the Conditional Random Field (CRF)
Training Set Creation with Entity Replacement. The de-              that produces sequence labels as output.
tails about the model and the results on the evaluation set
                              Sentences

      The optimized 4-layer BiLSTM model was then calibrated
      and validated for multiple prediction horizons.

      Furthermore, case studies show that SIMCLDA
       can effectively predict candidate lncRNAs
      for renal cancer.                                                                                                                       B-Model I-Model O O O O

      Longformer's attention mechanism is a
      drop-in replacement for the standard self-attention.                                      Entity Replacement
                                                                                      The optimized 4-layer BiLSTM model                                 CRF Layer
                                                                                           was then calibrated and ...
         Bi-LSTM MODEL
         SIMCLDA MODEL
       Longformer MODEL                                                                The optimized 4-layer TransE model
                                                                                           was then calibrated and ...                          SciBERT Embeddings


                       Unlabelled
                        corpora

                                                                                                                Weak Labels
                                                                                                                                                     Distantly Labelled
                                                                                                                                                       Training Data


       Knowledge
         Base


Figure 2: Training pipeline for the Sequence Labeller


                                                                  Graph-Ranker


                                                                                                                         Entity-Citation
                          Citation
                                                                                                                             Linker
                        Sentences
                                                                                                    Predicted
                             The authors use CNN [1]                                                 Entities                 The authors use CNN [1] layer on top of
                             layer on top of BERT [2]                                                                                 BERT [2] embeddings.
                                   embeddings
                                                                                                         BERT
         Target
                                                                                                         CNN
       Scientific                                                                                                                          B-Model
       Document
                                                                                                          ...
                             The ImdB dataset [2] is
                                                                                                           .
                              popular for sentiment
                                                             The authors use CNN [1] layer on
                                  classification
                                                               top of BERT [2] embeddings                                                        Predicted Entities ( with
                                                                                                                                                      citation link)

                                                                                                                                                        CNN         [1]
                                                                SciBERT Embeddings +
                                                                        CRF
                                                                                                                                                        BERT        [2]
                                                                  Trained NER Model
                                                                (Model Name Extractor)


Figure 3: Inference Pipeline of the end-to-end framework


                                                                                                       BERT-based language model train on large un-
                                                                                                       labelled scientific corpora using MLM objective.
                                                                                                       The output embeddings are passed to the linear
                                                                                                       CRF layer which predicts token labels from con-
Figure 4: Distribution of entity occurrence frequency in the
                                                                                                       textual representations.
training dataset pre-replacement                       We evaluate our baselines using our evaluation dataset
                                                       and the results are displayed in Table 4. We demonstrate
                                                       that entity replacement provides a significant boost in
     • A BERT + CRF model where the contextualised performance for each of these models. The reason is
       embeddings are captured by a pre-trained BERT that the model does not memorise entity names for the
       base uncased model and passed onto the CRF replaced dataset and uses the context to predict the en-
       layer to produce token labels.                  tity types. The results also prove that standard NER
     • A SciBERT + CRF model where the domain spe- approaches can provide decent results on the evaluation
       cific contextualised embeddings are captured by dataset while relying only on weakly labelled training
       a pre-trained SciBERT [25] model. SciBERT is data.
                                    P       R       F1                                                 P        R       F1
 BiLSTM + CRF (w/o replace-       0.205   0.519    0.294       TextRank                              0.063    0.273    0.098
 ment)                                                         PositionRank                          0.098    0.841    0.162
 BERT + CRF (w/o replace-         0.389   0.310    0.345       SingleRank                            0.105    0.863    0.179
 ment)                                                         MultipartiteRank                      0.123    0.834    0.214
 SciBERT+CRF (w/o replace-        0.391   0.312    0.346       SciBERT-CRF                           0.290    0.764    0.420
 ment)                                                         TextRank + SciBERT-CRF                0.512    0.235    0.322
 BERT+CRF (with replace-          0.575   0.563    0.569       PositionRank + SciBERT-CRF            0.608    0.661    0.633
 ment)                                                         SingleRank + SciBERT-CRF              0.609    0.679    0.642
 BiLSTM + CRF (with replace-      0.628   0.631    0.629       MultipartiteRank+SciBERT-CRF          0.639    0.672    0.655
 ment)
 SciBERT+CRF (with replace-      0.641    0.632    0.636     Table 5
 ment)                                                       Result on evaluation on the document level annotated dataset

Table 4
Result on Evaluation Dataset
                                                            model entity mentions with a good accuracy while consid-
                                                            ering sentences as contextual information as reported in
                                                            Table 4, not all models are competing. In order to discern
7. Combining Graph-Ranker and which of the extracted candidate entities are competing
     Sequence Tagger                                        models, document context is needed. Hence, we find
                                                            that combining the two approaches leads to a significant
We used the Unsupervised Keyphrase Extraction algo- boost in precision while maintaining a decent recall. The
rithm to capture only those keyphrases that are most highest performance is yielded by the combination of
relevant to the document. Although the Sequence Tag- Multipartite Rank with SciBERT-CRF, despite Multipar-
ger performs well on detecting model name mentions tite Rank having a slightly lower recall than SingleRank.
using sentences as the contextual information, we need The reason can be attributed to the higher precision of
to capture document level relevance as well to extract Multipartite Rank among all unsupervised keyphrase ex-
competing models. The reason is that not all model name traction algorithms investigated. The higher precision
mentions are relevant to the task the given target research in Multipartite Rank can be attributed to the fact that it
paper aims to solve. Hence, we predict only those entities aims to select the most relevant phrases by incorporating
which are common to both top-ranked keyphrases and positional information among edge weights among the
the extracted model names from our distantly supervised candidate keyphrases. Hence, its combination with the
sequence tagger. More formally,                             sequence labeller yields the highest F1-score among all
                        ″     ̃     ̃                       combinations.
                      𝑌 = 𝑌 ⋂𝐾

where 𝑌̃ is the set of predicted entities by the sequence    9. Entity Citation Linker
tagger, 𝐾̃ is the set of top-ranked keyphrases and 𝑌 ″ is
the final set of predicted entities. The entire inference     The entity citation linker is inspired from the prior work
pipeline is illustrated in the Figure 3.                      of [27]. The aim of this algorithm is to link the entities
                                                              with their corresponding citation. The first step is to
                                                              obtain all the possible entities and the citations. Then,
8. Results                                                    a closeness score is calculated for each entity-citation
We use the evaluation metric of micro-average Preci- pair, which is the string distance between the entity and
sion, Recall and F1-Score to evaluate the performance the citation. Then, we take all the citations and keep
of the different baselines investigated. We use the full only the closest citations per entity. Finally, we take all
document-level annotated dataset for this evaluation.         the entities and keep the closest entity per citation. As
   We report the results in Table 5. We compare perfor- demonstrated by the authors, this technique is able to
mances of 4 Unsupervised Graph-Rankers for keyphrase accurately map most entities with their corresponding
extraction: TextRank [5], SingleRank [26], PositionRank citations. We use this technique to link all the extracted
[8] and MultipartiteRank [9]. We observe that the recall model entities with their respective citations.
is highest for SingleRank, as it extracts most of the rele-
vant candidate keyphrases and ensures a high amount
of entity coverage. For SciBERT-CRF model, we notice
that even though the recall is high, the precision is signif-
icantly low. It is due to the fact that although it detects
10. Error Analysis                                              12. Conclusion and Future work
We conduct error analysis for the Unsupervised                  We have introduced the task of extraction of competing
keyphrase extraction, model entity extraction using se-         models from a research paper. We use a novel approach
quence labelling, the two-stage framework and entity            of treating relevant keyphrases extracted using an Un-
citation linking. For the keyphrase extraction, the graph-      supervised Graph Ranking algorithm as the superset of
ranker extracts most of the relevant model candidates.          a BERT-based sequence labeller. We also use distant
However, precision suffers significantly as most mod-           supervision to train our sequence labeller. We test our se-
els are not keyphrases. Their are multiple keyphrases           quence labeller and the entire pipeline on two annotated
extracted by the algorithm that are not model names             datasets. We also utilise a simple entitiy replacement
- few examples being domain names like ‘Information             technique to reduce overfitting in the sequence labeller.
Retrieval’, ‘Networking architecture’, dataset names like       Finally, we use the entity-citation linking technique to
‘SquaD 1.1’ or other terms that are relevant to the re-         link all the extracted model entities with their respective
search paper.                                                   citation. We believe this work to be a significant step for-
   For the sequence labeller, we observe mainly two types       ward to map the research landscape of Computer Science
of error. First, we notice precision error being introduced     in an automated and scalable manner.
into the model because in the training set we consider
maximum span of each entity and the occurrence of I-
Model ( token lying inside a named entity) is relatively        References
high. However, in the evaluation test set of the sequence
                                                                 [1] Y. Luan, L. He, M. Ostendorf, H. Hajishirzi, Multi-
labeller, the occurrence of singular B-Model entities is
                                                                     task identification of entities, relations, and coref-
massively more. This leads to the misclassification of O
                                                                     erence for scientific knowledge graph construc-
as an I by the model. Also, although the model is able to
                                                                     tion, in: Proceedings of the 2018 Conference
detect model entities reasonably given the sentence as the
                                                                     on Empirical Methods in Natural Language Pro-
context, it is unable to discern competing models from
                                                                     cessing, Association for Computational Linguis-
unrelated ones. This leads to a significant precision de-
                                                                     tics, Brussels, Belgium, 2018, pp. 3219–3232. URL:
crease when evaluated on the document-level annotated
                                                                     https://aclanthology.org/D18-1360. doi:10.18653/
evaluation set.
                                                                     v1/D18- 1360 .
   Finally, after evaluating the performance of the two-
                                                                 [2] S. Jain, M. van Zuylen, H. Hajishirzi, I. Beltagy,
stage pipeline on the document-level annotated dataset,
                                                                     Scirex: A challenge dataset for document-level in-
we find that the model often mistakes dataset names for
                                                                     formation extraction, in: Proceedings of the 58th
model entity mentions. This can be attributed to the high
                                                                     Annual Meeting of the Association for Computa-
relevance of datasets with respect to the research paper.
                                                                     tional Linguistics, 2020. arXiv:2005.00512 .
   Lastly, for the entity citation linker, sometimes an en-
                                                                 [3] K. Bennani-Smires, C. C. Musat, M. Jaggi, A. Hoss-
tity that is associated with a citation marker occurs in
                                                                     mann, M. Baeriswyl, Embedrank: Unsupervised
the initial part of a sentence and its not the closest to the
                                                                     keyphrase extraction using sentence embeddings,
citation. This can lead to missed out or incorrect linking.
                                                                     ArXiv abs/1801.04470 (2018).
                                                                 [4] Y. Yu, V. Ng, Wikirank: Improving keyphrase ex-
11. Implementation details                                           traction based on background knowledge, ArXiv
                                                                     abs/1803.09000 (2018).
We implement the NER model in Pytorch. For tokeniza-             [5] R. Mihalcea, P. Tarau, Textrank: Bringing order
tion, we use the pre-trained SciBERT tokenizer. The                  into text, in: EMNLP, 2004.
embedding layer is the output from the pre-trained SciB-         [6] X. Wan, J. Xiao, Single document keyphrase extrac-
ERT model. We include a dropout layer with a dropout                 tion using neighborhood knowledge, in: Proceed-
probability of 0.5 to reduce overfitting. Learning rate              ings of the 23rd National Conference on Artificial
is set to 1e-5 and we train all models for a total of 10             Intelligence - Volume 2, AAAI’08, AAAI Press, 2008,
epochs. The output from the dropout layer is passed                  p. 855–860.
through a linear layer with input dimension same as the          [7] S. Danesh, T. Sumner, J. H. Martin, SGRank: Com-
hidden dimension of SciBERT (768). For all Unsupervised              bining statistical and graphical methods to improve
Graph Ranker, we use the same hyperparameter settings                the state of the art in unsupervised keyphrase
as specified in their respective papers                              extraction, in: Proceedings of the Fourth Joint
                                                                     Conference on Lexical and Computational Se-
                                                                     mantics, Association for Computational Linguis-
                                                                     tics, Denver, Colorado, 2015, pp. 117–126. URL:
     https://aclanthology.org/S15-1013. doi:10.18653/               distant supervision for low-resource named
     v1/S15- 1013 .                                                 entity recognition,           CoRR abs/2102.13129
 [8] C. Florescu, C. Caragea, PositionRank: An unsuper-             (2021). URL: https://arxiv.org/abs/2102.13129.
     vised approach to keyphrase extraction from schol-             arXiv:2102.13129 .
     arly documents, in: Proceedings of the 55th An-           [20] F. Nooralahzadeh, J. T. Lønning, L. Øvrelid,
     nual Meeting of the Association for Computational              Reinforcement-based denoising of distantly super-
     Linguistics (Volume 1: Long Papers), Association               vised NER with partial annotation, in: Proceedings
     for Computational Linguistics, Vancouver, Canada,              of the 2nd Workshop on Deep Learning Approaches
     2017, pp. 1105–1115. URL: https://aclanthology.org/            for Low-Resource NLP (DeepLo 2019), Association
     P17-1102. doi:10.18653/v1/P17- 1102 .                          for Computational Linguistics, Hong Kong, China,
 [9] F. Boudin, Unsupervised keyphrase extraction                   2019, pp. 225–233. URL: https://aclanthology.org/
     with multipartite graphs, in: Proceedings of the               D19-6125. doi:10.18653/v1/D19- 6125 .
     2018 Conference of the North American Chapter             [21] Y. Yang, W. Chen, Z. Li, Z. He, M. Zhang, Dis-
     of the Association for Computational Linguistics:              tantly supervised NER with partial annotation
     Human Language Technologies, Volume 2 (Short                   learning and reinforcement learning, in: Proceed-
     Papers), Association for Computational Linguistics,            ings of the 27th International Conference on Com-
     New Orleans, Louisiana, 2018, pp. 667–672. URL:                putational Linguistics, Association for Computa-
     https://aclanthology.org/N18-2105. doi:10.18653/               tional Linguistics, Santa Fe, New Mexico, USA,
     v1/N18- 2105 .                                                 2018, pp. 2159–2169. URL: https://aclanthology.org/
[10] V. Kocaman, D. Talby, Biomedical named entity                  C18-1183.
     recognition at scale,            CoRR abs/2011.06315      [22] Y. Lin, Z. Liu, M. Sun, Y. Liu, X. Zhu, Learning en-
     (2020). URL: https://arxiv.org/abs/2011.06315.                 tity and relation embeddings for knowledge graph
     arXiv:2011.06315 .                                             completion, in: AAAI, 2015.
[11] S. Mesbah, C. Lofi, M. V. Torre, A. Bozzon, G.-J.         [23] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
     Houben, Tse-ner: An iterative approach for long-               M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
     tail entity extraction in scientific publications, in:         limits of transfer learning with a unified text-to-
     International Semantic Web Conference, Springer,               text transformer, Journal of Machine Learning Re-
     2018, pp. 127–143.                                             search 21 (2020) 1–67. URL: http://jmlr.org/papers/
[12] Q. Liu, P. cheng Li, W. Lu, Q. Cheng, Long-tail                v21/20-074.html.
     dataset entity recognition based on data augmenta-        [24] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhut-
     tion, in: EEKE@JCDL, 2020.                                     dinov, Q. V. Le, Xlnet: Generalized autoregres-
[13] J. Lafferty, A. McCallum, F. Pereira, Conditional              sive pretraining for language understanding, in:
     random fields: Probabilistic models for segmenting             NeurIPS, 2019.
     and labeling sequence data, in: ICML, 2001.               [25] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained
[14] H. L. Chieu, H. Ng, Named entity recognition with              language model for scientific text, arXiv preprint
     a maximum entropy approach, in: CoNLL, 2003.                   arXiv:1903.10676 (2019). URL: https://www.aclweb.
[15] J. Li, A. Sun, J. Han, C. Li, A survey on deep learning        org/anthology/D19-1371/.
     for named entity recognition, ArXiv abs/1812.09449        [26] X. Wan, J. Xiao, Single document keyphrase ex-
     (2018).                                                        traction using neighborhood knowledge, in: AAAI,
[16] S. Goldberg, D. Z. Wang, C. Grant, A probabilisti-             2008.
     cally integrated system for crowd-assisted text la-       [27] S. Ganguly, V. Pudi, Competing algorithm detection
     beling and extraction, J. Data and Information Qual-           from research papers, in: Proceedings of the 3rd
     ity 8 (2017). URL: https://doi.org/10.1145/3012003.            IKDD Conference on Data Science, 2016, CODS ’16,
     doi:10.1145/3012003 .                                          Association for Computing Machinery, New York,
[17] X. Wang, Y. Guan, Y. Zhang, Q. Li, J. Han, Pattern-            NY, USA, 2016. doi:10.1145/2888451.2888473 .
     enhanced named entity recognition with distant
     supervision, in: 2020 IEEE International Confer-
     ence on Big Data (Big Data), 2020, pp. 818–827.
     doi:10.1109/BigData50022.2020.9378052 .
[18] C. Liang, Y. Yu, H. Jiang, S. Er, R. Wang, T. Zhao,
     C. Zhang, BOND: bert-assisted open-domain
     named entity recognition with distant supervision,
     CoRR abs/2006.15509 (2020). URL: https://arxiv.org/
     abs/2006.15509. arXiv:2006.15509 .
[19] M. A. Hedderich, L. Lange, D. Klakow, ANEA:

</pre>