Identifying Used Methods and Datasets in Scientific Publications

                                  Michael Färber , Alexander Albers, Felix Schüber
                                          Karlsruhe Institute of Technology (KIT), Germany


                            Abstract                                         The usage of methods and datasets is typically not given
                                                                         explicitly, but mentioned in publications’ full texts. Identify-
  Although it has become common to assess publications and               ing scientific methods and datasets in texts can be considered
  researchers by means of their citation count (e.g., using the          as domain-specific named entity recognition. In the schol-
  h-index), measuring the impact of scientific methods and
  datasets (e.g., using an h-index for datasets) has been per-
                                                                         arly domain, a few approaches have been proposed for iden-
  formed only to a limited extent. This is not surprising because        tifying concepts such as datasets (Mesbah et al. 2018; Luan
  the usage information of methods and datasets is typically not         2019; Luan et al. 2018; Tsai, Kundu, and Roth 2013). For
  explicitly provided by the authors, but hidden in a publica-           instance, Tsai, Kundu, and Roth (2013) propose a method
  tion’s text. In this paper, we propose an approach to identi-          to extract concepts from scientific publications. They limit
  fying methods and datasets in texts that have actually been            their extraction method to entities that are followed by a cita-
  used by the authors. Our approach first recognizes datasets            tion indicator and only extract all mentioned concepts, rather
  and methods in the text by means of a domain-specific named            than ones explicitly used. Gábor et al. (2018), in contrast,
  entity recognition method with minimal human interaction. It           proposed a method to classify entity mentions into used and
  then classifies these mentions into used vs. non-used based            non-used. However, usage relations are only considered be-
  on the textual contexts. The obtained labels are aggregated
  on the document level and integrated into the Microsoft Aca-
                                                                         tween entities of a specific type and not with respect to
  demic Knowledge Graph modeling publications’ metadata.                 the papers’ authors. Overall, a state-of-the-art approach that
  In experiments based on the Microsoft Academic Graph, we               can recognize and classify scientific methods and datasets
  show that both method and dataset mentions can be identified           is, to the best of our knowledge, missing so far. Moreover,
  and correctly classified with respect to their usage to a high         no large data set has been published that allows tasks for
  degree. Overall, our approach facilitates method and dataset           method/dataset-centric scientific impact quantification.
  recommendation, enhanced paper recommendation, and sci-                    In this paper, we develop a framework to recognize en-
  entific impact quantification. It can be extended in such a way
                                                                         tities of type DATASET and METHOD in scientific publi-
  that it can identify mentions of any entity type (e.g., task).
                                                                         cations, as well as to classify them as used vs. non-used.
                                                                         Our framework consists of a domain-specific named entity
                     1    Introduction                                   recognition step, a classification step for determining the ac-
                                                                         tual usage, and an aggregation step for retrieving the used
In the past, a huge variety of scientific methods and datasets           methods and datasets on the document level. Our approach
has been proposed in the different scientific disciplines. For           is designed to extract information about entities from sci-
instance, Wikipedia lists several hundred datasets for the               entific publications in an automated way, requiring mini-
area of machine learning.1 It is therefore unsurprising that             mal human interaction. We provide the usage information
researchers are often unaware of which scientific methods or             of about 771,000 methods and 449,000 datasets online for
data sets have already been used for a given research topic.             further usage. Moreover, we integrate the information into
Furthermore, in digital libraries, such information regarding            the Microsoft Academic Knowledge Graph (MAKG), which
usage of scientific methods and datasets can be very useful.             models information of more than 120 million scientific pub-
For instance, this information allows us to measure the im-              lications, and thereby provides the basis for scientific impact
pact of publications and researchers in novel ways (e.g., h-             quantification studies (e.g., designing “h-index”-like metrics
index for datasets). In this way, authors providing methods              for scientific methods and datasets).
and datasets can be awarded properly in the light of FAIR
                                                                             Overall, the main contributions of this paper are as fol-
data principles and open research efforts.
                                                                         lows:
Copyright © 2021for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY            • We develop a named entity recognition approach that ex-
4.0).                                                                      tracts scientific methods and datasets from texts. Our ap-
    1                                                                      proach extends preliminary works (Mesbah et al. 2018)
      https://en.wikipedia.org/wiki/List of datasets for machine-
learning research.                                                         by using state-of-the-art embedding techniques.


                                                                     1
• We develop novel approaches to identify in texts the meth-          added false positives, this semi-supervised technique rein-
  ods and datasets authors have indeed used in their papers.          troduces the need for human labor and thus does not meet
                                                                      our requirements. Tchoua et al. (2019) present a dedicated
• We create an evaluation dataset of 1,000 sentences with
                                                                      NER approach for material sciences to recognize polymer
  annotated methods and datasets and provide it to the pub-
                                                                      names. The approach is based on active learning to over-
  lic.
                                                                      come the data sparsity problem. Luan et al. (2018) intro-
• We perform extensive experiments and identify the best              duce a multi-task setup of identifying entities, relations, and
  classification method for the proposed task.                        coreference clusters in scientific articles. Although the ap-
• We analyze the results of applying our framework to com-            proach is valuable in settings where not only named entities
  puter science papers.                                               but facts need to be extracted from text, the authors do not
                                                                      specifically consider the usage of datasets and methods by
• We extend the MAKG with the usage information con-                  the papers’ authors.
  cerning methods and datasets mentioned in 510,027 pa-                  Identifying Aspects of Entities. Apart from recognizing
  pers and provide it to the public.                                  named entities, a few approaches take additional aspects of
  Our data and code are publicly available at https://github.         the entities, such as the actual usage of entities, into account.
com/michaelfaerber/scholarly-entity-usage-detection.                  Gupta and Manning (2011) introduce a method to identify
  The rest of our paper is structured as follows: In Section 2,       the focus, domain of application, and technique from com-
we outline related work concerning domain-specific named              putational linguistics papers, but this approach only extracts
entity recognition and usage classification. In Section 3, we         broad topics. Jain et al. (2020) focus on detecting and ex-
describe our methods for named entity recognition and us-             tracting salient information from publications. They define
age classification. We present our evaluation in Section 4            salient information as information (e.g., named entities) that
and our generated dataset in Section 5, before summarizing            are needed to describe the results of an article. In contrast,
our findings in Section 6.                                            our goal is to find all used entities to gain enhanced insight
                                                                      into the general usage of methods and datasets.
                   2    Related Work
                                                                                             3    Approach
In the following paragraphs, we outline the most relevant
works concerning named entity recognition for long-tail en-           Our framework for identifying methods and datasets authors
tities and the extraction of aspects of entities.                     use in a given text document is depicted in Figure 1. We can
                                                                      differentiate between the following steps:
    Named Entity Recognition for Long-Tail Entities.
In general, existing named entity recognition (NER) ap-               1. We build a named entity recognition model to extract
proaches are of diverse nature: They utilize gazetteers,                 named entities of a given scientific paper.
rules, parts-of-speech tagging, dependency trees, or machine
                                                                      2. We perform a classification of each named entity into used
learning techniques. State-of-the-art NER approaches are of-
                                                                         and non-used (i.e., merely mentioned) on a sentence level.
ten based on long short-term memory networks (LSTMs)
(Mysore et al. 2017), conditional random fields (CRFs)                3. We aggregate the sentence-level classifications of all
(Mesbah et al. 2018; Vliegenthart et al. 2019), or a com-                named entities in a document.
bination of both (Lample et al. 2016; Ma and Hovy 2016;                  The obtained list of methods and datasets used per doc-
Luan 2019; Jain et al. 2020). Although many approaches                ument can be further analyzed in various ways. For a neat
to named entity recognition exist, most of them require a             alignment with papers’ metadata, we extend the Microsoft
considerable amount of human interaction for the creation             Academic Knowledge Graph (MAKG) with this new data.
of sufficient training data. Few classification approaches            In this way, metadata of publications, authors, venues, and
take into consideration that most of the considered entities          research areas can be used for advanced scholarly data min-
are long-tail entities (i.e., appearing infrequently in docu-         ing (e.g., for novel ways of research impact assessment).
ments and often not represented in public knowledge repos-
                                                                         In the following, we present the single steps of our
itories, such as Wikidata). To reduce the required amount
                                                                      pipeline in more detail.
of human-labeled training data, iterative and active learn-
ing techniques have been proposed, particularly for scien-
tific publications (Tchoua et al. 2019; Mesbah et al. 2018;           3.1   Named Entity Recognition
Vliegenthart et al. 2019; Luan et al. 2018). Mesbah et al.            For named entity recognition, we adapt the TSE-NER (Mes-
(2018), for instance, introduce TSE-NER, which iteratively            bah et al. 2018) to our needs. TSE-NER is based on the hy-
expands a predefined seed set of terms without additional             pothesis that entities of the same type are mostly used in
human input. The authors apply several heuristic filtering            a similar context. For example, objects of the entity type
methods to automatically create positive and negative clas-           DATASET may be mentioned in the documents via phrases
sification examples. Our approach to named entity recogni-            such as “we used data set X” or “we could achieve a recall
tion is based on TSE-NER, but extends it by using SciBERT             of 0.4 on data set Y.” Identifying such patterns automatically
embeddings. Vliegenthart et al. (2019) also extend the TSE-           in the text allows us to identify additional, unknown enti-
NER approach by relying on human feedback for newly                   ties in the text – particularly long-tail entities. The contexts
added labels. Although the authors achieve a lower rate of            of these newly found entity mentions can then be mined in


                                                                  2
                                               Figure 1: Overview of our framework.


another iteration, leading to additional patterns for named            3.2   Usage Classification
entity recognition.                                                    In total, we present four approaches for detecting used en-
   An in-depth introduction to the original TSE-NER ap-                tity mentions of type METHOD or DATASET. For each model,
proach is provided by Mesbah et al. (2018). In the follow-             we first apply an embedding-based method to transform the
ing, we outline the main steps of our named entity recog-              texts into a feature space, and then apply a classification al-
nition approach and the main differences from the original             gorithm to classify usage. In the following, we outline our
TSE-NER approach.                                                      approaches.
1. We start with an initial set of METHOD and DATASET in-              Model 1: TF-IDF + Random Forest As a baseline model,
   stances as seed terms (e.g., “SVM” and “MNIST”). These              we use term frequency-inverse document frequency (tf-idf)
   seed terms can, for instance, be gathered from existing             to represent the words of a text as vectors. Based on prelim-
   knowledge graphs. In contrast to the original approach of           inary evaluations of several standard classification methods,
   Mesbah et al., we consider all computer science methods             we choose a random forest classifier for classification into
   and datasets. The seed term selection is explained in Sec-          used and non-used.
   tion 4.
                                                                       Model 2: SciBERT + Random Forest For our second
2. We expand the list of seed terms by applying term                   model, we make use of SciBERT (Beltagy, Lo, and Cohan
   and sentence expansion (TSE). In contrast to the origi-             2019), a BERT-based language model pretrained on scien-
   nal method, we use SciBERT as a semantic relatedness                tific publications. This embedding model has been used for
   method and cluster the new entities using k-means.                  various tasks, such as scientific text classification and rec-
3. Using the expanded set of entities, we annotate named en-           ommendation. In our use case, we use SciBERT embeddings
   tities in the training data. As context for each named entity       to create feature vectors and a random forest classifier for the
   we consider the current sentence as well as the preceding           binary classification.
   and subsequent sentence.                                            Model 3: SciBERT + SciBERT Our third model is based
4. Using the annotated training set, we apply our NER ap-              on a fine-tuned SciBERT model for sequence classification.
   proach and thereby identify new entity candidates. We use           Beltagy, Lo, and Cohan (2019) show that fine-tuning SciB-
   a CRF algorithm to learn the patterns of the data.                  ERT clearly improves the classification score, especially in
                                                                       the field of computer science. Hence, in comparison to the
5. Finally, we filter the entity candidates to prevent misclas-        second model, we now also use SciBERT to make the classi-
   sification and ensure data quality. We start with simple            fication by fine-tuning it to our annotated data. For the clas-
   parts-of-speech analysis and stop-word removal methods              sification task, SciBERT uses a linear classification layer.
   to keep relevant nouns. Then, we use knowledge graph
   information and similarity scores to remove those entities          Model 4: SciBERT + CNN Our fourth model uses Sci-
   with low similarity and no reference.                               BERT embeddings as feature vectors and a convolutional
                                                                       neural network (CNN) for the classification task. Using the
   The output of our named entity recognition approach is a            CNN approach as introduced by Kim (2014) as an advanced
list of mentioned scientific methods and datasets with their           classification technique aims to capture the complex struc-
positions in the texts.                                                tures of word embeddings, which should result in a more


                                                                   3
accurate classification score.                                          Table 1: Evaluation of our modified TSE-NER model on the
                                                                        SciREX data set using precision, recall, and F1 score.
3.3   Document-level Aggregation
The method described above allows us to make a predic-                   Training corpus          Abstracts                 Full texts
tion for each occurrence of a named entity (i.e., entity-level           Metric              P       R        F1      P         R        F1
prediction). To predict at the document level whether each
                                                                         Method            0.44     0.14      0.21   0.26     0.45       0.33
unique named entity of a document is used or only men-                   Data set          0.33     0.27      0.30   0.20     0.29       0.25
tioned or proposed, we aggregate all entity-level predictions
to a document level prediction using majority vote.

3.4   Augmenting Publications’ Metadata                                 sentence containing the test entity, as well as the preceding
We use our results to extend the MAKG (Färber 2019),                   and the succeeding sentence (Mesbah et al. 2018), we apply
which models publications’ metadata for all scientific dis-             our model to full-text documents, which we regard as a more
ciplines. Given that the MAKG is provided in the Resource               realistic setting.
Description Framework (RDF), we introduce the property                     As in the original paper, we calculate precision, recall,
:used methods, which associates a paper with a used                     and F1 scores for the named entity recognition of METHOD
method. Because no knowledge graph contains all of the                  and DATASET instances. We count partial matches as correct
extracted methods and datasets, we refrain from linking to              predictions because in most cases we do not need to cover
URIs in other knowledge graphs.                                         the full span of an entity to gain meaningful insight.

                      4    Evaluation                                   Evaluation Results
In the following, we outline our evaluations of all three steps         (1) Study on Embeddings. The original TSE-NER ap-
of our pipeline. First, we compare the results of our modi-             proach is based on word2vec embeddings. Thus, we first
fied TSE-NER model to the original paper. Next, we eval-                analyze the difference in performance when using SciBERT
uate our usage classification models on our annotated test              token embeddings instead of word2vec embeddings for term
data. Finally, we apply our pipeline to full-text papers from           clustering and similar terms filtering (see the steps 2 and 5
the computer science domain to analyze trends over time in              in Sec. 3.1) influences the clustering performance. We qual-
various computer science fields.                                        itatively study the clustering results of the term expansion
                                                                        in the first iteration for the METHOD type and find that, in
4.1   Named Entity Recognition                                          general, both approaches generate very consistent clusters
                                                                        that differ based on various computer science fields. Given
Evaluation Settings                                                     that the word2vec model had to be trained from scratch, it
                                                                        achieves surprisingly good results. Nevertheless, clustering
(1) Training. We train our named entity recognition model               based on SciBERT embeddings yields far more and richer
on all 7 million abstracts of computer science papers given in          terms, because it is not limited to just bigrams. Single clus-
the Microsoft Academic Graph (MAG; v2019-12-26) (Sinha                  ters contain more variations of the same terms and gener-
et al. 2015). For the methods, we use the same 50 seed sets             ally contain better results. One risk of using SciBERT is
as the authors of the original paper. For DATASETS, we cre-             that terms, such as Netflix or GitHub, are clustered together
ate our own set of seed terms because we were only able to              with dataset names, which is likely caused by both terms be-
expand very few sentences from our corpus using the origi-              ing used in the context of datasets but not being recognized
nal terms.2 For our initial assessment, we run two iterations           jointly with neighboring terms. This may decrease the NER
for each entity type, which according to the authors should             performance if names of other unrelated organizations are
already yield good results with a high precision value. Run-            added as a result in the following iterations.
ning more than two iterations increases recall at the cost
                                                                           (2) NER Evaluation Results. Mesbah et al. (2018)
of precision due to the addition of too many unrelated seed
                                                                        achieve precision and recall values of 0.79 and 0.24 for the
terms.
                                                                        METHOD type and 0.83 and 0.10 for the DATASET type. The
   (2) Testing. To evaluate the NER approach, we use the
                                                                        authors’ TSE-NER model was trained based on 100 initial
SciREX dataset (Jain et al. 2020), which includes annota-
                                                                        seed terms and the same sentence expansion and filtering
tions of full-text papers from the machine learning domain
                                                                        strategies as our model. As shown in Table 1, we are not
for the METHOD and DATASET entity types. In this way, we
                                                                        able to achieve a similar high precision value as the authors
can reuse existing evaluation data sets and compare our eval-
                                                                        of the original paper, who used around 15,000 full-text pa-
uation results with the evaluation results of the original TSE-
                                                                        pers as their corpus. The obvious reason is that publications’
NER (Mesbah et al. 2018). Although the authors of TSE-
                                                                        abstracts, as used by us, may be publicly available to a large
NER only apply their evaluation to triples consisting of a
                                                                        extent and therefore may be a good data source, but seem
    2
      We extract 73 data set names from Wikipedia                       to contain method and dataset names only to a limited de-
(https://en.wikipedia.org/wiki/List of datasets for machine-            gree. To improve the performance of TSE-NER, we choose
learning research) and Wikidata (https://w.wiki/RrU) based on our       to replicate a more similar corpus by using 25,060 full-text
knowledge in the machine learning domain.                               papers instead of 7 million abstracts from the MAG, as well


                                                                    4
Figure 2: Example prediction of our trained TSE-NER model (top) versus ground truth (bottom) for the METHOD type after
two iterations.


Figure 3: Example prediction of our trained TSE-NER model (top) versus ground truth (bottom) for the DATASET type after
two iterations.


Table 2: TSE-NER training details using papers’ abstracts             tities for training. This leads to more than 90,000 extracted
as corpus. The table shows the number of words after each             named entities after the CRF training, compared to 7,469
training step for the first and second iteration.                     named entities when training on abstracts, but still does not
                                                                      achieve the same results as Mesbah et al. (2018). One obvi-
             i
                 Size of    Expanded    Extracted   Filtered          ous reason for that may be that neither of our training cor-
                 seed set    entities    entities   entities          pora contain as many seed entities, which results in fewer
             1        50        4,273       4,032       453           found terms and sentences. Another reason may be that the
  Method                                                              found sentences contain fewer similar neighboring terms
             2       503        3,403       7,469     1,031
                                                                      (e.g., fewer enumerations of method names or datasets),
             1        73          354       1,450         6
  Data set
             2        79          403       2,378       187
                                                                      which would result in smaller cluster sizes and thus fewer
                                                                      added terms.
                                                                          Despite the inferior evaluation results for our domain-
                                                                      specific named entity recognition of methods and datasets,
as narrowing the domain to include only machine learning              we nevertheless believe they are sufficient for the subse-
papers. Although we see equal or higher recall values, this           quent knowledge graph expansion and trend analysis. Be-
corpus does not improve the F1 scores significantly or, in the        cause we aggregate all found entities on the document level,
case of data sets, it even reduces the metric.                        we assume that a few missing mentions of the same entity
   Figure 2 and Figure 3 illustrate the named entity recogni-         would not affect the outcome significantly. For the subse-
tion for two exemplary sections from the SciREX data set.             quent tasks, we use the NER model trained on abstracts in-
We can observe that, in general, the approach produces de-            stead of full text, because we favor higher precision over
cent results. The approach sometimes fails to capture the             recall for the knowledge graph extension.
complete span of an entity mention (e.g., the first word in
character embedding layer). Some of the false positive pre-           4.2   Usage Classification
dictions are not too far fetched, such as vector space, but
others, such as query, answer, and context, are less similar          Evaluation Dataset
to names of methods. This indicates that there is still a po-
tential to introduce better filtering strategies. One recurring       We needed to create a new dataset for training and evalu-
problem for the DATASET model is that the term dataset is             ating our usage classification models. To this end, two au-
recognized without any specific names in its context.                 thors (computer scientists) manually annotated 1,000 sen-
   To further compare our results with the TSE-NER publi-             tences concerning the usage of mentioned method and data
cation (Mesbah et al. 2018), Table 2 shows the number of              sets (500 per entity type and person; see Table 4 for more
methods and datasets collected in each step based on the              statistics). We reuse a subset of the SciREX data set (Jain
corpus containing papers’ abstracts. While the original TSE-          et al. 2020), which already contains annotated entities for
NER model used nearly 30,000 method names, our model is               the METHOD and DATASET type, and manually annotate
only able to use 3,403 method names as training data of the           whether an entity has been used in the given sentence and
CRF. Training on the full-text corpus yields 8,355 named en-          context. To reduce training bias, we also drop duplicate en-


                                                                  5
                                                                             Method                     Dataset                Generalization
 Model                                                                 P         R        F1      P          R       F1       P          R      F1
                                                        Single input sentence
 Random Forest (TF-IDF)                                               0.56    0.83       0.67   0.56     0.83       0.67    0.57        0.89    0.70
 Random Forest + SciBERT                                              0.75    0.76       0.75   0.71     0.81       0.76    0.57        0.96    0.71
 SciBERT (fine-tuned)                                                 0.73    0.92       0.81   0.76     0.89       0.82    0.68        0.93    0.79
 SciBERT + CNN                                                        0.76    0.79       0.77   0.52     0.95       0.67    0.58        0.96    0.73
                                               With surrounding sentences for context
 Random Forest (TF-IDF)                                               0.69    0.76       0.72   0.69     0.76       0.72    0.54        0.92    0.68
 Random Forest + SciBERT                                              0.75    0.76       0.75   0.73     0.84       0.78    0.57        0.95    0.71
 SciBERT (fine-tuned)                                                 0.76    0.84       0.80   0.70     0.96       0.81    0.64        0.95    0.76
 SciBERT + CNN                                                        0.75    0.91       0.83   0.54     0.92       0.68    0.58        0.96    0.72

Table 3: Precision, recall and F1 scores for our usage classification models. We train each model with a single sentence as input
as well as with the preceding and succeeding sentences for both methods and data sets. Further, we show the generalization
capabilities for models that have been trained on the method type and then applied on data set entities.

                                        Table 4: Key statistics of our annotated data set.

  Entity Type   # annotated sentences   # annotated entities   # used entities       # mentioned entities        # balanced entities      κ score
  Method                        1,000                   909                508                         401                        802        0.858
  Data set                      1,000                   841                595                         246                        492        0.909


tities. We only annotated an entity as used if it is obvious           text representation.
from reading the sentence containing the entity and its sur-
rounding context. In any uncertain cases, we annotate the              Evaluation Results
entity as non-used. This way, we aim to achieve high pre-
cision on the sentence level while still being able to decide          Comparison of Methods. Table 3 shows the evaluation
for an entity on the document level using our entity aggre-            results concerning the usage classification of method and
gation step whether the entity has been used. We also label            dataset occurrences. For METHOD entities, the fine-tuned
an entity as used if it has been used in a comparison of mul-          SciBERT model performs better with only a single sentence
tiple approaches (i.e. as a baseline). In this way, we allow a         as input and achieves the best recall. The combined SciB-
thorough tracking of used methods and datasets, facilitating           ERT and CNN model works best when the preceding and
scientific impact quantification.                                      succeeding sentences are available as context. It achieves a
    To ensure high data quality and consistency of our an-             similar high recall and slightly better precision than the fine-
notated data, we select 100 entities of the METHOD and                 tuned SciBERT model.
DATASET type that were annotated to calculate the inter-                  For DATASET entities, both the fine-tuned SciBERT
annotator agreement. We achieve a satisfactory κ score of              model and the CNN model achieve higher recall than they
0.86 for methods and 0.91 for datasets.                                do for classifying METHOD entities. SciBERT still achieves
    Finally, we drop invalid entity types (e.g., entities from         relatively high precision scores but works better when neigh-
SciREX that are classified as material type but do not make            boring sentences are available. For the CNN model, preci-
sense as a data set type) and create a training and test set.          sion scores are significantly lower than they are for method
Using the same amount of used and non-used entities, we                entities.
have 802 entries for the METHOD type and 492 entries for                  Neither random forest model manages to compete with
the DATASET type. For the evaluation, we split the annotated           the more sophisticated models, but work slightly better on
data into training and test sets with a ratio of three to one.         the DATASET entity type. Using the SciBERT sentence em-
                                                                       beddings instead of tf-idf consistently results in a signifi-
Evaluation Settings                                                    cantly higher precision at a cost of slightly lower recall val-
                                                                       ues.
Because our usage classification task constitutes a binary                On manual inspection, we identified that the SciBERT and
classification problem, we evaluate our models using pre-              CNN models do not work when only a single sentence is
cision, recall, and F1 score. As outlined in Section 3.2, we           given but critical information about an entity from the pre-
evaluate four models: (1) random forest with TF-IDF rep-               ceding or succeeding sentence is needed for the decision. For
resentations, (2) random forest with SciBERT embeddings,               instance, in the following excerpt, the usage of the method
(3) a SciBERT classification model with SciBERT embed-                 is not recognized if only the second sentence is given to the
dings, and (4) a CNN model with SciBERT embeddings for                 models: “In this paper, we introduce Invariant Information


                                                                  6
                            80
                                                                                                 40       ANN                                                         80       MNIST
                                                                                                          CNN                                                                 ImageNet
                            70
  Usage [% of occurences]

                                                                                                          SVM                                                                 Wikipedia


                                                                           Usage [% of papers]


                                                                                                                                                Usage [# of papers]
                                                                                                 30       PCA                                                         60       PubMed
                            60
                                                                                                 20                                                                   40
                            50
                                     ANN                                                         10                                                                   20
                                     CNN
                            40
                                     SVM
                                     PCA                                                          0                                                                    0
                            30
                                 2005 2007 2009 2011 2013 2015 2017 2018                              2005 2007 2009 2011 2013 2015 2017 2018                              2007 2009 2011 2013 2015 2017 2018

 (a) Usage of selected machine learning (b) Usage of selected machine learning meth- (c) Usage of machine learning data sets over
 methods over time relative to total mentions ods over time relative to all computer vision time.
 in papers.                                   papers.

                                                               Figure 4: Relative usage of methods and datasets over time.


Clustering (IIC), a method that [...]. IIC is a generic clus-                                                             4.3    Application
tering algorithm that directly trains [...].”
                                                                                                                          We apply our framework to a corpus of 25,060 full-text ma-
   Furthermore, it can be seen that pronouns, such as “we,”
                                                                                                                          chine learning papers from the MAG (Sinha et al. 2015)
give the models a strong hint that an entity has been used.
                                                                                                                          combined with unpaywall. The publication dates range from
However, in some cases, such as mathematical notations,
                                                                                                                          2005 to 2018 and for each year we draw the same number
this may lead to a false positive classification: “We can write
                                                                                                                          of papers to compare relative usages. We process the publi-
the joint update for all as Restrict the update to define a con-
                                                                                                                          cations using GROBID (Lopez 2009) to extract the full text
traction mapping in the Euclidean metric.”
                                                                                                                          as well as the title and all section names. We extract 438,707
Generalization across Entity Types. We also evaluate                                                                      method and 98,276 dataset entities from our corpus. Out of
how well the usage classification models generalize to other                                                              all extracted entities, 56% are classified as used concerning
entity types. For this purpose, we apply all models trained on                                                            the methods and 68% concerning the datasets.
the METHOD entity type to DATASETS. All examined mod-
els perform slightly worse regarding the F1 score, but still                                                              Analyzing Relative Usage We first study how many pub-
achieve very high recall values. This suggests that sentences                                                             lications used specific entities compared to the number of
in which methods are proposed or described do not dif-                                                                    publications in which the same entities were only men-
fer too much from sentences that contain information about                                                                tioned. This relative measurement allows us to perform a
datasets. Out of all tested models, the SciBERT model gen-                                                                more granular trend analysis because irrelevant entities that
eralizes the best to another entity type.                                                                                 are never actually used will not be over-represented in the
                                                                                                                          results.
Further Studies. We also study whether information                                                                           Figure 4a shows this relative usage for selected machine
about the current section improves the performance of our                                                                 learning methods over time. The usage of artificial neural
classification models. Thus, we prepend the title of the                                                                  networks (ANNs) and support vector machines (SVMs) is
current section to the input sentence and retrain all mod-                                                                mostly constant between 60 and 75 % for all papers that
els. Our results show negligible performance improvements                                                                 mention one or the other term, but a slight downward trend
from this modification.                                                                                                   is discernible for plain ANNs. The relative usage of the prin-
   Finally, we investigate the extent to which our created data                                                           cipal component analysis (PCA) shows a higher variability
set differs from the SciREX data set (Jain et al. 2020) con-                                                              due to fewer absolute mentions but is used up to 75 % of the
taining salient information of publications. Specifically, we                                                             time if it is mentioned. For convolutional neural networks
study the degree to which our definition of used entities dif-                                                            (CNNs), we only show values from 2012 and later because
fers from salient entities considered by Jain et al. Salient                                                              only a few mentions of CNNs occur in earlier years. Still,
entities are defined as necessary to describe the results of a                                                            a clear trend is visible, where at the beginning in 2012 only
paper and thus are semantically similar to our definition of                                                              around 35 % of papers that mentioned CNNs also used them
used entities. We find for our method annotation set that only                                                            for their work, whereas in 2018 the value was greater than
12 out of 1,000 entries are labeled as salient in the original                                                            55 %.
paper, which results in an MCC of 0.027 with our labels.
For datasets, 39 entries are labeled as salient with an MCC                                                               Analyzing Specific Domains For another data study, we
of 0.011. In comparison, our created annotation data con-                                                                 leverage the knowledge of the MAKG to select only publica-
tains roughly similar amounts of used and non-used (e.g.,                                                                 tions from a specific computer science domain and analyze
proposed, only mentioned) entities, which allows us to ex-                                                                this subset of publications over time. Figure 4b shows the
tract and analyze considerably more used entities than we                                                                 usage of selected machine learning methods in the computer
can with the saliency approach.                                                                                           vision field, which is one of the most popular categories by


                                                                                                                     7
number of papers in our set. Here, we only analyze the rela-         Natural Language Processing, EMNLP-IJCNLP’19, 3613–
tive number of publications in which an entity has been used,        3618.
instead of the number of named entity occurrences. Until
                                                                     Färber, M. 2019. The Microsoft Academic Knowledge
2015, the most used methods were ANNs and SVMs, which
                                                                     Graph: A Linked Data Source with 8 Billion Triples of
together have been used in around 30% of all computer vi-
                                                                     Scholarly Data. In Proceedings of the International Seman-
sion papers. Since 2014, the usage of CNNs has steadily
                                                                     tic Web Conference, ISWC’19, 113–129.
grown and is now the most used computer vision method.
In turn, the number of papers that use SVMs and PCA has              Gábor, K.; Buscaldi, D.; Schumann, A.; QasemiZadeh, B.;
rather declined. Compared with Figure 4a, it can be seen that        Zargayouna, H.; and Charnois, T. 2018. SemEval-2018 Task
the relative usage of CNNs has increased since 2016. All this        7: Semantic Relation Extraction and Classification in Scien-
demonstrates that such a study would not be possible with-           tific Papers. In Proceedings of The 12th International Work-
out an approach as proposed in this paper, which determines          shop on Semantic Evaluation, SemEval@NAACL-HLT’18,
the actual usage of mentioned entities.                              679–688.
    We also apply our classification pipeline to DATASET en-         Gupta, S.; and Manning, C. D. 2011. Analyzing the Dy-
tities. Figure 4c shows the absolute amount of publications          namics of Research by Extracting Key Aspects of Scientific
for the top four extracted datasets. A clear trend is visible        Papers. In Proceedings of the 5th International Joint Con-
for image recognition data sets, such as MNIST and Ima-              ference on Natural Language Processing, IJCNLP’11, 1–9.
geNet, which also correlates with the usage of CNNs in the           The Association for Computer Linguistics.
computer vision domain. This again confirms the rising pop-
ularity of the specific domain. Another trend is visible for         Jain, S.; van Zuylen, M.; Hajishirzi, H.; and Beltagy, I. 2020.
Wikipedia, which has become popular in research on knowl-            SciREX: A Challenge Dataset for Document-Level Informa-
edge representation and natural language processing.                 tion Extraction. In Proceedings of the 58th Annual Meeting
                                                                     of the Association for Computational Linguistics, ACL’20,
                5    Data Provisioning                               7506–7516.
We apply our framework to all computer science papers                Kim, Y. 2014. Convolutional Neural Networks for Sentence
given both in the MAG and unpaywall (510,027 papers).                Classification. In Moschitti, A.; Pang, B.; and Daelemans,
Overall, we obtained 771,000 mentions of used methods and            W., eds., Proceedings of the 2014 Conference on Empiri-
449,000 mentions of used datasets. We provide the dataset            cal Methods in Natural Language Processing, EMNLP’14,
online for further use (see our repository).                         1746–1751.
                                                                     Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami,
                     6    Conclusion                                 K.; and Dyer, C. 2016. Neural Architectures for Named
In this paper, we proposed an approach to identifying meth-          Entity Recognition. In Proceedings of the 2016 Confer-
ods and datasets in texts that have actually been used by the        ence of the North American Chapter of the Association for
authors. Our approach first recognizes datasets and meth-            Computational Linguistics: Human Language Technologies,
ods in the text by means of a domain-specific named entity           NAACL-HLT’16, 260–270.
recognition with minimal human interaction. It then clas-
                                                                     Lopez, P. 2009. GROBID: Combining Automatic Biblio-
sifies these mentions into used vs. non-used. The obtained
                                                                     graphic Data Recognition and Term Extraction for Schol-
labels are aggregated on the document level and integrated
                                                                     arship Publications. In Proceedings of the 13th European
into the Microsoft Academic Knowledge Graph modeling
                                                                     Conference on Digital Libraries, ECDL’09, 473–474.
publications’ metadata. In experiments based on the Mi-
crosoft Academic Graph, we showed that both method and               Luan, Y. 2019. Information Extraction from Scientific
dataset mentions can be identified and correctly classified          Literature for Method Recommendation. arXiv preprint
with respect to their usage. Our approach, as well as our            arXiv:1901.00401 .
dataset containing the usage information of methods and
                                                                     Luan, Y.; He, L.; Ostendorf, M.; and Hajishirzi, H. 2018.
datasets mentioned in 510,000 papers, can be used for re-
                                                                     Multi-Task Identification of Entities, Relations, and Corefer-
search impact quantification tasks and further studies in the
                                                                     ence for Scientific Knowledge Graph Construction. In Pro-
area of digital libraries.
                                                                     ceedings of the 2018 Conference on Empirical Methods in
   In the future, we plan to use our framework with respect          Natural Language Processing, EMNLP’18, 3219–3232.
to other entity types, such as task and evaluation metric. Fi-
nally, a promising idea is to build a recommender system for         Ma, X.; and Hovy, E. H. 2016. End-to-end Sequence Label-
scientific publications using our framework.                         ing via Bi-directional LSTM-CNNs-CRF. In Proceedings
                                                                     of the 54th Annual Meeting of the Association for Computa-
                         References                                  tional Linguistics, ACL’16.
Beltagy, I.; Lo, K.; and Cohan, A. 2019. SciBERT: A Pre-             Mesbah, S.; Lofi, C.; Torre, M. V.; Bozzon, A.; and Houben,
trained Language Model for Scientific Text. In Inui, K.;             G. 2018. TSE-NER: An Iterative Approach for Long-Tail
Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019        Entity Extraction in Scientific Publications. In Proceedings
Conference on Empirical Methods in Natural Language                  of the International Semantic Web Conference, ISWC’18,
Processing and the 9th International Joint Conference on             127–143.


                                                                 8
Mysore, S.; Kim, E.; Strubell, E.; others; and Olivetti, E.
2017. Automatically Extracting Action Graphs from Mate-
rials Science Synthesis Procedures. CoRR abs/1711.06872.
Sinha, A.; Shen, Z.; Song, Y.; Ma, H.; Eide, D.; Hsu, B.-J. P.;
and Wang, K. 2015. An Overview of Microsoft Academic
Service (MAS) and Applications. In Proceedings of 24th
International Conference on World Wide Web Companion,
WWW’15, 243–246.
Tchoua, R.; Ajith, A.; Hong, Z.; Ward, L.; Chard, K.; Audus,
D.; Patel, S.; de Pablo, J.; and Foster, I. 2019. Active Learn-
ing Yields Better Training Data for Scientific Named Entity
Recognition. In Proceedings of the 15th International Con-
ference on eScience, eScience’19, 126–135.
Tsai, C.-T.; Kundu, G.; and Roth, D. 2013. Concept-Based
Analysis of Scientific Literature. In Proceedings of the 22nd
ACM International Conference on Information and Knowl-
edge Management, CIKM’13, 1733–1738.
Vliegenthart, D.; Mesbah, S.; Lofi, C.; Aizawa, A.; and Boz-
zon, A. 2019. Coner: A Collaborative Approach for Long-
Tail Named Entity Recognition in Scientific Publications. In
Proceedings of the 23rd International Conference on Theory
and Practice of Digital Libraries, TPDL’19, 3–17.


                                                                  9