Identifying Used Methods and Datasets in Scientific Publications Michael Färber , Alexander Albers, Felix Schüber Karlsruhe Institute of Technology (KIT), Germany Abstract The usage of methods and datasets is typically not given explicitly, but mentioned in publications’ full texts. Identify- Although it has become common to assess publications and ing scientific methods and datasets in texts can be considered researchers by means of their citation count (e.g., using the as domain-specific named entity recognition. In the schol- h-index), measuring the impact of scientific methods and datasets (e.g., using an h-index for datasets) has been per- arly domain, a few approaches have been proposed for iden- formed only to a limited extent. This is not surprising because tifying concepts such as datasets (Mesbah et al. 2018; Luan the usage information of methods and datasets is typically not 2019; Luan et al. 2018; Tsai, Kundu, and Roth 2013). For explicitly provided by the authors, but hidden in a publica- instance, Tsai, Kundu, and Roth (2013) propose a method tion’s text. In this paper, we propose an approach to identi- to extract concepts from scientific publications. They limit fying methods and datasets in texts that have actually been their extraction method to entities that are followed by a cita- used by the authors. Our approach first recognizes datasets tion indicator and only extract all mentioned concepts, rather and methods in the text by means of a domain-specific named than ones explicitly used. Gábor et al. (2018), in contrast, entity recognition method with minimal human interaction. It proposed a method to classify entity mentions into used and then classifies these mentions into used vs. non-used based non-used. However, usage relations are only considered be- on the textual contexts. The obtained labels are aggregated on the document level and integrated into the Microsoft Aca- tween entities of a specific type and not with respect to demic Knowledge Graph modeling publications’ metadata. the papers’ authors. Overall, a state-of-the-art approach that In experiments based on the Microsoft Academic Graph, we can recognize and classify scientific methods and datasets show that both method and dataset mentions can be identified is, to the best of our knowledge, missing so far. Moreover, and correctly classified with respect to their usage to a high no large data set has been published that allows tasks for degree. Overall, our approach facilitates method and dataset method/dataset-centric scientific impact quantification. recommendation, enhanced paper recommendation, and sci- In this paper, we develop a framework to recognize en- entific impact quantification. It can be extended in such a way tities of type DATASET and METHOD in scientific publi- that it can identify mentions of any entity type (e.g., task). cations, as well as to classify them as used vs. non-used. Our framework consists of a domain-specific named entity 1 Introduction recognition step, a classification step for determining the ac- tual usage, and an aggregation step for retrieving the used In the past, a huge variety of scientific methods and datasets methods and datasets on the document level. Our approach has been proposed in the different scientific disciplines. For is designed to extract information about entities from sci- instance, Wikipedia lists several hundred datasets for the entific publications in an automated way, requiring mini- area of machine learning.1 It is therefore unsurprising that mal human interaction. We provide the usage information researchers are often unaware of which scientific methods or of about 771,000 methods and 449,000 datasets online for data sets have already been used for a given research topic. further usage. Moreover, we integrate the information into Furthermore, in digital libraries, such information regarding the Microsoft Academic Knowledge Graph (MAKG), which usage of scientific methods and datasets can be very useful. models information of more than 120 million scientific pub- For instance, this information allows us to measure the im- lications, and thereby provides the basis for scientific impact pact of publications and researchers in novel ways (e.g., h- quantification studies (e.g., designing “h-index”-like metrics index for datasets). In this way, authors providing methods for scientific methods and datasets). and datasets can be awarded properly in the light of FAIR Overall, the main contributions of this paper are as fol- data principles and open research efforts. lows: Copyright © 2021for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY • We develop a named entity recognition approach that ex- 4.0). tracts scientific methods and datasets from texts. Our ap- 1 proach extends preliminary works (Mesbah et al. 2018) https://en.wikipedia.org/wiki/List of datasets for machine- learning research. by using state-of-the-art embedding techniques. 1 • We develop novel approaches to identify in texts the meth- added false positives, this semi-supervised technique rein- ods and datasets authors have indeed used in their papers. troduces the need for human labor and thus does not meet our requirements. Tchoua et al. (2019) present a dedicated • We create an evaluation dataset of 1,000 sentences with NER approach for material sciences to recognize polymer annotated methods and datasets and provide it to the pub- names. The approach is based on active learning to over- lic. come the data sparsity problem. Luan et al. (2018) intro- • We perform extensive experiments and identify the best duce a multi-task setup of identifying entities, relations, and classification method for the proposed task. coreference clusters in scientific articles. Although the ap- • We analyze the results of applying our framework to com- proach is valuable in settings where not only named entities puter science papers. but facts need to be extracted from text, the authors do not specifically consider the usage of datasets and methods by • We extend the MAKG with the usage information con- the papers’ authors. cerning methods and datasets mentioned in 510,027 pa- Identifying Aspects of Entities. Apart from recognizing pers and provide it to the public. named entities, a few approaches take additional aspects of Our data and code are publicly available at https://github. the entities, such as the actual usage of entities, into account. com/michaelfaerber/scholarly-entity-usage-detection. Gupta and Manning (2011) introduce a method to identify The rest of our paper is structured as follows: In Section 2, the focus, domain of application, and technique from com- we outline related work concerning domain-specific named putational linguistics papers, but this approach only extracts entity recognition and usage classification. In Section 3, we broad topics. Jain et al. (2020) focus on detecting and ex- describe our methods for named entity recognition and us- tracting salient information from publications. They define age classification. We present our evaluation in Section 4 salient information as information (e.g., named entities) that and our generated dataset in Section 5, before summarizing are needed to describe the results of an article. In contrast, our findings in Section 6. our goal is to find all used entities to gain enhanced insight into the general usage of methods and datasets. 2 Related Work 3 Approach In the following paragraphs, we outline the most relevant works concerning named entity recognition for long-tail en- Our framework for identifying methods and datasets authors tities and the extraction of aspects of entities. use in a given text document is depicted in Figure 1. We can differentiate between the following steps: Named Entity Recognition for Long-Tail Entities. In general, existing named entity recognition (NER) ap- 1. We build a named entity recognition model to extract proaches are of diverse nature: They utilize gazetteers, named entities of a given scientific paper. rules, parts-of-speech tagging, dependency trees, or machine 2. We perform a classification of each named entity into used learning techniques. State-of-the-art NER approaches are of- and non-used (i.e., merely mentioned) on a sentence level. ten based on long short-term memory networks (LSTMs) (Mysore et al. 2017), conditional random fields (CRFs) 3. We aggregate the sentence-level classifications of all (Mesbah et al. 2018; Vliegenthart et al. 2019), or a com- named entities in a document. bination of both (Lample et al. 2016; Ma and Hovy 2016; The obtained list of methods and datasets used per doc- Luan 2019; Jain et al. 2020). Although many approaches ument can be further analyzed in various ways. For a neat to named entity recognition exist, most of them require a alignment with papers’ metadata, we extend the Microsoft considerable amount of human interaction for the creation Academic Knowledge Graph (MAKG) with this new data. of sufficient training data. Few classification approaches In this way, metadata of publications, authors, venues, and take into consideration that most of the considered entities research areas can be used for advanced scholarly data min- are long-tail entities (i.e., appearing infrequently in docu- ing (e.g., for novel ways of research impact assessment). ments and often not represented in public knowledge repos- In the following, we present the single steps of our itories, such as Wikidata). To reduce the required amount pipeline in more detail. of human-labeled training data, iterative and active learn- ing techniques have been proposed, particularly for scien- tific publications (Tchoua et al. 2019; Mesbah et al. 2018; 3.1 Named Entity Recognition Vliegenthart et al. 2019; Luan et al. 2018). Mesbah et al. For named entity recognition, we adapt the TSE-NER (Mes- (2018), for instance, introduce TSE-NER, which iteratively bah et al. 2018) to our needs. TSE-NER is based on the hy- expands a predefined seed set of terms without additional pothesis that entities of the same type are mostly used in human input. The authors apply several heuristic filtering a similar context. For example, objects of the entity type methods to automatically create positive and negative clas- DATASET may be mentioned in the documents via phrases sification examples. Our approach to named entity recogni- such as “we used data set X” or “we could achieve a recall tion is based on TSE-NER, but extends it by using SciBERT of 0.4 on data set Y.” Identifying such patterns automatically embeddings. Vliegenthart et al. (2019) also extend the TSE- in the text allows us to identify additional, unknown enti- NER approach by relying on human feedback for newly ties in the text – particularly long-tail entities. The contexts added labels. Although the authors achieve a lower rate of of these newly found entity mentions can then be mined in 2 Figure 1: Overview of our framework. another iteration, leading to additional patterns for named 3.2 Usage Classification entity recognition. In total, we present four approaches for detecting used en- An in-depth introduction to the original TSE-NER ap- tity mentions of type METHOD or DATASET. For each model, proach is provided by Mesbah et al. (2018). In the follow- we first apply an embedding-based method to transform the ing, we outline the main steps of our named entity recog- texts into a feature space, and then apply a classification al- nition approach and the main differences from the original gorithm to classify usage. In the following, we outline our TSE-NER approach. approaches. 1. We start with an initial set of METHOD and DATASET in- Model 1: TF-IDF + Random Forest As a baseline model, stances as seed terms (e.g., “SVM” and “MNIST”). These we use term frequency-inverse document frequency (tf-idf) seed terms can, for instance, be gathered from existing to represent the words of a text as vectors. Based on prelim- knowledge graphs. In contrast to the original approach of inary evaluations of several standard classification methods, Mesbah et al., we consider all computer science methods we choose a random forest classifier for classification into and datasets. The seed term selection is explained in Sec- used and non-used. tion 4. Model 2: SciBERT + Random Forest For our second 2. We expand the list of seed terms by applying term model, we make use of SciBERT (Beltagy, Lo, and Cohan and sentence expansion (TSE). In contrast to the origi- 2019), a BERT-based language model pretrained on scien- nal method, we use SciBERT as a semantic relatedness tific publications. This embedding model has been used for method and cluster the new entities using k-means. various tasks, such as scientific text classification and rec- 3. Using the expanded set of entities, we annotate named en- ommendation. In our use case, we use SciBERT embeddings tities in the training data. As context for each named entity to create feature vectors and a random forest classifier for the we consider the current sentence as well as the preceding binary classification. and subsequent sentence. Model 3: SciBERT + SciBERT Our third model is based 4. Using the annotated training set, we apply our NER ap- on a fine-tuned SciBERT model for sequence classification. proach and thereby identify new entity candidates. We use Beltagy, Lo, and Cohan (2019) show that fine-tuning SciB- a CRF algorithm to learn the patterns of the data. ERT clearly improves the classification score, especially in the field of computer science. Hence, in comparison to the 5. Finally, we filter the entity candidates to prevent misclas- second model, we now also use SciBERT to make the classi- sification and ensure data quality. We start with simple fication by fine-tuning it to our annotated data. For the clas- parts-of-speech analysis and stop-word removal methods sification task, SciBERT uses a linear classification layer. to keep relevant nouns. Then, we use knowledge graph information and similarity scores to remove those entities Model 4: SciBERT + CNN Our fourth model uses Sci- with low similarity and no reference. BERT embeddings as feature vectors and a convolutional neural network (CNN) for the classification task. Using the The output of our named entity recognition approach is a CNN approach as introduced by Kim (2014) as an advanced list of mentioned scientific methods and datasets with their classification technique aims to capture the complex struc- positions in the texts. tures of word embeddings, which should result in a more 3 accurate classification score. Table 1: Evaluation of our modified TSE-NER model on the SciREX data set using precision, recall, and F1 score. 3.3 Document-level Aggregation The method described above allows us to make a predic- Training corpus Abstracts Full texts tion for each occurrence of a named entity (i.e., entity-level Metric P R F1 P R F1 prediction). To predict at the document level whether each Method 0.44 0.14 0.21 0.26 0.45 0.33 unique named entity of a document is used or only men- Data set 0.33 0.27 0.30 0.20 0.29 0.25 tioned or proposed, we aggregate all entity-level predictions to a document level prediction using majority vote. 3.4 Augmenting Publications’ Metadata sentence containing the test entity, as well as the preceding We use our results to extend the MAKG (Färber 2019), and the succeeding sentence (Mesbah et al. 2018), we apply which models publications’ metadata for all scientific dis- our model to full-text documents, which we regard as a more ciplines. Given that the MAKG is provided in the Resource realistic setting. Description Framework (RDF), we introduce the property As in the original paper, we calculate precision, recall, :used methods, which associates a paper with a used and F1 scores for the named entity recognition of METHOD method. Because no knowledge graph contains all of the and DATASET instances. We count partial matches as correct extracted methods and datasets, we refrain from linking to predictions because in most cases we do not need to cover URIs in other knowledge graphs. the full span of an entity to gain meaningful insight. 4 Evaluation Evaluation Results In the following, we outline our evaluations of all three steps (1) Study on Embeddings. The original TSE-NER ap- of our pipeline. First, we compare the results of our modi- proach is based on word2vec embeddings. Thus, we first fied TSE-NER model to the original paper. Next, we eval- analyze the difference in performance when using SciBERT uate our usage classification models on our annotated test token embeddings instead of word2vec embeddings for term data. Finally, we apply our pipeline to full-text papers from clustering and similar terms filtering (see the steps 2 and 5 the computer science domain to analyze trends over time in in Sec. 3.1) influences the clustering performance. We qual- various computer science fields. itatively study the clustering results of the term expansion in the first iteration for the METHOD type and find that, in 4.1 Named Entity Recognition general, both approaches generate very consistent clusters that differ based on various computer science fields. Given Evaluation Settings that the word2vec model had to be trained from scratch, it achieves surprisingly good results. Nevertheless, clustering (1) Training. We train our named entity recognition model based on SciBERT embeddings yields far more and richer on all 7 million abstracts of computer science papers given in terms, because it is not limited to just bigrams. Single clus- the Microsoft Academic Graph (MAG; v2019-12-26) (Sinha ters contain more variations of the same terms and gener- et al. 2015). For the methods, we use the same 50 seed sets ally contain better results. One risk of using SciBERT is as the authors of the original paper. For DATASETS, we cre- that terms, such as Netflix or GitHub, are clustered together ate our own set of seed terms because we were only able to with dataset names, which is likely caused by both terms be- expand very few sentences from our corpus using the origi- ing used in the context of datasets but not being recognized nal terms.2 For our initial assessment, we run two iterations jointly with neighboring terms. This may decrease the NER for each entity type, which according to the authors should performance if names of other unrelated organizations are already yield good results with a high precision value. Run- added as a result in the following iterations. ning more than two iterations increases recall at the cost (2) NER Evaluation Results. Mesbah et al. (2018) of precision due to the addition of too many unrelated seed achieve precision and recall values of 0.79 and 0.24 for the terms. METHOD type and 0.83 and 0.10 for the DATASET type. The (2) Testing. To evaluate the NER approach, we use the authors’ TSE-NER model was trained based on 100 initial SciREX dataset (Jain et al. 2020), which includes annota- seed terms and the same sentence expansion and filtering tions of full-text papers from the machine learning domain strategies as our model. As shown in Table 1, we are not for the METHOD and DATASET entity types. In this way, we able to achieve a similar high precision value as the authors can reuse existing evaluation data sets and compare our eval- of the original paper, who used around 15,000 full-text pa- uation results with the evaluation results of the original TSE- pers as their corpus. The obvious reason is that publications’ NER (Mesbah et al. 2018). Although the authors of TSE- abstracts, as used by us, may be publicly available to a large NER only apply their evaluation to triples consisting of a extent and therefore may be a good data source, but seem 2 We extract 73 data set names from Wikipedia to contain method and dataset names only to a limited de- (https://en.wikipedia.org/wiki/List of datasets for machine- gree. To improve the performance of TSE-NER, we choose learning research) and Wikidata (https://w.wiki/RrU) based on our to replicate a more similar corpus by using 25,060 full-text knowledge in the machine learning domain. papers instead of 7 million abstracts from the MAG, as well 4 Figure 2: Example prediction of our trained TSE-NER model (top) versus ground truth (bottom) for the METHOD type after two iterations. Figure 3: Example prediction of our trained TSE-NER model (top) versus ground truth (bottom) for the DATASET type after two iterations. Table 2: TSE-NER training details using papers’ abstracts tities for training. This leads to more than 90,000 extracted as corpus. The table shows the number of words after each named entities after the CRF training, compared to 7,469 training step for the first and second iteration. named entities when training on abstracts, but still does not achieve the same results as Mesbah et al. (2018). One obvi- i Size of Expanded Extracted Filtered ous reason for that may be that neither of our training cor- seed set entities entities entities pora contain as many seed entities, which results in fewer 1 50 4,273 4,032 453 found terms and sentences. Another reason may be that the Method found sentences contain fewer similar neighboring terms 2 503 3,403 7,469 1,031 (e.g., fewer enumerations of method names or datasets), 1 73 354 1,450 6 Data set 2 79 403 2,378 187 which would result in smaller cluster sizes and thus fewer added terms. Despite the inferior evaluation results for our domain- specific named entity recognition of methods and datasets, as narrowing the domain to include only machine learning we nevertheless believe they are sufficient for the subse- papers. Although we see equal or higher recall values, this quent knowledge graph expansion and trend analysis. Be- corpus does not improve the F1 scores significantly or, in the cause we aggregate all found entities on the document level, case of data sets, it even reduces the metric. we assume that a few missing mentions of the same entity Figure 2 and Figure 3 illustrate the named entity recogni- would not affect the outcome significantly. For the subse- tion for two exemplary sections from the SciREX data set. quent tasks, we use the NER model trained on abstracts in- We can observe that, in general, the approach produces de- stead of full text, because we favor higher precision over cent results. The approach sometimes fails to capture the recall for the knowledge graph extension. complete span of an entity mention (e.g., the first word in character embedding layer). Some of the false positive pre- 4.2 Usage Classification dictions are not too far fetched, such as vector space, but others, such as query, answer, and context, are less similar Evaluation Dataset to names of methods. This indicates that there is still a po- tential to introduce better filtering strategies. One recurring We needed to create a new dataset for training and evalu- problem for the DATASET model is that the term dataset is ating our usage classification models. To this end, two au- recognized without any specific names in its context. thors (computer scientists) manually annotated 1,000 sen- To further compare our results with the TSE-NER publi- tences concerning the usage of mentioned method and data cation (Mesbah et al. 2018), Table 2 shows the number of sets (500 per entity type and person; see Table 4 for more methods and datasets collected in each step based on the statistics). We reuse a subset of the SciREX data set (Jain corpus containing papers’ abstracts. While the original TSE- et al. 2020), which already contains annotated entities for NER model used nearly 30,000 method names, our model is the METHOD and DATASET type, and manually annotate only able to use 3,403 method names as training data of the whether an entity has been used in the given sentence and CRF. Training on the full-text corpus yields 8,355 named en- context. To reduce training bias, we also drop duplicate en- 5 Method Dataset Generalization Model P R F1 P R F1 P R F1 Single input sentence Random Forest (TF-IDF) 0.56 0.83 0.67 0.56 0.83 0.67 0.57 0.89 0.70 Random Forest + SciBERT 0.75 0.76 0.75 0.71 0.81 0.76 0.57 0.96 0.71 SciBERT (fine-tuned) 0.73 0.92 0.81 0.76 0.89 0.82 0.68 0.93 0.79 SciBERT + CNN 0.76 0.79 0.77 0.52 0.95 0.67 0.58 0.96 0.73 With surrounding sentences for context Random Forest (TF-IDF) 0.69 0.76 0.72 0.69 0.76 0.72 0.54 0.92 0.68 Random Forest + SciBERT 0.75 0.76 0.75 0.73 0.84 0.78 0.57 0.95 0.71 SciBERT (fine-tuned) 0.76 0.84 0.80 0.70 0.96 0.81 0.64 0.95 0.76 SciBERT + CNN 0.75 0.91 0.83 0.54 0.92 0.68 0.58 0.96 0.72 Table 3: Precision, recall and F1 scores for our usage classification models. We train each model with a single sentence as input as well as with the preceding and succeeding sentences for both methods and data sets. Further, we show the generalization capabilities for models that have been trained on the method type and then applied on data set entities. Table 4: Key statistics of our annotated data set. Entity Type # annotated sentences # annotated entities # used entities # mentioned entities # balanced entities κ score Method 1,000 909 508 401 802 0.858 Data set 1,000 841 595 246 492 0.909 tities. We only annotated an entity as used if it is obvious text representation. from reading the sentence containing the entity and its sur- rounding context. In any uncertain cases, we annotate the Evaluation Results entity as non-used. This way, we aim to achieve high pre- cision on the sentence level while still being able to decide Comparison of Methods. Table 3 shows the evaluation for an entity on the document level using our entity aggre- results concerning the usage classification of method and gation step whether the entity has been used. We also label dataset occurrences. For METHOD entities, the fine-tuned an entity as used if it has been used in a comparison of mul- SciBERT model performs better with only a single sentence tiple approaches (i.e. as a baseline). In this way, we allow a as input and achieves the best recall. The combined SciB- thorough tracking of used methods and datasets, facilitating ERT and CNN model works best when the preceding and scientific impact quantification. succeeding sentences are available as context. It achieves a To ensure high data quality and consistency of our an- similar high recall and slightly better precision than the fine- notated data, we select 100 entities of the METHOD and tuned SciBERT model. DATASET type that were annotated to calculate the inter- For DATASET entities, both the fine-tuned SciBERT annotator agreement. We achieve a satisfactory κ score of model and the CNN model achieve higher recall than they 0.86 for methods and 0.91 for datasets. do for classifying METHOD entities. SciBERT still achieves Finally, we drop invalid entity types (e.g., entities from relatively high precision scores but works better when neigh- SciREX that are classified as material type but do not make boring sentences are available. For the CNN model, preci- sense as a data set type) and create a training and test set. sion scores are significantly lower than they are for method Using the same amount of used and non-used entities, we entities. have 802 entries for the METHOD type and 492 entries for Neither random forest model manages to compete with the DATASET type. For the evaluation, we split the annotated the more sophisticated models, but work slightly better on data into training and test sets with a ratio of three to one. the DATASET entity type. Using the SciBERT sentence em- beddings instead of tf-idf consistently results in a signifi- Evaluation Settings cantly higher precision at a cost of slightly lower recall val- ues. Because our usage classification task constitutes a binary On manual inspection, we identified that the SciBERT and classification problem, we evaluate our models using pre- CNN models do not work when only a single sentence is cision, recall, and F1 score. As outlined in Section 3.2, we given but critical information about an entity from the pre- evaluate four models: (1) random forest with TF-IDF rep- ceding or succeeding sentence is needed for the decision. For resentations, (2) random forest with SciBERT embeddings, instance, in the following excerpt, the usage of the method (3) a SciBERT classification model with SciBERT embed- is not recognized if only the second sentence is given to the dings, and (4) a CNN model with SciBERT embeddings for models: “In this paper, we introduce Invariant Information 6 80 40 ANN 80 MNIST CNN ImageNet 70 Usage [% of occurences] SVM Wikipedia Usage [% of papers] Usage [# of papers] 30 PCA 60 PubMed 60 20 40 50 ANN 10 20 CNN 40 SVM PCA 0 0 30 2005 2007 2009 2011 2013 2015 2017 2018 2005 2007 2009 2011 2013 2015 2017 2018 2007 2009 2011 2013 2015 2017 2018 (a) Usage of selected machine learning (b) Usage of selected machine learning meth- (c) Usage of machine learning data sets over methods over time relative to total mentions ods over time relative to all computer vision time. in papers. papers. Figure 4: Relative usage of methods and datasets over time. Clustering (IIC), a method that [...]. IIC is a generic clus- 4.3 Application tering algorithm that directly trains [...].” We apply our framework to a corpus of 25,060 full-text ma- Furthermore, it can be seen that pronouns, such as “we,” chine learning papers from the MAG (Sinha et al. 2015) give the models a strong hint that an entity has been used. combined with unpaywall. The publication dates range from However, in some cases, such as mathematical notations, 2005 to 2018 and for each year we draw the same number this may lead to a false positive classification: “We can write of papers to compare relative usages. We process the publi- the joint update for all as Restrict the update to define a con- cations using GROBID (Lopez 2009) to extract the full text traction mapping in the Euclidean metric.” as well as the title and all section names. We extract 438,707 Generalization across Entity Types. We also evaluate method and 98,276 dataset entities from our corpus. Out of how well the usage classification models generalize to other all extracted entities, 56% are classified as used concerning entity types. For this purpose, we apply all models trained on the methods and 68% concerning the datasets. the METHOD entity type to DATASETS. All examined mod- els perform slightly worse regarding the F1 score, but still Analyzing Relative Usage We first study how many pub- achieve very high recall values. This suggests that sentences lications used specific entities compared to the number of in which methods are proposed or described do not dif- publications in which the same entities were only men- fer too much from sentences that contain information about tioned. This relative measurement allows us to perform a datasets. Out of all tested models, the SciBERT model gen- more granular trend analysis because irrelevant entities that eralizes the best to another entity type. are never actually used will not be over-represented in the results. Further Studies. We also study whether information Figure 4a shows this relative usage for selected machine about the current section improves the performance of our learning methods over time. The usage of artificial neural classification models. Thus, we prepend the title of the networks (ANNs) and support vector machines (SVMs) is current section to the input sentence and retrain all mod- mostly constant between 60 and 75 % for all papers that els. Our results show negligible performance improvements mention one or the other term, but a slight downward trend from this modification. is discernible for plain ANNs. The relative usage of the prin- Finally, we investigate the extent to which our created data cipal component analysis (PCA) shows a higher variability set differs from the SciREX data set (Jain et al. 2020) con- due to fewer absolute mentions but is used up to 75 % of the taining salient information of publications. Specifically, we time if it is mentioned. For convolutional neural networks study the degree to which our definition of used entities dif- (CNNs), we only show values from 2012 and later because fers from salient entities considered by Jain et al. Salient only a few mentions of CNNs occur in earlier years. Still, entities are defined as necessary to describe the results of a a clear trend is visible, where at the beginning in 2012 only paper and thus are semantically similar to our definition of around 35 % of papers that mentioned CNNs also used them used entities. We find for our method annotation set that only for their work, whereas in 2018 the value was greater than 12 out of 1,000 entries are labeled as salient in the original 55 %. paper, which results in an MCC of 0.027 with our labels. For datasets, 39 entries are labeled as salient with an MCC Analyzing Specific Domains For another data study, we of 0.011. In comparison, our created annotation data con- leverage the knowledge of the MAKG to select only publica- tains roughly similar amounts of used and non-used (e.g., tions from a specific computer science domain and analyze proposed, only mentioned) entities, which allows us to ex- this subset of publications over time. Figure 4b shows the tract and analyze considerably more used entities than we usage of selected machine learning methods in the computer can with the saliency approach. vision field, which is one of the most popular categories by 7 number of papers in our set. Here, we only analyze the rela- Natural Language Processing, EMNLP-IJCNLP’19, 3613– tive number of publications in which an entity has been used, 3618. instead of the number of named entity occurrences. Until Färber, M. 2019. The Microsoft Academic Knowledge 2015, the most used methods were ANNs and SVMs, which Graph: A Linked Data Source with 8 Billion Triples of together have been used in around 30% of all computer vi- Scholarly Data. In Proceedings of the International Seman- sion papers. Since 2014, the usage of CNNs has steadily tic Web Conference, ISWC’19, 113–129. grown and is now the most used computer vision method. In turn, the number of papers that use SVMs and PCA has Gábor, K.; Buscaldi, D.; Schumann, A.; QasemiZadeh, B.; rather declined. Compared with Figure 4a, it can be seen that Zargayouna, H.; and Charnois, T. 2018. SemEval-2018 Task the relative usage of CNNs has increased since 2016. All this 7: Semantic Relation Extraction and Classification in Scien- demonstrates that such a study would not be possible with- tific Papers. In Proceedings of The 12th International Work- out an approach as proposed in this paper, which determines shop on Semantic Evaluation, SemEval@NAACL-HLT’18, the actual usage of mentioned entities. 679–688. We also apply our classification pipeline to DATASET en- Gupta, S.; and Manning, C. D. 2011. Analyzing the Dy- tities. Figure 4c shows the absolute amount of publications namics of Research by Extracting Key Aspects of Scientific for the top four extracted datasets. A clear trend is visible Papers. In Proceedings of the 5th International Joint Con- for image recognition data sets, such as MNIST and Ima- ference on Natural Language Processing, IJCNLP’11, 1–9. geNet, which also correlates with the usage of CNNs in the The Association for Computer Linguistics. computer vision domain. This again confirms the rising pop- ularity of the specific domain. Another trend is visible for Jain, S.; van Zuylen, M.; Hajishirzi, H.; and Beltagy, I. 2020. Wikipedia, which has become popular in research on knowl- SciREX: A Challenge Dataset for Document-Level Informa- edge representation and natural language processing. tion Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL’20, 5 Data Provisioning 7506–7516. We apply our framework to all computer science papers Kim, Y. 2014. Convolutional Neural Networks for Sentence given both in the MAG and unpaywall (510,027 papers). Classification. In Moschitti, A.; Pang, B.; and Daelemans, Overall, we obtained 771,000 mentions of used methods and W., eds., Proceedings of the 2014 Conference on Empiri- 449,000 mentions of used datasets. We provide the dataset cal Methods in Natural Language Processing, EMNLP’14, online for further use (see our repository). 1746–1751. Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, 6 Conclusion K.; and Dyer, C. 2016. Neural Architectures for Named In this paper, we proposed an approach to identifying meth- Entity Recognition. In Proceedings of the 2016 Confer- ods and datasets in texts that have actually been used by the ence of the North American Chapter of the Association for authors. Our approach first recognizes datasets and meth- Computational Linguistics: Human Language Technologies, ods in the text by means of a domain-specific named entity NAACL-HLT’16, 260–270. recognition with minimal human interaction. It then clas- Lopez, P. 2009. GROBID: Combining Automatic Biblio- sifies these mentions into used vs. non-used. The obtained graphic Data Recognition and Term Extraction for Schol- labels are aggregated on the document level and integrated arship Publications. In Proceedings of the 13th European into the Microsoft Academic Knowledge Graph modeling Conference on Digital Libraries, ECDL’09, 473–474. publications’ metadata. In experiments based on the Mi- crosoft Academic Graph, we showed that both method and Luan, Y. 2019. Information Extraction from Scientific dataset mentions can be identified and correctly classified Literature for Method Recommendation. arXiv preprint with respect to their usage. Our approach, as well as our arXiv:1901.00401 . dataset containing the usage information of methods and Luan, Y.; He, L.; Ostendorf, M.; and Hajishirzi, H. 2018. datasets mentioned in 510,000 papers, can be used for re- Multi-Task Identification of Entities, Relations, and Corefer- search impact quantification tasks and further studies in the ence for Scientific Knowledge Graph Construction. In Pro- area of digital libraries. ceedings of the 2018 Conference on Empirical Methods in In the future, we plan to use our framework with respect Natural Language Processing, EMNLP’18, 3219–3232. to other entity types, such as task and evaluation metric. Fi- nally, a promising idea is to build a recommender system for Ma, X.; and Hovy, E. H. 2016. End-to-end Sequence Label- scientific publications using our framework. ing via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computa- References tional Linguistics, ACL’16. Beltagy, I.; Lo, K.; and Cohan, A. 2019. SciBERT: A Pre- Mesbah, S.; Lofi, C.; Torre, M. V.; Bozzon, A.; and Houben, trained Language Model for Scientific Text. In Inui, K.; G. 2018. TSE-NER: An Iterative Approach for Long-Tail Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Entity Extraction in Scientific Publications. In Proceedings Conference on Empirical Methods in Natural Language of the International Semantic Web Conference, ISWC’18, Processing and the 9th International Joint Conference on 127–143. 8 Mysore, S.; Kim, E.; Strubell, E.; others; and Olivetti, E. 2017. Automatically Extracting Action Graphs from Mate- rials Science Synthesis Procedures. CoRR abs/1711.06872. Sinha, A.; Shen, Z.; Song, Y.; Ma, H.; Eide, D.; Hsu, B.-J. P.; and Wang, K. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of 24th International Conference on World Wide Web Companion, WWW’15, 243–246. Tchoua, R.; Ajith, A.; Hong, Z.; Ward, L.; Chard, K.; Audus, D.; Patel, S.; de Pablo, J.; and Foster, I. 2019. Active Learn- ing Yields Better Training Data for Scientific Named Entity Recognition. In Proceedings of the 15th International Con- ference on eScience, eScience’19, 126–135. Tsai, C.-T.; Kundu, G.; and Roth, D. 2013. Concept-Based Analysis of Scientific Literature. In Proceedings of the 22nd ACM International Conference on Information and Knowl- edge Management, CIKM’13, 1733–1738. Vliegenthart, D.; Mesbah, S.; Lofi, C.; Aizawa, A.; and Boz- zon, A. 2019. Coner: A Collaborative Approach for Long- Tail Named Entity Recognition in Scientific Publications. In Proceedings of the 23rd International Conference on Theory and Practice of Digital Libraries, TPDL’19, 3–17. 9