Distant supervision for silver label generation of software mentions in social scientific publications ? Katarina Boland1[0000−0003−2958−9712] and Frank Krüger2[0000−0002−7925−3363] 1 GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany 2 Institute of Communications Engineering, University of Rostock, Rostock, Germany katarina.boland@gesis.org, frank.krueger@uni-rostock.de Abstract. Many scientific investigations rely on software for a range of different tasks including statistical data analyses, data pre-processing and data presentation. The choice of software may have a great influ- ence not only on the research process but also on the derived findings, e.g. when errors in the used software lead to incorrect computations or biases. In order to increase transparency of research and verifiability of findings, knowledge of the used software thus is crucial. However, explicit links between publications and used software are usually not available. In addition, software is, unlike literature, often not cited in a standardized way which makes the automatic generation of links difficult. While recent Named Entity Recognition (NER) approaches yield excellent results for a wide range of use-cases and tasks, they typically require large sets of annotated data which may be hard to acquire. In this paper, we inves- tigate the use of weakly supervised approaches with distant supervision to create silver labels to train supervised software mention extraction methods using transfer learning. We show that by combining even only a small number of weakly supervised approaches, a silver standard corpus can be created that serves as a useful basis for transfer learning. Keywords: Software Mention Extraction · Silverstandard · Mining of Scientific Publications · Open Science · Distant Supervision 1 Introduction Today, software is used for a variety of tasks in all steps of the research process, e.g. from data collection and data analysis to presentation and dissemination of findings. Therefore, it can shape both the process and the outcomes of scientific investigations in significant ways. Eklund et al., for instance, discovered inflated false-positive rates during analysis of FMRI data when using standard FMRI analysis software packages [8]. Therefore, research findings relying on analyses ? This work was partially carried out at GESIS - Leibniz Institute for the Social Sciences and was financially supported by the GESIS research grant GG-2019-015 and the German Research Foundation (DFG) within the CRC 1270. 2 K. Boland and F. Krüger using these packages may be systematically flawed. Another problem that was recently identified concerns the automatic formatting of dates in Excel which is shown to mistakenly convert gene names [25] which may introduce errors into datasets. Provenance information including knowledge about the software that is involved in scientific investigations thus is crucial to create understandable, traceable, and reproducible research that meets the requirements of open science and enables the implementation of recently proposed mechanisms for quality control and reproducibility [12]. Links between software, created datasets, and research findings would enable explicit modelling of provenance information and tracing of biases and errors throughout all stages of the research process. Also, assessing the usage of software in scientific publications could serve as a basis for rewarding software as research output [20] further advancing open science. However, such links are not easily identifiable. While software citation stan- dards exist (e.g., by FORCE11 [23]), none of them has yet become universally established in scientific publications. Some researchers include only the name of the software, others use the name including information about the manufacturer and the version. This complicates automated extraction of such statements and thereby the automatic detection of links. Manual analyses of software mentions were done previously [12,19], but were limited to reduced sets of publications (90 and 40, respectively) due to the high costs of manual annotations. Recently, deep neural networks have gained increasing interest in the do- main of NER [13,2], which software mention identification can be seen as. The application of neural models for NER provides outstanding recognition results but requires a large training corpus with labelled entities. The provision of such labelled data is often the bottleneck when it comes to neural NER, as it is typically done in a manual process by different annotators. Different approaches have been proposed to overcome this issue, as for instance semi-supervised learn- ing [26] and distant supervision [6]. Another approach is the usage of a so called silver standard corpus (SSC) [21], which in contrast to a gold standard corpus (GSC) is created by automatic labelling by a combination of different classifiers. The quality of SSCs is much lower than the quality of GSCs. However, recent work showed that neural NER can be improved by transfer learning, where the network is first trained on the SSC and later on a GSC [10]. This reduces the necessary size of the GSC, while at the same time increasing the recognition performance of the classifier. The objective of this case-study is to investigate whether and how weakly supervised classifiers can be employed to create a SSC for the extraction of soft- ware mentions from scientific publications which can later be used for transfer learning. We apply three weakly supervised classifiers on a small manually cre- ated GSC in order to create silver labels which are then used for training a supervised classifier in order to predict the gold labels of the GSC. The remainder of this paper is structured as follows. We first provide an overview of current approaches to NER in general and software mention identi- fication in particular in Section 2. The applied weakly supervised classifiers for named entity extraction are described in Section 3, our method for combining Distant supervision for silver label generation of software mentions 3 them in Section 4. The GSC is introduced in Section 5. Section 6 presents the evaluation and discussion of results before we conclude with Section 7. 2 Related work Recent approaches to extracting software mentions from scientific publications can be divided into three groups: manual extraction, rule-based, and supervised machine learning-based approaches. Manual approaches, sometimes called con- tent analysis, typically work on small corpora with less than 100 articles or focus on particular software. Li et al. analysed the usage of the statistical soft- ware R [16] and LAMMPS [14] in 400 articles, while Nangia and Katz [19], and Howison [12] concentrated on software in general in 40 and 90 articles, respec- tively. An automatic approach to software mention identification is implemented in the BioNerDS system by Duck et al. [7], who used a rule-based system based on syntactic token features and a dictionary of known software names. In later work, they employed a post-processing based on supervised machine learning which resulted in recognition rates of .67 F1. Another rule-based system to au- tomatically identify mentions of R was implemented by Li and Yan [15]. Due to the particular focus on R and a dictionary of R packages, they were able to reach recognition rates of .94 F1. Pan et al. [20] introduced an iterative bootstrapping approach to software mention identification which achieves .58 F1. Additionally, approaches exist that analyse references to software and code based on the URL to repositories [1,22] However, there are currently no other supervised approaches for the identification of usage statements for software in scientific publications. One reason might be the lack of a dataset of sufficient size and quality. For the related and similar task of extracting dataset references from scientific literature, we again find both semi-supervised and rule-based systems as well as supervised approaches. Boland et al. [4] employed a pattern-based iterative bootstrapping algorithm, named InfoLink, and were able to reach a precision of up to 1 with a very low recall of .3 on the downside. Another semi-supervised approach is introduced by Ghavimi et al. [9], which used a dictionary of dataset names and employed similarity scores for identification with a recognition rate of .85 F1. Lu et al. used supervised learning to identify datasets by use of a training set of 1,000 sections that were obtained by active learning and achieved a precision of .82 and a recall of .59 [17]. NER on scientific texts has been used with other targets in the literature with a particular interest in biomedical publications, as for instance for the identifi- cation of drugs, genes, proteins, and diseases [5]. This was also fostered by the BioCreative challenges that addressed gene names or chemicals and drug names. In this vein, Luo at al. employed neural models in order to identify chemical names from scientific texts with .91 F1 on a labelled corpus of 10,000 (train- ing set: 3,500) abstracts with 84,355 labelled entities [18]. In detail, they used a BiLSTM-CRF including an additional attention layer with character-, word-, and dictionary embeddings and other linguistic features. Chemical names are different to software names when it comes to lexical structure as they typi- 4 K. Boland and F. Krüger cally exhibit combinations of characters, numbers, and special characters while software names are often composed from words from the lexicon. Beside the do- main of scientific publications, neural NER methods have reached superior [13,2] recognition rates but often require large training sets with several thousands of labelled entities. In addition, often target entities with high occurrences are cho- sen, which make even small training sets more effective. As shown in Section 5, software mention statements, particularly in the social sciences, are very rare, which requires even larger training sets. One way to overcome this problem is the use of distant supervision to create large annotated corpora which enable the training of sophisticated methods for even very fine-grained entity typing tasks [6]. A different approach to overcome the lack of large training datasets is the application of transfer learning, which allows to transfer trained concepts, e.g. between different application domains or languages. Giorgi and Bader recently illustrated the benefit of transfer learning in biomedical NER [10] on datasets with a small number of labels, increasing the recognition rate substantially. They transferred a neural model for NER from a noisy SSC to a GSC, which lead to significant increases in the recognition rates. To summarize, semi-supervised approaches can achieve high precision but suffer from low recall. Supervised approaches produce more reliable results but require large sets of labelled training data. The application of distant supervision and transfer learning allows the automatic creation of labelled datasets and exploiting them for pre-training of more high-performance supervised methods. 3 Weakly supervised Named Entity Extraction To overcome the data acquisition bottleneck for labelled corpora, we choose a small selection of openly available named entity extraction tools for the creation of silver labels. As described in the related work section, weakly supervised tools naturally suffer from relatively low recall. However, since they implement differ- ent algorithms and use different features, we expect the different tools to produce diverging annotations, potentially complementing each other when combined. 3.1 BioNerds Bioinformatics Named Entity Recogniser for Databases and Software (BioN- erds) [7] is a rule-based system for recognition of software and databases from scientific publications in the domain of bioinformatics. Beside hard coded rules, it employs a dictionary of software and database names collected from Wikipedia, Bioconductor, and other sources. BioNerds implements a scoring system where the sum of the scores of the different features is used to decide upon the type of the entity, when a particular threshold is exceeded. The highest scores are pro- vided by the dictionary matches, but also matches of Hearst patterns or positive head nouns achieve positive scores. Furthermore, the occurrence of a URL, a reference or a version number is considered as positive hint. Negative scores are provided, for instance, for matches with the English dictionary, negative head Distant supervision for silver label generation of software mentions 5 nouns or partial word matches. The threshold to be exceeded in order to be classified positively was selected to be slightly below the score of a match with the dictionary of known entities. As a result, known entities are, given a positive context, almost certainly recognised. 3.2 InfoLink InfoLink [4] is a weakly supervised iterative pattern-based bootstrapping ap- proach developed for extracting dataset references from (social) scientific publi- cations. Initially, seed words are searched in the corpus to identify patterns from their surrounding contexts. By alternating application of pattern identification and entity extraction, the dictionary of entities is increased iteratively. InfoLink relies on the surface form, i.e. the surrounding words of seed mentions, with some heuristics to normalize years and numbers and a frequency-based pattern scoring mechanism. Patterns consist of regular expressions and Lucene queries for increased efficiency. 3.3 Spied The Stanford Pattern-based Information Extraction and Diagnostics [11] (SPIED) system also implements a semi-supervised approach to named entity recognition. In the main, it operates similarly to InfoLink but includes different and more complex scoring mechanisms and features such as edit distance-based features, distributional similarity, and TF-IDF weighting, the patterns include POS rather than relying solely on surface strings. 4 Method We first apply each weakly supervised tool separately on the corpus to retrieve a list of patterns and terms classified as software mentions. We create one BIO3 file for each tool and corpus. For this purpose, we search all retrieved terms in the input texts and treat each occurrence as a software mention. For InfoLink, we receive, in addition to the list of terms, as output a list of regular expression pat- terns that can easily be applied on the input texts without requiring additional pre-processing. We create a second BIO file for InfoLink searching the patterns in the input texts. Since this has the potential to disambiguate software men- tions from homonymous other entities, we use these predictions in our combined classifier but keep the term search variant for comparison. Weakly supervised approaches depend to a large part on the usefulness of their given seeds. Since our aim is to generate a silver standard for conditions where no or little train- ing data is available, we do not use knowledge on the distribution of software 3 The BIO format is a common format for annotated texts in named entity recognition. For each token, either a Begin, In, or Outside tag is provided signalling whether the token belongs to an entity of interest (as its first token (B) or a subsequent one (I) or whether it is not part of any entity to annotate (O)). 6 K. Boland and F. Krüger Table 1. Features used for the CRFs. dependency tag, fine-grained POS tag, coarse POS tag, surface form, lemma, is alpha, is stop, shape, sentence length, sentence number, word number mentions in the training data to construct a seed set. Instead, we use Wikidata for distant supervision. Since we are mainly interested in finding software that is used for processing and analysing data for social scientific publications to gain provenance information on generated data and findings, we query Wikidata for all instances belonging to the classes ”statistical package” or ”mathematical software”. Note that while it is also possible to use an extensive list of all known software names, this would introduce more noise due to the fact that software names often consist of common nouns (see Section 2) while at the same time providing little extra information relevant to our use-case. We instead rely on the weakly supervised approaches for expanding the list of software names. We incorporate all language variants and alternative names listed in Wikidata. This results in a list of 47 software names of which 10 and 8 are mentioned in the training and test set at least once, respectively. In the second step, we combine the predictions of all tools and use their majority votes as silver labels. As supervised approach, we model the extraction of software mentions as a sequence labelling task using Conditional Random Fields (CRFs). The CRF is trained on the silver labels and may use the tools’ individual predictions and additional output as features. Additional output are confidence values for InfoLink and BioNerds as well as information on the employed rules for BioNerds. Adding to that, we permit the CRF to use a small number of simple features as additional cues. These are listed in Table 1. The threshold for accepting or rejecting patterns has to be set manually for InfoLink. Since we do not want to rely on annotated data to do parameter tuning, we use the configuration which was optimal for the extraction of dataset references [4]. 5 Dataset and Preprocessing In order to measure the quality of our approach, we created a GSC of articles from the social sciences from PLoS4 . Out of all articles having the keyword “Social sciences”, we randomly selected 200. Following [7], we automatically ex- tracted all “Methods and Materials” sections as software mentions are expected to primarily occur here. 8 articles were removed from the set as they did not contain a “Methods and Materials” section. The resulting texts were annotated with the brat annotation software [24] by six annotators that were instructed to annotate software names without mentions of additional information such as pro- ducers or versions. For about 10% of the sentences which were randomly selected from the sentences of all annotators, a second annotation was obtained in order 4 https://www.plos.org/ Distant supervision for silver label generation of software mentions 7 Table 2. Number of articles with the given numbers of software mentions. # software 0 1 2 3 4 5 6 >6 sum # articles 45 46 43 21 12 7 6 12 192 % articles 23.4 24.0 22.4 10.9 6.3 3.6 3.1 6.3 100 Table 3. The 10 most common software mentions and their numbers of occurrences overall and in the training and test set, respectively. The X signals whether the wikidata seeds contain the software name. AB PLINK Matlab MEGA MATL SPM8 Prism SPSS Stata SAS R software # overall 17 16 15 13 12 12 11 10 9 9 # train 12 13 9 9 10 11 5 5 9 9 # test 5 3 6 4 2 1 6 5 - - wikidata X X X X X - - X - - to assess the quality of the annotation. The inter-rater agreement was computed using Cohen’s κ and reached almost perfect agreement of κ=.82. Overall, 462 (263 unique) software mention statements were found across all articles during annotation of the articles. Their distribution is detailed in Table 2. The number of articles that contained no software mentions at all was 45 (23 %). Table 3 lists the 10 most common software names including their frequencies in the training and test set. Note that software may be listed multiple times but with differ- ent spellings, e.g. for “Matlab” and “SPSS”. Since our aim at this point is the identification of software mention statements rather than their disambiguation and linking, we do not align these different variants. Table 4 lists the number of unique software mentions that occurred at least n times in the corpus. The annotated corpus was split into sentences using the Stanford NLTK Sentence Splitter [3], resulting in 12,480 sentences. Afterwards, a white space based tokenisation was done, resulting in 347,544 tokens. The annotated token sequence was finally represented as BIO sequence. 462 of these tokens were an- notated with the begin and 120 with the in tag, the remaining with the outside tag. From this corpus, we created a training and test set with 75 and 25 percent of articles, respectively. Table 4. Number of different software mentions occurring with the respective frequen- cies. # occurrence ≥ 13 ≥ 12 ≥ 11 ≥ 10 ≥ 9 ≥6 ≥5 ≥4 ≥3 ≥2 ≥1 # software 1 2 3 4 8 9 11 14 21 52 215 8 K. Boland and F. Krüger 6 Evaluation 6.1 Metrics To measure the performance of the software mention detection task, we dis- tinguish between exact and partial matches and compute precision, recall and F-measure considering each of these. Here, exact match means that the entire name of the software was recognised with the correct range, while partial matches signal that a certain overlap between the label and the prediction exists. We used the SemEval 2013 evaluation script5 by David Batista. 6.2 Experimental setup We measure the performance of the weakly supervised approaches, individually and in combination, as well as the direct distant supervision using labels from Wikidata and assess their applicability for transfer learning by using the silver labels as ground truth for training a CRF (Silver CRF). For combining the predictions of the weakly supervised approaches and creating silver labels, we test three different methods: 1. majority: majority vote of predicted labels 2. conservative: tokens are only labelled as belonging to a software mention if all classifiers agree on it belonging to this category 3. greedy: tokens are labelled as belonging to a software mention when at least one classifier labels it as such The conservative and greedy conditions are expected to max out precision and recall, respectively. As an upper bound, we train a CRF on the gold labels of our GSC (Gold CRF). To evaluate the robustness of the approach with respect to seed selection and give insights on the usefulness of the Wikidata seeds, we illustrate the effects of choosing different seed sets for the weakly supervised approaches. For this, we create bins for software mentions depending on their number of occurrences in the training set. The intuition behind that is that the most frequently mentioned software names will also be the most well-known which can be identified without requiring the consultation of external knowledge sources. The less frequent a mention is, the less likely it will be incorporated into a seed set when the occurrence of software mentions in the corpus is not known in advance which is typically the case. Finally, we test the effects of using silver labels and outputs of the weakly supervised approaches as additional features for the gold CRF. For the weakly supervised approaches and the direct labelling of software mentions using Wikidata supervision, we evaluate both the performance on the training and the test set. The CRFs are trained on the training and evaluated on the test set. 5 The original script can be obtained from https://github.com/davidsbatista/ NER-Evaluation/blob/7de8a231d5fd94ced0ef10c42971a30cd3b744b3/ner evaluation/ner eval.py. (We adjusted the calculation of the overlapping range by an offset of 1 and added calculation of F1 scores.) Distant supervision for silver label generation of software mentions 9 Fig. 1. F-scores of the weakly supervised tools with distant supervision using different seed sets. 6.3 Results The performance of the weakly supervised tools with distant supervision and the influence of the choice of seeds is illustrated in Figure 1. The X axis represents the different seed sets used; 13 describes the set containing all software mentions occurring at least 13 times in the training set (the maximum number), 12 all men- tions with at least 12 mentions and so forth (see Table 4). Wikidata represents the seed set obtained by querying Wikidata. As expected, performance generally increases when seeds are added. Especially when the number of seeds exceeds a certain threshold (4 and 9 in this case when only seeds occurring at least 10 or 6 times are used, respectively), there is a significant increase in performance. At the same time, adding seeds can harm performance for the pattern induc- tion approaches as ambiguous and rare mentions may increase the likelihood of generating deficient extraction patterns. The seed set obtained from Wikidata leads to performance which is close to the optimal seed set which shows that distant supervision using lists of well-known software for seeding the algorithms is a feasible approach. The numbers for term search show the impact of the used seeds for comparison. When near-complete information on mentioned software is available, there is no or little gain from applying weakly supervised approaches in addition to searching the known names directly. However, even then precision may suffer from ambiguous names that may refer to software or other entities, such as with the software package “R” which has a high influence when used in a set with only 3 other less ambiguous seeds (set of seeds >=10 mentions). The performance of the two pattern generating approaches (SPIED and InfoLink) on the training set is considerably worse than on the test set. An analysis of the induced patterns reveals that this is due to the higher number of ambiguous software names in the former, more precisely, the high number of occurrences of 10 K. Boland and F. Krüger Fig. 2. Comparison of the performances of the different classifiers on the test set using the Wikidata software names as seeds / for distant supervision. the software “R” which causes the generation of deficient patterns. For InfoLink, the pattern search variant succeeds in disambiguating software mentions from homonyms not referring to software as reflected by its higher precision compared to the term search variant. However, many software mentions are missed reduc- ing recall considerably. InfoLink yields the best results for partial matches on both the training and test set. Yet, it also has the highest divergence in scores for exact vs. partial matches reflecting its strength in detecting mentions but its weakness in determining the exact boundaries of the matches. This is caused by its relying on surface features rather than incorporating knowledge gained from linguistic features such as POS tags. The results for the combination of the different tools and their usage for silver standard generation are illustrated in Figure 2. The upper bound for the classifier (Gold CRF) reaches .54 F1 on the test set. The majority vote silver labels obtain .41 F1 with the recall being closer to the upper bound than the precision. The greedy variant achieves the same F-score but is biased towards maximizing recall at the cost of precision yielding higher recall values than the Gold CRF. The conservative variant suffers from low recall causing its F-score to be low (.2) while achieving a higher precision than the Gold CRF. The combination of the weakly supervised approaches with distant supervision outperforms the direct creation of silver labels from the Wikidata software names and the application of the approaches individually. The Silver CRFs achieve lower scores than the direct application of the weakly supervised approaches on the test set. We attribute this to the higher difficulty of the training set which results in decreased performance for the pattern induction approaches. These noisy labels are used for training the classifier which is then applied on the test set while the weakly supervised Distant supervision for silver label generation of software mentions 11 approaches are applied on the easier test set directly. Finally, the best result is achieved by feeding the silver labels as additional features to the Gold CRF. While this has a slightly negative impact on precision, it increases recall by a higher magnitude resulting in .6 F1 with a still very high precision of 0.87. 7 Conclusion and Outlook We investigated the use of weakly supervised classifiers and Wikidata for distant supervision for the extraction of software mentions from social scientific publi- cations without requiring manual annotations. We compared the generation of silver labels by directly labelling mentions according to the Wikidata information to using them as seeds for different information extraction tools. We can show that in doing so, a silver standard with relatively high-precision annotations can be created that may serve to pre-train more powerful algorithms using transfer learning. With each classifier using different features and scoring mechanisms, their combination yields the best results showing that they partly complement each other. Furthermore, we show that predictions of weakly supervised clas- sifiers may provide useful features for supervised methods which leads to good results even when using on a small training set. In this case-study, we employed a small set of basic features for the supervised approaches to demonstrate the feasibility of the approach. In future work, we will use more sophisticated fea- tures and supervised classifiers with transfer learning to exploit the generated SSC and extract software mentions from larger collections. References 1. Allen, A., Teuben, P.J., Ryan, P.W.: Schroedinger’s code: A preliminary study on research source code availability and link persistence in astro- physics. The Astrophysical Journal Supplement Series 236(1), 10 (may 2018). https://doi.org/10.3847/1538-4365/aab764 2. Beltagy, I., Cohan, A., Lo, K.: Scibert: Pretrained contextualized embeddings for scientific text (2019) 3. Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media Inc (2009) 4. Boland, K., Ritze, D., Eckert, K., Mathiak, B.: Identifying references to datasets in publications. In: International Conference on Theory and Practice of Digital Libraries. pp. 150–161. Springer (2012) 5. Campos, D., Matos, S., Oliveira, J.L.: Biomedical named entity recognition: a survey of machine-learning tools. In: Theory and Applications for Advanced Text Mining. IntechOpen (2012) 6. Choi, E., Levy, O., Choi, Y., Zettlemoyer, L.: Ultra-fine entity typing. In: Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 87–96. Association for Computational Linguistics (2018), http://aclweb.org/anthology/P18-1009 7. Duck, G., Kovacevic, A., Robertson, D.L., Stevens, R., Nenadic, G.: Ambiguity and variability of database and software names in bioinformatics. Journal of biomedical semantics 6(1), 29 (2015) 12 K. Boland and F. Krüger 8. Eklund, A., Nichols, T.E., Knutsson, H.: Cluster failure: why fmri inferences for spatial extent have inflated false-positive rates. Proceedings of the National Academy of Sciences p. 201602413 (2016) 9. Ghavimi, B., Mayr, P., Lange, C., Vahdati, S., Auer, S.: A semi-automatic approach for detecting dataset references in social science texts. Information Services & Use 36(3-4), 171–187 (2016) 10. Giorgi, J.M., Bader, G.: Transfer learning for biomedical named entity recognition with neural networks. (feb 2018). https://doi.org/10.1101/262790 11. Gupta, S., Manning, C.D.: Improved pattern learning for bootstrapped entity ex- traction. In: Computational Natural Language Learning (CoNLL) (2014) 12. Howison, J., Bullard, J.: Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology 67(9), 2137–2155 (2016) 13. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/n16-1030 14. Li, K., Lin, X., Greenberg, J.: Software citation, reuse and metadata considera- tions: An exploratory study examining lammps. Proceedings of the Association for Information Science and Technology 53(1), 1–10 (2016) 15. Li, K., Yan, E.: Co-mention network of r packages: Scientific impact and clustering structure. Journal of Informetrics 12(1), 87–100 (2018) 16. Li, K., Yan, E., Feng, Y.: How is r cited in research outputs? structure, impacts, and citation standard. Journal of Informetrics 11(4), 989–1002 (2017) 17. Lu, M., Bangalore, S., Cormode, G., Hadjieleftheriou, M., Srivastava, D.: A dataset search engine for the research document corpus. In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on. pp. 1237–1240. IEEE (2012) 18. Luo, L., Yang, Z., Yang, P., Zhang, Y., Wang, L., Lin, H., Wang, J.: An attention-based BiLSTM-CRF approach to document-level chemi- cal named entity recognition. Bioinformatics 34(8), 1381–1388 (nov 2017). https://doi.org/10.1093/bioinformatics/btx761 19. Nangia, U., Katz, D.S.: Understanding software in research: Initial results from examining nature and a call for collaboration. In: 2017 IEEE 13th International Conference on e-Science (e-Science). pp. 486–487 (Oct 2017). https://doi.org/10.1109/eScience.2017.78 20. Pan, X., Yan, E., Wang, Q., Hua, W.: Assessing the impact of software on sci- ence: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics 9(4), 860–871 (2015) 21. Rebholz-Schuhmann, D., Yepes, A.J.J., Mulligen, E.M.V., Kang, N., Kors, J., Milward, D., Corbett, P., Buyko, E., Beisswanger, E., Hahn, U.: CALBC Silver Standard Corpus. Journal of Bioinformatics and Computational Biology 08(01), 163–179 (feb 2010). https://doi.org/10.1142/s0219720010004562 22. Russell, P.H., Johnson, R.L., Ananthan, S., Harnke, B., Carlson, N.E.: A large- scale analysis of bioinformatics code on GitHub. PLOS ONE 13(10), e0205898 (oct 2018). https://doi.org/10.1371/journal.pone.0205898 23. Smith, A.M., Katz, D.S., and, K.E.N.: Software citation principles. PeerJ Com- puter Science 2, e86 (sep 2016). https://doi.org/10.7717/peerj-cs.86 24. Stenetorp, P., Pyysalo, S., Topic, G., Ohta, T., Ananiadou, S., Tsujii, J.: brat: a web-based tool for nlp-assisted text annotation. In: EACL 2012, 13th Conference of Distant supervision for silver label generation of software mentions 13 the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23-27, 2012. pp. 102–107 (2012), http://aclweb.org/anthology/E/ E12/E12-2021.pdf 25. Zeeberg, B.R., Riss, J., Kane, D.W., Bussey, K.J., Uchio, E., Linehan, W.M., Barrett, J.C., Weinstein, J.N.: Mistaken identifiers: Gene name errors can be in- troduced inadvertently when using excel in bioinformatics. BMC Bioinformatics 5(1), 80 (2004). https://doi.org/10.1186/1471-2105-5-80 26. Zhou, Z.H.: A brief introduction to weakly supervised learning. National Science Review 5(1), 44–53 (2017)