1st Workshop on AI + Informetrics - AII2021 1 Building the Extraction Model of the Software Entities from Full-Text of Research Articles Based on BERT Chuan Jiang1[0000-0003-2436-9411] Dongbo Wang1[0000-0002-9894-9550] Si Shen2[0000-0002-6990-410X] Wenhao Ye3[0000-0003-2811-4248] Jiangfeng Liu1[0000-0001-7268-7313] 1 College of Information Management, Nanjing Agricultural University, Nanjing 210095, China 2 School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China 3 School of Information Management, Nanjing University, Nanjing 210023, China Abstract. Software entities in the full-text of research articles are vital academ- ic resources. Extracting software entities from the full-text of research articles can improve the process of knowledge organisation and is a important aspect of knowledge entitymetrics. In this study, the full-text of research articles were collated from the journal Scientometrics from 2010 to 2020. The extracted software entities are subjected to metric analysis and mining from different per- spectives, such as the distribution in the different structures of articles, the number of mentions and citations, and the time-series evolution. To build an au- tomated software entities extraction model, entitymetrics tools are provided. The machine learning and deep learning models, namely, conditional random field (CRF), Bi-LSTM-CRF, and the bi-directional encoder representation from transformers (BERT), were established. Tthe highest F1 values of 84.99% was achieved with BERT. The future implications of the study include the applica- tion of the BERT-based model for the software entity extraction from other journals to deepen the mining and the analysis of software entities from multi- ple perspectives. Keywords: Deep learning; BERT; Full-Text; Entitymetrics 1 Introduction The development of the full-text of research articles data has been complemented by the continuous progress in the data analysis and extraction technology. The demand for the deep mining of the full-text of research articles has increased in the field of bibliometrics and knowledge organisation owing to the improvement in the metric analysis and the visualisation research of the literature metadata. Consequently, the academic research of the full-text of research articles has gained widespread attention. Software is a crucial aspect of academic research, as it is essential to facilitate aca- demic findings and interdisciplinary exchanges and collaboration, due to which most of the current research is driven by data and software (Nangia and Katz 2017). Ex- Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 tracting the software entities from the full-text of research articles innovative the form of knowledge organisation and the key of knowledge entitymetrics for the full-text of research articles (Ding et al., 2013). Currently, the software entities are predominantly extracted from metadata and footnotes, and they are generally extracted by using dic- tionaries and heuristic rules. However, these extraction methods typically encounter drawbacks, such as low data volume and limited extraction performance. Therefore, the extractions methods with greater accuracy are required for bibliometrics. To fill the gap, a novel corpus technology has been presented in this study to de- velop the corpus for software entities extracted from the full text of the research arti- cles from Scientometrics from 2010 to 2020. The software entities were extracted from Scientometrics using the BERT model and the latest deep learning model for natural language processing (NLP). These entities were subjected to metric analysis and mining from different perspectives, such as the distribution in the different struc- tures of articles, number of mentions and citations, and time-series evolution. The distribution laws of the software entities extracted from Scientometrics were dis- cussed. Consequently, the research trend and the objects in Scientometrics were retro- spectively analysed from 2010 to 2020. 2 Literature Review 2.1 Content Analysis of Full-Text of Research Articles and Its Applications The content analysis of the full-text of research articles involves deep mining of the structure of the articles, citations, lexical style, syntax, and the topics. It can be com- bined with conventional bibliometrics to improve the reliability and the accuracy of the latter. The content analysis of the full-text of research articles is based on the cita- tion, linguistic, and thematic perspectives. Citation Perspective. Combining the conventional citation count with content analy- sis can increase both the depth and the width of the citation content analysis and im- prove the clustering performance. Zhang et al. (2013) proposed a new framework for the citation content analysis, which employs NLP techniques to analyse the syntax and the semantics of the documents. This framework was suitable to analyse the char- acteristic citation style and the background behind the citation. Jeong et al. (2014) collected the full-text of the research articles from JASIST and proposed an improved method for the author co-citation analysis. This method involved measuring the simi- larities between the authors by considering the citation contents and presented more details than the conventional method. Hu et al. (2017) collected the full-text of the research articles data from the Journal of Informetrics to analyse the condition where the citations were mentioned repeatedly in the full-text of research articles, such as lengthy research articles and self-citations. Moreover, these citations were usually repeated in the same section of a research article. Small et al. (2017) acquired the full- text research articles data from PubMed and analysed the contextual sentences for the citations through lexical analysis and machine learning. This method was intended to 3 be applied to the knowledge discovery process in biomedicine. To study the distinc- tions and the relationship between the highly cited methodology and the non- methodology papers, Small (2018) collected the full-text of the research articles from PubMed. Methodology and non-methodology papers in biomedicine were differenti- ated with the help of the corpus technology and with machine learning. It was ob- served that the methodology papers were predominant among the highly cited papers. Bu et al. (2018) collected the full-text of the research articles from JASIST. They improved the conventional author co-citation analysis by introducing the number of citation mentions and the contextual vocabulary for citations. The improved method presented a better clustering performance. Linguistic Perspective. The full-text of research articles can be analysed by the NLP techniques and by the linguistic methodology to understand the style, language habits, and text structure. Yan and Zhu (2018) collected article abstracts from PubMed to explore the semantic variations of the words used in biomedical literature and applied the Word2Vec model and the topic model to this end. Li et al. (2018) collected the abstract data from WOS and extracted the verbs and the nouns from the WOS-related sentences. They observed that WOS was usually mentioned as a database. To study the language distribution in non-formal academic exchanges, Yu et al. (2018) collect- ed the full-text of the research articles from Scopus and Tweets. They reported an interdisciplinary difference in the language distribution in academic tweets and litera- ture and concluded that English had become a common language in non-formal aca- demic exchanges. To reveal the difference between non-native and native English- speaking scholars in writing, Lu et al. (2019) collected the full-text of the research articles from a PloS journal. They attempted to differentiate the writing style based on the syntax and the lexical perspectives. Thelwall (2019) collected the full-text of the research articles from PubMed Central. They reported substantial differences in the article structure between the literature of various fields. Topic Analysis Perspective. Topic mining of the full-text of research articles can derive more information than the conventional methodology and exhibits better per- formance. Glänzel and Thijs (2017) applied the NLP techniques and clustering to the full text of the research articles in astronomy and astrophysics to characterise and recognise the corresponding topics and clusters. To identify the preferred topics in the bibliometrics journals, Zhang et al. (2018) collected the full-text of the research arti- cles from the top three journals in bibliometrics and proposed a topic extraction meth- od that combined K-means and Word2vec. Thijs and Glänzel (2018) acquired the full- text of the research articles from Scientometrics and applied the Stanford NLP method to extract noun phrases from the full-text. They studied the contribution of the vocab- ulary composition to the hybrid clustering of the topics in the full-text of research articles. They reported consistency between the word clusters and the bibliographic coupling. 4 The content analysis of the full-text of research articles integrated with conven- tional bibliometrics is beneficial in establishing prominent findings in research arti- cles and improve the performance of conventional extraction methods. 2.2 Entitymetrics and Knowledge Discovery Based on Full-Text of Research Articles In recent years, there has been a tremendous increase in the volume of full-text re- search articles data along with the progress in analysis and extraction techniques. Increasing attention has been drawn to fine-grained entitymetrics and knowledge mining of the full-text of research articles. Entitymetrics. The concept of entitymetrics was first proposed by Ding et al. (2013) as a measure of the influence of the knowledge unit. Entitymetrics highlights the im- portance of entities in the process of knowledge discovery. The entities are further divided into knowledge entities and evaluation entities. The recent studies conducted on entitymetrics have been primarily concerned with the metrics of the knowledge entities, including software, data resources, algorithms, models, domain entities, and terms. Dictionaries and heuristic rules are the commonly used methods to extract entities from the full-text of research articles. A majority of the academic researchers currently use software for research. Some scholars have found through surveys that software entities often have a low citation rate and software is shared only in a few specific disciplines (Park and Wolfram 2019). Li et al. (2017) investigated the citation of the R package to account for the granularity of the software entities and observed that the citation of the R package was inconsistent with the R software. The authors proposed the fine-grained citation of the single functions of the software environment and the package. Recent studies based on the software entity metrics have considered the full-text of the research articles from PlosOne as the object in general. The soft- ware entities are extracted by using various methods, such as bootstrapping, dictionar- ies, and machine learning. The influence of software entities is assessed in terms of citations, mentions, and frequency, and the relationship between the packages is ana- lysed by developing an entity network (Yan and Zhu 2015; Yan and Pan 2015; Pan et al. 2015; Zhao et al. 2018). Wang and Zhang (2018) extracted algorithmic entities using dictionaries and rules from the full-text of the research articles from ACL. The academic influence of the algorithmic entities in the field of NLP was then studied by considering the number and the position of the mentions. They reported that the sup- port vector machine (SVM) exhibited greater influence. Zhao et al. (2018) collected the full text of the research articles from PlosOne to analyse the mentions and cita- tions of the database entities, and they observed that the dataset reuse rate was less than 30%. This indicates that the researchers tend to develop their own datasets. Ding et al. (2013) extracted the entities such as genes and drugs from the full-text of re- search articles in PubMed to account for the field-specific terms and entities. They developed the field-specific entity citation network to assess the influence of entities in the biomedical field. Chen and Luo (2019) collected textual data from the Web of 5 Science and Scopus databases, and entities were extracted from the abstracts by using the BERT model. A network for inference on the scholarly knowledge graph was developed to enhance the metric analysis of the research articles. The above studies are primarily focused on specific fields. Among the studies on the extraction of field- irrelevant terms, Yan et al. (2017) applied the NLP techniques and rules to extract the full-text of the research articles from PlosOne. They extracted the academic terms of various disciplines and discovered a power-law pattern in the frequency distribution of the terms in each discipline. Chen and Yan (2017) employed the NLP techniques and scoring rules to extract terms from the abstracts of research articles. A term net- work was developed to analyse the importance of multi-field terms and their time- series evolution. Knowledge discovery. Entity extraction from the full-text of research articles can facilitate the assessment of the influence of entities and also contribute to knowledge discovery. Current studies on entity extraction for knowledge discovery are mainly limited to the biomedical field wherein the use of dictionaries and machine learning is preferred. The correlations between the knowledge entities are mined by network analysis. Lv et al. (2018) collated the abstracts of papers on autism from PubMed to analyse the topological correlation between the drugs for autism and to mine the re- cent trend in the drug development. The drug entities were extracted based on the MeSH Translation Table, and the drug entity network was then developed. Yu et al. (2015) applied dictionaries to extract the medical database entities from PubMed. They developed a database link network to analyse the topological structure and the main paths of the network based on which the database use, link, and evolution were tracked. Song et al. (2015) extracted the biomedical entity correlations from the ab- stract dataset of PubMed by using machine learning and rules. They developed a bio- medical entity network and proposed a semantic path-based method for biomedical knowledge discovery. Song et al. (2013) used the conditional random field (CRF) and the Unified Medical Language System to extract the gene entities from the abstracts of papers in MEDLINE. The gene entity network was established based on the cita- tion correlations. The interactions between the gene entities were identified by net- work analysis. Studies conducted on entitymetrics based on the full-text of research articles gen- erally consider knowledge entities as objects (e.g., software). Meanwhile, studies conducted on knowledge discovery are primarily focused on the biomedical field, where the entities are extracted by rules and dictionaries. Although such extraction methods exhibit high precision and speed, their recall is low, and they generally fail to recognise new entities in the field. They also face the drawbacks of small data size and limited number of open datasets; that is, the data are insufficient for large-scale, accurate entitymetrics and knowledge discovery. 6 3 Metric Analysis of Software Entities The data used in this study were obtained from the papers of Scientometrics from 2010 to June 2020. A Web crawler was employed, and 3,522 papers published in Scientometrics were collected in total, as shown in Fig. 1. Fig. 1. Distribution of scholar documents in Scientometrics from 2010 to 2020 The PyQuery package (PASGRIMAUD 2017) was used for the parsing of the HTML full text of research articles. The letters, reports, and inaccurate and blank data were eliminated. Thus 3,493 full-text of the research articles including 156,318 para- graphs were obtained and stored in the Mysql database. For the full text data that has been acquired, this research uses the Brat annotation platform (Stenetorp et al. 2012) to label the software entities of academic full texts. As shown in Fig.2, the Brat is currently a widely used the field of natural language processing, which can be used to label entities, entities relationships and structure syntax tree, etc. This research first uses the StanfordCoreNLP toolkit (Manning et al. 2014) to segment the paragraphs in the MySql database, converts the collection of academic full-text chapters into a collection of sentences, and then imports it into the Brat annotation platform to construct the scientometrics software entity corpus by manual annotation. Based on this, the software entity extraction model is constructed, and the wrong software entities are corrected by analyzing the difference between the prediction results of the software entity extraction model and the corpus labeling re- sults. Finally, we obtained the scientometrics software entity annotation corpus. 7 Fig. 2. The example of manual annotation software entities in Brat In order to better understand the software entities in the academic full text, we pro- vide 3 sentences containing software entities, where the software entities in the sen- tences are marked with , such as:  The patent networks were drawn by using UCINET 6.0 (Borgatti et al.). In the following section, we describe the structural features of patent networks in overall network and cluster levels.  After extracting binary relations that appear together in each sentence of patents using the Stanford parser, unintended or too-general binary relations are filtered out using English stopwords (STOPWORDS).  Full-text search is supported using Solr (The Apache Software Foundation) to index the contents of the database. 3.1 Distribution of Software Entities in Different Structures of Articles The distribution of the software entity mentions in different structures of the articles were further analysed in this study. The collected full-text of the research articles were divided into several parts according to the common classification method for the structure of the articles proposed by Ding et al. (2013). The collected documents were divided into the following parts by the manual annotation process: Introduction, Re- lated Work, Method, Experiment and Result, and Discussion and Conclusion. As shown in Fig. 3, the software entities were predominantly mentioned in Method and Experiment and Result. This distribution pattern corresponds with the general organi- sation of a research article. In research papers, the tools and the software are generally introduced in these two sections, with the steps of the software implementation occa- sionally being explained in greater detail. The software entity mentions also appear frequently in Related Work, where the previous studies involving the use of the rele- vant software are generally reviewed. Conversely, the Introduction, and Discussion and Conclusion primarily describe the significance and the contribution of the study and do not often mention the software entities. 8 Fig. 3. Distribution of the number of software entity mentions in different structures of articles 3.2 Mentions and Citations of Software Entities The number of software entity mentions in the full-text of the research articles from Scientometrics from 2010 to 2020 was analysed to determine whether recent academ- ic efforts were driven by software. According to Fig. 4, the number of software entity mentions had been increasing over the years, with a peak of 2,047 mentions in 2018. This result indicates that Scientometrics has attached greater significance to the use of software when compared to power academic studies. Fig. 4. Distribution of the number of software entity mentions in Scientometrics The number of software entity mentions has been increasing in the recent years, with software entities becoming vital academic knowledge in research articles. The standardised citation of software entities is intended to recognise the software devel- opers and is an academic norm that must be conformed to. The distribution of the number of software entity citations in Scientometrics was analysed from 2010 to 2020. As shown in Fig. 5, the number of software entity citations has comprehensive- ly increased over the years. This indicates the increasing significance attached to the standardised software citations in the research articles. 9 Fig. 5. Distribution of the number of software entity citations in Scientometrics The correlation between the number of software entity mentions and the corre- sponding number of documents was analysed from 2010 to 2020, as shown in Fig. 6. A power law was observed in this correlation over the years. In particular, about 80% of the research articles had seven software entity mentions and below; only 20% of the research articles had over seven software entity mentions. This finding was con- sistent with the conclusion from the data research of PLoSONE by Pan et al. (2015). Fig. 6. Relationship between the number of software mentions and the number of documents 3.3 Time-Series Analysis of Software Entities The top 10 software entities in terms of the number of mentions were analysed from 2010 to 2020. The time-series analysis presented the research trends and the objects in the papers published in Scientometrics within the past decade. Consequently, a basic concept of the mainstream methodology used in academic research can be obtained and the prevailing topics in this field can be reviewed. Manual proofreading was per- 10 formed for the extracted software entities to obtain statistics with greater accuracy. The problems such as inconsistency in the version number, capitalisation, and the software abbreviations and acronyms were suitably rectified, for example, “AMOS, AMOS 22,” “JAVA, Java, java,” and “Excel, MS Excel.” Table 1. Top 10 most frequently mentioned software entities from 2010 to 2015 As presented in Tables 1 and 2, the Top 10 software entities in terms of the number of software entity mentions from 2010 to 2020 included the bibliometrics tools, VOSviewer and CiteSpace; the software packages for the analysis of website data, Pajek and UCINET; the data analysis tools, Excel, SPSS, and R; and the citation management tools, Mendeley and CiteULike. Google and Twitter generally use APIs to access metadata and full texts of research papers. An API is an interface that facili- tates interactions between various software programs. An API has been employed in this study as well. VOSviewer was mentioned 10 times in 11 years and was not mentioned only in 2013. CiteSpace was mentioned five times. This result indicates that the bibliometrics research papers published in Scientometrics preferred the use of VOSviewer. Among the website data analysis tools, Pajek was mentioned eight times in 11 years and UCINET seven times. This indicates that Pajek and UCINET were the mainstream software for analysing website data. SPSS and Excel were the primary tools for jour- nal sorting and statistical analysis used in the research papers from Scientometrics. These two software packages were mentioned 10 times in 11 years. 11 Table 2. Top 10 most frequently mentioned software entities from 2016 to 2020 The Altmetrics research has undergone continuous development in recent years, as complements to bibliometrics research. Certain citation management tools are used in combination with the data from Twitter and Google. Research papers and the academ- ic influence of the authors can thus be better analysed by incorporating the social media and website analysis. This combination can make up for the drawbacks faced by the conventional citation analysis methods. The reference manager Mendeley was mentioned 10 times, which was five times larger than that of CiteULike. Thus, Men- deley was more favoured in the Altmetrics research. The social media platforms, Twitter, Facebook, and Google were the top 3 in terms of the number of mentions. They represented the latest trend in Scientometrics, which appears to favour Alt- metrics in recent years. Many other new software packages appeared from 2010 to 2020, although they were mentioned less frequently in the past 11 years. For example, SCIgen is a pro- gram that generates random computer science research papers automatically, includ- ing graphs, figures, flow charts, and citations. Certain niche software products, such as the ASE tool, PoP (Publish or Perish Software), and Sci2 (Science of Science Tool), are now available for citation analysis research. Recently, Python has become a mainstream software in the data science analysis. It was one of the top 10 software entities mentioned in Scientometrics in 2017 and 2020. ScientoPy has also been de- veloped as an open-source Python-based scientometric analysis tool. From the emer- gence of Google+, which is Google’s social media platform, in 2020, it can be in- ferred that Google+ may become a new research object for Altmetrics research. 4 Models The machine learning and deep learning models will be used in the follow-up of this research to build the model of extracting software entities with excellent performance. 4.1 Conditional Random Field The conditional random field (CRF) (Lafferty et al. 2011) is a popular method used to perform NLP tasks, such as entity recognition, word segmentation, and part-of-speech 12 tagging. Here, the software entity extraction from the full-text of the research articles in Scientometrics was converted into the sequence tagging task. The formal formulae, 1-2, were applied, where X = {X1, X2…. Xn-1, Xn}, represents the character of each sentence in the full-text of the research articles from Scientometrics. Consider the example “Two other computer programs that can be used to construct graph-based maps are CiteSpace.” Y = {Y1, Y2…Yn-1, Yn} is the tag for the character in each sentence, such as the start, middle, and end tags for software entity characters. The tag sequence is modelled with X, which is given for the CRF. The formal for- mulae are expressed in 1–2. The two-valued eigenfunctions, tk and sl, are used to ex- tract the character-level features from the full-text of the research articles in Scien- tometrics. λk and μl are the weights of the eigenfunctions, which are dynamically ad- justed in the software entity extraction model. Z(x) is a normalisation factor that is used to ensure that the conditional probability of P(y|x) falls within the range of 0–1. P(y|x) is the overall score of the entity tags corresponding to the characters in the full text of research articles. 1 P(y|x)  exp( k tk ( yi 1 , yi , x, i )   l sl ( yi , x, i )) (1) Z (x) i,k i ,l Z (x)   exp( k tk ( yi 1 , yi , x, i )   l sl ( yi , x, i )) (2) y i ,k i ,l 4.2 Bidirectional LSTM-CRF Long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) is a special type of recurrent neural network. The architecture of the bidirectional LSTM, as shown in Fig. 1, can be used to extract contextual information and improve the soft- ware entity extraction performance. The neurons of the LSTM were calculated by using formulae 3–8, where it, ot, ft, and ct are the control matrices for the input gate, the output gate, the forget gate, and the memory cell of the t-th input character in the full-text of the research articles from Scientometrics. These matrices are used to con- trol the input, the output, the forget, and the memory functions of the information in the full-text of the research articles. xt and ht are the embedding vectors for the t-th character and the output vector of the hidden neuron at the t-th moment. Further, w and b are the untrained weight vectors, and σ is the Sigmoid activation function. ft  (Wf *[ht 1 , xt ]  bf ) (3) i t = (Wi *[ht 1 , xt ]  bi ) (4) ~ ot  (Wo *[ht 1 , xt ]  bo ) (5) c t = tanh(Wc *[ht 1 , xt ]  bc ) (6) ~ ct = f t  ct-1  it  c t (7) ht  Ot  tanh(ct ) (8) 13 Fig. 7. Architecture of the Bi-LSTM-CRF model for software entity extraction from Scien- tometrics The bi-directional LSTM was used to model only the features of the full text of the research articles from Scientometrics and not those of the software entity labels. Therefore, we introduced the CRF layer (Huang et al. 2015) to model the labels, with the architecture shown in Fig. 2. The CRF layer was able to effectively reduce the errors in the independent prediction of the tags. In particular, the start tag of the soft- ware entity is more likely to be the end tag or the middle tag. The formal formula is shown in formula. 9, where P  R is the entity tag score for the features of the n*k input full-text of the research articles. This score is the probability value of the tag corresponding to the n-th word, where n is the number of words in the current input sentence and k is the tag number; A  R is the transition probability matrix of the k*k tag. The CRF layer identifies and optimises the status transition probability matrix A and uses Viterbi for decoding. The highest tag score of the entire document is thus obtained along with the global optimal solution, y={y1…yn-1, yn}. n 1 S (x, y) =  ( Ayt 1 , yt  Pt , yt ) (9) t 1 4.3 Bidirectional Encoder Representation from Transformers The bi-directional encoder representation from transformers (BERT) is a deep lan- guage representation model based on modifying the common bi-directional language model (Devlin et al. 2018). A transformer encoder based entirely on a self-attention mechanism was used to model the full-text documents from Scientometrics. BERT is superior to other neural network models as it is pre-trained by large-scale unsuper- vised corpus. During the software entity extraction process from the full-text of the research articles in Scientometrics, BERT only needs to be fine-tuned with the top- level parameters to predict the software entity tags. 14 Fig. 8. Architecture of the BERT model for software entity extraction from Scientometrics As shown in Fig. 2, CLS always appears at the start of the full-text of research arti- cles from Scientometrics as a token of the sentence segmentation. The SEP token appears at the end of the research articles. Lexical coding is added on the three layers, namely, the word, word position, and sentence segmentation layers, to obtain the word embedding as the input. The multi-layer transformer based on the self-attention mechanism is then operated to generate the contextual semantic representation for each word in the full-text of the research articles from Scientometrics. Softmax is implemented through a neural network layer just before the output layer to predict the software entity tags and to lastly extract the software entities. 5 Experiments 5.1 Experimental Data Processing In this study, the full text of the research articles from Scientometrics stored in the database were organised in paragraphs. As the maximum input length of the BERT model is 512, the Stanford NLP method (Manning et al. 2014) was implemented for the parsing of the paragraphs. The character length of each sentence was maintained under 512 in order for the text input to satisfy requirements of the model input. A total of 659,191 sentences with over 20 million characters was obtained after parsing. These sentences were then input into Brat for the text annotation (Stenetorp et al., 2012). The software entities in the documents were manually annotated, and 13,269 software entities were identified. 15 After the annotation, the textual data were processed in a format required by the sequence tagging model, and the BMES tagging scheme was used. In Fig. 3, the words are in the first column and the tags in the second column. B-Software, M- Software, and E-Software represent the beginning word of the software entity, middle word, and ending word, respectively. If a software entity was of one word, it was tagged as S-Software. The characters that did not denote software entities were tagged as O. The sentences were separated by a blank line as a separator. Fig. 9. Example of the input of the sequence tagging model 5.2 Model Parameters and Experimental Environment Various models, such as the CRF++, the BiLSTM-CRF, and BERT, were operated, and their software entity extraction performance was compared to identify which one has the best performance. The CRF++ model is a discriminant probabilistic undirected graph model. In this study, the CRF++ 0.58 package (Kudo 2005) was used in combination with the basic feature template for software entity extraction. The BiLSTM-CRF model consists of the embedding, bidirectional LSTM, and CRF layers. Gradient clipping was adopted to avoid gradient explosion and disap- pearance, with clip = 5.0 and the learning rate was set to 0.001. The dimensionality of the word embedding was 300. The number of hidden neurons in the LSTM layer was set to 256, and the Layer Num was set to 2. The batch size was 512. The number of training epochs was 200, and the Adam optimiser was used for the gradient descent optimisation. Early stopping was implemented to avoid overfitting and to accelerate the training speed, which implies that the training would be terminated if the F-value of the cross-validation set does not increase within 10 iterations. 16 The transformer is the primary component of BERT, a neural network model pro- posed by Google in 2018. Thus far, BERT has achieved great success in 11 NLP tasks. The output layer of the pre-trained BERT model in English was modified by transfer learning so that the BERT model was better suited for software entity extrac- tion from the full-text of the research articles from Scientometrics. The number of hidden neurons was set to 768; the number of attention heads was 12; the warmup proportion was 0.3; the learning rate was 2.0E-5; the batch size was 16; the maximum sequence length was 512; and the training epoch was set to 3. The training of a neural network typically involves a significant load of parallel computing and matrix calculation. Therefore, the throughput and the response speed cannot sufficiently meet the requirements if the deep learning process is performed on a CPU. In this study, the NVIDIA Tesla P40 GPU was used to train the neural net- work. This GPU delivers a data handling capacity, which is over 60 times higher than that of the CPU and has an inference capacity of up to 47 TOPS (tera operations per second). The configuration of the computer used in the experiment is as follows: CPU: 48 Intel(R) Xeon(R) CPUs E5-2650 v4 @ 2.20GHz; memory: 256 GB; GPU: 6 NVIDIA Tesla P40 cards; video memory: 24 GB; and operating system: CentOS 3.10.0. 5.3 Analysis of the Entity Extraction Performance For the evaluation of the performance of model extraction software entity, it uses precision, recall and F1 value to measure the performance of software entity extrac- tion. The calculation formula is shown in formula 10-12. The precision is the rate at which the software entities extracted by the model are correct, and the recall is the rate at which the software entities extracted by the model are extracted from the cor- pus. The F1 value is the reconciled weighted average of the recall and precision, which is used to measure the overall performance of the model recognition software entity. Number of software entities extracted correctly Precision= *100% (10) Total number of software entities extracted Number of software entities extracted correctly Recall  *100% (11) Total number of software entities in the corpus 2*Precision *Recall F1  *100% (12) Precision+Recall Based on the corpus annotated above, the CRF, Bi-LSTM-CRF, and BERT models were operated for the software entity extraction experiments in the full-text of the research articles from Scientometrics. The 10-fold cross-validation process was im- plemented for each model to eliminate the influence of random errors on the experi- mental results. The corpus was split with a 9:1 ratio into a training set and a testing set. The averages of the precision, the recall, and F1 were calculated to measure and compare the overall entity extraction performance between the models. 17 Table 3. Comparison of the software entity extraction performance across the models As shown in Table 1, BERT exhibited the highest values of the average recall and F1 among the three models, with 83.89% and 84.99%, respectively. The CRF++ had the highest precision, which was 90.03%. The average value of F1 of the CRF++ model was 76.84%, which was lower than that of the Bi-LSTM-CRF and the BERT models by 1.83% and 8.15%, respectively. This result indicates that without adding the complex manual features, the CRF++ achieved a better recognition of high- frequency entities. Compared with the deep learning models, the CRF++ did not in- corporate a neural network for the automatic feature extraction and lacked a semantic similarity mechanism, such as word embedding. Therefore, the CRF++ failed to rec- ognise the semantically related software entities. Thus, the recall of the Bi-LSTM- CRF and BERT was higher than that of the CRF++ by 8.78% and 16.83%, respective- ly. The BERT model is pretrained with the large-scale unsupervised corpus by incor- porating a transformer based on a self-attention mechanism. The output layer of the BERT model is modified with the help of transfer learning so that it is better suited for the software entity extraction from the full-text of the research articles from Scien- tometrics. Based on these advantages, it was observed that the BERT model exhibited a higher semantic modelling performance and the highest overall performance. The precision, the recall, and F1 of the BERT model were 86.13%, 83.89%, and 84.99%, respectively. 6 Conclusion In this study, the full-text of research articles were collated from Scientometrics from 2010 to 2020. The corpus for the software entities in the full-text of the research arti- cles was developed. The extracted software entities were further analysed from vari- ous perspectives, such as the distribution in the different parts of a document, number of software entity mentions and citations, and time-series evolution. It was observed that certain software entities were mentioned more frequently in Method and Experi- ment and Result sections. A general increasing trend was observed in the number of software entity mentions and citations in Scientometrics from 2010 to 2020. This 18 indicates that Scientometrics attached greater importance to the use of software when compared to power academic research and the standardised citations of software enti- ties. A power law was observed in the correlation between the number of software mentions and the document number. The time-series analysis of the software entities showed that Scientometrics greatly favoured the Altmetrics research in the recent years, with an increase in the number of mentions of the relevant software products. New software entities for powering the research on new topics were also mentioned, such as ScientoPy and Google+. The machine learning and the deep learning models, CRF, Bi-LSTM-CRF and BERT were established to extract the software entities from the research papers in Scientometrics. The highest precision (90.03%) was achieved with CRF++; the high- est recall and F1-value of 83.89% and 84.99%, respectively, were achieved with BERT. This study still faces certain limitations. For example, it only covered one journal, which was Scientometrics. In the future, the data sources will be expanded. The BERT-based model built in the present study must be applied for the software entity extraction from other journals to deepen the mining and the analysis of software enti- ties from multiple perspectives. Acknowledgements Thank you very much to all the 25 graduate and undergraduate students who partici- pated in the software entities annotation. The authors acknowledge the National Natu- ral Science Foundation of China (Grant Numbers: 71974094) for financial support. References Bu, Y., Wang, B., Huang, W. B., Che, S., & Huang, Y. (2018). Using the appearance of cita- tions in full text on author co-citation analysis. Scientometrics, 116(1), 275-289. Chen, H., & Luo, X. (2019). An automatic literature knowledge graph and reasoning network modeling framework based on ontology and natural language processing. Advanced Engineer- ing Informatics, 42, 100959. Chen, Z., & Yan, E. (2017). Domain-independent term extraction & term network for scientific publications. IConference 2017 Proceedings. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805. Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., & Chambers, T. (2013). Entitymetrics: Measuring the impact of entities. PloS one, 8(8), e71416. Glänzel, W., & Thijs, B. (2017). Using hybrid methods and ‘core documents’ for the represen- tation of clusters and topics: the astronomy dataset. Scientometrics, 111(2), 1071-1087. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. Hu, Z., Lin, G., Sun, T., & Hou, H. (2017). Understanding multiply mentioned refer- ences. Journal of Informetrics, 11(4), 948-958. Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. 19 Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. Jeong, Y. K., Song, M., & Ding, Y. (2014). Content-based author co-citation analysis. Journal of Informetrics, 8(1), 197-211. Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Li, K., & Yan, E. (2018). Co-mention network of R packages: Scientific impact and clustering structure. Journal of Informetrics, 12(1), 87-100. Li, K., Rollins, J., & Yan, E. (2018). Web of Science use in published research and review papers 1997–2017: A selective, dynamic, cross-domain, content-based analy- sis. Scientometrics, 115(1), 1-20. Li, K., Yan, E., & Feng, Y. (2017). How is R cited in research outputs? Structure, impacts, and citation standard. Journal of Informetrics, 11(4), 989-1002. Lu, C., Bu, Y., Wang, J., Ding, Y., Torvik, V., Schnaars, M., & Zhang, C. (2019). Examining scientific writing styles from the perspective of linguistic complexity. Journal of the Associa- tion for Information Science and Technology, 70(5), 462-475. Lv, Y., Ding, Y., Song, M., & Duan, Z. (2018). Topology-driven trend analysis for drug dis- covery. Journal of Informetrics, 12(3), 893-905. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014, June). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55-60). Nangia, U., & Katz, D. S. (2017, September). Track 1 paper: Surveying the US national post- doctoral association regarding software use and training in research. In Workshop on Sustain- able Software for Science: Practice and Experiences (WSSSPE5. 1). Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860-871. Park, H., & Wolfram, D. (2019). Research software citation in the Data Citation Index: Current practices and implications for research software sharing and reuse. Journal of Informet- rics, 13(2), 574-582. PASGRIMAUD, G. Pyquery: a jquery-like library for python, 2017. Small, H. (2018). Characterizing highly cited method and non-method papers using citation contexts: The role of uncertainty. Journal of Informetrics, 12(2), 461-480. Small, H., Tseng, H., & Patek, M. (2017). Discovering discoveries: Identifying biomedical discoveries using citation contexts. Journal of Informetrics, 11(1), 46-62. Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. I. (2012, April). BRAT: a web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstra- tions at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 102-107). Song, M., Han, N. G., Kim, Y. H., Ding, Y., & Chambers, T. (2013). Discovering implicit entity relation with the gene-citation-gene network. PloS one, 8(12), e84639. Song, M., Heo, G. E., & Ding, Y. (2015). SemPathFinder: Semantic path analysis for discover- ing publicly unknown knowledge. Journal of informetrics, 9(4), 686-703. Thelwall, M. (2019). The rhetorical structure of science? A multidisciplinary analysis of article headings. Journal of Informetrics, 13(2), 555-563. Thijs, B., & Glänzel, W. (2018). The contribution of the lexical component in hybrid cluster- ing, the case of four decades of “Scientometrics”. Scientometrics, 115(1), 21-33. 20 Wang, Y., & Zhang, C. (2018, March). Using full-text of research articles to analyze academic impact of algorithms. In International Conference on Information (pp. 395-401). Springer, Cham. Yan, E., & Pan, X. (2015). A Bootstrapping Method to Assess Software Impact in Full-Text Papers. In ISSI. Yan, E., & Zhu, Y. (2015). Identifying entities from scientific publications: A comparison of vocabulary-and model-based methods. Journal of informetrics, 9(3), 455-465. Yan, E., & Zhu, Y. (2018). Tracking word semantic change in biomedical litera- ture. International journal of medical informatics, 109, 76-86. Yan, E., Williams, J., & Chen, Z. (2017). Understanding disciplinary vocabularies using a full- text enabled domain-independent term extraction approach. PloS one, 12(11), e0187762. Yu, H., Xu, S., & Xiao, T. (2018). Is there Lingua Franca in informal scientific communica- tion? Evidence from language distribution of scientific tweets. Journal of Informetrics, 12(3), 605-617. Yu, Q., Ding, Y., Song, M., Song, S., Liu, J., & Zhang, B. (2015). Tracing database usage: Detecting main paths in database link networks. Journal of Informetrics, 9(1), 1-15. Zhang, G., Ding, Y., & Milojević, S. (2013). Citation content analysis (CCA): A framework for syntactic and semantic analysis of citation content. Journal of the American Society for Information Science and Technology, 64(7), 1490-1503. Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H., & Zhang, G. (2018). Does deep learn- ing help topic extraction? A kernel k-means clustering method with word embedding. Journal of Informetrics, 12(4), 1099-1117. Zhao, M., Yan, E., & Li, K. (2018). Data set mentions and citations: A content analysis of full‐text publications. Journal of the Association for Information Science and Technolo- gy, 69(1), 32-46.