=Paper=
{{Paper
|id=Vol-3745/paper23
|storemode=property
|title=Quantifying Scientific Novelty of Doctoral Theses with Bio-BERT Model
|pdfUrl=https://ceur-ws.org/Vol-3745/paper23.pdf
|volume=Vol-3745
|authors=Alex J. Yang,Yi Bu,Ying Ding,Meijun Liu
|dblpUrl=https://dblp.org/rec/conf/eeke/Yang00L24
}}
==Quantifying Scientific Novelty of Doctoral Theses with Bio-BERT Model==
Quantifying scientific novelty of doctoral theses with Bio-BERT model β Alex J. Yang1, Yi Bu2, Ying Ding3, and Meijun Liu4, 1 School of Information Management, Nanjing University, Nanjing, China 2 Department of Information Management, Peking University, Beijing, China 3 School of Information, University of Texas at Austin, Austin, TX, USA 4 Institute for Global Public Policy, Fudan University, Shanghai, China Abstract Scientific novelty plays a pivotal role in advancing scholarly endeavors, driving the evolution of knowledge across various disciplines. In this paper, we present a methodology for quantifying the scientific novelty of biomedical doctoral theses utilizing the Bio-BERT model. Leveraging BERN2 for bio-entity extraction and normalization, we analyze a dataset comprising 305,693 doctoral theses to generate unique bio-entity combinations. Employing Bio-BERT, we calculate the semantic distance between bio-entities and establish a criterion for identifying novel entity pairings. We introduce a novelty score to assess the scientific novelty of each thesis, providing a nuanced evaluation of unique entity combinations. Our findings contribute to the discourse on scientific novelty assessment, offering insights into the evolving landscape of biomedical research and providing a framework for enhanced analysis of scholarly innovation for early-career scientists based on their doctoral theses. Keywords Biomedical research, Bio-BERT model, Doctoral theses, BERN2 1 1. Introduction With the exponential growth of scientific data, researchers have turned to various methodologies Scientific novelty serves as a cornerstone in to operationalize and quantify scientific novelty, scholarly pursuits, driving the progression of often leveraging textual information or citation data knowledge across diverse fields. Originating from to delineate knowledge elements and their Schumpeter's seminal insights on business cycles in combinations (7, 8). For instance, Fleming (2001) the 1930s, the concept of scientific novelty proposes evaluating novelty in patents by underscores the transformative nature of innovation, identifying unexplored technology classes (2), while wherein novel theories, methodologies, data, or Boudreau et al. (2016) advocate for assessing grant discoveries emerge to shape subsequent proposals based on unique MeSH keyword investigations (1). Over time, this perspective has combinations (7). Despite these endeavors, become integral to the examination of innovation, challenges persist in accurately capturing the permeating scholarly discourse and guiding intricate interplay of knowledge components. inquiries into the novelty of scientific artifacts such In this context, recent advancements aim to as publications, patents, and grant proposals (2-6). refine methodologies for gauging scientific novelty, Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities from Scientific Documents and the 4th AI + Informetrics (EEKE-AII2024), April 23~24, 2024, Changchun, China and Online EMAIL: alexjieyang@outlook.com (A. J. Yang); buyi@pku.edu.cn (Y. Bu); ying.ding@ischool.utexas.edu (Y. Ding) ; meijunliu@fudan.edu.cn (M. Liu) Β© Copyright 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 145 drawing inspiration from combinatorial approaches advancing methodologies to better encapsulate the that consider the semantic relationships between richness and complexity of scholarly innovation. knowledge elements (9). Liu et al. (2022) propose an The primary data source for this study is the innovative methodology for assessing scientific Sciences and Engineering Collection of The novelty in biomedical publications related to ProQuest Dissertations & Theses Citation Index coronavirus (10), utilizing bio-entities as (PQDT). PQDT stands as the world's largest fundamental knowledge units and employing a pre- multidisciplinary dissertation database, housing trained Bio-BERT model to measure their semantic over 5.5 million dissertations from universities distance. By scrutinizing entity pairs and identifying worldwide and serving as an official repository for novel combinations based on a semantic distance the US Library of Congress. From a compilation of threshold, this approach offers a nuanced US higher education institutions provided by the perspective on scientific novelty, surpassing Carnegie Commission on Higher Education, we traditional methods reliant solely on textual or gather records of doctoral theses from the Science citation-based analyses. and Engineering collection of PQDT. This dataset Building upon this pioneering framework, our encompasses 1,109,491 theses from 828 US study endeavors to evaluate the scientific novelty of institutions, spanning publication years 1960 to biomedical doctoral theses through a 2016. PQDT offers comprehensive information comprehensive five-step method. By adopting the about dissertations, including author details, approach outlined by Liu et al. (2022) (10), which advisors, universities, subjects, and publication years. integrates domain-specific contexts and semantic Each thesis is associated with one or more subjects analysis, we aspire to enhance the precision and chosen by the author, which can be mapped to 22 depth of our analysis, providing invaluable insights broader disciplines. Prioritizing data accuracy, we into the evolving landscape of biomedical research. analyze doctoral theses published from 1980 to Through this endeavor, we contribute to the 2016, retaining 313,274 theses in the biomedical ongoing discourse on scientific novelty assessment, sciences encompassing biological science, health, and medical science. Figure 1: Steps of quantifying scientific novelty of doctoral theses. 2. Extracting and disambiguating bio- tools (12) lies in its ability to provide more efficient annotations. entities We opt to extract bio-entities from the titles We utilize BERN2 (11), an advanced neural and abstracts of doctoral theses rather than relying biomedical tool, to extract biomedical entities from on full texts for several reasons. Firstly, although the a corpus comprising 313,274 doctoral theses. BERN2 PQDT database offers access to 3 million full texts of comprises two principal models: (1) Named Entity doctoral dissertations added since 1997, a download Recognition (NER), which discerns nine types of limit is imposed. However, titles and abstracts are biomedical entitiesβgene/protein, disease, available for nearly all doctoral theses added since drug/chemical, species, mutation, cell line, cell type, 1980. The title succinctly encapsulates the main DNA, and RNAβemploying a multi-task NER model; topic addressed by the author, while the abstract and (2) Named Entity Normalization (NEN), which provides a summary of the substantive content. associates annotated entities with concept unique Utilizing titles and abstracts instead of full texts identifiers using a combination of rule-based and ensures higher data accessibility, a denser neural network-based NEN models. BERN2's concentration of relevant vocabulary reflecting the superiority over existing biomedical text mining publication's topic, as well as advantages such as 146 reduced computation time and simplified data π·π,π =1-πΆππ ππππ,π (1) preprocessing processes. where πΆππ ππππ,π is the cosine similarity between Utilizing BERN2, we extract 1,519,599 annotated entities π and π based on their corresponding vector bio-entity names from the titles and abstracts of representations that are obtained from the Bio-BERT 305,693 doctoral theses from the final dataset. In model. The examples of an entity vector space for 2.42% of the 313,274 doctoral theses, we fail to three theses based on the Bio-BERT model are extract any bio-entity, leading to the exclusion of shown in ιθ――!ζͺζΎε°εΌη¨ζΊγa-b. these theses from further analyses, resulting in a We develop a criterion to determine what remaining subset of 305,693 doctoral theses. The qualifies as a novel combination of entities. To do 1,519,599 annotated bio-entity names were this, we analyze the distribution of cosine distances disambiguated and linked to 118,349 unique bio- among all pairs of entities in our dataset. If the entity IDs. The standard name for each ID was cosine distance between the two constituent entities determined as the most frequently occurring bio- of a pair falls within the top 10% of this distribution, entity name associated with it in the biomedical we consider it as a novel entity pairing. The 90th doctoral theses. In cases of multiple associated percentile of the distribution corresponds to a names with unequal occurrences, one was randomly cosine distance of 0.279 (ιθ――!ζͺζΎε°εΌη¨ζΊγc). designated as the standard name. Any entity pair with a cosine distance greater than Subsequently, we establish pairings among the 0.279 is considered to be a novel combination. We 118,349 distinct bio-entity IDs by analyzing their co- further define a novel thesis as a doctoral thesis that occurrence in the dataset comprising 305,693 includes at least one novel entity combination/pair. doctoral theses. Among these theses, 8.45% To provide a nuanced evaluation of each exclusively mentioned a single bio-entity, rendering doctoral thesisβs scientific novelty, we introduce the the generation of any bio-entity combinations novelty score. This score is calculated by impossible. Consequently, these instances were determining the proportion of novel entity pairs out excluded from subsequent analyses, leaving us with of the total number of entity combinations 277,288 doctoral theses and resulting in the generated within a given thesis. As an illustration, let generation of 68,949,061 unique bio-entity us consider a thesis that mentions three bio-entities: combinations. a, b, and c. Within this thesis, the number of generated entity combinations is calculated as πΆ32 =3. 3. Measuring the distance of two bio- Out of these three entity pairs, only the combination entities of a and b meets our novelty criterion, which requires the cosine distance between the two bio- Using the standard names associated with the entities to be greater than 0.279. Accordingly, the 118,349 unique bio-entity IDs obtained in the novelty score for this particular thesis is 1/3. The previous step, we convert each standard bio-entity novelty score is bounded between 0 and 1, with a name into a vector representation using a Bio-BERT higher score indicating a greater degree of novelty. model. We then calculate the distance between two This metric provides a precise and continuous bio-entities that are denoted by π and π, π·π,π , for any measure of the unique combinations of entities entity combination that is generated from the present in each thesis. doctoral theses using Equation 1. 147 ( ) ( ) , , Figure 2: The illustration of how to measure novelty scores for doctoral theses using the Bio-BERT model. (a) An entity vector space containing all entities extracted from three sample doctoral theses based on Bio-BERT. (b) The distribution of cosine distances between entities for all entity pairs extracted from the three sample doctoral theses. Within each thesis, the entity pairs are ordered from left to right based on their cosine distance values. (c) The distribution of cosine distance for all entity pairs extracted from all doctoral theses in this study. If the cosine distance between the two constituent entities of an entity pair falls within the upper 10th percentile of this distribution, it is considered a novel entity pair. Acknowledgements 5. D. K. Simonton, Scientific creativity as constrained stochastic behavior: This study is sponsored by the National Natural the integration of product, person, Science Foundation of China (72104054, 72104007), the Shanghai Pujiang Talent program (21PJC026), and process perspectives. and the Key Project of the National Natural Science Psychological Bulletin 129, 475 Foundation of China (72234001). We thank Mr. Grant (2003). Guo for his technical support. 6. J. Wang, S. Shibayama, Mentorship and creativity: Effects of mentor References creativity and mentoring style. Research Policy 51, 104451 (2022). 1. J. A. Schumpeter, Business cycles 7. K. J. Boudreau, E. C. Guinan, K. R. (Mcgraw-hill New York, 1939), vol. 1. Lakhani, C. Riedl, Looking across 2. L. Fleming, Recombinant and looking beyond the knowledge uncertainty in technological search. frontier: Intellectual distance, Management Science 47, 117-132 novelty, and resource allocation in (2001). science. Management Science 62, 3. B. Uzzi, S. Mukherjee, M. Stringer, B. 2765-2783 (2016). Jones, Atypical combinations and 8. S. Chai, A. Menon, Breakthrough scientific impact. Science 342, 468- recognition: Bias against novelty 472 (2013). and competition for attention. 4. M. L. Weitzman, Recombinant Research Policy 48, 733-747 (2019). growth. The Quarterly Journal of 9. P. Azoulay, J. S. Graff Zivin, G. Manso, Economics 113, 331-360 (1998). Incentives and creativity: evidence from the academic life sciences. The 148 RAND Journal of Economics 42, 527- 554 (2011). 10. M. Liu et al., Pandemics are catalysts of scientific novelty: Evidence from COVID-19. J Assoc Inf Sci Technol 73, 1065-1078 (2022). 11. M. Sung et al., BERN2: an advanced neural biomedical named entity recognition and normalization tool. Bioinformatics 38, 4837-4839 (2022). 12. D. Kim et al., A neural named entity recognition and multi-type normalization tool for biomedical text mining. IEEE Access 7, 73729- 73740 (2019). 149