=Paper=
{{Paper
|id=Vol-3745/paper23
|storemode=property
|title=Quantifying Scientific Novelty of Doctoral Theses with Bio-BERT Model
|pdfUrl=https://ceur-ws.org/Vol-3745/paper23.pdf
|volume=Vol-3745
|authors=Alex J. Yang,Yi Bu,Ying Ding,Meijun Liu
|dblpUrl=https://dblp.org/rec/conf/eeke/Yang00L24
}}
==Quantifying Scientific Novelty of Doctoral Theses with Bio-BERT Model==
<pdf width="1500px">https://ceur-ws.org/Vol-3745/paper23.pdf</pdf>
<pre>
                                Quantifying scientific novelty of doctoral theses
                                with Bio-BERT model ⋆
                                Alex J. Yang1, Yi Bu2, Ying Ding3, and Meijun Liu4,

                                1 School of Information Management, Nanjing University, Nanjing, China

                                2 Department of Information Management, Peking University, Beijing, China

                                3 School of Information, University of Texas at Austin, Austin, TX, USA

                                4 Institute for Global Public Policy, Fudan University, Shanghai, China


                                                    Abstract
                                                    Scientific novelty plays a pivotal role in advancing scholarly endeavors, driving the evolution
                                                    of knowledge across various disciplines. In this paper, we present a methodology for
                                                    quantifying the scientific novelty of biomedical doctoral theses utilizing the Bio-BERT model.
                                                    Leveraging BERN2 for bio-entity extraction and normalization, we analyze a dataset
                                                    comprising 305,693 doctoral theses to generate unique bio-entity combinations. Employing
                                                    Bio-BERT, we calculate the semantic distance between bio-entities and establish a criterion
                                                    for identifying novel entity pairings. We introduce a novelty score to assess the scientific
                                                    novelty of each thesis, providing a nuanced evaluation of unique entity combinations. Our
                                                    findings contribute to the discourse on scientific novelty assessment, offering insights into the
                                                    evolving landscape of biomedical research and providing a framework for enhanced analysis
                                                    of scholarly innovation for early-career scientists based on their doctoral theses.

                                                    Keywords
                                                    Biomedical research, Bio-BERT model, Doctoral theses, BERN2 1


                                1. Introduction                                                                                 With the exponential growth of scientific data,
                                                                                                                          researchers have turned to various methodologies
                                    Scientific novelty serves as a cornerstone in                                         to operationalize and quantify scientific novelty,
                                scholarly pursuits, driving the progression of                                            often leveraging textual information or citation data
                                knowledge across diverse fields. Originating from                                         to delineate knowledge elements and their
                                Schumpeter's seminal insights on business cycles in                                       combinations (7, 8). For instance, Fleming (2001)
                                the 1930s, the concept of scientific novelty                                              proposes evaluating novelty in patents by
                                underscores the transformative nature of innovation,                                      identifying unexplored technology classes (2), while
                                wherein novel theories, methodologies, data, or                                           Boudreau et al. (2016) advocate for assessing grant
                                discoveries emerge to shape subsequent                                                    proposals based on unique MeSH keyword
                                investigations (1). Over time, this perspective has                                       combinations (7). Despite these endeavors,
                                become integral to the examination of innovation,                                         challenges persist in accurately capturing the
                                permeating scholarly discourse and guiding                                                intricate interplay of knowledge components.
                                inquiries into the novelty of scientific artifacts such                                         In this context, recent advancements aim to
                                as publications, patents, and grant proposals (2-6).                                      refine methodologies for gauging scientific novelty,


                                Joint Workshop of the 5th Extraction and Evaluation of Knowledge
                                Entities from Scientific Documents and the 4th AI + Informetrics
                                (EEKE-AII2024), April 23~24, 2024, Changchun, China and Online
                                EMAIL: alexjieyang@outlook.com (A. J. Yang); buyi@pku.edu.cn (Y.
                                Bu); ying.ding@ischool.utexas.edu (Y. Ding) ; meijunliu@fudan.edu.cn
                                (M. Liu)
                                              © Copyright 2024 for this paper by its authors. Use permitted under
                                              Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                    145
drawing inspiration from combinatorial approaches               advancing methodologies to better encapsulate the
that consider the semantic relationships between                richness and complexity of scholarly innovation.
knowledge elements (9). Liu et al. (2022) propose an                 The primary data source for this study is the
innovative methodology for assessing scientific                 Sciences and Engineering Collection of The
novelty in biomedical publications related to                   ProQuest Dissertations & Theses Citation Index
coronavirus     (10),   utilizing    bio-entities   as          (PQDT). PQDT stands as the world's largest
fundamental knowledge units and employing a pre-                multidisciplinary dissertation database, housing
trained Bio-BERT model to measure their semantic                over 5.5 million dissertations from universities
distance. By scrutinizing entity pairs and identifying          worldwide and serving as an official repository for
novel combinations based on a semantic distance                 the US Library of Congress. From a compilation of
threshold, this approach offers a nuanced                       US higher education institutions provided by the
perspective on scientific novelty, surpassing                   Carnegie Commission on Higher Education, we
traditional methods reliant solely on textual or                gather records of doctoral theses from the Science
citation-based analyses.                                        and Engineering collection of PQDT. This dataset
     Building upon this pioneering framework, our               encompasses 1,109,491 theses from 828 US
study endeavors to evaluate the scientific novelty of           institutions, spanning publication years 1960 to
biomedical      doctoral      theses     through     a          2016. PQDT offers comprehensive information
comprehensive five-step method. By adopting the                 about dissertations, including author details,
approach outlined by Liu et al. (2022) (10), which              advisors, universities, subjects, and publication years.
integrates domain-specific contexts and semantic                Each thesis is associated with one or more subjects
analysis, we aspire to enhance the precision and                chosen by the author, which can be mapped to 22
depth of our analysis, providing invaluable insights            broader disciplines. Prioritizing data accuracy, we
into the evolving landscape of biomedical research.             analyze doctoral theses published from 1980 to
Through this endeavor, we contribute to the                     2016, retaining 313,274 theses in the biomedical
ongoing discourse on scientific novelty assessment,             sciences encompassing biological science, health,
                                                                and medical science.


                   Figure 1: Steps of quantifying scientific novelty of doctoral theses.

2. Extracting and disambiguating bio-                           tools (12) lies in its ability to provide more efficient
                                                                annotations.
    entities                                                         We opt to extract bio-entities from the titles
     We utilize BERN2 (11), an advanced neural                  and abstracts of doctoral theses rather than relying
biomedical tool, to extract biomedical entities from            on full texts for several reasons. Firstly, although the
a corpus comprising 313,274 doctoral theses. BERN2              PQDT database offers access to 3 million full texts of
comprises two principal models: (1) Named Entity                doctoral dissertations added since 1997, a download
Recognition (NER), which discerns nine types of                 limit is imposed. However, titles and abstracts are
biomedical       entities—gene/protein,       disease,          available for nearly all doctoral theses added since
drug/chemical, species, mutation, cell line, cell type,         1980. The title succinctly encapsulates the main
DNA, and RNA—employing a multi-task NER model;                  topic addressed by the author, while the abstract
and (2) Named Entity Normalization (NEN), which                 provides a summary of the substantive content.
associates annotated entities with concept unique               Utilizing titles and abstracts instead of full texts
identifiers using a combination of rule-based and               ensures higher data accessibility, a denser
neural network-based NEN models. BERN2's                        concentration of relevant vocabulary reflecting the
superiority over existing biomedical text mining                publication's topic, as well as advantages such as


                                                          146
reduced computation time and simplified data                                          𝐷𝑖,𝑗 =1-𝐶𝑜𝑠𝑆𝑖𝑚𝑖,𝑗 (1)
preprocessing processes.                                             where 𝐶𝑜𝑠𝑆𝑖𝑚𝑖,𝑗 is the cosine similarity between
     Utilizing BERN2, we extract 1,519,599 annotated             entities 𝑖 and 𝑗 based on their corresponding vector
bio-entity names from the titles and abstracts of                representations that are obtained from the Bio-BERT
305,693 doctoral theses from the final dataset. In               model. The examples of an entity vector space for
2.42% of the 313,274 doctoral theses, we fail to                 three theses based on the Bio-BERT model are
extract any bio-entity, leading to the exclusion of              shown in 错误!未找到引用源。a-b.
these theses from further analyses, resulting in a                     We develop a criterion to determine what
remaining subset of 305,693 doctoral theses. The                 qualifies as a novel combination of entities. To do
1,519,599 annotated bio-entity names were                        this, we analyze the distribution of cosine distances
disambiguated and linked to 118,349 unique bio-                  among all pairs of entities in our dataset. If the
entity IDs. The standard name for each ID was                    cosine distance between the two constituent entities
determined as the most frequently occurring bio-                 of a pair falls within the top 10% of this distribution,
entity name associated with it in the biomedical                 we consider it as a novel entity pairing. The 90th
doctoral theses. In cases of multiple associated                 percentile of the distribution corresponds to a
names with unequal occurrences, one was randomly                 cosine distance of 0.279 (错误!未找到引用源。c).
designated as the standard name.                                 Any entity pair with a cosine distance greater than
     Subsequently, we establish pairings among the               0.279 is considered to be a novel combination. We
118,349 distinct bio-entity IDs by analyzing their co-           further define a novel thesis as a doctoral thesis that
occurrence in the dataset comprising 305,693                     includes at least one novel entity combination/pair.
doctoral theses. Among these theses, 8.45%                             To provide a nuanced evaluation of each
exclusively mentioned a single bio-entity, rendering             doctoral thesis’s scientific novelty, we introduce the
the generation of any bio-entity combinations                    novelty score. This score is calculated by
impossible. Consequently, these instances were                   determining the proportion of novel entity pairs out
excluded from subsequent analyses, leaving us with               of the total number of entity combinations
277,288 doctoral theses and resulting in the                     generated within a given thesis. As an illustration, let
generation of 68,949,061 unique bio-entity                       us consider a thesis that mentions three bio-entities:
combinations.                                                    a, b, and c. Within this thesis, the number of
                                                                 generated entity combinations is calculated as 𝐶32 =3.
3. Measuring the distance of two bio-                            Out of these three entity pairs, only the combination
   entities                                                      of a and b meets our novelty criterion, which
                                                                 requires the cosine distance between the two bio-
     Using the standard names associated with the
                                                                 entities to be greater than 0.279. Accordingly, the
118,349 unique bio-entity IDs obtained in the
                                                                 novelty score for this particular thesis is 1/3. The
previous step, we convert each standard bio-entity
                                                                 novelty score is bounded between 0 and 1, with a
name into a vector representation using a Bio-BERT
                                                                 higher score indicating a greater degree of novelty.
model. We then calculate the distance between two
                                                                 This metric provides a precise and continuous
bio-entities that are denoted by 𝑖 and 𝑗, 𝐷𝑖,𝑗 , for any
                                                                 measure of the unique combinations of entities
entity combination that is generated from the                    present in each thesis.
doctoral theses using Equation 1.


                                                           147
                                                                                         ( )


                                                                                         ( )

                                         ,                        ,


Figure 2: The illustration of how to measure novelty scores for doctoral theses using the Bio-BERT
model. (a) An entity vector space containing all entities extracted from three sample doctoral theses based
on Bio-BERT. (b) The distribution of cosine distances between entities for all entity pairs extracted from the
three sample doctoral theses. Within each thesis, the entity pairs are ordered from left to right based on
their cosine distance values. (c) The distribution of cosine distance for all entity pairs extracted from all
doctoral theses in this study. If the cosine distance between the two constituent entities of an entity pair falls
within the upper 10th percentile of this distribution, it is considered a novel entity pair.

Acknowledgements                                             5.       D. K. Simonton, Scientific creativity
                                                                      as constrained stochastic behavior:
This study is sponsored by the National Natural
                                                                      the integration of product, person,
Science Foundation of China (72104054, 72104007),
the Shanghai Pujiang Talent program (21PJC026),
                                                                      and       process         perspectives.
and the Key Project of the National Natural Science                   Psychological Bulletin 129, 475
Foundation of China (72234001). We thank Mr. Grant                    (2003).
Guo for his technical support.                               6.       J. Wang, S. Shibayama, Mentorship
                                                                      and creativity: Effects of mentor
References                                                            creativity and mentoring style.
                                                                      Research Policy 51, 104451 (2022).
1.       J. A. Schumpeter, Business cycles
                                                             7.       K. J. Boudreau, E. C. Guinan, K. R.
         (Mcgraw-hill New York, 1939), vol. 1.
                                                                      Lakhani, C. Riedl, Looking across
2.       L.       Fleming,       Recombinant
                                                                      and looking beyond the knowledge
         uncertainty in technological search.
                                                                      frontier:    Intellectual      distance,
         Management Science 47, 117-132
                                                                      novelty, and resource allocation in
         (2001).
                                                                      science. Management Science 62,
3.       B. Uzzi, S. Mukherjee, M. Stringer, B.
                                                                      2765-2783 (2016).
         Jones, Atypical combinations and
                                                             8.       S. Chai, A. Menon, Breakthrough
         scientific impact. Science 342, 468-
                                                                      recognition: Bias against novelty
         472 (2013).
                                                                      and competition for attention.
4.       M. L. Weitzman, Recombinant
                                                                      Research Policy 48, 733-747 (2019).
         growth. The Quarterly Journal of
                                                             9.       P. Azoulay, J. S. Graff Zivin, G. Manso,
         Economics 113, 331-360 (1998).
                                                                      Incentives and creativity: evidence
                                                                      from the academic life sciences. The


                                                       148
      RAND Journal of Economics 42, 527-
      554 (2011).
10.   M. Liu et al., Pandemics are catalysts
      of scientific novelty: Evidence from
      COVID-19. J Assoc Inf Sci Technol 73,
      1065-1078 (2022).
11.   M. Sung et al., BERN2: an advanced
      neural biomedical named entity
      recognition and normalization tool.
      Bioinformatics      38,   4837-4839
      (2022).
12.   D. Kim et al., A neural named entity
      recognition       and      multi-type
      normalization tool for biomedical
      text mining. IEEE Access 7, 73729-
      73740 (2019).


                                               149

</pre>