<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Quantifying scientific novelty of doctoral theses with Bio-BERT model ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alex J. Yang</string-name>
          <email>alexjieyang@outlook.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Bu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ying Ding</string-name>
          <email>ying.ding@ischool.utexas.edu</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meijun Liu</string-name>
          <email>meijunliu@fudan.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Management, Peking University</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Global Public Policy, Fudan University</institution>
          ,
          <addr-line>Shanghai</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Information Management, Nanjing University</institution>
          ,
          <addr-line>Nanjing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>School of Information, University of Texas at Austin</institution>
          ,
          <addr-line>Austin, TX</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Scientific novelty plays a pivotal role in advancing scholarly endeavors, driving the evolution of knowledge across various disciplines. In this paper, we present a methodology for quantifying the scientific novelty of biomedical doctoral theses utilizing the Bio-BERT model. Leveraging BERN2 for bio-entity extraction and normalization, we analyze a dataset comprising 305,693 doctoral theses to generate unique bio-entity combinations. Employing Bio-BERT, we calculate the semantic distance between bio-entities and establish a criterion for identifying novel entity pairings. We introduce a novelty score to assess the scientific novelty of each thesis, providing a nuanced evaluation of unique entity combinations. Our findings contribute to the discourse on scientific novelty assessment, offering insights into the evolving landscape of biomedical research and providing a framework for enhanced analysis of scholarly innovation for early-career scientists based on their doctoral theses.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Biomedical research</kwd>
        <kwd>Bio-BERT model</kwd>
        <kwd>Doctoral theses</kwd>
        <kwd>BERN2 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction</p>
      <p>Scientific novelty serves as a cornerstone in
scholarly pursuits, driving the progression of
knowledge across diverse fields. Originating from
Schumpeter's seminal insights on business cycles in
the 1930s, the concept of scientific novelty
underscores the transformative nature of innovation,
wherein novel theories, methodologies, data, or
discoveries emerge to shape subsequent
investigations (1). Over time, this perspective has
become integral to the examination of innovation,
permeating scholarly discourse and guiding
inquiries into the novelty of scientific artifacts such
as publications, patents, and grant proposals (2-6).
© Copyright 2024 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>With the exponential growth of scientific data,
researchers have turned to various methodologies
to operationalize and quantify scientific novelty,
often leveraging textual information or citation data
to delineate knowledge elements and their
combinations (7, 8). For instance, Fleming (2001)
proposes evaluating novelty in patents by
identifying unexplored technology classes (2), while
Boudreau et al. (2016) advocate for assessing grant
proposals based on unique MeSH keyword
combinations (7). Despite these endeavors,
challenges persist in accurately capturing the
intricate interplay of knowledge components.</p>
      <p>In this context, recent advancements aim to
refine methodologies for gauging scientific novelty,
drawing inspiration from combinatorial approaches
that consider the semantic relationships between
knowledge elements (9). Liu et al. (2022) propose an
innovative methodology for assessing scientific
novelty in biomedical publications related to
coronavirus (10), utilizing bio-entities as
fundamental knowledge units and employing a
pretrained Bio-BERT model to measure their semantic
distance. By scrutinizing entity pairs and identifying
novel combinations based on a semantic distance
threshold, this approach offers a nuanced
perspective on scientific novelty, surpassing
traditional methods reliant solely on textual or
citation-based analyses.</p>
      <p>Building upon this pioneering framework, our
study endeavors to evaluate the scientific novelty of
biomedical doctoral theses through a
comprehensive five-step method. By adopting the
approach outlined by Liu et al. (2022) (10), which
integrates domain-specific contexts and semantic
analysis, we aspire to enhance the precision and
depth of our analysis, providing invaluable insights
into the evolving landscape of biomedical research.
Through this endeavor, we contribute to the
ongoing discourse on scientific novelty assessment,
advancing methodologies to better encapsulate the
richness and complexity of scholarly innovation.</p>
      <p>The primary data source for this study is the
Sciences and Engineering Collection of The
ProQuest Dissertations &amp; Theses Citation Index
(PQDT). PQDT stands as the world's largest
multidisciplinary dissertation database, housing
over 5.5 million dissertations from universities
worldwide and serving as an official repository for
the US Library of Congress. From a compilation of
US higher education institutions provided by the
Carnegie Commission on Higher Education, we
gather records of doctoral theses from the Science
and Engineering collection of PQDT. This dataset
encompasses 1,109,491 theses from 828 US
institutions, spanning publication years 1960 to
2016. PQDT offers comprehensive information
about dissertations, including author details,
advisors, universities, subjects, and publication years.
Each thesis is associated with one or more subjects
chosen by the author, which can be mapped to 22
broader disciplines. Prioritizing data accuracy, we
analyze doctoral theses published from 1980 to
2016, retaining 313,274 theses in the biomedical
sciences encompassing biological science, health,
and medical science.
2. Extracting and disambiguating
bioentities</p>
      <p>We utilize BERN2 (11), an advanced neural
biomedical tool, to extract biomedical entities from
a corpus comprising 313,274 doctoral theses. BERN2
comprises two principal models: (1) Named Entity
Recognition (NER), which discerns nine types of
biomedical entities—gene/protein, disease,
drug/chemical, species, mutation, cell line, cell type,
DNA, and RNA—employing a multi-task NER model;
and (2) Named Entity Normalization (NEN), which
associates annotated entities with concept unique
identifiers using a combination of rule-based and
neural network-based NEN models. BERN2's
superiority over existing biomedical text mining
tools (12) lies in its ability to provide more efficient
annotations.</p>
      <p>We opt to extract bio-entities from the titles
and abstracts of doctoral theses rather than relying
on full texts for several reasons. Firstly, although the
PQDT database offers access to 3 million full texts of
doctoral dissertations added since 1997, a download
limit is imposed. However, titles and abstracts are
available for nearly all doctoral theses added since
1980. The title succinctly encapsulates the main
topic addressed by the author, while the abstract
provides a summary of the substantive content.
Utilizing titles and abstracts instead of full texts
ensures higher data accessibility, a denser
concentration of relevant vocabulary reflecting the
publication's topic, as well as advantages such as
reduced computation time and simplified data
preprocessing processes.</p>
      <p>Utilizing BERN2, we extract 1,519,599 annotated
bio-entity names from the titles and abstracts of
305,693 doctoral theses from the final dataset. In
2.42% of the 313,274 doctoral theses, we fail to
extract any bio-entity, leading to the exclusion of
these theses from further analyses, resulting in a
remaining subset of 305,693 doctoral theses. The
1,519,599 annotated bio-entity names were
disambiguated and linked to 118,349 unique
bioentity IDs. The standard name for each ID was
determined as the most frequently occurring
bioentity name associated with it in the biomedical
doctoral theses. In cases of multiple associated
names with unequal occurrences, one was randomly
designated as the standard name.</p>
      <p>Subsequently, we establish pairings among the
118,349 distinct bio-entity IDs by analyzing their
cooccurrence in the dataset comprising 305,693
doctoral theses. Among these theses, 8.45%
exclusively mentioned a single bio-entity, rendering
the generation of any bio-entity combinations
impossible. Consequently, these instances were
excluded from subsequent analyses, leaving us with
277,288 doctoral theses and resulting in the
generation of 68,949,061 unique bio-entity
combinations.
3. Measuring the distance of two
bioentities</p>
      <p>Using the standard names associated with the
118,349 unique bio-entity IDs obtained in the
previous step, we convert each standard bio-entity
name into a vector representation using a Bio-BERT
model. We then calculate the distance between two
bio-entities that are denoted by  and  ,   , , for any
entity combination that is generated from the
doctoral theses using Equation 1.</p>
      <p>, =1-  , (1)
where   , is the cosine similarity between
entities  and  based on their corresponding vector
representations that are obtained from the Bio-BERT
model. The examples of an entity vector space for
three theses based on the Bio-BERT model are
shown in 错误!未找到引用源。a-b.</p>
      <p>We develop a criterion to determine what
qualifies as a novel combination of entities. To do
this, we analyze the distribution of cosine distances
among all pairs of entities in our dataset. If the
cosine distance between the two constituent entities
of a pair falls within the top 10% of this distribution,
we consider it as a novel entity pairing. The 90th
percentile of the distribution corresponds to a
cosine distance of 0.279 (错误!未找到引用源。c).
Any entity pair with a cosine distance greater than
0.279 is considered to be a novel combination. We
further define a novel thesis as a doctoral thesis that
includes at least one novel entity combination/pair.</p>
      <p>To provide a nuanced evaluation of each
doctoral thesis’s scientific novelty, we introduce the
novelty score. This score is calculated by
determining the proportion of novel entity pairs out
of the total number of entity combinations
generated within a given thesis. As an illustration, let
us consider a thesis that mentions three bio-entities:
a, b, and c. Within this thesis, the number of
generated entity combinations is calculated as  32=3.
Out of these three entity pairs, only the combination
of a and b meets our novelty criterion, which
requires the cosine distance between the two
bioentities to be greater than 0.279. Accordingly, the
novelty score for this particular thesis is 1/3. The
novelty score is bounded between 0 and 1, with a
higher score indicating a greater degree of novelty.
This metric provides a precise and continuous
measure of the unique combinations of entities
present in each thesis.
,
,
( )
( )
Acknowledgements
This study is sponsored by the National Natural
Science Foundation of China (72104054, 72104007),
the Shanghai Pujiang Talent program (21PJC026),
and the Key Project of the National Natural Science
Foundation of China (72234001). We thank Mr. Grant
Guo for his technical support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Schumpeter</surname>
          </string-name>
          ,
          <article-title>Business cycles (Mcgraw-hill New York,</article-title>
          <year>1939</year>
          ), vol.
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>Management Science</source>
          <volume>47</volume>
          ,
          <fpage>117</fpage>
          -
          <lpage>132</lpage>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Jones</surname>
          </string-name>
          ,
          <article-title>Atypical combinations and scientific impact</article-title>
          .
          <source>Science</source>
          <volume>342</volume>
          ,
          <fpage>468</fpage>
          -
          <lpage>472</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>M. L. Weitzman</surname>
          </string-name>
          ,
          <article-title>Recombinant growth</article-title>
          .
          <source>The Quarterly Journal of Economics</source>
          <volume>113</volume>
          ,
          <fpage>331</fpage>
          -
          <lpage>360</lpage>
          (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>Psychological Bulletin</source>
          <volume>129</volume>
          ,
          <issue>475</issue>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>Research Policy</source>
          <volume>51</volume>
          ,
          <issue>104451</issue>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Lakhani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Riedl</surname>
          </string-name>
          ,
          <article-title>Looking across and looking beyond the knowledge frontier: Intellectual distance, novelty, and resource allocation in science</article-title>
          .
          <source>Management Science</source>
          <volume>62</volume>
          ,
          <fpage>2765</fpage>
          -
          <lpage>2783</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>Research Policy</source>
          <volume>48</volume>
          ,
          <fpage>733</fpage>
          -
          <lpage>747</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Azoulay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Graff Zivin</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Manso, Incentives and creativity: evidence from the academic life sciences</article-title>
          .
          <source>The RAND Journal of Economics</source>
          <volume>42</volume>
          ,
          <fpage>527</fpage>
          -
          <lpage>554</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>M. Liu</surname>
          </string-name>
          et al.,
          <source>Pandemics are catalysts of scientific novelty: Evidence from COVID-19. J Assoc Inf Sci Technol</source>
          <volume>73</volume>
          ,
          <fpage>1065</fpage>
          -
          <lpage>1078</lpage>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>M. Sung</surname>
          </string-name>
          et al.,
          <article-title>BERN2: an advanced neural biomedical named entity recognition and normalization tool</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>Bioinformatics</source>
          <volume>38</volume>
          ,
          <fpage>4837</fpage>
          -
          <lpage>4839</lpage>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          et al.,
          <article-title>A neural named entity recognition and multi-type normalization tool for biomedical text mining</article-title>
          .
          <source>IEEE Access 7</source>
          ,
          <fpage>73729</fpage>
          -
          <lpage>73740</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>