<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>L. Bornmann, R. Mutz, Growth rates of modern
science: A bibliometric analysis based on the number
of publications and cited references, Journal of the
Association for Information Science and Technology
from both epochs and the citation count in the first epoch.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Is Dynamicity All You Need?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Richard Delwin Myloth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kian Ahrabian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arun Baalaaji Sankar Ananthan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinwei Du</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jay Pujara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Sciences Institute</institution>
          ,
          <addr-line>Marina del Ray, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Southern California</institution>
          ,
          <addr-line>Los Angeles, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2008</year>
      </pub-date>
      <volume>66</volume>
      <issue>2015</issue>
      <fpage>2215</fpage>
      <lpage>2222</lpage>
      <abstract>
        <p>Scientific domains are fluid entities that change and turn as time passes. Take machine learning as an example. Up until the '90s, most of the methods were expert-knowledge-driven. However, as time passed, more data-driven approaches appeared, ifnally leading to the advent of deep learning methods. As a result, in a span of 30 years, the field has gone through many changes and breakthroughs and is at a point where many novelties have a life span of shorter than five years. In parallel, a regular researcher's career span is around the same length. Consequently, being a researcher requires shifts in the field of study throughout one's career. Besides, researchers' scientific interests are inherently dynamic and change over time. Hence, there exists a dynamicity to authors' interests and fields of work over time. In this work, we study this phenomenon through systematic approaches for representing and tracking dynamicity in diferent epochs. Our representation approaches are based on the idea that each author could be represented as a distribution of other authors. Concurrently, our tracking approaches rely on established mathematical concepts for measuring the change between two distributions. We focus on the publications in the 2001-2020 range and present a set of analyses built on top of the introduced approaches to understanding the potential connection between dynamicity and success.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Author Dynamicity</kwd>
        <kwd>Causal Analysis</kwd>
        <kwd>Scientific Research Analysis</kwd>
        <kwd>Community Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>have similar interests.</p>
      <p>Community detection algorithms are graph
partitionThe past few decades have been an unprecedented era of ing approaches that identify sets of tightly connected
scientific discoveries, with the sheer number of publica- nodes that are loosely connected to nodes outside their
retions rising steadily [1]. This constant growth of research spective sets [2, 3]. When employed on citation networks,
collaborations has led to the emergence of new interdisci- these algorithms yield a set of communities where each
plinary domains, prompting researchers to expand their community contains highly related publications. These
research horizons. This expansion, combined with the extracted communities could then be exploited for
indicontinuous development of scientific domains and the rectly analyzing authors’ interests through publications
inherent nature of research to explore new areas, results and citations as proxies.
in a potentially volatile set of research directions. This In this work, we study the authors’ dynamicity
phework introduces approaches for systematically studying nomenon from a relational standpoint. More specifically,
this fluidity and uncovering interesting behaviors among we focus on the following research questions:
authors.</p>
      <p>Scientific publications are the information vessels sci- 1. How can we characterize and quantify the
entists use to communicate their findings, methodologies, interests and dynamicity of an author?
and critiques. At the same time, publications are reflec- 2. Is there any connection between dynamicity
tions of their authors’ interests and fields of study. These and success due to reasons such as
adaptabilpublications are bound together through citations that ity or diversity?
specify the foundations of each work. As a result,
citations create tightly connected groups of publications
with similar research directions. Consequently, authors
with a high number of interactions in these groups, either
through collaborations or citations, are more likely to
To this end, we first create two knowledge graphs
(KG) from publications in the 2001-2020 period, each
encompassing ten years’ worth of scholarly information,
i.e., publications and authors. Then, we introduce three
vectorizing approaches focused on presenting authors’
interest in one epoch, and two tracking approaches
focused on quantifying the change in interests in two
distinct epochs. Our vectorizing approaches are built on top
of relational information in the KGs and represent
authors as a distribution of other authors. Meanwhile, our
divergence) measures. By mix-and-matching, these
approaches yield six diferent dynamicity scores for each
author. We then use these scores to investigate the
connection between authors’ dynamicity and success. Our
analyses showcase the connection between success,
diversity, and adaptability in research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Bird et al. [4] analyzed community structures in the DBLP
bibliographic database to investigate collaborative
connections in computer science and interdisciplinary
research at the individual, within-area, and network-wide
levels. They developed quantifiable metrics such as
longitudinal assortativity over the number of publications,
collaborators, and career length to study author overlap
and migration patterns. Prior to Bird et al. [4], Newman
[5] used data from publications in physics, biomedical
research, and computer science to build co-authorship
collaboration networks. They looked at the number of
publications produced by authors, the number of authors
per article, the number of collaborators that scientists
have, the existence and size of a significant component of
connected scientists, and the degree of clustering in the
networks. They examined collaboration patterns among
participants and discovered that these variables follow a
power law distribution and that collaboration
relationships are transitive. Paul et al. [6] also used the DBLP
database in their study to develop a citation-collaboration
network to rank authors based on their contributions in
terms of co-authorship and citations while verifying them
against the h-index. They also carried out a comparative
examination of the change in author ranking for diferent
parts of the author spectrum over time.
(MAG), ROR, ORCID, DOAJ, PubMed, PubMed Central,
and Unpaywall. We use the OpenAlex dump obtained on
2022-12-07 to construct our dataset for this work. Given
this dump, we first extract a KG containing all the
publications and their connections, i.e., citation links. Then,
we extract two induced KGs by filtering the publications
with publication dates within two ranges of 2001-2010
and 2011-2020, naming them CG-2010 and CG-2020,
respectively. Following this, we add the authorship
information for each KG for all the publications. Finally, we
drop all the nodes with a zero degree (in and out) in
both KGs. After this procedure, we end up with two
temporally-scoped KGs containing authorship and
citation information for all the publications in the 2001-2010
and 2011-2020 periods. Table 1 illustrates the statistics
of the extracted KGs. To handle the large size of the raw
dump, we resorted to using the KGTK toolkit for all our
KG processing procedures [8].</p>
    </sec>
    <sec id="sec-3">
      <title>4. Methodology</title>
      <p>We break down the problem of characterizing authors’
dynamicity into two sets of approaches: Vectorizers
and Trackers. Vectorizers, as described in Section 4.1,
focus on presenting authors’ interest in one epoch. As
described in Section 4.2, trackers focus on quantifying
the change in interests in two distinct epochs. When
combined, these approaches provide a systematic way of
characterizing authors’ dynamicity.</p>
      <sec id="sec-3-1">
        <title>4.1. Vectorizers</title>
        <p>We introduce three approaches for vectorizing authors’
interests in a given epoch. The main idea of all these
approaches is that each author’s interests could be
modeled through a distribution over the set of other authors.
Our first two approaches rely only on the information
that could be directly extracted from citation links. In
contrast, the third approach uses external information
by building upon the output of a community detection
algorithm. As a result, the third approach is prone to
erroneous information propagated from the underlying
community detection algorithm; in return, it gains access
to more complex information compared to the first two
approaches.</p>
        <sec id="sec-3-1-1">
          <title>4.1.1. Co-authors</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Dataset</title>
      <p>In this approach, we present an author’s interests through
OpenAlex [7] is a free and open catalog of scholarly en- their co-authors. To this end, given two arbitrary authors
tities that provides metadata for publications, authors,  and  and epoch , we define the co-author weight value
vtheenureesla,tiniosntisthuitpiosnasm,aonndgsctiheenmti.fic cIotngcaetphtse,rsalodnatgawfirtohm  () as  () = | ∩ | (1)
sources such as Crossref, Microsoft Academic Graph
where  is the set of publications by author  in epoch
. Building on top of these co-author weight values, for
any arbitrary author , we form the representative vector
 as
 = [ (0),  (1), . . . ,  (||)]</p>
      <p>(2)
where  is the set of all authors in the KG. It is important
to note that these representative vectors are extremely
sparse due to the large cardinality of .</p>
      <sec id="sec-4-1">
        <title>4.1.2. Citations</title>
        <p>In this approach, we present an author’s interests through
its citing and cited authors. To this end, given two
arbitrary authors  and  and epoch , we define the citation
weight value () as
() =
∑︁ | ∩ | + ∑︁ | ∩ |
(3)
∈
∈
where  is the set of publications by author  in epoch
 and  is the set of all publications cited by publication
 in epoch . Building on these citation weight values,
for any arbitrary author , we form the representative
vector  following Equation 2, replacing   with .</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.1.3. Communities</title>
        <p>In this approach, we present an author’s interests through
authors with whom they publish in the same research
communities. To this end, given a KG encompassing
epoch , we first extract the citation graph by removing
all non-publication nodes, i.e., authors. Then, we run the
Leiden [3] community detection algorithm to extract a
set of communities . We rely on the hypothesis that
each community represents a somewhat unique field of
study. We use a modified version of the Leiden algorithm
that limits the maximum number of generated
communities and the number of publications in a community.
Doing so avoids the creation of large unfocused, or small
insignificant communities. Given the set of extracted
communities , for any two arbitrary authors  and ,
we define the co-occurrence weight value   () as
  () =
{︃∑︀
0</p>
        <p>|| log2(|| +  )  ̸= 
∈ ||
 = 
(4)
two authors that have many papers in the same
communities and simultaneously waives the need for tracking
the communities themselves. Building on top of these
cooccurrence weight values, for any arbitrary author , we
can form a representative vector  following Equation
2, replacing   with   .</p>
        <sec id="sec-4-2-1">
          <title>4.2. Trackers</title>
          <p>We introduce two tracking approaches for quantifying
the dynamicity between two distinct epochs. These two
approaches are built on well-known mathematical
concepts of cosine similarity and relative entropy.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4.2.1. Cosine Similarity (-score)</title>
      <p>Given the representative vectors of an arbitrary author
 from two time periods,  and ′ , we calculate the
,′ defined as
cosine similarity score</p>
      <p>,′ =</p>
      <p>′
.
‖‖‖′ ‖
.</p>
      <p>(5)
The calculated cosine similarity scores represent the
stability of authors’ interests in two epochs, i.e., the higher
the value, the more consistent the authors’ interests.</p>
    </sec>
    <sec id="sec-6">
      <title>4.2.2. Relative Entropy (ℰ -score)</title>
      <p>Building on top of the representative vectors, for each
arbitrary author  in period , we define a probability
distribution as
ℱ() = ∑︀</p>
      <p>[] + 
′∈ [′] +  ||
∀ ∈</p>
      <p>(6)
where  = 1 is the prior probability and  is the set of</p>
      <p>||
all authors in the KG. Then, given the probability
distributions of an arbitrary author  from two time periods,
ℱ and ℱ′ , we calculate the relative entropy ℰ,′ as
ℰ,′ = KL(ℱ′ ‖ℱ) = ∑︁ ℱ′ () log( ℱℱ′(()) ) .</p>
      <p>∈</p>
      <p>(7)
In contrast to the cosine similarity score, the calculated
relative entropy scores represent the volatility of authors’
interests in two epochs, i.e., the higher the value, the less
consistent the authors’ interests are.
where  is the set of publications by author  in
community ,  is the set of publications by author  in epoch 5. Analyses
, and  = 0.001. In this formalization, the efect of each
community is weighed on the number of publications an Throughout this section, we run all our analyses on a set
author has in that community, e.g., || . Moreover, each of randomly 10,000 sampled authors. More specifically,
author’s influence is smoothened by taking the log value we do a weighted sampling without replacement using
of their number of publications, e.g., log2( + ). The re- the citation counts. This procedure allows us to manage
sulting equation highlights the connection between any the computational costs of running these analyses.
we use the average citation count as the proxy metric.</p>
      <p>Formally, given the set of extracted communities , for
any arbitrary author , we calculate the entropy across
communities ℋ as
 = ||</p>
      <p>||
ℋ = −
∑︁  log2()</p>
      <p>∈
(8)
(9)
where  is the set of publications by author  in
community  and  is the set of publications by author  in
epoch . Figure 1 illustrates the results of our analysis.</p>
      <p>We can observe in Figure 1 that in both epochs average
citation count increases with the increase of entropy up
until a point and then drops again. This observation
indicates the benefit of having a diverse portfolio, but
simultaneously too much diversity could negatively impact
success.</p>
      <sec id="sec-6-1">
        <title>5.3. Propensity Score Matching Analysis</title>
      </sec>
      <sec id="sec-6-2">
        <title>5.1. Statistical Dependence Analysis</title>
        <p>This analysis studies the connection between the
introduced stability scores and success across two epochs. We
use the relative change in average citation count as the
proxy metric for success. The main intuitions behind
this metric are 1) citation count is an accepted correlated
metric for success in the community, 2) using average
mitigates the efect of the high number of publications from
an author, and 3) using relative change locally normalizes
the metric values. Moreover, to reduce the potential noise
in the data, we remove the outliers by filtering out
samples outside two standard deviations of relative change
in average citation count mean.</p>
        <p>To quantify the strength of this connection, we use
the established bivariate correlation and univariate
linear regression measurements. We also include a random
noise vectorizer as a sanity check to our methodology.</p>
        <p>Table 2 presents the results of our analysis with one of
the introduced scores as the independent variable  and
the number of citations as the dependent variable . As
evident from Table 2, every introduced score has a
significant connection with success, some in the same direction
and some in the opposite direction. Moreover, the
“Citations" vectorizer showcases the highest correlation with
the measurement for success which signifies the efect
of author interactions.</p>
        <p>This analysis focuses on the potential causal relationship
between adaptability and success in two epochs by
utilizing the propensity score matching (PSM) technique. We
5.2. Entropy Analysis use the increase in entropy and citation count in the
second epoch as proxy metrics for adaptability and success,
In this analysis, we study the connection between diver- respectively. Following this, we designate the increase in
sity and success. We use the authors’ entropy across the entropy as the treatment variable and the citation count
extracted communities as a proxy for diversity. As for in the second epoch as the outcome variable. As for
success, with similar intuitions to the previous section, the confounding variables, we use the publication counts
Some of the straightforward extensions of our work
for future studies are 1) including more authors, 2) using
a more extended period, and 3) changing the temporal
granularity for tracking changes. Moreover, we used
a relatively simple metric as our success proxy; future
works could work with other metrics, such as the h-index
or i10-index.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was funded by the Defense Advanced Research
Projects Agency with award W911NF-19-20271 and with
support from a Keston Exploratory Research Award.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>