=Paper=
{{Paper
|id=Vol-3656/paper9
|storemode=property
|title=Is Dynamicity All You Need?
|pdfUrl=https://ceur-ws.org/Vol-3656/paper9.pdf
|volume=Vol-3656
|authors=Richard Delwin Myloth,Kian Ahrabian,Arun Baalaaji Sankar Ananthan,Xinwei Du,Jay Pujara
|dblpUrl=https://dblp.org/rec/conf/aaai/MylothAADP23
}}
==Is Dynamicity All You Need?==
<pdf width="1500px">https://ceur-ws.org/Vol-3656/paper9.pdf</pdf>
<pre>
                                Is Dynamicity All You Need?
                                Richard Delwin Myloth1,2 , Kian Ahrabian1,2,* , Arun Baalaaji Sankar Ananthan1,2 , Xinwei Du1,2
                                and Jay Pujara1,2
                                1
                                    Information Sciences Institute, Marina del Ray, CA, USA
                                2
                                    University of Southern California, Los Angeles, CA, USA


                                                                           Abstract
                                                                           Scientific domains are fluid entities that change and turn as time passes. Take machine learning as an example. Up until the
                                                                           ’90s, most of the methods were expert-knowledge-driven. However, as time passed, more data-driven approaches appeared,
                                                                           finally leading to the advent of deep learning methods. As a result, in a span of 30 years, the field has gone through many
                                                                           changes and breakthroughs and is at a point where many novelties have a life span of shorter than five years. In parallel, a
                                                                           regular researcher’s career span is around the same length. Consequently, being a researcher requires shifts in the field of
                                                                           study throughout one’s career. Besides, researchers’ scientific interests are inherently dynamic and change over time. Hence,
                                                                           there exists a dynamicity to authors’ interests and fields of work over time. In this work, we study this phenomenon through
                                                                           systematic approaches for representing and tracking dynamicity in different epochs. Our representation approaches are based
                                                                           on the idea that each author could be represented as a distribution of other authors. Concurrently, our tracking approaches
                                                                           rely on established mathematical concepts for measuring the change between two distributions. We focus on the publications
                                                                           in the 2001-2020 range and present a set of analyses built on top of the introduced approaches to understanding the potential
                                                                           connection between dynamicity and success.

                                                                           Keywords
                                                                           Author Dynamicity, Causal Analysis, Scientific Research Analysis, Community Detection


                                1. Introduction                                              have similar interests.
                                                                                               Community detection algorithms are graph partition-
                                The past few decades have been an unprecedented era of ing approaches that identify sets of tightly connected
                                scientific discoveries, with the sheer number of publica- nodes that are loosely connected to nodes outside their re-
                                tions rising steadily [1]. This constant growth of research spective sets [2, 3]. When employed on citation networks,
                                collaborations has led to the emergence of new interdisci- these algorithms yield a set of communities where each
                                plinary domains, prompting researchers to expand their community contains highly related publications. These
                                research horizons. This expansion, combined with the extracted communities could then be exploited for indi-
                                continuous development of scientific domains and the rectly analyzing authors’ interests through publications
                                inherent nature of research to explore new areas, results and citations as proxies.
                                in a potentially volatile set of research directions. This     In this work, we study the authors’ dynamicity phe-
                                work introduces approaches for systematically studying nomenon from a relational standpoint. More specifically,
                                this fluidity and uncovering interesting behaviors among we focus on the following research questions:
                                authors.
                                   Scientific publications are the information vessels sci-      1. How can we characterize and quantify the
                                entists use to communicate their findings, methodologies,           interests and dynamicity of an author?
                                and critiques. At the same time, publications are reflec-        2. Is there any connection between dynamicity
                                tions of their authors’ interests and fields of study. These        and success due to reasons such as adaptabil-
                                publications are bound together through citations that              ity or diversity?
                                specify the foundations of each work. As a result, ci-         To this end, we first create two knowledge graphs
                                tations create tightly connected groups of publications (KG) from publications in the 2001-2020 period, each en-
                                with similar research directions. Consequently, authors compassing ten years’ worth of scholarly information,
                                with a high number of interactions in these groups, either i.e., publications and authors. Then, we introduce three
                                through collaborations or citations, are more likely to vectorizing approaches focused on presenting authors’
                                The Third AAAI Workshop on Scientific Document Understanding 2023,
                                                                                                                                                                    interest in one epoch, and two tracking approaches fo-
                                February 14th, 2023, Washington, DC, USA                                                                                            cused on quantifying the change in interests in two dis-
                                *
                                  Corresponding author.                                                                                                             tinct epochs. Our vectorizing approaches are built on top
                                $ myloth@usc.edu (R. D. Myloth); ahrabian@usc.edu                                                                                   of relational information in the KGs and represent au-
                                (K. Ahrabian); arunbaal@usc.edu (A. B. S. Ananthan);                                                                                thors as a distribution of other authors. Meanwhile, our
                                xinweidu@usc.edu (X. Du); jpujara@usc.edu (J. Pujara)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License tracking approaches are based on the two well-known
                                                                       Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                      cosine similarity and relative entropy (Kullback–Leibler


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1                                                  (MAG), ROR, ORCID, DOAJ, PubMed, PubMed Central,
Statistics of the extracted KGs.                         and Unpaywall. We use the OpenAlex dump obtained on
            Dataset          CG-2010     CG-2020
                                                         2022-12-07 to construct our dataset for this work. Given
                                                         this dump, we first extract a KG containing all the publi-
       # Publications       19,707,369  33,743,276       cations and their connections, i.e., citation links. Then,
         # Authors          20,333,216  36,077,559       we  extract two induced KGs by filtering the publications
                                                         with publication dates within two ranges of 2001-2010
       # Citation Links     167,133,583 323,927,950      and 2011-2020, naming them CG-2010 and CG-2020, re-
      # Authorship Links    67,531,472  137,160,724      spectively. Following this, we add the authorship infor-
                                                         mation for each KG for all the publications. Finally, we
                                                         drop all the nodes with a zero degree (in and out) in
divergence) measures. By mix-and-matching, these ap- both KGs. After this procedure, we end up with two
proaches yield six different dynamicity scores for each temporally-scoped KGs containing authorship and cita-
author. We then use these scores to investigate the con- tion information for all the publications in the 2001-2010
nection between authors’ dynamicity and success. Our and 2011-2020 periods. Table 1 illustrates the statistics
analyses showcase the connection between success, di- of the extracted KGs. To handle the large size of the raw
versity, and adaptability in research.                   dump, we resorted to using the KGTK toolkit for all our
                                                         KG processing procedures [8].

2. Related Work
                                                              4. Methodology
Bird et al. [4] analyzed community structures in the DBLP
bibliographic database to investigate collaborative con-      We break down the problem of characterizing authors’
nections in computer science and interdisciplinary re-        dynamicity into two sets of approaches: Vectorizers
search at the individual, within-area, and network-wide       and Trackers. Vectorizers, as described in Section 4.1,
levels. They developed quantifiable metrics such as lon-      focus on presenting authors’ interest in one epoch. As
gitudinal assortativity over the number of publications,      described in Section 4.2, trackers focus on quantifying
collaborators, and career length to study author overlap      the change in interests in two distinct epochs. When
and migration patterns. Prior to Bird et al. [4], Newman      combined, these approaches provide a systematic way of
[5] used data from publications in physics, biomedical        characterizing authors’ dynamicity.
research, and computer science to build co-authorship
collaboration networks. They looked at the number of          4.1. Vectorizers
publications produced by authors, the number of authors
                                                              We introduce three approaches for vectorizing authors’
per article, the number of collaborators that scientists
                                                              interests in a given epoch. The main idea of all these
have, the existence and size of a significant component of
                                                              approaches is that each author’s interests could be mod-
connected scientists, and the degree of clustering in the
                                                              eled through a distribution over the set of other authors.
networks. They examined collaboration patterns among
                                                              Our first two approaches rely only on the information
participants and discovered that these variables follow a
                                                              that could be directly extracted from citation links. In
power law distribution and that collaboration relation-
                                                              contrast, the third approach uses external information
ships are transitive. Paul et al. [6] also used the DBLP
                                                              by building upon the output of a community detection
database in their study to develop a citation-collaboration
                                                              algorithm. As a result, the third approach is prone to
network to rank authors based on their contributions in
                                                              erroneous information propagated from the underlying
terms of co-authorship and citations while verifying them
                                                              community detection algorithm; in return, it gains access
against the h-index. They also carried out a comparative
                                                              to more complex information compared to the first two
examination of the change in author ranking for different
                                                              approaches.
parts of the author spectrum over time.

                                                              4.1.1. Co-authors
3. Dataset                                                In this approach, we present an author’s interests through
OpenAlex [7] is a free and open catalog of scholarly en- their co-authors. To this end, given two arbitrary authors
tities that provides metadata for publications, authors, 𝑝 and   𝑞 and epoch 𝑡, we define the co-author weight value
venues, institutions, and scientific concepts, along with 𝜓𝑝 (𝑞) as
                                                            𝑡

the relationships among them. It gathers data from                            𝜓𝑝𝑡 (𝑞) = |𝒱𝑝𝑡 ∩ 𝒱𝑞𝑡 |              (1)
sources such as Crossref, Microsoft Academic Graph
where 𝒱𝑥𝑡 is the set of publications by author 𝑥 in epoch           two authors that have many papers in the same commu-
𝑡. Building on top of these co-author weight values, for            nities and simultaneously waives the need for tracking
any arbitrary author 𝑝, we form the representative vector           the communities themselves. Building on top of these co-
𝑧𝑝𝑡 as                                                              occurrence weight values, for any arbitrary author 𝑝, we
                                                                    can form a representative vector 𝑧𝑝𝑡 following Equation
          𝑧𝑝𝑡 = [𝜓𝑝𝑡 (𝑎0 ), 𝜓𝑝𝑡 (𝑎1 ), . . . , 𝜓𝑝𝑡 (𝑎|𝒜| )]   (2)   2, replacing 𝜓𝑝𝑡 with 𝜂𝑝𝐶 .
where 𝒜 is the set of all authors in the KG. It is important
to note that these representative vectors are extremely             4.2. Trackers
sparse due to the large cardinality of 𝒜.                           We introduce two tracking approaches for quantifying
                                                                    the dynamicity between two distinct epochs. These two
4.1.2. Citations                                                    approaches are built on well-known mathematical con-
In this approach, we present an author’s interests through          cepts of cosine similarity and relative entropy.
its citing and cited authors. To this end, given two arbi-
trary authors 𝑝 and 𝑞 and epoch 𝑡, we define the citation 4.2.1. Cosine Similarity (𝒮-score)
weight value 𝜑𝑡𝑝 (𝑞) as                                    Given the representative vectors of an arbitrary author
                                                                                               ′
                  ∑︁                 ∑︁ 𝑡                  𝑝 from two time periods, 𝑧𝑝𝑡 and 𝑧𝑝𝑡 , we calculate the
         𝑡
        𝜑𝑝 (𝑞) =         𝑡     𝑡
                       |𝒩𝑣 ∩ 𝒱𝑞 | +       |𝒱𝑝 ∩ 𝒩𝑢 | (3)
                                                   𝑡
                                                                                    𝑡,𝑡′
                 𝑣∈𝒱 𝑡              𝑢∈𝒱 𝑡
                                                           cosine similarity score 𝒮𝑝 defined as
                      𝑝                        𝑞
                                                                                                           ′
                                                                                            ′       𝑧𝑝𝑡 .𝑧𝑝𝑡
where 𝒱𝑥𝑡 is the set of publications by author 𝑥 in epoch                              𝒮𝑝𝑡,𝑡 =                  .                 (5)
𝑡 and 𝒩𝑦𝑡 is the set of all publications cited by publication                                     ‖𝑧𝑝𝑡 ‖‖𝑧𝑝𝑡′ ‖
𝑦 in epoch 𝑡. Building on these citation weight values, The calculated cosine similarity scores represent the sta-
for any arbitrary author 𝑝, we form the representative bility of authors’ interests in two epochs, i.e., the higher
vector 𝑧𝑝𝑡 following Equation 2, replacing 𝜓𝑝𝑡 with 𝜑𝑡𝑝 . the value, the more consistent the authors’ interests.

4.1.3. Communities                                                  4.2.2. Relative Entropy (ℰ-score)
In this approach, we present an author’s interests through          Building on top of the representative vectors, for each
authors with whom they publish in the same research                 arbitrary author 𝑝 in period 𝑡, we define a probability
communities. To this end, given a KG encompassing                   distribution as
epoch 𝑡, we first extract the citation graph by removing
                                                                                                  𝑧𝑝𝑡 [𝑞] + 𝜖
all non-publication nodes, i.e., authors. Then, we run the                    ℱ𝑝𝑡 (𝑞) = ∑︀                         ∀𝑞 ∈ 𝒜         (6)
                                                                                                       𝑡 ′
Leiden [3] community detection algorithm to extract a                                        𝑞 ′ ∈𝒜 𝑧𝑝 [𝑞 ] + 𝜖|𝒜|
set of communities 𝒞. We rely on the hypothesis that
                                                                    where 𝜖 = |𝒜|1
                                                                                    is the prior probability and 𝒜 is the set of
each community represents a somewhat unique field of
                                                                    all authors in the KG. Then, given the probability distri-
study. We use a modified version of the Leiden algorithm
                                                                    butions of an arbitrary author 𝑝 from two time periods,
that limits the maximum number of generated commu-                             ′                                         ′
nities and the number of publications in a community.               ℱ𝑝𝑡 and ℱ𝑝𝑡 , we calculate the relative entropy ℰ𝑞𝑡,𝑡 as
Doing so avoids the creation of large unfocused, or small                                                                  ′
                                                                          ′            ′                       ′         ℱ𝑝𝑡 (𝑞)
insignificant communities. Given the set of extracted
                                                                                                    ∑︁
                                                                      ℰ𝑝𝑡,𝑡 = 𝐷KL (ℱ𝑝𝑡 ‖ℱ𝑝𝑡 ) =           ℱ𝑝𝑡 (𝑞) log(           ).
communities 𝒞, for any two arbitrary authors 𝑝 and 𝑞,                                               𝑞∈𝒜
                                                                                                                         ℱ𝑝𝑡 (𝑞)
we define the co-occurrence weight value 𝜂𝑝𝐶 (𝑞) as                                                                           (7)
                                                                    In contrast to the cosine similarity score, the calculated
                       |𝑐𝑝 |
                                                                    relative entropy scores represent the volatility of authors’
              {︃∑︀
                   𝑐∈𝒞 |𝒱𝑝𝑡 | log2 (|𝑐𝑞 | + 𝛼) 𝑝 ̸= 𝑞
     𝐶
    𝜂𝑝 (𝑞) =                                            (4)         interests in two epochs, i.e., the higher the value, the less
                0                              𝑝=𝑞
                                                                    consistent the authors’ interests are.
where 𝑐𝑥 is the set of publications by author 𝑥 in commu-
nity 𝑐, 𝒱𝑥𝑡 is the set of publications by author 𝑥 in epoch
𝑡, and 𝛼 = 0.001. In this formalization, the effect of each
                                                                    5. Analyses
community is weighed on the number of publications an               Throughout this section, we run all our analyses on a set
                                        𝑐
author has in that community, e.g., |𝒱𝑝𝑡 | . Moreover, each         of randomly 10,000 sampled authors. More specifically,
                                          𝑝
author’s influence is smoothened by taking the log value            we do a weighted sampling without replacement using
of their number of publications, e.g., log2 (𝑐𝑞 +𝛼). The re-        the citation counts. This procedure allows us to manage
sulting equation highlights the connection between any              the computational costs of running these analyses.
                                                              Table 2
                                                              Univariate linear regression and bivariate correlation metrics
                                                              between introduced scores and relative change in average
                                                              citation count. Legend: PCC: Pearson correlation coefficient.
                                                               Tracker      Vectorizer      PCC         Coef.      SE        𝑡       𝑃 > |𝑡|

                                                                            Random          -0.001     -967.70   5156.52   -0.188     0.851
                                                               𝒮 -score    Co-authors       -0.121      -26.03    2.15     -12.11     0.000
                                                                            Citations       -0.138      -27.95    2.02     -13.81     0.000
                                                                          Communities       -0.082      -25.72    3.17      -8.12     0.000
                                                                            Random          0.015       47.03     31.15    1.51       0.131
                                                               ℰ -score    Co-authors       -0.057      -0.64      0.11    -5.65      0.000
                                                                            Citations       0.198       3.019      0.15    20.00      0.000
                                                                          Communities       0.048        0.66      0.14     4.73      0.000


                                                              Table 3
                                                              Treatment effect evaluations. Legend: ATE: Average treatment
                                                              effect, ATT: Average treatment effect on the treated, ATU:
                                                              Average treatment effect on the untreated.

                                                                     Metric          Est.             SE           𝑧       𝑃 > |𝑧|
Figure 1: The effect of entropy on average citation count.
                                                                      ATE         -189.157           36.274      -5.215      0.000
                                                                      ATT         -176.136           29.762      -5.918      0.000
                                                                      ATU         -202.178           43.471      -4.651      0.000
5.1. Statistical Dependence Analysis
This analysis studies the connection between the intro-
duced stability scores and success across two epochs. We
                                                              we use the average citation count as the proxy metric.
use the relative change in average citation count as the
                                                              Formally, given the set of extracted communities 𝐶, for
proxy metric for success. The main intuitions behind
                                                              any arbitrary author 𝑝, we calculate the entropy across
this metric are 1) citation count is an accepted correlated
                                                              communities ℋ𝑝𝐶 as
metric for success in the community, 2) using average mit-
igates the effect of the high number of publications from                               |𝑐𝑝 |
an author, and 3) using relative change locally normalizes                         𝑤𝑝𝑐 =                                                  (8)
                                                                                        |𝒱𝑝𝑡 |
the metric values. Moreover, to reduce the potential noise                                ∑︁ 𝑐
in the data, we remove the outliers by filtering out sam-                         ℋ𝑝𝒞 = −      𝑤𝑝 log2 (𝑤𝑝𝑐 )                             (9)
ples outside two standard deviations of relative change                                         𝑐∈𝒞

in average citation count mean.
                                                              where 𝑐𝑥 is the set of publications by author 𝑥 in com-
   To quantify the strength of this connection, we use
                                                              munity 𝑐 and 𝒱𝑥𝑡 is the set of publications by author 𝑥 in
the established bivariate correlation and univariate lin-
                                                              epoch 𝑡. Figure 1 illustrates the results of our analysis.
ear regression measurements. We also include a random
                                                              We can observe in Figure 1 that in both epochs average
noise vectorizer as a sanity check to our methodology.
                                                              citation count increases with the increase of entropy up
Table 2 presents the results of our analysis with one of
                                                              until a point and then drops again. This observation
the introduced scores as the independent variable 𝒳 and
                                                              indicates the benefit of having a diverse portfolio, but si-
the number of citations as the dependent variable 𝒴. As
                                                              multaneously too much diversity could negatively impact
evident from Table 2, every introduced score has a signifi-
                                                              success.
cant connection with success, some in the same direction
and some in the opposite direction. Moreover, the “Cita-
tions" vectorizer showcases the highest correlation with      5.3. Propensity Score Matching Analysis
the measurement for success which signifies the effect    This analysis focuses on the potential causal relationship
of author interactions.                                   between adaptability and success in two epochs by utiliz-
                                                          ing the propensity score matching (PSM) technique. We
5.2. Entropy Analysis                                     use the increase in entropy and citation count in the sec-
                                                          ond epoch as proxy metrics for adaptability and success,
In this analysis, we study the connection between diver-
                                                          respectively. Following this, we designate the increase in
sity and success. We use the authors’ entropy across the
                                                          entropy as the treatment variable and the citation count
extracted communities as a proxy for diversity. As for
                                                          in the second epoch as the outcome variable. As for
success, with similar intuitions to the previous section,
                                                          the confounding variables, we use the publication counts
                                                                  Some of the straightforward extensions of our work
                                                               for future studies are 1) including more authors, 2) using
                                                               a more extended period, and 3) changing the temporal
                                                               granularity for tracking changes. Moreover, we used
                                                               a relatively simple metric as our success proxy; future
                                                               works could work with other metrics, such as the h-index
                                                               or i10-index.


                                                               Acknowledgments
                                                               This work was funded by the Defense Advanced Research
                                                               Projects Agency with award W911NF-19-20271 and with
                                                               support from a Keston Exploratory Research Award.


                                                               References
Figure 2: Matched groups for the confounding variable, i.e.,
publication count in the second epoch, for both control and    [1] L. Bornmann, R. Mutz, Growth rates of modern
treatment groups against the outcome variable.                     science: A bibliometric analysis based on the number
                                                                   of publications and cited references, Journal of the
                                                                   Association for Information Science and Technology
from both epochs and the citation count in the first epoch.        66 (2015) 2215–2222.
To check the matching quality, we plot one of the con-         [2] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefeb-
founding variables, i.e., publication counts in the second         vre, Fast unfolding of communities in large networks,
epoch, against the outcome variable for both control and           Journal of statistical mechanics: theory and experi-
treatment groups in Figure 2. Moreover, Table 3 presents           ment 2008 (2008) P10008.
the treatment effect evaluation results. From Table 3, we      [3] V. A. Traag, L. Waltman, N. J. Van Eck, From louvain
can observe that the average treatment effect (ATE) has            to leiden: guaranteeing well-connected communities,
a larger value compared to the average treatment effect            Scientific reports 9 (2019) 1–12.
on treated (ATT) while both have a negative value. This        [4] C. Bird, E. T. Barr, A. Nash, P. T. Devanbu, V. Filkov,
observation indicates that while, in general, the authors          Z. Su, Structure and dynamics of research collabora-
have experienced a decline in the number of citations,             tion in computer science, in: SDM, 2009.
the increase in entropy slows down this phenomenon.            [5] M. E. Newman, Scientific collaboration networks. i.
Hence, adaptability, i.e., an increase in entropy, could be        network construction and fundamental results, Phys
seen as a remedy for a decline in success.                         Rev E Stat Nonlin Soft Matter Phys 64 (2001) 016131.
                                                               [6] P. S. Paul, V. Kumar, P. Choudhury, S. Nandi, Tem-
                                                                   poral analysis of author ranking using citation-
6. Conclusion and Future Works                                     collaboration network, in: 2015 7th International
                                                                   Conference on Communication Systems and Net-
Motivated by our observation of scientific domains’ flu-           works (COMSNETS), 2015, pp. 1–6. doi:10.1109/
idity and empowered by the emergence of public reposi-             COMSNETS.2015.7098737.
tories of scholarly data, we presented a thorough system-      [7] J. Priem, H. Piwowar, R. Orr, Openalex: A fully-open
atic study of the author dynamicity phenomenon in this             index of scholarly works, authors, venues, institu-
work. With the idea of representing authors’ interests             tions, and concepts, arXiv preprint arXiv:2205.01833
and fields of work by a distribution of other authors, we          (2022).
introduced three different systematic approaches vector-       [8] F. Ilievski, D. Garijo, H. Chalupsky, N. T. Divvala,
izing each author in a single epoch. Then, to track an             Y. Yao, C. Rogers, R. Li, J. Liu, A. Singh, D. Schwabe,
author’s behavioral changes between two epochs, we                 et al., Kgtk: a toolkit for large knowledge graph ma-
introduced two approaches built on top of the extracted            nipulation and analysis, in: International Semantic
vectors and well-known mathematical approaches for                 Web Conference, Springer, 2020, pp. 278–293.
quantifying change. Based on these approaches, we pre-
sented in-depth analyses to understand the connection
between success better, as measured by citation counts,
and specific dynamic behaviors, as measured through the
introduced approaches.

</pre>