1. Introduction

L. Bornmann, R. Mutz, Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references, Journal of the Association for Information Science and Technology from both epochs and the citation count in the first epoch.

Is Dynamicity All You Need?

Richard Delwin Myloth

0 1

Kian Ahrabian

0 1

Arun Baalaaji Sankar Ananthan

0 1

Xinwei Du

0 1

Jay Pujara

0 1 0 Information Sciences Institute , Marina del Ray, CA , USA 1 University of Southern California , Los Angeles, CA , USA

2008

66 2015 2215 2222

Scientific domains are fluid entities that change and turn as time passes. Take machine learning as an example. Up until the '90s, most of the methods were expert-knowledge-driven. However, as time passed, more data-driven approaches appeared, ifnally leading to the advent of deep learning methods. As a result, in a span of 30 years, the field has gone through many changes and breakthroughs and is at a point where many novelties have a life span of shorter than five years. In parallel, a regular researcher's career span is around the same length. Consequently, being a researcher requires shifts in the field of study throughout one's career. Besides, researchers' scientific interests are inherently dynamic and change over time. Hence, there exists a dynamicity to authors' interests and fields of work over time. In this work, we study this phenomenon through systematic approaches for representing and tracking dynamicity in diferent epochs. Our representation approaches are based on the idea that each author could be represented as a distribution of other authors. Concurrently, our tracking approaches rely on established mathematical concepts for measuring the change between two distributions. We focus on the publications in the 2001-2020 range and present a set of analyses built on top of the introduced approaches to understanding the potential connection between dynamicity and success.

eol>Author Dynamicity Causal Analysis Scientific Research Analysis Community Detection

1. Introduction

have similar interests.

Community detection algorithms are graph partitionThe past few decades have been an unprecedented era of ing approaches that identify sets of tightly connected scientific discoveries, with the sheer number of publica- nodes that are loosely connected to nodes outside their retions rising steadily [1]. This constant growth of research spective sets [2, 3]. When employed on citation networks, collaborations has led to the emergence of new interdisci- these algorithms yield a set of communities where each plinary domains, prompting researchers to expand their community contains highly related publications. These research horizons. This expansion, combined with the extracted communities could then be exploited for indicontinuous development of scientific domains and the rectly analyzing authors’ interests through publications inherent nature of research to explore new areas, results and citations as proxies. in a potentially volatile set of research directions. This In this work, we study the authors’ dynamicity phework introduces approaches for systematically studying nomenon from a relational standpoint. More specifically, this fluidity and uncovering interesting behaviors among we focus on the following research questions: authors.

Scientific publications are the information vessels sci- 1. How can we characterize and quantify the entists use to communicate their findings, methodologies, interests and dynamicity of an author? and critiques. At the same time, publications are reflec- 2. Is there any connection between dynamicity tions of their authors’ interests and fields of study. These and success due to reasons such as adaptabilpublications are bound together through citations that ity or diversity? specify the foundations of each work. As a result, citations create tightly connected groups of publications with similar research directions. Consequently, authors with a high number of interactions in these groups, either through collaborations or citations, are more likely to To this end, we first create two knowledge graphs (KG) from publications in the 2001-2020 period, each encompassing ten years’ worth of scholarly information, i.e., publications and authors. Then, we introduce three vectorizing approaches focused on presenting authors’ interest in one epoch, and two tracking approaches focused on quantifying the change in interests in two distinct epochs. Our vectorizing approaches are built on top of relational information in the KGs and represent authors as a distribution of other authors. Meanwhile, our divergence) measures. By mix-and-matching, these approaches yield six diferent dynamicity scores for each author. We then use these scores to investigate the connection between authors’ dynamicity and success. Our analyses showcase the connection between success, diversity, and adaptability in research.

2. Related Work

Bird et al. [4] analyzed community structures in the DBLP bibliographic database to investigate collaborative connections in computer science and interdisciplinary research at the individual, within-area, and network-wide levels. They developed quantifiable metrics such as longitudinal assortativity over the number of publications, collaborators, and career length to study author overlap and migration patterns. Prior to Bird et al. [4], Newman [5] used data from publications in physics, biomedical research, and computer science to build co-authorship collaboration networks. They looked at the number of publications produced by authors, the number of authors per article, the number of collaborators that scientists have, the existence and size of a significant component of connected scientists, and the degree of clustering in the networks. They examined collaboration patterns among participants and discovered that these variables follow a power law distribution and that collaboration relationships are transitive. Paul et al. [6] also used the DBLP database in their study to develop a citation-collaboration network to rank authors based on their contributions in terms of co-authorship and citations while verifying them against the h-index. They also carried out a comparative examination of the change in author ranking for diferent parts of the author spectrum over time. (MAG), ROR, ORCID, DOAJ, PubMed, PubMed Central, and Unpaywall. We use the OpenAlex dump obtained on 2022-12-07 to construct our dataset for this work. Given this dump, we first extract a KG containing all the publications and their connections, i.e., citation links. Then, we extract two induced KGs by filtering the publications with publication dates within two ranges of 2001-2010 and 2011-2020, naming them CG-2010 and CG-2020, respectively. Following this, we add the authorship information for each KG for all the publications. Finally, we drop all the nodes with a zero degree (in and out) in both KGs. After this procedure, we end up with two temporally-scoped KGs containing authorship and citation information for all the publications in the 2001-2010 and 2011-2020 periods. Table 1 illustrates the statistics of the extracted KGs. To handle the large size of the raw dump, we resorted to using the KGTK toolkit for all our KG processing procedures [8].

4. Methodology

We break down the problem of characterizing authors’ dynamicity into two sets of approaches: Vectorizers and Trackers. Vectorizers, as described in Section 4.1, focus on presenting authors’ interest in one epoch. As described in Section 4.2, trackers focus on quantifying the change in interests in two distinct epochs. When combined, these approaches provide a systematic way of characterizing authors’ dynamicity.

4.1. Vectorizers

We introduce three approaches for vectorizing authors’ interests in a given epoch. The main idea of all these approaches is that each author’s interests could be modeled through a distribution over the set of other authors. Our first two approaches rely only on the information that could be directly extracted from citation links. In contrast, the third approach uses external information by building upon the output of a community detection algorithm. As a result, the third approach is prone to erroneous information propagated from the underlying community detection algorithm; in return, it gains access to more complex information compared to the first two approaches.

4.1.1. Co-authors 3. Dataset

In this approach, we present an author’s interests through OpenAlex [7] is a free and open catalog of scholarly en- their co-authors. To this end, given two arbitrary authors tities that provides metadata for publications, authors, and and epoch , we define the co-author weight value vtheenureesla,tiniosntisthuitpiosnasm,aonndgsctiheenmti.fic cIotngcaetphtse,rsalodnatgawfirtohm () as () = | ∩ | (1) sources such as Crossref, Microsoft Academic Graph where is the set of publications by author in epoch . Building on top of these co-author weight values, for any arbitrary author , we form the representative vector as = [ (0), (1), . . . , (||)]

(2) where is the set of all authors in the KG. It is important to note that these representative vectors are extremely sparse due to the large cardinality of .

4.1.2. Citations

In this approach, we present an author’s interests through its citing and cited authors. To this end, given two arbitrary authors and and epoch , we define the citation weight value () as () = ∑︁ | ∩ | + ∑︁ | ∩ | (3) ∈ ∈ where is the set of publications by author in epoch and is the set of all publications cited by publication in epoch . Building on these citation weight values, for any arbitrary author , we form the representative vector following Equation 2, replacing with .

4.1.3. Communities

In this approach, we present an author’s interests through authors with whom they publish in the same research communities. To this end, given a KG encompassing epoch , we first extract the citation graph by removing all non-publication nodes, i.e., authors. Then, we run the Leiden [3] community detection algorithm to extract a set of communities . We rely on the hypothesis that each community represents a somewhat unique field of study. We use a modified version of the Leiden algorithm that limits the maximum number of generated communities and the number of publications in a community. Doing so avoids the creation of large unfocused, or small insignificant communities. Given the set of extracted communities , for any two arbitrary authors and , we define the co-occurrence weight value () as () = {︃∑︀ 0

|| log2(|| + ) ̸= ∈ || = (4) two authors that have many papers in the same communities and simultaneously waives the need for tracking the communities themselves. Building on top of these cooccurrence weight values, for any arbitrary author , we can form a representative vector following Equation 2, replacing with .

4.2. Trackers

We introduce two tracking approaches for quantifying the dynamicity between two distinct epochs. These two approaches are built on well-known mathematical concepts of cosine similarity and relative entropy.

4.2.1. Cosine Similarity (-score)

Given the representative vectors of an arbitrary author from two time periods, and ′ , we calculate the ,′ defined as cosine similarity score

,′ =

′ . ‖‖‖′ ‖ .

(5) The calculated cosine similarity scores represent the stability of authors’ interests in two epochs, i.e., the higher the value, the more consistent the authors’ interests.

4.2.2. Relative Entropy (ℰ -score)

Building on top of the representative vectors, for each arbitrary author in period , we define a probability distribution as ℱ() = ∑︀

[] + ′∈ [′] + || ∀ ∈

(6) where = 1 is the prior probability and is the set of

|| all authors in the KG. Then, given the probability distributions of an arbitrary author from two time periods, ℱ and ℱ′ , we calculate the relative entropy ℰ,′ as ℰ,′ = KL(ℱ′ ‖ℱ) = ∑︁ ℱ′ () log( ℱℱ′(()) ) .

∈

(7) In contrast to the cosine similarity score, the calculated relative entropy scores represent the volatility of authors’ interests in two epochs, i.e., the higher the value, the less consistent the authors’ interests are. where is the set of publications by author in community , is the set of publications by author in epoch 5. Analyses , and = 0.001. In this formalization, the efect of each community is weighed on the number of publications an Throughout this section, we run all our analyses on a set author has in that community, e.g., || . Moreover, each of randomly 10,000 sampled authors. More specifically, author’s influence is smoothened by taking the log value we do a weighted sampling without replacement using of their number of publications, e.g., log2( + ). The re- the citation counts. This procedure allows us to manage sulting equation highlights the connection between any the computational costs of running these analyses. we use the average citation count as the proxy metric.

Formally, given the set of extracted communities , for any arbitrary author , we calculate the entropy across communities ℋ as = ||

|| ℋ = − ∑︁ log2()

∈ (8) (9) where is the set of publications by author in community and is the set of publications by author in epoch . Figure 1 illustrates the results of our analysis.

We can observe in Figure 1 that in both epochs average citation count increases with the increase of entropy up until a point and then drops again. This observation indicates the benefit of having a diverse portfolio, but simultaneously too much diversity could negatively impact success.

5.3. Propensity Score Matching Analysis 5.1. Statistical Dependence Analysis

This analysis studies the connection between the introduced stability scores and success across two epochs. We use the relative change in average citation count as the proxy metric for success. The main intuitions behind this metric are 1) citation count is an accepted correlated metric for success in the community, 2) using average mitigates the efect of the high number of publications from an author, and 3) using relative change locally normalizes the metric values. Moreover, to reduce the potential noise in the data, we remove the outliers by filtering out samples outside two standard deviations of relative change in average citation count mean.

To quantify the strength of this connection, we use the established bivariate correlation and univariate linear regression measurements. We also include a random noise vectorizer as a sanity check to our methodology.

Table 2 presents the results of our analysis with one of the introduced scores as the independent variable and the number of citations as the dependent variable . As evident from Table 2, every introduced score has a significant connection with success, some in the same direction and some in the opposite direction. Moreover, the “Citations" vectorizer showcases the highest correlation with the measurement for success which signifies the efect of author interactions.

This analysis focuses on the potential causal relationship between adaptability and success in two epochs by utilizing the propensity score matching (PSM) technique. We 5.2. Entropy Analysis use the increase in entropy and citation count in the second epoch as proxy metrics for adaptability and success, In this analysis, we study the connection between diver- respectively. Following this, we designate the increase in sity and success. We use the authors’ entropy across the entropy as the treatment variable and the citation count extracted communities as a proxy for diversity. As for in the second epoch as the outcome variable. As for success, with similar intuitions to the previous section, the confounding variables, we use the publication counts Some of the straightforward extensions of our work for future studies are 1) including more authors, 2) using a more extended period, and 3) changing the temporal granularity for tracking changes. Moreover, we used a relatively simple metric as our success proxy; future works could work with other metrics, such as the h-index or i10-index.

Acknowledgments

This work was funded by the Defense Advanced Research Projects Agency with award W911NF-19-20271 and with support from a Keston Exploratory Research Award.