Is Dynamicity All You Need? Richard Delwin Myloth1,2 , Kian Ahrabian1,2,* , Arun Baalaaji Sankar Ananthan1,2 , Xinwei Du1,2 and Jay Pujara1,2 1 Information Sciences Institute, Marina del Ray, CA, USA 2 University of Southern California, Los Angeles, CA, USA Abstract Scientific domains are fluid entities that change and turn as time passes. Take machine learning as an example. Up until the ’90s, most of the methods were expert-knowledge-driven. However, as time passed, more data-driven approaches appeared, finally leading to the advent of deep learning methods. As a result, in a span of 30 years, the field has gone through many changes and breakthroughs and is at a point where many novelties have a life span of shorter than five years. In parallel, a regular researcher’s career span is around the same length. Consequently, being a researcher requires shifts in the field of study throughout one’s career. Besides, researchers’ scientific interests are inherently dynamic and change over time. Hence, there exists a dynamicity to authors’ interests and fields of work over time. In this work, we study this phenomenon through systematic approaches for representing and tracking dynamicity in different epochs. Our representation approaches are based on the idea that each author could be represented as a distribution of other authors. Concurrently, our tracking approaches rely on established mathematical concepts for measuring the change between two distributions. We focus on the publications in the 2001-2020 range and present a set of analyses built on top of the introduced approaches to understanding the potential connection between dynamicity and success. Keywords Author Dynamicity, Causal Analysis, Scientific Research Analysis, Community Detection 1. Introduction have similar interests. Community detection algorithms are graph partition- The past few decades have been an unprecedented era of ing approaches that identify sets of tightly connected scientific discoveries, with the sheer number of publica- nodes that are loosely connected to nodes outside their re- tions rising steadily [1]. This constant growth of research spective sets [2, 3]. When employed on citation networks, collaborations has led to the emergence of new interdisci- these algorithms yield a set of communities where each plinary domains, prompting researchers to expand their community contains highly related publications. These research horizons. This expansion, combined with the extracted communities could then be exploited for indi- continuous development of scientific domains and the rectly analyzing authors’ interests through publications inherent nature of research to explore new areas, results and citations as proxies. in a potentially volatile set of research directions. This In this work, we study the authors’ dynamicity phe- work introduces approaches for systematically studying nomenon from a relational standpoint. More specifically, this fluidity and uncovering interesting behaviors among we focus on the following research questions: authors. Scientific publications are the information vessels sci- 1. How can we characterize and quantify the entists use to communicate their findings, methodologies, interests and dynamicity of an author? and critiques. At the same time, publications are reflec- 2. Is there any connection between dynamicity tions of their authors’ interests and fields of study. These and success due to reasons such as adaptabil- publications are bound together through citations that ity or diversity? specify the foundations of each work. As a result, ci- To this end, we first create two knowledge graphs tations create tightly connected groups of publications (KG) from publications in the 2001-2020 period, each en- with similar research directions. Consequently, authors compassing ten years’ worth of scholarly information, with a high number of interactions in these groups, either i.e., publications and authors. Then, we introduce three through collaborations or citations, are more likely to vectorizing approaches focused on presenting authors’ The Third AAAI Workshop on Scientific Document Understanding 2023, interest in one epoch, and two tracking approaches fo- February 14th, 2023, Washington, DC, USA cused on quantifying the change in interests in two dis- * Corresponding author. tinct epochs. Our vectorizing approaches are built on top $ myloth@usc.edu (R. D. Myloth); ahrabian@usc.edu of relational information in the KGs and represent au- (K. Ahrabian); arunbaal@usc.edu (A. B. S. Ananthan); thors as a distribution of other authors. Meanwhile, our xinweidu@usc.edu (X. Du); jpujara@usc.edu (J. Pujara) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License tracking approaches are based on the two well-known Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) cosine similarity and relative entropy (Kullback–Leibler CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Table 1 (MAG), ROR, ORCID, DOAJ, PubMed, PubMed Central, Statistics of the extracted KGs. and Unpaywall. We use the OpenAlex dump obtained on Dataset CG-2010 CG-2020 2022-12-07 to construct our dataset for this work. Given this dump, we first extract a KG containing all the publi- # Publications 19,707,369 33,743,276 cations and their connections, i.e., citation links. Then, # Authors 20,333,216 36,077,559 we extract two induced KGs by filtering the publications with publication dates within two ranges of 2001-2010 # Citation Links 167,133,583 323,927,950 and 2011-2020, naming them CG-2010 and CG-2020, re- # Authorship Links 67,531,472 137,160,724 spectively. Following this, we add the authorship infor- mation for each KG for all the publications. Finally, we drop all the nodes with a zero degree (in and out) in divergence) measures. By mix-and-matching, these ap- both KGs. After this procedure, we end up with two proaches yield six different dynamicity scores for each temporally-scoped KGs containing authorship and cita- author. We then use these scores to investigate the con- tion information for all the publications in the 2001-2010 nection between authors’ dynamicity and success. Our and 2011-2020 periods. Table 1 illustrates the statistics analyses showcase the connection between success, di- of the extracted KGs. To handle the large size of the raw versity, and adaptability in research. dump, we resorted to using the KGTK toolkit for all our KG processing procedures [8]. 2. Related Work 4. Methodology Bird et al. [4] analyzed community structures in the DBLP bibliographic database to investigate collaborative con- We break down the problem of characterizing authors’ nections in computer science and interdisciplinary re- dynamicity into two sets of approaches: Vectorizers search at the individual, within-area, and network-wide and Trackers. Vectorizers, as described in Section 4.1, levels. They developed quantifiable metrics such as lon- focus on presenting authors’ interest in one epoch. As gitudinal assortativity over the number of publications, described in Section 4.2, trackers focus on quantifying collaborators, and career length to study author overlap the change in interests in two distinct epochs. When and migration patterns. Prior to Bird et al. [4], Newman combined, these approaches provide a systematic way of [5] used data from publications in physics, biomedical characterizing authors’ dynamicity. research, and computer science to build co-authorship collaboration networks. They looked at the number of 4.1. Vectorizers publications produced by authors, the number of authors We introduce three approaches for vectorizing authors’ per article, the number of collaborators that scientists interests in a given epoch. The main idea of all these have, the existence and size of a significant component of approaches is that each author’s interests could be mod- connected scientists, and the degree of clustering in the eled through a distribution over the set of other authors. networks. They examined collaboration patterns among Our first two approaches rely only on the information participants and discovered that these variables follow a that could be directly extracted from citation links. In power law distribution and that collaboration relation- contrast, the third approach uses external information ships are transitive. Paul et al. [6] also used the DBLP by building upon the output of a community detection database in their study to develop a citation-collaboration algorithm. As a result, the third approach is prone to network to rank authors based on their contributions in erroneous information propagated from the underlying terms of co-authorship and citations while verifying them community detection algorithm; in return, it gains access against the h-index. They also carried out a comparative to more complex information compared to the first two examination of the change in author ranking for different approaches. parts of the author spectrum over time. 4.1.1. Co-authors 3. Dataset In this approach, we present an author’s interests through OpenAlex [7] is a free and open catalog of scholarly en- their co-authors. To this end, given two arbitrary authors tities that provides metadata for publications, authors, 𝑝 and 𝑞 and epoch 𝑡, we define the co-author weight value venues, institutions, and scientific concepts, along with 𝜓𝑝 (𝑞) as 𝑡 the relationships among them. It gathers data from 𝜓𝑝𝑡 (𝑞) = |𝒱𝑝𝑡 ∩ 𝒱𝑞𝑡 | (1) sources such as Crossref, Microsoft Academic Graph where 𝒱𝑥𝑡 is the set of publications by author 𝑥 in epoch two authors that have many papers in the same commu- 𝑡. Building on top of these co-author weight values, for nities and simultaneously waives the need for tracking any arbitrary author 𝑝, we form the representative vector the communities themselves. Building on top of these co- 𝑧𝑝𝑡 as occurrence weight values, for any arbitrary author 𝑝, we can form a representative vector 𝑧𝑝𝑡 following Equation 𝑧𝑝𝑡 = [𝜓𝑝𝑡 (𝑎0 ), 𝜓𝑝𝑡 (𝑎1 ), . . . , 𝜓𝑝𝑡 (𝑎|𝒜| )] (2) 2, replacing 𝜓𝑝𝑡 with 𝜂𝑝𝐶 . where 𝒜 is the set of all authors in the KG. It is important to note that these representative vectors are extremely 4.2. Trackers sparse due to the large cardinality of 𝒜. We introduce two tracking approaches for quantifying the dynamicity between two distinct epochs. These two 4.1.2. Citations approaches are built on well-known mathematical con- In this approach, we present an author’s interests through cepts of cosine similarity and relative entropy. its citing and cited authors. To this end, given two arbi- trary authors 𝑝 and 𝑞 and epoch 𝑡, we define the citation 4.2.1. Cosine Similarity (𝒮-score) weight value 𝜑𝑡𝑝 (𝑞) as Given the representative vectors of an arbitrary author ′ ∑︁ ∑︁ 𝑡 𝑝 from two time periods, 𝑧𝑝𝑡 and 𝑧𝑝𝑡 , we calculate the 𝑡 𝜑𝑝 (𝑞) = 𝑡 𝑡 |𝒩𝑣 ∩ 𝒱𝑞 | + |𝒱𝑝 ∩ 𝒩𝑢 | (3) 𝑡 𝑡,𝑡′ 𝑣∈𝒱 𝑡 𝑢∈𝒱 𝑡 cosine similarity score 𝒮𝑝 defined as 𝑝 𝑞 ′ ′ 𝑧𝑝𝑡 .𝑧𝑝𝑡 where 𝒱𝑥𝑡 is the set of publications by author 𝑥 in epoch 𝒮𝑝𝑡,𝑡 = . (5) 𝑡 and 𝒩𝑦𝑡 is the set of all publications cited by publication ‖𝑧𝑝𝑡 ‖‖𝑧𝑝𝑡′ ‖ 𝑦 in epoch 𝑡. Building on these citation weight values, The calculated cosine similarity scores represent the sta- for any arbitrary author 𝑝, we form the representative bility of authors’ interests in two epochs, i.e., the higher vector 𝑧𝑝𝑡 following Equation 2, replacing 𝜓𝑝𝑡 with 𝜑𝑡𝑝 . the value, the more consistent the authors’ interests. 4.1.3. Communities 4.2.2. Relative Entropy (ℰ-score) In this approach, we present an author’s interests through Building on top of the representative vectors, for each authors with whom they publish in the same research arbitrary author 𝑝 in period 𝑡, we define a probability communities. To this end, given a KG encompassing distribution as epoch 𝑡, we first extract the citation graph by removing 𝑧𝑝𝑡 [𝑞] + 𝜖 all non-publication nodes, i.e., authors. Then, we run the ℱ𝑝𝑡 (𝑞) = ∑︀ ∀𝑞 ∈ 𝒜 (6) 𝑡 ′ Leiden [3] community detection algorithm to extract a 𝑞 ′ ∈𝒜 𝑧𝑝 [𝑞 ] + 𝜖|𝒜| set of communities 𝒞. We rely on the hypothesis that where 𝜖 = |𝒜|1 is the prior probability and 𝒜 is the set of each community represents a somewhat unique field of all authors in the KG. Then, given the probability distri- study. We use a modified version of the Leiden algorithm butions of an arbitrary author 𝑝 from two time periods, that limits the maximum number of generated commu- ′ ′ nities and the number of publications in a community. ℱ𝑝𝑡 and ℱ𝑝𝑡 , we calculate the relative entropy ℰ𝑞𝑡,𝑡 as Doing so avoids the creation of large unfocused, or small ′ ′ ′ ′ ℱ𝑝𝑡 (𝑞) insignificant communities. Given the set of extracted ∑︁ ℰ𝑝𝑡,𝑡 = 𝐷KL (ℱ𝑝𝑡 ‖ℱ𝑝𝑡 ) = ℱ𝑝𝑡 (𝑞) log( ). communities 𝒞, for any two arbitrary authors 𝑝 and 𝑞, 𝑞∈𝒜 ℱ𝑝𝑡 (𝑞) we define the co-occurrence weight value 𝜂𝑝𝐶 (𝑞) as (7) In contrast to the cosine similarity score, the calculated |𝑐𝑝 | relative entropy scores represent the volatility of authors’ {︃∑︀ 𝑐∈𝒞 |𝒱𝑝𝑡 | log2 (|𝑐𝑞 | + 𝛼) 𝑝 ̸= 𝑞 𝐶 𝜂𝑝 (𝑞) = (4) interests in two epochs, i.e., the higher the value, the less 0 𝑝=𝑞 consistent the authors’ interests are. where 𝑐𝑥 is the set of publications by author 𝑥 in commu- nity 𝑐, 𝒱𝑥𝑡 is the set of publications by author 𝑥 in epoch 𝑡, and 𝛼 = 0.001. In this formalization, the effect of each 5. Analyses community is weighed on the number of publications an Throughout this section, we run all our analyses on a set 𝑐 author has in that community, e.g., |𝒱𝑝𝑡 | . Moreover, each of randomly 10,000 sampled authors. More specifically, 𝑝 author’s influence is smoothened by taking the log value we do a weighted sampling without replacement using of their number of publications, e.g., log2 (𝑐𝑞 +𝛼). The re- the citation counts. This procedure allows us to manage sulting equation highlights the connection between any the computational costs of running these analyses. Table 2 Univariate linear regression and bivariate correlation metrics between introduced scores and relative change in average citation count. Legend: PCC: Pearson correlation coefficient. Tracker Vectorizer PCC Coef. SE 𝑡 𝑃 > |𝑡| Random -0.001 -967.70 5156.52 -0.188 0.851 𝒮 -score Co-authors -0.121 -26.03 2.15 -12.11 0.000 Citations -0.138 -27.95 2.02 -13.81 0.000 Communities -0.082 -25.72 3.17 -8.12 0.000 Random 0.015 47.03 31.15 1.51 0.131 ℰ -score Co-authors -0.057 -0.64 0.11 -5.65 0.000 Citations 0.198 3.019 0.15 20.00 0.000 Communities 0.048 0.66 0.14 4.73 0.000 Table 3 Treatment effect evaluations. Legend: ATE: Average treatment effect, ATT: Average treatment effect on the treated, ATU: Average treatment effect on the untreated. Metric Est. SE 𝑧 𝑃 > |𝑧| Figure 1: The effect of entropy on average citation count. ATE -189.157 36.274 -5.215 0.000 ATT -176.136 29.762 -5.918 0.000 ATU -202.178 43.471 -4.651 0.000 5.1. Statistical Dependence Analysis This analysis studies the connection between the intro- duced stability scores and success across two epochs. We we use the average citation count as the proxy metric. use the relative change in average citation count as the Formally, given the set of extracted communities 𝐶, for proxy metric for success. The main intuitions behind any arbitrary author 𝑝, we calculate the entropy across this metric are 1) citation count is an accepted correlated communities ℋ𝑝𝐶 as metric for success in the community, 2) using average mit- igates the effect of the high number of publications from |𝑐𝑝 | an author, and 3) using relative change locally normalizes 𝑤𝑝𝑐 = (8) |𝒱𝑝𝑡 | the metric values. Moreover, to reduce the potential noise ∑︁ 𝑐 in the data, we remove the outliers by filtering out sam- ℋ𝑝𝒞 = − 𝑤𝑝 log2 (𝑤𝑝𝑐 ) (9) ples outside two standard deviations of relative change 𝑐∈𝒞 in average citation count mean. where 𝑐𝑥 is the set of publications by author 𝑥 in com- To quantify the strength of this connection, we use munity 𝑐 and 𝒱𝑥𝑡 is the set of publications by author 𝑥 in the established bivariate correlation and univariate lin- epoch 𝑡. Figure 1 illustrates the results of our analysis. ear regression measurements. We also include a random We can observe in Figure 1 that in both epochs average noise vectorizer as a sanity check to our methodology. citation count increases with the increase of entropy up Table 2 presents the results of our analysis with one of until a point and then drops again. This observation the introduced scores as the independent variable 𝒳 and indicates the benefit of having a diverse portfolio, but si- the number of citations as the dependent variable 𝒴. As multaneously too much diversity could negatively impact evident from Table 2, every introduced score has a signifi- success. cant connection with success, some in the same direction and some in the opposite direction. Moreover, the “Cita- tions" vectorizer showcases the highest correlation with 5.3. Propensity Score Matching Analysis the measurement for success which signifies the effect This analysis focuses on the potential causal relationship of author interactions. between adaptability and success in two epochs by utiliz- ing the propensity score matching (PSM) technique. We 5.2. Entropy Analysis use the increase in entropy and citation count in the sec- ond epoch as proxy metrics for adaptability and success, In this analysis, we study the connection between diver- respectively. Following this, we designate the increase in sity and success. We use the authors’ entropy across the entropy as the treatment variable and the citation count extracted communities as a proxy for diversity. As for in the second epoch as the outcome variable. As for success, with similar intuitions to the previous section, the confounding variables, we use the publication counts Some of the straightforward extensions of our work for future studies are 1) including more authors, 2) using a more extended period, and 3) changing the temporal granularity for tracking changes. Moreover, we used a relatively simple metric as our success proxy; future works could work with other metrics, such as the h-index or i10-index. Acknowledgments This work was funded by the Defense Advanced Research Projects Agency with award W911NF-19-20271 and with support from a Keston Exploratory Research Award. References Figure 2: Matched groups for the confounding variable, i.e., publication count in the second epoch, for both control and [1] L. Bornmann, R. Mutz, Growth rates of modern treatment groups against the outcome variable. science: A bibliometric analysis based on the number of publications and cited references, Journal of the Association for Information Science and Technology from both epochs and the citation count in the first epoch. 66 (2015) 2215–2222. To check the matching quality, we plot one of the con- [2] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefeb- founding variables, i.e., publication counts in the second vre, Fast unfolding of communities in large networks, epoch, against the outcome variable for both control and Journal of statistical mechanics: theory and experi- treatment groups in Figure 2. Moreover, Table 3 presents ment 2008 (2008) P10008. the treatment effect evaluation results. From Table 3, we [3] V. A. Traag, L. Waltman, N. J. Van Eck, From louvain can observe that the average treatment effect (ATE) has to leiden: guaranteeing well-connected communities, a larger value compared to the average treatment effect Scientific reports 9 (2019) 1–12. on treated (ATT) while both have a negative value. This [4] C. Bird, E. T. Barr, A. Nash, P. T. Devanbu, V. Filkov, observation indicates that while, in general, the authors Z. Su, Structure and dynamics of research collabora- have experienced a decline in the number of citations, tion in computer science, in: SDM, 2009. the increase in entropy slows down this phenomenon. [5] M. E. Newman, Scientific collaboration networks. i. Hence, adaptability, i.e., an increase in entropy, could be network construction and fundamental results, Phys seen as a remedy for a decline in success. Rev E Stat Nonlin Soft Matter Phys 64 (2001) 016131. [6] P. S. Paul, V. Kumar, P. Choudhury, S. Nandi, Tem- poral analysis of author ranking using citation- 6. Conclusion and Future Works collaboration network, in: 2015 7th International Conference on Communication Systems and Net- Motivated by our observation of scientific domains’ flu- works (COMSNETS), 2015, pp. 1–6. doi:10.1109/ idity and empowered by the emergence of public reposi- COMSNETS.2015.7098737. tories of scholarly data, we presented a thorough system- [7] J. Priem, H. Piwowar, R. Orr, Openalex: A fully-open atic study of the author dynamicity phenomenon in this index of scholarly works, authors, venues, institu- work. With the idea of representing authors’ interests tions, and concepts, arXiv preprint arXiv:2205.01833 and fields of work by a distribution of other authors, we (2022). introduced three different systematic approaches vector- [8] F. Ilievski, D. Garijo, H. Chalupsky, N. T. Divvala, izing each author in a single epoch. Then, to track an Y. Yao, C. Rogers, R. Li, J. Liu, A. Singh, D. Schwabe, author’s behavioral changes between two epochs, we et al., Kgtk: a toolkit for large knowledge graph ma- introduced two approaches built on top of the extracted nipulation and analysis, in: International Semantic vectors and well-known mathematical approaches for Web Conference, Springer, 2020, pp. 278–293. quantifying change. Based on these approaches, we pre- sented in-depth analyses to understand the connection between success better, as measured by citation counts, and specific dynamic behaviors, as measured through the introduced approaches.