Approaches to assessing the semantic similarity and future citation of publications by identifying informative terms with predictive properties A.Kh. Khakimova1, M.M. Charnine2 aida_khatif@mail.ru | mc@keywen.com 1 ANO «Scientific and Research Center for Information in Physics and Technique», Nizhny Novgorod, Russia; 2 FRC CSC of the Russian Academy of Sciences, Moscow, Russia The article discusses new approaches to assessing the semantic similarity of documents in a vector space, taking into account statistically significant and informative terms. Informative terms reflect the current state of research in a certain field of research. To select informative terms, an algorithm for calculating the impact factor of the term is proposed. It is shown that informative terms allow both to evaluate the semantic similarity of texts and to predict future citations. The developed methods for assessing the semantic similarity and future impact of scientific publications can be used in the framework of “Predictive optimization”, a modern technology that allows us to make decisions based on forecasts. In evaluating the activities of research and individual scientists, bibliometric indicators often play an important role. However, the use of citation-based indicators is problematic in determining the impact of recent publications. Usually, two years after the publication of most articles, they receive only a few links. The probability of future citation can be predicted using the proposed indicator - IFT. Keywords: semantic similarity, informative terms, impact factor of the term, citations, statistical analysis, citation prediction. IFT is similar to journal impact factor (JIF) which has 1. Introduction been used for many years and has proven effective. JIF is a Measuring the similarity between documents is an scientometric index that reflects the yearly average number important component in various tasks such as document of citations that articles published in the last two years in a clustering, topic detection, topic tracking, question given journal received. If all articles of a journal are highly answering, information retrieval and text summarization. cited, then this journal has a high JIF value and is For scientific articles, there are two main types of considered significant and authoritative. Similarly, if all similarity measures: citation-based similarity [1] and articles with some general term are highly cited, then this semantic textual similarity [2]. These two types of term has a high IFT value and is considered significant and similarity measures should correlate and maximizing this informative. The IFT helps to identify informative terms correlation is a convenient way to adjust the coefficients that indicate significant fundamental ideas. Words and and parameters on which these measures depend. terms with a constantly high IFT (for example, neural Citation-based similarity measures such as networks) denote significant ideas, interest in which is bibliographic coupling (if two documents share a reference stable for many years. For such informative words, the IFT in their bibliography) and co-citation (if two documents are values are stably high. Also, such words have a high cited by a third document) are an integral component of correlation between IFT values of the current and next year. many information retrieval systems. Semantic textual This correlation as well as the conditions for the stability similarity measures analyze situations where two and predictability of the IFT are discussed in Section 4. documents share certain words (co-word linkages [3]), Section 3 describes a collection of articles used for phrases or ideas [4]. experiments to study the empirical properties of IFT, Latent Semantic Analysis (LSA) [5] and Generalized including its correlations. The next section gives a formal Latent Semantic Analysis (GLSA) [6] are the most popular description of the IFT. techniques of Corpus-Based semantic textual similarity [2]. 2. Impact Factor of Terms (IFT) GLSA extends the LSA approach by focusing on term vectors instead of the dual document-term representation. There are currently several journal ranking systems, but There is a problem of efficient filtering of non- the oldest and most influential system is a journal impact informative words. LSA and GLSA suffer from noise factor (JIF). JIF is used as an indicator of the importance of introduced by typos and infrequent and non-informative a journal for its field. words [6]. To solve this problem, we present a new A journal's impact factor is based on how often articles citation-based method for efficient filtering of the core published in that journal during the previous two years (e.g. vocabulary and keeping only content bearing words. This 2017 and 2018) were cited by articles published in a new citation-based method is called the Impact Factor of particular year (e.g. 2019). Terms (IFT). It is described in Section 2. IFT assesses the The higher the JIF, the more often articles in that significance and informational content of terms in scientific journal are cited by other articles. Thus, the influence factor articles based on citation analysis of the articles with these can give an approximate idea of how prestigious the terms. Also, IFT is useful for prediction future citations and magazine is in its field of science. promising topics in different subject areas such as smart The journal with the highest IF value is the one that energy systems. publishes the most frequently cited articles over a two-year Maximizing correlation between citation-based period. One easy way to increase JIF is to publish more similarity and IFT-based semantic textual similarity is a review articles, which are usually cited more often than convenient way to adjust the coefficients and parameters of research reports [7]. the IFT method. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Author Impact Factor (AIF) is an extension of the articles containing the specified term in the title are taken impact factor for authors. The AIF of an author A in year t into account. is the average number of citations given by papers published in year t to papers published by A in a period of 3. AI collection (Data Set) Δt years before year t. AIF is able to capture trends and In our experiments, we analyze DBLP citation network, variations in the influence of scientists over time, in which is a collection of articles on Artificial Intelligence contrast to the h-index, which is a measure that takes into from 1936 to 2017, compiled by aminer.org and referred to account the entire career path [8]. here as AI collection. We offer an extension of the impact factor idea for The citation data is extracted from DBLP (Digital terms. We offer a new numerical indicator of the authority Bibliography & Library Project dblp.org), ACM of words and terms, called the impact factor of the term (Association for Computing Machinery acm.org), MAG (IFT). (Microsoft Academic Graph), and other sources. IFT (formula 1) can be used to effectively filter the We used the V10 version released in October 2017. dictionary, excluding uninformative words and terms. With This data set consists of 3,079,007 articles and 25,166,994 the help of IFT, we can identify promising topics and ideas, citation relationships. For each article there is a title, find implicit links between articles and texts, and discover authors, year of publication and links. We have processed ideologically influential sites. all titles and citation relationships. 𝐴𝐴𝑡𝑡 In this paper, the AI collection was analyzed in different 𝐼𝐼𝐼𝐼𝐼𝐼 = , (1) 𝑁𝑁𝑡𝑡 directions described in the next Section. where Аt is the number of citations in articles with the term A published in year t to articles with the term A in the 4. Results of a statistical analysis of term trends period Δt years to year t; Nt - total number of articles with The main goal of the statistical analysis of the AI term A for the time period ∆t + 1. collection is to study the empirical properties of Impact Therefore, the IFT of term A in year t is the average Factor of Terms (IFT), including the correlation of its number of references cited in articles with term A current and future values to assess its stability and forecast published in year t to articles with term A in the period ∆t future citations. years to year t. Statistical analysis of the collection was carried out It follows from the IFT formula (1) that the method will using the Trend+ author program, which built a frequency certainly increase the correlation of the similarity measure dictionary of all words and terms in the collection. Also, for of texts with their bibliographic relationship, since the IFT each term with a frequency of more than 5, Trend+ linearly depends on the number of bibliographic references calculated its trend indicators (trending situations), over the past two years (or over a period of ∆t years). including the number of articles with this term for the year, Various approaches to the calculation of IFT were the number of citations from other articles with this term, investigated. the IFT and IFTm indicators for the current and next year. The modified impact factor of the term (IFTm) is the To calculate the correlation, situations/points were ratio of citations of articles with term A to the total number selected for different words in different years, when the of articles with this term over 3 years. 𝐴𝐴𝑡𝑡−2 + 𝐴𝐴𝑡𝑡−1 + 𝐴𝐴𝑡𝑡 values of IFT and IFTm of the current year were more than 𝐼𝐼𝐼𝐼𝐼𝐼𝑚𝑚 = , (2) zero. There could be several such situations for one word 𝑁𝑁 in different years. The selected situations were divided into where Аt-2 - the number of links to the article with the term groups differing in the number of articles with a word over A two years ago in same year; Аt-1 - the number of links to the past 3 years. According to the number of situations, the the article with term A last year for the same and previous IFTm groups turned out to be larger than the IFT groups, years; Аt - the number of links to the article with term A because IFTm takes into account more citations. Fig. 1 over a three-year period, including the current year; N - shows graphs of the number of situations/points in these total number of articles with term A for three years. groups for calculating correlations. Both the IFT and IFTm are considered only for articles in which the given term is in the title. Only citations from Fig. 1. Graphs of the number of points for calculating the correlations of the current and future years for the indicators IFTm (upper) and IFT, depending on the number of articles with the word in the last 3 years In Fig. 1, the upper graph corresponds to the IFTm, and On the IFT graph, the maximum number of points the lower IFT. The y-axis represents the number of points 54326 is reached at X = 5, and the minimum 2423 at X = for calculating the correlations of the current and future 50. On the IFTm graph, the maximum number of points years. The x-axis represents the frequency of terms, i.e. the 91997 is reached at X = 5, and the minimum 2913 at X = number of articles with the term over the last 3 years. The 50. maximum points on both graphs are achieved when the For each group of trending situations/points (i.e., for number of articles is 5, because the experiment did not each X) individually, a correlation was calculated between analyze terms that occurred less than 5 times in the the current and future values of IFT and IFTm. The results collection for all time. of calculating the correlations are shown in Fig. 2. Fig. 2. Graph of IFT correlations (upper) and IFTm correlations of the current and future years depending on the number of articles with the word in the last 3 years The upper graph is the IFT correlations, and the lower The graphs show that the higher the current frequency graph is the IFTm correlations. of the term (the number of articles with the term), the higher Both graphs behave very similarly, but the correlations the correlation, and therefore, the more stable the IFT of the IFT (upper graph) are almost always greater than the behaves in time. Stable IFT allows you to accurately correlations of the IFTm. The correlation on the graphs predict the average number of future citations, since the IFT reaches 0.5 at a frequency of 17 articles over the past three is exactly equal to the average number of citations of years, 0.6 at 26 articles, and 0.7 at 45 articles. Thus, IFT articles with the specified word/term. Thus, the behaves more stably and predictably than IFTm, but IFTm words/terms with a high frequencies and high IFT values covers more different situations and words/terms. define promising topics in different subject areas such as Our model assumes a publication citation prediction artificial intelligence or smart energy systems. based on the following predictors: the impact factor of The most stable and predictable words/terms with high significant terms (for example, authors' keywords) and the IFT values are called informative terms. Informative time of appearance of subsequent articles associated with words/terms have high frequencies and IFT meanings implicit links to the original article. above a certain threshold. The type of function for filtering The two predictors used are readily available, and of non-informative words which grows with increasing IFT unlike most prediction approaches, they allow you to make and frequency can be selected by maximizing the predictions pretty soon after the publication. correlation between citation-based similarity and IFT- Citation forecasts have a high degree of uncertainty. based semantic textual similarity. As a first approximation, Therefore, we believe that it is more important to know the this filtering function can be taken as the product of IFT likelihood that the publication will receive a certain number and frequency with a certain minimum threshold for IFT. of links in the future. Therefore, we do not predict the Here are examples of the most informative words/terms average number of links that the publication should attract in the collection of AI articles that have the largest total in the future, but we predict the probability distribution for values of IFT multiplied by the current frequency: web the future number of links based on the developed (year 1982), fuzzy (1969), sensor networks (1992), neural mathematical probabilistic model of the dependence of the (1962), video (1976) , social (1971), cognitive (1973), number of direct citations on terms with high IFT. semantic (1967), clustering (1970), neural networks It is important to emphasize that the purpose of our (1986). work is different from the studies mentioned above. As in These examples point to the most actively and stably the above studies, we are interested in predicting the future developing areas of AI, and also confirm the usefulness of citation. However, many indicators that have been found to the proposed filtering function and its ability to evaluate the correlate with the influence of citation are easy to significance and information content of words/terms. manipulate. For example, suppose researchers know that future 5. Predicting the citations with IFT citations of a publication will be predicted based, for Prediction of citation of scientific works was studied by example, on the number of pages or the number of links. In many researchers. The described approaches are mainly this case, authors can artificially increase the number of based on the analysis of a number of features, including pages or increase the number of bibliographic references. information about the authors (number of authors, country, Therefore, we consider variables that cannot be changed by authors rating, etc.), features of the journal (total number of the authors of the publication. links to the journal, impact factor of the journal), article Based on IFT values, we can choose informative terms parameters (topic, volume, number of references etc.), type that indicate important fundamental ideas. Words and of research (for example, original research compared to a terms with a consistently high IFT indicate important ideas literature review), as well as other characteristics that have been stable for many years. (reputation of institutions etc.). In addition, altmetrics are In our experiments, we analyze the DBLP citation also used to predict the citation of a scientific paper. network, which is a collection of articles on artificial Citation prediction methods have been proposed, for intelligence from 1936 to 2017, including 3,079,007 example, by Walters (2006) [9], Haslam et al. (2008) [10], articles and 25,166,994 links. Statistical analysis of the Fu and Aliferis (2010) [11], Wang, Yu and Yu (2011) [12], collection was carried out using the Trend + program, Wang et al. (2012) [13], Didegah and Thelwall (2013) [14], which built a frequency dictionary and trend indicators, Yu, Yu, Li and Wang (2014) [15], Onodera and Yoshikane including the number of articles with this term per year, the (2015) [16], Cao et al. (2016) [17], Golosovsky and number of links to other articles with this term, IFT and Solomon (2017) [18], Fiala and Tutoky (2018) [19], Bai et IFTm indicators for the current and next year. al. (2019) [20]. For example, Wang et al. (2013) [21] The term “Trend of the initial frequency” (TIF) is propose mathematical models that describe how proposed - this is the number of years from the first article publications accumulate citations over time. Using these with a certain term to the nth article with this term. A models, the authors predict the effect of publication citation relationship was found between TIF, IFT, and citation on a longer term based on a short-term publication citation trends. It is shown that the higher the trends of the initial history. Bornmann et al. (2013) [22] present an empirical frequency, the higher the trends of fresh citation links, that analysis of the correlation between short-term and long- is, the higher the likelihood of quick appearance of links to term citation indicators. the article. IFT evaluates the significance and informativeness of Of particular interest are trend terms with a large terms in scientific articles based on an analysis of the number of new articles (more than 10 articles in the citation of articles with these terms. IFT can also be used to previous 2 years). For trend terms, the correlation of current predict future citations of new articles. and future IFTm is more than 60%, which allows us to Given the practical importance of incorporating the make a fairly confident forecast of IFTm (i.e. citation latest publications in evaluations of scientific performance, forecast) for the next year. one of the goals of our study is to develop a model to We summarize how our study differs from existing predict the impact that recent publications will have in the works: long run. ˗ we are interested in predicting the long-term impact of citation, based solely on the impact factors of significant terms (as mentioned above, we do not want [12] Wang, M., Yu, G., & Yu, D. (2011). Mining typical to use variables that can be easily manipulated); features for highly cited papers. Scientometrics, ˗ we are interested in predicting the long-term impact of 87(3), pp. 695-706. citation within one or two years after the publication; [13] Wang, M., Yu, G., Xu, J., He, H., Yu, D., & An, S. ˗ unlike most earlier papers, our interest is in predicting (2012). Development a case-based classifier for the probability distribution for the future number of predicting highly cited papers. Journal of links to a publication. We do not aim to give an accurate Informetrics, 6(4), pp.586-599. estimate of the future number of links to the publication. [14] Didegah, F., & Thelwall, M. (2013a]). Determinants of research citation impact in nanoscience and Acknowledgment nanotechnology. Journal of the American Society The reported study was funded by RFBR according to forInformation Science and Technology, 64(5), the research projects № 18-07-00909, 19-07-00857 and 20- pp.1055-1064. 04-60185. [15] Yu, T., Yu, G., Li, P.-Y. & Wang, L. (2014). Citation impact prediction for scientific papers using stepwise References regression analysis. Scientometrics, 101(2), pp.1233- 1252. [1] Gipp, B. (2014). Citation-based Document Similarity. [16] Onodera, N. & Yoshikane, F. (2015). Factors Citation-based Plagiarism Detection. Springer affecting citation rates of research articles. Journal of Fachmedien Wiesbaden, pp. 43-55. the Association for Information Science and [2] Gomaa, W.H.and Fahmy, A.A. (2013). A survey of Technology,66(4), 739-764. text similarity approaches, Int. J. Comput. Appl., vol. [17] Cao, X., Chen, Y., Liu K.J.R. (2016). A data analytic 68, no. 13, doi: https://doi.org/10.5120/11638-7118. approach to quantifying scientific impact. Journal of [3] Leydesdor, L. (1989). Words and co-words as Informetrics, 10 (2), pp. 471-484. indicators of intellectual organization. Research [18] Golosovsky, M., Solomon S. (2017). Growing Policy 18(4), pp. 209-223. DOI complex network of citations of scientific papers: http://dx.doi.org/10.1016/0048-7333(89)90016-4. Modeling and measurements. Physical Review E, 95 URL (1), p. 012324. http://www.sciencedirect.com/science/article/pii/004 [19] Fiala, D., Tutoky G. (2018). PageRank-based 8733389900164 prediction of award-winning researchers and the [4] Charnine, M., Klimenko, S. (2015). Measuring of impact of citations. Journal of Informetrics, 11 (4), pp. “Idea-based” Influence of Scientific Papers // 1044-1068. Proceedings of the 2015 International Conference on [20] Wang, D., Song, C., Barabási, A.-L. (2013). Information Science and Security (ICISS 2015), Quantifying long-term scientific impact. Science, 342 December 14-16, Seoul, South Korea, pp.160-164. (6154) , pp. 127-132. [5] Landauer, T.K. & Dumais, S.T. (1997). A solution to [21] Bornmann, L., Leydesdorff, L., & Wang, J. (2013). plato’s problem: The latent semantic analysis theory Which percentile-based approach should be preferred of acquisition, induction, and representation of for calculating normalized citation impact values?an knowledge", Psychological Review, 104. empirical comparison of five approaches including a [6] Matveeva, I., Levow, G., Farahat, A. & Royer, C. newly developed citation-rank approach (p100). (2005). Generalized latent semantic analysis for term Journal of Informetrics, 7(4), pp.933-944. representation. In Proc. of RANLP. [22] Bai, X., Zhang, F., Lee, I. (2019). Predicting the [7] Zaidi I, Singh S, Sinha A, Dwivedi R. (2015). Current citations of scholarly paper. Journal of Informetrics, views and implications of journal impact factor: A key Volume 13, Issue 1, pp. 407-418. note. Indian J Dent. 6(2):113-114. doi:10.4103/0975- 962X.154375 About the autors [8] Pan, R., Fortunato, S. (2015). Author Impact Factor: Khakimova Aida Kh., PhD, docent, Kama Institute tracking the dynamics of individual scientific impact. (Naberezhnye Chelny, Russia), ANO «Scientific and Research Sci Rep 4, 4880. https://doi.org/10.1038/srep04880. Center for Information in Physics and Technique» (Nizhny [9] Walters, G. (2006). Predicting subsequent citations to Novgorod, Russia), Е-mail: aida_khatif@mail.ru articles published in twelve crime-psychology Charnine Mikhail M., PhD, Senior Researcher, FRC CSC of journals: Author impact versus journal the Russian Academy of Sciences, Moscow, Russia, Е-mail: impact.Scientometrics, 69(3), pp. 499-510. mc@keywen.com [10] Haslam, N., Ban, L., Kaufmann, L., Loughnan, S., Peters, K., Whelan, J., et al. (2008). What makes an article influential? Predicting impact in social andpersonality psychology. Scientometrics, 76(1), pp.169-185. [11] Fu, L., & Aliferis, C. (2010). Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature.Scientometrics, 85(1), pp. 257-270.