Approaches to assessing the semantic similarity and future citation of publications by identifying informative terms with predictive properties

Approaches to assessing the semantic similarity and future citation of publications by identifying informative terms with predictive properties AKhKhakimova ANO «Scientific and Research Center for Information in Physics and Technique» Novgorod

Nizhny Russia

MMCharnine FRC CSC Russian Academy of Sciences

Moscow Russia

Approaches to assessing the semantic similarity and future citation of publications by identifying informative terms with predictive properties 7BB708F54BA43EF5B053F99F35078164 GROBID - A machine learning software for extracting information from scholarly documents semantic similarity informative terms impact factor of the term citations statistical analysis citation prediction

The article discusses new approaches to assessing the semantic similarity of documents in a vector space, taking into account statistically significant and informative terms. Informative terms reflect the current state of research in a certain field of research. To select informative terms, an algorithm for calculating the impact factor of the term is proposed. It is shown that informative terms allow both to evaluate the semantic similarity of texts and to predict future citations. The developed methods for assessing the semantic similarity and future impact of scientific publications can be used in the framework of "Predictive optimization", a modern technology that allows us to make decisions based on forecasts. In evaluating the activities of research and individual scientists, bibliometric indicators often play an important role. However, the use of citation-based indicators is problematic in determining the impact of recent publications. Usually, two years after the publication of most articles, they receive only a few links. The probability of future citation can be predicted using the proposed indicator -IFT.

Introduction

Measuring the similarity between documents is an important component in various tasks such as document clustering, topic detection, topic tracking, question answering, information retrieval and text summarization.

For scientific articles, there are two main types of similarity measures: citation-based similarity [1] and semantic textual similarity [2]. These two types of similarity measures should correlate and maximizing this correlation is a convenient way to adjust the coefficients and parameters on which these measures depend.

Citation-based similarity measures such as bibliographic coupling (if two documents share a reference in their bibliography) and co-citation (if two documents are cited by a third document) are an integral component of many information retrieval systems. Semantic textual similarity measures analyze situations where two documents share certain words (co-word linkages [3]), phrases or ideas [4].

Latent Semantic Analysis (LSA) [5] and Generalized Latent Semantic Analysis (GLSA) [6] are the most popular techniques of Corpus-Based semantic textual similarity [2]. GLSA extends the LSA approach by focusing on term vectors instead of the dual document-term representation.

There is a problem of efficient filtering of noninformative words. LSA and GLSA suffer from noise introduced by typos and infrequent and non-informative words [6]. To solve this problem, we present a new citation-based method for efficient filtering of the core vocabulary and keeping only content bearing words. This new citation-based method is called the Impact Factor of Terms (IFT). It is described in Section 2. IFT assesses the significance and informational content of terms in scientific articles based on citation analysis of the articles with these terms. Also, IFT is useful for prediction future citations and promising topics in different subject areas such as smart energy systems.

Maximizing correlation between citation-based similarity and IFT-based semantic textual similarity is a convenient way to adjust the coefficients and parameters of the IFT method.

IFT is similar to journal impact factor (JIF) which has been used for many years and has proven effective. JIF is a scientometric index that reflects the yearly average number of citations that articles published in the last two years in a given journal received. If all articles of a journal are highly cited, then this journal has a high JIF value and is considered significant and authoritative. Similarly, if all articles with some general term are highly cited, then this term has a high IFT value and is considered significant and informative. The IFT helps to identify informative terms that indicate significant fundamental ideas. Words and terms with a constantly high IFT (for example, neural networks) denote significant ideas, interest in which is stable for many years. For such informative words, the IFT values are stably high. Also, such words have a high correlation between IFT values of the current and next year. This correlation as well as the conditions for the stability and predictability of the IFT are discussed in Section 4. Section 3 describes a collection of articles used for experiments to study the empirical properties of IFT, including its correlations. The next section gives a formal description of the IFT.

Impact Factor of Terms (IFT)

There are currently several journal ranking systems, but the oldest and most influential system is a journal impact factor (JIF). JIF is used as an indicator of the importance of a journal for its field.

A journal's impact factor is based on how often articles published in that journal during the previous two years (e.g. 2017 and 2018) were cited by articles published in a particular year (e.g. 2019).

The higher the JIF, the more often articles in that journal are cited by other articles. Thus, the influence factor can give an approximate idea of how prestigious the magazine is in its field of science.

The journal with the highest IF value is the one that publishes the most frequently cited articles over a two-year period. One easy way to increase JIF is to publish more review articles, which are usually cited more often than research reports [7].

Author Impact Factor (AIF) is an extension of the impact factor for authors. The AIF of an author A in year t is the average number of citations given by papers published in year t to papers published by A in a period of Δt years before year t. AIF is able to capture trends and variations in the influence of scientists over time, in contrast to the h-index, which is a measure that takes into account the entire career path [8].

We offer an extension of the impact factor idea for terms. We offer a new numerical indicator of the authority of words and terms, called the impact factor of the term (IFT).

IFT (formula 1) can be used to effectively filter the dictionary, excluding uninformative words and terms. With the help of IFT, we can identify promising topics and ideas, find implicit links between articles and texts, and discover ideologically influential sites.

𝐼𝐼𝐼𝐼𝐼𝐼 = 𝐴𝐴 𝑡𝑡 𝑁𝑁 𝑡𝑡 , (1)

where Аt is the number of citations in articles with the term A published in year t to articles with the term A in the period Δt years to year t; Nt -total number of articles with term A for the time period ∆t + 1.

Therefore, the IFT of term A in year t is the average number of references cited in articles with term A published in year t to articles with term A in the period ∆t years to year t.

It follows from the IFT formula (1) that the method will certainly increase the correlation of the similarity measure of texts with their bibliographic relationship, since the IFT linearly depends on the number of bibliographic references over the past two years (or over a period of ∆t years).

Various approaches to the calculation of IFT were investigated.

The modified impact factor of the term (IFTm) is the ratio of citations of articles with term A to the total number of articles with this term over 3 years.

𝐼𝐼𝐼𝐼𝐼𝐼 𝑚𝑚 = 𝐴𝐴 𝑡𝑡−2 + 𝐴𝐴 𝑡𝑡−1 + 𝐴𝐴 𝑡𝑡 𝑁𝑁 ,(2)

where Аt-2 -the number of links to the article with the term A two years ago in same year; Аt-1 -the number of links to the article with term A last year for the same and previous years; Аt -the number of links to the article with term A over a three-year period, including the current year; Ntotal number of articles with term A for three years.

Both the IFT and IFTm are considered only for articles in which the given term is in the title. Only citations from articles containing the specified term in the title are taken into account.

AI collection (Data Set)

In our experiments, we analyze DBLP citation network, which is a collection of articles on Artificial Intelligence from 1936 to 2017, compiled by aminer.org and referred to here as AI collection.

The citation data is extracted from DBLP (Digital Bibliography & Library Project dblp.org), ACM (Association for Computing Machinery acm.org), MAG (Microsoft Academic Graph), and other sources.

We used the V10 version released in October 2017. This data set consists of 3,079,007 articles and 25,166,994 citation relationships. For each article there is a title, authors, year of publication and links. We have processed all titles and citation relationships.

In this paper, the AI collection was analyzed in different directions described in the next Section.

Results of a statistical analysis of term trends

The main goal of the statistical analysis of the AI collection is to study the empirical properties of Impact Factor of Terms (IFT), including the correlation of its current and future values to assess its stability and forecast future citations.

Statistical analysis of the collection was carried out using the Trend+ author program, which built a frequency dictionary of all words and terms in the collection. Also, for each term with a frequency of more than 5, Trend+ calculated its trend indicators (trending situations), including the number of articles with this term for the year, the number of citations from other articles with this term, the IFT and IFTm indicators for the current and next year.

To calculate the correlation, situations/points were selected for different words in different years, when the values of IFT and IFTm of the current year were more than zero. There could be several such situations for one word in different years. The selected situations were divided into groups differing in the number of articles with a word over the past 3 years. According to the number of situations, the IFTm groups turned out to be larger than the IFT groups, because IFTm takes into account more citations. Fig. 1 shows graphs of the number of situations/points in these groups for calculating correlations. In Fig. 1, the upper graph corresponds to the IFTm, and the lower IFT. The y-axis represents the number of points for calculating the correlations of the current and future years. The x-axis represents the frequency of terms, i.e. the number of articles with the term over the last 3 years. The maximum points on both graphs are achieved when the number of articles is 5, because the experiment did not analyze terms that occurred less than 5 times in the collection for all time.

On the IFT graph, the maximum number of points 54326 is reached at X = 5, and the minimum 2423 at X = 50. On the IFTm graph, the maximum number of points 91997 is reached at X = 5, and the minimum 2913 at X = 50.

For each group of trending situations/points (i.e., for each X) individually, a correlation was calculated between the current and future values of IFT and IFTm. The results of calculating the correlations are shown in Fig. 2. The upper graph is the IFT correlations, and the lower graph is the IFTm correlations.

Both graphs behave very similarly, but the correlations of the IFT (upper graph) are almost always greater than the correlations of the IFTm. The correlation on the graphs reaches 0.5 at a frequency of 17 articles over the past three years, 0.6 at 26 articles, and 0.7 at 45 articles. Thus, IFT behaves more stably and predictably than IFTm, but IFTm covers more different situations and words/terms. The graphs show that the higher the current frequency of the term (the number of articles with the term), the higher the correlation, and therefore, the more stable the IFT behaves in time. Stable IFT allows you to accurately predict the average number of future citations, since the IFT is exactly equal to the average number of citations of articles with the specified word/term. Thus, the words/terms with a high frequencies and high IFT values define promising topics in different subject areas such as artificial intelligence or smart energy systems.

The most stable and predictable words/terms with high IFT values are called informative terms. Informative words/terms have high frequencies and IFT meanings above a certain threshold. The type of function for filtering of non-informative words which grows with increasing IFT and frequency can be selected by maximizing the correlation between citation-based similarity and IFTbased semantic textual similarity. As a first approximation, this filtering function can be taken as the product of IFT and frequency with a certain minimum threshold for IFT.

Here are examples of the most informative words/terms in the collection of AI articles that have the largest total values of IFT multiplied by the current frequency: web (year 1982), fuzzy (1969), sensor networks (1992), neural (1962), video (1976) , social (1971), cognitive (1973), semantic (1967), clustering (1970), neural networks (1986).

These examples point to the most actively and stably developing areas of AI, and also confirm the usefulness of the proposed filtering function and its ability to evaluate the significance and information content of words/terms.

Predicting the citations with IFT

Prediction of citation of scientific works was studied by many researchers. The described approaches are mainly based on the analysis of a number of features, including information about the authors (number of authors, country, authors rating, etc.), features of the journal (total number of links to the journal, impact factor of the journal), article parameters (topic, volume, number of references etc.), type of research (for example, original research compared to a literature review), as well as other characteristics (reputation of institutions etc.). In addition, altmetrics are also used to predict the citation of a scientific paper.

Citation prediction methods have been proposed, for example, by Walters (2006) [9], Haslam et al. (2008) [10], Fu and Aliferis (2010) [11], Wang, Yu and Yu (2011) [12], Wang et al. (2012) [13], Didegah and Thelwall (2013) [14], Yu, Yu, Li and Wang (2014) [15], Onodera and Yoshikane (2015) [16], Cao et al. (2016) [17], Golosovsky and Solomon (2017) [18], Fiala and Tutoky (2018) [19], Bai et al. (2019) [20]. For example, Wang et al. (2013) [21] propose mathematical models that describe how publications accumulate citations over time. Using these models, the authors predict the effect of publication citation on a longer term based on a short-term publication citation history. Bornmann et al. (2013) [22] present an empirical analysis of the correlation between short-term and longterm citation indicators.

IFT evaluates the significance and informativeness of terms in scientific articles based on an analysis of the citation of articles with these terms. IFT can also be used to predict future citations of new articles.

Given the practical importance of incorporating the latest publications in evaluations of scientific performance, one of the goals of our study is to develop a model to predict the impact that recent publications will have in the long run.

Our model assumes a publication citation prediction based on the following predictors: the impact factor of significant terms (for example, authors' keywords) and the time of appearance of subsequent articles associated with implicit links to the original article.

The two predictors used are readily available, and unlike most prediction approaches, they allow you to make predictions pretty soon after the publication.

Citation forecasts have a high degree of uncertainty. Therefore, we believe that it is more important to know the likelihood that the publication will receive a certain number of links in the future. Therefore, we do not predict the average number of links that the publication should attract in the future, but we predict the probability distribution for the future number of links based on the developed mathematical probabilistic model of the dependence of the number of direct citations on terms with high IFT.

It is important to emphasize that the purpose of our work is different from the studies mentioned above. As in the above studies, we are interested in predicting the future citation. However, many indicators that have been found to correlate with the influence of citation are easy to manipulate.

For example, suppose researchers know that future citations of a publication will be predicted based, for example, on the number of pages or the number of links. In this case, authors can artificially increase the number of pages or increase the number of bibliographic references. Therefore, we consider variables that cannot be changed by the authors of the publication.

Based on IFT values, we can choose informative terms that indicate important fundamental ideas. Words and terms with a consistently high IFT indicate important ideas that have been stable for many years.

In our experiments, we analyze the DBLP citation network, which is a collection of articles on artificial intelligence from 1936 to 2017, including 3,079,007 articles and 25,166,994 links. Statistical analysis of the collection was carried out using the Trend + program, which built a frequency dictionary and trend indicators, including the number of articles with this term per year, the number of links to other articles with this term, IFT and IFTm indicators for the current and next year.

The term "Trend of the initial frequency" (TIF) is proposed -this is the number of years from the first article with a certain term to the nth article with this term. A relationship was found between TIF, IFT, and citation trends. It is shown that the higher the trends of the initial frequency, the higher the trends of fresh citation links, that is, the higher the likelihood of quick appearance of links to the article.

Of particular interest are trend terms with a large number of new articles (more than 10 articles in the previous 2 years). For trend terms, the correlation of current and future IFTm is more than 60%, which allows us to make a fairly confident forecast of IFTm (i.e. citation forecast) for the next year.

We summarize how our study differs from existing works: ˗ we are interested in predicting the long-term impact of citation, based solely on the impact factors of significant terms (as mentioned above, we do not want to use variables that can be easily manipulated); ˗ we are interested in predicting the long-term impact of citation within one or two years after the publication; ˗ unlike most earlier papers, our interest is in predicting the probability distribution for the future number of links to a publication. We do not aim to give an accurate estimate of the future number of links to the publication.

Fig. 1 .1Fig. 1. Graphs of the number of points for calculating the correlations of the current and future years for the indicators IFTm (upper) and IFT, depending on the number of articles with the word in the last 3 years

Fig. 2 .2Fig. 2. Graph of IFT correlations (upper) and IFTm correlations of the current and future years depending on the number of articles with the word in the last 3 years

Acknowledgment

The reported study was funded by RFBR according to the research projects № 18-07-00909, 19-07-00857 and 20-04-60185.

Citation-based Document Similarity. Citation-based Plagiarism Detection BGipp 2014 Springer Fachmedien Wiesbaden A survey of text similarity approaches WHGomaa AAFahmy 10.5120/11638-7118 Int. J. Comput. Appl 68 13 2013 Words and co-words as indicators of intellectual organization LLeydesdor 10.1016/0048-7333(89)90016-4 DOI Research Policy 18 4 1989 Measuring of "Idea-based" Influence of Scientific Papers MCharnine SKlimenko Proceedings of the 2015 International Conference on Information Science and Security (ICISS 2015) the 2015 International Conference on Information Science and Security (ICISS 2015)

Seoul, South Korea

2015. December 14-16 A solution to plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge TKLandauer STDumais Psychological Review 104 1997 Generalized latent semantic analysis for term representation IMatveeva GLevow AFarahat CRoyer Proc. of RANLP of RANLP 2005 Current views and implications of journal impact factor: A key note IZaidi SSingh ASinha RDwivedi 10.4103/0975-962X.154375 Indian J Dent 6 2 2015 Author Impact Factor: tracking the dynamics of individual scientific impact RPan SFortunato 10.1038/srep04880 Sci Rep 4 4880 2015 GWalters Predicting subsequent citations to articles published in twelve crime-psychology journals 2006 Mining typical features for highly cited papers MWang GYu DYu Scientometrics 87 3 2011 Development a case-based classifier for predicting highly cited papers MWang GYu JXu HHe DYu SAn Journal of Informetrics 6 4 2012 Determinants of research citation impact in nanoscience and nanotechnology FDidegah MThelwall Journal of the American Society forInformation Science and Technology 64 5 2013a Citation impact prediction for scientific papers using stepwise regression analysis TYu GYu P.-YLi LWang Scientometrics 101 2 2014 Factors affecting citation rates of research articles NOnodera FYoshikane Journal of the Association for Information Science and Technology 66 4 2015 A data analytic approach to quantifying scientific impact XCao YChen KJ RLiu Journal of Informetrics 10 2 2016 Growing complex network of citations of scientific papers: Modeling and measurements MGolosovsky SSolomon Physical Review E 95 1 12324 2017 PageRank-based prediction of award-winning researchers and the impact of citations DFiala GTutoky Journal of Informetrics 11 4 2018 Quantifying long-term scientific impact DWang CSong A.-LBarabási Science 342 2013. 6154 Which percentile-based approach should be preferred for calculating normalized citation impact values?an empirical comparison of five approaches including a newly developed citation-rank approach (p100) LBornmann LLeydesdorff JWang Journal of Informetrics 7 4 2013 Predicting the citations of scholarly paper XBai FZhang ILee Journal of Informetrics 13 1 2019 About the autors ANO «Scientific and Research Center for Information in Physics and Technique CharnineMikhail MPhd SeniorResearcher

Naberezhnye Chelny, Russia; Nizhny Novgorod, Russia; Moscow, Russia

FRC CSC of the Russian Academy of Sciences aida_khatif@mail