=Paper=
{{Paper
|id=Vol-2763/CPT2020_paper_s3-8
|storemode=property
|title=Approaches to assessing the semantic similarity and future citation of publications by identifying informative terms with predictive properties
|pdfUrl=https://ceur-ws.org/Vol-2763/CPT2020_paper_s3-8.pdf
|volume=Vol-2763
|authors=Aida Khakimova,Michael Charnine
}}
==Approaches to assessing the semantic similarity and future citation of publications by identifying informative terms with predictive properties==
Approaches to assessing the semantic similarity and future citation of
publications by identifying informative terms with predictive properties
A.Kh. Khakimova1, M.M. Charnine2
aida_khatif@mail.ru | mc@keywen.com
1
ANO «Scientific and Research Center for Information in Physics and Technique», Nizhny Novgorod, Russia;
2
FRC CSC of the Russian Academy of Sciences, Moscow, Russia
The article discusses new approaches to assessing the semantic similarity of documents in a vector space, taking into account
statistically significant and informative terms. Informative terms reflect the current state of research in a certain field of research. To
select informative terms, an algorithm for calculating the impact factor of the term is proposed. It is shown that informative terms allow
both to evaluate the semantic similarity of texts and to predict future citations. The developed methods for assessing the semantic
similarity and future impact of scientific publications can be used in the framework of “Predictive optimization”, a modern technology
that allows us to make decisions based on forecasts. In evaluating the activities of research and individual scientists, bibliometric
indicators often play an important role. However, the use of citation-based indicators is problematic in determining the impact of recent
publications. Usually, two years after the publication of most articles, they receive only a few links. The probability of future citation
can be predicted using the proposed indicator - IFT.
Keywords: semantic similarity, informative terms, impact factor of the term, citations, statistical analysis, citation prediction.
IFT is similar to journal impact factor (JIF) which has
1. Introduction been used for many years and has proven effective. JIF is a
Measuring the similarity between documents is an scientometric index that reflects the yearly average number
important component in various tasks such as document of citations that articles published in the last two years in a
clustering, topic detection, topic tracking, question given journal received. If all articles of a journal are highly
answering, information retrieval and text summarization. cited, then this journal has a high JIF value and is
For scientific articles, there are two main types of considered significant and authoritative. Similarly, if all
similarity measures: citation-based similarity [1] and articles with some general term are highly cited, then this
semantic textual similarity [2]. These two types of term has a high IFT value and is considered significant and
similarity measures should correlate and maximizing this informative. The IFT helps to identify informative terms
correlation is a convenient way to adjust the coefficients that indicate significant fundamental ideas. Words and
and parameters on which these measures depend. terms with a constantly high IFT (for example, neural
Citation-based similarity measures such as networks) denote significant ideas, interest in which is
bibliographic coupling (if two documents share a reference stable for many years. For such informative words, the IFT
in their bibliography) and co-citation (if two documents are values are stably high. Also, such words have a high
cited by a third document) are an integral component of correlation between IFT values of the current and next year.
many information retrieval systems. Semantic textual This correlation as well as the conditions for the stability
similarity measures analyze situations where two and predictability of the IFT are discussed in Section 4.
documents share certain words (co-word linkages [3]), Section 3 describes a collection of articles used for
phrases or ideas [4]. experiments to study the empirical properties of IFT,
Latent Semantic Analysis (LSA) [5] and Generalized including its correlations. The next section gives a formal
Latent Semantic Analysis (GLSA) [6] are the most popular description of the IFT.
techniques of Corpus-Based semantic textual similarity [2].
2. Impact Factor of Terms (IFT)
GLSA extends the LSA approach by focusing on term
vectors instead of the dual document-term representation. There are currently several journal ranking systems, but
There is a problem of efficient filtering of non- the oldest and most influential system is a journal impact
informative words. LSA and GLSA suffer from noise factor (JIF). JIF is used as an indicator of the importance of
introduced by typos and infrequent and non-informative a journal for its field.
words [6]. To solve this problem, we present a new A journal's impact factor is based on how often articles
citation-based method for efficient filtering of the core published in that journal during the previous two years (e.g.
vocabulary and keeping only content bearing words. This 2017 and 2018) were cited by articles published in a
new citation-based method is called the Impact Factor of particular year (e.g. 2019).
Terms (IFT). It is described in Section 2. IFT assesses the The higher the JIF, the more often articles in that
significance and informational content of terms in scientific journal are cited by other articles. Thus, the influence factor
articles based on citation analysis of the articles with these can give an approximate idea of how prestigious the
terms. Also, IFT is useful for prediction future citations and magazine is in its field of science.
promising topics in different subject areas such as smart The journal with the highest IF value is the one that
energy systems. publishes the most frequently cited articles over a two-year
Maximizing correlation between citation-based period. One easy way to increase JIF is to publish more
similarity and IFT-based semantic textual similarity is a review articles, which are usually cited more often than
convenient way to adjust the coefficients and parameters of research reports [7].
the IFT method.
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0)
Author Impact Factor (AIF) is an extension of the articles containing the specified term in the title are taken
impact factor for authors. The AIF of an author A in year t into account.
is the average number of citations given by papers
published in year t to papers published by A in a period of 3. AI collection (Data Set)
Δt years before year t. AIF is able to capture trends and In our experiments, we analyze DBLP citation network,
variations in the influence of scientists over time, in which is a collection of articles on Artificial Intelligence
contrast to the h-index, which is a measure that takes into from 1936 to 2017, compiled by aminer.org and referred to
account the entire career path [8]. here as AI collection.
We offer an extension of the impact factor idea for The citation data is extracted from DBLP (Digital
terms. We offer a new numerical indicator of the authority Bibliography & Library Project dblp.org), ACM
of words and terms, called the impact factor of the term (Association for Computing Machinery acm.org), MAG
(IFT). (Microsoft Academic Graph), and other sources.
IFT (formula 1) can be used to effectively filter the We used the V10 version released in October 2017.
dictionary, excluding uninformative words and terms. With This data set consists of 3,079,007 articles and 25,166,994
the help of IFT, we can identify promising topics and ideas, citation relationships. For each article there is a title,
find implicit links between articles and texts, and discover authors, year of publication and links. We have processed
ideologically influential sites. all titles and citation relationships.
𝐴𝐴𝑡𝑡 In this paper, the AI collection was analyzed in different
𝐼𝐼𝐼𝐼𝐼𝐼 = , (1)
𝑁𝑁𝑡𝑡 directions described in the next Section.
where Аt is the number of citations in articles with the term
A published in year t to articles with the term A in the 4. Results of a statistical analysis of term trends
period Δt years to year t; Nt - total number of articles with
The main goal of the statistical analysis of the AI
term A for the time period ∆t + 1.
collection is to study the empirical properties of Impact
Therefore, the IFT of term A in year t is the average
Factor of Terms (IFT), including the correlation of its
number of references cited in articles with term A
current and future values to assess its stability and forecast
published in year t to articles with term A in the period ∆t
future citations.
years to year t.
Statistical analysis of the collection was carried out
It follows from the IFT formula (1) that the method will
using the Trend+ author program, which built a frequency
certainly increase the correlation of the similarity measure
dictionary of all words and terms in the collection. Also, for
of texts with their bibliographic relationship, since the IFT
each term with a frequency of more than 5, Trend+
linearly depends on the number of bibliographic references
calculated its trend indicators (trending situations),
over the past two years (or over a period of ∆t years).
including the number of articles with this term for the year,
Various approaches to the calculation of IFT were
the number of citations from other articles with this term,
investigated.
the IFT and IFTm indicators for the current and next year.
The modified impact factor of the term (IFTm) is the
To calculate the correlation, situations/points were
ratio of citations of articles with term A to the total number
selected for different words in different years, when the
of articles with this term over 3 years.
𝐴𝐴𝑡𝑡−2 + 𝐴𝐴𝑡𝑡−1 + 𝐴𝐴𝑡𝑡 values of IFT and IFTm of the current year were more than
𝐼𝐼𝐼𝐼𝐼𝐼𝑚𝑚 = , (2) zero. There could be several such situations for one word
𝑁𝑁 in different years. The selected situations were divided into
where Аt-2 - the number of links to the article with the term
groups differing in the number of articles with a word over
A two years ago in same year; Аt-1 - the number of links to
the past 3 years. According to the number of situations, the
the article with term A last year for the same and previous
IFTm groups turned out to be larger than the IFT groups,
years; Аt - the number of links to the article with term A
because IFTm takes into account more citations. Fig. 1
over a three-year period, including the current year; N -
shows graphs of the number of situations/points in these
total number of articles with term A for three years.
groups for calculating correlations.
Both the IFT and IFTm are considered only for articles
in which the given term is in the title. Only citations from
Fig. 1. Graphs of the number of points for calculating the correlations of the current and future years for the indicators IFTm (upper)
and IFT, depending on the number of articles with the word in the last 3 years
In Fig. 1, the upper graph corresponds to the IFTm, and On the IFT graph, the maximum number of points
the lower IFT. The y-axis represents the number of points 54326 is reached at X = 5, and the minimum 2423 at X =
for calculating the correlations of the current and future 50. On the IFTm graph, the maximum number of points
years. The x-axis represents the frequency of terms, i.e. the 91997 is reached at X = 5, and the minimum 2913 at X =
number of articles with the term over the last 3 years. The 50.
maximum points on both graphs are achieved when the For each group of trending situations/points (i.e., for
number of articles is 5, because the experiment did not each X) individually, a correlation was calculated between
analyze terms that occurred less than 5 times in the the current and future values of IFT and IFTm. The results
collection for all time. of calculating the correlations are shown in Fig. 2.
Fig. 2. Graph of IFT correlations (upper) and IFTm correlations of the current and future years depending on the number of articles
with the word in the last 3 years
The upper graph is the IFT correlations, and the lower The graphs show that the higher the current frequency
graph is the IFTm correlations. of the term (the number of articles with the term), the higher
Both graphs behave very similarly, but the correlations the correlation, and therefore, the more stable the IFT
of the IFT (upper graph) are almost always greater than the behaves in time. Stable IFT allows you to accurately
correlations of the IFTm. The correlation on the graphs predict the average number of future citations, since the IFT
reaches 0.5 at a frequency of 17 articles over the past three is exactly equal to the average number of citations of
years, 0.6 at 26 articles, and 0.7 at 45 articles. Thus, IFT articles with the specified word/term. Thus, the
behaves more stably and predictably than IFTm, but IFTm words/terms with a high frequencies and high IFT values
covers more different situations and words/terms.
define promising topics in different subject areas such as Our model assumes a publication citation prediction
artificial intelligence or smart energy systems. based on the following predictors: the impact factor of
The most stable and predictable words/terms with high significant terms (for example, authors' keywords) and the
IFT values are called informative terms. Informative time of appearance of subsequent articles associated with
words/terms have high frequencies and IFT meanings implicit links to the original article.
above a certain threshold. The type of function for filtering The two predictors used are readily available, and
of non-informative words which grows with increasing IFT unlike most prediction approaches, they allow you to make
and frequency can be selected by maximizing the predictions pretty soon after the publication.
correlation between citation-based similarity and IFT- Citation forecasts have a high degree of uncertainty.
based semantic textual similarity. As a first approximation, Therefore, we believe that it is more important to know the
this filtering function can be taken as the product of IFT likelihood that the publication will receive a certain number
and frequency with a certain minimum threshold for IFT. of links in the future. Therefore, we do not predict the
Here are examples of the most informative words/terms average number of links that the publication should attract
in the collection of AI articles that have the largest total in the future, but we predict the probability distribution for
values of IFT multiplied by the current frequency: web the future number of links based on the developed
(year 1982), fuzzy (1969), sensor networks (1992), neural mathematical probabilistic model of the dependence of the
(1962), video (1976) , social (1971), cognitive (1973), number of direct citations on terms with high IFT.
semantic (1967), clustering (1970), neural networks It is important to emphasize that the purpose of our
(1986). work is different from the studies mentioned above. As in
These examples point to the most actively and stably the above studies, we are interested in predicting the future
developing areas of AI, and also confirm the usefulness of citation. However, many indicators that have been found to
the proposed filtering function and its ability to evaluate the correlate with the influence of citation are easy to
significance and information content of words/terms. manipulate.
For example, suppose researchers know that future
5. Predicting the citations with IFT citations of a publication will be predicted based, for
Prediction of citation of scientific works was studied by example, on the number of pages or the number of links. In
many researchers. The described approaches are mainly this case, authors can artificially increase the number of
based on the analysis of a number of features, including pages or increase the number of bibliographic references.
information about the authors (number of authors, country, Therefore, we consider variables that cannot be changed by
authors rating, etc.), features of the journal (total number of the authors of the publication.
links to the journal, impact factor of the journal), article Based on IFT values, we can choose informative terms
parameters (topic, volume, number of references etc.), type that indicate important fundamental ideas. Words and
of research (for example, original research compared to a terms with a consistently high IFT indicate important ideas
literature review), as well as other characteristics that have been stable for many years.
(reputation of institutions etc.). In addition, altmetrics are In our experiments, we analyze the DBLP citation
also used to predict the citation of a scientific paper. network, which is a collection of articles on artificial
Citation prediction methods have been proposed, for intelligence from 1936 to 2017, including 3,079,007
example, by Walters (2006) [9], Haslam et al. (2008) [10], articles and 25,166,994 links. Statistical analysis of the
Fu and Aliferis (2010) [11], Wang, Yu and Yu (2011) [12], collection was carried out using the Trend + program,
Wang et al. (2012) [13], Didegah and Thelwall (2013) [14], which built a frequency dictionary and trend indicators,
Yu, Yu, Li and Wang (2014) [15], Onodera and Yoshikane including the number of articles with this term per year, the
(2015) [16], Cao et al. (2016) [17], Golosovsky and number of links to other articles with this term, IFT and
Solomon (2017) [18], Fiala and Tutoky (2018) [19], Bai et IFTm indicators for the current and next year.
al. (2019) [20]. For example, Wang et al. (2013) [21] The term “Trend of the initial frequency” (TIF) is
propose mathematical models that describe how proposed - this is the number of years from the first article
publications accumulate citations over time. Using these with a certain term to the nth article with this term. A
models, the authors predict the effect of publication citation relationship was found between TIF, IFT, and citation
on a longer term based on a short-term publication citation trends. It is shown that the higher the trends of the initial
history. Bornmann et al. (2013) [22] present an empirical frequency, the higher the trends of fresh citation links, that
analysis of the correlation between short-term and long- is, the higher the likelihood of quick appearance of links to
term citation indicators. the article.
IFT evaluates the significance and informativeness of Of particular interest are trend terms with a large
terms in scientific articles based on an analysis of the number of new articles (more than 10 articles in the
citation of articles with these terms. IFT can also be used to previous 2 years). For trend terms, the correlation of current
predict future citations of new articles. and future IFTm is more than 60%, which allows us to
Given the practical importance of incorporating the make a fairly confident forecast of IFTm (i.e. citation
latest publications in evaluations of scientific performance, forecast) for the next year.
one of the goals of our study is to develop a model to We summarize how our study differs from existing
predict the impact that recent publications will have in the works:
long run. ˗ we are interested in predicting the long-term impact of
citation, based solely on the impact factors of
significant terms (as mentioned above, we do not want [12] Wang, M., Yu, G., & Yu, D. (2011). Mining typical
to use variables that can be easily manipulated); features for highly cited papers. Scientometrics,
˗ we are interested in predicting the long-term impact of 87(3), pp. 695-706.
citation within one or two years after the publication; [13] Wang, M., Yu, G., Xu, J., He, H., Yu, D., & An, S.
˗ unlike most earlier papers, our interest is in predicting (2012). Development a case-based classifier for
the probability distribution for the future number of predicting highly cited papers. Journal of
links to a publication. We do not aim to give an accurate Informetrics, 6(4), pp.586-599.
estimate of the future number of links to the publication. [14] Didegah, F., & Thelwall, M. (2013a]). Determinants
of research citation impact in nanoscience and
Acknowledgment nanotechnology. Journal of the American Society
The reported study was funded by RFBR according to forInformation Science and Technology, 64(5),
the research projects № 18-07-00909, 19-07-00857 and 20- pp.1055-1064.
04-60185. [15] Yu, T., Yu, G., Li, P.-Y. & Wang, L. (2014). Citation
impact prediction for scientific papers using stepwise
References regression analysis. Scientometrics, 101(2), pp.1233-
1252.
[1] Gipp, B. (2014). Citation-based Document Similarity. [16] Onodera, N. & Yoshikane, F. (2015). Factors
Citation-based Plagiarism Detection. Springer affecting citation rates of research articles. Journal of
Fachmedien Wiesbaden, pp. 43-55. the Association for Information Science and
[2] Gomaa, W.H.and Fahmy, A.A. (2013). A survey of Technology,66(4), 739-764.
text similarity approaches, Int. J. Comput. Appl., vol. [17] Cao, X., Chen, Y., Liu K.J.R. (2016). A data analytic
68, no. 13, doi: https://doi.org/10.5120/11638-7118. approach to quantifying scientific impact. Journal of
[3] Leydesdor, L. (1989). Words and co-words as Informetrics, 10 (2), pp. 471-484.
indicators of intellectual organization. Research [18] Golosovsky, M., Solomon S. (2017). Growing
Policy 18(4), pp. 209-223. DOI complex network of citations of scientific papers:
http://dx.doi.org/10.1016/0048-7333(89)90016-4. Modeling and measurements. Physical Review E, 95
URL (1), p. 012324.
http://www.sciencedirect.com/science/article/pii/004 [19] Fiala, D., Tutoky G. (2018). PageRank-based
8733389900164 prediction of award-winning researchers and the
[4] Charnine, M., Klimenko, S. (2015). Measuring of impact of citations. Journal of Informetrics, 11 (4), pp.
“Idea-based” Influence of Scientific Papers // 1044-1068.
Proceedings of the 2015 International Conference on [20] Wang, D., Song, C., Barabási, A.-L. (2013).
Information Science and Security (ICISS 2015), Quantifying long-term scientific impact. Science, 342
December 14-16, Seoul, South Korea, pp.160-164. (6154) , pp. 127-132.
[5] Landauer, T.K. & Dumais, S.T. (1997). A solution to [21] Bornmann, L., Leydesdorff, L., & Wang, J. (2013).
plato’s problem: The latent semantic analysis theory Which percentile-based approach should be preferred
of acquisition, induction, and representation of for calculating normalized citation impact values?an
knowledge", Psychological Review, 104. empirical comparison of five approaches including a
[6] Matveeva, I., Levow, G., Farahat, A. & Royer, C. newly developed citation-rank approach (p100).
(2005). Generalized latent semantic analysis for term Journal of Informetrics, 7(4), pp.933-944.
representation. In Proc. of RANLP. [22] Bai, X., Zhang, F., Lee, I. (2019). Predicting the
[7] Zaidi I, Singh S, Sinha A, Dwivedi R. (2015). Current citations of scholarly paper. Journal of Informetrics,
views and implications of journal impact factor: A key Volume 13, Issue 1, pp. 407-418.
note. Indian J Dent. 6(2):113-114. doi:10.4103/0975-
962X.154375 About the autors
[8] Pan, R., Fortunato, S. (2015). Author Impact Factor:
Khakimova Aida Kh., PhD, docent, Kama Institute
tracking the dynamics of individual scientific impact.
(Naberezhnye Chelny, Russia), ANO «Scientific and Research
Sci Rep 4, 4880. https://doi.org/10.1038/srep04880. Center for Information in Physics and Technique» (Nizhny
[9] Walters, G. (2006). Predicting subsequent citations to Novgorod, Russia), Е-mail: aida_khatif@mail.ru
articles published in twelve crime-psychology Charnine Mikhail M., PhD, Senior Researcher, FRC CSC of
journals: Author impact versus journal the Russian Academy of Sciences, Moscow, Russia, Е-mail:
impact.Scientometrics, 69(3), pp. 499-510. mc@keywen.com
[10] Haslam, N., Ban, L., Kaufmann, L., Loughnan, S.,
Peters, K., Whelan, J., et al. (2008). What makes an
article influential? Predicting impact in social
andpersonality psychology. Scientometrics, 76(1),
pp.169-185.
[11] Fu, L., & Aliferis, C. (2010). Using content-based and
bibliometric features for machine learning models to
predict citation counts in the biomedical
literature.Scientometrics, 85(1), pp. 257-270.