Approaches to assessing the semantic similarity and future citation of
  publications by identifying informative terms with predictive properties
                                         A.Kh. Khakimova1, M.M. Charnine2
                                      aida_khatif@mail.ru | mc@keywen.com
      1
       ANO «Scientific and Research Center for Information in Physics and Technique», Nizhny Novgorod, Russia;
                           2
                             FRC CSC of the Russian Academy of Sciences, Moscow, Russia

    The article discusses new approaches to assessing the semantic similarity of documents in a vector space, taking into account
statistically significant and informative terms. Informative terms reflect the current state of research in a certain field of research. To
select informative terms, an algorithm for calculating the impact factor of the term is proposed. It is shown that informative terms allow
both to evaluate the semantic similarity of texts and to predict future citations. The developed methods for assessing the semantic
similarity and future impact of scientific publications can be used in the framework of “Predictive optimization”, a modern technology
that allows us to make decisions based on forecasts. In evaluating the activities of research and individual scientists, bibliometric
indicators often play an important role. However, the use of citation-based indicators is problematic in determining the impact of recent
publications. Usually, two years after the publication of most articles, they receive only a few links. The probability of future citation
can be predicted using the proposed indicator - IFT.
    Keywords: semantic similarity, informative terms, impact factor of the term, citations, statistical analysis, citation prediction.

                                                                            IFT is similar to journal impact factor (JIF) which has
1. Introduction                                                         been used for many years and has proven effective. JIF is a
    Measuring the similarity between documents is an                    scientometric index that reflects the yearly average number
important component in various tasks such as document                   of citations that articles published in the last two years in a
clustering, topic detection, topic tracking, question                   given journal received. If all articles of a journal are highly
answering, information retrieval and text summarization.                cited, then this journal has a high JIF value and is
    For scientific articles, there are two main types of                considered significant and authoritative. Similarly, if all
similarity measures: citation-based similarity [1] and                  articles with some general term are highly cited, then this
semantic textual similarity [2]. These two types of                     term has a high IFT value and is considered significant and
similarity measures should correlate and maximizing this                informative. The IFT helps to identify informative terms
correlation is a convenient way to adjust the coefficients              that indicate significant fundamental ideas. Words and
and parameters on which these measures depend.                          terms with a constantly high IFT (for example, neural
    Citation-based similarity measures such as                          networks) denote significant ideas, interest in which is
bibliographic coupling (if two documents share a reference              stable for many years. For such informative words, the IFT
in their bibliography) and co-citation (if two documents are            values are stably high. Also, such words have a high
cited by a third document) are an integral component of                 correlation between IFT values of the current and next year.
many information retrieval systems. Semantic textual                    This correlation as well as the conditions for the stability
similarity measures analyze situations where two                        and predictability of the IFT are discussed in Section 4.
documents share certain words (co-word linkages [3]),                   Section 3 describes a collection of articles used for
phrases or ideas [4].                                                   experiments to study the empirical properties of IFT,
    Latent Semantic Analysis (LSA) [5] and Generalized                  including its correlations. The next section gives a formal
Latent Semantic Analysis (GLSA) [6] are the most popular                description of the IFT.
techniques of Corpus-Based semantic textual similarity [2].
                                                                        2. Impact Factor of Terms (IFT)
GLSA extends the LSA approach by focusing on term
vectors instead of the dual document-term representation.                   There are currently several journal ranking systems, but
    There is a problem of efficient filtering of non-                   the oldest and most influential system is a journal impact
informative words. LSA and GLSA suffer from noise                       factor (JIF). JIF is used as an indicator of the importance of
introduced by typos and infrequent and non-informative                  a journal for its field.
words [6]. To solve this problem, we present a new                          A journal's impact factor is based on how often articles
citation-based method for efficient filtering of the core               published in that journal during the previous two years (e.g.
vocabulary and keeping only content bearing words. This                 2017 and 2018) were cited by articles published in a
new citation-based method is called the Impact Factor of                particular year (e.g. 2019).
Terms (IFT). It is described in Section 2. IFT assesses the                 The higher the JIF, the more often articles in that
significance and informational content of terms in scientific           journal are cited by other articles. Thus, the influence factor
articles based on citation analysis of the articles with these          can give an approximate idea of how prestigious the
terms. Also, IFT is useful for prediction future citations and          magazine is in its field of science.
promising topics in different subject areas such as smart                   The journal with the highest IF value is the one that
energy systems.                                                         publishes the most frequently cited articles over a two-year
    Maximizing correlation between citation-based                       period. One easy way to increase JIF is to publish more
similarity and IFT-based semantic textual similarity is a               review articles, which are usually cited more often than
convenient way to adjust the coefficients and parameters of             research reports [7].
the IFT method.


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0)
    Author Impact Factor (AIF) is an extension of the            articles containing the specified term in the title are taken
impact factor for authors. The AIF of an author A in year t      into account.
is the average number of citations given by papers
published in year t to papers published by A in a period of      3. AI collection (Data Set)
Δt years before year t. AIF is able to capture trends and             In our experiments, we analyze DBLP citation network,
variations in the influence of scientists over time, in          which is a collection of articles on Artificial Intelligence
contrast to the h-index, which is a measure that takes into      from 1936 to 2017, compiled by aminer.org and referred to
account the entire career path [8].                              here as AI collection.
    We offer an extension of the impact factor idea for               The citation data is extracted from DBLP (Digital
terms. We offer a new numerical indicator of the authority       Bibliography & Library Project dblp.org), ACM
of words and terms, called the impact factor of the term         (Association for Computing Machinery acm.org), MAG
(IFT).                                                           (Microsoft Academic Graph), and other sources.
    IFT (formula 1) can be used to effectively filter the             We used the V10 version released in October 2017.
dictionary, excluding uninformative words and terms. With        This data set consists of 3,079,007 articles and 25,166,994
the help of IFT, we can identify promising topics and ideas,     citation relationships. For each article there is a title,
find implicit links between articles and texts, and discover     authors, year of publication and links. We have processed
ideologically influential sites.                                 all titles and citation relationships.
                                    𝐴𝐴𝑡𝑡                              In this paper, the AI collection was analyzed in different
                           𝐼𝐼𝐼𝐼𝐼𝐼 =      ,                (1)
                                    𝑁𝑁𝑡𝑡                         directions described in the next Section.
where Аt is the number of citations in articles with the term
A published in year t to articles with the term A in the         4. Results of a statistical analysis of term trends
period Δt years to year t; Nt - total number of articles with
                                                                     The main goal of the statistical analysis of the AI
term A for the time period ∆t + 1.
                                                                 collection is to study the empirical properties of Impact
    Therefore, the IFT of term A in year t is the average
                                                                 Factor of Terms (IFT), including the correlation of its
number of references cited in articles with term A
                                                                 current and future values to assess its stability and forecast
published in year t to articles with term A in the period ∆t
                                                                 future citations.
years to year t.
                                                                     Statistical analysis of the collection was carried out
    It follows from the IFT formula (1) that the method will
                                                                 using the Trend+ author program, which built a frequency
certainly increase the correlation of the similarity measure
                                                                 dictionary of all words and terms in the collection. Also, for
of texts with their bibliographic relationship, since the IFT
                                                                 each term with a frequency of more than 5, Trend+
linearly depends on the number of bibliographic references
                                                                 calculated its trend indicators (trending situations),
over the past two years (or over a period of ∆t years).
                                                                 including the number of articles with this term for the year,
    Various approaches to the calculation of IFT were
                                                                 the number of citations from other articles with this term,
investigated.
                                                                 the IFT and IFTm indicators for the current and next year.
    The modified impact factor of the term (IFTm) is the
                                                                     To calculate the correlation, situations/points were
ratio of citations of articles with term A to the total number
                                                                 selected for different words in different years, when the
of articles with this term over 3 years.
                           𝐴𝐴𝑡𝑡−2 + 𝐴𝐴𝑡𝑡−1 + 𝐴𝐴𝑡𝑡                values of IFT and IFTm of the current year were more than
                𝐼𝐼𝐼𝐼𝐼𝐼𝑚𝑚 =                        ,       (2)    zero. There could be several such situations for one word
                                     𝑁𝑁                          in different years. The selected situations were divided into
where Аt-2 - the number of links to the article with the term
                                                                 groups differing in the number of articles with a word over
A two years ago in same year; Аt-1 - the number of links to
                                                                 the past 3 years. According to the number of situations, the
the article with term A last year for the same and previous
                                                                 IFTm groups turned out to be larger than the IFT groups,
years; Аt - the number of links to the article with term A
                                                                 because IFTm takes into account more citations. Fig. 1
over a three-year period, including the current year; N -
                                                                 shows graphs of the number of situations/points in these
total number of articles with term A for three years.
                                                                 groups for calculating correlations.
    Both the IFT and IFTm are considered only for articles
in which the given term is in the title. Only citations from
 Fig. 1. Graphs of the number of points for calculating the correlations of the current and future years for the indicators IFTm (upper)
                           and IFT, depending on the number of articles with the word in the last 3 years

    In Fig. 1, the upper graph corresponds to the IFTm, and                On the IFT graph, the maximum number of points
the lower IFT. The y-axis represents the number of points              54326 is reached at X = 5, and the minimum 2423 at X =
for calculating the correlations of the current and future             50. On the IFTm graph, the maximum number of points
years. The x-axis represents the frequency of terms, i.e. the          91997 is reached at X = 5, and the minimum 2913 at X =
number of articles with the term over the last 3 years. The            50.
maximum points on both graphs are achieved when the                        For each group of trending situations/points (i.e., for
number of articles is 5, because the experiment did not                each X) individually, a correlation was calculated between
analyze terms that occurred less than 5 times in the                   the current and future values of IFT and IFTm. The results
collection for all time.                                               of calculating the correlations are shown in Fig. 2.


 Fig. 2. Graph of IFT correlations (upper) and IFTm correlations of the current and future years depending on the number of articles
                                                  with the word in the last 3 years

    The upper graph is the IFT correlations, and the lower                 The graphs show that the higher the current frequency
graph is the IFTm correlations.                                        of the term (the number of articles with the term), the higher
    Both graphs behave very similarly, but the correlations            the correlation, and therefore, the more stable the IFT
of the IFT (upper graph) are almost always greater than the            behaves in time. Stable IFT allows you to accurately
correlations of the IFTm. The correlation on the graphs                predict the average number of future citations, since the IFT
reaches 0.5 at a frequency of 17 articles over the past three          is exactly equal to the average number of citations of
years, 0.6 at 26 articles, and 0.7 at 45 articles. Thus, IFT           articles with the specified word/term. Thus, the
behaves more stably and predictably than IFTm, but IFTm                words/terms with a high frequencies and high IFT values
covers more different situations and words/terms.
define promising topics in different subject areas such as             Our model assumes a publication citation prediction
artificial intelligence or smart energy systems.                  based on the following predictors: the impact factor of
    The most stable and predictable words/terms with high         significant terms (for example, authors' keywords) and the
IFT values are called informative terms. Informative              time of appearance of subsequent articles associated with
words/terms have high frequencies and IFT meanings                implicit links to the original article.
above a certain threshold. The type of function for filtering          The two predictors used are readily available, and
of non-informative words which grows with increasing IFT          unlike most prediction approaches, they allow you to make
and frequency can be selected by maximizing the                   predictions pretty soon after the publication.
correlation between citation-based similarity and IFT-                 Citation forecasts have a high degree of uncertainty.
based semantic textual similarity. As a first approximation,      Therefore, we believe that it is more important to know the
this filtering function can be taken as the product of IFT        likelihood that the publication will receive a certain number
and frequency with a certain minimum threshold for IFT.           of links in the future. Therefore, we do not predict the
    Here are examples of the most informative words/terms         average number of links that the publication should attract
in the collection of AI articles that have the largest total      in the future, but we predict the probability distribution for
values of IFT multiplied by the current frequency: web            the future number of links based on the developed
(year 1982), fuzzy (1969), sensor networks (1992), neural         mathematical probabilistic model of the dependence of the
(1962), video (1976) , social (1971), cognitive (1973),           number of direct citations on terms with high IFT.
semantic (1967), clustering (1970), neural networks                    It is important to emphasize that the purpose of our
(1986).                                                           work is different from the studies mentioned above. As in
    These examples point to the most actively and stably          the above studies, we are interested in predicting the future
developing areas of AI, and also confirm the usefulness of        citation. However, many indicators that have been found to
the proposed filtering function and its ability to evaluate the   correlate with the influence of citation are easy to
significance and information content of words/terms.              manipulate.
                                                                       For example, suppose researchers know that future
5. Predicting the citations with IFT                              citations of a publication will be predicted based, for
    Prediction of citation of scientific works was studied by     example, on the number of pages or the number of links. In
many researchers. The described approaches are mainly             this case, authors can artificially increase the number of
based on the analysis of a number of features, including          pages or increase the number of bibliographic references.
information about the authors (number of authors, country,        Therefore, we consider variables that cannot be changed by
authors rating, etc.), features of the journal (total number of   the authors of the publication.
links to the journal, impact factor of the journal), article           Based on IFT values, we can choose informative terms
parameters (topic, volume, number of references etc.), type       that indicate important fundamental ideas. Words and
of research (for example, original research compared to a         terms with a consistently high IFT indicate important ideas
literature review), as well as other characteristics              that have been stable for many years.
(reputation of institutions etc.). In addition, altmetrics are         In our experiments, we analyze the DBLP citation
also used to predict the citation of a scientific paper.          network, which is a collection of articles on artificial
    Citation prediction methods have been proposed, for           intelligence from 1936 to 2017, including 3,079,007
example, by Walters (2006) [9], Haslam et al. (2008) [10],        articles and 25,166,994 links. Statistical analysis of the
Fu and Aliferis (2010) [11], Wang, Yu and Yu (2011) [12],         collection was carried out using the Trend + program,
Wang et al. (2012) [13], Didegah and Thelwall (2013) [14],        which built a frequency dictionary and trend indicators,
Yu, Yu, Li and Wang (2014) [15], Onodera and Yoshikane            including the number of articles with this term per year, the
(2015) [16], Cao et al. (2016) [17], Golosovsky and               number of links to other articles with this term, IFT and
Solomon (2017) [18], Fiala and Tutoky (2018) [19], Bai et         IFTm indicators for the current and next year.
al. (2019) [20]. For example, Wang et al. (2013) [21]                  The term “Trend of the initial frequency” (TIF) is
propose mathematical models that describe how                     proposed - this is the number of years from the first article
publications accumulate citations over time. Using these          with a certain term to the nth article with this term. A
models, the authors predict the effect of publication citation    relationship was found between TIF, IFT, and citation
on a longer term based on a short-term publication citation       trends. It is shown that the higher the trends of the initial
history. Bornmann et al. (2013) [22] present an empirical         frequency, the higher the trends of fresh citation links, that
analysis of the correlation between short-term and long-          is, the higher the likelihood of quick appearance of links to
term citation indicators.                                         the article.
    IFT evaluates the significance and informativeness of              Of particular interest are trend terms with a large
terms in scientific articles based on an analysis of the          number of new articles (more than 10 articles in the
citation of articles with these terms. IFT can also be used to    previous 2 years). For trend terms, the correlation of current
predict future citations of new articles.                         and future IFTm is more than 60%, which allows us to
    Given the practical importance of incorporating the           make a fairly confident forecast of IFTm (i.e. citation
latest publications in evaluations of scientific performance,     forecast) for the next year.
one of the goals of our study is to develop a model to                 We summarize how our study differs from existing
predict the impact that recent publications will have in the      works:
long run.                                                         ˗ we are interested in predicting the long-term impact of
                                                                       citation, based solely on the impact factors of
    significant terms (as mentioned above, we do not want        [12] Wang, M., Yu, G., & Yu, D. (2011). Mining typical
    to use variables that can be easily manipulated);                 features for highly cited papers. Scientometrics,
˗   we are interested in predicting the long-term impact of           87(3), pp. 695-706.
    citation within one or two years after the publication;      [13] Wang, M., Yu, G., Xu, J., He, H., Yu, D., & An, S.
˗   unlike most earlier papers, our interest is in predicting         (2012). Development a case-based classifier for
    the probability distribution for the future number of             predicting highly cited papers. Journal of
    links to a publication. We do not aim to give an accurate         Informetrics, 6(4), pp.586-599.
    estimate of the future number of links to the publication.   [14] Didegah, F., & Thelwall, M. (2013a]). Determinants
                                                                      of research citation impact in nanoscience and
Acknowledgment                                                        nanotechnology. Journal of the American Society
    The reported study was funded by RFBR according to                forInformation Science and Technology, 64(5),
the research projects № 18-07-00909, 19-07-00857 and 20-              pp.1055-1064.
04-60185.                                                        [15] Yu, T., Yu, G., Li, P.-Y. & Wang, L. (2014). Citation
                                                                      impact prediction for scientific papers using stepwise
References                                                            regression analysis. Scientometrics, 101(2), pp.1233-
                                                                      1252.
[1] Gipp, B. (2014). Citation-based Document Similarity.         [16] Onodera, N. & Yoshikane, F. (2015). Factors
     Citation-based Plagiarism Detection. Springer                    affecting citation rates of research articles. Journal of
     Fachmedien Wiesbaden, pp. 43-55.                                 the Association for Information Science and
[2] Gomaa, W.H.and Fahmy, A.A. (2013). A survey of                    Technology,66(4), 739-764.
     text similarity approaches, Int. J. Comput. Appl., vol.     [17] Cao, X., Chen, Y., Liu K.J.R. (2016). A data analytic
     68, no. 13, doi: https://doi.org/10.5120/11638-7118.             approach to quantifying scientific impact. Journal of
[3] Leydesdor, L. (1989). Words and co-words as                       Informetrics, 10 (2), pp. 471-484.
     indicators of intellectual organization. Research           [18] Golosovsky, M., Solomon S. (2017). Growing
     Policy        18(4),       pp.      209-223.      DOI            complex network of citations of scientific papers:
     http://dx.doi.org/10.1016/0048-7333(89)90016-4.                  Modeling and measurements. Physical Review E, 95
     URL                                                              (1), p. 012324.
     http://www.sciencedirect.com/science/article/pii/004        [19] Fiala, D., Tutoky G. (2018). PageRank-based
     8733389900164                                                    prediction of award-winning researchers and the
[4] Charnine, M., Klimenko, S. (2015). Measuring of                   impact of citations. Journal of Informetrics, 11 (4), pp.
     “Idea-based” Influence of Scientific Papers //                   1044-1068.
     Proceedings of the 2015 International Conference on         [20] Wang, D., Song, C., Barabási, A.-L. (2013).
     Information Science and Security (ICISS 2015),                   Quantifying long-term scientific impact. Science, 342
     December 14-16, Seoul, South Korea, pp.160-164.                  (6154) , pp. 127-132.
[5] Landauer, T.K. & Dumais, S.T. (1997). A solution to          [21] Bornmann, L., Leydesdorff, L., & Wang, J. (2013).
     plato’s problem: The latent semantic analysis theory             Which percentile-based approach should be preferred
     of acquisition, induction, and representation of                 for calculating normalized citation impact values?an
     knowledge", Psychological Review, 104.                           empirical comparison of five approaches including a
[6] Matveeva, I., Levow, G., Farahat, A. & Royer, C.                  newly developed citation-rank approach (p100).
     (2005). Generalized latent semantic analysis for term            Journal of Informetrics, 7(4), pp.933-944.
     representation. In Proc. of RANLP.                          [22] Bai, X., Zhang, F., Lee, I. (2019). Predicting the
[7] Zaidi I, Singh S, Sinha A, Dwivedi R. (2015). Current             citations of scholarly paper. Journal of Informetrics,
     views and implications of journal impact factor: A key           Volume 13, Issue 1, pp. 407-418.
     note. Indian J Dent. 6(2):113-114. doi:10.4103/0975-
     962X.154375                                                 About the autors
[8] Pan, R., Fortunato, S. (2015). Author Impact Factor:
                                                                     Khakimova Aida Kh., PhD, docent, Kama Institute
     tracking the dynamics of individual scientific impact.
                                                                 (Naberezhnye Chelny, Russia), ANO «Scientific and Research
     Sci Rep 4, 4880. https://doi.org/10.1038/srep04880.         Center for Information in Physics and Technique» (Nizhny
[9] Walters, G. (2006). Predicting subsequent citations to       Novgorod, Russia), Е-mail: aida_khatif@mail.ru
     articles published in twelve crime-psychology                   Charnine Mikhail M., PhD, Senior Researcher, FRC CSC of
     journals:     Author      impact      versus   journal      the Russian Academy of Sciences, Moscow, Russia, Е-mail:
     impact.Scientometrics, 69(3), pp. 499-510.                  mc@keywen.com
[10] Haslam, N., Ban, L., Kaufmann, L., Loughnan, S.,
     Peters, K., Whelan, J., et al. (2008). What makes an
     article influential? Predicting impact in social
     andpersonality psychology. Scientometrics, 76(1),
     pp.169-185.
[11] Fu, L., & Aliferis, C. (2010). Using content-based and
     bibliometric features for machine learning models to
     predict citation counts in the biomedical
     literature.Scientometrics, 85(1), pp. 257-270.