Simplifying Impact Prediction for Scientific Articles
                             Thanasis Vergoulis                                                                    Ilias Kanellos
                               IMSI, ATHENA RC                                                                  IMSI, ATHENA RC
                             vergoulis@athenarc.gr                                                         ilias.kanellos@athenarc.gr

                         Giorgos Giannopoulos                                                               Theodore Dalamagas
                               IMSI, ATHENA RC                                                                 IMSI, ATHENA RC
                               giann@athenarc.gr                                                              dalamag@athenarc.gr

ABSTRACT                                                                                importance; all impactful works will be interesting regardless of
Estimating the expected impact of an article is valuable for vari-                      the exact number of citations they will receive.
ous applications (e.g., article/cooperator recommendation). Most                           In addition, most existing approaches rely on rich article meta-
existing approaches attempt to predict the exact number of ci-                          data (e.g., authors, venue, topics). Unfortunately, the available
tations each article will receive in the near future, however this                      information for many articles in the relevant data sources (e.g.,
is a difficult regression analysis problem. Moreover, most ap-                          Crossref) is erroneous or incomplete, complicating the learning
proaches rely on the existence of rich metadata for each article, a                     process of such approaches and creating risks for their effective-
requirement that cannot be adequately fulfilled for a large num-                        ness. Moreover, even when the required metadata are available,
ber of them. In this work, we take advantage of the fact that                           the generation of the corresponding machine learning features
solving a simpler machine learning problem, that of classifying                         from them may be extremely time-consuming or even difficult
articles based on their expected impact, is adequate for many real                      to be implemented (details in Section 2.3).
world applications and we propose a simplified model that can be                           In this work, our objective is to take advantage of the previous
trained using minimal article metadata. Finally, we examine vari-                       observations in an attempt to guide and facilitate the work of
ous configurations of this model and evaluate their effectiveness                       researchers and developers working on applications that can
in solving the aforementioned classification problem.                                   benefit from predicting the expected impact of scientific articles.
                                                                                        In particular, we propose a simplified machine learning approach
KEYWORDS                                                                                which is based on the binary classification of articles in two cat-
                                                                                        egories (‘impactful’ / ‘impactless’) according to their expected
scientific impact, machine learning, classification
                                                                                        impact. In addition, we propose the use of a particular set of
                                                                                        features that rely on minimal metadata for each article (only its
1     INTRODUCTION                                                                      publication year and its previous citations). We argue that this
Predicting the attention a scientific article will attract in the next                  simpler approach is adequate, significantly easier to implement,
few years by other articles, i.e., estimating its expected impact 1 , is                and can benefit many applications that require the estimation
very useful for many applications. For example, consider a recom-                       of the expected impact of articles. Finally, we perform exper-
mendation system, which suggests articles to researchers based                          iments to investigate the effectiveness of this approach using
on their interests. Due to the large growth rate in the number of                       various well-established classifiers. In our experimental setup we
published research works [9], a large number of articles will be                        seriously take into consideration the fact that our problem is im-
retrieved for almost any subject of interest. However, not all of                       balanced by nature, both to carefully select the appropriate eval-
them will be of equal importance. The recommendation system                             uation measures and to examine some classification approaches
could leverage the expected impact of papers to suggest only the                        that are particularly tailored to such scenarios.
most important works to the user and avoid overwhelming her
with a large number of trivial options. The benefits would be                           2 OUR APPROACH
similar for other relevant applications, such as expert finding,                        2.1 Preliminaries
collaboration recommendation, etc.
   Several approaches, which attempt to predict the exact number                        Scientific articles always include a list of references to other
of citations articles will receive in the next few years, have been                     works and the referenced articles describe work related to the
proposed in the literature (see Section 4 for indicative examples).                     referencing article (e.g., preliminaries, competitive approaches).
However, this is an extremely difficult regression analysis prob-                       As a result, the inclusion of an article in the reference list of
lem, due to the many factors (some of which are hard to quantify)                       another (i.e., the one citing it) implies that the latter gives credit to
that may affect the impact of an article (details in Section 2.2).                      the former2 . Based on this view, counting the number of distinct
Fortunately, in practice, for many applications, knowing the exact                      articles that include an article of interest in their reference list
number of future citations is not critical. For instance, in the case                   (i.e., counting its citations) is considered to be an indicator of
of the recommendation system, it is important that the system                           its impact in the scientific community. Of course, there are also
distinguish the ‘impactful’ works from those that are of lesser                         many other aspects of scientific impact [3], however the focus
                                                                                        of this work is on this type of citation-based expected impact.
1 Since scientific impact has several aspects [3], the term can be defined in diverse
                                                                                        In particular, we focus on the expected impact of an article at a
ways. In this work, we focus on the definition provided in Section 2.1.                 given time point, which can be defined as follows:
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)
                                                                                        2 Note that the “amount” of credit may be significantly different for each referenced
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)                                                               work and that, in some cases, it may also have a negative sign (when the referencing
                                                                                        work criticizes the referenced one).
   Definition 2.1 (Expected Article Impact). Consider an article 𝑎           always achieve a good performance according to this measure.
and a time point 𝑡. Then, 𝑖 (𝑎, 𝑡), the (expected) impact of 𝑎 at 𝑡, is      For this reason, alternative measures like precision, recall, and F1
calculated as the number of citations that 𝑎 will receive during the         of the minority class (i.e., the class of ‘impactful’ articles) should
period [𝑡, 𝑡 + 𝑦], where 𝑦 is a problem parameter, which defines             be used instead. Unfortunately, part of the previous literature
a future period of interest.                                                 (e.g., [18]) overlooks this issue making it difficult to evaluate the
                                                                             real effectiveness of the corresponding proposed approaches.
   It should be noted that the problem parameter 𝑦 can be config-
ured based on the characteristics of the dataset used. The optimal
option typically depends on the citation dynamics of the scientific
                                                                             2.3      The proposed feature selection
fields covered by the dataset. However, 𝑦 = 3 or 𝑦 = 5 are two               Many existing machine learning approaches rely on the exis-
reasonable and very common configurations. Finally, it should                tence of various article metadata such as its publication year,
be highlighted that the expected impact of an article can only be            author list, venue, main topics, citations etc. Although nowadays
measured in retrospect, i.e., by monitoring the citations that the           a large portion of such data becomes available through open
article receives 𝑦 years after the time point of reference.                  scholarly graphs [6, 15] or datasets (e.g., DBLP, Crossref), there
                                                                             are many articles for which important information is erroneous,
2.2    Problem definition                                                    incomplete, or even completely missing. The main reason for
                                                                             this is that many such datasets are created by automatically har-
Considering the expected impact of articles can be useful for
                                                                             vesting, cleaning, and integrating data from heterogeneous (and
many applications. This is why there is a line of work of methods
                                                                             sometimes noisy) primary sources.
that attempt to predict the exact impact of each article, i.e., the
                                                                                However, even when all the required metadata are available,
exact number of citations it is going to receive in the following
                                                                             in many cases the generation of the desired machine learning
few years (see Section 4). However, this is a difficult regression
                                                                             features involves time-consuming aggregations and other pro-
analysis problem for many reasons. First of all, there are many
                                                                             cessing tasks and may also be difficult to implement. For ex-
factors that may affect the number of citations an article will
                                                                             ample, a number of data cleaning issues arise, for approaches
receive in the future. These factors are related to the quality of
                                                                             using author-based features since author names have to be dis-
the work, the hype of its topic, the prestige of its authors or its
                                                                             ambiguated in the case of synonyms, or different spellings across
venue, the dissemination effort that will be made in social media,
                                                                             publication venues. Similarly, venue names might be recorded
to name only a few. Also, to make matters worse, many of these
                                                                             with different forms (e.g., acronyms vs. full names). Such issues
factors cannot be easily quantified without losing important
                                                                             affect the overall quality and, hence, the utility of these metadata.
information (e.g., due to dimensionality reduction reasons in
                                                                                It is evident that, relying on rich article metadata is an im-
one-hot encodings), affecting the accuracy of the approaches.
                                                                             portant limitation for any machine learning approach to predict
   Additionally, in practice, many of the aforementioned applica-
                                                                             the expected impact of articles. On the other hand, an article’s
tions do not require the prediction of the exact number of future
                                                                             publication year is a basic information that is available in the
citations for each article. It is sufficient for them to simply distin-
                                                                             vast majority of cases. As an indicative example, in the Crossref
guish between ‘impactful’ (to-be) and ‘impactless’ articles. This
                                                                             public data file of March 20203 , only 7.85% of the records were
type of problem is easier and, thus, a traditional classification
                                                                             missing this information. Moreover, due to the Initiative for Open
approach is likely to achieve adequate effectiveness in solving it.
                                                                             Citations4 (I4OC), an increasing number of publishers (with Else-
Hence, in this work, we focus on a binary impact-based article
                                                                             vier being the most recent one) are committed to openly provide
classification problem that can be formulated as follows:
                                                                             the reference lists of their articles. As a result, the majority of
   Definition 2.2 (Impact-based article classification). Consider a          citation data are now available in open scholarly datasets (e.g., in
collection of scientific articles 𝐴 and a time point 𝑡 and let 𝑖¯ =          Crossref). To summarize, the citations and the publication years
                                                                             of scientific articles are readily available data.
Í
  𝑎 ∈𝐴 𝑖 (𝑎, 𝑡)/|𝐴|. Then, the objective is to classify each 𝑎 ∈ 𝐴 in
one of two classes: in the class of ‘impactful’ articles, if 𝑖 (𝑎, 𝑡) > 𝑖¯      Based on the above, we propose a set of features that can be
and to the class of ‘impactless’ articles, otherwise.                        easily calculated using article citations and publication years. In
                                                                             particular, we calculate the following:
   In other words, our objective is to identify articles that receive
an above-average number of citations, to classify them as ‘impact-                 • cc_total: The total number of citations ever received by
ful’ and the rest as ‘impactless’. Note that this intuitive distinction              the article (i.e., its ‘citation count’).
is equivalent with the first iteration of the Head/Tail Breaks clus-               • cc_1y: Citations received by the article in the last year.
tering algorithm, which is tailored for heavy tailed distributions,                • cc_3y: Citations received by the article in the last 3 years.
like the citation distribution of articles [2] (a small number of                  • cc_5y: Citations received by the article in the last 5 years.
articles receive an extremely large number of citations).                       The intuition behind these features is based on the idea of
   An important matter that should be highlighted is that this               preferential attachment [2] and of its time-restricted version used
classification problem is imbalanced by nature. Due to the fact              in recent impact-based article ranking approaches [8]: articles
that the citation distribution of articles is long-tailed, most arti-        that are likely to be highly cited in the following few years are
cles have an impact (i.e., number of citations) well below average.          most likely those, which were intensively cited in the recent past.
Consequently, the class of ‘impactful’ articles will always be a                It should be noted that, although the minimum value of the
minority in the collection (the so-called ‘head’ of the citation dis-        features is zero in all cases, the largest value of each of them could
tribution). This is important for two reasons; first of all, it affects      be very diverse. This is why it is a good practice to normalize
the correct choice of evaluation measures in the experimental                them before using them as input to the classifier.
setup. For example, using the accuracy (i.e., the ratio of true
positives to the complete set) is problematic: a trivial classifier          3 https://doi.org/10.13003/83B2GP
                                                                             4 https://i4oc.org/
that would always assign all articles to the ‘impactless’ class will
             Sample set                 Samples       Impactful samples                            feature2                                      ve
                                                                                                                                           sit
                                                                                                                                     sen              e
    PMC 2011 − 2013 (3 years)           229, 207          57, 016(24.88%)                                                         st-             itiv
                                                                                                                             co              ens
    PMC 2011 − 2015 (5 years)           229, 207          61, 898(27.01%)                                                                ins
                                                                                                                                   st-
    DBLP 2011 − 2013 (3 years)         1, 695, 533       387, 506(22.85%)                                                     co
    DBLP 2011 − 2015 (5 years)         1, 695, 533       339, 351(20.01%)
                       Table 1: Used sample sets
                                                                                                                       feature1

   Classifier      Examined parameter values                                         Figure 1: Toy example showcasing why cost-sensitive ap-
   LR & cLR        ‘max_iter’: 60, 80, 100, 120, 140, 160, 180, 200, 220, 240        proaches may achieve worse precision.
                   ‘solver’: ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’
   DT & cDT        ‘max_depth’: 1 − 32
                                                                                           • cRF: Cost-sensitive random forest
                   ‘min_samples_split’: 2, 5, 10, 20, 50, 100, 200
                   ‘min_samples_leaf’: 1, 4, 7, 10                                      For all methods we used their Scikit-learn [16] implementa-
                                                                                     tions and we have followed a two-fold, exhaustive grid search ap-
   RF & cRF        ‘max_depth’: 1, 5, 10, 50
                   ‘n_estimators’: 100, 150, 200, 250, 300
                                                                                     proach to identify the optimal values of their parameters accord-
                   ‘criterion’: ‘gini’, ‘entropy’                                    ing to the precision, recall, and F1 of the minority class. Table 2
                   ‘max_features’: ‘log2’, ‘sqrt’                                    summarizes the parameter space examined, while Tables 5 & 6
                                                                                     in the Appendix enlist all the identified optimal configurations.
     Table 2: Parameter values examined per classifier.
                                                                                     Each optimal configuration is named as [𝑐𝑙𝑎𝑠𝑠𝑖 𝑓 𝑖𝑒𝑟 ] [𝑚𝑒𝑎𝑠𝑢𝑟𝑒 ] ,
                                                                                     where [𝑐𝑙𝑎𝑠𝑠𝑖 𝑓 𝑖𝑒𝑟 ] is the name of the corresponding classifier
3 EVALUATION                                                                         (e.g., LR, cLR) and [𝑚𝑒𝑎𝑠𝑢𝑟𝑒] stands for the evaluation measure
                                                                                     for which the configuration is optimal (e.g., ‘prec’ for precision).
3.1 Setup
Datasets. For our experiments, we collected citations and publi-                     3.2    Results
cation years for scientific articles from two sources:
                                                                                     Because of the imbalanced nature of the classification problem
     • PMC: The data were gathered from NCBI’s PMC FTP di-                           we study, it is very important to carefully select the measures
        rectory5 and are relevant to 1.12 million open access scien-                 that will be used for the evaluation of the effectiveness of the
        tific articles from life sciences published between 1896 and                 examined approaches. For example, as it was discussed in Sec-
        2016. Moreover, we removed data from the last year (they                     tion 2.2, accuracy that is commonly used for generic classification
        were incomplete, not the entire year was represented).                       approaches, is not a good option, since it is mostly affected by the
     • DBLP: The data were collected from AMiner’s DBLP Cita-                        misclassification of samples from the majority class. However,
        tion Network dataset6 [19] and are relevant to 3 million                     in most imbalanced problems, like the one we have here, the
        articles published between 1936 and 2018. Moreover, we                       minority class has the most importance.
        removed data from the last two, incomplete years.                                Therefore, we do not report the accuracy of the examined
To create the labeled samples required for our analysis, we follow                   approaches. In any case, all configurations achieved accuracy
the hold-out evaluation approach [7]: For each dataset we select                     between 0.73 and 0.99. Following the best practices for the evalua-
the year 𝑡 = 2010 as a (virtual) present year and we split the                       tion of imbalanced classification approaches, we instead measure
dataset in two parts: the first one (articles published until 2010,                  the precision, recall, and F1 of the minority class. We indicatively
with 2010 included) to calculate the feature vectors described in                    report the same measures for the majority class, as well. However
Section 2.3 for all included articles; the second one to calculate the               our main objective is to perform well according to the measures
label for each sample, based on its future citations (see Section 2.2).              calculated for the minority class. Note that, each of these three
We set 𝑦 = 3 and 𝑦 = 5 for the article impact future period (see                     measures may be preferable for different applications.
Section 2.1), which corresponds in both our datasets to the periods                      Tables 3b & 4b summarize the results of the performed experi-
2011 − 2013, and 2011 − 2015, respectively. Table 1 summarizes                       ments. The results are very similar for both data sets (PMC and
the statistics of the sample sets that have been created based on                    DBLP) and for both values of the parameter 𝑦. A general obser-
the aforementioned process.                                                          vation is that, when we focus on precision, cost-insensitive clas-
Classifiers. We selected to use a set of well-known classifiers,                     sification approaches perform adequately well and, thus, there is
along with their cost-sensitive versions7 . The reason we selected                   no need to work with cost-sensitive versions. However, the same
to include cost-sensitive versions is because they target the prob-                  experiments highlight that the latter approaches can significantly
lem of imbalanced learning by using different misclassification                      improve the effectiveness based on the recall and F1.
costs for samples of different classes [5]. As a result, we have                         This behavior is not surprising: By default, in several classi-
configured and evaluated the following classification methods:                       fiers, the optimization process targets at accuracy maximization,
     • LR: Logistic regression                                                       since all samples equally contribute to the loss function to be
     • cLR: Cost-sensitive logistic regression                                       minimized. Consequently, in areas of the hyperspace where the
     • DT : Decision trees                                                           samples of different classes are not easily separable, the samples
     • cDT : Cost-sensitive decision trees                                           of the majority class are favored (i.e., correctly classified) due to
     • RF: Random forest                                                             their dominance in numbers. Consider, for instance, the two mi-
                                                                                     nority class samples (cross marks) and the six majority class ones
5 ftp://ftp.ncbi.nlm.nih.gov/pub/pmc
6 https://aminer.org/citation                                                        (cyclic marks) between the two alternative hyperplanes of the toy
7 We used Scikit-learn’s ‘balanced’ mode for 𝑐𝑙𝑎𝑠𝑠 _𝑤𝑒𝑖𝑔ℎ𝑡 to automatically adjust   example in Figure 1: Classifying all of them to the majority class
weights inversely proportional to class frequencies in the input data.               would induce 3 times less cost to the classifier than classifying
                Precision            Recall               F1                               Precision            Recall               F1
 Classifier   (impactful|rest)   (impactful|rest)   (impactful|rest)        Classifier   (impactful|rest)   (impactful|rest)   (impactful|rest)

 LR𝑝𝑟𝑒𝑐          0.85 |0.79         0.23|0.99          0.36|0.88            LR𝑝𝑟𝑒𝑐          0.97 |0.82         0.25|1.00          0.39|0.90
 LR𝑟𝑒𝑐           0.85 |0.79         0.23|0.99          0.36|0.88            LR𝑟𝑒𝑐           0.96|0.82          0.26|1.00          0.40|0.90
 LR 𝑓 1          0.85 |0.79         0.23|0.99          0.36|0.88            LR 𝑓 1          0.96|0.82          0.25|1.00          0.40|0.90
 cLR𝑝𝑟𝑒𝑐         0.57|0.85          0.52|0.87          0.55|0.86            cLR𝑝𝑟𝑒𝑐         0.70|0.88          0.57|0.93          0.63|0.90
 cLR𝑟𝑒𝑐          0.57|0.85          0.52|0.87          0.55|0.86            cLR𝑟𝑒𝑐          0.70|0.88          0.57|0.93          0.63|0.90
 cLR 𝑓 1         0.57|0.85          0.52|0.87          0.55|0.86            cLR 𝑓 1         0.71|0.88          0.56|0.93          0.63|0.90
 DT𝑝𝑟𝑒𝑐          0.66|0.82          0.38|0.93          0.48|0.87            DT𝑝𝑟𝑒𝑐          0.80|0.88          0.55|0.96          0.65 |0.92
 DT𝑟𝑒𝑐           0.66|0.82          0.38|0.93          0.48|0.87            DT𝑟𝑒𝑐           0.72|0.89          0.61|0.93          0.61|0.91
 DT 𝑓 1          0.66|0.82          0.38|0.93          0.48|0.87            DT 𝑓 1          0.72|0.89          0.61|0.93          0.61|0.91
 cDT𝑝𝑟𝑒𝑐         0.60|0.85          0.52|0.89          0.56 |0.87           cDT𝑝𝑟𝑒𝑐         0.58|0.92          0.74|0.84          0.65 |0.88
 cDT𝑟𝑒𝑐          0.50|0.87          0.63|0.79          0.56 |0.83           cDT𝑟𝑒𝑐          0.52|0.93          0.79 |0.78         0.63|0.85
 cDT 𝑓 1         0.52|0.86          0.60|0.81          0.55|0.84            cDT 𝑓 1         0.58|0.92          0.75|0.84          0.65 |0.88
 RF𝑝𝑟𝑒𝑐          0.70|0.82          0.38|0.95          0.50|0.88            RF𝑝𝑟𝑒𝑐          0.72|0.88          0.56|0.94          0.63|0.91
 RF𝑟𝑒𝑐           0.71|0.82          0.37|0.95          0.48|0.88            RF𝑟𝑒𝑐           0.72|0.88          0.56|0.94          0.63|0.91
 RF 𝑓 1          0.71|0.82          0.36|0.95          0.48|0.88            RF 𝑓 1          0.77|0.87          0.54|0.95          0.63|0.91
 cRF𝑝𝑟𝑒𝑐         0.56|0.85          0.53|0.86          0.54|0.85            cRF𝑝𝑟𝑒𝑐         0.64|0.89          0.63|0.89          0.64|0.89
 cRF𝑟𝑒𝑐          0.47|0.87          0.65 |0.76         0.55|0.81            cRF𝑟𝑒𝑐          0.57|0.92          0.76|0.83          0.65 |0.87
 cRF 𝑓 1         0.48|0.87          0.65 |0.77         0.55|0.81            cRF 𝑓 1         0.58|0.92          0.76|0.84          0.65|0.88
                              (a) PMC                                                                    (b) DBLP

    Table 3: Precision, recall, and F1 based on future citations in [2011-2013] (3 years). Configurations in Tables 5 & 6.


                Precision            Recall               F1                               Precision            Recall               F1
 Classifier   (impactful|rest)   (impactful|rest)   (impactful|rest)        Classifier   (impactful|rest)   (impactful|rest)   (impactful|rest)

 LR𝑝𝑟𝑒𝑐          0.89 |0.78         0.26|0.99          0.40|0.87            LR𝑝𝑟𝑒𝑐          0.96|0.84          0.24|1.00          0.39|0.91
 LR𝑟𝑒𝑐           0.89 |0.78         0.26|0.99          0.40|0.87            LR𝑟𝑒𝑐           0.96|0.84          0.24|1.00          0.39|0.91
 LR 𝑓 1          0.89 |0.78         0.25|0.99          0.39|0.87            LR 𝑓 1          0.97 |0.84         0.24|1.00          0.38|0.91
 cLR𝑝𝑟𝑒𝑐         0.60|0.82          0.49|0.88          0.54|0.85            cLR𝑝𝑟𝑒𝑐         0.70|0.90          0.61|0.93          0.65 |0.92
 cLR𝑟𝑒𝑐          0.60|0.82          0.48|0.88          0.54|0.85            cLR𝑟𝑒𝑐          0.73|0.90          0.58|0.94          0.65 |0.92
 cLR 𝑓 1         0.60|0.82          0.49|0.88          0.54|0.85            cLR 𝑓 1         0.70|0.90          0.60|0.93          0.65 |0.92
 DT𝑝𝑟𝑒𝑐          0.75|0.81          0.38|0.95          0.50|0.87            DT𝑝𝑟𝑒𝑐          0.87|0.87          0.42|0.98          0.56|0.92
 DT𝑟𝑒𝑐           0.75|0.80          0.35|0.96          0.48|0.87            DT𝑟𝑒𝑐           0.73|0.90          0.56|0.95          0.63|0.92
 DT 𝑓 1          0.75|0.81          0.39|0.95          0.51|0.87            DT 𝑓 1          0.77|0.89          0.52|0.96          0.62|0.92
 cDT𝑝𝑟𝑒𝑐         0.60|0.82          0.49|0.88          0.54|0.85            cDT𝑝𝑟𝑒𝑐         0.59|0.93          0.72|0.88          0.65 |0.90
 cDT𝑟𝑒𝑐          0.50|0.84          0.61|0.78          0.55|0.81            cDT𝑟𝑒𝑐          0.47|0.94          0.82 |0.77         0.60|0.85
 cDT 𝑓 1         0.53|0.84          0.60|0.81          0.56 |0.82           cDT 𝑓 1         0.59|0.93          0.72|0.88          0.65 |0.90
 RF𝑝𝑟𝑒𝑐          0.72|0.80          0.37|0.95          0.49|0.87            RF𝑝𝑟𝑒𝑐          0.83|0.89          0.52|0.97          0.64|0.93
 RF𝑟𝑒𝑐           0.73|0.81          0.41|0.95          0.53|0.87            RF𝑟𝑒𝑐           0.74|0.90          0.56|0.95          0.64|0.92
 RF 𝑓 1          0.74|0.81          0.41|0.95          0.52|0.87            RF 𝑓 1          0.80|0.90          0.56|0.96          0.66|0.93
 cRF𝑝𝑟𝑒𝑐         0.57|0.82          0.49|0.86          0.52|0.84            cRF𝑝𝑟𝑒𝑐         0.62|0.91          0.66|0.90          0.64|0.91
 cRF𝑟𝑒𝑐          0.50|0.84          0.61 |0.77         0.55|0.81            cRF𝑟𝑒𝑐          0.59|0.91          0.67|0.89          0.63|0.90
 cRF 𝑓 1         0.50|0.84          0.61 |0.77         0.55|0.81            cRF 𝑓 1         0.55|0.93          0.76|0.84          0.64|0.89
                              (a) PMC                                                                    (b) DBLP

    Table 4: Precision, recall, and F1 based on future citations in [2011-2015] (5 years). Configurations in Tables 5 & 6.


them to the minority class. In this way the cost-insensitive classi-   the other hand, cost-sensitive Random Forest and Decision Tree
fier also achieves good precision for the minority class (no false     classifiers seem to be the best options when recall and F1 are
positives in this example). The drawback is that this results in       more important (albeit their losses in precision are significant).
many false negatives for the minority class (the most important
one). Cost-sensitive approaches alleviate this issue improving
the recall and F1 of the minority class, with the counter-effect of    4   RELATED WORK
a larger number of false positives for the minority class.             The vast majority of works that attempt to estimate the expected
    Focusing on the differences of the examined classification         impact of scientific articles focus on predicting the exact number
approaches, it seems that cost-insensitive Logistic Regression         of citations each article will receive in a given future period, a
is, by far, the best option for applications focusing on precision,    problem know as Citation Count Prediction (CCP). Most of these
achieving values between 0.85 and 0.97 for all datasets. However,      works incorporate a wide range of features based on the article’s
this is achieved by allowing very significant losses in recall and     content, novelty, author list, venue, topic, citations, reviews, to
F1 (values below 0.27 and 0.41 for all datasets, respectively). On     name only a few. The corresponding predicting models are based
on various regression models like Linear Regression [22, 24], k-         ACKNOWLEDGMENTS
NN [22], SVR [10, 14, 22, 24], Gaussian Process Regression [21],         We acknowledge support of this work by the project “Moving
CART Model [21, 22], ZINB Regression [4], or various types               from Big Data Management to Data Science" (MIS 5002437/3)
of neural networks [1, 11–13, 20, 24]. In most works, one or             which is implemented under the Action“Reinforcement of the Re-
more regression models are tested on the complete data set, with         search and Innovation Infrastructur”, funded by the Operational
the notable exception of [10], which first attempts to identify          Programme “Competitiveness, Entrepreneurship and Innovation”
the current citation trend of each article (e.g., early burst, no        (NSRF 2014-2020) and co-financed by Greece and the European
burst, late burst, etc) and then applies a different model for each      Union (European Regional Development Fund).
case. As it was elaborated in Section 2.2, CCP is a very difficult
problem and there are many, not easily quantified factors that can       REFERENCES
significantly affect the performance of such approaches. Also,            [1] A. Abrishami and S. Aliakbary. 2019. Predicting citation counts based on
such approaches rely on article metadata that are difficult to                deep neural network learning techniques. Journal of Informetrics 13, 2 (2019),
                                                                              485–499.
collect and that they should be undergo complex to implement              [2] A. Barabási et al. 2016. Network science. Cambridge university press.
and time-consuming processing (see also Section 2.3).                     [3] J. Bollen, H. Van de Sompel, A. Hagberg, and R. Chute. 2009. A principal
   In another line of work, based on the fact that co-authorship              component analysis of 39 scientific impact measures. PloS one 4, 6 (2009),
                                                                              e6022.
and citation-based features seemed to be effective for earlier ap-        [4] F. Didegah and M. Thelwall. 2013. Determinants of research citation impact
proaches, the authors of [17] follow a link-prediction-inspired               in nanoscience and nanotechnology. Journal of the American Society for
approach to solve CCP. They also investigate the effectiveness                Information Science and Technology 64, 5 (2013), 1055–1064.
                                                                          [5] H. He and Y. Ma. 2013. Imbalanced learning: foundations, algorithms, and
of their approach in a relevant classification problem based on               applications. John Wiley & Sons.
a set of arbitrarily determined classes. However, training their          [6] M. Jaradeh, A. Oelen, K. Farfar, M. Prinz, J. D’Souza, G. Kismihók, M. Stocker,
                                                                              and S. Auer. 2019. Open Research Knowledge Graph: Next Generation Infras-
approach requires a heavy pattern mining analysis of the underly-             tructure for Semantic Scholarly Knowledge. In Proc. of K-CAP.
ing citation network and also considers author- and venue-based           [7] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, and Y. Vassiliou. 2019.
features, which face the already discussed issues. It should be               Impact-Based Ranking of Scientific Publications: A Survey and Experimental
                                                                              Evaluation. IEEE TKDE (2019).
noted that there are also some link prediction approaches that            [8] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, and Y. Vassiliou.
aim to reveal missing citations between a set of articles (e.g.,[23]),        2020. Ranking Papers by their Short-Term Scientific Impact. arXiv preprint
these approaches are irrelevant to the problem of impact predic-              arXiv:2006.00951 (2020).
                                                                          [9] P. Larsen and M. von Ins. 2010. The Rate of Growth in Scientific Publication
tion though. Furthermore, in [18] an impact-based classification              and the Decline in Coverage Provided by Science Citation Index. Scientometrics
problem is studied, but the features of the proposed approach rely            84, 3 (2010), 575–603.
                                                                         [10] C. Li, Y. Lin, R. Yan, and M. Yeh. 2015. Trend-Based Citation Count Prediction
on difficult to collect article metadata (e.g., information about             for Research Articles. In PAKDD.
academic and funding organizations). As a result, this approach          [11] M. Li, J. Xu, B. Ge, J. Liu, J. Jiang, and Q. Zhao. 2019. A Deep Learning
cannot be easily used in practice. Finally, there are methods that            Methodology for Citation Count Prediction with Large-scale Biblio-Features.
                                                                              IEEE SMC (2019), 1172–1176.
attempt to estimate the rank of articles based on their expected         [12] S. Li, W. Zhao, E. Yin, and J. Wen. 2019. A Neural Citation Count Prediction
impact. A thorough survey and experimental study of such meth-                Model based on Peer Review Text. In EMNLP/IJCNLP.
ods can be found in [7]. This problem is easier than CCP, since          [13] L. Liu, D. Yu, D. Wang, and F. Fukumoto. 2020. Citation Count Prediction Based
                                                                              on Neural Hawkes Model. IEICE Transactions on Information and Systems
only the partial ordering of the articles according to their ex-              (2020), 2379–2388.
pected impact should be estimated, but it is still more difficult        [14] A. Livne, E. Adar, J. Teevan, and S. Dumais. 2013. Predicting citation counts
                                                                              using text and graph mining. In Proc. of CompSci.
than the problem we focus on.                                            [15] P. Manghi, C. Atzori, A. Bardi, J. Shirrwagen, H. Dimitropoulos, S. La Bruzzo,
                                                                              I. Foufoulas, A. Löhden, A. Bäcker, A. Mannocci, M. Horst, M. Baglioni, A.
                                                                              Czerniak, K. Kiatropoulou, A. Kokogiannaki, M. De Bonis, M. Artini, E. Ot-
                                                                              tonello, A. Lempesis, L. Nielsen, A. Ioannidis, C. Bigarella, and F. Summan. 2019.
5   CONCLUSION                                                                OpenAIRE Research Graph Dump. https://doi.org/10.5281/zenodo.3516918
                                                                         [16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
In this work, we propose a simplified approach that can signifi-              Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.
cantly simplify the work of researchers and developers working                Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn:
on applications that rely on the prediction of the expected impact            Machine Learning in Python. JMLR 12 (2011), 2825–2830.
                                                                         [17] N. Pobiedina and R. Ichise. 2016. Citation count prediction as a link prediction
of scientific articles. The proposed approach is based on classify-           problem. Applied Intelligence 44, 2 (2016), 252–268.
ing the articles in two categories (‘impactful’ / ‘impactless’) based    [18] Z. Su. 2020. Prediction of future citation count with machine learning and
                                                                              neural network. In IPEC. IEEE, 101–104.
on a set of features that can be calculated using a minimal set of       [19] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. 2008. ArnetMiner: Extraction
article metadata. Furthermore, we experimentally evaluated this               and Mining of Academic Social Networks. In KDD’08. 990–998.
approach using various well-established classifiers showing that         [20] J. Wen, L. Wu, and J. Chai. 2020. Paper Citation Count Prediction Based on
                                                                              Recurrent Neural Network with Gated Recurrent Unit. IEEE ICEIEC (2020),
the results are more than adequate. The aforementioned exper-                 303–306.
iments have been performed with caution taking into account              [21] R. Yan, C. Huang, J. Tang, Y. Zhang, and X. Li. 2012. To better stand on the
the imbalanced nature of the classification problem at hand.                  shoulder of giants. In Proc. of ACM/IEEE-CS JCDL. 51–60.
                                                                         [22] R. Yan, J. Tang, X. Liu, D. Shan, and X. Li. 2011. Citation count prediction:
   In the future, we plan to further investigate the imbalanced               learning to estimate future citations for literature. In Proc. of CIKM. 1247–1252.
nature of the problem by examining other approaches like meth-           [23] X. Yu, Q. Gu, M. Zhou, and J. Han. 2012. Citation Prediction in Heterogeneous
                                                                              Bibliographic Networks. In SDM.
ods that perform over-sampling of the minority class, others that        [24] X. Zhu and Z. Ban. 2018. Citation Count Prediction Based on Academic
perform under-sampling of the majority class, or methods com-                 Network Features. IEEE AINA (2018), 534–541.
bining these two approaches (e.g., SMOTEEN). Additionally, we
plan to examine a wider range of parameters for the examined             A      USED PARAMETER CONFIGURATIONS
approaches, for instance, examining a range of custom weights
                                                                         Tables 5 & 6 summarize the configuration for each used approach.
for cost-sensitive approaches. Finally, we plan to take full ad-
                                                                         The names of the parameters are based on the input parameters of
vantage of the Head/Tail Breaks approach to study a non-binary
                                                                         the corresponding Scikit-learn functions. Omitted input parame-
version of the classification problem.
                                                                         ters were not configured (their default values had been selected).
Classifier   Configuration for 𝑦 = 3                                            Configuration for 𝑦 = 5
LR𝑝𝑟𝑒𝑐       ‘max_iter’: 200, ‘solver’: ‘sag’                                   ‘max_iter’: 160, ‘solver’: ‘sag’
LR𝑟𝑒𝑐        ‘max_iter’: 80, ‘solver’: ‘sag’                                    ‘max_iter’: 80, ‘solver’: ‘sag’
LR 𝑓 1       ‘max_iter’: 180, ‘solver’: ‘sag’                                   ‘max_iter’: 240, ‘solver’: ‘sag’
cLR𝑝𝑟𝑒𝑐      ‘max_iter’: 100, ‘solver’: ‘sag’                                   ‘max_iter’: 60, ‘solver’: ‘sag’
cLR𝑟𝑒𝑐       ‘max_iter’: 120, ‘solver’: ‘sag’                                   ‘max_iter’: 140, ‘solver’: ‘sag’
cLR 𝑓 1      ‘max_iter’: 180, ‘solver’: ‘sag’                                   ‘max_iter’: 140, ‘solver’: ‘sag’
DT𝑝𝑟𝑒𝑐       ‘max_depth’: 3, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2      ‘max_depth’: 4, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2
DT𝑟𝑒𝑐        ‘max_depth’: 1, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2      ‘max_depth’: 3, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2
DT 𝑓 1       ‘max_depth’: 1, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2      ‘max_depth’: 8, ‘min_samples_leaf’: 10, ‘min_samples_split’: 200
cDT𝑝𝑟𝑒𝑐      ‘max_depth’: 1, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2      ‘max_depth’: 1, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2
cDT𝑟𝑒𝑐       ‘max_depth’: 2, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2      ‘max_depth’: 2, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2
cDT 𝑓 1      ‘max_depth’: 7, ‘min_samples_leaf’: 4, ‘min_samples_split’: 20     ‘max_depth’: 7, ‘min_samples_leaf’: 4, ‘min_samples_split’: 50
RF𝑝𝑟𝑒𝑐       ‘criterion’: ‘gini’, ‘max_depth’: 1, ‘max_features’: ‘log2’,       ‘criterion’: ‘gini’, ‘max_depth’: 1, ‘max_features’: ‘log2’,
             ‘n_estimators’: 200                                                ‘n_estimators’: 200
RF𝑟𝑒𝑐        ‘criterion’: ‘gini’, ‘max_depth’: 10, ‘max_features’: ‘log2’,      ‘criterion’: ‘gini’, ‘max_depth’: 10, ‘max_features’: ‘sqrt’,
             ‘n_estimators’: 300                                                ‘n_estimators’: 300
RF 𝑓 1       ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘sqrt’,   ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘sqrt’,
             ‘n_estimators’: 200                                                ’n_estimators’: 300
cRF𝑝𝑟𝑒𝑐      ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’,    ‘criterion’: ’entropy’, ‘max_depth’: 1, ‘max_features’: ’log2’,
             ‘n_estimators’: 150                                                ‘n_estimators’: 100
cRF𝑟𝑒𝑐       ‘criterion’: ‘gini’, ‘max_depth’: 5, ‘max_features’: ‘sqrt’,       ‘criterion’: ‘entropy’, ‘max_depth’: 5, ‘max_features’: ‘log2’,
             ‘n_estimators’: 150                                                ‘n_estimators’: 100
cRF 𝑓 1      ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘log2’,   ‘criterion’: ‘gini’, ‘max_depth’: 5, ‘max_features’: ‘sqrt’,
             ‘n_estimators’: 150                                                ‘n_estimators’: 300
                                            Table 5: Parameter configurations for PMC.


Classifier   Configuration for 𝑦 = 3                                            Configuration for 𝑦 = 5
LR𝑝𝑟𝑒𝑐       ‘max_iter’: 80, ‘solver’: ‘sag’                                    ‘max_iter’: 100, ‘solver’: ’sag’
LR𝑟𝑒𝑐        ‘max_iter’: 80, ‘solver’: ‘sag’                                    ‘max_iter’: 140, ‘solver’: ’sag’
LR 𝑓 1       ‘max_iter’: 220, ‘solver’: ‘saga’                                  ‘max_iter’: 220, ‘solver’: ‘sag’
cLR𝑝𝑟𝑒𝑐      ‘max_iter’: 200, ‘solver’: ‘sag’                                   ‘max_iter’: 180, ‘solver’: ‘sag’
cLR𝑟𝑒𝑐       ‘max_iter’: 140, ‘solver’: ‘sag’                                   ‘max_iter’: 160, ‘solver’: ‘sag’
cLR 𝑓 1      ‘max_iter’: 100, ‘solver’: ‘sag’                                   ‘max_iter’: 60, ‘solver’: ‘newton-cg’
DT𝑝𝑟𝑒𝑐       ‘max_depth’: 6, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2      ‘max_depth’: 3, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2
DT𝑟𝑒𝑐        ‘max_depth’: 3, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2      ‘max_depth’: 1, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2
DT 𝑓 1       ‘max_depth’: 3, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2      ‘max_depth’: 4, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2
cDT𝑝𝑟𝑒𝑐      ‘max_depth’: 14, ‘min_samples_leaf’: 10, ‘min_samples_split’: 2    ‘max_depth’: 4, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2
cDT𝑟𝑒𝑐       ‘max_depth’: 2, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2      ‘max_depth’: 2, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2
cDT 𝑓 1      ‘max_depth’: 11, ‘min_samples_leaf’: 10, ‘min_samples_split’:      ‘max_depth’: 4, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2
             200
RF𝑝𝑟𝑒𝑐       ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’,    ‘criterion’: ‘gini’, ‘max_depth’: 5, ‘max_features’: ‘sqrt’,
             ‘n_estimators’: 150                                                ‘n_estimators’: 100
RF𝑟𝑒𝑐        ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’,    ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’,
             ‘n_estimators’: 150                                                ‘n_estimators’: 150
RF 𝑓 1       ‘criterion’: ‘gini’, ‘max_depth’: 5, ‘max_features’: ‘log2’,       ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘sqrt’,
             ‘n_estimators’: 100                                                ‘n_estimators’: 250
cRF𝑝𝑟𝑒𝑐      ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’,    ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’,
             ‘n_estimators’: 250                                                ‘n_estimators’: 100
cRF𝑟𝑒𝑐       ‘criterion’: ‘gini’, ‘max_depth’: 5, ‘max_features’: ‘log2’,       ‘criterion’: ‘gini’, ‘max_depth’: 1, ‘max_features’: ‘log2’,
             ‘n_estimators’: 100                                                ‘n_estimators’: 150
cRF 𝑓 1      ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘log2’,   ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘sqrt’,
             ‘n_estimators’: 150                                                ‘n_estimators’: 150
                                           Table 6: Parameter configurations for DBLP.