Simplifying Impact Prediction for Scientific Articles Thanasis Vergoulis Ilias Kanellos IMSI, ATHENA RC IMSI, ATHENA RC vergoulis@athenarc.gr ilias.kanellos@athenarc.gr Giorgos Giannopoulos Theodore Dalamagas IMSI, ATHENA RC IMSI, ATHENA RC giann@athenarc.gr dalamag@athenarc.gr ABSTRACT importance; all impactful works will be interesting regardless of Estimating the expected impact of an article is valuable for vari- the exact number of citations they will receive. ous applications (e.g., article/cooperator recommendation). Most In addition, most existing approaches rely on rich article meta- existing approaches attempt to predict the exact number of ci- data (e.g., authors, venue, topics). Unfortunately, the available tations each article will receive in the near future, however this information for many articles in the relevant data sources (e.g., is a difficult regression analysis problem. Moreover, most ap- Crossref) is erroneous or incomplete, complicating the learning proaches rely on the existence of rich metadata for each article, a process of such approaches and creating risks for their effective- requirement that cannot be adequately fulfilled for a large num- ness. Moreover, even when the required metadata are available, ber of them. In this work, we take advantage of the fact that the generation of the corresponding machine learning features solving a simpler machine learning problem, that of classifying from them may be extremely time-consuming or even difficult articles based on their expected impact, is adequate for many real to be implemented (details in Section 2.3). world applications and we propose a simplified model that can be In this work, our objective is to take advantage of the previous trained using minimal article metadata. Finally, we examine vari- observations in an attempt to guide and facilitate the work of ous configurations of this model and evaluate their effectiveness researchers and developers working on applications that can in solving the aforementioned classification problem. benefit from predicting the expected impact of scientific articles. In particular, we propose a simplified machine learning approach KEYWORDS which is based on the binary classification of articles in two cat- egories (‘impactful’ / ‘impactless’) according to their expected scientific impact, machine learning, classification impact. In addition, we propose the use of a particular set of features that rely on minimal metadata for each article (only its 1 INTRODUCTION publication year and its previous citations). We argue that this Predicting the attention a scientific article will attract in the next simpler approach is adequate, significantly easier to implement, few years by other articles, i.e., estimating its expected impact 1 , is and can benefit many applications that require the estimation very useful for many applications. For example, consider a recom- of the expected impact of articles. Finally, we perform exper- mendation system, which suggests articles to researchers based iments to investigate the effectiveness of this approach using on their interests. Due to the large growth rate in the number of various well-established classifiers. In our experimental setup we published research works [9], a large number of articles will be seriously take into consideration the fact that our problem is im- retrieved for almost any subject of interest. However, not all of balanced by nature, both to carefully select the appropriate eval- them will be of equal importance. The recommendation system uation measures and to examine some classification approaches could leverage the expected impact of papers to suggest only the that are particularly tailored to such scenarios. most important works to the user and avoid overwhelming her with a large number of trivial options. The benefits would be 2 OUR APPROACH similar for other relevant applications, such as expert finding, 2.1 Preliminaries collaboration recommendation, etc. Several approaches, which attempt to predict the exact number Scientific articles always include a list of references to other of citations articles will receive in the next few years, have been works and the referenced articles describe work related to the proposed in the literature (see Section 4 for indicative examples). referencing article (e.g., preliminaries, competitive approaches). However, this is an extremely difficult regression analysis prob- As a result, the inclusion of an article in the reference list of lem, due to the many factors (some of which are hard to quantify) another (i.e., the one citing it) implies that the latter gives credit to that may affect the impact of an article (details in Section 2.2). the former2 . Based on this view, counting the number of distinct Fortunately, in practice, for many applications, knowing the exact articles that include an article of interest in their reference list number of future citations is not critical. For instance, in the case (i.e., counting its citations) is considered to be an indicator of of the recommendation system, it is important that the system its impact in the scientific community. Of course, there are also distinguish the ‘impactful’ works from those that are of lesser many other aspects of scientific impact [3], however the focus of this work is on this type of citation-based expected impact. 1 Since scientific impact has several aspects [3], the term can be defined in diverse In particular, we focus on the expected impact of an article at a ways. In this work, we focus on the definition provided in Section 2.1. given time point, which can be defined as follows: © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) 2 Note that the “amount” of credit may be significantly different for each referenced on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) work and that, in some cases, it may also have a negative sign (when the referencing work criticizes the referenced one). Definition 2.1 (Expected Article Impact). Consider an article 𝑎 always achieve a good performance according to this measure. and a time point 𝑡. Then, 𝑖 (𝑎, 𝑡), the (expected) impact of 𝑎 at 𝑡, is For this reason, alternative measures like precision, recall, and F1 calculated as the number of citations that 𝑎 will receive during the of the minority class (i.e., the class of ‘impactful’ articles) should period [𝑡, 𝑡 + 𝑦], where 𝑦 is a problem parameter, which defines be used instead. Unfortunately, part of the previous literature a future period of interest. (e.g., [18]) overlooks this issue making it difficult to evaluate the real effectiveness of the corresponding proposed approaches. It should be noted that the problem parameter 𝑦 can be config- ured based on the characteristics of the dataset used. The optimal option typically depends on the citation dynamics of the scientific 2.3 The proposed feature selection fields covered by the dataset. However, 𝑦 = 3 or 𝑦 = 5 are two Many existing machine learning approaches rely on the exis- reasonable and very common configurations. Finally, it should tence of various article metadata such as its publication year, be highlighted that the expected impact of an article can only be author list, venue, main topics, citations etc. Although nowadays measured in retrospect, i.e., by monitoring the citations that the a large portion of such data becomes available through open article receives 𝑦 years after the time point of reference. scholarly graphs [6, 15] or datasets (e.g., DBLP, Crossref), there are many articles for which important information is erroneous, 2.2 Problem definition incomplete, or even completely missing. The main reason for this is that many such datasets are created by automatically har- Considering the expected impact of articles can be useful for vesting, cleaning, and integrating data from heterogeneous (and many applications. This is why there is a line of work of methods sometimes noisy) primary sources. that attempt to predict the exact impact of each article, i.e., the However, even when all the required metadata are available, exact number of citations it is going to receive in the following in many cases the generation of the desired machine learning few years (see Section 4). However, this is a difficult regression features involves time-consuming aggregations and other pro- analysis problem for many reasons. First of all, there are many cessing tasks and may also be difficult to implement. For ex- factors that may affect the number of citations an article will ample, a number of data cleaning issues arise, for approaches receive in the future. These factors are related to the quality of using author-based features since author names have to be dis- the work, the hype of its topic, the prestige of its authors or its ambiguated in the case of synonyms, or different spellings across venue, the dissemination effort that will be made in social media, publication venues. Similarly, venue names might be recorded to name only a few. Also, to make matters worse, many of these with different forms (e.g., acronyms vs. full names). Such issues factors cannot be easily quantified without losing important affect the overall quality and, hence, the utility of these metadata. information (e.g., due to dimensionality reduction reasons in It is evident that, relying on rich article metadata is an im- one-hot encodings), affecting the accuracy of the approaches. portant limitation for any machine learning approach to predict Additionally, in practice, many of the aforementioned applica- the expected impact of articles. On the other hand, an article’s tions do not require the prediction of the exact number of future publication year is a basic information that is available in the citations for each article. It is sufficient for them to simply distin- vast majority of cases. As an indicative example, in the Crossref guish between ‘impactful’ (to-be) and ‘impactless’ articles. This public data file of March 20203 , only 7.85% of the records were type of problem is easier and, thus, a traditional classification missing this information. Moreover, due to the Initiative for Open approach is likely to achieve adequate effectiveness in solving it. Citations4 (I4OC), an increasing number of publishers (with Else- Hence, in this work, we focus on a binary impact-based article vier being the most recent one) are committed to openly provide classification problem that can be formulated as follows: the reference lists of their articles. As a result, the majority of Definition 2.2 (Impact-based article classification). Consider a citation data are now available in open scholarly datasets (e.g., in collection of scientific articles 𝐴 and a time point 𝑡 and let 𝑖¯ = Crossref). To summarize, the citations and the publication years of scientific articles are readily available data. Í 𝑎 ∈𝐴 𝑖 (𝑎, 𝑡)/|𝐴|. Then, the objective is to classify each 𝑎 ∈ 𝐴 in one of two classes: in the class of ‘impactful’ articles, if 𝑖 (𝑎, 𝑡) > 𝑖¯ Based on the above, we propose a set of features that can be and to the class of ‘impactless’ articles, otherwise. easily calculated using article citations and publication years. In particular, we calculate the following: In other words, our objective is to identify articles that receive an above-average number of citations, to classify them as ‘impact- • cc_total: The total number of citations ever received by ful’ and the rest as ‘impactless’. Note that this intuitive distinction the article (i.e., its ‘citation count’). is equivalent with the first iteration of the Head/Tail Breaks clus- • cc_1y: Citations received by the article in the last year. tering algorithm, which is tailored for heavy tailed distributions, • cc_3y: Citations received by the article in the last 3 years. like the citation distribution of articles [2] (a small number of • cc_5y: Citations received by the article in the last 5 years. articles receive an extremely large number of citations). The intuition behind these features is based on the idea of An important matter that should be highlighted is that this preferential attachment [2] and of its time-restricted version used classification problem is imbalanced by nature. Due to the fact in recent impact-based article ranking approaches [8]: articles that the citation distribution of articles is long-tailed, most arti- that are likely to be highly cited in the following few years are cles have an impact (i.e., number of citations) well below average. most likely those, which were intensively cited in the recent past. Consequently, the class of ‘impactful’ articles will always be a It should be noted that, although the minimum value of the minority in the collection (the so-called ‘head’ of the citation dis- features is zero in all cases, the largest value of each of them could tribution). This is important for two reasons; first of all, it affects be very diverse. This is why it is a good practice to normalize the correct choice of evaluation measures in the experimental them before using them as input to the classifier. setup. For example, using the accuracy (i.e., the ratio of true positives to the complete set) is problematic: a trivial classifier 3 https://doi.org/10.13003/83B2GP 4 https://i4oc.org/ that would always assign all articles to the ‘impactless’ class will Sample set Samples Impactful samples feature2 ve sit sen e PMC 2011 − 2013 (3 years) 229, 207 57, 016(24.88%) st- itiv co ens PMC 2011 − 2015 (5 years) 229, 207 61, 898(27.01%) ins st- DBLP 2011 − 2013 (3 years) 1, 695, 533 387, 506(22.85%) co DBLP 2011 − 2015 (5 years) 1, 695, 533 339, 351(20.01%) Table 1: Used sample sets feature1 Classifier Examined parameter values Figure 1: Toy example showcasing why cost-sensitive ap- LR & cLR ‘max_iter’: 60, 80, 100, 120, 140, 160, 180, 200, 220, 240 proaches may achieve worse precision. ‘solver’: ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ DT & cDT ‘max_depth’: 1 − 32 • cRF: Cost-sensitive random forest ‘min_samples_split’: 2, 5, 10, 20, 50, 100, 200 ‘min_samples_leaf’: 1, 4, 7, 10 For all methods we used their Scikit-learn [16] implementa- tions and we have followed a two-fold, exhaustive grid search ap- RF & cRF ‘max_depth’: 1, 5, 10, 50 ‘n_estimators’: 100, 150, 200, 250, 300 proach to identify the optimal values of their parameters accord- ‘criterion’: ‘gini’, ‘entropy’ ing to the precision, recall, and F1 of the minority class. Table 2 ‘max_features’: ‘log2’, ‘sqrt’ summarizes the parameter space examined, while Tables 5 & 6 in the Appendix enlist all the identified optimal configurations. Table 2: Parameter values examined per classifier. Each optimal configuration is named as [𝑐𝑙𝑎𝑠𝑠𝑖 𝑓 𝑖𝑒𝑟 ] [𝑚𝑒𝑎𝑠𝑢𝑟𝑒 ] , where [𝑐𝑙𝑎𝑠𝑠𝑖 𝑓 𝑖𝑒𝑟 ] is the name of the corresponding classifier 3 EVALUATION (e.g., LR, cLR) and [𝑚𝑒𝑎𝑠𝑢𝑟𝑒] stands for the evaluation measure for which the configuration is optimal (e.g., ‘prec’ for precision). 3.1 Setup Datasets. For our experiments, we collected citations and publi- 3.2 Results cation years for scientific articles from two sources: Because of the imbalanced nature of the classification problem • PMC: The data were gathered from NCBI’s PMC FTP di- we study, it is very important to carefully select the measures rectory5 and are relevant to 1.12 million open access scien- that will be used for the evaluation of the effectiveness of the tific articles from life sciences published between 1896 and examined approaches. For example, as it was discussed in Sec- 2016. Moreover, we removed data from the last year (they tion 2.2, accuracy that is commonly used for generic classification were incomplete, not the entire year was represented). approaches, is not a good option, since it is mostly affected by the • DBLP: The data were collected from AMiner’s DBLP Cita- misclassification of samples from the majority class. However, tion Network dataset6 [19] and are relevant to 3 million in most imbalanced problems, like the one we have here, the articles published between 1936 and 2018. Moreover, we minority class has the most importance. removed data from the last two, incomplete years. Therefore, we do not report the accuracy of the examined To create the labeled samples required for our analysis, we follow approaches. In any case, all configurations achieved accuracy the hold-out evaluation approach [7]: For each dataset we select between 0.73 and 0.99. Following the best practices for the evalua- the year 𝑡 = 2010 as a (virtual) present year and we split the tion of imbalanced classification approaches, we instead measure dataset in two parts: the first one (articles published until 2010, the precision, recall, and F1 of the minority class. We indicatively with 2010 included) to calculate the feature vectors described in report the same measures for the majority class, as well. However Section 2.3 for all included articles; the second one to calculate the our main objective is to perform well according to the measures label for each sample, based on its future citations (see Section 2.2). calculated for the minority class. Note that, each of these three We set 𝑦 = 3 and 𝑦 = 5 for the article impact future period (see measures may be preferable for different applications. Section 2.1), which corresponds in both our datasets to the periods Tables 3b & 4b summarize the results of the performed experi- 2011 − 2013, and 2011 − 2015, respectively. Table 1 summarizes ments. The results are very similar for both data sets (PMC and the statistics of the sample sets that have been created based on DBLP) and for both values of the parameter 𝑦. A general obser- the aforementioned process. vation is that, when we focus on precision, cost-insensitive clas- Classifiers. We selected to use a set of well-known classifiers, sification approaches perform adequately well and, thus, there is along with their cost-sensitive versions7 . The reason we selected no need to work with cost-sensitive versions. However, the same to include cost-sensitive versions is because they target the prob- experiments highlight that the latter approaches can significantly lem of imbalanced learning by using different misclassification improve the effectiveness based on the recall and F1. costs for samples of different classes [5]. As a result, we have This behavior is not surprising: By default, in several classi- configured and evaluated the following classification methods: fiers, the optimization process targets at accuracy maximization, • LR: Logistic regression since all samples equally contribute to the loss function to be • cLR: Cost-sensitive logistic regression minimized. Consequently, in areas of the hyperspace where the • DT : Decision trees samples of different classes are not easily separable, the samples • cDT : Cost-sensitive decision trees of the majority class are favored (i.e., correctly classified) due to • RF: Random forest their dominance in numbers. Consider, for instance, the two mi- nority class samples (cross marks) and the six majority class ones 5 ftp://ftp.ncbi.nlm.nih.gov/pub/pmc 6 https://aminer.org/citation (cyclic marks) between the two alternative hyperplanes of the toy 7 We used Scikit-learn’s ‘balanced’ mode for 𝑐𝑙𝑎𝑠𝑠 _𝑤𝑒𝑖𝑔ℎ𝑡 to automatically adjust example in Figure 1: Classifying all of them to the majority class weights inversely proportional to class frequencies in the input data. would induce 3 times less cost to the classifier than classifying Precision Recall F1 Precision Recall F1 Classifier (impactful|rest) (impactful|rest) (impactful|rest) Classifier (impactful|rest) (impactful|rest) (impactful|rest) LR𝑝𝑟𝑒𝑐 0.85 |0.79 0.23|0.99 0.36|0.88 LR𝑝𝑟𝑒𝑐 0.97 |0.82 0.25|1.00 0.39|0.90 LR𝑟𝑒𝑐 0.85 |0.79 0.23|0.99 0.36|0.88 LR𝑟𝑒𝑐 0.96|0.82 0.26|1.00 0.40|0.90 LR 𝑓 1 0.85 |0.79 0.23|0.99 0.36|0.88 LR 𝑓 1 0.96|0.82 0.25|1.00 0.40|0.90 cLR𝑝𝑟𝑒𝑐 0.57|0.85 0.52|0.87 0.55|0.86 cLR𝑝𝑟𝑒𝑐 0.70|0.88 0.57|0.93 0.63|0.90 cLR𝑟𝑒𝑐 0.57|0.85 0.52|0.87 0.55|0.86 cLR𝑟𝑒𝑐 0.70|0.88 0.57|0.93 0.63|0.90 cLR 𝑓 1 0.57|0.85 0.52|0.87 0.55|0.86 cLR 𝑓 1 0.71|0.88 0.56|0.93 0.63|0.90 DT𝑝𝑟𝑒𝑐 0.66|0.82 0.38|0.93 0.48|0.87 DT𝑝𝑟𝑒𝑐 0.80|0.88 0.55|0.96 0.65 |0.92 DT𝑟𝑒𝑐 0.66|0.82 0.38|0.93 0.48|0.87 DT𝑟𝑒𝑐 0.72|0.89 0.61|0.93 0.61|0.91 DT 𝑓 1 0.66|0.82 0.38|0.93 0.48|0.87 DT 𝑓 1 0.72|0.89 0.61|0.93 0.61|0.91 cDT𝑝𝑟𝑒𝑐 0.60|0.85 0.52|0.89 0.56 |0.87 cDT𝑝𝑟𝑒𝑐 0.58|0.92 0.74|0.84 0.65 |0.88 cDT𝑟𝑒𝑐 0.50|0.87 0.63|0.79 0.56 |0.83 cDT𝑟𝑒𝑐 0.52|0.93 0.79 |0.78 0.63|0.85 cDT 𝑓 1 0.52|0.86 0.60|0.81 0.55|0.84 cDT 𝑓 1 0.58|0.92 0.75|0.84 0.65 |0.88 RF𝑝𝑟𝑒𝑐 0.70|0.82 0.38|0.95 0.50|0.88 RF𝑝𝑟𝑒𝑐 0.72|0.88 0.56|0.94 0.63|0.91 RF𝑟𝑒𝑐 0.71|0.82 0.37|0.95 0.48|0.88 RF𝑟𝑒𝑐 0.72|0.88 0.56|0.94 0.63|0.91 RF 𝑓 1 0.71|0.82 0.36|0.95 0.48|0.88 RF 𝑓 1 0.77|0.87 0.54|0.95 0.63|0.91 cRF𝑝𝑟𝑒𝑐 0.56|0.85 0.53|0.86 0.54|0.85 cRF𝑝𝑟𝑒𝑐 0.64|0.89 0.63|0.89 0.64|0.89 cRF𝑟𝑒𝑐 0.47|0.87 0.65 |0.76 0.55|0.81 cRF𝑟𝑒𝑐 0.57|0.92 0.76|0.83 0.65 |0.87 cRF 𝑓 1 0.48|0.87 0.65 |0.77 0.55|0.81 cRF 𝑓 1 0.58|0.92 0.76|0.84 0.65|0.88 (a) PMC (b) DBLP Table 3: Precision, recall, and F1 based on future citations in [2011-2013] (3 years). Configurations in Tables 5 & 6. Precision Recall F1 Precision Recall F1 Classifier (impactful|rest) (impactful|rest) (impactful|rest) Classifier (impactful|rest) (impactful|rest) (impactful|rest) LR𝑝𝑟𝑒𝑐 0.89 |0.78 0.26|0.99 0.40|0.87 LR𝑝𝑟𝑒𝑐 0.96|0.84 0.24|1.00 0.39|0.91 LR𝑟𝑒𝑐 0.89 |0.78 0.26|0.99 0.40|0.87 LR𝑟𝑒𝑐 0.96|0.84 0.24|1.00 0.39|0.91 LR 𝑓 1 0.89 |0.78 0.25|0.99 0.39|0.87 LR 𝑓 1 0.97 |0.84 0.24|1.00 0.38|0.91 cLR𝑝𝑟𝑒𝑐 0.60|0.82 0.49|0.88 0.54|0.85 cLR𝑝𝑟𝑒𝑐 0.70|0.90 0.61|0.93 0.65 |0.92 cLR𝑟𝑒𝑐 0.60|0.82 0.48|0.88 0.54|0.85 cLR𝑟𝑒𝑐 0.73|0.90 0.58|0.94 0.65 |0.92 cLR 𝑓 1 0.60|0.82 0.49|0.88 0.54|0.85 cLR 𝑓 1 0.70|0.90 0.60|0.93 0.65 |0.92 DT𝑝𝑟𝑒𝑐 0.75|0.81 0.38|0.95 0.50|0.87 DT𝑝𝑟𝑒𝑐 0.87|0.87 0.42|0.98 0.56|0.92 DT𝑟𝑒𝑐 0.75|0.80 0.35|0.96 0.48|0.87 DT𝑟𝑒𝑐 0.73|0.90 0.56|0.95 0.63|0.92 DT 𝑓 1 0.75|0.81 0.39|0.95 0.51|0.87 DT 𝑓 1 0.77|0.89 0.52|0.96 0.62|0.92 cDT𝑝𝑟𝑒𝑐 0.60|0.82 0.49|0.88 0.54|0.85 cDT𝑝𝑟𝑒𝑐 0.59|0.93 0.72|0.88 0.65 |0.90 cDT𝑟𝑒𝑐 0.50|0.84 0.61|0.78 0.55|0.81 cDT𝑟𝑒𝑐 0.47|0.94 0.82 |0.77 0.60|0.85 cDT 𝑓 1 0.53|0.84 0.60|0.81 0.56 |0.82 cDT 𝑓 1 0.59|0.93 0.72|0.88 0.65 |0.90 RF𝑝𝑟𝑒𝑐 0.72|0.80 0.37|0.95 0.49|0.87 RF𝑝𝑟𝑒𝑐 0.83|0.89 0.52|0.97 0.64|0.93 RF𝑟𝑒𝑐 0.73|0.81 0.41|0.95 0.53|0.87 RF𝑟𝑒𝑐 0.74|0.90 0.56|0.95 0.64|0.92 RF 𝑓 1 0.74|0.81 0.41|0.95 0.52|0.87 RF 𝑓 1 0.80|0.90 0.56|0.96 0.66|0.93 cRF𝑝𝑟𝑒𝑐 0.57|0.82 0.49|0.86 0.52|0.84 cRF𝑝𝑟𝑒𝑐 0.62|0.91 0.66|0.90 0.64|0.91 cRF𝑟𝑒𝑐 0.50|0.84 0.61 |0.77 0.55|0.81 cRF𝑟𝑒𝑐 0.59|0.91 0.67|0.89 0.63|0.90 cRF 𝑓 1 0.50|0.84 0.61 |0.77 0.55|0.81 cRF 𝑓 1 0.55|0.93 0.76|0.84 0.64|0.89 (a) PMC (b) DBLP Table 4: Precision, recall, and F1 based on future citations in [2011-2015] (5 years). Configurations in Tables 5 & 6. them to the minority class. In this way the cost-insensitive classi- the other hand, cost-sensitive Random Forest and Decision Tree fier also achieves good precision for the minority class (no false classifiers seem to be the best options when recall and F1 are positives in this example). The drawback is that this results in more important (albeit their losses in precision are significant). many false negatives for the minority class (the most important one). Cost-sensitive approaches alleviate this issue improving the recall and F1 of the minority class, with the counter-effect of 4 RELATED WORK a larger number of false positives for the minority class. The vast majority of works that attempt to estimate the expected Focusing on the differences of the examined classification impact of scientific articles focus on predicting the exact number approaches, it seems that cost-insensitive Logistic Regression of citations each article will receive in a given future period, a is, by far, the best option for applications focusing on precision, problem know as Citation Count Prediction (CCP). Most of these achieving values between 0.85 and 0.97 for all datasets. However, works incorporate a wide range of features based on the article’s this is achieved by allowing very significant losses in recall and content, novelty, author list, venue, topic, citations, reviews, to F1 (values below 0.27 and 0.41 for all datasets, respectively). On name only a few. The corresponding predicting models are based on various regression models like Linear Regression [22, 24], k- ACKNOWLEDGMENTS NN [22], SVR [10, 14, 22, 24], Gaussian Process Regression [21], We acknowledge support of this work by the project “Moving CART Model [21, 22], ZINB Regression [4], or various types from Big Data Management to Data Science" (MIS 5002437/3) of neural networks [1, 11–13, 20, 24]. In most works, one or which is implemented under the Action“Reinforcement of the Re- more regression models are tested on the complete data set, with search and Innovation Infrastructur”, funded by the Operational the notable exception of [10], which first attempts to identify Programme “Competitiveness, Entrepreneurship and Innovation” the current citation trend of each article (e.g., early burst, no (NSRF 2014-2020) and co-financed by Greece and the European burst, late burst, etc) and then applies a different model for each Union (European Regional Development Fund). case. As it was elaborated in Section 2.2, CCP is a very difficult problem and there are many, not easily quantified factors that can REFERENCES significantly affect the performance of such approaches. Also, [1] A. Abrishami and S. Aliakbary. 2019. Predicting citation counts based on such approaches rely on article metadata that are difficult to deep neural network learning techniques. Journal of Informetrics 13, 2 (2019), 485–499. collect and that they should be undergo complex to implement [2] A. Barabási et al. 2016. Network science. Cambridge university press. and time-consuming processing (see also Section 2.3). [3] J. Bollen, H. Van de Sompel, A. Hagberg, and R. Chute. 2009. A principal In another line of work, based on the fact that co-authorship component analysis of 39 scientific impact measures. PloS one 4, 6 (2009), e6022. and citation-based features seemed to be effective for earlier ap- [4] F. Didegah and M. Thelwall. 2013. Determinants of research citation impact proaches, the authors of [17] follow a link-prediction-inspired in nanoscience and nanotechnology. Journal of the American Society for approach to solve CCP. They also investigate the effectiveness Information Science and Technology 64, 5 (2013), 1055–1064. [5] H. He and Y. Ma. 2013. Imbalanced learning: foundations, algorithms, and of their approach in a relevant classification problem based on applications. John Wiley & Sons. a set of arbitrarily determined classes. However, training their [6] M. Jaradeh, A. Oelen, K. Farfar, M. Prinz, J. D’Souza, G. Kismihók, M. Stocker, and S. Auer. 2019. Open Research Knowledge Graph: Next Generation Infras- approach requires a heavy pattern mining analysis of the underly- tructure for Semantic Scholarly Knowledge. In Proc. of K-CAP. ing citation network and also considers author- and venue-based [7] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, and Y. Vassiliou. 2019. features, which face the already discussed issues. It should be Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. IEEE TKDE (2019). noted that there are also some link prediction approaches that [8] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, and Y. Vassiliou. aim to reveal missing citations between a set of articles (e.g.,[23]), 2020. Ranking Papers by their Short-Term Scientific Impact. arXiv preprint these approaches are irrelevant to the problem of impact predic- arXiv:2006.00951 (2020). [9] P. Larsen and M. von Ins. 2010. The Rate of Growth in Scientific Publication tion though. Furthermore, in [18] an impact-based classification and the Decline in Coverage Provided by Science Citation Index. Scientometrics problem is studied, but the features of the proposed approach rely 84, 3 (2010), 575–603. [10] C. Li, Y. Lin, R. Yan, and M. Yeh. 2015. Trend-Based Citation Count Prediction on difficult to collect article metadata (e.g., information about for Research Articles. In PAKDD. academic and funding organizations). As a result, this approach [11] M. Li, J. Xu, B. Ge, J. Liu, J. Jiang, and Q. Zhao. 2019. A Deep Learning cannot be easily used in practice. Finally, there are methods that Methodology for Citation Count Prediction with Large-scale Biblio-Features. IEEE SMC (2019), 1172–1176. attempt to estimate the rank of articles based on their expected [12] S. Li, W. Zhao, E. Yin, and J. Wen. 2019. A Neural Citation Count Prediction impact. A thorough survey and experimental study of such meth- Model based on Peer Review Text. In EMNLP/IJCNLP. ods can be found in [7]. This problem is easier than CCP, since [13] L. Liu, D. Yu, D. Wang, and F. Fukumoto. 2020. Citation Count Prediction Based on Neural Hawkes Model. IEICE Transactions on Information and Systems only the partial ordering of the articles according to their ex- (2020), 2379–2388. pected impact should be estimated, but it is still more difficult [14] A. Livne, E. Adar, J. Teevan, and S. Dumais. 2013. Predicting citation counts using text and graph mining. In Proc. of CompSci. than the problem we focus on. [15] P. Manghi, C. Atzori, A. Bardi, J. Shirrwagen, H. Dimitropoulos, S. La Bruzzo, I. Foufoulas, A. Löhden, A. Bäcker, A. Mannocci, M. Horst, M. Baglioni, A. Czerniak, K. Kiatropoulou, A. Kokogiannaki, M. De Bonis, M. Artini, E. Ot- tonello, A. Lempesis, L. Nielsen, A. Ioannidis, C. Bigarella, and F. Summan. 2019. 5 CONCLUSION OpenAIRE Research Graph Dump. https://doi.org/10.5281/zenodo.3516918 [16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. In this work, we propose a simplified approach that can signifi- Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. cantly simplify the work of researchers and developers working Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: on applications that rely on the prediction of the expected impact Machine Learning in Python. JMLR 12 (2011), 2825–2830. [17] N. Pobiedina and R. Ichise. 2016. Citation count prediction as a link prediction of scientific articles. The proposed approach is based on classify- problem. Applied Intelligence 44, 2 (2016), 252–268. ing the articles in two categories (‘impactful’ / ‘impactless’) based [18] Z. Su. 2020. Prediction of future citation count with machine learning and neural network. In IPEC. IEEE, 101–104. on a set of features that can be calculated using a minimal set of [19] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. 2008. ArnetMiner: Extraction article metadata. Furthermore, we experimentally evaluated this and Mining of Academic Social Networks. In KDD’08. 990–998. approach using various well-established classifiers showing that [20] J. Wen, L. Wu, and J. Chai. 2020. Paper Citation Count Prediction Based on Recurrent Neural Network with Gated Recurrent Unit. IEEE ICEIEC (2020), the results are more than adequate. The aforementioned exper- 303–306. iments have been performed with caution taking into account [21] R. Yan, C. Huang, J. Tang, Y. Zhang, and X. Li. 2012. To better stand on the the imbalanced nature of the classification problem at hand. shoulder of giants. In Proc. of ACM/IEEE-CS JCDL. 51–60. [22] R. Yan, J. Tang, X. Liu, D. Shan, and X. Li. 2011. Citation count prediction: In the future, we plan to further investigate the imbalanced learning to estimate future citations for literature. In Proc. of CIKM. 1247–1252. nature of the problem by examining other approaches like meth- [23] X. Yu, Q. Gu, M. Zhou, and J. Han. 2012. Citation Prediction in Heterogeneous Bibliographic Networks. In SDM. ods that perform over-sampling of the minority class, others that [24] X. Zhu and Z. Ban. 2018. Citation Count Prediction Based on Academic perform under-sampling of the majority class, or methods com- Network Features. IEEE AINA (2018), 534–541. bining these two approaches (e.g., SMOTEEN). Additionally, we plan to examine a wider range of parameters for the examined A USED PARAMETER CONFIGURATIONS approaches, for instance, examining a range of custom weights Tables 5 & 6 summarize the configuration for each used approach. for cost-sensitive approaches. Finally, we plan to take full ad- The names of the parameters are based on the input parameters of vantage of the Head/Tail Breaks approach to study a non-binary the corresponding Scikit-learn functions. Omitted input parame- version of the classification problem. ters were not configured (their default values had been selected). Classifier Configuration for 𝑦 = 3 Configuration for 𝑦 = 5 LR𝑝𝑟𝑒𝑐 ‘max_iter’: 200, ‘solver’: ‘sag’ ‘max_iter’: 160, ‘solver’: ‘sag’ LR𝑟𝑒𝑐 ‘max_iter’: 80, ‘solver’: ‘sag’ ‘max_iter’: 80, ‘solver’: ‘sag’ LR 𝑓 1 ‘max_iter’: 180, ‘solver’: ‘sag’ ‘max_iter’: 240, ‘solver’: ‘sag’ cLR𝑝𝑟𝑒𝑐 ‘max_iter’: 100, ‘solver’: ‘sag’ ‘max_iter’: 60, ‘solver’: ‘sag’ cLR𝑟𝑒𝑐 ‘max_iter’: 120, ‘solver’: ‘sag’ ‘max_iter’: 140, ‘solver’: ‘sag’ cLR 𝑓 1 ‘max_iter’: 180, ‘solver’: ‘sag’ ‘max_iter’: 140, ‘solver’: ‘sag’ DT𝑝𝑟𝑒𝑐 ‘max_depth’: 3, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 ‘max_depth’: 4, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 DT𝑟𝑒𝑐 ‘max_depth’: 1, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 ‘max_depth’: 3, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 DT 𝑓 1 ‘max_depth’: 1, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 ‘max_depth’: 8, ‘min_samples_leaf’: 10, ‘min_samples_split’: 200 cDT𝑝𝑟𝑒𝑐 ‘max_depth’: 1, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 ‘max_depth’: 1, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 cDT𝑟𝑒𝑐 ‘max_depth’: 2, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 ‘max_depth’: 2, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 cDT 𝑓 1 ‘max_depth’: 7, ‘min_samples_leaf’: 4, ‘min_samples_split’: 20 ‘max_depth’: 7, ‘min_samples_leaf’: 4, ‘min_samples_split’: 50 RF𝑝𝑟𝑒𝑐 ‘criterion’: ‘gini’, ‘max_depth’: 1, ‘max_features’: ‘log2’, ‘criterion’: ‘gini’, ‘max_depth’: 1, ‘max_features’: ‘log2’, ‘n_estimators’: 200 ‘n_estimators’: 200 RF𝑟𝑒𝑐 ‘criterion’: ‘gini’, ‘max_depth’: 10, ‘max_features’: ‘log2’, ‘criterion’: ‘gini’, ‘max_depth’: 10, ‘max_features’: ‘sqrt’, ‘n_estimators’: 300 ‘n_estimators’: 300 RF 𝑓 1 ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘sqrt’, ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘sqrt’, ‘n_estimators’: 200 ’n_estimators’: 300 cRF𝑝𝑟𝑒𝑐 ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’, ‘criterion’: ’entropy’, ‘max_depth’: 1, ‘max_features’: ’log2’, ‘n_estimators’: 150 ‘n_estimators’: 100 cRF𝑟𝑒𝑐 ‘criterion’: ‘gini’, ‘max_depth’: 5, ‘max_features’: ‘sqrt’, ‘criterion’: ‘entropy’, ‘max_depth’: 5, ‘max_features’: ‘log2’, ‘n_estimators’: 150 ‘n_estimators’: 100 cRF 𝑓 1 ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘log2’, ‘criterion’: ‘gini’, ‘max_depth’: 5, ‘max_features’: ‘sqrt’, ‘n_estimators’: 150 ‘n_estimators’: 300 Table 5: Parameter configurations for PMC. Classifier Configuration for 𝑦 = 3 Configuration for 𝑦 = 5 LR𝑝𝑟𝑒𝑐 ‘max_iter’: 80, ‘solver’: ‘sag’ ‘max_iter’: 100, ‘solver’: ’sag’ LR𝑟𝑒𝑐 ‘max_iter’: 80, ‘solver’: ‘sag’ ‘max_iter’: 140, ‘solver’: ’sag’ LR 𝑓 1 ‘max_iter’: 220, ‘solver’: ‘saga’ ‘max_iter’: 220, ‘solver’: ‘sag’ cLR𝑝𝑟𝑒𝑐 ‘max_iter’: 200, ‘solver’: ‘sag’ ‘max_iter’: 180, ‘solver’: ‘sag’ cLR𝑟𝑒𝑐 ‘max_iter’: 140, ‘solver’: ‘sag’ ‘max_iter’: 160, ‘solver’: ‘sag’ cLR 𝑓 1 ‘max_iter’: 100, ‘solver’: ‘sag’ ‘max_iter’: 60, ‘solver’: ‘newton-cg’ DT𝑝𝑟𝑒𝑐 ‘max_depth’: 6, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 ‘max_depth’: 3, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 DT𝑟𝑒𝑐 ‘max_depth’: 3, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 ‘max_depth’: 1, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 DT 𝑓 1 ‘max_depth’: 3, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 ‘max_depth’: 4, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 cDT𝑝𝑟𝑒𝑐 ‘max_depth’: 14, ‘min_samples_leaf’: 10, ‘min_samples_split’: 2 ‘max_depth’: 4, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 cDT𝑟𝑒𝑐 ‘max_depth’: 2, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 ‘max_depth’: 2, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 cDT 𝑓 1 ‘max_depth’: 11, ‘min_samples_leaf’: 10, ‘min_samples_split’: ‘max_depth’: 4, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2 200 RF𝑝𝑟𝑒𝑐 ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’, ‘criterion’: ‘gini’, ‘max_depth’: 5, ‘max_features’: ‘sqrt’, ‘n_estimators’: 150 ‘n_estimators’: 100 RF𝑟𝑒𝑐 ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’, ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’, ‘n_estimators’: 150 ‘n_estimators’: 150 RF 𝑓 1 ‘criterion’: ‘gini’, ‘max_depth’: 5, ‘max_features’: ‘log2’, ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘sqrt’, ‘n_estimators’: 100 ‘n_estimators’: 250 cRF𝑝𝑟𝑒𝑐 ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’, ‘criterion’: ‘entropy’, ‘max_depth’: 1, ‘max_features’: ‘log2’, ‘n_estimators’: 250 ‘n_estimators’: 100 cRF𝑟𝑒𝑐 ‘criterion’: ‘gini’, ‘max_depth’: 5, ‘max_features’: ‘log2’, ‘criterion’: ‘gini’, ‘max_depth’: 1, ‘max_features’: ‘log2’, ‘n_estimators’: 100 ‘n_estimators’: 150 cRF 𝑓 1 ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘log2’, ‘criterion’: ‘entropy’, ‘max_depth’: 10, ‘max_features’: ‘sqrt’, ‘n_estimators’: 150 ‘n_estimators’: 150 Table 6: Parameter configurations for DBLP.