1st Workshop on AI + Informetrics - AII2021


           Important Citations Identification with Semi-supervised
                            Classification Model

                      Xin An1[0000-0001-7413-9396], Xin Sun1 and Shuo Xu2*[0000-0002-8602-1819]
              1
              School of Economics & Management, Beijing Forestry University, Beijing 100083, P.R.
                                                   China
                   anxin@bjfu.edu.cn (Xin An), sx0118@outlook.com (Xin Sun)
           2
             College of Economics and Management, Beijing University of Technology, Beijing 100124,
                                                 P.R. China
                                     xushuo@bjut.edu.cn (Shuo Xu)
                                           * Corresponding author


                  Abstract. Given that citations are not equally important, various techniques have
                  been presented to identify important citations on the basis of supervised machine
                  learning models. However, only a small volume of data has been annotated man-
                  ually with the labels. To make full use of unlabeled data and promote the learning
                  performance, the semi-supervised self-training technique is utilized to identify
                  important citations in this work. After six groups of features are engineered, the
                  semi-supervised versions of SVM and RF models improve significantly the per-
                  formance of the conventional supervised versions when un-annotated samples
                  under 75% and 95% confidence level are rejoined to the training set, respectively.
                  The AUC-PR and AUC-ROC of SVM model are 0.8102 and 0.9622, and those
                  of RF model reach 0.9248 and 0.9841, which outperform their counterparts. This
                  demonstrates the effectiveness of our semi-supervised self-training strategy for
                  important citation identification.

                  Keywords: Important Citation, Semi-supervised Learning, Self-training.


          1       Introduction

          Citations are reckoned as a proxy of scientific knowledge flow in the literature, thus
          they are usually utilized for multifarious academic evaluation purposes, such as ranking
          of researchers [1], journals [2], organizations [3], etc. But most studies treat all refer-
          ences as equally important to an interested citing publication. This is obviously not in
          line with actual situations. In recent years, researchers have argued that citations are not
          equally important and presented various techniques to identify important citations [4-
          11].
             The supervised learning methods are commonly used for this task, which learn the
          feature space of the labeled data to form a classification model. However, most super-
          vised learning methods require a large amount of labeled data to ensure the performance
          of the resulting machines [12]. Currently, only a small number of citations are labeled


Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
2

manually due to the time-consuming annotation and heavy workload. That is to say,
large amounts of unlabeled data have not been exploited. Last two decades have wit-
nessed significant progress in the field of semi-supervised learning, and many success-
ful cases from various fields are reported in the literature [12-15]. However, important
citations identification with semi-supervised model remains largely under-studied.
   To make full use of unlabeled data and promote the model performance, a semi-
supervised self-training method is deployed in this work. After Section 2 briefly de-
scribes the related work, the framework of semi-supervised self-training for important
citation identification is introduced in Section 3 along with six groups of features [11].
Section 4 shows the statistics of labeled and unlabeled data. In Section 5, the experi-
ments of SVM and RF models armed with semi-supervised self-training strategy are
conducted, and Section 6 concludes this work.


2      Related work

In the literature, various techniques have been presented to identify important citations.
Valenzuela et al. [4] annotated 465 citations from ACL anthology and used two super-
vised learning models (SVM and RF) to conduct important citations classification.
Since then, a plethora of studies have been implemented with different super-vised
learning models on this annotated dataset [6-11], including SVM, RF, Naïve Bayes, K-
Nearest Neighbors, Decision Tree, Deep Learning, etc. Among all these supervised
models, SVM and RF were the most commonly used and outperformed the other coun-
terparts. It can be seen that the supervised learning model is a main-stream technique
in this task. However, it relies on large amount of labeled data to maintain the perfor-
mance, which is in contrast with the reality that labeled data costly to obtain.
   In practice, to overcome the limitation of little amount of labeled data and make full
use of unlabeled data, the semi-supervised learning algorithm have received more at-
tention. Many semi-supervised learning methods are raised, such as co-training [13],
semi-supervised support vector machine (S3VM) [14], self-training [15], etc. These
methods have been indicated the effectiveness in improving the predictive performance
when leveraging large amounts of unlabeled data with a small amount of labeled data.
   Among these approaches, the self-training method expands the training data with
predictions on unlabeled data. It is easy to conduct and has great flexibility in threshold
setting, which gives more choices on model selection. Therefore, to make full use of
the unlabeled data, the semi-supervised self-training method is preferred to identify im-
portant citations in this paper.


3      Methodology

Figure 1 depicts the framework of important citations identification on the basis of
semi-supervised self-training learning strategy. First of all, a supervised learning model
(such as SVM and RF) is trained on the labeled data under 5-fold cross validation. After
                                                                                             3

learning the training set of each fold, the labels of the unlabeled data are predicted re-
spectively. We selected samples with 95%, 90%, 85%, 80%, 75%, and 70% confidence
level as the pseudo-labeled data to rejoin the training set. For each fold, the model is
retrained on the new combined data and evaluated on the testing set. The involved pa-
rameters are optimized correspondingly. The areas under the curve of PR and ROC are
used as indicators for evaluating the performance.


Fig. 1. Framework for identifying important citations with the semi-supervised learning model.

   As for the feature engineering, the following six groups of features from our previous
study [11] are utilized here: G1 (two generative features extracted from the CIM
model), G2 (Structural based features, containing 7 features), G3 (Separate citation-
based feature, containing 1 feature), G4 (Author overlap-based feature, containing 1
feature), G5 (Cue words-based feature, containing 2 feature), G6 (Similarity based fea-
ture, containing 1 feature). Please refer to [11] for more details.


4      Data and preprocessing

The annotated corpus in [4] is used in this work. This dataset was randomly chosen
from the ACL anthology and were manually annotated by one expert with the label 0
(related work), 1 (comparison), 2 (using the work), and 3 (extending the work). For
conducting the experiment of identifying important citations, we combine the related
work and comparison classes into incidental class with the label 0, and using the work
and extending the work classes into important class with the label 1. The inner-annota-
tor agreement was verified between two experts to reduce the bias raised by human
annotation and reached 93.9% in this coarse label set. Table 1 lists the summary of the
labeled dataset. In the end, 456 pairs of labeled data were collected after preprocessing,
of which 14.7% are important citations.
4

                             Table 1. Summary of labeled dataset.

                     Label                    Class           Number of Samples
                      0                   Incidental              389 (85.3%)
                      1                   Important                 67 (14.7%)

    The preprocessing steps include: (1) Collecting PDF format of citing papers and
converting to text format by Xpdf; (2) Parsing the text format data by ParsCit to extract
title, author, abstract and references of each citing paper as well as the generic section
headers; (3) Extracting citation contexts based on regular expressions; (4) Prepro-
cessing all textual information including citation contexts and abstract using NLTK
toolkit. During the preprocessing, 434 citing papers are collected, which yields 8,541
citing and cited pairs totally. Table 2 lists the statistics of citing paper and references.
Apart from the labeled data described above, 8,085 unlabeled citations come into being.
Similar to the labeled data, the feature engineering and preprocessing are also con-
ducted on all unlabeled data.

                      Table 2. Statistics of citing paper and references.

    Number of Cit-            Number of            Number of total cit-      Number of unla-
     ing papers           unique references       ing and cited pairs         beled data
          434                    4,590                    8,541                   8,085


5      Experimental results and discussion

As two state-of-the-art discriminative models, SVM and RF are utilized here as our
classifiers. First of all, these two models were trained on the labeled data. To tune the
parameters of these two classifiers, grid search with 5-fold cross-validation [16] is used
in this study. Figure 2 shows the PR curves and ROC curves of SVM and RF. As one
can see, the area under the ROC curve (AUC-ROC) of SVM and RF models are 0.9287
and 0.9798 respectively, and the areas under the PR curve (AUC-PR) are 0.7628 and
0.9056 respectively. The RF model outperforms the SVM model, which is in accord-
ance with most of previous studies [4-11].

   Then, a semi-supervised self-training on the unlabeled data is conducted. After learn-
ing the training set of each fold based on the above 5-fold data, the labels of the unla-
beled data are predicted. We select samples with 95%, 90%, 85%, 80%, 75%, and 70%
confidence level to rejoin the training set. Table 3 lists the number of new samples of
each fold at different confidence level. After that, for each fold, the resulting model is
retrained on new combined data and evaluated on the testing set. Similarly, grid search
is also used to tune the involved parameters. Table 4 reports the results of mean AUC-
ROC and AUC-PR of 5-fold under different confidence level. It can be seen that the
AUC-PR and AUC-ROC for SVM model reach the maximum at the 75% confidence
level, which are 0.8102 and 0.9622 respectively. The RF model has the highest AUC-
                                                                                           5

PR and AUC-ROC at 95% confidence level (0.9248 and 0.9841). Both are better than
the results of the above supervised learning counterparts.


Fig. 2. The PR curves (a) and ROC curves (b) of SVM and RF models on labeled data with su-
                                pervised learning strategy.

            Table 3. Number of new samples under different confidence levels.
                                                    Confidence Level
   Fold          Model
                           95%        90%        85%        80%        75%        70%
                 SVM       4,444      5,977      6,714     7,067       7,334      7,533
     1
                  RF       1,002      2,406      3,670     4,709       5,368      5,909
                 SVM       3,538      5,863      6,567     6,999       7,279      7,502
     2
                  RF        944       2,462      3,674     4,663       5,387      6,054
                 SVM       3,993      5,913      6,649     7,025       7,306      7,517
     3
                  RF        925       2,462      3,620     4,624       5,369      6,086
                 SVM       4,362      5,940      6,688     7,040       7,319      7,521
     4
                  RF        944       2,462      3,674     4,663       5,387      6,054
                 SVM       3,411      5,853      6,555     6,994       7,271      7,499
     5
                  RF        944       2,462      3,674     4,663       5,387      6,054

 Table 4. Performance of SVM and RF models with semi-supervised strategy under different
                                  confidence levels.
    Confidence                     SVM                                 RF
      level              AUC-PR           AUC-ROC          AUC-PR           AUC-ROC
      95%                 0.7380           0.9217           0.9248           0.9841
      90%                 0.7290           0.9078           0.9015           0.9804
      85%                 0.7525           0.9225           0.8811           0.9759
      80%                 0.7545           0.9248           0.8463           0.9702
      75%                 0.8102           0.9622           0.8331           0.9674
      70%                 0.7522           0.9292           0.8374           0.9666

   Further, to find out the contribution of each group of features, we perform an addi-
tional experiment to observe the changes of mean AUC-PR and mean AUC-ROC. Ta-
ble 5 shows the scores of mean AUC-PR and AUC-ROC of the SVM model under 75%
6

confidence level and the RF model under 95% confidence level and their rankings (in
parentheses) as well as the average rank using different groups of features under 5-fold
cross-validation by controlling for structure features (G2). For each com-bination, the
resulting parameters are optimized separately. As we can observe, the baseline model
based on the structural features achieves a mean AUCPR of about 0.7600 and 0.7903,
and AUCROC of about 0.8906 and 0.4743. The author-overlap based features (G4)
ranks first, which increase respectively the AUC-PR to 0.9462 and 0.8145, AUC-ROC
to 0.8145 and 0.4798. The CIM (Citation Influence Model) [17] model-based features
(G1) rank the second, which demonstrates that the features generated from the genera-
tive model can improve the performance of important citations identification. This ob-
servation is in accordance with the previous work [11].

    Table 5. The performance of semi-supervised SVM and RF models with different groups of
                 features in terms of mean AUC-PR, AUC-ROC, and their ranks.
                              SVM                             RF
     Feature                                                                    Average_rank
                     PR               ROC            PR              ROC
      G2          0.7600(3)         0.8906(6)     0.7903(5)        0.4743(5)         4.75
     G2+G1        0.7558(4)         0.8935(5)     0.9035(1)        0.4968(1)         2.75
     G2+G3        0.7448(5)         0.8971(4)     0.8183(2)        0.4885(3)         3.5
     G2+G4        0.9462(1)         0.9875(1)     0.8145(3)        0.4798(4)         2.25
     G2+G5        0.7822(2)         0.9065(3)     0.7065(6)        0.4604(6)         4.25
     G2+G6        0.6947(6)         0.9181(2)     0.7997(4)        0.4889(2)         3.5


6        Conclusion

In this paper, we refer to the practices in [4] to divide citations into important and inci-
dental classes and use semi-supervised self-training strategy to identify important cita-
tions by leveraging labeled data and unlabeled data to promote the performance and
generalization ability. Through the semi-supervised self-training on the unlabeled data,
the performance of the SVM model can be promoted from 0.9287 to 0.9622 and from
0.7628 to 0.8102 and that of the RF model from 0.9798 to 0.9841 and from 0.9056 to
0.9248 in terms of mean AUC-ROC and mean AUC-PR. This demonstrates the effec-
tiveness of our semi-supervised self-training strategy for important citation identifica-
tion. Additionally, the CIM model-based features, structural based features and author-
overlap based features contribute greatly on important citations identification.

Acknowledgements
This research received the financial support from the National Natural Science Foundation of
China under grant number 72004012 and 72074014.


References
 1. Hirsch, J.E.: An index to quantify an individual's scientific research output. Proceedings of
    the National academy of Sciences 102(46), 16569-16572 (2005).
                                                                                                7

 2. Garfield, E.: Citation indexes to science: a new dimension in documentation through asso-
    ciation of ideas. Science, 122:108-111 (1955).
 3. Lazaridis, T.: Ranking university departments using the mean h-index. Scientometrics 82(2),
    211-216 (2010).
 4. Valenzuela, M., Ha, V., Etzioni, O.: Identifying meaningful citations. In: Workshops at the
    twenty-ninth AAAI conference on artificial intelligence, pp. 21-26. AAAI , Austin (2015).
 5. Zhu, X., Turney, P., Lemire, D., Vellino, A.: Measuring academic influence: not all citations
    are equal. Journal of the Association for Information Science and Technology 66(2), 408-
    427(2015).
 6. Hassan, S.U., Akram, A. and Haddawy, P. Identifying important citations using contextual
    information from full text. In: 2017 ACM/IEEE Joint Conference on Digital Libraries
    (JCDL), pp. 1-8. IEEE, New York (2017).
 7. Hassan, S.U., Safder, I., Akram, A., Kamiran, F.: A novel machine-learning approach to
    measuring scientific knowledge flows using citation context analysis. Scientometrics
    116(2), 973-996 (2018).
 8. Hassan, S.U., Imran, M., Iqbal, S., Aljohani, N.R., Nawaz, R.: Deep context of citations
    using machine-learning models in scholarly full-text articles. Scientometrics 117(3), 1645-
    1662 (2018).
 9. Qayyum, F., Afzal, M.T.: Identification of important citations by exploiting research arti-
    cles’ metadata and cue-terms from content. Scientometrics 118(1), 21-43 (2019).
10. Wang, M., Zhang, J., Jiao, S., Zhang, X., Zhu, N., Chen, G.: Important citation identification
    by exploiting the syntactic and contextual information of citations. Scientometrics 125(3),
    1-21 (2020).
11. An, X., Sun, X., Xu, S., Hao, L., Li, J.: Important Citations Identification by Exploiting
    Generative Model into Discriminative Model. Journal of Information Science. (2021)
    doi:10.1177/0165551521991034.
12. Xu, S., An, X., Qiao, X., Zhu, L., Li, L.: Semi-supervised least-squares support vector re-
    gression machines. Journal of Information & Computational 8(6), 885-892 (2011).
13. Blum, A., Mitchell, T.: In: Proceeding of the eleventh annual conference on Computational
    learning theory, pp.92-100. ACM, Madison, Wisconsin (1998).
14. Chapelle, O., Sindhwani, V., Keerthi, S.S.: Optimization techniques for semi-supervised
    support vector machines. Journal of Machine Learning Research 9(2), (2008).
15. Tanha, J., van Someren, M., Afsarmanesh, H.: Semi-supervised self-training for decision
    tree classifiers. Journal of Machine Learning and Cybernetics 8(1), 355-370 (2017).
16. Xu, S., Ma, F., Tao, L.: Learn from the Information Contained in the False Splice Sites as
    well as in the True Splice Sites using SVM. Proceedings of the International Conference on
    Intelligent Systems and Knowledge Engineering, 1360-1366 (2007).
17. Xu, S., Hao, L., An, X., Yang, G., Wang, F.: Emerging Research Topics Detection with
    Multiple Machine Learning Models. Journal of Informetrics, 13(4), 100983 (2019).