1. Introduction

Augmenting a COVID-19 Research Knowledge Graph With Influential Papers Prediction

Gollam Rabby

Vojtěch Svátek

Petr Berka

berka@vse.cz 0 Prague University of Economics and Business , Prague , Czech Republic

2022

13 15

We applied machine learning to predict which of COVID-19-related papers will be highly cited, yielding an extension for the Covid-on-the-Web knowledge graph. Symbolic and deep-learning (BERT) ML performed comparably. LIME-based explanation is also included as part of the produced graph. Among the current proliferation of knowledge graphs (KGs), research-oriented ones are a particular species. They can be understood as concise, structured representations of various kinds of scholarly knowledge, and have the potential to bridge between overwhelmingly large corpora of scientific texts and the potential recipients of scholarly knowledge who only have limited reading capacity. Numerous projects [1], [2] apply NLP techniques in order to extract key facts from research papers so that they can be exploited independently of their original contexts of publication, without the necessity to read the papers in extenso. The quality of the service provided by the KGs however depends on the quality of papers they represent: knowledge from papers making impact in the scientific community should thus be prioritized.

knowledge graph COVID-19 research papers machine learning

1. Introduction

CEUR Workshop Proceedings 2https://github.com/Wimmics/CovidOnTheWeb © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

category

Frequency Title and Abstract Title low 2541

high 2538 low 8063 high 8058

2. Methods

From our previous experiments we learned that in biomedical scientific document processing, the TF-IDF or bag of words (BOW) representation with random forest or neural network (BERT) learners achieve state-of-the-art results for diferent combinations of document representation. Also, in most cases, the abstract and title had more impact on classifying a research paper than the bibliometric data had. Therefore we only used the research paper titles and abstracts, for the predictive task. 3https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge 4https://opencitations.net/

Both Title Both Title Both

NN NN RF RF RF RF RF RF

Repr. BERT BERT BERT

TF-IDF TF-IDF TF-IDF BOW BOW BOW

P low 0.71 high 0.77 low 0.69 high 0.71 low 0.77 high 0.73 low 0.68 high 0.68 low 0.71 high 0.76 low 0.72 high 0.77 low 0.71 high 0.76 low 0.71 high 0.77 low 0.71 high 0.76 Document representation The Term Frequency – Inverse Document Frequency (TF-IDF) weighting system is the most popular text representation utilized throughout various previous studies. Using the unigrams, bigrams and trigrams from the titles and abstracts we developed the TF-IDF input data table. A binary representation of the input data table (BOW ) was also included for comparison, for the same features.

Next, we employed an embedding-based representation approach that is viewed as a cuttingedge within NLP-based language models, BERT [ 5 ], which was trained on English Wikipedia and BooksCorpus. We used the BERT Tokenizer on the same collection of titles and abstracts. Machine learning algorithms We used the random forest implementation from the scikitlearn library, with the same hyperparameter optimization as by Beranová, et al. [ 3 ]. For every input data table (TF-IDF and BOW) the parameters were individually tuned. The focus of the optimization criterion was to improve the accuracy. Next, we used the simple feed-forward network from the PyTorch library over the BERT representation.

Explanation algorithm The LIME (Local Interpretable Model-agnostic Explanations) [ 6 ] tool demonstrates which feature values and how they afected a certain prediction. This explanation can only be considered approximate because the LIME model is developed by altering the explained instance by varying the feature values and observing the efects on the prediction of each individual feature change. By replacing the described model locally with an interpretable one, the explanation is obtained.

3. Results and Discussion

We used 70% training data and 30% test data by random sampling. The overall accuracy was used to evaluate the results, but we also computed the per-class accuracy. Table 2 shows the Precision, Recall, F1 score and accuracy (per-class and average) of the neural network (BERT) and random forest approach to testing data. As we see, a traditional multi-purpose machine learning algorithms, random forest, performs well like a neural network (BERT). This is not so surprising since also in some other reported cases the diference in performance between BERT, TF-IDF, and BOW was relatively small [ 7 ]. Superiority of neural-neural network prediction could possibly be achieved via training domain-driven language models. However, the creation of the TF-IDF and BOW representation is quicker, and the representation enables the use of machine learning techniques that are inherently interpretable while maintaining the interpretability of the generated models.

As regards the RDF output, we store the predicted citation rate category (high or low) together with the citation count from OpenCitations and with the LIME-based interpretation, for every research paper from the Covid-on-the-Web KG. In the GitHub repository5, the classified data from the covid-on-the-web corpus is available. An example is as follows6; the LIME-based explanation (stored just a long string in lexinfo:explanation) is displayed as truncated: < h t t p s : / / c i m p l e . v s e . c z / c o v i d - o n - t h e - w e b / 1 0 . 1 0 1 6 / j . y m e t h . 2 0 0 5 . 0 5 . 0 0 8 > a f a b i o : R e s e a r c h P a p e r , b i b o : A c a d e m i c A r t i c l e , s c h e m a : S c h o l a r l y A r t i c l e ; b i b o : d o i ” 1 0 . 1 0 1 6 / j . y m e t h . 2 0 0 5 . 0 5 . 0 0 8 ” ; c i t o : C i t a t i o n 9 3 ; < h t t p s : / / c i m p l e . v s e . c z / c o v i d - o n - t h e - w e b / e x p C i t a t i o n R a t e > h i g h ; l e x i n f o : e x p l a n a t i o n ” ( ’ n o v e l ’ , - 0 . 0 3 0 8 6 7 ) , ( ’ s t r u c t u r e s ’ , - 0 . 0 2 5 7 8 9 ) , . . . ” ; s c h e m a : u r l < h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . y m e t h . 2 0 0 5 . 0 5 . 0 0 8 > .

4. Conclusions and future work

We have made an initial exploration on augmenting a research-oriented KG with the predicted impact of the underlying papers, obtained via machine learning.

Our next step will be to evaluate this simple approach in the context of a more comprehensive support for users, in particular, the fact checkers, in getting access to scientific literature and its authors. As regards the actual predictive ML technology, the BERT model, having been merely trained on general textual data (English Wikipedia and the BooksCorpus), did not outperform classical ML models in this first try. We however assume that it would work better if trained on domain-specific data such as bio-medical research papers. Also, we also considering to external KGs, such as encyclopaedic ones (DBpedia, Wikidata), into the learning process. 5https://github.com/corei5/Enhancement-of-the-Covid-on-the-Web 6Prefixes for common vocabularies omitted; they can be retrieved via https://prefix.cc.

5. Acknowledgments

This research is being supported by CIMPLE project (CHIST-ERA-19-XAI-003). The authors also would like to thank Sören Auer and Open Research Knowledge Graph (ORKG) group for providing valuable feedback and some more idea to enhance this with some other research in the ORKG environment.

[1]

M. Y.

Jaradeh ,

Oelen ,

K. E.

Farfar ,

Prinz , J. D'Souza , G.

Kismihók , M.

Stocker , S.

Auer , Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge , in: Proceedings of the 10th International Conference on Knowledge Capture , 2019 , pp. 243 - 246 .

[2]

M. E.

Deagen ,

J. P.

McCusker ,

Fateye ,

Stoufer ,

L. C.

Brinson ,

D. L.

McGuinness ,

L. S.

Schadler , Fair and interactive data graphics from a scientific knowledge graph, Scientific Data 9 ( 2022 ) 1 - 11 .

[3]

Beranová ,

M. P.

Joachimiak ,

Kliegr ,

Rabby ,

Sklenák , Why was this cited? explainable machine learning applied to covid-19 research literature , Scientometrics ( 2022 ) 1 - 37 .

[4]

Michel ,

Gandon ,

Ah-Kane ,

Bobasheva ,

Cabrio ,

Corby ,

Gazzotti ,

Giboin ,

Marro ,

Mayer , et al., Covid-on-the-web: Knowledge graph and services to advance covid-19 research , in: International Semantic Web Conference, Springer, 2020 , pp. 294 - 310 .

[5]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[6]

M. T.

Ribeiro ,

Singh ,

Guestrin , ” why should i trust you?” explaining the predictions of any classifier , in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining , 2016 , pp. 1135 - 1144 .

[7]

Mujahid ,

Lee ,

Rustam , P. B. Washington,

Ullah ,

A. A.

Reshi , I. Ashraf , Sentiment analysis and topic modeling on tweets about online education during covid-19, Applied Sciences 11 ( 2021 ) 8438 .