<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Augmenting a COVID-19 Research Knowledge Graph With Influential Papers Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gollam Rabby</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vojtěch Svátek</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petr Berka</string-name>
          <email>berka@vse.cz</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Prague University of Economics and Business</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We applied machine learning to predict which of COVID-19-related papers will be highly cited, yielding an extension for the Covid-on-the-Web knowledge graph. Symbolic and deep-learning (BERT) ML performed comparably. LIME-based explanation is also included as part of the produced graph. Among the current proliferation of knowledge graphs (KGs), research-oriented ones are a particular species. They can be understood as concise, structured representations of various kinds of scholarly knowledge, and have the potential to bridge between overwhelmingly large corpora of scientific texts and the potential recipients of scholarly knowledge who only have limited reading capacity. Numerous projects [1], [2] apply NLP techniques in order to extract key facts from research papers so that they can be exploited independently of their original contexts of publication, without the necessity to read the papers in extenso. The quality of the service provided by the KGs however depends on the quality of papers they represent: knowledge from papers making impact in the scientific community should thus be prioritized.</p>
      </abstract>
      <kwd-group>
        <kwd>knowledge graph</kwd>
        <kwd>COVID-19</kwd>
        <kwd>research papers</kwd>
        <kwd>machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR
Workshop
Proceedings
2https://github.com/Wimmics/CovidOnTheWeb
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>category</p>
      <sec id="sec-1-1">
        <title>Frequency</title>
      </sec>
      <sec id="sec-1-2">
        <title>Title and Abstract Title low 2541</title>
        <p>high
2538
low
8063
high
8058</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>From our previous experiments we learned that in biomedical scientific document processing,
the TF-IDF or bag of words (BOW) representation with random forest or neural network (BERT)
learners achieve state-of-the-art results for diferent combinations of document representation.
Also, in most cases, the abstract and title had more impact on classifying a research paper than
the bibliometric data had. Therefore we only used the research paper titles and abstracts, for
the predictive task.
3https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge
4https://opencitations.net/</p>
      <sec id="sec-2-1">
        <title>Both</title>
      </sec>
      <sec id="sec-2-2">
        <title>Title</title>
      </sec>
      <sec id="sec-2-3">
        <title>Both</title>
      </sec>
      <sec id="sec-2-4">
        <title>Title</title>
      </sec>
      <sec id="sec-2-5">
        <title>Both</title>
        <p>NN
NN
RF
RF
RF
RF
RF
RF</p>
      </sec>
      <sec id="sec-2-6">
        <title>Repr. BERT</title>
      </sec>
      <sec id="sec-2-7">
        <title>BERT</title>
      </sec>
      <sec id="sec-2-8">
        <title>BERT</title>
        <p>TF-IDF
TF-IDF
TF-IDF
BOW
BOW
BOW</p>
        <p>P
low 0.71
high 0.77
low 0.69
high 0.71
low 0.77
high 0.73
low 0.68
high 0.68
low 0.71
high 0.76
low 0.72
high 0.77
low 0.71
high 0.76
low 0.71
high 0.77
low 0.71
high 0.76
Document representation The Term Frequency – Inverse Document Frequency (TF-IDF)
weighting system is the most popular text representation utilized throughout various previous
studies. Using the unigrams, bigrams and trigrams from the titles and abstracts we developed
the TF-IDF input data table. A binary representation of the input data table (BOW ) was also
included for comparison, for the same features.</p>
        <p>
          Next, we employed an embedding-based representation approach that is viewed as a
cuttingedge within NLP-based language models, BERT [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which was trained on English Wikipedia
and BooksCorpus. We used the BERT Tokenizer on the same collection of titles and abstracts.
Machine learning algorithms We used the random forest implementation from the
scikitlearn library, with the same hyperparameter optimization as by Beranová, et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. For every
input data table (TF-IDF and BOW) the parameters were individually tuned. The focus of the
optimization criterion was to improve the accuracy. Next, we used the simple feed-forward
network from the PyTorch library over the BERT representation.
        </p>
        <p>
          Explanation algorithm The LIME (Local Interpretable Model-agnostic Explanations) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] tool
demonstrates which feature values and how they afected a certain prediction. This explanation
can only be considered approximate because the LIME model is developed by altering the
explained instance by varying the feature values and observing the efects on the prediction of
each individual feature change. By replacing the described model locally with an interpretable
one, the explanation is obtained.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Discussion</title>
      <p>
        We used 70% training data and 30% test data by random sampling. The overall accuracy was
used to evaluate the results, but we also computed the per-class accuracy. Table 2 shows the
Precision, Recall, F1 score and accuracy (per-class and average) of the neural network (BERT) and
random forest approach to testing data. As we see, a traditional multi-purpose machine learning
algorithms, random forest, performs well like a neural network (BERT). This is not so surprising
since also in some other reported cases the diference in performance between BERT, TF-IDF,
and BOW was relatively small [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Superiority of neural-neural network prediction could
possibly be achieved via training domain-driven language models. However, the creation of the
TF-IDF and BOW representation is quicker, and the representation enables the use of machine
learning techniques that are inherently interpretable while maintaining the interpretability of
the generated models.
      </p>
      <p>As regards the RDF output, we store the predicted citation rate category (high or low) together
with the citation count from OpenCitations and with the LIME-based interpretation, for every
research paper from the Covid-on-the-Web KG. In the GitHub repository5, the classified data
from the covid-on-the-web corpus is available. An example is as follows6; the LIME-based
explanation (stored just a long string in lexinfo:explanation) is displayed as truncated:
&lt; h t t p s : / / c i m p l e . v s e . c z / c o v i d - o n - t h e - w e b / 1 0 . 1 0 1 6 / j . y m e t h . 2 0 0 5 . 0 5 . 0 0 8 &gt;
a f a b i o : R e s e a r c h P a p e r , b i b o : A c a d e m i c A r t i c l e , s c h e m a : S c h o l a r l y A r t i c l e ;
b i b o : d o i ” 1 0 . 1 0 1 6 / j . y m e t h . 2 0 0 5 . 0 5 . 0 0 8 ” ; c i t o : C i t a t i o n 9 3 ;
&lt; h t t p s : / / c i m p l e . v s e . c z / c o v i d - o n - t h e - w e b / e x p C i t a t i o n R a t e &gt; h i g h ;
l e x i n f o : e x p l a n a t i o n ” ( ’ n o v e l ’ , - 0 . 0 3 0 8 6 7 ) , ( ’ s t r u c t u r e s ’ , - 0 . 0 2 5 7 8 9 ) , . . . ” ;
s c h e m a : u r l &lt; h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . y m e t h . 2 0 0 5 . 0 5 . 0 0 8 &gt; .</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and future work</title>
      <p>We have made an initial exploration on augmenting a research-oriented KG with the predicted
impact of the underlying papers, obtained via machine learning.</p>
      <p>Our next step will be to evaluate this simple approach in the context of a more comprehensive
support for users, in particular, the fact checkers, in getting access to scientific literature and its
authors. As regards the actual predictive ML technology, the BERT model, having been merely
trained on general textual data (English Wikipedia and the BooksCorpus), did not outperform
classical ML models in this first try. We however assume that it would work better if trained on
domain-specific data such as bio-medical research papers. Also, we also considering to external
KGs, such as encyclopaedic ones (DBpedia, Wikidata), into the learning process.
5https://github.com/corei5/Enhancement-of-the-Covid-on-the-Web
6Prefixes for common vocabularies omitted; they can be retrieved via https://prefix.cc.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Acknowledgments</title>
      <p>This research is being supported by CIMPLE project (CHIST-ERA-19-XAI-003). The authors
also would like to thank Sören Auer and Open Research Knowledge Graph (ORKG) group for
providing valuable feedback and some more idea to enhance this with some other research in
the ORKG environment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Jaradeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Farfar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prinz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kismihók</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Stocker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Knowledge Capture</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Deagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCusker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fateye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stoufer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Brinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Schadler</surname>
          </string-name>
          ,
          <article-title>Fair and interactive data graphics from a scientific knowledge graph, Scientific Data 9 (</article-title>
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Beranová</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Joachimiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kliegr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rabby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sklenák</surname>
          </string-name>
          ,
          <article-title>Why was this cited? explainable machine learning applied to covid-19 research literature</article-title>
          ,
          <source>Scientometrics</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ah-Kane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bobasheva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cabrio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gazzotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giboin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mayer</surname>
          </string-name>
          , et al.,
          <article-title>Covid-on-the-web: Knowledge graph and services to advance covid-19 research</article-title>
          , in: International Semantic Web Conference, Springer,
          <year>2020</year>
          , pp.
          <fpage>294</fpage>
          -
          <lpage>310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          , ”
          <article-title>why should i trust you?” explaining the predictions of any classifier</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mujahid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rustam</surname>
          </string-name>
          , P. B. Washington,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Reshi</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ashraf</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis and topic modeling on tweets about online education during covid-19,</article-title>
          <source>Applied Sciences</source>
          <volume>11</volume>
          (
          <year>2021</year>
          )
          <fpage>8438</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>