<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Classifying Scientific Topic Relationships with SciBERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessia Pisu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Livio Pompianu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelo Salatino</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Osborne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Riboni</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Motta</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Reforgiato Recupero</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Business and Law, University of Milano Bicocca</institution>
          ,
          <addr-line>IT</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics and Computer Science, University of Cagliari</institution>
          ,
          <addr-line>IT</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Knowledge Media Institute, The Open University</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Current AI systems, including smart search engines and recommendation systems tools for streamlining literature reviews, and interactive question-answering platforms, are becoming indispensable for researchers to navigate and understand the vast landscape of scientific knowledge. Taxonomies and ontologies of research topics are key to this process, but manually creating them is costly and often leads to outdated results. This poster paper shows the use of SciBERT model to automatically generate research topic ontologies. Our model excels at identifying semantic relationships between research topics, outperforming traditional methods. This approach promises to streamline the creation of accurate and up-to-date ontologies, enhancing the efectiveness of AI tools for researchers.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Research Topics</kwd>
        <kwd>Ontology Generation</kwd>
        <kwd>Language Models</kwd>
        <kwd>Knowledge Graph Generation</kwd>
        <kwd>SciBERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The current generation of AI technologies, such as smart search engines, recommendation
systems, and question-answering applications, significantly aids researchers in exploring and
interpreting scientific literature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Despite this, the rapid growth of scientific publications,
increasing by about 2.5 million papers annually [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], poses a substantial challenge. Although
large language models have revolutionised natural language processing (NLP) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], they still
encounter limitations to process extensive text volumes and understand the broader context of
a research area.
      </p>
      <p>
        To address this, scientific knowledge graphs (SKGs) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], such as SemOpenAlex1, AIDA-KG2,
ORKG3, CS-KG4, became increasingly popular, providing structured and formal representations
of research publications.
      </p>
      <p>
        Research topics are essential for describing research concepts within SKGs, making ontologies
of research topics (e.g., MeSH, UMLS, CSO, NLM) crucial for organising and querying academic
information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Altogether, they empower intelligent systems to eficiently navigate and
understand academic literature, including advanced search engines, interactive conversational
agents, analytics dashboards, and academic recommender systems.
      </p>
      <p>
        However, manually creating ontologies of research topics is costly and time-consuming,
often resulting in outdated representations. To address this challenge, several approaches have
been proposed, including the integration of ontology learning with crowdsourcing methods,
combining statistical analysis with user feedback [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], or utilising citation-based clustering
of research papers to infer research topics from the titles and abstracts of documents within
clusters [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Another approach is Klink-2 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which produced the Computer Science Ontology
(CSO) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a widely adopted resource with about 14K topics and 159K semantic relationships.
      </p>
      <p>In the same direction, this poster paper explores the use of SciBERT for generating research
topic ontologies. Our goal is to develop a method that incorporates language model technology
to update CSO and construct large-scale ontologies across scientific disciplines. We developed a
model to identify four semantic relationships (supertopic, subtopic, same-as, and other ) between
research topics and compared its performance to traditional feature-based solutions. Preliminary
results show that the transformer-based model significantly outperforms traditional models.
The gold standard and code are available on a GitHub repository5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and Methods</title>
      <p>In this section, at first we describe the addressed task and the used datasets. Then, we illustrate
a traditional feature-based approach, and our transformer-based technique.</p>
      <sec id="sec-2-1">
        <title>2.1. Task Definition and Datasets</title>
        <p>In this work, we address a single-label multi-class classification problem. The task is to classify
the relationship between a pair of research topics (  ,   ) according to four categories which are
essential for ontology generation:
• supertopic:   is a parent topic of   . E.g., ontological languages is a broader area than owl
• subtopic:   is a child topic of   . E.g., nosql is a specific area within databases
• same-as:   and   are diferent labels for the same concept. E.g., haptic interface and
haptic device
• other :   and   do not relate according to the above categories. E.g., blockchain and user
interfaces</p>
        <p>In this context, other can refer to either negative samples or alternative semantic relationships
not currently considered by our method, such as partOf, or contributesTo.</p>
        <p>
          For our gold standard, we selected portions of the Computer Science Ontology [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] that
have been manually checked and improved. CSO is a large ontology covering 14K research
topics, providing an extensive and fine-grained representation of Computer Science. It was
automatically generated using the Klink-2 algorithm [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] on 16 million scientific articles.
        </p>
        <p>
          CSO comprises four primary semantic relationships. Among them, superTopicOf and
relatedEquivalent essentially correspond to our superTopic and same-as relationships, respectively. To
5Gold standard and code - https://github.com/aleessiap/LeveragingLMforGeneratingOntologies.git
construct the gold standard, we selected 4,713 superTopicOf triples from the CSO and
designated them as superTopic instances. Additionally, we chose 3,034 relatedEquivalent triples to
represent equivalence using the same-as relation. We also derived 4,713 subTopic relationships
by reversing the superTopic relationships. Lastly, we randomly paired topics to create 5,151
other relationships, ensuring that none of these pairs shared any of the previously identified
relationships within the CSO. The resulting gold standard dataset consists of 17,611 triples,
divided into 15,154 triples (86%) for the training set, 2,166 triples (12.3%) for the validation set,
and 291 triples (1.7%) for the test set. To prevent bias, we ensured that topic pairs in one set do
not appear in another. Moreover, each test set triple includes at least one topic not present in
the training set. These measures make the test set more challenging compared to those used
for Klink-2 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In order to compute features involving the linkage of topics to relevant papers
used in our feature-based method, we queried AIDA-KG [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], a KG considering 25 million
publications linked to research topics in CSO.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Feature-based Method</title>
        <p>
          Our classification task is commonly approached exploiting numerical features, usually measuring
the frequency and common usage of the two topics [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Extracted feature vectors are then
classified through mathematical functions or machine learning algorithms [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We devised a
feature-based classification method using the following features for each pair of topics (   ,   ):
• occA: the frequency of   appearing in paper abstracts
• occB: the frequency of   appearing in paper abstracts
• cooccurrenceAB: the frequency of both   and   appearing together in abstracts
• subsumption: the degree of overlap between the co-occurring topics, computed as
subsumption =   −
        </p>
        <p />
        <p>The first two features indicate the popularity of a topic. The third feature quantifies the
relatedness of two topics. The fourth feature assesses the hierarchical relationship between
the topics. After normalising the features, we trained two ensemble machine learning models:
Gradient Boosting (GB) and Random Forest (RF); varying the number of estimators from 10 to
3000 to determine the optimal configuration.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Language Model-based Method</title>
        <p>
          Our method leveraging language models relies on SciBERT [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], an extension of BERT [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
which is a highly regarded model for its ability to efectively understand and process human
language. SciBERT, trained on scientific literature from Semantic Scholar, enhances BERT’s
capabilities by focusing on the scientific domain.
        </p>
        <p>
          To address our classification task we fine-tuned SciBERT using the training set described
in Section 2.1. Specifically, we used the scibert-scivocab-uncased model from Huggingface.
As optimiser, we selected AdamW [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] to prevent overfitting in large models. For the
finetuning process, we provided the model with the surface forms of the two topics, separated by a
semicolon. For each couple of topics, we also provided the correct relationship class from the
training set. We experimented with varying the number of epochs from 1 to 10, maintaining 50
warm-up steps. Our best-performing model was achieved when training for five epochs.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>Using the test set described in Section 2.1, we evaluated the three methods outlined in the
previous section: Gradient Boosting and Random Forest (both feature-based), and SciBERT
(language model-based). We compared their performance using accuracy, precision, recall, and
F-score, which are standard metrics for text classification.</p>
      <p>Table 1 reports the experimental results. The language model-based method was far superior
to the feature-based methods in all areas, achieving an impressive F1 score of 0.9129. This was
over 27% higher than the other methods. Among the feature-based approaches, Random Forest
performed better. The language model-based method was particularly efective in recognising
superTopic and subTopic relations, where feature-based methods struggled, likely due to the
presence of unfamiliar topics in the test set.</p>
      <p>The language model-based method generally priorities precision over recall, particularly
for the relations superTopic, subTopic, and same-as. However, for the other relation, it tends to
miss some semantic connections, resulting in lower precision compared to recall. This suggests
the model may incorrectly classify some related topics as other, an issue we intend to explore
further in future research.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this poster paper, we introduced a new method based on SciBERT to identify the relationship
between research topics and conducted a comparative analysis against feature-based solutions.
We fine-tuned a SciBERT model using a gold standard of triples derived from CSO. The model
achieved an F1 score of 0.9129, a 27% improvement over methods using numerical features.
These findings are significant given the growing demand for detailed ontologies to enhance
content characterization in scientific KGs</p>
      <p>In our future work, we aim to develop an innovative method for creating taxonomies of
research topics to improve CSO and create large-scale ontologies across diferent scientific
ifelds. We plan to combine language models and numerical features using knowledge injection
techniques and experiment with recent large language models. We also intend to explore
potential challenges when applying these techniques to other research domains and assess the
impact of cross-disciplinary applications.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bolanos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          , E. Motta,
          <article-title>Artificial intelligence for literature reviews: Opportunities and challenges</article-title>
          ,
          <source>arXiv preprint arXiv:2402.08565</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bornmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mutz</surname>
          </string-name>
          ,
          <article-title>Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references</article-title>
          ,
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>66</volume>
          (
          <year>2015</year>
          )
          <fpage>2215</fpage>
          -
          <lpage>2222</lpage>
          . doi:
          <volume>10</volume>
          .1002/asi.23329.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Kung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cheatham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Medenilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sillos</surname>
          </string-name>
          , L. De Leon,
          <string-name>
            <given-names>C.</given-names>
            <surname>Elepaño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Madriaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aggabao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Diaz-Candido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maningo</surname>
          </string-name>
          , et al.,
          <article-title>Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models</article-title>
          ,
          <source>PLoS digital health 2</source>
          (
          <year>2023</year>
          )
          <article-title>e0000198</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naseriparsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <article-title>Knowledge graphs: Opportunities and challenges</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mannocci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Motta</surname>
          </string-name>
          ,
          <article-title>A survey on knowledge organization systems of research fields: Resources and challenges, 2024</article-title>
          . URL: https://arxiv. org/abs/2409.04432. arXiv:
          <volume>2409</volume>
          .
          <fpage>04432</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Wohlgenannt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Weichselbraun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Scharl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sabou</surname>
          </string-name>
          ,
          <article-title>Dynamic integration of multiple evidence sources for ontology learning</article-title>
          ,
          <source>Journal of Information and Data Management</source>
          <volume>3</volume>
          (
          <year>2012</year>
          )
          <fpage>243</fpage>
          -
          <lpage>254</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] OpenAlex, Openalex: End-to-end process for topic classification</article-title>
          ,
          <year>2024</year>
          . URL: https://docs. google.com/document/d/1bDopkhuGieQ4F8gGNj7sEc8WSE8mvLZS/edit.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          , E. Motta,
          <article-title>Klink-2: Integrating multiple web sources to generate semantic topic networks</article-title>
          ,
          <source>in: The Semantic Web - ISWC 2015</source>
          , Springer International Publishing, Cham,
          <year>2015</year>
          , pp.
          <fpage>408</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Thanapalasingam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mannocci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Motta,</surname>
          </string-name>
          <article-title>The computer science ontology: a large-scale taxonomy of research areas, in: The Semantic Web-ISWC</article-title>
          <year>2018</year>
          : 17th International Semantic Web Conference, Monterey, CA, USA, October 8-
          <issue>12</issue>
          ,
          <year>2018</year>
          , Proceedings,
          <source>Part II 17</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>187</fpage>
          -
          <lpage>205</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Angioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Recupero</surname>
          </string-name>
          , E. Motta,
          <article-title>Aida: A knowledge graph about research dynamics in academia and industry</article-title>
          ,
          <source>Quantitative Science Studies</source>
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <fpage>1356</fpage>
          -
          <lpage>1398</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <article-title>Scibert: A pretrained language model for scientific text</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1903</year>
          .10676.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          , Decoupled weight decay regularization,
          <year>2019</year>
          . arXiv:
          <volume>1711</volume>
          .
          <fpage>05101</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>