<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Kumar); ruby.rani@warwick.ac.uk (R. Rani);
mirko.bottarelli@warwick.ac.uk (M. Bottarelli); gregory.epiphaniou@warwick.ac.uk (G. Epiphaniou);
CM@warwick.ac.uk (C. Maple)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Science and Technology Ontology: A Taxonomy of Emerging Topics⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mahender Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruby Rani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mirko Bottarelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gregory Epiphaniou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carsten Maple</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Secure Cyber Systems Research Group, WMG, University of Warwick</institution>
          ,
          <addr-line>Coventry</addr-line>
          ,
          <country country="UK">United Kingdom</country>
          ,
          <addr-line>CV4 7AL</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Ontologies play a critical role in Semantic Web technologies by providing a structured and standardized way to represent knowledge and enabling machines to understand the meaning of data. Several taxonomies and ontologies have been generated, but individuals target one domain, and only some of those have been found expensive in time and manual efort. Also, they need more coverage of unconventional topics representing a more holistic and comprehensive view of the knowledge landscape and interdisciplinary collaborations. Thus, there needs to be an ontology covering Science and Technology and facilitate multidisciplinary research by connecting topics from diferent fields and domains that may be related or have commonalities. To address these issues, we present an automatic Science and Technology Ontology (S&amp;TO) that covers unconventional topics in diferent science and technology domains. The proposed S&amp;TO can promote the discovery of new research areas and collaborations across disciplines. The ontology is constructed by applying BERTopic to a dataset of 393,991 scientific articles collected from Semantic Scholar from October 2021 to August 2022, covering four fields of science. Currently, S&amp;TO includes 5,153 topics and 13,155 semantic relations. S&amp;TO model can be updated by running BERTopic on more recent datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Science and Technology Ontology</kwd>
        <kwd>Unconventional Topics</kwd>
        <kwd>BERTopic</kwd>
        <kwd>Scientific Knowledge Graph</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Ontologies are a valuable tool for representing and organising knowledge about a specific
topic or set of topics, using a set of concepts, relationships, and rules within the domain [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
They have many applications, including data annotation and visualisation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], forecasting new
research areas [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and scholarly data discovery [5]. Some topic ontologies created in diferent
domains include ACM Computing Classification System 1, Physics and Astronomy Classification
Scheme (PACS) 2, replaced in 2016 by the Physics Subject Headings (PhySH) 3, Mathematics
      </p>
      <p>Subject Classification (MSC) 4, the taxonomy used in the field of Mathematics, and Medical
Subject Heading (MeSH) 5. Creating these large-scale taxonomies is a complex and costly process
that often requires the expertise of multiple domain experts, making it a time-consuming and
resource-intensive endeavour. Consequently, these taxonomies are often dificult to update
and maintain, quickly becoming outdated as new information and discoveries emerge. As a
result, the practicality and usefulness of these taxonomies are significantly limited. One of
the most notable advancements in ontology generation is the development of a large-scale
automated ontology known as Computer Science Ontology (CSO) [6]. CSO ontology defines
a significant breakthrough in the representation of research topics in the computer science
domain, providing a structured and comprehensive framework for organising and integrating
knowledge but limited to computer science concepts only.</p>
      <p>Research Challenge: Understanding the dynamics associated with unconventional topics,
which present a more comprehensive and holistic perspective of the knowledge landscape
and interdisciplinary collaborations, poses a considerable challenge. Constructing an ontology
for such unconventional topics necessitates recognising and collecting essential concepts and
relationships from multiple domains. Furthermore, unconventional topics may necessitate
multidisciplinary study, necessitating the integration of information from many fields. By
overcoming these challenges, there is an opportunity for researchers and academicians to study
new and developing areas of science and technology, as well as facilitate interdisciplinary
collaboration across varied fields.</p>
      <p>Contribution. This paper presents preliminary work to construct a S&amp;TO ontology that
automatically generates a taxonomy of unconventional S&amp;T topics. S&amp;TO ontology is built by
applying BERTopic to a dataset of 393,991 scientific articles collected from Semantic Scholar
from October 2021 to August 2022, covering four fields of science: computer science, physics,
chemistry and Engineering. Currently, S&amp;TO includes 5,153 topics and 13,155 semantic
relations. Unlike existing ontology, S&amp;TO ontology can provide many benefits for knowledge
representation and discovery, facilitating interdisciplinary research and enabling dynamic
updates.</p>
      <p>Organisation. The rest of the paper is organized as follows. Section 2 discusses the dataset
and methods for constructing the proposed S&amp;TO. Section 3 gives the proposed S&amp;TO. The
experimental results are discussed in section 4. Section 5 presents the applications and Usecases
of S&amp;TO, and the limitations of the current version are discussed in Section 6. Finally, the
conclusion is given in section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data and Methods</title>
      <sec id="sec-2-1">
        <title>2.1. Semantic Scholar</title>
        <p>Semantic Scholar has many academic publications from various fields, including medical
sciences, agriculture, geoscience, biomedical literature, and computer science. We used the RESTful
Semantic Scholar Academic Graph (S2AG) API to retrieve a sample of these articles [7]. This
42010 Mathematics Subject Classification: https://mathscinet.ams.org/msc/msc2010.html.
5MeSH - Medical Subject Headings: https://www.nlm.nih.gov/mesh.</p>
        <p>API ofers users on-demand knowledge of authors, papers, titles, citations, venues, and more.
We obtained 393,991 Science and Technology articles from Semantic Scholar using the S2AG
API. The API provides a dependable data source that allows users to link directly to the related
page on semanticscholar.org, making it a convenient and accessible way to retrieve information
about academic papers.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Methodologies</title>
        <p>After downloading the dataset, we used the BERTopic method to obtain topics from the
articles—some articles representing the multi-discipline need to be included as an outlier. To
have these unconventional articles and reduce the outlier percentage, we adjusted parameters
with BERTopic during topic clustering [8]. Table 3 lists the critical BERTopic parameters used
in the taxonomy generation.</p>
        <p>As shown in Figure 1, our suggested topic modelling workflow consists of five important steps:
sentence embedding, dimension reduction, clustering, topic quality, and topic representation.
The Sentence Embedding stage, in particular, involve turning textual input into numerical
vectors that capture the underlying semantics of the text. Dimension Reduction is then applied
to the vectors to lower their dimensionality and improve the efectiveness of the clustering
procedure. Clustering is the process of combining similar vectors to generate coherent clusters
of related material. The vectorisation ensures that the extracted topics are high quality, whilst
topic Representation creates an interpretable summary of each topic. Overall, we provide a
robust and efective approach to extracting meaningful unconventional topics from vast and
heterogeneous datasets, utilising the power of BERTopic and careful parameter optimisation to
assure optimal outcomes.</p>
        <p>1. Sentence Embedding: We first transformed the input articles into numerical
representations before analysing them. For this purpose, we utilised sentence
transformers, the default embedding model used by BERTopic. This model can determine the
semantic similarity of diferent documents. As default, BERTopic provides many
pretrained models among them we tried the following two: “all-MiniLM-L6-v2“ and
“paraphrase-MiniLM-L12-v2“. While various sentence embedding models are
available, we opted for the “paraphrase-MiniLM-L12-v2“ model in this work. This model
efectively balances performance and speed, making it a good fit for our requirements.
Thus, we can efectively translate textual data into numerical form and obtain relevant
insights from large and diverse datasets using sentence transformers in conjunction with
BERTopic.
2. Dimension Reduction: Clustering can be complex since embeddings are often
highdimensional. To solve this problem, the dimensionality of the embeddings is frequently
reduced to a more practical level. We used the UMAP (Uniform Manifold Approximation
and Projection) technique, representing local and global high-dimensional features in a
lower-dimensional domain [9]. “n_neighbors“ and “n_components“ are two important
parameters in the UMAP method. These parameters have a considerable impact on the
size of the generated clusters. Larger values for these factors, in particular, result in the
formation of more important clusters. We got optimal clustering results and extracted
important topics from the input data by carefully tweaking these parameters.
3. Clustering: BERTopic splits the input data into clusters of similar embeddings after the
dimensionality reduction process. The clustering techniques’ accuracy directly impacts
the quality of the generated topics. K-means [10], Hierarchical Density-Based Spatial
Clustering of Applications with Noise (HDBSCAN) [11], and Agglomerative Clustering
[12] are among the clustering techniques provided by BERTopic. The advantages and
drawbacks of these clustering algorithms are summarised in Table 2, emphasising their
capacity to generate high-quality topics, manage outlier percentages, and limit the danger
of missing unconventional topics. HDBSCAN is a density-based clustering algorithm used
to find clusters of varying densities in a dataset. It works by constructing a hierarchical
tree of clusters based on the density of the data points. It starts by identifying the points
with the highest density and forming a cluster around them. Then, it gradually adds
lower-density points to the cluster until a natural cutof is reached, indicating the end of
the cluster. According to our findings, the HDBSCAN with the prediction_data parameter
set to “True“ was the best option. Our method eficiently balances the above elements,
allowing us to obtain meaningful and valuable insights from large and complex datasets.
4. Vectorisation: The CountVectorizer technology turns text documents into vectors of
phrase frequencies. However, it has significant drawbacks, such as failing to consider a
specific topic’s relative relevance in an article. To fix this issue, we adopted C-TF-IDF
(Class-Based TF-IDF), a variant of the classic TF-IDF (Term Frequency-Inverse Document
Frequency) method that allocates weights to terms depending on their relevance to a
specific class of documents. We could fine-tune the model’s performance in BERTopic by
adjusting its parameters to optimise the clustering process using CountVectorizer with
C-TF-IDF. This method made it possible to create higher-quality topics that are more
fascinating and pertinent to the input data.
5. Topic Representation: BERTopic can adjust TF-IDF to work at the cluster level instead
of the document level to obtain a concrete representation of topics from the bag-of-words
matrix. This modified TF-IDF is called c-TF-IDF. For word x in class c, the c-TF-IDF value
is:
, = |,| × (1 + /)
(1)
Where , denotes the frequency of word x in class c,  denotes word x across all
classes, and A denotes the average number of words per class.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Science and Technology Ontology Generation</title>
      <p>To create a topic network, also known as a knowledge graph, the metadata provided by the
semantic scholar is utilised. The construction of topic ontologies involves the definitions of the
following components:
• Topics: concepts of the topic ontology (e.g. Sports, Arts, Politics).
• Predicates: kinds of relationships that define the semantic link established between the
ontology concepts. Many predicates can be defined in topic ontologies: hierarchical (e.g.
superTopicOf) and non-hierarchical (e.g., part of, contribute to).
• Relationships: according to predicates and the set of elements they link, relationships are
distinguished. They can be used to characterise the paths in the graphs and denoted as a
triplet (T1, P, T2), where T1 and T2 denote the topics, and P denotes the predicate that
links T1 and T2.</p>
      <sec id="sec-3-1">
        <title>3.1. Topics</title>
        <p>The KeyBERT tool is used on the associated publications to extract keywords representing the
essential concepts and topics within each document to produce the vocabulary for BERTopic.
These keywords are then sent into BERTopic, which generates a complete collection of topics
that capture the overarching topic found in the dataset. The extracted topics are saved in a
database’s "topic_nets_topics" and "topic_nets_topics keywords" (see Figure 2). Each topic’s
weight denotes the number of papers for which it serves as the main association, showing its
relevance within the dataset.</p>
        <p>There are two techniques to establish the relationship between papers and topics:
1. Probability: First, each paper’s BERTopic/HDBSCAN probabilities are saved as entries in
the "papers topics" table. These probabilities indicate how closely each document relates
to each extracted topic.
2. Embedding similarity: Second, using the "main topic id" field, the major topic associated
with each paper, as identified by BERTopic, is directly linked to the paper using a SQL
trigger. This allows for eficient querying and analysis of the topics and papers related to
them within the corpus.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Relationship</title>
        <p>The next step is to create topic networks using the relationship among topics. Currently, the
S&amp;TO ontology is built on 393,991 scientific papers collected from Semantic Scholar from
October 2021 to August 2022. It covers four science fields: computer science, physics, chemistry
and Engineering. S&amp;TO ontology follows the data model SKOS6 and includes the following
semantic relationships:
• “relatedIdentical“, It is a sub-metric of skos related, denotes that two topics can be
viewed as identical for assessing research topics. The similarity between topics is
calculated as cosine-similarity in the SQL stored procedure create_topic_nets. The relationship
between topics is established if the similarity threshold is above 0.9.
• “superTopicOf“: It is a sub-metric of skos:narrower, which means that a
topic is a super-area of another topic in the graph. For example,
"streaminmg_rsi_retrieval_streaming_regression" is the super-topic of Topics with topic_ids
78 and 101, as shown in Figure 3.
• “CommonArticles“: It extracts common articles that appear in the two topics. The link
between two topics is evaluated as the sum of the probability distribution by common
articles assigned to the topics.
• “nSimilarTopics“: It returns the top x number of similar topics for an input keyword.</p>
        <p>For instance, the top 5 similar topics related to the keyword "motor" are shown in Table 3.
.
6SKOS Simple Knowledge Organization System - http://www.w3.org/2004/02/skos.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results and Discussion</title>
      <p>In the literature [13], ontology has been evaluated by four methods: gold-standard based [14],
corpus-based [15], application-based [16], and structure-based methods [17]. The
gold-standardbased method compares the developed ontology with the referenced ontology developed earlier.
The corpus-based method compares the significantly developed ontology with the contents of a
text corpus that covers a given domain. The application-based approach considers applications
and evaluations according to their performance across use cases. The structure-based approach
quantifies structure-based properties such as size and ontology complexity.</p>
      <p>Selecting the best evaluation approach and defining the rationale behind evaluating a
developed ontology is necessary. In the proposed study, the science and technology ontology is in the
early stages of development and will grow in future work. Thus, the application-based approach
should not be a good evaluation approach because the proposed ontology is not proper for
application purposes as it is currently in development. The proposed ontology is developed
on a Semantic Scholar data set subset. Thus, the best reference ontology should be a semantic
scholar. However, using Semantic Scholar as the gold standard ontology is impractical due to
its unavailability.</p>
      <p>Structure-based evaluation is performed on several measures, including knowledge coverage
and popularity measures (i.e., number of properties and classes) and structural measures (i.e.,
maximum depth, minimum depth, and depth variance). These measures are adopted based on
the belief that densely populated ontologies with high depth and breadth variance are more
likely to result in meaningful semantic content. Structural metrics are related to the semantic
accuracy of adaptively modelled knowledge in the ontology [19].</p>
      <p>In the context of the proposed S&amp;TO ontology, we quantified some structural measures by
considering their taxonomic structure. S&amp;TO ontology gives 5,153 topics and 13,155 semantic
relations (of which 8052 topics are based on cosine similarity and 5103 topics are based on
probability distribution). Figure 4 shows an example of a knowledge graph covering topics and
semantic relationships. We used neo4j as the graph database to host the final ontology [ 18].
Links in green indicate semantic relationships assessed using cosine similarity, while links in
yellow indicate relationships based on the probability distribution of papers assigned to topics.
S&amp;TO covers the maximum amount of articles for topic clustering and gives only 15.47% of
articles as an outlier, enabling the extraction of topics belonging to unconventional articles.</p>
      <sec id="sec-4-1">
        <title>4.1. Topics Details</title>
        <p>This section discusses the structure of topic networks, topics in topic networks and keywords
related to topics in the network.</p>
        <p>The "topic_nets" table (Table 4) gives information about the development and status of topic
modelling networks. Each network is assigned a unique identifier known as a "topic_net_id,"
its primary key. The "created_on" parameter specifies the date and time the network was
established. The "status" field ofers information on the network’s current status, which might
take various values. This field assists in tracking the process’s progress and ensuring that
all networks are correctly generated and assessed. In addition, the "year_month" parameter
provides the month the network was founded. This feature is beneficial for tracking the temporal
evolution of themes within the corpus since it allows researchers to understand how topics and
their associations change over time.</p>
        <p>The "topic_nets_topics" table (Table 5) provides essential information about the topics
associated with the Topic networks. Each topic has a unique identifier known as a "topic_id", the
primary key. Each topic links to the corresponding network the "topic_nets" using a unique
identifier known as a "topic_net_id", its foreign key. A descriptive label is assigned to the topic
based on the most common terms in the associated papers. The topic has "topic_weight" which
indicates the number of papers related to it, indicating its importance and relevance within the
corpus. It also stores the "embedding", which will be used for cosine similarity calculations, and
"similar_topics", an array of topic ids related to searched topics.</p>
        <p>The "topic_nets_topics_keyword" table (Table 6) provides essential information about the
keywords associated with the topic, represented by a unique identifier known as a "topic_id",
which is the primary key. It stores the fields such as: "number" representing the topic number,
"row" is an auto-incremented number, and "keyword" representing the name of the keywords.
In addition, "score" represents a weight associated with the keyword for that topic.</p>
        <p>The "papers_topics" table (table 7) illustrates the relationship between academic papers and
the topics they cover. The "corpus_id" denotes the unique identifier of the corpus the papers
that were extracted from. The "topic_id" represents the unique identifier of the topic that the
paper covers, making it easier to track papers within that topic. The "probability" represents
the weight of the paper assigned to the topic. This weight indicates the degree to which the
paper covers the topic. The probability value is typically normalized, which is scaled to a range
between 0 and 1.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Relation Details</title>
        <p>This section discusses the structure of topic relations, such as information on edges and
similarities among topics.</p>
        <p>The "topic_nets_topics_edges" stores (Table 8) the information of edges among the topics
"topic_nets_topics" in the network "topic_nets". The edge is established between the two topics
represented by their unique identifier (i.e., "topic_id1" and "topic_id2"). "edge_Weight" is the
sum of possibilities ("possibility" field of Table 7) of papers sharing between two topics. The
strength of collaboration "str_of_col" represents the weight computed as harmonic mean and
normalised based on topics weights, shown in Eq (2).</p>
        <p>︂( _ℎ(1, 2) _ℎ(1, 2) )︂
_ _ =    ,
_ℎ(1) _ℎ(2)
(2)</p>
        <p>The "topic_nets_topics_similarity" stores (Table 9) the information of edges among the topics
based on the similarity.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Advantages and Use-cases</title>
      <p>The proposed S&amp;TO with unconventional topics could have the following advantages and Use
cases.</p>
      <sec id="sec-5-1">
        <title>5.1. Knowledge expansion</title>
        <p>S&amp;TO can broaden the scope of knowledge representation beyond existing ontologies by
incorporating previously unconsidered topics. Ofering a more holistic and comprehensive
6. Limitations
6.1. Limited dataset</p>
      </sec>
      <sec id="sec-5-2">
        <title>6.2. Topic labelling</title>
        <p>view of the knowledge landscape can lead to new insights and discoveries. In medical research,
unconventional topics like holistic therapies or mindfulness practice might be incorporated into
the ontology to provide a more comprehensive view of the more extensive health and wellness
landscape [20].</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.2. Interdisciplinary Collaboration</title>
        <p>Second, by connecting topics from diversified fields that may have commonalities or be
connected, an unconventional topics ontology encourage interdisciplinary collaboration. This can
encourage the discovery of new research areas and cross-disciplinary cooperation, leading to
novel solutions to complicated issues. For example, an ontology incorporating computer science
and psychology issues could make it easier for academics in both domains to collaborate on
human-computer interaction or afective computing [21].</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.3. Scalability and Adaptability</title>
        <p>
          An unconventional topics ontology has the benefit of being easily updatable and adaptable to
reflect the most contemporary developments and topics, resulting in a dynamic and flexible
knowledge representation system. This capability is significant in fast-paced sectors like
technology and healthcare, where new topics and concepts develop regularly. For example, an
ontology that includes topics relating to emerging technologies such as artificial intelligence
[22] or blockchain [
          <xref ref-type="bibr" rid="ref5">23</xref>
          ] can be easily updated to include new concepts and trends.
The current version of the proposed S&amp;TO ontology has the following limitations.
The current version of S&amp;TO ontology is built on the Semantic scholar dataset covering 393,991
S&amp;T articles from October 2021 to August 2022. However, it could be built on more datasets.
Since S&amp;TO ontology utilises BERTopic, an unsupervised topic computation library, ontology
sufers from the consequences of unlabeled topics. Due to a lack of labelled data, it may be
challenging to determine the significance and relevance of unconventional topics and distinguish
them from noise or irrelevant topics. This is incredibly challenging when working with massive,
complicated datasets containing various topics.
        </p>
      </sec>
      <sec id="sec-5-5">
        <title>6.3. Domain coverage</title>
        <p>S&amp;TO can capture various research topics and domains by covering these four domains:
computer science, physics, chemistry and Engineering. However, many other disciplines and
subfields within science and technology are not yet included in S&amp;TO. For example, biology,
environmental science, and neuroscience are all essential areas of research that could be
integrated into an ontology to create a more comprehensive and multidisciplinary framework
for understanding scientific research. Expanding the coverage of S&amp;TO to include additional
domains would have several potential benefits.</p>
      </sec>
      <sec id="sec-5-6">
        <title>6.4. Topic quality</title>
        <p>While S&amp;TO depicts a significant efort towards organising and categorising S&amp;T topics, there
is still room for improvement regarding the quality of the topics included in the ontology.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion and Future Work</title>
      <p>S&amp;T Ontology, an automated ontology of science and technology that includes all scientific study
topics, was introduced in this paper. We constructed an ontology encompassing four diferent
science domains by utilising BERTopic on a collection of 393,991 scientific articles acquired from
Semantic Scholar from October 2021 to August 2022. S&amp;TO can be updated using BERTopic on
recent datasets, ofering a dynamic and flexible foundation for knowledge representation. S&amp;TO
ontology has the potential to broaden the scope of knowledge representation and stimulate
interdisciplinary collaboration, making it a valuable resource for scientists and technologists.</p>
      <p>The S&amp;TO ontology constantly evolves and requires ongoing enhancements to meet the
expanding knowledge landscape’s demands. Currently, S&amp;TO is growing, and we are employing
topic labelling techniques to improve the organisation and comprehension of diferent topics
by giving them meaningful "tags". This makes it easy for users to browse the ontology and
derive valuable insights. Furthermore, we intend to improve the topic quality by investigating
additional methodologies and algorithms for topic modelling and clustering. This will strengthen
the ontology’s accuracy and eficacy in describing the knowledge landscape. In addition, we
intend to expand the ontology to a more extensive dataset, allowing for the inclusion of more
unconventional categories and topics, which will improve and diversify the knowledge base.
[5] S. Fathalla, S. Vahdati, S. Auer, C. Lange, Semsur: a core ontology for the semantic
representation of research findings, Procedia Computer Science 137 (2018) 151–162.
[6] A. A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne, E. Motta, The computer
science ontology: a large-scale taxonomy of research areas, in: The Semantic Web–ISWC
2018: 17th International Semantic Web Conference, Monterey, CA, USA, October 8–12,
2018, Proceedings, Part II 17, Springer, 2018, pp. 187–205.
[7] A. D. Wade, The semantic scholar academic graph (s2ag), in: Companion Proceedings of
the Web Conference 2022, 2022, pp. 739–739.
[8] M. Grootendorst, Bertopic: Neural topic modeling with a class-based tf-idf procedure,
arXiv preprint arXiv:2203.05794 (2022).
[9] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and projection
for dimension reduction, arXiv preprint arXiv:1802.03426 (2018).
[10] J. A. Hartigan, M. A. Wong, Algorithm as 136: A k-means clustering algorithm, Journal of
the royal statistical society. series c (applied statistics) 28 (1979) 100–108.
[11] L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering., J. Open</p>
      <p>Source Softw. 2 (2017) 205.
[12] D. Müllner, Modern hierarchical, agglomerative clustering algorithms, arXiv preprint
arXiv:1109.2378 (2011).
[13] M. Fernández, C. Overbeeke, M. Sabou, E. Motta, What makes a good ontology? a
casestudy in fine-grained knowledge reuse, in: The Semantic Web: Fourth Asian Conference,
ASWC 2009, Shanghai, China, December 6-9, 2009. Proceedings 4, Springer, 2009, pp.
61–75.
[14] A. Maedche, S. Staab, Measuring similarity between ontologies, in: Knowledge Engineering
and Knowledge Management: Ontologies and the Semantic Web: 13th International
Conference, EKAW 2002 Sigüenza, Spain, October 1–4, 2002 Proceedings 13, Springer,
2002, pp. 251–263.
[15] C. Brewster, H. Alani, S. Dasmahapatra, Y. Wilks, Data driven ontology evaluation (2004).
[16] M. Sabou, J. Gracia, S. Angeletou, M. d’Aquin, E. Motta, Evaluating the semantic web: A
task-based approach, in: The Semantic Web: 6th International Semantic Web Conference,
2nd Asian Semantic Web Conference, ISWC 2007+ ASWC 2007, Busan, Korea, November
11-15, 2007. Proceedings, Springer, 2007, pp. 423–437.
[17] P. Buitelaar, T. Eigner, T. D. OntoSelect, A dynamic ontology library with support for
ontology selection in: Proc. of the demo session at the international semantic web conference,
Hiroshima, Japan, Nov (2004).
[18] D. Fernandes, J. Bernardino, Graph databases comparison: Allegrograph, arangodb,
infinitegraph, neo4j, and orientdb., in: Data, 2018, pp. 373–380.
[19] D. Sánchez, M. Batet, S. Martínez, J. Domingo-Ferrer, Semantic variance: an intuitive
measure for ontology accuracy evaluation, Engineering Applications of Artificial Intelligence
39 (2015) 89–99.
[20] J. Howard, Artificial intelligence: Implications for the future of work, American journal of
industrial medicine 62 (2019) 917–926.
[21] W. Xu, Toward human-centered ai: a perspective from human-computer interaction,
interactions 26 (2019) 42–46.
[22] C. Zhang, Y. Lu, Study on artificial intelligence: The state of the art and future prospects,</p>
    </sec>
    <sec id="sec-7">
      <title>A. BERTopic Parameters</title>
      <p>Here, we summarised the parameters we set throughout S&amp;TO development. The following
parameters have been set:
• “n_neighbors“ refers to the number of neighbouring data points needed to estimate
the manifold. Large sample point embeddings produce a more global perspective of the
structure, while low values produce a narrower one. To get a good strike, we set n=2 as
the result of the estimation.
• “n_components“ refers to the number of components after the reduction in dominance.</p>
      <p>This value directly afects the clustering performance, so it is necessary to set an optimal
value. By default, it is set to 5 to reduce the dimensionality as much as possible while
maximizing the information in the generated embeddings.
• “low_memory“: It is set to TRUE because we use a huge dataset and need a lot of memory.
• “min_cluster_size“: The number of cluster generations highly relies on the cluster
size. It is necessary to adjust the minimum size. After several experiments, a cluster size
of 50 was found to be the optimal one. While high value gives few clusters of considerable
size, and low value gives microclusters.
• “metric“: metric, like HDBSCAN, calculates the distances. Here, we went with Euclidean
as, after reducing the dimensionality, we have low dimensional data, and not much
optimisation is necessary. However, if you increase “n_components“ in UMAP, it would
be advised to investigate metrics that work with high dimensional data.
• “prediction_data“: Make sure you always set this value to True, as it is needed to
predict new points later. You can set this to False if you do not wish to predict any unseen
data points.
• “min_samples“: It is automatically set to “min_cluster_size“ and controls
the number of outliers generated. Setting this value significantly lower than
“min_cluster_size“ might help you reduce the amount of noise you will get. Do
note that outliers are typical to be expected, and forcing the output to have no outliers
may not properly represent the data.
• “top_n_words“ refers to the number of words extracted per topic. In practice, we keep
this value below 30, preferably between 10 and 20. The reasoning is that the more words
representing a topic, the less relevant it may be. In this case, the top words are most
representative of the topic and are the focus.
• “min_topic_size“ specifies the minimum size of a topic. The lower the size value, the
more topics are created. If the value is set too high, no topics may be created. We set this
value too low, and we get many micro-clusters.
• “calculate_probabilities“ give probabilities of all topics per document. This could
slow down the extraction of topics for a large number of many documents.
• “low_memory“ set to true ensures that less memory is used in the calculations. This
slows computation but allows UMAP to run on machines with little memory.
• “diversity“ reports a range of topic diversity from 0 to 1, where 0 indicates no diversity
and 1 indicates a lot of diversity. Higher diverse topics mean less coherent topics in
smaller cluster sizes. In our case, the diversity is assumed to be 0.4 or above.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Saif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Alani</surname>
          </string-name>
          ,
          <article-title>Semantic sentiment analysis of twitter</article-title>
          ,
          <source>in: The Semantic WebISWC</source>
          <year>2012</year>
          : 11th International Semantic Web Conference, Boston, MA, USA, November
          <volume>11</volume>
          -
          <issue>15</issue>
          ,
          <year>2012</year>
          , Proceedings,
          <source>Part I 11</source>
          , Springer,
          <year>2012</year>
          , pp.
          <fpage>508</fpage>
          -
          <lpage>524</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birukou</surname>
          </string-name>
          , E. Motta,
          <article-title>Automatic classification of springer nature proceedings with smart topic miner</article-title>
          ,
          <source>in: The Semantic Web-ISWC</source>
          <year>2016</year>
          : 15th International Semantic Web Conference, Kobe, Japan,
          <source>October 17-21</source>
          ,
          <year>2016</year>
          , Proceedings,
          <source>Part II 15</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>383</fpage>
          -
          <lpage>399</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dudáš</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lohmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Svátek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pavlov</surname>
          </string-name>
          ,
          <article-title>Ontology visualization methods and tools: a survey of the state of the art</article-title>
          ,
          <source>The Knowledge Engineering Review</source>
          <volume>33</volume>
          (
          <year>2018</year>
          )
          <article-title>e10</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          , E. Motta,
          <article-title>Augur: forecasting the emergence of new research topics</article-title>
          ,
          <source>in: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>303</fpage>
          -
          <lpage>312</lpage>
          .
          <source>Journal of Industrial Information Integration</source>
          <volume>23</volume>
          (
          <year>2021</year>
          )
          <fpage>100224</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Monrat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Schelén</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Andersson</surname>
          </string-name>
          ,
          <article-title>A survey of blockchain from the perspectives of applications, challenges, and opportunities</article-title>
          ,
          <source>IEEE Access 7</source>
          (
          <year>2019</year>
          )
          <fpage>117134</fpage>
          -
          <lpage>117151</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>