<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Corresponding author.
†These authors contributed equally.
$ stefani.tsaneva@wu.ac.at (S. Tsaneva); ddessi@sharjah.ac.ae (D. Dessì); francesco.osborne@open.ac.uk (F. Osborne);
marta.sabou@wu.ac.at (M. Sabou)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Enhancing Scientific Knowledge Graph Generation Pipelines with LLMs and Human-in-the-Loop</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefani Tsaneva</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danilo Dessì</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Osborne</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marta Sabou</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, College of Computing and Informatics, University of Sharjah</institution>
          ,
          <addr-line>Sharjah</addr-line>
          ,
          <country country="AE">United Arab Emirates</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Business and Law, University of Milano Bicocca</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Data, Process and Knowledge Management, Vienna University of Economics and Business</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Knowledge Media Institute, The Open University</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Scientific Knowledge Graphs have recently become a powerful tool for exploring the research landscape and assisting scientific inquiry. It is crucial to generate and validate these resources to ensure they ofer a comprehensive and accurate representation of specific research fields. However, manual approaches are not scalable, while automated methods often result in lower-quality resources. In this paper, we investigate novel validation techniques to improve the accuracy of automated KG generation methodologies, leveraging both a human-inthe-loop (HiL) and a large language model (LLM)-in-the-loop. Using the automated generation pipeline of the Computer Science Knowledge Graph as a case study, we demonstrate that precision can be increased by 12% (from 75% to 87%) using only LLMs. Moreover, a hybrid approach incorporating both LLMs and HiL significantly enhances both precision and recall, resulting in a 4% increase in the F1 score (from 77% to 81%).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Graph Evaluation</kwd>
        <kwd>Scientific Knowledge Graph</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Hybrid Human-AI Workflows</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As the number of open research articles continues to grow, the research community increasingly
needs eficient solutions for knowledge-based content exploration of scientific works. Knowledge
graphs (KGs) have emerged as a crucial technology in this domain due to their ability to structure
information semantically and support intelligent systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Consequently, scientific KGs, which
facilitate the categorization, search, and reasoning over scientific knowledge, have attracted significant
interest (e.g., [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6 ref7 ref8">2, 3, 4, 5, 6, 7, 8</xref>
        ]). Some of these resources, such as the Open Research Knowledge Graph
(ORKG) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], require manual annotations for their creation. This approach produces high-quality data,
but it limits the coverage and scalability of the curated resources. Other methods aim to generate much
larger resources by integrating scientific content from millions of articles through automated processes.
An example is the Computer Science Knowledge Graph (CS-KG) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a KG of 10 million entities extracted
from 6.7 million publications built using an automatic pipeline called SCICERO [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, while
fully automated KG generation approaches provide extensive coverage of the represented area, they
often fall short in terms of the quality of the resulting KGs. Due to the complexity of transforming
natural language in structured form for example using the Resource Description Framework (RDF)
misleading or incorrect triples might be extracted. Therefore, it is crucial to incorporate validation
steps as part of the scientific KG generation and curation processes. To this aim, SCICERO includes
a module to assess triples according to domain understanding through an ontology, and a module
that assesses triples using the sci-bert [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] model. However, SCICERO does not make use of common
methods based on human-in-the-loop (HiL) approaches, where domain experts identify incorrect triples
to improve the quality of the resulting KGs. Traditional HiL methods, however, are not scalable [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
thus making challenging their use for large KGs. A potential solution is the adoption of modern large
language models (LLMs), which have demonstrated human-like performance in many natural language
processing tasks [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        In this paper, we investigate novel validation approaches to improve the accuracy of automated KG
generation pipelines, leveraging both a HiL and an LLM-in-the-loop (LLM-iL). Using SCICERO- the
automated generation pipeline of the Computer Science Knowledge Graph as a case study [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we
design and test validation modules that can be integrated into the existing pipeline to improve the
quality of the resulting triples, leading to an enhanced CS-KG. For this purpose, we adopted a gold
standard of 3.6K triples1 used in the original evaluation of SCICERO to simulate various workflows
incorporating HiL and LLM-iL validation modules. We designed these workflows to optimize precision,
recall, and F1 score using a subset (600 triples) of the gold standard.
      </p>
      <p>Specifically, we begin by implementing and investigating two approaches integrating a HiL module
within SCICERO. We then replicate these pipelines, replacing the HiL module with an LLM-iL validation,
leveraging GPT-4o2. While the improvement with the LLM validation module was not as pronounced as
with the HiL, we observed a significant increase (up to 12%) in precision without any additional manual
efort. Furthermore, we explore hybrid workflows that integrate both LLM-iL and HiL modules within the
SCICERO framework. The results indicate that even minimal HiL involvement can lead to higher-quality
extractions. Notably, an efective method for reducing human involvement and increasing scalability
involved utilizing the HiL module only to resolve disagreements between automated validators.</p>
      <p>We assess the proposed extended SCICERO workflows by testing their performance on the full gold
standard dataset, confirming that the observed score improvement trends persist.</p>
      <p>The remainder of this paper is structured as follows. Section 2 reviews related research in this area.
Section 3 provides an overview of the current SCICERO pipeline and its relevant validation modules.
In Section 4, we describe the newly implemented HiL and LLM-iL validation modules. The extended
SCICERO pipelines and the methods used to design them are described in Section 5, followed by a
discussion of the evaluation results in Section 6 and a conclusion in Section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        This paper is situated in the intersection of three research areas: (1) methods for generating scientific
knowledge graphs, (2) expertise comparison of LLMs and HiL, and (3) quality enhancement of KGs
leveraging LLMs. This section ofers a brief overview of related work from each of these areas.
Scientific Knowledge Graph Curation. There are two types of curation processes of scientific
knowledge graphs- manual and automated. An exemplary result of the manual curation process is the
Open Research Knowledge Graph (ORKG) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. ORKG describes articles, their contributions, applied
methods, evaluation methodologies, etc. The framework relies on researchers describing the content of
their scientific work manually as RDF triples. While such curation approaches allow for good quality
graphs, they require high amounts of manual work and are, therefore, limited in terms of scalability.
      </p>
      <p>
        In contrast, automatic approaches for the generation of scientific knowledge graphs can cover a
high number of articles. SCICERO- the Computer Science Knowledge Graph generation pipeline [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], is
an example of such automatic curation workflow. While it involves two automatic validator modules
aiming to remove noisy triples, the quality of the graph cannot be guaranteed since the extracted
triples are not checked by a HiL with expertise in the domain. To overcome the lack of scalability of an
additional HiL validation module, we investigate additional alternative validation modules which can
extend the SCICERO pipeline.
1The gold standard is available at https://github.com/danilo-dessi/SKG-pipeline/tree/main/eval
2GPT4 Omni. https://openai.com/index/hello-gpt-4o/
LLM-in-the-Loop as an Expert-in-the-Loop. Recently, LLMs have gained much research attention
and have been applied in a variety of tasks across difered domains. For instance, LLMs have been
prompt to take qualification tests in non-trivial domain such as clinical chemistry [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], or logic-based
domain such as ontology modeling [13] showing performance comparable to the top graduate students.
      </p>
      <p>LLMs have also been compared to HiL for specific human intelligence tasks. For instance, in []
LLMs judge the quality of automatically extracted texts and produce annotations similar to experts’
judgments. Motivated by the results presented in literature, we investigate whether LLMs are suited for
the validation of scientific knowledge graphs.</p>
      <p>
        LLMs for Semantic Resource Validation. Large language model advancements have inspired
several works in the area of semantic resource (i.e, knowledge graphs, ontologies, etc.) validation. The
usage of LLMs for the detection of ontology modeling errors is employed in [13] for ontology restriction
defect detection and in [14] for class membership validations. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] the authors present an LLM-based
knowledge graph generation workflow, which includes a triples validation step. However, the paper
does not discuss concrete validation results. Complementary to this line of work, NeOn-GPT [15]
integrates an LLM for the correction of ontology errors found through external services.
      </p>
      <p>The mentioned evaluation approaches are tailored towards small resources with focus on the
ontological schema of knowledge graphs. Nevertheless, they obtain promising results motivating the
exploration of LLMs for more complex domains.</p>
      <p>In this paper, we propose a variety of KG validation workflows harnessing a HiL, an LLM-iL, or both
of these modules to overcome the trade-of between KG quality and scalability.</p>
    </sec>
    <sec id="sec-3">
      <title>3. SCICERO: The CS-KG Generation Pipeline</title>
      <p>
        In this section, we briefly introduce the CS-KG generation pipeline called SCICERO [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with a focus
on its validation modules. SCICERO takes as inputs a set of scientific texts and an ontology used to
semantically describe the domain knowledge in the field. The pipeline (visualized in Fig. 1) contains
three main steps: (1) extraction, which exploits the CSO classifier [ 16] and revised natural language
processing modules based on the CoreNLP suite [17] to produce initial sets of triples; (2) entity and
relationship handling, which merges similar entities, filters generic entities, maps similar relationships
on the same relation label, and integrates the triples coming from the various extractors into a single
set; and (3) triple validation, consisting of two validation modules namely a Transformer Validator and
an Ontology-based Validator aiming to discard incorrectly extracted or generated triples.
      </p>
      <p>Extraction (E)</p>
      <p>
        Transformer Validator. The transformer-based validation takes as input the set of refined triples
produced in the extraction and subsequent entity and relationship handling step. The triples include
additional information about their support-level, i.e., a number that indicates the amount of papers from
which a triple was generated. The support can be interpreted as a confidence number that indicates how
reliable a triple is. Intuitively, a triple that is associated with a considerable number of scientific papers
(e.g., ≥ 5) is considered well-supported [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The transformer validator employs the following process:
(1) define reliable and uncertain sets of triples .  represents highly supported statements in
literature (e.g., ≥ 5), while the rest of the triples (e.g., &lt; 5) forms ;
(2) fine-tune a scibert [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] transformer model using , and a set of negative triples 
generated by corrupting ;
(3) predict if triples from  are correct and add them into the  set or incorrect and
should be discarded (− ).
      </p>
      <p>Ontology-based Validator. The ontology-based validation enables the removal of triples that do not
hold according to an expert understanding of the scientific Computer Science domain. It takes as input
 and  and filters out triples not compliant with the ontological schema. For example,
a triple with a subject or object type not defined as the domain and range for the used relation will be
discarded (− ). Concretely, the triple &lt;dbpedia, uses, core_nlp&gt; would be removed since the
entity dbpedia of type Material cannot use the entity core_nlp of type Method.</p>
      <p>SCICERO Evaluation. SCICERO was evaluated using a gold standard of 3.6K triples (CS-KG-3600).
The gold standard was created by sampling the generated CS-KG and selecting 600 triples from each
of the six categories: very high support triples ( ∈ ), high support triples (∈ ), low
support triples (∈ ), triples discarded by the transformer-validator (∈ − ), triples
discarded by the ontology-validator (∈ − ), and randomly generated triples (), which
were generated by replacing the head or tail of triples from the CS-KG. Each triple was then manually
annotated as correct or incorrect by 3 senior experts in the Computer Science domain. The ground truth
for each triple was calculated by aggregating the annotations from the experts using a majority vote.</p>
      <p>
        SCICERO has been evaluated on this set of 3.6K triples and achieves 75% precision, 79% recall
and 77% F1 score. The integrated validation modules managed to significantly improve the precision
from the extraction step (54% precision, 95% recall, 69% F1 score), showcasing the importance of
incorporating a validation step in the extraction pipeline to filter out erroneously extracted or generated
statements. Further details about SCICERO’s implementation and the CS-KG can be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Human-in-the-loop &amp; LLM Validation Modules</title>
      <p>This section describes human-in-the-loop and LLM-in-the-loop validation modules that can be attached
to SCICERO to further refine the generated triples enhancing the overall quality of the CS-KG.
Human-in-the-Loop Validator. The HiL validation module is designed to incorporate expert
judgments into the KG validation process. When a triple is subjected to human-in-the-loop validation, the
expert judgment overwrites the automatically predicted correctness of the triple. Consequently, the
triple is either incorporated into the final knowledge graph or added to the set of discarded triples
ℎ− . In this paper, to simulate various workflows involving human participation at diferent
stages of the validation pipeline, we utilize the available gold standard (CS-KG-3600). The HiL Validator
module applies a single expert judgment, randomly chosen from the expert annotations, for each triple.
We follow this approach to allow the reusability of the created gold standard while limiting biases
introduced by the usage of the gold standard withing the validation pipeline.</p>
      <p>
        LLM Validator. The SCICERO pipeline already includes a sciebert [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] transformer-based validation.
However, the new module leverages GPT-4o for the validation of statements represented as triples. As
the first step an initial prompt with instructions is sent introducing the task and expected output:
      </p>
      <p>You are an expert in Computer Science and want to help with the identification of incorrect statements
from the domain. The user will provide you with a set of RDF triples in the form (subject, predicate, object).
For each triple from the set answer ’0’ if the statement they represent is incorrect and ’1’ if the modeled
statement is correct. Think step by step when making the decision. Return the classifications of each triple in
the order they were provided and do not add an explanation. Use the format ’0. [triple1]- [0|1], 1.
[triple2][0|1],..., 99. [triple100]- [0|1]’.</p>
      <p>Each following prompt includes a batch of 100 triples to be checked. Whenever the response did
not follow the requested format, the batch was re-sent for validation. To reduce the variability of
LLM-generated results each batch is sent 3 consecutive times and a majority vote is calculated.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Extended SCICERO Workflows</title>
      <p>In this section, we present SCICERO extensions containing one or both of the new validation
modulesthe HiL-Validator and the LLM-Validator. To design the pipeline extensions, we explore a subset of
the gold standard, used in the SCICERO evaluation as an exploratory dataset (CS-KG-600). CS-KG-600
consists of 100 randomly selected triples from each of the six subsets contained within the gold standard
(see Sect. 3) thus retaining a representative sample. Using CS-KG-600, we design various workflows
optimizing the extraction performance. To evaluate the observed benefits of the added validation
modules we test the extraction scores of the new workflows on the complete gold standard and report
our findings in Section 6.</p>
      <sec id="sec-5-1">
        <title>5.1. SCICERO integration with the HiL Validator</title>
        <p>A traditional approach to improve the quality of gathered results is to involve a HiL at the end of
SCICERO’s existing validation pipeline (see Fig. 2). While this strategy ofers an improvement in terms
of the precision of the generation pipeline, the human eforts needed are enormous, especially for large
resources including millions of triples such as CS-KG.</p>
        <p>Since the extracted triples are organized into  and , the solution can be adopted
following the intuition that highly supported statements are correct. The modified workflow is displayed
in Fig. 3.  triples, passed through the Transformer and Ontology-based Validators are directly
added to the KG. In contrast, triples with lower support () receive an additional HiL validation.</p>
        <p>An objective of the replicated HiL validation is a reduction of the number of triples to be manually
checked. In contrast, LLMs ofer more flexibility in terms of scalable solutions. For instance, triples</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. SCICERO integration with the LLM Validator</title>
        <p>Related work has presented impressive results of LLMs, specifically the GPT models, for tasks typically
performed by experts. We, therefore, transform the already presented workflows by replacing the HiL
Validator module through the LLM Validator. Figures 4 and 5 visualize the new pipelines.</p>
        <p>Ontology-based</p>
        <p>Validator
To-discarded</p>
        <p>LLM
Validator</p>
        <p>TKG
TKG</p>
        <p>Tconsistent+reliable
Tt-discarded</p>
        <p>Treliable
Ontology-based Tconsistent</p>
        <p>Validator
discarded by the Transformer Validator can be double-checked to ensure no correct triples are being
removed and thus improve recall (Fig. 6).</p>
        <p>Another possibility is the inclusion of the LLM Validator at several positions within the same workflow.
For instance, as shown in Fig. 7, the module can be added first, at an early stage to "rescue" discarded
triples and at a later stage with the aim of removing noisy triples which may have been missed by the
previous validation modules.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. SCICERO integration with the HiL Validator and the LLM Validator</title>
        <p>While an LLM-iL validation can ofer additional improvements to the pipeline and does not require any
manual efort, it is important to keep at least some level of human oversight. Thus, we propose two
exemplary hybrid workflows which can take advantage of human intelligence at scale.</p>
        <p>Both approaches follow the notion of agreement among the automated validation modules- the
original SCICERO Transformer Validator and the new LLM Validator. We follow the intuition that if
multiple automated approaches assert that a triple is correct (or incorrect) it is likely to be accurate. In
contrast, if a decision conflict between the modules arises, a HiL can be involved in its resolution.</p>
        <p>Figure 8 presents a workflow following the agreement rationale for discarding triples, thus allowing
the improvement of the SCICERO extraction in terms of recall, lost through the Transformer Validator.</p>
        <p>We further extend this workflow (Fig. 9) by applying a disagreement-strategy at the final step for the
remaining  triples to ensure uncertain triples are re-evaluated before being added to the KG.
Ontology-based</p>
        <p>Validator
Tt-discarded
Transformer</p>
        <p>Validator
Tt-discarded
Treliable</p>
        <p>LLM</p>
        <p>Validator</p>
        <p>Tdiscarded-agreement
To-discarded</p>
        <p>Tdisagreement</p>
        <p>Th-discarded
Tagreement-valid</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>We present the performance of the extended pipelines on the exploratory triple set and complete gold
standard in Table 1. The scores are color-coded such that improvements are marked in green and a
decrease is visualized in red. For each workflows discussed in Sect. 5, we include the precision, recall,
F1 scores. Additionally, we provide inputs on the added validation efort in terms of number of triples
to be validated either by the LLM Validator or HiL Validator.</p>
      <p>Workflow Design Methodology. We examine whether the performance trends observed on the
triple subset CS-KG-600 remain prominent as the number of triples increases. We report that for all
selected workflows, as shown in Table 1, the observed improvements remain or even increase when
tested against the full gold standard. These results indicate that the designed workflows are robust and
scalable. Therefore, a similar exploration strategy can be employed to develop a suitable validation
workflow for other automatically generated knowledge graphs, having a partial gold standard.
SCICERO Workflows Performance. To select the best-fitted validation pipeline we discuss the
achieved performance with each workflow. Extensions of SCICERO with a HiL module ofer
improvements in terms of precision (+6% to +20% ) and F1 scores (+1% to +5%). Nevertheless, significant
improvements (i.e., workflow 2) require high manual eforts, which introduces a scalability issue.</p>
      <p>As an alternative, workflows 4-7 leverage an LLM rather than a HiL. Depending on the positioning
of the LLM-module either the recall or precision score can be boosted. For instance, workflow 4 reaches
a precision of 85% (+12% from the baseline) on the CS-KG-600 dataset. However, a significant drop in
recall (-18%) and F1 (-5%) scores is observed. In contrast, workflow 6 increases the recall to over 80%
(+5%) with some losses (-2%) in the precision.</p>
      <p>The best performing workflows, which lead to improvements across all scores for both dataset are
workflows 8 and 9, employing both the LLM-Validator and HiL Validator modules. We see that the
recall can be improved by 5% with minimal HiL efort (validation of approx. 5-6% of the total triples,
workflow 8) without any precision losses. Similarly, workflow 9 ofers improvements of both the recall
(+3%) and precision (+6%) with slightly higher manual eforts (approx 12-13% ).</p>
      <p>The most appropriate workflow should be selected based on the available resources and main
validation goal. For instance, whenever an expert is unavailable and the elimination of noisy triples is
requested, workflow 4 can be followed. In contrast, for a small KG, workflow 1 would deliver the best
results. Lastly, when dealing with a large knowledge graph and only limited availability of experts, a
semi-automatic validation workflow such as workflow 9 can be followed to achieve high-quality results.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>
        In this paper, we present possible solutions extending SCICERO, the generation pipeline of the Computer
Science Knowledge Graph, initially presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We propose two new validation modules that can
be integrated into the framework- one incorporating a human-in-the-loop and another LLM-in-the-loop
validation leveraging GPT-4o.
      </p>
      <p>Using a subset of the available gold standard from the original SCICERO evaluation as an exploratory
dataset we design workflows with one or both additional validation modules optimizing the extraction
scores. To evaluate the efectiveness of the proposed solutions we measured precision, recall and F1
scores on the complete gold standard, accounting for additional (manual) validation eforts.</p>
      <p>Our findings reveal that (1) the new LLM-based validation module can increase the extraction
precision by 12% reaching 85-87% (following workflow 4; Fig. 4) without any human involvement;
(2) minimal manual eforts for validating 5-6% of the produced triples can lead to significant score
improvements (+5% recall from workflow 8; Fig. 8); and (3) the notion of agreement among automated
approaches can efectively determine the need of human-in-the-loop validation.</p>
      <p>Each of the proposed SCICERO extensions enhances the extraction process and depending on the
available resources and objective of the KG extraction, the best-suited workflow can be selected.</p>
      <p>A limitation of this work is the exclusive focus and evaluation on a single KG generation process.
Nevertheless, the added HiL and LLM-iL validation modules are developed independently from SCICERO,
making the designed workflows easily adoptable to other KG generation pipelines.</p>
      <p>In future work, we plan to employ the designed workflows in a dynamic environment, for a new subset
of the CS-KG to further validate the obtained results. We further intend to use the extended SCICERO
pipelines to generate various versions of the CS-KG, enabling cross-validation and comprehensive
analysis of KG-enabled tasks such as hypothesis generation, forecasting of research dynamics, etc.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was supported by the FWF HOnEst project (V 745) and the PERKS project (101120323)
co-funded by the European Union. Views and opinions expressed are, however, those of the authors
only and do not necessarily reflect those of the European Union. Neither the European Union nor the
granting authority can be held responsible for them.
outperforming bing, bard, chatgpt-3.5, and humans in clinical chemistry multiple-choice questions,
medRxiv (2024). doi:10.1101/2024.01.08.24300995.
[13] S. Tsaneva, S. Vasic, M. Sabou, Llm-driven ontology evaluation: Verifying ontology restrictions
with chatgpt, in: The Semantic Web: ESWC Satellite Events, 2024, 2024.
[14] P. T. G. Bradley P. Allen, Evaluating class membership relations in knowledge graphs using large
language models, in: The Semantic Web: ESWC Satellite Events, 2024, 2024.
[15] N. Fathallah, A. Das, S. De Giorgis, A. Poltronieri, P. Haase, L. Kovriguina, Neon-gpt: A large
language model-powered pipeline for ontology learning, in: The Semantic Web: ESWC Satellite
Events, 2024, 2024.
[16] A. A. Salatino, F. Osborne, E. Motta, CSO classifier 3.0: a scalable unsupervised method for
classifying documents in terms of research topics, Int. J. Digit. Libr. 23 (2022) 91–110. doi:10.
1007/S00799-021-00305-Y.
[17] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, D. McClosky, The stanford corenlp
natural language processing toolkit, in: Proceedings of the 52nd Annual Meeting of the Association
for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, System
Demonstrations, The Association for Computer Linguistics, 2014, pp. 55–60. doi:10.3115/V1/P14-5010.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naseriparsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <article-title>Knowledge graphs: Opportunities and challenges</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>56</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10462-023-10465-9.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Jaradeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Farfar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prinz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kismihók</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Stocker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Open research knowledge graph: Next generation infrastructure for semantic scholarly knowledge</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Knowledge Capture, K-CAP '19</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>243</fpage>
          -
          <lpage>246</lpage>
          . doi:
          <volume>10</volume>
          .1145/3360901.3364435.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gibson</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Velterop,</surname>
          </string-name>
          <article-title>The anatomy of a nanopublication</article-title>
          ,
          <source>Information services &amp; use 30</source>
          (
          <year>2010</year>
          )
          <fpage>51</fpage>
          -
          <lpage>56</lpage>
          . doi:
          <volume>10</volume>
          .3233/ISU-2010-0613.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dessí</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Recupero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          , E. Motta,
          <article-title>Scicero: A deep learning and nlp approach for generating scientific knowledge graphs in the computer science domain</article-title>
          ,
          <source>KnowledgeBased Systems</source>
          <volume>258</volume>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .1016/j.knosys.
          <year>2022</year>
          .
          <volume>109945</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dessí</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Reforgiato</given-names>
            <surname>Recupero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          , E. Motta,
          <article-title>Cs-kg: A large-scale knowledge graph of research entities and claims in computer science</article-title>
          , in: U. Sattler,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Keet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Presutti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P. A.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Takeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Monnin</surname>
          </string-name>
          , G. Pirrò, C. d'Amato (Eds.),
          <source>The Semantic Web - ISWC 2022</source>
          , Springer International Publishing, Cham,
          <year>2022</year>
          , pp.
          <fpage>678</fpage>
          -
          <lpage>696</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -19433-7_
          <fpage>39</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dessì</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Reforgiato</given-names>
            <surname>Recupero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          , E. Motta, H. Sack,
          <article-title>Ai-kg: an automatically generated knowledge graph of artificial intelligence</article-title>
          ,
          <source>in: The Semantic Web-ISWC</source>
          <year>2020</year>
          : 19th International Semantic Web Conference, Athens, Greece, November 2-
          <issue>6</issue>
          ,
          <year>2020</year>
          , Proceedings,
          <source>Part II 19</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>127</fpage>
          -
          <lpage>143</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -62466-
          <issue>8</issue>
          _
          <fpage>9</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jozashoori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Niazmand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rivas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bougiatiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Aisopos</surname>
          </string-name>
          , E. Iglesias,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Rohde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Padiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          , et al.,
          <fpage>Knowledge4covid</fpage>
          -
          <lpage>19</lpage>
          :
          <article-title>A semantic-based approach for constructing a covid-19 related knowledge graph from various sources and analyzing treatments' toxicities</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>75</volume>
          (
          <year>2023</year>
          )
          <article-title>100760</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.websem.
          <year>2022</year>
          .
          <volume>100760</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parulian</surname>
          </string-name>
          , G. Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Ma, J. Tu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chauhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          , J. Han,
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pustejovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liem</surname>
          </string-name>
          , A. ELsayed, M. Palmer,
          <string-name>
            <given-names>C.</given-names>
            <surname>Voss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Onyshkevych</surname>
          </string-name>
          , COVID
          <article-title>-19 literature knowledge graph construction and drug repurposing report generation, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, Association for Computational Linguistics</article-title>
          ,
          <year>2021</year>
          , pp.
          <fpage>66</fpage>
          -
          <lpage>77</lpage>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2021</year>
          .naacl-demos.
          <volume>8</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Cohan,
          <article-title>SciBERT: A pretrained language model for scientific text</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3615</fpage>
          -
          <lpage>3620</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1371.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          ,
          <article-title>Knowledge graph refinement: A survey of approaches and evaluation methods</article-title>
          ,
          <source>Semantic Web</source>
          <volume>8</volume>
          (
          <year>2016</year>
          )
          <fpage>489</fpage>
          -
          <lpage>508</lpage>
          . doi:
          <volume>10</volume>
          .3233/SW-160218.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Khorashadizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Groppe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Groppe</surname>
          </string-name>
          ,
          <article-title>Exploring in-context learning capabilities of foundation models for generating knowledge graphs from text</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>08804</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sallam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Al-Salahat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Eid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Egger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Puladi</surname>
          </string-name>
          ,
          <source>Human versus artificial intelligence: Chatgpt-4</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>