<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SEMANTiCS</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Unstructured Text with LLMs for KG Generation with RML</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Maushagen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Sepehri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Audrey Sanctorum</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tamara Vanhaecke</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olga De</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Troyer</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christophe Debruyne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>KG Construction, LLMs, End-user Involvement</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Montefiore Institute of Electrical Engineering and Computer Science, University of Liège</institution>
          ,
          <addr-line>Liège</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Research Group of In Vitro Toxicology and Dermato-Cosmetology (IVTD), Vrije Universiteit Brussel</institution>
          ,
          <addr-line>Brussels</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Web &amp; Information Systems Engineering (WISE) Lab, Vrije Universiteit Brussel</institution>
          ,
          <addr-line>Brussels</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>20</volume>
      <fpage>17</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>We report on an exploratory study using Large Language Models (LLMs) to generate Comma-Separated Values (CSV) files, which are subsequently transformed into Resource Description Framework (RDF) using the RDF Mapping Language (RML). Prior studies have shown that LLMs sometimes have problems generating valid and well-formed RDF from unstructured texts, i.e., issues with RDF, not the contents. We wanted to test whether the generation of CSV led to fewer issues and whether this would be a viable option for allowing domain experts to be actively part of the Knowledge Graph (KG) population process by allowing them to use familiar tools. We have built a prototype illustrating this idea, and the results seem promising for further study. The initial prototype uses zero-shot training and is built on GPT-4. The prototype takes the unstructured text and the CSV file's structure as input and uses the latter to generate prompts to fill in the cells' values. Future work includes analyzing the efect of diferent prompting strategies. The limitation, however, is that such an approach only works for projects where domain experts work with spreadsheets for pre-existing mappings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Knowledge Graphs (KGs) enable us to organize, represent, and reason about structured
information integrated from various sources. However, KG construction remains challenging due
to the heterogeneity and complexity of real-world data sources. End-user and domain-expert
involvement in all KG construction activities, such as ontology engineering, data transformation,
data enrichment, and quality assurance, is a challenge requiring bespoke methods and tools, as
exemplified in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we proposed a method in the toxicology domain that relied on
domain experts populating a set of spreadsheets, which are subsequently transformed into RDF
using RML. Our approach also includes an end-user approach based on the block metaphor.
      </p>
      <p>
        Large Language Models (LLMs) have demonstrated their potential for natural language
understanding and generation tasks, and their use has been explored in KG construction. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
Netherlands
for instance, generated RDF from unstructured text and noticed diferences when the LLM was
requested to produce Turtle, JSON-LD, ... LLMs are not only used to generate RDF, but their use
has been explored in declarative mappings as well. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors demonstrated that LLMs
can be used to engage with RML [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] mappings and that the output (RDF, queries, etc.) is of fairly
high quality. However, in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the authors explored various LLMs to generate RML mapping
and noticed that they tended to generate syntactically correct RDF but invalid mappings.
      </p>
      <p>
        As demonstrated in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], applying LLMs to KG construction may still have suboptimal results.
Recognizing that the generation of RDF from unstructured text has some challenges, we explored
using LLMs to distill simple (i.e., CSV) semi-structured information from unstructured text that
domain experts can more easily validate and refine with spreadsheets. We believe this approach
would yield better results in contexts where one has an ontology and data can easily be entered
into such files. This paper elaborates on our approach and reports on our initial findings.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Context</title>
      <p>
        This study was conducted in the context of the TOXIN project. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] A major part of this project
was to gather and integrate information about in vivo tests, described in Safety Evaluation
Opinions, issued by the Scientific Committee on Consumer Safety (SCCS) about cosmetic ingredients,
in a knowledge graph. Each Opinion, i.e., dossier, contains information about experiments or
tests of an ingredient (compound) on laboratory animals (the compound, quantities, exposure,
animals, outcomes, ...). The data contained in these dossiers are integrated into a KG to provide
more eficient access to this data for toxicologists.
      </p>
      <p>Our current method for populating the knowledge graph (KG) relies on domain experts
reading and interpreting safety evaluation opinions to enter the details of experiments in
spreadsheets, which are subsequently transformed into RDF using RML. Our approach includes
an alternative (end-user) approach based on the block metaphor to enter the data into the KG
directly.</p>
      <p>While this process ensures the authoritative nature of the data, it is inherently tedious and
time-consuming. The automation of this process was hampered by the variety in structure,
presentation, and even writing style (e.g., the use of negation) across opinions.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Approach</title>
      <p>While LLMs have been demonstrated to be promising, the aforementioned problems regarding
RDF generation are problematic if the domain experts are not knowledgeable in these
technologies. The ontologies and mappings have already been engineered in the TOXIN project. We
can thus explore whether a) LLMs are better at generating (CSV) tables or, at least, finding the
relevant information in the text to construct such a table, and b) whether such an approach
could assist domain experts in filling those spreadsheets more eficiently. To this end, we have
built a prototype assistant, see Figure 2, that takes a safety evaluation opinion and the table’s
structure as input.</p>
      <p>In the current prototype (Figure 1), the text about the experiments (or studies) is extracted
using regular expressions (1), and the column headers are used to generate the prompts (2). The</p>
      <p>Safety
Evaluation</p>
      <p>Opinions
Proposed process
interpreted</p>
      <p>by
input
domain expert
s
n
o
i
t
c
a
r
e
t
n
i</p>
      <p>LLM
integrated in
assistant
fills
in</p>
      <p>input
column headers are grouped under categories. A user can select one or more such categories.
Initial testing has quickly shown that the LLM in our experiment, GPT-4, struggled to generate
a coherent CSV with many columns. We generate the following prompt for each column: “Find
the value for the following variable “«column name»” based on the category ”«category name»” in
the following text “«text»”. If you can’t find the answer in the text, respond with ”-”. Don’t include
any commentary text!”. The result of which is shown in (3).</p>
      <p>Prompt: Provide a text quote from
“&lt;&lt;text&gt;&gt;" which is used to answer the
following command, namely
“&lt;&lt;previous prompt for value of the
endpoint variable&gt;&gt;”.</p>
      <p>Domain experts can recompute the whole CSV table by resubmitting the prompts or the
value of one single cell. Domain experts can thus engage with cells multiple times. A promising
feature in the prototype is a button prompting the LLM to point to the part of the text that was
used to fill in one of the columns. An example is shown in Figure 3. This feature could assist
the project in ensuring the data entered in the CSV is authoritative.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Exploratory Results</title>
      <p>While no user studies have been conducted yet, we deem this approach worthwhile to investigate,
given the initial exploratory results. One of the co-authors, a domain expert, found the retrieved
information to be often coherent, though experiments with additional domain experiments are
warranted. During this study, we noticed that the prompts generated using the column headers
sometimes misled the LLM. This was because the column header was ambiguous. This was
partly remedied by including the information on the category (e.g., the observed efects of a
compound, which are represented under ”Observations” containing five headers, as shown in
Figure 2). We plan, however, to investigate specific prompts for each column header, which can
be provided to the assistant.</p>
      <p>Our current prototype also does not keep track of past interactions; each prompt is executed
in a new session. Additional experiments should investigate this impact. What, in our opinion,
is more interesting to explore is the use of one-shot or few-shot training. We currently employ
zero-shot training with remarkable results. Given the heterogeneity of the Safety Evaluation
Opinions, we wonder whether a few-shot approach would yield better results.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions and Future Work</title>
      <p>LLMs have been used to generate KGs, but state-of-the-art has shown some challenges with
hallucinations and the validity and well-formedness of the KG. We wanted to test whether
the generation of CSV would render KG generation more eficient and ensure domain-expert
involvement. The advantages are twofold: CSV is an easier and more commonplace data
structure, and domain experts are more adept at manipulating spreadsheets. An initial exploration
of this approach makes us believe it is worthwhile to investigate.</p>
      <p>We developed a prototype that generates CSV based on prompts, which users can copy into
a spreadsheet. Subsequently, these spreadsheets are transformed into RDF with RML. It is
important to note that the current approach would work for KG projects where domain experts
use spreadsheets with existing mappings to a KG.</p>
      <p>Future work is twofold: exploring diferent prompting techniques, as described in the previous
section, and integrating the prototype into a workflow for domain experts to allow for domain
expert validation.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The TOXIN project is financially supported by Vrije Universiteit Brussel under Grant IRP19.
Some funding came from Cosmetics Europe and the European Chemical Industry Council.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanctorum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Riggio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maushagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sepehri</surname>
          </string-name>
          , E. Arnesdotter,
          <string-name>
            <given-names>M.</given-names>
            <surname>Delagrange</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. De Kock</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Vanhaecke</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Debruyne</surname>
          </string-name>
          , O. De Troyer,
          <article-title>End-user engineering of ontology-based knowledge bases</article-title>
          ,
          <source>Behaviour &amp; Information Technology</source>
          <volume>41</volume>
          (
          <year>2022</year>
          )
          <fpage>1811</fpage>
          -
          <lpage>1829</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Debruyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Munnelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kilgallon</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. O'Sullivan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Crooks</surname>
          </string-name>
          ,
          <article-title>Creating a Knowledge Graph for Ireland's Lost History: Knowledge Engineering and Curation in the Beyond 2022 Project</article-title>
          ,
          <source>ACM Journal on Computing and Cultural Heritage</source>
          <volume>15</volume>
          (
          <year>2022</year>
          )
          <volume>25</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          :
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Meyer</surname>
          </string-name>
          , C. Stadler,
          <string-name>
            <given-names>J.</given-names>
            <surname>Frey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Radtke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Junghanns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Meissner</surname>
          </string-name>
          , G. Dziwis,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bulert</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Martin, LLM-assisted Knowledge Graph Engineering: Experiments with ChatGPT</article-title>
          ,
          <source>in: First Working Conference on Artificial Intelligence Development for a Resilient and Sustainable Tomorrow - AI Tomorrow</source>
          <year>2023</year>
          , Leipzig, Germany,
          <fpage>29</fpage>
          -
          <lpage>20</lpage>
          June,
          <year>2023</year>
          ,
          <string-name>
            <given-names>Informatik</given-names>
            <surname>Aktuell</surname>
          </string-name>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Randles</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. O'Sullivan</surname>
            ,
            <given-names>R2</given-names>
          </string-name>
          [RML]
          <article-title>-ChatGPT Framework</article-title>
          , in: 5th
          <source>International Workshop on Knowledge Graph Construction (KGCW</source>
          <year>2024</year>
          )
          <article-title>co-located with ESWC 2024, Hersonissos</article-title>
          , Greece, May
          <volume>27</volume>
          ,
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Iglesias-Molina</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Van Assche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Arenas-Guerrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>De Meester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Debruyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jozashoori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <article-title>The RML ontology: A communitydriven modular redesign after a decade of experience in mapping heterogeneous data to RDF</article-title>
          , in: 22nd
          <source>International Semantic Web Conference - ISWC</source>
          <year>2023</year>
          , Athens, Greece, November 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume
          <volume>14266</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>152</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Frey</surname>
          </string-name>
          , E. Rahm,
          <article-title>Towards self-configuring Knowledge Graph Construction Pipelines using LLMs - A Case Study with RML</article-title>
          ,
          <source>in: 5th International Workshop on Knowledge Graph Construction (KGCW</source>
          <year>2024</year>
          )
          <article-title>co-located with ESWC 2024, Hersonissos</article-title>
          , Greece, May
          <volume>27</volume>
          ,
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>