<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Model⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Enrico Daga</string-name>
          <email>enrico.daga@open.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jason Carvalho</string-name>
          <email>jason.carvalho@open.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alba Morales Tirado</string-name>
          <email>alba.morales-tirado@open.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Licence Extraction, Knowledge Graphs, Open Digital Rights Language, Large Language Models,</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Open University</institution>
          ,
          <addr-line>Milton Keynes</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data catalogues play an increasing role in supporting information sharing and reuse on the Web. However, evaluating the reusability of Web resources requires an understanding of the related licence and terms of use. Recent methods for licence representation and reasoning allows to explore Web resources according to their permissions, obligations, and duties. Therefore, licence annotations should be linked to those representations in order to support users in filtering and exploring datasets according to their licencing requirements. However, populating data catalogues with licence information is a tedious and error-prone task. In this paper, we explore the suitability of a Large Language Model (LLM) to support the automatic extraction, annotation, and linking of licence information from reference Web pages of data catalogue items. The approach is evaluated for its capacity to automatically find relevant pages from within a main web page, extract data about copyright and licencing, and link licence descriptions to a knowledge graph of licences expressed in RDF/ODRL. We apply our method to extend the coverage of licence annotations of a data catalogue in the music domain.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Data catalogues play an increasing role in supporting information sharing and reuse on the
Web in many domains. However, evaluating the reusability of Web resources requires an
understanding of the licence and terms of use associated with those resources. Recent methods
for licence representation and reasoning allow to explore Web resources according to their
permissions, obligations, and duties [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. However, licence annotations should be linked to
those representations in order to support users in filtering and exploring datasets according to
their licensing requirements [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>
        Our starting point is a registry of resources relevant to music research: the musoW
catalogue [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. MusoW is a knowledge graph and Web registry of datasets and projects annotated
with crowd-sourced metadata. We analysed the coverage of licence annotations from querying
the musoW SPARQL endpoint. Figure 1 shows how most resources do not have a specific licence
associated with them (almost ∼70% of the registry). The reasons may vary: 1. the resource
does not have a specific licence; 2. the information being not available at the time the metadata
Workshop
⋆You can use this document as the template for preparing your publication. We recommend using the latest version
nEvelop-O
Open Access
      </p>
      <p>50
CC-BY
Not specified</p>
      <p>274
45
30 29
was curated (but it may be available today); 3. the curator overlooked the information, maybe
because it was hidden in secondary web pages.</p>
      <p>The lack of suficient licensing and terms of use information for published web resources is
a well-known problem whose impact on the broader landscape of content reuse on the web
cannot be underestimated. In the case of musoW catalogue, we are confident that most of the
annotations are actually correct (or they were correct at the time of their retrieval). However,
supporting curators in collecting such information without having to browse each one of the
websites catalogued manually would certainly contribute to improving the quality and coverage
of the musoW catalogue.</p>
      <p>
        Large Language Models (LLM) such as OpenAI’s ChatGPT, Meta’s LLAMA and Google’s
Bard, have emerged recently providing impressive abilities in language generation, opening
new opportunities for interacting with textual content, for example, for detecting and extracting
structured information [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In this paper, we apply a Large Language Model (LLM) for extracting
licence information from web resources to improve the coverage of licence metadata in Web
registries, considering the case of the musoW registry. Specifically, we pose the following
questions:
RQ1 Can copyright and licence information be derived automatically from web pages?
RQ2 How can copyright and licence information be derived automatically from web pages
using Large Language Models (LLM)?
RQ3 How accurate would an LLM detect the copyright and licence information (in other
words, is it worth pursuing this line of enquiry)?
RQ4 How much can we complete a curated catalogue of licence metadata with an automatic
method based on LLMs?
      </p>
      <p>The rest of the paper is structured as follows. The next Section is dedicated to related work.
Next, we illustrate our methodology to apply a Large Language Model to extract and link licence
information from Web resources (Section 3). We report on extensive experiments in Section 4,
before discussing our results (Section 5) and concluding the work in Section 6.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>
        Licences and terms of use have been a recurrent topic of interest in Web research. Initiatives
include The Creative Commons Rights Expression Language1 proposed by the Open Data
Institute, and the Open Digital Rights Language (ODRL) a W3C specification to support the
definition, exchange and validation of policies [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Online repositories are developed to publish
licences expressed in RDF, including the RDFLicense Dataset3 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and DALICC [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which we
use in our work. A formal representation of licences can be of use to support the users in
deciding what possible constraints they want to guarantee concerning the use of their data [
        <xref ref-type="bibr" rid="ref4 ref9">9, 4</xref>
        ].
Datasets should include information on licences facilitating researchers’ decisions to reuse such
resources [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Computational legal policies allows to reason on the applicability of terms to data
derived from licenced resources [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The collection and curation of licence metadata is clearly a
necessary step for enabling such applications. Applying natural language processing techniques,
like the ones proposed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], can facilitate the process of data acquisition. Recently, there
has been increasing work on applying Large Language Models (LLM) to aid the extraction of
structured information from textual content (e.g. [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
        ]). Diferently from fine-tuning (e.g.
in RAG [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]), in-context learning allows for tailoring the response flexibly without significant
computational resources [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Emergent research is exploring complex tasks such as interpret
the content of web pages and navigate4 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Attention has been given on evaluating the
suitability of LLMs in many end-user tasks as well as raising concerns on their limitations, for
example, in generating plausible but wrong information (hallucination) and propagating societal
biases derived from the text they have been trained from [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Knowledge graphs play a key
role in bridging the gap between language models and structured data models [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], including
attempts to mitigate known issues in content generated by LLMs such as hallucinations and
biases [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Similarly, LLMs are at the centre of current efort in aiding knowledge graph
population in various domains [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. In our work, we use LLMs with in-context learning to
identify copyright and licence metadata from web resources, and develop a knowledge extraction
pipeline that generates links between two knowledge graphs: one of catalogue items and the
other of licences represented computationally.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology</title>
      <p>We tackle the problem by designing a methodology that engages with an LLM by asking to
perform language understanding tasks. To avoid relying on the LLM embedded knowledge (which
is known to be incomplete and often leads to unreliable information due to hallucinations), we
design specific prompts (in-context learning) that make use of its language
processing/predictive abilities but constrain them only to content that we provide. The method is structured as
follows:
Data preparation We start from a list of resources published on the Web for which we want
to know the associated licence. The assumption is that for each resource there is a web page
which includes such information in one of the linked pages.
1Creative Commons rights language, the ODRS vocabulary2https://creativecommons.org/ns
3RDFLicense Dataset, https://rdflicense.linkeddata.es/
4See also tools such as vimGPT https://github.com/ishan0102/vimGPT and the BrowserPilot extension of
ChatGPT https://community.openai.com/t/browserpilot-a-plugin-for-enhanced-chatgpt-interactions/297653.
Task 1 :: identify Starting from the main web page of the catalogued resource, we design a
prompt for an LLM asking to find no more than three links that may include copyright, privacy,
or licencing information. The resource’s home page is downloaded, and all HTML tags except
anchor tags (links) are removed. This step is necessary for reducing the content size, making
it less expensive to be analysed by the LLM. We ask the LLM to find such information in the
content provided. The expected output is a list of links potentially including copyright and
licence information.</p>
      <p>Task 2 :: extract We design a prompt for an LLM asking to derive copyright, licence, and terms
of use information from a piece of textual content. For each one of the resources and links
collected, we download the HTML page and remove all tags. We then send the content to the
LLM, which is asked to return a structured data object with three main fields: copyright, licence,
and terms of use.</p>
      <p>Task 3 :: link In this step, we focus on linking licencing information to a catalogue of
wellknown licences. We designed a prompt for an LLM asking to identify a licence from a piece of
text, selecting it from a list provided.</p>
      <p>Evaluation We evaluate each one of the previous steps under a number of dimensions,
including 1. the ability of the LLM to provide an answer syntactically correct (following the requested
specification); 2. the ability of the LLM to make an answer semantically correct (a meaningful
answer); Each task involving the LLM included a prompt engineering design phase which was
essentially exploratory, starting from a prompt-as-hypothesis and resulting in a final prompt,
after a short time of incremental trials with the LLM UI dashboard. In what follows, we apply
the methodology to the collection of resources published in the musoW catalogue that do not
have a specified licence and describe our approach in detail.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <p>In this section, we report on the experiments conducted by applying the methodology outlined
so far to the musoW catalogue of resources that do not have an explicit licence in the metadata.
The experiments were executed using OpenAI ChatGPT API with model
gpt-3.5-turbo16k. The experiments are reproducible with the source code provided in this GitHub project:
https://github.com/polifonia-project/musow-licences-experiments-llm.</p>
      <sec id="sec-5-1">
        <title>4.1. Data preparation</title>
        <p>
          We use two main resources: 1. the musoW catalogue of musical resources on the Web [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] 2. the
DALICC catalogue of licences in RDF/ODRL [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. We start by downloading the content from the
musoW SPARQL endpoint5, specifically the resource identifier and name, the main home page
of the resource, some categorical data and the licence metadata6. Next, we obtain the list of
DALICC licences and generate a file summarising the licence description, legal text URL, and
code used as a local name to identify the Linked Data entity7.
        </p>
        <sec id="sec-5-1-1">
          <title>5musoW endpoint: https://projects.dharc.unibo.it/musow/sparql</title>
          <p>6The data file can be inspected at https://github.com/polifonia-project/musow-licences-experiments-llm/blob/main/
Query-16.csv.
7The YAML file can be found in the experiments project folder on GitHub: https://github.com/polifonia-project/
musow-licences-experiments-llm/blob/main/licences.yaml
rel.
1</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Task 1: finding links in web pages</title>
        <p>The first task aims to automatically retrieve links pointing to web pages potentially including
information about copyright, licence, and terms and conditions.</p>
        <p>Prompt engineering.</p>
        <sec id="sec-5-2-1">
          <title>We start with the following prompt as an initial hypothesis:</title>
          <p>SYSTEM : You a r e an e x p e r t i n l i c e n c i n g and terms and c o n d i t i o n s o f r e s o u r c e s on t h e Web
.</p>
          <p>USER : Find t h e l i n k t o t h e pages d e s c r i b i n g l i c e n c e s , p r i v a c y p o l i c i e s , or terms o f use
o f t h e c o n t e n t i n t h e f o l l o w i n g HTML s o u r c e code . P l e a s e respond i n a JSON f o r m a t
. HTML code : { { HTMLCODE } }</p>
          <p>We perform tests with sample web pages from the musoW catalogue and change the prompt
to include more details regarding the expected format and strengthen the reference to HTML
knowledge. The resulting prompt is the following:
SYSTEM : You a r e e x p e r t i n l i c e n c i n g and terms and c o n d i t i o n s o f r e s o u r c e s on t h e Web .</p>
          <p>You a l s o know how t o f i n d i n f o r m a t i o n on a web page by r e a d i n g i t s HTML c o n t e n t .
USER : Find t h e l i n k t o t h e pages d e s c r i b i n g l i c e n c e s , p r i v a c y p o l i c i e s , or terms o f use
o f t h e c o n t e n t i n t h e f o l l o w i n g HTML s o u r c e code . P l e a s e respond ONLY with a JSON
f o r m a t with a l i s t o f maximum 3 l i n k s , r e s o l v e d a c c o r d i n g t o t h i s a d d r e s s : { u r l }
HTML code : { html }</p>
          <p>We iterate over the list of resources without explicit licence information (or marked with any
of the categories that do not refer to a specific licence, as discussed in Section 1). The answers
are saved locally and collected into a table that we later analyse to evaluate the performance of
the LLM under the two dimensions mentioned in our methodology, which we specify as follows:
Q1 Are there any links returned? (Yes/No) Q2 Is the returned well-formed JSON? (Yes/No)
Q3* Are any of those links relevant? We evaluate the answer on a Likert scale, from definitely
not (1) to surely yes (5). While the first two questions can be answered automatically, we rely
on manual supervision to answer the third one (we indicate this with the asterisk). It needs to
be duly noted that we did not manually check each one of the web pages but only observed the
returned links and assessed whether any of them may potentially provide useful information.
A sample of the results of this task can be seen in Table 1.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Task 2: extract copyright, licence, and terms of use</title>
        <p>The output of the previous step is a set of links for each one of the resources derived from
the content of the home web page. The second task aims to extract information from each
one of those web pages. We used all links returned, independently from our manual relevance
assessment (270 resources and 648 links in total).</p>
        <p>For this task, we want the information to be structured under three dimensions: copyright
statement – who owns the intellectual property of the resource; licence – what is the licence
associated with it (if any); and terms of use – to include any other information regarding the
use of the resource.</p>
        <p>Prompt engineering. We start with the following prompt as an initial hypothesis:
SYSTEM : You a r e e x p e r t i n l i c e n c i n g and terms and c o n d i t i o n s o f r e s o u r c e s on t h e Web .</p>
        <p>You a l s o know how t o f i n d i n f o r m a t i o n on a web page by r e a d i n g i t s HTML c o n t e n t .
USER : P l e a s e l i s t t h e l i c e n c e s and c o p y r i g h t owners named i n t h e f o l l o w i n g HTML code .</p>
        <p>Format t h e answer i n JSON with two f i e l d s , ’ c o p y r i g h t ’ and ’ l i c e n c e s ’ . { { HTMLCODEE
} }</p>
        <p>We perform tests with a sample of content from the web pages of the previous step and refine
the prompt until we obtain suficiently consistent results. The resulting prompt is the following:
SYSTEM : You a r e e x p e r t i n l i c e n c i n g and terms and c o n d i t i o n s o f r e s o u r c e s on t h e Web .</p>
        <p>You a l s o know how t o f i n d i n f o r m a t i o n on a web page by r e a d i n g i t s HTML c o n t e n t
and e x p r e s s i t i n JSON f o r m a t .</p>
        <p>USER : P l e a s e l i s t t h e l i c e n c e s , c o p y r i g h t owners , and terms and c o n d i t i o n s mentioned i n
t h e f o l l o w i n g t e x t . Respond only with a JSON o b j e c t with 3 f i e l d s , ’ c o p y r i g h t ’ , ’
l i c e n c e s ’ , and ’ terms and c o n d i t i o n s ’ . The t e x t i s : { t e x t }
We save the responses locally and gather them in a tabular format. Subsequently, we analyze
this data to assess the efectiveness of the LLM based on syntactic and semantic accuracy, as
stated in our methodology. These dimensions are elaborated as follows: Q4 Is the text returned
well-formed JSON? Q5, Q8 Did the LLM find any copyright information? Q6, Q9 Did the LLM
ifnd any licence information? Q7, Q10 Did the LLM find any terms and condition information?
We pose the last three questions above two times, the first considering each of the links (web
pages) and associated requests to the LLM and the second aggregating all responses related to
each resource and quantifying whether any provided links was useful to gather the information.
While the above questions can be answered automatically, we add a qualitative, human-based
assessment of the quality of the results, answering the following additional questions on a
restricted sample of 100 items: Q11* Is the copyright information correct? Q12* Is the licence
information correct?</p>
        <p>Tables 2 and 3 show example annotations for questions Q11 and Q12 respectively. Crucially,
we observed that for all 100 evaluated responses to Q12, the LLM never returned a wrong
answer, while having some variability in the form (for example, in some cases it did not find a
licence but it still returned some content). We leave the assessment of the information related
to the terms and conditions to future work.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Task 3: link licence descriptions to the licences database</title>
        <p>The expected output of the previous step is a structured JSON object with three fields: copyright,
licence, and terms of use. In this task, we focus on the content returned for the field ’licence’</p>
        <p>Web page Licence
http://popmusic.mtsu.edu/ManuscriptMu- []
sic/guidelines.aspx
https://www.youtube.com/t/terms
https://github.com/midi-ld/
http://cantus.uwaterloo.ca
http://pemdatabase.eu/
https://libraries.mit.edu/permissions
worldwide, non-exclusive, royalty-free, transferable, sublicens- 0
able licence to use that Content
[’MIT’] 2
Creative Commons Attribution-NonCommercial-ShareAlike 2
4.0 International License
[’Creative Commons Attribution-NonCommercial-ShareAlike 2
4.0 International (CC BY-NC-SA 4.0)’]
[’CC BY-NC’] 2
1–12
Ann.
and aim to automatically link such licence descriptions with the equivalent authoritative entry
derived from the Dalicc catalogue of licences expressed in RDF/ODRL. The initial prompt
hypothesis is the following:
SYSTEM : You a r e e x p e r t i n l i c e n c i n g and terms and c o n d i t i o n s o f r e s o u r c e s on t h e Web .</p>
        <p>You a l s o know how t o f i n d i n f o r m a t i o n on a web page by r e a d i n g i t s HTML c o n t e n t .</p>
        <p>You a r e a l s o p r o f i c i e n t i n r e a d i n g YAML f i l e s .</p>
        <p>USER : Given t h e f o l l o w i n g l i s t o f l i c e n c e s , can you t e l l me t o which l i c e n c e t h e
f o l l o w i n g d e s c r i p t i o n r e f e r s t o { LICENCEEXPR } {YAML}</p>
        <p>We refined the prompt by testing a sample of licence descriptions identified in the previous
step to improve the results. Specifically, we moved the list of licences to the SYSTEM input and
asked to return ’NONE’ if the description would not refer to any specific licence in the list.
SYSTEM : You a r e e x p e r t i n l i c e n c i n g and terms and c o n d i t i o n s o f r e s o u r c e s on t h e Web
and know t h e f o l l o w i n g l i s t o f l i c e n c e s : { l i s t O f L i c e n c e s }
USER : Can you t e l l me t o which l i c e n c e s t h e f o l l o w i n g l i c e n c e d e s c r i p t i o n r e f e r s t o ?
The d e s c r i p t i o n i s { d e s c r i p t i o n } −− P l e a s e respond by only r e p o r t i n g t h e s e l e c t e d
l i c e n c e s from t h e l i s t or ’NONE’ i f none i s found .</p>
        <p>We manually evaluate each one of the responses and annotate them as follows: (-1) The
licence described is in the list, but the LLM didn’t find it (or it hallucinated in some way);
(0) The licence described is not in the list and the LLM correctly did not find it; (1) The licence
description found is correct and in the list but the LLM did not link it properly (for example, it
did not respond with the licence code); (2) The licence was found in the list and linked properly
(the correct licence code was returned). In this assessment, we ignore the licence version and
accept to link, for example, a CC licence Version 1 to the equivalent Version 4 in the DALICC
[”{’ELVIS Database
’https://github.com/ELVIS-Project/elvisdatabase/releases’}”]
[”[’MIT’]”]
[”[{’name’: ’Creative Commons Attribution
License’, ’url’:
’http://creativecommons.org/licenses/by/1.0/’}]”]</p>
        <p>The description ’MIT’ corresponds to the [ExpatLicense] MIT</p>
        <p>License (MIT) https://opensource.org/licenses/MIT
[”[{’title’: ’CC BY-NC 4.0’, ’website’: ’https://cre- [CC-BY-NC_v4] Creative Commons
Attributionativecommons.org/licenses/by-nc/4.0/’}]”] NonCommercial 4.0 International (Creative Commons)
https://creativecommons.org/licenses/by-nc/4.0/legalcode
CC-BY_v4
1
2
2
catalogue8. Table 4 shows a sample of annotated responses, limited to the content of Task 2
returning the field ’licence’.</p>
        <p>We quantitatively evaluate this task as follows: Q13* How many correct decisions are made?
(all except -1) Q14* How many licences are correctly not found? (0) Q15* How many licences
are correctly found? (1 and 2) Q16* How many licences are linked to the list? (2) Q13 includes
all answers, positive and negative, while Q14 summarise the licences that were missing from
the sources and therefore not linked. Q15 counts the licences that where found and correct,
even if the linking task didn’t work syntactically, while Q16 only measures the licences that
were correct and properly linked to the list provided.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results and discussion</title>
      <p>In this section we present the results of our experiments and discuss them in the light
of the initial assumptions and hypotheses. A summary of the results is reported in
Table 5. The results of each one of the steps, the related manual annotations, and the
computed statistics can be reviewed at this address: https://docs.google.com/spreadsheets/d/
1wl-5YKcLVY9wDwSauPWz9NlLyeI7Ga1Da5WJXtOOp18/edit?usp=sharing.</p>
      <p>We can first look at the results of each one of the tasks, in order to gather evidence that would
allow us to answer the main questions.</p>
      <p>Task 1 The first task is related to finding links in web pages that may include copyright or
licence information. The task was executed 313 times, one for each resource home page. The
vast majority of results were provided with a correct JSON syntax (this includes responses with
no links). The LLM was capable of finding links in 86% of the cases, and most of the link sets are
8The registry typically includes only the most recent version of a licence. However, we leave the assessment of
licence versions to future work.
deemed to be potentially relevant (51% were surely relevant and 21% were deemed potentially
relevant by our manual assessment).</p>
      <p>Task 2 The second task aims at extracting textual content from the web pages mentioning
copyright, licence, or terms of use information. The task was executed 648 times, one for each
web page collected in the previous step. Those links covered 86% of the collection (270). A good
amount of results were provided with a correct JSON syntax – 75% (this includes responses with
no information). Copyright information was found in 66% of the cases (82% of the resources,
221/270), while licence information had a much lower result: 26%, corresponding to less than
half of the resources for which at least one web page was returned in the previous step (43%).
Terms of use are also found with a similar success rate, however, we don’t delve into those now
and leave an assessment of the quality of this additional information to future work. At the
end of this second step, out of 313 initial resources, we obtain copyright information for 221 of
them and licence information for 115 of them, approximately 70% and 36% respectively. The
reasons vary from errors propagated from the previous step to the information not existing at
all on the web pages. Crucially, we validate the quality of the results with a manual supervision
of a sample of 100 resources, for which we find that 65% include correct copyright information
and 100% include correct licence information (or did not find any when none was there). This
information was checked by manually opening each one of the web pages and verifying its
content. Crucially, the LLM did not hallucinate when requested to derive licence information
from a web page, therefore, the returned content, when valid, is also true.</p>
      <p>Task 3 The last task is devoted to automatically linking the licence information to the list
of licences in the Dalicc catalogue. The results of this operation were performed on the 115
resources that included any form of licence information (including cases where such information
was empty, missing, or non-referring to a specific licence). We evaluate the entire result set
manually according to a Likert scale of 5, reflected in questions 13-16 (see Table 5). The prompt
to the LLM was designed to identify licences from the list provided, starting from a text that
supposedly mentions any of them. We can observe how the system made a correct decision
(whether there was a licence from the list or not) in 90% of the cases. However, in more than
half of the cases, there was no licence information – 57%. However, the system managed to
correctly identify a licence from the Dalicc catalogue for 38 resources (33% of the cases) and
in 25% of the cases it was able to report the correct licence code from the list (76% of the ones
correctly found). With this approach, we managed to retrieve and link 38 licence information in
an automatic (or semi-automatic) way, covering 12% of the resources which originally did not
have a licence specified. We conclude this section by discussing the original research questions.
[RQ1] Can copyright and licence information be derived automatically from web
pages? We can conclude that it is possible to derive such information from web pages, and
automatic methods involving LLM can help in processing large amounts of web pages and
gathering relevant information with little human supervision. Crucially, we gathered evidence
that there is little risk of generating plausible but wrong information in the case of licencing,
thus making us confident that it is possible to apply LLM for extracting licencing information
from the content of web resources (see Table 3). This is not true for copyright, as shown by our
evaluation of Q11 (reported in Table 2).
[RQ2] How can copyright and licence information be derived automatically from web
pages using Large Language Models (LLM)? Our methodology, which was validated by
our experiments, is an initial answer to this question. However, we performed our experiments
with one specific LLM (ChatGPT) and we acknowledge the fact that a larger study would be
needed in order to establish what kind of prompts would be most successful generally, for
example considering portability across diferent LLMs, in achieving this task. However, our
experiments are promising and open directions about how to improve the overall workflow
both in terms of accuracy and coverage.
[RQ3] How accurate would an LLM detect the copyright and licence information (in
other words, is it worth pursuing this line of enquiry)? By looking into the results,
we can observe how most of the decrease of coverage during the pipeline was due either to
dificulties in producing machine-readable content or in actually recognise that the information
is not there (for example, this can be seen by comparing the results of Q9 with Q14). Increasing
correct responses in the case of true negatives seems to be a challenge (sometimes the LLM
returns some content that does not include relevant information in a task but then this becomes
inefective in further tasks, for example when the LLM returns a piece of text that does not
describe a licence in Task 2 and the same text is correctly not linked to any licence in Task
3). Instead, we can observe how the LLM was particularly accurate in deciding, for example,
whether a certain piece of text included a licence from a given list (Q13). These results are
particularly encouraging and we can definitely see this as a promising research direction.
[RQ4] How much can we complete a curated catalogue of licence metadata with an
automatic method based on LLMs? This final answer pertains to our case studies. We
managed to find new licence information for 38 resources (12% of the set of resources without
licence annotations). We cannot confidently state that those are all the existing missing ones
but from the analysis of the results of intermediate steps in our pipelines, we are confident
that most of the web pages scrutinised did not include licence information (see results about
Q12 and Q13). This is also coherent with the original statistics in musoW, where most of the
resources did not present licence information. However, our method allowed us to get more
of them, inspiring us to consider opportunities for adopting LLM as an aid for curating digital
libraries’ metadata.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusions</title>
      <p>
        In this paper, we focused on the problem of helping data curators of Web registries to collect
and link licence information. To the best of our knowledge, this is the first work focusing on
extracting licence information from web resources with LLMs. The risk of LLM hallucinating is
not fully dissipated in our results. In the future, we want to improve the quality of the
recommendations by refining the prompts and analyse error propagation, as well as extending the
evaluation to copyright and terms of use. Furthermore, we plan a larger evaluation comparing
diferent LLMs and models and covering the whole musoW dataset. Finally, we will possibly
integrate the method in the data acquisition workflow [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements References</title>
      <p>This work was supported by the EU’s Horizon Europe research and innovation programme
within the Polifonia project (grant agreement N. 101004746).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Iannella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pähler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kasten</surname>
          </string-name>
          ,
          <source>ODRL: Open Digital Rights Language 2</source>
          .1,
          <string-name>
            <surname>Technical</surname>
            <given-names>Report</given-names>
          </string-name>
          , W3C,
          <year>2015</year>
          . URL: https://www.w3.org/ns/odrl/2/ODRL21.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pellegrini</surname>
          </string-name>
          , G. Havur,
          <string-name>
            <given-names>S.</given-names>
            <surname>Steyskal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Panasiuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fensel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mireles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Thurner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schönhofer</surname>
          </string-name>
          ,
          <article-title>Dalicc: a license management framework for digital assets</article-title>
          ,
          <source>Proceedings of the Internationales Rechtsinformatik Symposion (IRIS) 10</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Villata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gandon</surname>
          </string-name>
          ,
          <article-title>Licenses compatibility and composition in the web of data</article-title>
          ,
          <source>in: Proceedings of the Third International Conference on Consuming Linked Data-Volume 905, CEUR-WS. org</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>124</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Daga</surname>
          </string-name>
          , M.
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gangemi</surname>
          </string-name>
          ,
          <article-title>A Bottom-Up Approach for Licences Classification and Selection</article-title>
          , in: Proceedings of the International Workshop on Legal Domain And
          <string-name>
            <surname>Semantic Web Applications (LeDA-SWAn)</surname>
          </string-name>
          , co-located
          <source>with ESWC</source>
          <year>2015</year>
          .,
          <string-name>
            <surname>CEUR</surname>
            <given-names>WS</given-names>
          </string-name>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Daquino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Enrico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mathieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Robin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Merono</given-names>
            <surname>Penuela</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Paul, Characterizing the landscape of musical data on the web: State of the art and challenges</article-title>
          , in: Workshop on
          <article-title>Humanities in the Semantic Web, co-located with ISWC</article-title>
          .,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dunn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dagdelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Rosen</surname>
          </string-name>
          , G. Ceder,
          <string-name>
            <given-names>K.</given-names>
            <surname>Persson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Structured information extraction from complex scientific text with fine-tuned large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2212.05238</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Steidl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Iannella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rodríguez-Doncel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Myles</surname>
          </string-name>
          ,
          <source>ODRL Vocabulary &amp; Expression</source>
          <volume>2</volume>
          .2,
          <string-name>
            <given-names>W3C</given-names>
            <surname>Recommendation</surname>
          </string-name>
          ,
          <year>W3C</year>
          ,
          <year>2018</year>
          . Https://www.w3.org/TR/2018/REC-odrl-vocab20180215/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Rodríguez-Doncel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Villata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gómez-Pérez</surname>
          </string-name>
          ,
          <article-title>A dataset of RDF licenses</article-title>
          , in: R. Hoekstra (Ed.),
          <source>Legal Knowledge and Information Systems. JURIX</source>
          <year>2014</year>
          :
          <article-title>The Twenty-Seventh Annual Conference</article-title>
          ., IOS Press,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .3233/978- 1-
          <fpage>61499</fpage>
          - 468- 8- 187.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardellino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Villata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Governatori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rotolo</surname>
          </string-name>
          ,
          <article-title>Licentia: a Tool for Supporting Users in Data Licensing on the Web of Data, in: ISWC 2014 Posters &amp; Demo Track</article-title>
          . 13th
          <source>International Semantic Web Conference (ISWC)</source>
          ,
          <source>Riva del Garda</source>
          , Italy,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wilke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bannoura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <article-title>Relicensing combined datasets</article-title>
          ,
          <source>in: 2021 IEEE 15th International Conference on Semantic Computing (ICSC)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>241</fpage>
          -
          <lpage>247</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Daga</surname>
          </string-name>
          , M.
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gangemi</surname>
          </string-name>
          , E. Motta,
          <article-title>Propagation of Policies in Rich Data Flows</article-title>
          ,
          <source>in: Proceedings of the 8th International Conference on Knowledge Capture, ACM</source>
          ,
          <year>2015</year>
          , p.
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Cabrio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Palmero</given-names>
            <surname>Aprosio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Villata</surname>
          </string-name>
          , These Are Your Rights,
          <source>in: The Semantic Web: Trends and Challenges</source>
          , volume
          <volume>8465</volume>
          <source>of LNCS</source>
          , Springer International Publishing,
          <year>2014</year>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>269</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>319</fpage>
          - 07443- 6_
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Forbus</surname>
          </string-name>
          ,
          <article-title>Combining analogy with language models for knowledge extraction</article-title>
          ,
          <source>in: 3rd Conference on Automated Knowledge Base Construction</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Extracting biomedical factual knowledge using pretrained language model and electronic health record context</article-title>
          ,
          <source>in: AMIA Annual Symposium Proceedings</source>
          , volume
          <volume>2022</volume>
          , American Medical Informatics Association,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Woldesenbet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sanapathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Valluri</surname>
          </string-name>
          , E. Strandberg,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          , et al.,
          <article-title>Distilling large language models for biomedical knowledge extraction: A case study on adverse drug events</article-title>
          ,
          <source>arXiv:2307.06439</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>Large language models for information retrieval: A survey</article-title>
          ,
          <source>arXiv preprint arXiv:2308.07107</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pitawela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , H.-T. Chen,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Webvln: Vision-and-language navigation on websites</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>38</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>1165</fpage>
          -
          <lpage>1173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tamkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brundage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguli</surname>
          </string-name>
          ,
          <article-title>Understanding the capabilities, limitations, and societal impact of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2102.02503</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Unifying large language models and knowledge graphs: A roadmap</article-title>
          ,
          <source>arXiv preprint arXiv:2306.08302</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling</article-title>
          ,
          <source>arXiv preprint arXiv:2306.11489</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Frey</surname>
          </string-name>
          , L.-P. Meyer,
          <string-name>
            <given-names>N.</given-names>
            <surname>Arndt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bulert</surname>
          </string-name>
          ,
          <article-title>Benchmarking the abilities of large language models for rdf knowledge graph creation and comprehension: How well do llms speak turtle?</article-title>
          ,
          <source>arXiv preprint arXiv:2309.17122</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Daquino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wigham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Daga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Giagnolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tomasi</surname>
          </string-name>
          ,
          <string-name>
            <surname>CLEF.</surname>
          </string-name>
          <article-title>A Linked Open Data Native System for Crowdsourcing</article-title>
          ,
          <source>ACM JOCCH 16</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>