<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>T. C. d. Silva);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Provider-agnostic knowledge graph extraction from user stories using large language models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thayná Camargo da Silva</string-name>
          <email>th.camargodasilva@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leen Lambers</string-name>
          <email>leen.lambers@b-tu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sébastien Mosser</string-name>
          <email>mossers@mcmaster.ca</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kate Revoredo</string-name>
          <email>kate.revoredo@hu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Brandenburg University of Technology Cottbus-Senftenberg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Humboldt-Universität zu Berlin</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Joint Proceedings of the STAF 2025 Workshops: OCL</institution>
          ,
          <addr-line>OOPSLE, LLM4SE, ICMM, AgileMDE, AI4DPS, and TTC. Koblenz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>McSCert, McMaster University</institution>
          ,
          <addr-line>Hamilton, Ontario</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In agile software development, it is common to employ user stories to capture requirements. Specifying requirements in structured natural language has the advantage that requirements are easily understood by domain experts. As requirements evolve and become more complex over time, their analysis also becomes more dificult. Requirement specifications in the form of knowledge graphs have been proven to be useful to partially automate this analysis and make it more manageable. There are related works that automate the translation of user stories into knowledge graph representations, making the previous manual translation less error-prone and more eficient. A recent approach of Arulmohan et al. employs large language models (LLMs) to automate the translation and compares it with alternatives based on dedicated Natural Language Processing (NLP). A large experiment revealed that the latter outperformed the LLM-based solution.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge graphs</kwd>
        <kwd>Requirements</kwd>
        <kwd>LLMs</kwd>
        <kwd>Langchain</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        User stories [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are commonly used in agile software development to capture requirements in
semistructured natural language. They are also known as backlog items, within the Scrum framework [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
They describe both the functionality desired by specific stakeholders and the value they provide from
the perspective of these stakeholders. Typically, they follow the Connextra format [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]:
      </p>
      <p>As a &lt;PERSONA&gt;, I want &lt;ACTIONS over ENTITIES&gt; so that &lt;BENEFIT&gt;.</p>
      <p>
        User stories have the advantage that any stakeholder can support their specification and validation
more easily. When a backlog containing user stories grows, it is dificult to get a structured overview
over all requirements, which may lead to development delays [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Previous work therefore investigates
the translation of user stories into more formal requirement specifications such as domain models
or knowledge graphs [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], enabling their automated requirement analysis, for example, to detect
inconsistencies, overlaps or dependencies. A recent approach [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] uses large language models (LLMs)
†Kate Revoredo is funded by the Berliner Chancengleichheitsprogramm (BCP) as part of the DiGiTal Graduate Program.
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
to support knowledge graph extraction from user stories. The extraction is illustrated on a user story
example in Figure 1. On the right, a user Story is depicted as the central node of the graph, and the
colored nodes and further relationships depict how this user story can be decomposed into nodes and
edges following the knowledge graph architecture for product backlogs on the left. The Product and
Backlog nodes are omitted for simplicity from the graph on the right, containing only one user story.</p>
      <p>
        Using a curated dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as ground truth1, Arulmohan et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] compared their LLM-based solution
with Visual Narrator [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], NLP open-source software using techniques such as Part-of-Speech tagging
and rule-based extraction, and a Conditional Random Fields (CRF) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] model, a statistical model that
can be used to predict patterns based on their context. The results showed that while Visual Narrator
performed reasonably well in identifying personas and actions due to their predictable positions within
the text, it struggled with entity extraction, often misidentifying or omitting them. The LLM-based
solution demonstrated a substantial improvement over Visual Narrator in all categories. It required
significantly less development efort and achieved superior results, particularly in identifying actions.
However, their solution was outperformed by the CRF-based model, trained specifically for this task.
The LLM-based solution was still considered a promising approach for rapid prototyping.
      </p>
      <p>
        Because of the stochastic nature of LLMs, the fact that they evolve over time, and the availability of
diferent LLM providers (i.e., organizations that provide access to an LLM such as OpenAI (GPT) and
Meta (LLaMA)), the same experiment run today or with another LLM would generate diferent results.
Consequently, we explored in this paper if we can come up with an LLM-based end-to-end automated
approach for the knowledge graph extraction that is provider-agnostic and can be easily (re)evaluated
against a given ground truth. We explain the design and implementation of our solution based on
the LangChain framework in section 3. LangChain [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a robust open-source Python framework for
developing LLM-powered applications, provides connectivity to various LLM providers and ofers a
declarative syntax to manage complex LLM interactions. We designed and implemented a
custommade module focusing particularly on knowledge graph extraction. Moreover, our solution relies on
LangChain for defining a prompt template in advance such that input user stories can be dynamically
added to the prompt, interacting with API via chains standardizing and automating input feeding to LLMs
and processing their outputs, and using pre-built modules facilitating various tasks such as configuring
LLMs, performing data transformations, and interacting with external databases (e.g., Neo4j [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]).
In section 3, we concentrate on explaining the details of our custom-made module as well as prompt
template definition. Finally, we also present an evaluation script that enables systematic and automated
(re)evaluation against a given ground truth, and we use this script exemplarily in two experiments
using diferent LLM providers in section 4. We could show that some of the LLMs are indeed able to
1Based upon a requirements dataset by Dalpiaz et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], consisting of 22 product backlogs containing 1679 user stories.
close the gap compared to the CRF approach [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We conclude the paper in section 5 with a summary
of our contribution, a discussion of the obtained results and an outlook. This work is based on a master
thesis [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], where more details to the approach and its evaluation can be found.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        In this section, we describe the two main streams of research related to ours: i) domain modeling using
artifacts, such as ontologies and knowledge graphs to represent the requirements extracted from user
stories [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]; and ii) the use of LLMs to support requirement engineering [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ].
      </p>
      <p>
        For what concerns stream i), A user story modeling ontology was proposed by Mancuso et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
which created a domain-specific modeling language and integrated it into the AOAME modeling tool,
resulting in a visual user story. Ladeinde et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] also proposed the use of knowledge graphs to model
user stories. Their approach uses NLP techniques to extract the role, goal, and benefit, and model them
into an ontology. Arulmohan et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] extracted knowledge from user stories using LLMs, specifically
ChatGPT 2, and modeled them into knowledge graphs. The knowledge graph proposed by Arulmohan
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is considered more comprehensive and flexible compared to the models of Mancuso et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
and Ladeinde et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], as the former lacks general node types, which complicates querying, while the
latter extracts only three node types and does not standardize relationship type.
      </p>
      <p>
        Regarding stream ii), White et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] propose prompt patterns that leverage large language models
(LLMs), such as ChatGPT, to facilitate the elicitation and identification of missing requirements. This
approach helps capturing user needs more comprehensively during the early stages of software
development. Endres et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] explore the use of LLMs to formalize requirements from natural language
intent. This approach holds promise for streamlining the transition from user stories to formal
specifications, improving clarity, and reducing ambiguity. Arulmohan et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] pioneered a distinct technique,
difering from traditional NLP methods, to model user stories by applying LLMs, specifically ChatGPT.
While they propose a framework for transforming requirement concepts into a comprehensive domain
representation, they do not address the specific challenge of knowledge graph creation.
      </p>
      <p>In this paper, based on the limitations from both streams, we explore the potential of LLMs for
extracting knowledge from user stories and representing this knowledge using knowledge graphs.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <sec id="sec-3-1">
        <title>3.1. Overview</title>
        <p>
          Our solution automates an end-to-end process, from receiving a backlog, over translating the user
stories in the backlog into a graph document, to visualizing the translated knowledge graph in the
Neo4j [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] graph database. It starts with configuring the LLM , a required input of the USGT module
(cf. activity 1 in the BPMN model in Figure 2). The LLM configuration specifies the connection to the
LLM API via the LangChain framework [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], which ofers the possibility of connecting to several LLM
providers3. Then it transforms a user story into Document format, applying a LangChain ready-made
function to convert the user story into a Document object type that is required for LLM interaction
(cf. activity 2 in Figure 2)). Then our newly developed UserStoryGraphTransformer (USGT) module
(cf. activities 3 and 4 in Figure 2) interacts with the selected LLM and generates a Graph Document
from the user story Document it is provided with. We will explain in more detail the USGT module
in subsection 3.2. Finally, the Graph Document is stored in the Neo4j database by using LangChain’s
Neo4j integration to establish a connection with the database and ingest the extracted knowledge graph
(cf. activity 4 in Figure 2).
        </p>
        <p>An example of the knowledge graph extracted from ”As a student, I want to learn how to code, so that
I can build my own projects.” is depicted in Figure 3, the user story represented by the grey, the persona
2https://platform.openai.com/docs/models/gpt-4
3https://python.langchain.com/docs/integrations/chat/
by the blue, the actions by the red, the entities by the green, and the benefit by the pink node(s).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. UserStoryGraphTransformer Module</title>
        <p>
          We developed the UserStoryGraphTransformer (USGT) module as a specialized module to extract
knowledge graphs from user stories using arbitrary Large Language Models (LLMs). LangChain’s
prebuilt LLMGraphTransformer module automates the construction of knowledge graphs from text data
using an LLM. The LLMGraphTransformer struggled to fully extract nodes and relationships from user
stories due to the fine-grained, domain-specific ontology required and the high information density,
exceeding the capabilities of general knowledge graph extraction. We customized it to more efectively
capture the specific structure and relationships within the user stories. The USGT module processes
each user story independently. This decision aligns with the INVEST criteria. This decision is further
supported by the related study of Arulmohan et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], where researchers found that when the LLM
was provided with a list of user stories, it tended to confuse nodes between diferent stories and even
generated misinformation.
        </p>
        <p>The USGT module, as illustrated in the component diagram in Figure 4, comprises two main
components: the LLM Connector and the Graph Transformer.</p>
        <p>The LLM Connector receives as input a user story and an LLM configuration. The user story is a
textual input in a Document format, while the LLM configuration, defined by the user, specifies the
LLM provider and the model to be used. This configuration enables communication with the LLM</p>
        <p>API by sending requests and receiving responses. Upon receiving these inputs, the LLM Connector
uses Prompt Templates to define prompts that guide the LLM to identify and extract the desired nodes
and relationships from the user story, and LangChain chains are used to interact with the LLM. As an
output, the LLM Connector delivers the LLM-derived components of a knowledge graph, composed of
nodes and relationships. The Graph Transformer processes the LLM-derived components, enriching it
with additional information, and converting nodes and relationships into a Graph Document suitable
for ingestion into the graph database. It ensures that the knowledge graph components extracted using
the LLM Connector adhere to the ontology constraints and formats. The Graph Document is a special
data structure to represent a Knowledge Graph that is required to be ingested by the Neo4j database.</p>
        <p>When analyzing the knowledge graph architecture (cf. Figure 1), we identified several opportunities
to streamline the LLM’s workload, saving up resources, such as API costs and processing time, while
also improving accuracy. The userstory node represents the input itself and, therefore, does not require
further processing by the LLM to be represented as a node in the knowledge graph. Instead, it is handled
by the Graph Transformer component, by enriching the LLM’s extracted nodes with this userstory
node by default. Additionally, some relationships can be logically inferred from the existing nodes. For
instance, if the LLM is able to extract a persona node, the has_persona relationship can be established
with certainty. The same logic applies to has_action, has_entity, and has_benefit relationships. The
logical inference of relationships allows the LLM to concentrate its resources on extracting more
complex relationships – specifically triggers and targets – that demand a deeper semantic understanding
of the user story content. The LLM Connector therefore guides the LLM to only extract the persona,
action, entity, and benefit nodes and the relationships triggers and targets. The Graph Transformer is
responsible for enriching the knowledge graph with the userstory node and for deriving the logically
inferrable relationships from extracted nodes from the LLM connector.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Implementation</title>
        <p>
          We have implemented in Python the end-to-end process as depicted in Figure 2 and released it as
open source [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The us_graph_transformer.py file contains the core logic of the USGT module.
To facilitate its use and evaluation, the repository includes several supporting files. extractor.py
serves as the primary execution script, while evaluation.py calculates performance metrics using
the ground truth data from pos_baseline/. Data management is handled through template.json
(for input formatting) and the extracted-user-stories/ and evaluation/ directories (for storing
experimental outputs). We focus here on explaining implementation details of the custom-made USGT
module (see subsection 3.2).
        </p>
        <p>
          Models like ChatGPT 3.5 from OpenAI support function calls and provide structured outputs in a
predefined JSON format, simplifying data processing and integration. However, models like Llama 3
by Meta do not have this capability, requiring more explicit prompts and examples to guide the LLM
toward a desired output structure. Therefore, the LLM Connector defines diferent types of prompt
templates after assessing if the selected language model supports it. For models that support function
calls, an output schema template is defined, containing nodes and relationship keys, whose values will
be filled out by the LLM response. If the selected model does not support function calls, additional
examples and detailed instructions are added to the prompt to guide the LLM to respond in a specific
JSON format. Then, the LLM Connector creates two separate prompts (main and benefit prompts):
one for extracting the persona, action, and entity nodes and their relationships, and another solely for
extracting the benefit node. We made this separation because the first three node types are always
present in any given user story, ensuring that their relationships are consistently extracted. In contrast,
the benefit node is optional and does not participate in the relationships to be extracted by the LLM
(triggers and targets). By isolating the benefit node extraction into its own prompt, the extraction
performance and consistency were improved. To guide both prompts, the 6 Strategies for getting better
results by OpenAI [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] were used as a reference (accessed in November 2024). This guide is composed of
high-level principles, strategies, and specific methods, and tactics, that can be used to implement those
strategies. The purpose of applying these methods is to improve the chances of the LLM to produce
the desired outcome. Nine of the tactics were fully implemented, one was partially implemented, and
nine were not relevant or could not be applied in this solution. Key implemented tactics included:
instructing the model to adopt a specific persona, using delimiters to clearly define input sections,
explicitly specifying task completion steps, providing illustrative examples, and directing the model to
answer using provided reference text. The LLM Connector thus constructs two LangChain chains: one
for the main and one for the benefit prompts. It sends two requests (one for each chain) to the LLM API
and obtains the LLM responses, or LLM-derived components.
        </p>
        <p>
          The main prompt uses two diferent roles, the system role to set the context, and the human role to
provide the input, in this case, the user story. The context part is structured into six diferent parts:
Overview: Directs the model to assume the role of a requirements engineer and establishes the task
context. Nodes: Introduces the concept of a node, describes the three possible node types (Persona,
Action, and Entity), and provides detailed instructions for extracting all relevant nodes from the user
story. Relationships: Specifies what constitutes a relationship, presents the two possible relationship
types (TRIGGERS between Persona and the main Action, and TARGETS between Action and Entity), and
emphasizes that no other relationships should be extracted. Coreference Resolution: Reinforces the
importance of maintaining consistency and coherence across extracted nodes and relationships. Strict
Compliance: Stresses the necessity for the model to adhere strictly to the given instructions. Example:
Provides an example of a user story, illustrating the extracted nodes and relationships to guide the
model’s output. We show an excerpt of the main prompt in Listing 1) focusing on node extraction. The
complete listing of the prompt is in our repo [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], in the file us_graph_transformer.py.
        </p>
        <p>In addition to the main prompt, five examples were included to to illustrate the expected output
format (cf. Listing 2) and guide the LLM in cases where function call support is not available. Thereby,
text represents the input user story, head and tail correspond to the extracted nodes, head type and tail
type indicate the type of each node, and relation specifies the relationship between them.</p>
        <p>
          The benefit prompt , is much simpler, as it has a single objective: to extract the benefit node, if present.
We refer to our open-source repository [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] (file us_graph_transformer.py) for the listing of the
benefit prompt. It is structured into the following three components: Overview: Instructs the model to
assume the role of a requirements engineer and provides context for the task. Benefit : Defines what
constitutes a benefit sentence within a user story and directs the model to extract it only if it is explicitly
stated. Examples: Provides two examples of user stories—one where the benefit node is present and
can be extracted, and another where it is absent, prompting the LLM to return an empty response.
        </p>
        <p>Once the Graph Transformer receives the LLM-derived knowledge graph components from the LLM
Connector, again their processing depends on the LLM’s support for function calls. If the LLM supports
function calls, the LLM response is already structured into JSON format, and the nodes and relationships
can be directly accessed. In the other case, the LLM response cannot be structured into a JSON format.
It is a string, and if the LLM followed all the instructions provided, it is possible to parse this string
into a JSON format. Both scenarios result in a knowledge graph components object, which contains
the nodes and relationships extracted by the LLM. The nodes’ list is enriched by adding the input user
story as a node. Subsequently, the complete nodes’ list is converted into a list of Graph Node objects.
As mentioned earlier, the logically inferable relationships from existing nodes has_benefit , has_persona,
has_entity and has_action are derived and combined with the explicitly extracted relationships from the
knowledge graph components before converting them into graph relationships. Finally, the extracted
graph nodes and relationships are combined into a graph document.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>
        We developed an evaluation script (evaluation.py in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]) that automatically extracts comparison
data supporting the users of our solution to answer the following two research questions (RQs).
      </p>
      <p>
        RQ 1. How does the accuracy of knowledge graph extraction of our solution difers when
configuring it with diferent LLMs?
RQ 2. How does the accuracy of knowledge graph extraction of our solution configured with a
specific LLM compares with the CRF and Visual Narrator solution as presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]?
We exemplarily answer both RQs by using the same dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as employed for experimentation by
Arulmohan et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] as ground truth. This means that we used a cleaned version of the annotated dataset
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], representing 87% of the complete version, to ensure comparability of our results with their results
(cf. RQ2). Our evaluation script can be easily generalized to work with a diferent ground truth. The
evaluation script’s input expects a JSON structure mirroring that of template.json in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], allowing
for flexible evaluation against diverse ground truth datasets.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Evaluation metrics</title>
        <p>
          We use a combination of multiple-classification (MC) metrics, such as F-measure and token-similarity
(TS) metrics, such as BERTScore [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. The MC metrics evaluate the LLM’s ability to categorize texts
into various groups, each representing a label. The TS metrics measure the semantic similarity between
the LLM’s generated text and a reference. The metrics are calculated at the user story level but are
averaged across the backlog to provide a broader perspective. This approach ensures that the evaluation
considers variations in context and patterns across diferent backlogs.
        </p>
        <p>
          We calculate the MC metrics recall, precision and F-measure. As a prerequisite it is essential to define
what qualifies as a correct output from the LLM. We adopted the criteria outlined by Arulmohan et
al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], utilizing three comparison modes: strict, inclusive, and relaxed. The strict comparison considers
a result correct when the LLM produces the exact same response elements as the ground truth, the
inclusive comparison adds some flexibility and considers the LLM’s output as a superset of the ground
truth, and the relaxed comparison ignores adjective qualifiers and considers plurals as singulars as the
same. In addition to the experiment by Arulmohan et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we added the evaluation of the benefit node.
The MC metrics were calculated for the strict and inclusive modes, but not for the relaxed mode because
of missing Part-of-Speech annotations on the ground-truth dataset. Based on the three comparison
criteria, each knowledge graph (KG) component generated by the LLM can be classified as follows: True
Positive: An element that is considered equivalent to the ground truth. False Positive: An element
that was incorrectly identified by the LLM, as it should not have been included. False Negative: An
element that should have been identified by the LLM but was not, indicating a missing component.
        </p>
        <p>
          As TS metrics (as opposed to [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]), we chose BERTScore [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] to assess the quality of the extracted nodes.
While other metrics like Perplexity, BLEU, ROUGE, and METEOR exist, their reliance on token overlaps
and n-gram matches makes them less suitable for our dataset, which contains many single-token actions
and entities. Because these lexical similarity metrics do not fully capture semantic relationships, they
are less efective when lexical information is limited. BERTScore provides a more relaxed evaluation by
using contextual embeddings to capture deeper meaning and assess semantic similarity.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experiments and results</title>
        <p>Our first experiment uses our evaluation script to exemplarily answer RQ 1 for two diferent LLM
providers: the first is Llama 3 8B by Meta, an open-source, cost-free for research purposes and highly
powerful model, and the second is GPT-4o mini, the most cost-eficient small model by OpenAI. We
have chosen these models because Llama 3 does not support function calls, while GPT-4o mini does
support this functionality, therefore demonstrating the versatility of our solution to work with diferent
LLM providers. For the LLM configuration, both models, Llama 3 and GPT-4o mini, were set to a
zero temperature. This configuration parameter is necessary to control the stochasticity of the output,
making it more consistent and predictable, since the output of the LLM is non-deterministic. Our
experiment involved configuring the LLMs and running the extractor.py script to apply the USGT
module to extract knowledge graphs from all the user stories within the 22 product backlogs, and then
evaluating the results against the ground truth.</p>
        <p>
          We only show detailed results from our evaluation in the strict mode in Table 1 and report on
average results for the inclusive and relaxed mode (see Figure 5). We refer to the folder
user-storyextractor/evaluation in our repo [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] for the evaluation details on the other modes. In the strict
mode comparison (Table 1), GPT-4o-mini outperforms Llama 3 in extracting all node types except for
benefit , where Llama 3 shows a slightly higher performance. On average, Llama 3 struggles primarily
with entity and action extraction in this mode. A detailed analysis of the data reveals that Llama 3
frequently mishandles noun qualifiers and verb complements, both of which are critical in this strict
mode. Additionally, a significant performance gap is observed in backlog g02 for benefit extraction.
Upon closer inspection, many user stories in this backlog did not have a benefit to extract, yet Llama 3
either attempted to extract other parts from the story or produced hallucinations as the benefit node.
        </p>
        <p>
          Moving to the inclusive and relaxed modes, both models show improved scores across all categories
as expected (see Figure 5). However, GPT-4o-mini continues to lead with a more consistent performance
across all categories, maintaining its edge in benefit extraction in the inclusive mode. The BertScore
comparison mode brings the semantic alignment perspective, and in this case both models show a good
performance (see [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] for concrete evaluation data). Overall, GPT-4o-mini (Figure 5) demonstrates
superior performance across all evaluated categories, with consistently higher F-Measure scores. Even
though it doesn’t have a perfect F-measure when comparing exact string matching, it proves capable in
capturing semantic alignment, particularly in the benefit node, which poses a great challenge since it
has many elements to be extracted. Llama 3, while competitive, exhibits greater variability.
        </p>
        <p>Our second experiment aims to answer RQ 2. We used GPT-4 turbo in this experiment as LLM. We
conducted it using the same dataset and a zero temperature configuration on the LLM. The earlier
study employed Conditional Random Fields (CRF), a tailored machine-learning approach trained on 20%
of the dataset. This method outperformed all GPT models (GPT-3.5-turbo-0125, GPT-3.5-turbo-0613,
GPT-4-0125-preview, and GPT-4-0613) and the Visual Narrator technique. When comparing CRF to
GPT-4-turbo using our solution, CRF still leads in performance; however, the gap has narrowed under
the same evaluation metrics, making LLM-based solutions a competitive alternative, in principle.</p>
        <p>The results reported for RQ 1 and 2 show that our solution is able to successfully extract nodes and
relationships from user stories according to the specific ontology proposed, however the adherence to
the ontology, and therefore, the quality of the results depends on the chosen large language model.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion &amp; Discussion</title>
      <p>In this work, we focused on making available LLMs for tasks related to requirements extraction, more
precisely, the reification of knowledge graphs out of user stories backlogs. By abstracting the task
under study from the LLM provider, we ofered an easy way to select the “best” provider for a given
task, as well as extended validation using a reference ground truth. We observed that LLMs have
strengths and weaknesses depending on the part of the story to extract, which reinforces the need to
be able to seamlessly switch between providers (RQ1). By comparing LLMs to a ground truth and a
non-LLM-based approach, we observed that if a supervised approach such as CRF ends up being more
eficient in terms of training and computation, LLMs tend to be close to these performances but do not
require annotation, which can be helpful in some scenarios.</p>
      <p>
        Threats to validity. One limitation concerns the evaluation against the ground truth. We reused
existing benchmarks from non-LLMs approaches to compare our work with the state of practice, leading
to an incomplete evaluation. Although the data set was annotated through a rigorous process, the exact
annotation of node elements remains somewhat subjective, according to their intrinsic nature. For
example, one would consider that out of the text “I want to add type information to my data”, one might
consider the entity to be the “type information”, while others might consider “data”, as it includes the
former. This subjectivity poses a threat to construct validity, as the evaluation might not consistently
reflect the true quality or correctness of the extracted knowledge graph components. This is in general
mitigated by the operational scale (e.g., thousands of stories) and a choice of evaluation metrics that is
robust to small variations. Another limitation is the nondeterministic nature of LLM outputs. Even
with model parameters such as the temperature set to zero, responses from the LLM may vary across
executions. This variability impacts internal validity, as it can lead to inconsistent experimental results,
making it dificult to attribute the results solely to the factors being tested. Moreover, it afects the
external validity, as the same approach might yield inconsistent results when applied in diferent settings
or with alternative data sets. Finally, the evaluation of the knowledge graph focused solely on the
extraction of the nodes, as the extraction of relationships was not evaluated in previous comparable
work [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Discussion. To determine the most assertive approach, it is necessary to consider specific use
cases and contexts. For tasks that require constant updates, flexibility, or are part of larger, dynamic
systems, LLMs may ofer advantages and come with the immediate benefit of not necessitating any
further annotation or data curation for training. But if this free-lunch approach (no need to build a
ground truth) can make sense for prototyping proof-of-concepts, in the long run, it triggers privacy
and security issues: requirements are often trade secrets, and, as such, using an LLM without proper
due diligence might be an issue. This benefit might be impacted if the model has to be fine-tuned.
Another issue is related to the dificulty of properly evaluating the outcome. In addition to lack of
determinism, not having a proper ground truth for the tasks considered could alter the evaluation: a
lot of research only relies on subjective feedback from interviews to mitigate this, but quantitative
evaluation cannot be overlooked when integrating LLMs into critical systems [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Unfortunately, when
taking the quantitative evaluation path, focusing on metrics such as accuracy or F-score might not be
suficient or even reasonable in the long run. For example, in our use case, all LLMs have an accuracy
of 1 regarding the identification of the persona. This is highly related to the structure of the Connextra
pattern, where an approach taking the nouns located between the third word and the first comma would
also have the same accuracy. By only focusing on such metrics, we as researchers do not consider the
impact of our decisions on a larger scale. One immediate thing people might think of is the energy
cost of each request, in addition to the cost of training associated with LLMs. But the human cost, for
example, is often overlooked: to build the dataset used for their image-generation features, OpenAI
exposed Kenian workers to highly disturbing content (including child abuse, bestiality, rape and sexual
slavery), for less than $2/h4. If, as researchers in software engineering, we are not ethicians, it is still
our responsibility to explore alternative solutions before jumping into the LLM train [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and use it for
use cases where other alternatives would produce results as good as the LLM one.
      </p>
      <p>
        Outlook. An immediate perspective of our contribution is to extend the benchmark eforts and
define a comprehensive way of evaluating knowledge graph building, leveraging not only the nodes
but also the relations between them. We envision an approach à la Transformer Ranker [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], where the
system can automatically guide practitioners to the LLM that best fits their needs. A second perspective
is to capture the needs of the task at stake, and provide feedback to practitioners on how relevant LLMs
are for such a task, as well as to identify alternative approaches (e.g., heuristic based, supervised) that
could be considered for the same objective.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative Al tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <article-title>User stories applied: For agile software development</article-title>
          ,
          <string-name>
            <surname>Addison-Wesley Professional</surname>
          </string-name>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Schwaber</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Sutherland,</surname>
          </string-name>
          <article-title>The scrum guide</article-title>
          ,
          <source>Scrum Alliance</source>
          <volume>21</volume>
          (
          <year>2011</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lucassen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Robeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. M. E. Van Der Werf</surname>
          </string-name>
          , S. Brinkkemper,
          <article-title>Extracting conceptual models from user stories with visual narrator</article-title>
          ,
          <source>Requirements Engineering</source>
          <volume>22</volume>
          (
          <year>2017</year>
          )
          <fpage>339</fpage>
          -
          <lpage>358</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mancuso</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Laurenzi,</surname>
          </string-name>
          <article-title>An approach for knowledge graphs-based user stories in agile methodologies</article-title>
          ,
          <source>in: Business Information Research</source>
          , volume
          <volume>493</volume>
          <source>of LNBIP</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ladeinde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Khalajzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kanij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Grundy</surname>
          </string-name>
          ,
          <article-title>Extracting queryable knowledge graphs from user stories: An empirical evaluation</article-title>
          , in: ENASE,
          <string-name>
            <surname>SCITEPRESS</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>684</fpage>
          -
          <lpage>692</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Arulmohan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-J. Meurs</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Mosser</surname>
          </string-name>
          ,
          <article-title>Extracting domain models from textual requirements in the era of large language models</article-title>
          ,
          <source>in: 2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion</source>
          <volume>(</volume>
          <string-name>
            <surname>MODELS-C)</surname>
          </string-name>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>580</fpage>
          -
          <lpage>587</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Arulmohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mosser</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-J. Meurs</surname>
          </string-name>
          , ace
          <article-title>-design/qualified-user-stories: Version 1</article-title>
          .0,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          . 5281/ZENODO.8136975, annotated dataset.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          ,
          <article-title>Requirements data sets (user stories</article-title>
          ),
          <year>2018</year>
          . doi:
          <volume>10</volume>
          .17632/7ZBK8ZSD8Y.1, dataset.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Robeer</surname>
          </string-name>
          , G. Lucassen,
          <string-name>
            <surname>J. M. E. M. van der Werf</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dalpiaz</surname>
          </string-name>
          , S. Brinkkemper,
          <article-title>Automated extraction of conceptual models from user stories via NLP</article-title>
          , in: RE, IEEE Computer Society,
          <year>2016</year>
          , pp.
          <fpage>196</fpage>
          -
          <lpage>205</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Laferty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. C. N.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          , in: ICML, Morgan Kaufmann,
          <year>2001</year>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>LangChain</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://github.com/langchain-ai/langchain, last access:
          <fpage>2024</fpage>
          -10-10.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <issue>Neo4j</issue>
          ,
          <year>2024</year>
          . URL: https://neo4j.com/, last access:
          <fpage>2024</fpage>
          -06-12.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T. C.</given-names>
            da
            <surname>Silva</surname>
          </string-name>
          ,
          <article-title>Extracting Knowledge Graphs from User Stories Using Langchain, Master's thesis</article-title>
          ,
          <source>BTU Cottbus-Senftenberg</source>
          ,
          <year>2025</year>
          . doi: https://doi.org/10.26127/BTUOpen- 7038.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I. K.</given-names>
            <surname>Raharjana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Siahaan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Fatichah, User stories and natural language processing: A systematic literature review</article-title>
          ,
          <source>IEEE access 9</source>
          (
          <year>2021</year>
          )
          <fpage>53811</fpage>
          -
          <lpage>53826</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Grundy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Large language models for software engineering: A systematic literature review</article-title>
          ,
          <source>ACM Trans. Softw. Eng. Methodol</source>
          .
          <volume>33</volume>
          (
          <year>2024</year>
          )
          <volume>220</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>220</lpage>
          :
          <fpage>79</fpage>
          . URL: https://doi.org/10.1145/3695988. doi:
          <volume>10</volume>
          .1145/3695988.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hemmat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sharbaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kolahdouz-Rahimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lano</surname>
          </string-name>
          , S. Y. Tehrani,
          <article-title>Research directions for using llm in software requirement engineering: a systematic review</article-title>
          ,
          <source>Frontiers in Computer Science</source>
          <volume>7</volume>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .3389/fcomp.
          <year>2025</year>
          .
          <volume>1519437</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Spencer-Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <article-title>Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design</article-title>
          ,
          <source>in: Generative AI for Efective Software Development</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Endres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fakhoury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Lahiri</surname>
          </string-name>
          ,
          <article-title>Can large language models transform natural language intent into formal method postconditions?</article-title>
          ,
          <source>Proc. ACM Softw. Eng</source>
          .
          <volume>1</volume>
          (
          <year>2024</year>
          ). URL: https://doi.org/10.1145/3660791.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T. C.</given-names>
            da
            <surname>Silva</surname>
          </string-name>
          ,
          <article-title>Extracting knowledge graphs from user stories using langchain</article-title>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .5281/ zenodo.14254058, https://zenodo.org/record/14254058.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Prompt engineering,
          <year>2023</year>
          . URL: https://platform.openai.com/docs/guides/ prompt-engineering?ref=blef.fr, last access:
          <fpage>2024</fpage>
          -09-07.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang*</surname>
          </string-name>
          , V. Kishore*, F. Wu*,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2020</year>
          . URL: https://openreview. net/forum?id=SkeHuCVFDr.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Norheim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rebentisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Draeger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kerbrat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. L.</given-names>
            <surname>De Weck</surname>
          </string-name>
          ,
          <article-title>Challenges in applying large language models to requirements engineering tasks</article-title>
          ,
          <source>Design Science 10</source>
          (
          <year>2024</year>
          )
          <article-title>e16</article-title>
          . doi:
          <volume>10</volume>
          .1017/dsj.
          <year>2024</year>
          .
          <volume>8</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>B.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , T. Menzies,
          <string-name>
            <given-names>AI</given-names>
            <surname>Over-Hype</surname>
          </string-name>
          :
          <article-title>A Dangerous Threat (and How to Fix It)</article-title>
          ,
          <source>IEEE Software 41</source>
          (
          <year>2024</year>
          )
          <fpage>131</fpage>
          -
          <lpage>138</lpage>
          . doi:
          <volume>10</volume>
          .1109/MS.
          <year>2024</year>
          .
          <volume>3439138</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L.</given-names>
            <surname>Garbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ploner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Akbik</surname>
          </string-name>
          ,
          <article-title>Transformerranker: A tool for eficiently finding the best-suited language models for downstream classification tasks</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2409</volume>
          .
          <fpage>05997</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>