<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Barcelona, Spain, April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Evaluating the Capabilities of LLMs in Traceability Maintenance for Automotive System and Software Requirements</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vibhashree Hippargi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Kamsties</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jürgen Naumann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Area 21 Software GmbH</institution>
          ,
          <addr-line>Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dortmund University of Applied Sciences and Arts</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>7</volume>
      <issue>2025</issue>
      <abstract>
        <p>Various researchers have explored the potential of Large Language Models (LLMs) for several software engineering tasks, including design solution generation, coding, and test case creation. This paper presents five empirical studies performed on OpenAI's ChatGPT-4o to analyze its performance to support diferent requirements engineering tasks related to requirements traceability. Using a dataset from an ongoing automotive project and industry experts' assessments as ground truth, we evaluate ChatGPT-4o's ability to assess trace link quality between system requirements, software requirements and test cases, to predict the trace links and also to analyze the quality of the requirements itself. We also tested ChatGPT-4o with an existing project ticket. Our findings in these studies indicate that ChatGPT-4o demonstrated strong performance, as evidenced by the metrics. These results suggest that ChatGPT-4o can be efectively integrated into daily industry practices as a support tool. The dataset is available on GitHub [3].</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;traceability</kwd>
        <kwd>link quality</kwd>
        <kwd>automotive</kwd>
        <kwd>large language model</kwd>
        <kwd>trace link prediction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Motivation</title>
      <p>Requirements traceability is a crucial aspect of automotive software engineering, ensuring that all
requirements maintain clear and meaningful connections throughout the software lifecycle. Within the
framework of Automotive SPICE (ASPICE 4.0) standard Base Practices (BP4: Ensure consistency and
establish bidirectional traceability), not only must requirements be traceable, but the trace links must
also exhibit consistency, meaning that each link must be relevant and logically meaningful. Achieving
this level of consistency requires significant efort from human analysts to verify the quality of trace
links. As project complexity increases, so do the requirements and their subsequent connections, leading
to the search for tools to support.</p>
      <p>Maintaining accurate trace links between system requirements, software requirements, and test
cases is resource-intensive, especially when human analysts must verify each link. Furthermore,
inconsistencies in requirements documentation can lead to missed or weak links, negatively afecting
quality and compliance. The problem this short paper addresses is whether LLMs can efectively support
requirements traceability by assessing the link quality and predicting trace links, potentially reducing
the efort needed by human analysts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Research Design and Evaluation</title>
      <p>
        This section discusses the empirical studies performed on ChatGPT-4o along with their results. The
studies’ design and methodology were inspired by foundational AI-assisted requirements engineering
research [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The studies are supported by and conducted in Area 21 Software GmbH, a software
member of Elmos Semiconductors. 21 system and 24 software requirements (low complexity, no
IP) were selected from an ongoing project on firmware for an Ultrasonic Park Assist IC. Parameter
values were obfuscated, and the dataset was reviewed and approved by experts for integrity. We used
Elmos ChatGPT-4o Workspace equipped with ChatGPT-4o. To simulate real-world challenges, minor
inconsistencies and ambiguities were introduced in the requirements to test the robustness.
Study 1: Trace Link Quality check between System and Software Requirements. For this study,
ChatGPT-4o was prompted to assess and comment on the quality of the trace link with regards to two
aspects, meaningfulness (logically relevant) and completeness (all aspects are reflected). This is based on
how well the software requirement details its linked system requirement. To account for variability in
ChatGPT’s responses across diferent sessions, the task was performed in five separate sessions. The
responses across these sessions were largely consistent. However, to ensure reliability, findings that
appeared in multiple sessions were identified as ChatGPT’s final response.
      </p>
      <p>The evaluation was done using two independent approaches: Pooling and Ranking. In the
poolingbased evaluation, five experts at Area 21 with an average industry experience of 10 years (who will
be referred to as ’expert respondents’ going forward) were requested to perform the same task as
ChatGPT-4o. Their responses were pooled into a consolidated ground truth, against which
ChatGPT4o’s assessment was compared to evaluate its alignment with human judgment and reliability as a
support tool. Its performance was then evaluated for Precision, Recall, and F1 score using TP
(ChatGPT4o matches respondents’ findings), FP (ChatGPT-4o flagged insignificant issues that respondents did
not), FN (ChatGPT-4o missed valid issues that respondents identified), and TN (ChatGPT-4o found
valid issues that respondents overlooked, extending TN’s traditional definition to reflect AI’s ability to
uncover additional insights).</p>
      <p>In the ranking-based evaluation, an expert evaluator, i.e., the Head of Embedded Software
Development at Area 21, was presented with six anonymous responses (five from expert respondents and
one from ChatGPT-4o) and requested to rank them based on response quality and validity, with Rank 1
being the highest. Below is an example of linked requirements given, which ChatGPT-4o analyzed to
assess the trace link quality, evaluating the meaningfulness and completeness of SWRS004 in relation
to its linked SYSRS004.</p>
      <p>SYSRS004: IF performing Erase Backup AND IF Addressed with the Unicast Command THEN
the Sensor IC shall respond with CRM_RSP 0x0: Report Acknowledgement.</p>
      <p>SWRS004: IF performing Erase Backup AND IF Addressed with the Unicast Command THEN
the Software shall prepare to respond with CRM_RSP 0x0: Report Acknowledgment.</p>
      <p>The issue with this linkage is that SWRS 004 states, "the software shall prepare to respond", while
SYSRS 004 explicitly mandates sending the response. This ambiguity was flagged during the
completeness check, as it introduces uncertainty about execution. Both expert respondents and ChatGPT-4o
identified this issue, which adds one count to the TP.</p>
      <p>The pooling based evaluation result is, P=0.96, R=0.71, F1-Score=0.82. That is, the precision is high,
meaning ChatGPT-4o correctly identified most flagged issues. The recall is moderate, as ChatGPT-4o
missed some valid issues that expert respondents identified leveraging their experience and expertise.
Performance can be improved by refining the prompt to include additional evaluation criteria. The
F1-Score is moderate, balancing precision and recall. Time Eficiency: Expert respondents took 120
minutes, while ChatGPT-4o took 20 minutes, with requirements fed manually sequentially (∼ 1 min per
requirement). Future automation could further enhance eficiency.</p>
      <p>The ranking-based evaluation resulted in ChatGPT-4o responses having an average rank between 1
and 2 indicating that they were marked as valid and useful by the expert evaluator. There were only
two outliers that occurred because all responses for those entries addressed similar issues, leading to no
significant diferentiation (charts omitted for reasons of space).
Study 2: Trace Link Quality check between Software Requirements and Test Cases. In this
study, ChatGPT-4o was initially provided with possible combinations of parameter values and their
interpretations from the software requirements to store in its memory. It was then prompted to assess
whether or not the test cases entirely covered its linked software requirement and also suggest any
missing cases (referring to its memory) to attain complete coverage. During its analysis, ChatGPT-4o
not only suggested missing test cases but also identified incorrect parameter values being checked in
existing test cases. ChatGPT-4o’s analysis was reviewed by two test experts at Area 21, who rated its
validity and usefulness based on the survey questions: Q1: Is the Meaningfulness and Completeness
Analysis Valid? Q2: Do the suggested missing test cases improve coverage? Q3: Are there any other
critical aspects or test cases missed? Q4: Please comment on the performance of GenAI. The study was
designed as a quick survey to account for the limited availability of the test experts. This approach
helped minimize time demands while still gathering their insights into whether ChatGPT-4o could be
recommended for similar traceability tasks.</p>
      <p>The expert responses were analyzed using the priority order: "Partial &gt; Yes &gt; No." For Q1 and Q2,
the ChatGPT-4o achieved a combined "Yes" or "Partial" rating of 100%, with no "No" ratings recorded.
This highlights its consistent ability to address the criteria efectively. Furthermore, for Q3, 100% of
the response was "No", meaning, the evaluators confirmed that ChatGPT-4o did not omit any critical
aspects in its test cases, further validating its thoroughness.</p>
      <p>The "Partial" ratings for Q1 and Q2 were attributed primarily (70%) to deficiencies in the quality of the
software requirements rather than the GenAI’s output, indicating that software requirement’s lack of
detail also contributes to the ambiguity in writing test cases. The remaining 30% of the issues were related
to missing preconditions and negative test scenarios or redundant tests for some of the SWRS - Test
cases pair. However, this can be tackled by refining the prompts. Finally, the evaluation of ChatGPT-4o’s
responses for SWRS test cases showed strong performance. Here is the comment from one of the test
experts (also agreed by other) regarding the overall impression of GenAI: "AI is already highly advanced
for evaluating test cases and even requirements. However, there is still room for improvement. It is on a
promising path toward being fully utilized for such purposes. In the future, GenAI might take over such tasks
entirely, leaving us as reviewers to ensure that it has correctly understood and implemented our expectations."
Study 3: Trace Link Prediction between System and Software requirements. This study evaluates
ChatGPT-4o’s performance in predicting trace links between system requirements (SYSRS) and software
requirements (SWRS) in the so-called Standard Firmware project. Existing links, established by engineers
at Area 21, serve as the ground truth. The process involves providing a prompt and presenting SYSRS
and SWRS in randomized order.</p>
      <p>The analysis is conducted seven times, each using the same SYSRS list but a diferent SWRS list. In
each iteration, some SWRS are randomly removed (making it diferent from the previous list), leaving
their originally linked SYSRS unlinked. ChatGPT-4o’s predicted links are then compared to the ground
truth to evaluate precision, recall, and F1 score using TP (correctly linked), FP (falsely linked), TN
(correctly unlinked), FN (falsely unlinked). The final performance is averaged across all runs as P=1,
R=0.97, F1-Score=0.98. Both precision and recall are high, with a slight drop in recall due to a FN.
ChatGPT-4o correctly flagged an incomplete lower-level requirement but was marked false negative as
the ground truth retained the link.</p>
      <p>
        Study 4: Anomaly detection to identify gaps or ambiguities in requirements. In this study,
ChatGPT-4o was instructed to flag any linguistic or content-based ambiguities that may exist within
the requirements. Linguistic ambiguities include lexical, syntactic, semantic, syntax, and pragmatic
ambiguity [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Content based ambiguities include Noise (irrelevant information), Silence (missing
important details), Over-specification (including unnecessary solution details), Contradiction (conflicting
definitions), Ambiguity (multiple interpretations of a requirement), Forward reference (referring to
undefined concepts), and Wishful thinking (unrealistic or unverifiable requirements) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The outcome
of this experiment aims to determine whether GenAI can support quality monitoring and improvement
by identifying potential ambiguities in requirement documentation. Below is the requirement and
ChatGPT-4o’s response in the outlined box.
      </p>
      <p>SYSRS 016: AFTER Reception of the Valid CRM_FCC_CMD, THEN the Sensor IC shall set the
PDCM_RSP Parameter KAC = 0x0.</p>
      <p>For the system requirement SYSRS 016 shown above, the term valid leaves room for interpretation.
Such terms like right, valid, graciously, fast, slow cause pragmatic ambiguity and hence should be
avoided in the requirements, especially without a proper definition of what constitutes them.
ChatGPT4o correctly identified this: "The context around what makes CRM_FCC_CMD valid could lead to
misunderstandings. Is the validity based on the correct syntax, values, or another condition? This needs
to be explicitly defined to avoid confusion". For the other requirements, it also identified the presence
of noise and contradiction.</p>
      <p>Study 5: Inconsistencies between system and software requirements - Real Project Ticket. In
this study, ChatGPT-4o was tested on an existing project ticket at Area 21 to determine its ability to
identify gaps in system-software requirement alignment.</p>
      <p>Description: The requirement SYSRS 021 is not suficiently detailed by the related downstream
requirements SWRS 023 and SWRS 024.</p>
      <p>SYSRS 021: The firmware shall allow configurability to select the retry for open and short faults
independently.</p>
      <p>SWRS 023: IF the software detects an open or short error AND retry is configured as enabled in
the DEFAULT_OPEN_SHORT_RETRY value in the COMPILE TIME CONFIGURATION data, the
afected channel shall be re-enabled for one TIMESLICE with the following setup at least every
200 milliseconds. SWRS 024: The COMPILE TIME CONFIGURATION data shall contain values
for DEFAULT_OPEN_SHORT_RETRY.</p>
      <p>The missing piece is "independent configuration": SWRS has only one parameter "requiring open and
short retries to be configured simultaneously." If an application needs separate timing parameters, a
single parameter DEFAULT_OPEN_SHORT_RETRY is insuficient.</p>
      <p>It was tested with ChatGPT-4o, to see if it was able to spot the issue. Prompt: "I will provide you a
system and its linked software requirement. Can you tell me if the latter suficiently details the former?
if not, please explain the issue and suggest revisions." Response: "SYSRS 021 emphasizes independent
configurability for open and short faults. However, SWRS 023 doesn’t clarify how open and short fault
retries are configured separately. The statement -if retry is enabled in DEFAULT_OPEN_SHORT_RETRY
- implies a single retry configuration for both fault types, lacking the necessary independent configurability .
There is no mention of whether the retry for open and short faults can be configured separately, which
is essential according to SYSRS 021. Suggested revision-Parameters- DEFAULT_OPEN_RETRY and
DEFAULT_SHORT_RETRY for open and short fault retries, respectively." The response shows that
ChatGPT-4o successfully identified the misalignment, demonstrating its ability to detect inconsistencies
between system and software requirements and suggest revisions.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Threats to Validity</title>
      <p>This section outlines the potential threats to the validity of these empirical studies. Complexity and
Quantity of Requirements: The dataset was small and consisted of low-complexity requirements, as
more complex ones were restricted due to IP and confidentiality constraints. GenAI Tool Used: The
study utilized ChatGPT-4o, but outcomes may vary with other tools like Gemini or even the newer
versions of ChatGPT. Prompt Engineering: The prompts were designed with specific criteria checks
for the task. However, further refinement to include additional criteria could result in more diverse
and comprehensive outputs. Human Responses: The quality of responses from expert respondents was
influenced by their understanding, motivation, and approach to the evaluation task.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future Work</title>
      <p>
        The results indicate that ChatGPT-4o demonstrates strong performance during the mentioned use cases
(refer to P, R, F1-score), positioning it as a valuable tool for preliminary analysis. As highlighted in the
GPT-4 Technical Documentation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and by other researchers, human verification is crucial for accuracy
and reliability. ChatGPT-4o shows promise as an eficient assistant for a draft. Readers can refer to
GitHub [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for a detailed view of the prompts and responses, as well as to replicate the case studies.
Future work can explore ChatGPT-4o for broader Requirements Engineering tasks, including custom
GPT models tailored for collaboration and eficiency. OpenAI’s API could enable batch processing,
allowing large-scale requirement analysis overnight for review the next day, significantly improving
the time required for the evaluation.
Sincere appreciation and gratitude are extended to Area 21 Software GmbH for supporting the empirical
studies and the participants for providing valuable insights. Furthermore, we acknowledge Jianwei Shi
for his valuable feedback about the study on trace link prediction.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[1] GPT-4 Technical Report</source>
          .
          <year>2024</year>
          . arXiv:
          <volume>2303</volume>
          .08774 [cs.CL]. url: https://arxiv.org/abs/2303.08774.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Jin L. C. Guo</surname>
          </string-name>
          et al. “
          <article-title>Natural Language Processing for Requirements Traceability”</article-title>
          . In: To appear
          <source>in Handbook on Natural Language Processing for Requirements Engineering</source>
          , Springer,
          <year>2025</year>
          .
          <year>2024</year>
          . arXiv:
          <volume>2405</volume>
          .10845 [cs.SE]. url: https://arxiv.org/abs/2405.10845.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Vibhashree</given-names>
            <surname>Hippargi</surname>
          </string-name>
          .
          <source>Case Study on ChatGPT for RE</source>
          .
          <year>2025</year>
          . url: https://github.com/VibhaHippargi/ CaseStudyonChatGPTforRE.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Erik</given-names>
            <surname>Kamsties</surname>
          </string-name>
          . “
          <article-title>Understanding ambiguity in requirements engineering”</article-title>
          . In: Engineering and Managing Software Requirements (
          <year>2005</year>
          ). Publisher: Springer, pp.
          <fpage>245</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Meyer</surname>
          </string-name>
          . “
          <article-title>On Formalism in Specifications”</article-title>
          . In: IEEE Software (
          <year>1985</year>
          ). doi:
          <volume>10</volume>
          .1109/MS.
          <year>1985</year>
          .
          <volume>229776</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Fatin</given-names>
            <surname>Muhamad</surname>
          </string-name>
          et al. “
          <article-title>Fault-Prone Software Requirements Specification Detection Using Ensemble Learning for Edge/Cloud Applications”</article-title>
          .
          <source>In: Applied Sciences 13.14</source>
          (
          <year>2023</year>
          ), p.
          <fpage>8368</fpage>
          . doi:
          <volume>10</volume>
          .3390/ app13148368.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Krishna</given-names>
            <surname>Ronanki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Berger</surname>
          </string-name>
          , and Jennifer Horkof. “
          <article-title>Investigating ChatGPT's Potential to Assist in Requirements Elicitation Processes”</article-title>
          .
          <source>In: 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)</source>
          .
          <source>IEEE</source>
          .
          <year>2023</year>
          , pp.
          <fpage>354</fpage>
          -
          <lpage>361</lpage>
          . doi:
          <volume>10</volume>
          .1109/SEAA60479.
          <year>2023</year>
          .
          <volume>00061</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>