<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From Inductive to Deductive: LLMs-Based Qualitative Data Analysis in Requirements Engineering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Syed Tauhid Ullah Shah</string-name>
          <email>syed.tauhidullahshah@ucalgary.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohamad Hussein</string-name>
          <email>mohamad.hussein@ucalgary.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ann Barcomb</string-name>
          <email>ann.barcomb@ucalgary.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammad Moshirpour</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>In: A. Hess, A. Susi</institution>
          ,
          <addr-line>E. C. Groen, M. Ruiz, M. Abbas, F. B. Aydemir, M. Daneva, R. Guizzardi, J. Gulden, A. Herrmann, J. Horkof, S. Kopczyńska, P. Mennig, M. Oriol Hilari, E. Paja, A. Perini, A. Rachmann, K. Schneider, L. Semini, P. Spoletini</addr-line>
          ,
          <institution>A. Vogelsang. Joint Proceedings of REFSQ-2025 Workshops, Doctoral Symposium, Posters &amp; Tools Track, and Education and Training Track.</institution>
          <addr-line>Co-located with REFSQ 2025. Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Calgary</institution>
          ,
          <addr-line>Calgary</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of California</institution>
          ,
          <addr-line>Irvine, Irvine, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Requirements Engineering (RE) is essential for developing complex and regulated software projects. Given the challenges in transforming stakeholder inputs into consistent software designs, Qualitative Data Analysis (QDA) provides a systematic approach to handling free-form data. However, traditional QDA methods are time-consuming and heavily reliant on manual efort. In this paper, we explore the use of Large Language Models (LLMs), including GPT-4, Mistral, and LLaMA-2, to improve QDA tasks in RE. Our study evaluates LLMs' performance in inductive (zero-shot) and deductive (one-shot, few-shot) annotation tasks, revealing that GPT-4 achieves substantial agreement with human analysts in deductive settings, with Cohen's Kappa scores exceeding 0.7, while zero-shot performance remains limited. Detailed, context-rich prompts significantly improve annotation accuracy and consistency, particularly in deductive scenarios, and GPT-4 demonstrates high reliability across repeated runs. These findings highlight the potential of LLMs to support QDA in RE by reducing manual efort while maintaining annotation quality. The structured labels automatically provide traceability of requirements and can be directly utilized as classes in domain models, facilitating systematic software design."</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Requirements Engineering is a key process in developing large and complex software systems. It
ensures that the software meets the needs of stakeholders by gathering, organizing, and managing their
requirements systematically [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. QDA is an emerging approach in RE that aids in analyzing unstructured
data like interviews and surveys to identify patterns and insights [2, 3, 4]. One important step in QDA
is labeling or coding, where pieces of text are categorized into themes to make the data more structured
and meaningful [5]. This process helps improve traceability, consistency, and the quality of software
design [6]. However, traditional QDA methods can be slow, inconsistent, and require a lot of manual
work [7].
      </p>
      <p>Recently, Large Language Models (LLMs), such as GPT-4 [8], Gemini [9], and LLaMA-2 [10], have
shown great potential at processing and generating human-like text, making them useful for working
with large sets of unstructured data. Unlike traditional models, LLMs use natural language prompts for
tasks such as text classification [ 11], summarization [12], and translation [13]. Their adaptability across
zero-shot and few-shot scenarios [14, 9] reduces reliance on extensive training data and computational
resources. In RE, structured outputs like software specifications are essential, and LLMs can help by
generating accurate and contextually relevant outputs [15].</p>
      <p>In this study, we use LLMs, such as GPT-4, Mistral, and LLaMA-2, to assist in qualitative data annotation
for RE, aiming to reduce manual efort and accelerate the analysis process. Our approach uses both
inductive and deductive annotation. To facilitate the alignment of inductive and deductive with the NLP
setup, we treated inductive annotation as zero-shot learning and used one-shot and few-shot learning for
deductive annotation. Our experiments, conducted on two test cases (Library Management System and
Smart Home System), demonstrate that our LLM-based approach achieved fair to substantial agreement
with human analysts in deductive annotation tasks. Specifically, in both test cases, GPT-4 performed
better than the other LLMs, showing stronger agreement with human analysts. Contextual examples in
detailed prompts led to notable performance gains, especially during the shift from zero-shot to one-shot
scenarios. Providing rich context was key, as it produced much better results than using limited or no
context. Our findings demonstrate that LLMs can efectively support qualitative data annotation in
RE, ofering faster and more consistent results. Additionally, the structured labels generated by these
models help create domain models, which are critical for systematic software design and development.
This not only reduces manual efort but also ensures greater consistency and accuracy, improving the
overall quality of software design.</p>
      <p>Our work is structured around the following research key questions:
• RQ1: To what extent does our LLM-based approach align with human analysts in both inductive
and deductive annotation tasks?
• RQ2: How do diferent prompt designs (zero-shot and few-shot) and lengths (short, medium,
long) afect the accuracy and reliability of the annotations generated by LLMs?
• RQ3: How consistent are the LLM-generated labels across multiple runs?
• RQ4: How do various contextual settings afect the efectiveness of our LLM-based annotation
approach?</p>
      <sec id="sec-1-1">
        <title>Overall, our contributions can be summarized as follows:</title>
        <p>• We conducted a comprehensive assessment of both open-source and proprietary LLMs to
determine their utility in supporting QDA within RE. Our study spans various models, including
GPT-4, Mistral, and LLaMA-2.
• We explored the efectiveness of diferent annotation strategies (inductive and deductive) across
various settings (zero-shot, one-shot, and few-shot). Our findings illustrate the impacts of these
strategies on the performance of LLMs, with deductive (few-shot) annotation achieving higher
agreement with human analysts. For instance, GPT-4 reached a Cohen’s Kappa score of up to
0.738, indicating substantial agreement.
• We investigated the influence of prompt length and contextual information on the performance
of LLMs. Detailed, context-rich prompts significantly enhanced the accuracy of LLMs. In the
fewshot setting, the precision and recall for GPT-4 were notably high, at 0.80 and 0.79, respectively,
demonstrating its efectiveness in closely mirroring human analytical processes.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <p>In this literature review, we explore two critical areas: the role of QDA in RE (Section. 2.1) and the
application of LLMs in RE (Section. 2.2) for QDA-assisted RE.</p>
      <sec id="sec-2-1">
        <title>2.1. Qualitative Data Analysis (QDA)-based RE</title>
        <p>QDA is a key technique in RE for analyzing unstructured stakeholder inputs, such as interviews
and surveys, to extract patterns and generate actionable insights [16]. Qualitative labeling is used to
identify domain concepts and latent requirements. These coded insights are then mapped to classes
or components in a domain model, ensuring that stakeholder needs are accurately reflected in the
system design [17]. While QDA improves traceability and accuracy in requirements specification,
traditional methods are labor-intensive, inconsistent, and prone to subjectivity [18, 19]. Tools like
Computer Assisted Qualitative Data Analysis Software (CAQDAS) aim to support the process but often
lack adaptability to dynamic RE environments [20]. Recent eforts like QDAcity-RE [ 20, 21] have shown
that QDA techniques help extract domain concepts from unstructured stakeholder interviews and
documentation. This approach uses manual qualitative coding to generate traceable domain models by
mapping labeled requirements to classes or components, ensuring consistency and traceability in the
design process. However, the repetitive and manual nature of these processes underscores the need for
automation to improve scalability and eficiency.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Large Language Models (LLMs) in Requirements Engineering (RE)</title>
        <p>LLMs, such as GPT-4, Mistral, and LLaMA-2, have shown promise in automating RE tasks like
requirements classification, ambiguity detection, and documentation synthesis [ 22, 23]. Their adaptability
across zero-shot and few-shot scenarios enables eficient processing of unstructured data with minimal
training [14]. Recent studies have explored the application of LLMs in qualitative research within
software engineering [24]. For example, Alhoshan et al. [25] demonstrated the potential of LLMs for
requirements classification without task-specific training, while Kici et al. [ 26] showed the efectiveness
of transfer learning for RE tasks. Despite this progress, applying LLMs to QDA for RE remains
underexplored, presenting an opportunity to address limitations of traditional QDA and enhance scalability and
accuracy in RE processes.</p>
        <p>Although LLMs have been widely studied in RE and QDA independently, their integration for QDA
in RE is still new. Using LLMs for QDA can greatly improve eficiency and accuracy by automating
annotations and reducing errors from manual work, can simplify the process, make it more reliable and
scalable, and better meet the changing demands of RE.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Qualitative Data Analysis (QDA)</title>
      <p>For our study, we focused on two specific test cases: a Library Management System and a Smart Home
system. The Library Management System test case involves managing resources like cataloging, user
management, loans, and digital resources. The Smart Home System test case focuses on automating
tasks such as security, energy control, and device management. While the two primary test cases were
sourced from the PURE dataset [27], we supplemented these with additional SRS and FRS documents
from the internet to ensure a comprehensive dataset. Following the extensive data collection, we
applied QDA to our test cases. Our primary goal was to convert the requirement statements from
these documents into actionable insights by assigning precise labels to distinct segments. These labels,
akin to UML classes, help structure the requirements, making them more comprehensible and aiding
their integration into the software development lifecycle. This structured approach ensures that the
requirements are clear, precise, and aligned with the overall goals of the software engineering process.
To maintain precision and reliability, we assigned two independent analysts, 1 and 2, to review and
label the same set of requirement documents independently. Both analysts have a software engineering
background, with 1 having 1.5 years of experience and 2 having 8 months of experience working
with software requirements. First, both analysts (1 and 2) labeled the requirement documents
independently. We then measured their agreement using Cohen’s Kappa 1. After that, they met to
1Cohen’s Kappa is a statistical measure used to assess the inter-rater agreement of qualitative (categorical) items. It considers
the agreement occurring by chance and provides a more robust metric compared to simple percent agreement. A Kappa score
of more than 0.70 typically indicates a substantial level of agreement between raters, reflecting a high degree of reliability in
the annotation process.
discuss and resolve any diferences, creating a unified set of labels. This iterative process combined
their insights into a unified analytical framework. The total time and efort spent by the analysts in this
QDA process are summarized in Table. 1. We reached a substantial agreement of 0.80 for the Library
Management System and 0.78 for the Smart Home System. The Library Management System used
labels such as ’Notification,’ ’Loan,’ ’Reservation,’ ’Catalog,’ etc., while the Smart Home System included
’Sensor,’ ’Light,’ ’Thermostat,’ ’Device,’ etc. These labels ensure stakeholder inputs are directly linked to
corresponding elements in the domain model</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Overview</title>
        <p>Figure. 1 outlines our approach to integrating LLMs into QDA for RE. We begin by taking requirement
statements (Section. 3) as input. The requirements are subsequently formatted into structured prompts
optimized for inductive or deductive annotations (Section. 4.2). Inductive prompts, used in zero-shot
learning, allow LLMs to identify patterns without predefined categories, while deductive prompts,
supporting one-shot and few-shot learning, include examples for consistency with defined categories.
LLMs (Section. 4.3) process these prompts to generate structured labels (Section. 4.4), which categorize
and interpret requirements, providing actionable insights for further development. This approach
simplifies the QDA process, reducing manual efort while leveraging LLM capabilities efectively.
Inductive (Zero-shot)
Deductive (Few-shot)</p>
        <p>One-shot
Contextual Information</p>
        <p>Prompt Design
Requirements</p>
        <p>LLMs</p>
        <p>Labels</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Prompt Design</title>
        <p>We created clear and structured prompts to convert the collected requirements into a format that LLMs
can understand and label. Table 2 summarizes our prompt templates, while Table 3 provides details on
our context levels. Our design considers three independent factors:
1. Shot Type: This factor refers to the number of examples included in the prompt. In a zero-shot
prompt, no examples are provided, so the LLM relies entirely on its built-in knowledge. A one-shot
prompt includes one example to guide the model, while a few-shot prompt provides several examples to
clearly show the desired labeling.
2. Prompt Length: This factor measures how much instruction is given. A short prompt provides
minimal instructions, a medium prompt adds additional details, and a long prompt gives in-depth
guidance. For instance, a long prompt might explain specific aspects of QDA such as traceability,
stakeholder intent, and consistency.
3. Contextual vs. Non-Contextual: This aspect determines whether the prompt includes background
information. Non-contextual prompts provide only the requirement statement, while contextual prompts
ofer system details to improve understanding. We define three levels: no context (requirement only),
some context (brief system description), and full context (comprehensive system details).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Model Selection</title>
        <p>We used state-of-the-art LLMs, including GPT-4 [8], Mistral [28], and LLaMA-2 [10], for their abilities in
understanding and generating natural language and suitability for the complex task of QDA in RE. We
prompt these models with specific software requirement data to understand the context of requirements,
recognize domain-specific terminology, and map requirement statements to relevant labels.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Output Labels</title>
        <p>Our approach focuses on generating labels that organize and interpret requirement statements,
converting unstructured data into clear and actionable insights. These labels are critical for understanding
stakeholder needs and ensuring that requirements align with their expectations [29]. By improving
communication among teams, the labels also play a key role in creating domain models, which are
essential for systematic software design [21]. To achieve accurate and relevant labels, we employ both
inductive and deductive strategies, supported by contextual prompts. This dual strategy improves the
precision and relevance of the labeling process. Additionally, these QDA-based annotations ensure
automatic traceability by linking each label back to its corresponding stakeholder input [20].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Evaluation Metrics</title>
        <p>We assessed the performance of the LLMs using several key metrics to evaluate their accuracy and
agreement in annotation tasks. Inter-rater agreement was measured using Cohen’s Kappa, which
quantifies the level of agreement between the labels generated by the LLMs and those assigned by
human analysts, with higher values indicating stronger agreement. To evaluate the consistency of the
labels across multiple experimental runs, we analyzed the standard deviation (SD) and the Intraclass
Correlation Coeficient (ICC). A lower SD indicates minimal variability in the labels, while ICC values
above 0.85 demonstrate excellent reliability. In addition to reliability and consistency, we evaluated the
accuracy of the LLMs, which measures the proportion of correct labels out of all predictions. Precision
was used to determine how many of the labels identified by the LLMs were correct, providing insights
into their ability to avoid false positives. Recall assessed the ability of LLMs in the identification of all
relevant labels, minimizing the risk of missing important instances (false negatives). Finally, we used
F1-score, the harmonic mean of precision and recall, to provide a balanced measure of the performance
of the models, with higher scores indicating a good trade-of between precision and recall. In this study,
we used only the labels on which both analysts reached consensus as the ground truth for evaluating
LLM performance.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Implementation</title>
        <p>We carried out all experiments using Python and PyTorch2. For Mistral and LLaMA-2 models, we
used the 7B configuration from the Hugging Face’s Transformers library 3, which provides access to
pre-trained models while for GPT-4, we used the GPT-4 Turbo API4. To ensure fair comparisons, we set
the temperature parameter to 0.0 across all models, which minimizes randomness and makes outputs</p>
        <sec id="sec-5-2-1">
          <title>2https://pytorch.org/ 3https://huggingface.co/transformers/ 4https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4</title>
          <p>consistent. The experiments were conducted on high-performance computing clusters equipped with
NVIDIA A100 GPUs to handle the computational demands. The source code for all experiments and
evaluations is publicly available5.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. LLMs vs. Human Analysts (RQ1)</title>
        <p>To evaluate the efectiveness of LLMs in aiding QDA-based annotation tasks within RE, we compared
their performance against human analysts for both inductive and deductive settings. We used Cohen’s
Kappa, a widely recognized statistical measure for assessing inter-rater agreement, to quantify agreement
levels between LLM-generated labels and those derived by human analysts (described in detail in Section.
3). This measure highlights the reliability and consistency of the LLM’s performance in replicating
human judgment, aligning with practices in qualitative research [30] and LLM-assisted content analysis
[31].</p>
        <p>Table. 4 reports the Cohen’s Kappa results for various prompt designs (zero-shot, one-shot, few-shot)
and test cases (Library Management System and Smart Home System). Our empirical assessment across
various settings for both test-cases yielded significant insights into the capabilities of LLMs. Notably,
GPT-4 consistently outperformed other models such as LLaMA-2 and Mistral, achieving the highest
Cohen’s Kappa scores. Specifically, in the few-shot setting, GPT-4 achieved scores of 0.738 and 0.734
for the Library Management System and the Smart Home System, respectively, indicating substantial
agreement with human analysts and highlighting its robustness in these settings.</p>
        <p>However, it is important to note that the agreement levels in the zero-shot setting were around 0.54,
which is not typically considered a strong outcome. This observation suggests that while LLMs can
approach the performance of human analysts in scenarios where some guidance (one-shot or few-shot)
is provided, their efectiveness in fully autonomous, inductive annotation tasks (zero-shot) remains
limited. This analysis highlights that, although LLMs show promise, particularly in deductive settings
where they can match or even exceed human performance, they still require refinement for inductive
tasks where no initial guidance is given. This detailed understanding addresses RQ1, indicating that
while LLMs hold significant potential to support human eforts in RE annotation processes, their current
application is more reliable in deductive annotation tasks than inductive ones.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Influence of Prompt Design on Annotation Outcomes (RQ2)</title>
        <p>To assess the impact of diferent prompt lengths, we executed a series of experiments across the two
distinct test cases. The results, summarized in Table. 5, indicate that while long prompts generally
provide the best performance, medium prompts also ofer a good balance of context and eficiency.
Short prompts, although less detail-intensive, often fall short in tasks requiring detailed contextual
understanding.</p>
        <p>This analysis directly addresses RQ2, demonstrating that careful prompt design is essential for
maximizing the efectiveness of LLMs in annotation tasks within RE. Also, the finding is consistent with the</p>
        <sec id="sec-5-4-1">
          <title>5https://github.com/SyedTauhidUllahShah/LLM4QDARE</title>
          <p>broader literature [32, 33], which emphasizes that the detailed contextual information in long prompts
significantly enhances LLM performance by reducing ambiguity. Our findings highlight the potential
for optimizing LLM performance in practical applications by tailoring prompts to balance context and
eficiency.</p>
        </sec>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Consistency Analysis of LLM-Generated Labels Across Multiple Runs (RQ3)</title>
        <p>The consistency analysis of LLM-generated labels across multiple runs, as shown in Table. 6, revealed
that GPT-4 exhibited the highest consistency among the tested models. Specifically, GPT-4 achieved
the lowest standard deviations of 0.034 for the Library Management System and 0.037 for the Smart
Home System. Additionally, GPT-4 obtained the highest ICC values of 0.93 and 0.92 for the Library
Management System and Smart Home System, respectively. These results indicate a high degree of
reliability and stability in the generated labels, surpassing the performance of LLaMA-2 and Mistral,
which also demonstrated good consistency but with slightly higher variability.</p>
        <p>The high ICC values (&gt;0.85) across all models afirm that LLM-generated labels are consistently
reproducible within the same class, ensuring reliable outputs that closely align with the performance of
human analysts. These findings show that GPT-4 is a reliable tool for helping with QDA in RE, making
it easier to extract and organize insights from requirements data with less manual work.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Impact of Contextual Backgrounds (RQ4)</title>
        <p>To address RQ4, we evaluated the impact of varying levels of contextual backgrounds on the efectiveness
of LLM-generated labels. Specifically, we adjusted the amount of context provided in the prompts,
ranging from no context to full context. The results, as shown in Table. 7, demonstrated that the
inclusion of richer contextual information in the prompts significantly improved the performance of all
evaluated models, including LLaMA-2, Mistral, and GPT-4.</p>
        <p>Specifically, GPT-4 exhibited the highest Cohen’s Kappa scores across all scenarios, achieving scores
of 0.738 for the Library Management System and 0.734 for the Smart Home System in the full context
setting. These findings indicate that GPT-4 is particularly efective at leveraging detailed contextual
information to generate accurate and consistent labels.</p>
        <p>The improvement in performance with increased context suggests that providing comprehensive
background information enables LLMs to better understand and interpret the requirements, resulting
in more precise annotation. This highlights the importance of designing context-rich prompts to
maximize the potential of LLMs for automating and refining QDA processes within RE. By incorporating
detailed contextual information, LLMs can deliver outputs that accurately reflect the complexities of
the requirements, thereby improving the accuracy and reliability of the annotation process.</p>
      </sec>
      <sec id="sec-5-7">
        <title>5.7. Performance Evaluation with Detailed Metrics</title>
        <p>To further validate our results, we incorporated additional evaluation metrics: accuracy, precision, recall,
and F1-score. The detailed performance evaluation, presented in Table. 8, shows that GPT-4 consistently
outperforms LLaMA-2 and Mistral across all metrics. Specifically, GPT-4 achieves the highest accuracy,
precision, and recall in both zero-shot and few-shot settings for the Library Management and Smart
Home test cases. Although in the inductive scenario, the model is not provided with explicit examples,
it still outputs a single label per requirement that is evaluated against the ground truth. In the deductive
scenario, implemented as few-shot learning, the model is guided by explicit examples to generate labels.
In both cases, the task is treated as a multi-class classification problem. For instance, in the few-shot
setting for the Library Management test case, GPT-4 achieves an accuracy of 0.86, a precision of 0.80,
recall of 0.79, and an F1-score of 0.79, demonstrating its superior ability to correctly and consistently
categorize requirement statements.</p>
        <p>Similarly, in the Smart Home test case, GPT-4 again leads with an accuracy of 0.85, a precision of 0.79,
a recall of 0.78, and an F1-score of 0.785 in the few-shot setting. This analysis supports our earlier
ifndings from Cohen’s Kappa and ICC, showing that GPT-4 is reliable for automating QDA tasks in RE.
The higher precision and recall suggest that GPT-4 not only identifies the correct labels more often but
also misses fewer important instances, making the annotations more complete and accurate.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Threats to Validity</title>
      <p>In this section, we discuss the potential threats to the validity of our study on the application of LLMs
for QDA in RE.</p>
      <sec id="sec-6-1">
        <title>6.1. Internal Validity</title>
        <p>One challenge in this study is the potential bias in pre-trained LLMs such as GPT-4, Mistral, and
LLaMA-2. Since these models are trained on vast datasets, their outputs may reflect underlying biases
that could skew the annotation results and fail to fully capture the nuances of RE. To minimize this risk,
we carefully designed prompts with detailed context to guide the models toward more accurate and
relevant annotations. Another concern is the consistency of human annotations. Diferent analysts may
interpret and label the same requirement statements in slightly diferent ways, which could introduce
inconsistencies in the dataset used for evaluation. To address this, we used an inter-rater reliability
phase, where analysts reviewed their annotations together, resolving discrepancies to improve label
consistency. Prompt design also plays a crucial role in the accuracy of LLM-generated annotations.
Poorly structured or vague prompts can lead to unreliable results. To improve performance, we tested
prompts with diferent lengths and levels of contextual information, refining them through an iterative
process to ensure clarity and efectiveness.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. External Validity</title>
        <p>Our study evaluates LLM performance using two test cases, Library Management and Smart Home
systems, which may not fully capture the diversity of software systems in practice. Results could vary
when applied to diferent domains, particularly those with unique complexities or highly specialized
requirements. The dataset, while sourced from multiple documents, may not represent the full range
of real-world projects. A broader selection of requirement documents covering various industries
and project types would strengthen the evaluation and improve the generalizability of our findings.
Contextual information in prompts also plays a key role in guiding LLMs toward accurate annotations,
but our prompts may not fully capture every detail of diferent RE contexts. Ensuring clarity and
relevance across diverse scenarios remains a challenge. Further refinement, incorporating real-world
feedback, is needed to enhance the applicability of this approach.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>This paper explored the application of LLMs, specifically LLM, Mistral, and LLaMA-2, to aid and enhance
the annotation processes in RE. Our findings demonstrate that GPT-4, in particular, significantly reduces
the manual efort required for annotation, achieving high levels of accuracy and consistency comparable
to human analysts. The performance of these models is notably improved with detailed, context-rich
prompts, underscoring the importance of prompt design in leveraging LLM capabilities efectively. Our
work highlights that while GPT-4 and other LLMs show promise in deductive annotation tasks
(oneshot and few-shot settings), achieving substantial agreement with human analysts, their efectiveness
in inductive annotation tasks (zero-shot) remains limited. This calls for further development and
optimization of LLM strategies to enhance their performance across all types of annotation tasks. The
potential for broader adoption of LLMs in RE is clear, suggesting that these models can aid QDA,
increase eficiency, and reduce subjectivity. The structured labels generated by LLMs not only improve
the eficiency and reliability of the QDA process but also facilitate the creation of domain models,
simplifying the software design process and enhancing overall project eficiency. Future work should
focus on extending these results to more diverse scenarios and further refining the training processes to
address any inherent model biases. By doing so, the utility and reliability of LLMs in enhancing various
aspects of software development processes can be significantly expanded.
[2] D. Carrizo, O. Dieste, N. Juristo, Systematizing requirements elicitation technique selection,</p>
      <p>Information and Software Technology 56 (2014) 644–669.
[3] J. Mucha, The QDAcity-RE-RS Method for Creating Complete, Consistent, and Traceable
Requirements Specifications, Friedrich-Alexander-Universitaet Erlangen-Nuernberg (Germany), 2023.
[4] A. Kaufmann, J. Krause, N. Harutyunyan, A. Barcomb, D. Riehle, A validation of QDAcity–RE
for domain modeling using qualitative data analysis, Requirements Engineering (2021). URL:
https://link.springer.com/article/10.1007/s00766-021-00360-6. doi:https://doi.org/10.1007/
s00766-021-00360-6.
[5] J. Saldaña, The coding manual for qualitative researchers (2021).
[6] C. Treude, Qualitative data analysis in software engineering: Techniques and teaching insights,
arXiv preprint arXiv:2406.08228 (2024).
[7] S. Tsang, An experiment exploring the theoretical and methodological challenges in developing a
semi-automated approach to analysis of small-n qualitative data, arXiv preprint arXiv:2002.04513
(2020).
[8] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,</p>
      <p>S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
[9] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M.</p>
      <p>Dai, A. Hauth, et al., Gemini: a family of highly capable multimodal models, arXiv preprint
arXiv:2312.11805 (2023).
[10] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P.
Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint
arXiv:2307.09288 (2023).
[11] X. Sun, X. Li, J. Li, F. Wu, S. Guo, T. Zhang, G. Wang, Text classification via large language models,
in: The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
[12] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKeown, T. B. Hashimoto, Benchmarking large
language models for news summarization, Transactions of the Association for Computational
Linguistics 12 (2024) 39–57.
[13] B. Zhang, B. Haddow, A. Birch, Prompting large language model for machine translation: A case
study, in: International Conference on Machine Learning, PMLR, 2023, pp. 41092–41110.
[14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information
processing systems 33 (2020) 1877–1901.
[15] M. Krishna, B. Gaur, A. Verma, P. Jalote, Using llms in software requirements specifications: An
empirical evaluation, arXiv preprint arXiv:2404.17842 (2024).
[16] B. Nuseibeh, S. Easterbrook, Requirements engineering: a roadmap, in: Proceedings of the</p>
      <p>Conference on the Future of Software Engineering, 2000, pp. 35–46.
[17] A. Kaufmann, J. Krause, N. Harutyunyan, A. Barcomb, D. Riehle, A validation of qdacity-re for
domain modeling using qualitative data analysis, Requirements Engineering 27 (2022) 31–51.
[18] N.-C. Chen, R. Kocielnik, M. Drouhard, V. Peña-Araya, J. Suh, K. Cen, X. Zheng, C. R. Aragon,
Challenges of applying machine learning to qualitative coding, in: ACM SIGCHI Workshop on
Human-Centered Machine Learning, 2016.
[19] B. Glaser, A. Strauss, Discovery of grounded theory: Strategies for qualitative research, Routledge,
2017.
[20] A. Kaufmann, D. Riehle, The QDAcity-RE method for structural domain modeling using qualitative
data analysis, Requirements Engineering 24 (2019) 85–102.
[21] A. Kaufmann, A. Barcomb, D. Riehle, Supporting interview analysis with autocoding, in: 53rd
Hawaii International Conference on System Sciences, HICSS 2020, Maui, Hawaii, USA, January
7-10, 2020, ScholarSpace, 2020, pp. 1–10.
[22] A. Vogelsang, J. Fischbach, Using large language models for natural language processing tasks in
requirements engineering: A systematic guideline, arXiv e-prints (2024) arXiv–2402.
[23] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, J. M. Zhang, Large language
models for software engineering: Survey and open problems, arXiv preprint arXiv:2310.03533
(2023).
[24] M. Bano, R. Hoda, D. Zowghi, C. Treude, Large language models for qualitative research in
software engineering: exploring opportunities and challenges, Automated Software Engineering
31 (2024) 8.
[25] W. Alhoshan, A. Ferrari, L. Zhao, Zero-shot learning for requirements classification: An exploratory
study, Information and Software Technology 159 (2023) 107202.
[26] D. Kici, G. Malik, M. Cevik, D. Parikh, A. Basar, A bert-based transfer learning approach to text
classification on software requirements specifications., in: Canadian AI, 2021.
[27] A. Ferrari, G. O. Spagnolo, S. Gnesi, Pure: a dataset of public requirements documents, Unspecified</p>
      <p>Journal Unspecified Volume (2023) Unspecified Pages.
[28] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand,</p>
      <p>G. Lengyel, G. Lample, L. Saulnier, et al., Mistral 7b, arXiv preprint arXiv:2310.06825 (2023).
[29] K. E. Wiegers, J. Beatty, Software requirements, Pearson Education, 2013.
[30] M. L. Coleman, M. Ragan, T. Dari, Intercoder reliability for use in qualitative research and
evaluation, Measurement and Evaluation in Counseling and Development 57 (2024) 136–146.
[31] R. Chew, J. Bollenbacher, M. Wenger, J. Speer, A. Kim, Llm-assisted content analysis: Using large
language models to support deductive coding, arXiv preprint arXiv:2306.14924 (2023).
[32] M. Turpin, J. Michael, E. Perez, S. Bowman, Language models don’t always say what they
think: unfaithful explanations in chain-of-thought prompting, Advances in Neural Information
Processing Systems 36 (2024).
[33] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought
prompting elicits reasoning in large language models, Advances in neural information processing
systems 35 (2022) 24824–24837.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. H.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , J. M. Atlee, Research directions in requirements engineering,
          <source>Future of software engineering (FOSE'07)</source>
          (
          <year>2007</year>
          )
          <fpage>285</fpage>
          -
          <lpage>303</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>