<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Barcelona, Catalunya, Spain, April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>a Preliminary Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Fantechi</string-name>
          <email>alessandro.fantechi@unifi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefania Gnesi</string-name>
          <email>stefania.gnesi@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Semini</string-name>
          <email>laura.semini@unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>A. Wohlgemuth, A. Hess</institution>
          ,
          <addr-line>S. Fricker, R. Guizzardi, J. Horkof, A. Perini, A. Susi, O. Karras, A. Moreira, F. Dalpiaz, P</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>In: A. Ferrari</institution>
          ,
          <addr-line>B. Penzenstadler, I. Hadar, S. Oyedeji, S. Abualhaija, A. Vogelsang, G. Deshpande, A. Rachmann, J. Gulden</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>17</volume>
      <issue>2023</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>With the rapid advances of AI-based tools, the question of whether to use such tools or conventional rule-based tools often arises in many application domains. In this paper, we address this question when considering the issue of ambiguity in requirements documents. For this purpose, we consider GPT-3 that is the third-generation of the Generative Pretrained Transformer language model, developed by OpenAI and we compare its ambiguity detection capability with that of a publicly available rule-based NLP tool on a few example requirements documents.</p>
      </abstract>
      <kwd-group>
        <kwd>Ambiguity detection in requirements</kwd>
        <kwd>chatGPT</kwd>
        <kwd>rule-based NLP tools</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Being GPT-3 the largest language model constructed to date, we decided it was worth trying
to evaluate its ability to analyze software requirements, and to compare its performance against
a traditional rule-based NLP tool.</p>
      <p>
        In this paper, we present a first step in this direction, in which we compared on a few
requirements documents examples the ambiguity detection ability of chatGPT with that of a
publicly available rule-based NLP tool, QuARS, that we already used in a previous work for
ambiguity and variability detection in requirements [
        <xref ref-type="bibr" rid="ref5">5, 6, 7</xref>
        ].
      </p>
      <p>The experiments described below aim at giving a first answer to the following research
questions: RQ1 Can chatGPT be used to detect ambiguities in requirements? RQ2 How does
the chatGPT performance for ambiguity detection compare to a rule based NLP tool?</p>
      <p>The scope of the experiments is limited to four requirements documents and to a single query
asked to chatGPT; however, since chatGPT returns diferent answers when the same question
is asked again, we have run each query a few times.</p>
      <p>Section 2 briefly introduces the issue of ambiguity detection in requirements, and the two
diferent detection approaches of the two tools. Section 3 describes the example requirements
documents used as a benchmark. The analysis of the data generated by the experiments in view
of the research questions is addressed in Section 4. Final sections on threats to validity, lessons
learned and conclusions follow.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Ambiguity detection</title>
      <p>Software requirements are normally expressed informally through natural language sentences,
which are potentially ambiguous, and this ambiguity is a known source of problems in the later
stages of software development. In the requirement engineering community, many tools have
been developed to help the analyst in detecting ambiguous requirements.</p>
      <sec id="sec-2-1">
        <title>2.1. Rule based NLP tools for ambiguity detection</title>
        <p>In the last decades some tools (e.g. [8, 9, 10, 11, 12, 13]) have been defined that address the
automated analysis of requirements documents by means of Natural Language Processing (NLP)
tools [14] with the purpose of detecting ambiguities in them. This kind of analysis is aimed at
identifying typical natural language defects, especially focusing on ambiguity sources. We list
in Table 1 the most common sources of ambiguity, with a classification inspired by [ 15, 16, 17].</p>
        <p>As a representative of these NLP tools, in this work we apply QuARS - Quality Analyzer
for Requirement Specifications, developed in our lab [ 18], which shows a good performance
when compared with similar tools [7]. QuARS performs an automatic linguistic analysis of a
requirements document in plain text format, according to the deterministic rules defined by a
given quality model. Its output indicates the defective requirements and highlights the words
that reveal the defect. The defect identification process includes lexical and syntactical analysis,
while semantic analysis is not supported.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. chatGPT for ambiguity detection</title>
        <p>As an AI large language model (LLM), chatGPT doesn’t use rules to detect ambiguities in the
traditional sense. Instead, it uses training data and algorithms to generate an answer. LLMs
are such complex algorithms that it is arduous, if not infeasible, to know exactly how and why
the model returns a particular result (lack of explainability and transparency) and it is rare
to get the same answer twice (nonreproducibility). These are well-known issues that need to
be considered when switching from rule-based approaches to LLMs, particularly if there is a
need to guarantee a quality level of the requirements. The purpose of this work, however, is to
investigate whether chatGPT has reasonable performance in ambiguity detection compared
with rule-based tools, such that it would make it a useful tool in software development, alone or
in combination with rule-based tools. To the best of our knowledge, there is no documentation
or literature so far on the ambiguity detection capabilities of chatGPT.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data preparation</title>
      <p>To perform our experience we have used two simple requirements documents introduced in
previous papers, and two third-party requirements documents1:</p>
      <p>Cofee machine</p>
      <p>that gives few requirements of an automatic cofee vending machine;
1All documents are available at https://github.com/Vibe-NLP/RequirementsForValidation.</p>
      <p>C1 After inserting a suitable coin, the user shall choose a beverage and select the amount of sugar.
C2 The machine shall ofer, as beverages, Cofee and Cappuccino or Tea.</p>
      <p>C3 The machine shall always ofer cofee.</p>
      <p>C4 A ringtone possibly has to be played after beverage delivery.</p>
      <p>C5 After the beverage is taken, the machine returns idle.</p>
      <p>C6 The British market requires tea and excludes any ring tone.</p>
      <p>E1 The system shall enable the user to enter the search text on the screen.</p>
      <p>E2 The system shall display all the matching products based on the search.</p>
      <p>E3 The system possibly notifies with a pop-up the user when no matching product is found on the search.
E4 The system shall allow a user to create his profile and set his credentials.</p>
      <p>E5 The system shall authenticate user credentials to enter the profile.</p>
      <p>E6 The system shall display the list of active orders and/or the list of completed orders in the customer profile.
E7 The system shall maintain customer email information as a required part of customer profile.
E8 The system shall send an order confirmation to the user through email.</p>
      <p>E9 The system shall allow an user to add and remove products in the shopping cart.</p>
      <p>E10 The system shall display various shipping methods.</p>
      <p>E11 The order shall be shipped to the client address or, if the shipping to store service is available, to an associated
store.</p>
      <p>E12 The system shall enable the user to select the shipping method.</p>
      <p>E13 The system may display the current tracking information about the order.</p>
      <p>E14 The system shall display the available payment methods.</p>
      <p>E15 The system shall allow the user to select the payment method for order.</p>
      <p>E16 After delivery, the system may enable the users to enter their reviews and ratings.</p>
      <p>E17 Shipping time should be as fast as possible.</p>
      <p>E18 The system must report the available products, if the availability of these are are less than 10 percent the
system should show a pop-up.</p>
      <p>E-shop that describes a simple online shopping system;</p>
      <p>Library, that describes the requirements for the System Administration Module of a urban
library system.</p>
      <p>DigitalHome, that specifies the requirements for developing a domotic system.</p>
      <p>In Table 2 we summarise some characteristics of the considered documents. In Tables 3 and 4
we present the requirements of the cofee machine and E-shop, respectively.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Data Collection and Analysis</title>
      <p>To address the RQs, including RQ2 that requires a comparison with a rule based NLP tool, we
perform the following steps:</p>
      <p>Automatic detection: We apply both QuARS and chatGPT to each document. The document
is given as input to QuARS in text format while chatGPT is queried by asking: ”Find the
ambiguities of the following software requirements document: &lt;list of requirements in text
format&gt;”.</p>
      <p>QuARS returns the requirements that are considered ambiguous, along with the term or
expression that is an indicator of ambiguity and the defect class to which it refers. chatGPT has
a less structured and more variable response format, but basically indicates which requirements
are ambiguous and why.</p>
      <p>Review: The output of the tools is reviewed by the authors in a joint meeting and each defect
identified as ambiguity or false positive. The classification derived at this stage is the one used
for data analysis in the following step.</p>
      <p>Assessment: The analysis is both quantitative, in terms of performance metrics, and
qualitative, to understand in detail what kind of defects are identified or ignored by the two tools.</p>
      <p>For the quantitative analysis, we use the following metrics, where   is true positive,   is
false positive and   is false negative:
  =
  =</p>
      <p>+  
 
  +  
=
=</p>
      <p>|     ∩      |
|     ∩      |
|    |
|    |</p>
      <sec id="sec-4-1">
        <title>4.1. Cofee machine</title>
        <p>As an initial case study we consider the cofee machine, which is short enough to allow us to
fully report the output of QuARS and the response to a query to chatGPT. We are well aware
that considering only one response is statistically irrelevant, but this example is mainly for
illustrative purposes. Let us first report the output of QuARS:</p>
        <p>T h e l i n e n u m b e r : 1 . a f t e r i n s e r t i n g a s u i t a b l e c o i n , t h e u s e r s h a l l c h o o s e a b e v e r a g e a n d s e l e c t t h e
a m o u n t o f s u g a r .</p>
        <p>i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : s u i t a b l e
T h e l i n e n u m b e r : 2 . t h e m a c h i n e s h a l l o f f e r , a s b e v e r a g e s , c o f f e e a n d c a p p u c c i n o o r t e a . i s d e f e c t i v e
b e c a u s e i t c o n t a i n s t h e w o r d i n g : o r
T h e l i n e n u m b e r : 3 . t h e m a c h i n e s h a l l a l w a y s o f f e r c o f f e e .</p>
        <p>i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : a l w a y s
T h e l i n e n u m b e r : 4 . a r i n g t o n e p o s s i b l y h a s t o b e p l a y e d a f t e r b e v e r a g e d e l i v e r y .</p>
        <p>i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : p o s s i b l y
T h e l i n e n u m b e r : 6 . t h e b r i t i s h m a r k e t r e q u i r e s t e a a n d e x c l u d e s a n y r i n g t o n e .</p>
        <p>i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : a n y
Then we have asked chatGPT obtaining the following answer:</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results analysis and comparison</title>
        <p>We have manually analysed the requirements, according the classification of ambiguity sources
in Table 1 and then assessed the tools’ outcome. Results are summarised in Table 5 and
commented below; performance results are non-significant in this toy example.
1. ”suitable coin” in C1 is an ambiguity, detected by both tools;
2. ”or” in C2 is an ambiguity detected by both tools. In the same requirement there is a
coordination ambiguity, undetected by the tools;
3. ”always” in C3 is a false positive, detected as ambiguity by QuaRS. ChatGPT returns an
indication of a possible contradiction, which might exist, but is not an ambiguity;
4. the fact that the ring tone is possibly played, in C4, is an ambiguity and it is detected by
both tools;
5. in C6 QuARS finds ”any”, which is a false positive, while chatGPT detects an
incompleteness that actually exists, but is not an ambiguity.
4.3. E-shop
Our second experience involved the E-shop example: we performed a manual analysis, an
analysis with QuARS, and queried chatGPT twice, on diferent days. For space reasons, we
do not report the whole outcomes but only the found indicators and kind of defect in Table 6.
Performance values are in Table 7 and show that the performance of chatGPT can be highly
variable, which was expected, but also that it can be compared with that of a settled, rule-based
tool. It is interesting to note that chatGPT was able to detect an hidden ambiguity in E3 that
was not found by manual analysis (note also that we have been working on for some time on
this case study and we had never noticed the problem):</p>
        <p>A m b i g u i t y i n E 3 : I t i s n o t s p e c i f i e d i f t h e p o p - u p w i l l b e d i s p l a y e d w h e n a s i n g l e p r o d u c t
i s n o t f o u n d o r m u l t i p l e p r o d u c t s .</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.4. Smart Home</title>
        <p>Our third experience involved the smart home example: we performed an analysis with QuARS,
and queried chatGPT as usual. We encountered a limitation of chatGPT: it does not accept
documents of this length as input and returns an error. To get around the problem, we divided
the document into two parts and had them analysed separately, then merged the results. Since
we are looking for sources of ambiguity, which do not depend on the joint analysis of multiple
requirements, we claim that this partition-based solution is acceptable. Each part has been
analysed twice, on diferent days.</p>
        <p>This document returned many false positives, both with QuARS and chatGPT, and chatGPT
has a very low recall (Table 7). For space reasons, we do not present the tools output but only
the performance measures. With regard to qualitative analysis, chatGPT found the following
defect not found by QuARS, which is worth noting since it reveals an incompleteness of the
QuARS dictionaries, which do not contain the term compatible.</p>
        <p>A m b i g u i t y i n 2 . 2 . 5 , a s i t s t a t e s ” T h e s y s t e m s h a l l b e c o m p a t i b l e w i t h a c e n t r a l i z e d H V A C</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.5. Library</title>
        <p>The last document considered is Library, which is slightly smaller in size than Smart Home. We
analysed the document with QuaRS and then with chatGPT for 5 times, on 5 diferent days. In the
table named GPT_QuARS_library in https://github.com/Vibe-NLP/RequirementsForValidation
we list all the defects found. The table is truncated because all 5 times we queried chatGPT,
although it did not report length errors, it only found defects in the first 38 requirements. We
therefore decided to consider this document fragment to make the performance measurements,
which are shown in Table 8. In the GPT_QuARS_library table, for each analysis, we show
each defect reported, labelling it directly as false positive (fp) or true positive (amb). In the
adjacent column we report: for QuARS, which indicators were considered false positives or
true ambiguities; for chatGPT a fragment of the response, if significant.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.6. Threats to validity</title>
        <p>We have used precision and recall as metrics to compare the tools. The human intervention in
the review and assessment steps, returning the number of true/false positives and false negatives,
is a threat to construct validity, and the involvement of the authors in these phases is also a
threat to internal validity. With regard to external validity, we have presented a preliminary
study, and the quantitative comparison is limited to three case studies, to two compared tools
and to a single kind of query to chatGPT and few chat sessions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>The findings from the experience allow us to give an answer, albeit preliminar, to the RQs:
RQ1 chatGPT can be used to detect ambiguities in requirements by simply asking: ”Find the
ambiguities of the following software requirements document: &lt;list of requirements in text
format&gt;”. We note that chatGPT does not process long requirement documents: either
it returns an error or it provides a partial answer. Since ambiguity detection does not
depend on processing the document as a whole, it is possible to break the requirements
document into simpler parts and analyze the pieces separately.</p>
      <p>RQ2 ChatGPT’s performance results vary between chat sessions with the bot, especially recall;
precision, on the other hand, is more stable and comparable to that of a rule-based NLP
tool. Running several sessions with the same question improves recall. For example,
when making the union of the 5 responses got from the chatbot for the library case study,
we have the following performance: precision = 0, 51(28/55) recall=0, 55(12/22)
Validity threats can be mitigated in future work by involving third-party reviewers and
measuring the level of agreement between them and by increasing the number of documents and
querying chatGPT with diferent queries.</p>
      <p>Future work can further develop the analysis presented here along several dimensions:
• Assess the coverage by GPT-3 language model of the technical slang used in requirements;
• Exploit ChatGPT’s ability to rationalise and explain ambiguity;
• Ask ChatGPT more focused questions, addressing the various classes of ambiguity
separately;
• Develop the analysis with additional documents and evaluate the hypothesis that slicing
a requirements document for chatGPT does not influence its results;
• We have seen that chatGPT is able to detect defects, such as incompleteness and
inconsistency, that traditional NLP tools cannot identify or can identify with dificulty and
after domain-focused training. A future study may be devoted to specifically measuring
the performance of chatGPT in finding these classes of defects in requirements. Positive
results in this respect could lead to the use of chatGPT to complement a rule-based tool
to automatically detect these important quality criteria;</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The research has been partially supported by the MIUR, Italy project PRIN 2017 FTXR7S
‘‘ITMaTTerS’’ (Methods and Tools for Trustworthy Smart Systems).
[6] A. Fantechi, S. Gnesi, L. Semini, VIBE: looking for variability in ambiguous requirements,</p>
      <p>J. Syst. Softw. 195 (2023).
[7] M. Arrabito, A. Fantechi, S. Gnesi, L. Semini, An experience with the application of
three NLP tools for the analysis of natural language requirements, in: Proc. of Quality of
Information and Communications Technology - 13th Int. Conference, QUATIC, volume
1266 of Communications in Computer and Information Science, Springer, 2020, pp. 488–498.
[8] J. Kasser, TIGER-PRO , . URL: www.therightrequirement.com.
[9] V. Ambriola, V. Gervasi, Processing natural language requirements, in: Int. Conference on</p>
      <p>Automated Software Engineering, ASE, Nov. 2-5, IEEE Computer Society, 1997, pp. 36–45.
[10] O. Kenney, M. Cooper, Automating requirement quality standards with QVscribe, in:
NLP4RE’20, co-located with the 26th Int. Conf. on Requirements Engineering: Foundation
for Software Quality (REFSQ), volume 2584 of CEUR Workshop Proc., CEUR-WS.org, 2020.
[11] H. Femmer, Requirements quality defect detection with the Qualicen requirements scout, in:
NLP4RE’18, co-located with the 23rd Int. Conf. on Requirements Engineering: Foundation
for Software Quality (REFSQ), volume 2075 of CEUR Workshop Proceedings, 2018.
[12] S. F. Tjong, D. M. Berry, The design of SREE - a prototype potential ambiguity finder for
requirements specifications and lessons learned, in: International Working Conference
on Requirements Engineering: Foundation for Software Quality, volume 7830 of LNCS,
Springer, Essen, Germany, 2013, pp. 80–95.
[13] R. Company, RAT, . URL: www.reusecompany.com/rat-authoring-tools.
[14] G. G. Chowdhury, Natural language processing, Annu. Rev. Inf. Sci. Technol. 37 (2003)
51–89.
[15] D. Berry, E. Kamsties, M. Krieger, From contract drafting to software specification:
Linguistic sources of ambiguity - a handbook version 1.0 (2003).
[16] V. Gervasi, A. Ferrari, D. Zowghi, P. Spoletini, Ambiguity in requirements engineering:
Towards a unifying framework, in: From Software Engineering to Formal Methods and
Tools, and Back - Essays Dedicated to Stefania Gnesi on the Occasion of Her 65th Birthday,
volume 11865 of Lecture Notes in Computer Science, Springer, 2019, pp. 191–210.
[17] INCOSE, Guide for Writing Requirements, TechGuideWR2019Soft V3, 2019.
[18] S. Gnesi, G. Lami, G. Trentanni, An automatic tool for the analysis of natural language
requirements, Computer Systems: Science &amp; Engineering 20 (2005).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>in: 33rd Annual Conference on Neural Information Processing Systems</source>
          <year>2020</year>
          , Dec.
          <fpage>6</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Dale</surname>
          </string-name>
          , GPT-3
          <article-title>: what's it good for?</article-title>
          ,
          <source>Nat. Lang. Eng</source>
          .
          <volume>27</volume>
          (
          <year>2021</year>
          )
          <fpage>113</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] IBM, Engineering Requirements Quality Assistant (RQA),</article-title>
          . URL: www.ibm.com/products/ requirements-quality-assistant.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fantechi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gnesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Livi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Semini</surname>
          </string-name>
          ,
          <article-title>A spaCy-based tool for extracting variability from NL requirements</article-title>
          , in: M.
          <string-name>
            <surname>Mousavi</surname>
          </string-name>
          , P. Schobbens (Eds.),
          <source>SPLC '21: 25th ACM Int. Systems and Software Product Line Conference</source>
          , Leicester,
          <string-name>
            <surname>UK</surname>
          </string-name>
          , Sept.
          <fpage>6</fpage>
          -
          <lpage>11</lpage>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>32</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fantechi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferrari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gnesi</surname>
          </string-name>
          , L. Semini, Requirement Engineering of Software Product Lines:
          <article-title>Extracting Variability Using NLP</article-title>
          ,
          <source>in: 26th IEEE International Requirements Engineering Conference</source>
          <year>2018</year>
          , Banf, Canada,
          <source>August 20-24</source>
          ,
          <year>2018</year>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>418</fpage>
          -
          <lpage>423</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>