1. Introduction

Barcelona, Catalunya, Spain, April

a Preliminary Study

Alessandro Fantechi

alessandro.fantechi@unifi.it 0 1

Stefania Gnesi

stefania.gnesi@isti.cnr.it 0 1

Laura Semini

laura.semini@unipi.it 0 1 0 A. Wohlgemuth, A. Hess , S. Fricker, R. Guizzardi, J. Horkof, A. Perini, A. Susi, O. Karras, A. Moreira, F. Dalpiaz, P 1 In: A. Ferrari , B. Penzenstadler, I. Hadar, S. Oyedeji, S. Abualhaija, A. Vogelsang, G. Deshpande, A. Rachmann, J. Gulden

2023

17 2023 0000 0002

With the rapid advances of AI-based tools, the question of whether to use such tools or conventional rule-based tools often arises in many application domains. In this paper, we address this question when considering the issue of ambiguity in requirements documents. For this purpose, we consider GPT-3 that is the third-generation of the Generative Pretrained Transformer language model, developed by OpenAI and we compare its ambiguity detection capability with that of a publicly available rule-based NLP tool on a few example requirements documents.

Ambiguity detection in requirements chatGPT rule-based NLP tools

1. Introduction

Being GPT-3 the largest language model constructed to date, we decided it was worth trying to evaluate its ability to analyze software requirements, and to compare its performance against a traditional rule-based NLP tool.

In this paper, we present a first step in this direction, in which we compared on a few requirements documents examples the ambiguity detection ability of chatGPT with that of a publicly available rule-based NLP tool, QuARS, that we already used in a previous work for ambiguity and variability detection in requirements [ 5, 6, 7 ].

The experiments described below aim at giving a first answer to the following research questions: RQ1 Can chatGPT be used to detect ambiguities in requirements? RQ2 How does the chatGPT performance for ambiguity detection compare to a rule based NLP tool?

The scope of the experiments is limited to four requirements documents and to a single query asked to chatGPT; however, since chatGPT returns diferent answers when the same question is asked again, we have run each query a few times.

Section 2 briefly introduces the issue of ambiguity detection in requirements, and the two diferent detection approaches of the two tools. Section 3 describes the example requirements documents used as a benchmark. The analysis of the data generated by the experiments in view of the research questions is addressed in Section 4. Final sections on threats to validity, lessons learned and conclusions follow.

2. Ambiguity detection

Software requirements are normally expressed informally through natural language sentences, which are potentially ambiguous, and this ambiguity is a known source of problems in the later stages of software development. In the requirement engineering community, many tools have been developed to help the analyst in detecting ambiguous requirements.

2.1. Rule based NLP tools for ambiguity detection

In the last decades some tools (e.g. [8, 9, 10, 11, 12, 13]) have been defined that address the automated analysis of requirements documents by means of Natural Language Processing (NLP) tools [14] with the purpose of detecting ambiguities in them. This kind of analysis is aimed at identifying typical natural language defects, especially focusing on ambiguity sources. We list in Table 1 the most common sources of ambiguity, with a classification inspired by [ 15, 16, 17].

As a representative of these NLP tools, in this work we apply QuARS - Quality Analyzer for Requirement Specifications, developed in our lab [ 18], which shows a good performance when compared with similar tools [7]. QuARS performs an automatic linguistic analysis of a requirements document in plain text format, according to the deterministic rules defined by a given quality model. Its output indicates the defective requirements and highlights the words that reveal the defect. The defect identification process includes lexical and syntactical analysis, while semantic analysis is not supported.

2.2. chatGPT for ambiguity detection

As an AI large language model (LLM), chatGPT doesn’t use rules to detect ambiguities in the traditional sense. Instead, it uses training data and algorithms to generate an answer. LLMs are such complex algorithms that it is arduous, if not infeasible, to know exactly how and why the model returns a particular result (lack of explainability and transparency) and it is rare to get the same answer twice (nonreproducibility). These are well-known issues that need to be considered when switching from rule-based approaches to LLMs, particularly if there is a need to guarantee a quality level of the requirements. The purpose of this work, however, is to investigate whether chatGPT has reasonable performance in ambiguity detection compared with rule-based tools, such that it would make it a useful tool in software development, alone or in combination with rule-based tools. To the best of our knowledge, there is no documentation or literature so far on the ambiguity detection capabilities of chatGPT.

3. Data preparation

To perform our experience we have used two simple requirements documents introduced in previous papers, and two third-party requirements documents1:

Cofee machine

that gives few requirements of an automatic cofee vending machine; 1All documents are available at https://github.com/Vibe-NLP/RequirementsForValidation.

C1 After inserting a suitable coin, the user shall choose a beverage and select the amount of sugar. C2 The machine shall ofer, as beverages, Cofee and Cappuccino or Tea.

C3 The machine shall always ofer cofee.

C4 A ringtone possibly has to be played after beverage delivery.

C5 After the beverage is taken, the machine returns idle.

C6 The British market requires tea and excludes any ring tone.

E1 The system shall enable the user to enter the search text on the screen.

E2 The system shall display all the matching products based on the search.

E3 The system possibly notifies with a pop-up the user when no matching product is found on the search. E4 The system shall allow a user to create his profile and set his credentials.

E5 The system shall authenticate user credentials to enter the profile.

E6 The system shall display the list of active orders and/or the list of completed orders in the customer profile. E7 The system shall maintain customer email information as a required part of customer profile. E8 The system shall send an order confirmation to the user through email.

E9 The system shall allow an user to add and remove products in the shopping cart.

E10 The system shall display various shipping methods.

E11 The order shall be shipped to the client address or, if the shipping to store service is available, to an associated store.

E12 The system shall enable the user to select the shipping method.

E13 The system may display the current tracking information about the order.

E14 The system shall display the available payment methods.

E15 The system shall allow the user to select the payment method for order.

E16 After delivery, the system may enable the users to enter their reviews and ratings.

E17 Shipping time should be as fast as possible.

E18 The system must report the available products, if the availability of these are are less than 10 percent the system should show a pop-up.

E-shop that describes a simple online shopping system;

Library, that describes the requirements for the System Administration Module of a urban library system.

DigitalHome, that specifies the requirements for developing a domotic system.

In Table 2 we summarise some characteristics of the considered documents. In Tables 3 and 4 we present the requirements of the cofee machine and E-shop, respectively.

4. Data Collection and Analysis

To address the RQs, including RQ2 that requires a comparison with a rule based NLP tool, we perform the following steps:

Automatic detection: We apply both QuARS and chatGPT to each document. The document is given as input to QuARS in text format while chatGPT is queried by asking: ”Find the ambiguities of the following software requirements document: <list of requirements in text format>”.

QuARS returns the requirements that are considered ambiguous, along with the term or expression that is an indicator of ambiguity and the defect class to which it refers. chatGPT has a less structured and more variable response format, but basically indicates which requirements are ambiguous and why.

Review: The output of the tools is reviewed by the authors in a joint meeting and each defect identified as ambiguity or false positive. The classification derived at this stage is the one used for data analysis in the following step.

Assessment: The analysis is both quantitative, in terms of performance metrics, and qualitative, to understand in detail what kind of defects are identified or ignored by the two tools.

For the quantitative analysis, we use the following metrics, where is true positive, is false positive and is false negative: = =

+ + = =

| ∩ | | ∩ | | | | |

4.1. Cofee machine

As an initial case study we consider the cofee machine, which is short enough to allow us to fully report the output of QuARS and the response to a query to chatGPT. We are well aware that considering only one response is statistically irrelevant, but this example is mainly for illustrative purposes. Let us first report the output of QuARS:

T h e l i n e n u m b e r : 1 . a f t e r i n s e r t i n g a s u i t a b l e c o i n , t h e u s e r s h a l l c h o o s e a b e v e r a g e a n d s e l e c t t h e a m o u n t o f s u g a r .

i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : s u i t a b l e T h e l i n e n u m b e r : 2 . t h e m a c h i n e s h a l l o f f e r , a s b e v e r a g e s , c o f f e e a n d c a p p u c c i n o o r t e a . i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : o r T h e l i n e n u m b e r : 3 . t h e m a c h i n e s h a l l a l w a y s o f f e r c o f f e e .

i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : a l w a y s T h e l i n e n u m b e r : 4 . a r i n g t o n e p o s s i b l y h a s t o b e p l a y e d a f t e r b e v e r a g e d e l i v e r y .

i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : p o s s i b l y T h e l i n e n u m b e r : 6 . t h e b r i t i s h m a r k e t r e q u i r e s t e a a n d e x c l u d e s a n y r i n g t o n e .

i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : a n y Then we have asked chatGPT obtaining the following answer:

4.2. Results analysis and comparison

We have manually analysed the requirements, according the classification of ambiguity sources in Table 1 and then assessed the tools’ outcome. Results are summarised in Table 5 and commented below; performance results are non-significant in this toy example. 1. ”suitable coin” in C1 is an ambiguity, detected by both tools; 2. ”or” in C2 is an ambiguity detected by both tools. In the same requirement there is a coordination ambiguity, undetected by the tools; 3. ”always” in C3 is a false positive, detected as ambiguity by QuaRS. ChatGPT returns an indication of a possible contradiction, which might exist, but is not an ambiguity; 4. the fact that the ring tone is possibly played, in C4, is an ambiguity and it is detected by both tools; 5. in C6 QuARS finds ”any”, which is a false positive, while chatGPT detects an incompleteness that actually exists, but is not an ambiguity. 4.3. E-shop Our second experience involved the E-shop example: we performed a manual analysis, an analysis with QuARS, and queried chatGPT twice, on diferent days. For space reasons, we do not report the whole outcomes but only the found indicators and kind of defect in Table 6. Performance values are in Table 7 and show that the performance of chatGPT can be highly variable, which was expected, but also that it can be compared with that of a settled, rule-based tool. It is interesting to note that chatGPT was able to detect an hidden ambiguity in E3 that was not found by manual analysis (note also that we have been working on for some time on this case study and we had never noticed the problem):

A m b i g u i t y i n E 3 : I t i s n o t s p e c i f i e d i f t h e p o p - u p w i l l b e d i s p l a y e d w h e n a s i n g l e p r o d u c t i s n o t f o u n d o r m u l t i p l e p r o d u c t s .

4.4. Smart Home

Our third experience involved the smart home example: we performed an analysis with QuARS, and queried chatGPT as usual. We encountered a limitation of chatGPT: it does not accept documents of this length as input and returns an error. To get around the problem, we divided the document into two parts and had them analysed separately, then merged the results. Since we are looking for sources of ambiguity, which do not depend on the joint analysis of multiple requirements, we claim that this partition-based solution is acceptable. Each part has been analysed twice, on diferent days.

This document returned many false positives, both with QuARS and chatGPT, and chatGPT has a very low recall (Table 7). For space reasons, we do not present the tools output but only the performance measures. With regard to qualitative analysis, chatGPT found the following defect not found by QuARS, which is worth noting since it reveals an incompleteness of the QuARS dictionaries, which do not contain the term compatible.

A m b i g u i t y i n 2 . 2 . 5 , a s i t s t a t e s ” T h e s y s t e m s h a l l b e c o m p a t i b l e w i t h a c e n t r a l i z e d H V A C

4.5. Library

The last document considered is Library, which is slightly smaller in size than Smart Home. We analysed the document with QuaRS and then with chatGPT for 5 times, on 5 diferent days. In the table named GPT_QuARS_library in https://github.com/Vibe-NLP/RequirementsForValidation we list all the defects found. The table is truncated because all 5 times we queried chatGPT, although it did not report length errors, it only found defects in the first 38 requirements. We therefore decided to consider this document fragment to make the performance measurements, which are shown in Table 8. In the GPT_QuARS_library table, for each analysis, we show each defect reported, labelling it directly as false positive (fp) or true positive (amb). In the adjacent column we report: for QuARS, which indicators were considered false positives or true ambiguities; for chatGPT a fragment of the response, if significant.

4.6. Threats to validity

We have used precision and recall as metrics to compare the tools. The human intervention in the review and assessment steps, returning the number of true/false positives and false negatives, is a threat to construct validity, and the involvement of the authors in these phases is also a threat to internal validity. With regard to external validity, we have presented a preliminary study, and the quantitative comparison is limited to three case studies, to two compared tools and to a single kind of query to chatGPT and few chat sessions.

5. Conclusions and Future Work

The findings from the experience allow us to give an answer, albeit preliminar, to the RQs: RQ1 chatGPT can be used to detect ambiguities in requirements by simply asking: ”Find the ambiguities of the following software requirements document: <list of requirements in text format>”. We note that chatGPT does not process long requirement documents: either it returns an error or it provides a partial answer. Since ambiguity detection does not depend on processing the document as a whole, it is possible to break the requirements document into simpler parts and analyze the pieces separately.

RQ2 ChatGPT’s performance results vary between chat sessions with the bot, especially recall; precision, on the other hand, is more stable and comparable to that of a rule-based NLP tool. Running several sessions with the same question improves recall. For example, when making the union of the 5 responses got from the chatbot for the library case study, we have the following performance: precision = 0, 51(28/55) recall=0, 55(12/22) Validity threats can be mitigated in future work by involving third-party reviewers and measuring the level of agreement between them and by increasing the number of documents and querying chatGPT with diferent queries.

Future work can further develop the analysis presented here along several dimensions: • Assess the coverage by GPT-3 language model of the technical slang used in requirements; • Exploit ChatGPT’s ability to rationalise and explain ambiguity; • Ask ChatGPT more focused questions, addressing the various classes of ambiguity separately; • Develop the analysis with additional documents and evaluate the hypothesis that slicing a requirements document for chatGPT does not influence its results; • We have seen that chatGPT is able to detect defects, such as incompleteness and inconsistency, that traditional NLP tools cannot identify or can identify with dificulty and after domain-focused training. A future study may be devoted to specifically measuring the performance of chatGPT in finding these classes of defects in requirements. Positive results in this respect could lead to the use of chatGPT to complement a rule-based tool to automatically detect these important quality criteria;

Acknowledgments

The research has been partially supported by the MIUR, Italy project PRIN 2017 FTXR7S ‘‘ITMaTTerS’’ (Methods and Tools for Trustworthy Smart Systems). [6] A. Fantechi, S. Gnesi, L. Semini, VIBE: looking for variability in ambiguous requirements,

J. Syst. Softw. 195 (2023). [7] M. Arrabito, A. Fantechi, S. Gnesi, L. Semini, An experience with the application of three NLP tools for the analysis of natural language requirements, in: Proc. of Quality of Information and Communications Technology - 13th Int. Conference, QUATIC, volume 1266 of Communications in Computer and Information Science, Springer, 2020, pp. 488–498. [8] J. Kasser, TIGER-PRO , . URL: www.therightrequirement.com. [9] V. Ambriola, V. Gervasi, Processing natural language requirements, in: Int. Conference on

Automated Software Engineering, ASE, Nov. 2-5, IEEE Computer Society, 1997, pp. 36–45. [10] O. Kenney, M. Cooper, Automating requirement quality standards with QVscribe, in: NLP4RE’20, co-located with the 26th Int. Conf. on Requirements Engineering: Foundation for Software Quality (REFSQ), volume 2584 of CEUR Workshop Proc., CEUR-WS.org, 2020. [11] H. Femmer, Requirements quality defect detection with the Qualicen requirements scout, in: NLP4RE’18, co-located with the 23rd Int. Conf. on Requirements Engineering: Foundation for Software Quality (REFSQ), volume 2075 of CEUR Workshop Proceedings, 2018. [12] S. F. Tjong, D. M. Berry, The design of SREE - a prototype potential ambiguity finder for requirements specifications and lessons learned, in: International Working Conference on Requirements Engineering: Foundation for Software Quality, volume 7830 of LNCS, Springer, Essen, Germany, 2013, pp. 80–95. [13] R. Company, RAT, . URL: www.reusecompany.com/rat-authoring-tools. [14] G. G. Chowdhury, Natural language processing, Annu. Rev. Inf. Sci. Technol. 37 (2003) 51–89. [15] D. Berry, E. Kamsties, M. Krieger, From contract drafting to software specification: Linguistic sources of ambiguity - a handbook version 1.0 (2003). [16] V. Gervasi, A. Ferrari, D. Zowghi, P. Spoletini, Ambiguity in requirements engineering: Towards a unifying framework, in: From Software Engineering to Formal Methods and Tools, and Back - Essays Dedicated to Stefania Gnesi on the Occasion of Her 65th Birthday, volume 11865 of Lecture Notes in Computer Science, Springer, 2019, pp. 191–210. [17] INCOSE, Guide for Writing Requirements, TechGuideWR2019Soft V3, 2019. [18] S. Gnesi, G. Lami, G. Trentanni, An automatic tool for the analysis of natural language requirements, Computer Systems: Science & Engineering 20 (2005).

[1] T. B. Brown , B. Mann , et al., Language models are few-shot learners , in: 33rd Annual Conference on Neural Information Processing Systems 2020 , Dec. 6 - 12 , 2020 .

[2]

Dale , GPT-3 : what's it good for? , Nat. Lang. Eng . 27 ( 2021 ) 113 - 118 .

[3] IBM, Engineering Requirements Quality Assistant (RQA), . URL: www.ibm.com/products/ requirements-quality-assistant.

[4]

Fantechi ,

Gnesi ,

Livi ,

Semini , A spaCy-based tool for extracting variability from NL requirements , in: M. Mousavi , P. Schobbens (Eds.), SPLC '21: 25th ACM Int. Systems and Software Product Line Conference , Leicester, UK , Sept. 6 - 11 , ACM, 2021 , pp. 32 - 35 .

[5]

Fantechi ,

Ferrari ,

Gnesi , L. Semini, Requirement Engineering of Software Product Lines: Extracting Variability Using NLP , in: 26th IEEE International Requirements Engineering Conference 2018 , Banf, Canada, August 20-24 , 2018 , IEEE, 2018 , pp. 418 - 423 .