Rule-based NLP vs ChatGPT in Ambiguity Detection, a Preliminary Study

Rule-based NLP vs ChatGPT in Ambiguity Detection, a Preliminary Study AlessandroFantechi alessandro.fantechi@unifi.it Dip. di Ingegneria dell'Informazione Università di Firenze Istituto di Scienza e Tecnologie dell'Informazione "A.Faedo" Consiglio Nazionale delle Ricerche ISTI-CNR

Pisa

StefaniaGnesi stefania.gnesi@isti.cnr.it Istituto di Scienza e Tecnologie dell'Informazione "A.Faedo" Consiglio Nazionale delle Ricerche ISTI-CNR

Pisa

LauraSemini laura.semini@unipi.it Istituto di Scienza e Tecnologie dell'Informazione "A.Faedo" Consiglio Nazionale delle Ricerche ISTI-CNR

Pisa

Dipartimento di Informatica Università di Pisa Rule-based NLP vs ChatGPT in Ambiguity Detection, a Preliminary Study 1613-0073 EED21D3920A230083D3202FD923D5A49 GROBID - A machine learning software for extracting information from scholarly documents Ambiguity detection in requirements, chatGPT, rule-based NLP tools Orcid 0000-0002-4648-4667 (A. Fantechi) 0000-0002-0139-0421 (S. Gnesi) 0000-0001-8774-2346 (L. Semini)

With the rapid advances of AI-based tools, the question of whether to use such tools or conventional rule-based tools often arises in many application domains. In this paper, we address this question when considering the issue of ambiguity in requirements documents. For this purpose, we consider GPT-3 that is the third-generation of the Generative Pretrained Transformer language model, developed by OpenAI and we compare its ambiguity detection capability with that of a publicly available rule-based NLP tool on a few example requirements documents.

Introduction

GPT-3 is the third-generation of the Generative Pretrained Transformer language model, developed by OpenAI, it is an autoregressive language model and it is the largest language model constructed to date. Having sufficient data, GPT-3 can solve all kinds of tasks: it did not have any fine-tuning to solve specific tasks, like translation or text generation [1,2]. chatGPT is a GPT-3 based conversational chatbot that has gained popularity in recent months. It is designed to respond to questions and provide information in a conversational manner, using specific training to handle conversational text and generate natural and coherent responses.

Until now, attempts to define AI-based tools for analyzing software requirements have faced the well-known lack of a corpus of annotated requirements documents on which to train the models. Some existing NLP tools harness the power of machine learning for linguistic analysis of the NL, supported by the very large size of the examples data that can be used to train the learning model, and integrate AI based language analysis with a rule-based system for ambiguity search in requirements, but they cannot be considered AI tools [3,4].

Being GPT-3 the largest language model constructed to date, we decided it was worth trying to evaluate its ability to analyze software requirements, and to compare its performance against a traditional rule-based NLP tool.

In this paper, we present a first step in this direction, in which we compared on a few requirements documents examples the ambiguity detection ability of chatGPT with that of a publicly available rule-based NLP tool, QuARS, that we already used in a previous work for ambiguity and variability detection in requirements [5,6,7].

The experiments described below aim at giving a first answer to the following research questions: RQ1 Can chatGPT be used to detect ambiguities in requirements? RQ2 How does the chatGPT performance for ambiguity detection compare to a rule based NLP tool?

The scope of the experiments is limited to four requirements documents and to a single query asked to chatGPT; however, since chatGPT returns different answers when the same question is asked again, we have run each query a few times.

Section 2 briefly introduces the issue of ambiguity detection in requirements, and the two different detection approaches of the two tools. Section 3 describes the example requirements documents used as a benchmark. The analysis of the data generated by the experiments in view of the research questions is addressed in Section 4. Final sections on threats to validity, lessons learned and conclusions follow.

Ambiguity detection

Software requirements are normally expressed informally through natural language sentences, which are potentially ambiguous, and this ambiguity is a known source of problems in the later stages of software development. In the requirement engineering community, many tools have been developed to help the analyst in detecting ambiguous requirements.

Rule based NLP tools for ambiguity detection

In the last decades some tools (e.g. [8,9,10,11,12,13]) have been defined that address the automated analysis of requirements documents by means of Natural Language Processing (NLP) tools [14] with the purpose of detecting ambiguities in them. This kind of analysis is aimed at identifying typical natural language defects, especially focusing on ambiguity sources. We list in Table 1 the most common sources of ambiguity, with a classification inspired by [15,16,17].

As a representative of these NLP tools, in this work we apply QuARS -Quality Analyzer for Requirement Specifications, developed in our lab [18], which shows a good performance when compared with similar tools [7]. QuARS performs an automatic linguistic analysis of a requirements document in plain text format, according to the deterministic rules defined by a given quality model. Its output indicates the defective requirements and highlights the words that reveal the defect. The defect identification process includes lexical and syntactical analysis, while semantic analysis is not supported.

Underspecification

occurs when the sentence contains terms that need to be instantiated or qualified information, interface, attack, button, channel, component, procedure, process, report, session,...

Passive voice

occurs when the subject of the passive sentence is not be revealed auxiliary to be with a past participle and no agent specified (by)

chatGPT for ambiguity detection

As an AI large language model (LLM), chatGPT doesn't use rules to detect ambiguities in the traditional sense. Instead, it uses training data and algorithms to generate an answer. LLMs are such complex algorithms that it is arduous, if not infeasible, to know exactly how and why the model returns a particular result (lack of explainability and transparency) and it is rare to get the same answer twice (nonreproducibility). These are well-known issues that need to be considered when switching from rule-based approaches to LLMs, particularly if there is a need to guarantee a quality level of the requirements. The purpose of this work, however, is to investigate whether chatGPT has reasonable performance in ambiguity detection compared with rule-based tools, such that it would make it a useful tool in software development, alone or in combination with rule-based tools. To the best of our knowledge, there is no documentation or literature so far on the ambiguity detection capabilities of chatGPT.

Data preparation

To perform our experience we have used two simple requirements documents introduced in previous papers, and two third-party requirements documents 1 : Coffee machine that gives few requirements of an automatic coffee vending machine; The system shall enable the user to enter the search text on the screen. E2 The system shall display all the matching products based on the search. E3 The system possibly notifies with a pop-up the user when no matching product is found on the search. E4 The system shall allow a user to create his profile and set his credentials. E5 The system shall authenticate user credentials to enter the profile. E6 The system shall display the list of active orders and/or the list of completed orders in the customer profile. E7 The system shall maintain customer email information as a required part of customer profile. E8 The system shall send an order confirmation to the user through email. E9 The system shall allow an user to add and remove products in the shopping cart. E10 The system shall display various shipping methods. E11 The order shall be shipped to the client address or, if the shipping to store service is available, to an associated store. E12 The system shall enable the user to select the shipping method. E13 The system may display the current tracking information about the order. E14 The system shall display the available payment methods. E15 The system shall allow the user to select the payment method for order. E16 After delivery, the system may enable the users to enter their reviews and ratings. E17 Shipping time should be as fast as possible. E18 The system must report the available products, if the availability of these are are less than 10 percent the system should show a pop-up.

E-shop that describes a simple online shopping system; Library, that describes the requirements for the System Administration Module of a urban library system.

DigitalHome, that specifies the requirements for developing a domotic system.

In Table 2 we summarise some characteristics of the considered documents. In Tables 3 and 4 we present the requirements of the coffee machine and E-shop, respectively.

Data Collection and Analysis

To address the RQs, including RQ2 that requires a comparison with a rule based NLP tool, we perform the following steps:

Automatic detection: We apply both QuARS and chatGPT to each document. The document is given as input to QuARS in text format while chatGPT is queried by asking: "Find the ambiguities of the following software requirements document: <list of requirements in text format>".

QuARS returns the requirements that are considered ambiguous, along with the term or expression that is an indicator of ambiguity and the defect class to which it refers. chatGPT has a less structured and more variable response format, but basically indicates which requirements are ambiguous and why.

Review: The output of the tools is reviewed by the authors in a joint meeting and each defect identified as ambiguity or false positive. The classification derived at this stage is the one used for data analysis in the following step.

Assessment: The analysis is both quantitative, in terms of performance metrics, and qualitative, to understand in detail what kind of defects are identified or ignored by the two tools.

For the quantitative analysis, we use the following metrics, where 𝑡𝑝 is true positive, 𝑓 𝑝 is false positive and 𝑓 𝑛 is false negative:

Coffee machine

As an initial case study we consider the coffee machine, which is short enough to allow us to fully report the output of QuARS and the response to a query to chatGPT. We are well aware that considering only one response is statistically irrelevant, but this example is mainly for illustrative purposes. Let us first report the output of QuARS:

T Then we have asked chatGPT obtaining the following answer: 1. I t i s n o t c l e a r w h a t i s m e a n t b y " a s u i t a b l e c o i n . " I s t h e r e a s p e c i f i c t y p e o f o i n t h a t t h e m a c h i n e w i l l a c c e p t , o r w i l l i t a c c e p t a n y c o i n ? I s t h i s a r e q u i r e m e n t o r j u s t a p o s s i b i l i t y ?

T h e s t a t e m e n t " T h e m a c h i n e

Results analysis and comparison

We have manually analysed the requirements, according the classification of ambiguity sources in Table 1 and then assessed the tools' outcome. Results are summarised in Table 5 and commented below; performance results are non-significant in this toy example.

1. "suitable coin" in C1 is an ambiguity, detected by both tools; 2. "or" in C2 is an ambiguity detected by both tools. In the same requirement there is a coordination ambiguity, undetected by the tools; 3. "always" in C3 is a false positive, detected as ambiguity by QuaRS. ChatGPT returns an indication of a possible contradiction, which might exist, but is not an ambiguity; 4. the fact that the ring tone is possibly played, in C4, is an ambiguity and it is detected by both tools; 5. in C6 QuARS finds "any", which is a false positive, while chatGPT detects an incompleteness that actually exists, but is not an ambiguity.

E-shop

Our second experience involved the E-shop example: we performed a manual analysis, an analysis with QuARS, and queried chatGPT twice, on different days. For space reasons, we do not report the whole outcomes but only the found indicators and kind of defect in Table 6.

Performance values are in Table 7 and show that the performance of chatGPT can be highly variable, which was expected, but also that it can be compared with that of a settled, rule-based tool. It is interesting to note that chatGPT was able to detect an hidden ambiguity in E3 that was not found by manual analysis (note also that we have been working on for some time on this case study and we had never noticed the problem):

Smart Home

Our third experience involved the smart home example: we performed an analysis with QuARS, and queried chatGPT as usual. We encountered a limitation of chatGPT: it does not accept documents of this length as input and returns an error. To get around the problem, we divided the document into two parts and had them analysed separately, then merged the results. Since we are looking for sources of ambiguity, which do not depend on the joint analysis of multiple requirements, we claim that this partition-based solution is acceptable. Each part has been analysed twice, on different days. This document returned many false positives, both with QuARS and chatGPT, and chatGPT has a very low recall (Table 7). For space reasons, we do not present the tools output but only the performance measures. With regard to qualitative analysis, chatGPT found the following defect not found by QuARS, which is worth noting since it reveals an incompleteness of the QuARS dictionaries, which do not contain the term compatible.

A m b i g u i t y i n 2 . 2 . 5 , a s i t s t a t e s " T h e s y s t e m s h a l l b e c o m p a t i b l e w i t h a c e n t r a l i z e d H V A C

h e l i n e n u m b e r : 1 . a f t e r i n s e r t i n g a s u i t a b l e c o i n , t h e u s e r s h a l l c h o o s e a b e v e r a g e a n d s e l e c t t h e a m o u n t o f s u g a r . i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : s u i t a b l e T h e l i n e n u m b e r : 2 . t h e m a c h i n e s h a l l o f f e r , a s b e v e r a g e s , c o f f e e a n d c a p p u c c i n o o r t e a . i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : o r T h e l i n e n u m b e r : 3 . t h e m a c h i n e s h a l l a l w a y s o f f e r c o f f e e . i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : a l w a y s T h e l i n e n u m b e r : 4 . a r i n g t o n e p o s s i b l y h a s t o b e p l a y e d a f t e r b e v e r a g e d e l i v e r y . i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : p o s s i b l y T h e l i n e n u m b e r : 6 . t h e b r i t i s h m a r k e t r e q u i r e s t e a a n d e x c l u d e s a n y r i n g t o n e . i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : a n y

s h a l l o f f e r , a s b e v e r a g e s , C o f f e e a n d C a p p u c c i n o o r T e a " i s a m b i g u o u s b e c a u s e i t i s n o t c l e a r w h e t h e r t h e m a c h i n e w i l l o f f e r a l l t h r e e o p t i o n s ( C o f f e e , C a p p u c c i n o , a n d T e a ) o r o n l y o n e o f t h e o p t i o n s ( C o f f e e , C a p p u c c i n o , o r T e a ) .

s h a l l a l w a y s o f f e r c o f f e e " s e e m s t o c o n t r a d i c t t h e p r e v i o u s s t a t e m e n t a b o u t t h e a v a i l a b l e b e v e r a g e o p t i o n s .4. I t i s n o t c l e a r w h a t i s m e a n t b y " a r i n g t o n e p o s s i b l y h a s t o b e p l a y e d a f t e r b e v e r a g e d e l i v e r y . "

A m b ig u i t y i n E 3 : I t i s n o t s p e c i f i e d i f t h e p o p -u p w i l l b e d i s p l a y e d w h e n a s i n g l e p r o d u c t i s n o t f o u n d o r m u l t i p l e p r o d u c t s .

Table 11Ambiguity classes and indicators.Ambiguity classesIndicatorsHomonymy andoccur when a term can have different meanings,some examples are:bank, can, bat...polisemyhaving different (homonymy) or one (polisemy) et-(homonymies), left, right, fall, minute, ... (poly-ymologysemies)Analytical,occur when a sentence admits more than one gram-syntactic analysis: the sentence admits two orattachment,matical structure, and different structures have dif-more syntactic treescoordinationferent meaningsAnaphoraoccurs when an element of a sentence depends forrelative and demonstrative pronouns: that, which,its reference on another, antecedent, element andtheir, it, them, they, both,...it is not clear to which antecedent it refersVaguenessoccurs when it is not possible to interpret a sen-clear, easy, strong, good, bad, adequate, tall,tence in a unequivocal wayshort, various, completed, similar, similarly, accord-ingly,...Comparativesoccurs when the term of comparison or the uni-better, easier, worst, faster, bigger, biggest,...& superlativesverse of discourse are missingDisjunctionsoccurs when a sentence admits different models inor, and/or,...which the first, the second or both disjuncts aretrue

Escape clausesoccurs when a sentence admits different models, containing or not the object the escape clause case, possibly, if possible, if appropriate, among others, as a minimum, when required, ... Weakness occurs when the sentence contains weak verbs may, can, could,... Quantifiers in presence of quantifiers, ambiguities are due to the scope or to the universe of quantification a, all, always, every, any, nothing,...

Table 22Characteristics of the requirement documents: number of requirements, number of words, authorship and characteristic of the system to be.reqswords issued bycharacteristicsCoffee machine663authorstoy exampleE-shop18263authorstoy exampleLibrary941815 companyinformation systemDigitalHome1121121 academiacontrol system

Table 33Coffee-machine requirements C1 After inserting a suitable coin, the user shall choose a beverage and select the amount of sugar. C2 The machine shall offer, as beverages, Coffee and Cappuccino or Tea. C3 The machine shall always offer coffee. C4 A ringtone possibly has to be played after beverage delivery. C5 After the beverage is taken, the machine returns idle. C6 The British market requires tea and excludes any ring tone.

Table 44

E-shop requirementsE1

Table 55Coffee machine case study. We report the indicator found with its defect class.QuaRSchatGPTManual analysisReqIndicatorDefectIndicatorDefectIndicatorC1suitablevaguenesssuitablevaguenesssuitableC2ordisjunctionorambiguous disjunctionor----coord. ambiguityC3alwaysquantificationalwayscontradiction-C4possiblyoptionalitypossiblyoptionalitypossibly----passive voiceC5anyquantification-incompleteness-

Table 6 E6-shop case study. All indicators found are true positives unless labeled as false positives (fp).QuaRSchatGPT 1chatGPT 2Manual analysisReqIndicatorDefectIndicatorDefectIndicatorDefectIndicatorE2allquantif. (fp)-----E3possiblyoptionalitypossiblyoptionality--possibly----noquantif.-E6and/oroptionalityand/orambig. disj.and/orambig. disj.and/orE10variousvagueness----variousE11oroptionalityorambig. disj.--or----associatedvague (fp)-E13mayweaknessmayweakness--may----currentunclear (fp)-E16mayweaknessmayweakness--mayE17shouldweaknessshouldreplace by shall--should----as fast assubjectiveas fast as possi-possiblebleE18shouldweaknessshouldreplace by shall--should

Table 7 E7-shop and Smart Home: performance measures.QuaRSchatGPT1chatGPT2precisionrecallprecisionrecallprecisionrecalleshop0,89 (8/9)0,8 0,89 (8/9)1 (7/7)0,78 (7/9)0,6 (3/5)0,33 (3/9)smart_home 0,24 (17/70)0,77 (17/22)0,28 (3/14) 0,14 (3/22) 0,17 (2/12) 0,09 (2/22)

All documents are available at https://github.com/Vibe-NLP/RequirementsForValidation.

Acknowledgments

The research has been partially supported by the MIUR, Italy project PRIN 2017 FTXR7S ''IT-MaTTerS'' (Methods and Tools for Trustworthy Smart Systems).

Library

The last document considered is Library, which is slightly smaller in size than Smart Home. We analysed the document with QuaRS and then with chatGPT for 5 times, on 5 different days. In the table named GPT_QuARS_library in https://github.com/Vibe-NLP/RequirementsForValidation we list all the defects found. The table is truncated because all 5 times we queried chatGPT, although it did not report length errors, it only found defects in the first 38 requirements. We therefore decided to consider this document fragment to make the performance measurements, which are shown in Table 8. In the GPT_QuARS_library table, for each analysis, we show each defect reported, labelling it directly as false positive (fp) or true positive (amb). In the adjacent column we report: for QuARS, which indicators were considered false positives or true ambiguities; for chatGPT a fragment of the response, if significant.

Threats to validity

We have used precision and recall as metrics to compare the tools. The human intervention in the review and assessment steps, returning the number of true/false positives and false negatives, is a threat to construct validity, and the involvement of the authors in these phases is also a threat to internal validity. With regard to external validity, we have presented a preliminary study, and the quantitative comparison is limited to three case studies, to two compared tools and to a single kind of query to chatGPT and few chat sessions.

Conclusions and Future Work

The findings from the experience allow us to give an answer, albeit preliminar, to the RQs:

RQ1 chatGPT can be used to detect ambiguities in requirements by simply asking: "Find the ambiguities of the following software requirements document:<list of requirements in text format>". We note that chatGPT does not process long requirement documents: either it returns an error or it provides a partial answer. Since ambiguity detection does not depend on processing the document as a whole, it is possible to break the requirements document into simpler parts and analyze the pieces separately. RQ2 ChatGPT's performance results vary between chat sessions with the bot, especially recall; precision, on the other hand, is more stable and comparable to that of a rule-based NLP tool. Running several sessions with the same question improves recall. For example, when making the union of the 5 responses got from the chatbot for the library case study, we have the following performance: precision = 0, 51(28/55) recall=0, 55(12/22)

Validity threats can be mitigated in future work by involving third-party reviewers and measuring the level of agreement between them and by increasing the number of documents and querying chatGPT with different queries. Future work can further develop the analysis presented here along several dimensions:

• Assess the coverage by GPT-3 language model of the technical slang used in requirements;

• Exploit ChatGPT's ability to rationalise and explain ambiguity;

• Ask ChatGPT more focused questions, addressing the various classes of ambiguity separately; • Develop the analysis with additional documents and evaluate the hypothesis that slicing a requirements document for chatGPT does not influence its results; • We have seen that chatGPT is able to detect defects, such as incompleteness and inconsistency, that traditional NLP tools cannot identify or can identify with difficulty and after domain-focused training. A future study may be devoted to specifically measuring the performance of chatGPT in finding these classes of defects in requirements. Positive results in this respect could lead to the use of chatGPT to complement a rule-based tool to automatically detect these important quality criteria;

Language models are few-shot learners TBBrown BMann Annual Conference on Neural Information Processing Systems 2020 Dec. 6-12, 2020 GPT-3: what's it good for? RDale Nat. Lang. Eng 27 2021 A spaCy-based tool for extracting variability from NL requirements AFantechi SGnesi SLivi LSemini SPLC '21: 25th ACM Int. Systems and Software Product Line Conference MMousavi PSchobbens

Leicester, UK

ACM Sept. 6-11. 2021 Requirement Engineering of Software Product Lines: Extracting Variability Using NLP AFantechi AFerrari SGnesi LSemini 26th IEEE International Requirements Engineering Conference 2018

Banff, Canada

IEEE August 20-24, 2018. 2018 VIBE: looking for variability in ambiguous requirements AFantechi SGnesi LSemini J. Syst. Softw 195 2023 An experience with the application of three NLP tools for the analysis of natural language requirements MArrabito AFantechi SGnesi LSemini Proc. of Quality of Information and Communications Technology -13th Int. Conference, QUATIC Communications in Computer and Information Science of Quality of Information and Communications Technology -13th Int. Conference, QUATIC Springer 2020 1266 <author> <persName><forename type="first">J</forename><surname>Kasser</surname></persName> </author> <author> <persName><surname>Tiger-Pro</surname></persName> </author> <ptr target="www.therightrequirement.com" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b7"> <analytic> <title level="a" type="main">Processing natural language requirements VAmbriola VGervasi Int. Conference on Automated Software Engineering, ASE IEEE Computer Society Nov. 2-5. 1997 Automating requirement quality standards with QVscribe OKenney MCooper NLP4RE'20, co-located with the 26th Int. Conf. on Requirements Engineering: Foundation for Software Quality (REFSQ) CEUR Workshop Proc. CEUR-WS 2020 2584 Requirements quality defect detection with the Qualicen requirements scout HFemmer NLP4RE'18, co-located with the 23rd Int. Conf. on Requirements Engineering: Foundation for Software Quality (REFSQ) CEUR Workshop Proceedings 2018 2075 The design of SREE -a prototype potential ambiguity finder for requirements specifications and lessons learned SFTjong DMBerry International Working Conference on Requirements Engineering: Foundation for Software Quality

Essen, Germany

Springer 2013 7830 <author> <persName><forename type="first">R</forename><surname>Company</surname></persName> </author> <author> <persName><surname>Rat</surname></persName> </author> <ptr target="www.reusecompany.com/rat-authoring-tools" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b12"> <analytic> <title level="a" type="main">Natural language processing GGChowdhury Annu. Rev. Inf. Sci. Technol 37 2003 From contract drafting to software specification: Linguistic sources of ambiguity -a handbook version DBerry EKamsties MKrieger 2003 1 0 Ambiguity in requirements engineering: Towards a unifying framework VGervasi AFerrari DZowghi PSpoletini From Software Engineering to Formal Methods and Tools, and Back -Essays Dedicated to Stefania Gnesi on the Occasion of Her 65th Birthday Lecture Notes in Computer Science Springer 2019 11865 Incose Guide for Writing Requirements 2019 3 An automatic tool for the analysis of natural language requirements SGnesi GLami GTrentanni Computer Systems: Science & Engineering 20 2005