<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Rule-based NLP vs ChatGPT in Ambiguity Detection, a Preliminary Study</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Alessandro</forename><surname>Fantechi</surname></persName>
							<email>alessandro.fantechi@unifi.it</email>
							<affiliation key="aff0">
								<orgName type="department">Dip. di Ingegneria dell&apos;Informazione</orgName>
								<orgName type="institution">Università di Firenze</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Istituto di Scienza e Tecnologie dell&apos;Informazione &quot;A.Faedo&quot;</orgName>
								<orgName type="institution" key="instit1">Consiglio Nazionale delle Ricerche</orgName>
								<orgName type="institution" key="instit2">ISTI-CNR</orgName>
								<address>
									<settlement>Pisa</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stefania</forename><surname>Gnesi</surname></persName>
							<email>stefania.gnesi@isti.cnr.it</email>
							<affiliation key="aff1">
								<orgName type="department">Istituto di Scienza e Tecnologie dell&apos;Informazione &quot;A.Faedo&quot;</orgName>
								<orgName type="institution" key="instit1">Consiglio Nazionale delle Ricerche</orgName>
								<orgName type="institution" key="instit2">ISTI-CNR</orgName>
								<address>
									<settlement>Pisa</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Laura</forename><surname>Semini</surname></persName>
							<email>laura.semini@unipi.it</email>
							<affiliation key="aff1">
								<orgName type="department">Istituto di Scienza e Tecnologie dell&apos;Informazione &quot;A.Faedo&quot;</orgName>
								<orgName type="institution" key="instit1">Consiglio Nazionale delle Ricerche</orgName>
								<orgName type="institution" key="instit2">ISTI-CNR</orgName>
								<address>
									<settlement>Pisa</settlement>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Dipartimento di Informatica</orgName>
								<orgName type="institution">Università di Pisa</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Rule-based NLP vs ChatGPT in Ambiguity Detection, a Preliminary Study</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">EED21D3920A230083D3202FD923D5A49</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-04-29T06:36+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Ambiguity detection in requirements, chatGPT, rule-based NLP tools Orcid 0000-0002-4648-4667 (A. Fantechi)</term>
					<term>0000-0002-0139-0421 (S. Gnesi)</term>
					<term>0000-0001-8774-2346 (L. Semini)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>With the rapid advances of AI-based tools, the question of whether to use such tools or conventional rule-based tools often arises in many application domains. In this paper, we address this question when considering the issue of ambiguity in requirements documents. For this purpose, we consider GPT-3 that is the third-generation of the Generative Pretrained Transformer language model, developed by OpenAI and we compare its ambiguity detection capability with that of a publicly available rule-based NLP tool on a few example requirements documents.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>GPT-3 is the third-generation of the Generative Pretrained Transformer language model, developed by OpenAI, it is an autoregressive language model and it is the largest language model constructed to date. Having sufficient data, GPT-3 can solve all kinds of tasks: it did not have any fine-tuning to solve specific tasks, like translation or text generation <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. chatGPT is a GPT-3 based conversational chatbot that has gained popularity in recent months. It is designed to respond to questions and provide information in a conversational manner, using specific training to handle conversational text and generate natural and coherent responses.</p><p>Until now, attempts to define AI-based tools for analyzing software requirements have faced the well-known lack of a corpus of annotated requirements documents on which to train the models. Some existing NLP tools harness the power of machine learning for linguistic analysis of the NL, supported by the very large size of the examples data that can be used to train the learning model, and integrate AI based language analysis with a rule-based system for ambiguity search in requirements, but they cannot be considered AI tools <ref type="bibr">[3,</ref><ref type="bibr" target="#b2">4]</ref>.</p><p>Being GPT-3 the largest language model constructed to date, we decided it was worth trying to evaluate its ability to analyze software requirements, and to compare its performance against a traditional rule-based NLP tool.</p><p>In this paper, we present a first step in this direction, in which we compared on a few requirements documents examples the ambiguity detection ability of chatGPT with that of a publicly available rule-based NLP tool, QuARS, that we already used in a previous work for ambiguity and variability detection in requirements <ref type="bibr" target="#b3">[5,</ref><ref type="bibr" target="#b4">6,</ref><ref type="bibr" target="#b5">7]</ref>.</p><p>The experiments described below aim at giving a first answer to the following research questions: RQ1 Can chatGPT be used to detect ambiguities in requirements? RQ2 How does the chatGPT performance for ambiguity detection compare to a rule based NLP tool?</p><p>The scope of the experiments is limited to four requirements documents and to a single query asked to chatGPT; however, since chatGPT returns different answers when the same question is asked again, we have run each query a few times.</p><p>Section 2 briefly introduces the issue of ambiguity detection in requirements, and the two different detection approaches of the two tools. Section 3 describes the example requirements documents used as a benchmark. The analysis of the data generated by the experiments in view of the research questions is addressed in Section 4. Final sections on threats to validity, lessons learned and conclusions follow.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Ambiguity detection</head><p>Software requirements are normally expressed informally through natural language sentences, which are potentially ambiguous, and this ambiguity is a known source of problems in the later stages of software development. In the requirement engineering community, many tools have been developed to help the analyst in detecting ambiguous requirements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Rule based NLP tools for ambiguity detection</head><p>In the last decades some tools (e.g. <ref type="bibr" target="#b6">[8,</ref><ref type="bibr" target="#b7">9,</ref><ref type="bibr" target="#b8">10,</ref><ref type="bibr" target="#b9">11,</ref><ref type="bibr" target="#b10">12,</ref><ref type="bibr" target="#b11">13]</ref>) have been defined that address the automated analysis of requirements documents by means of Natural Language Processing (NLP) tools <ref type="bibr" target="#b12">[14]</ref> with the purpose of detecting ambiguities in them. This kind of analysis is aimed at identifying typical natural language defects, especially focusing on ambiguity sources. We list in Table <ref type="table" target="#tab_0">1</ref> the most common sources of ambiguity, with a classification inspired by <ref type="bibr" target="#b13">[15,</ref><ref type="bibr" target="#b14">16,</ref><ref type="bibr" target="#b15">17]</ref>.</p><p>As a representative of these NLP tools, in this work we apply QuARS -Quality Analyzer for Requirement Specifications, developed in our lab <ref type="bibr" target="#b16">[18]</ref>, which shows a good performance when compared with similar tools <ref type="bibr" target="#b5">[7]</ref>. QuARS performs an automatic linguistic analysis of a requirements document in plain text format, according to the deterministic rules defined by a given quality model. Its output indicates the defective requirements and highlights the words that reveal the defect. The defect identification process includes lexical and syntactical analysis, while semantic analysis is not supported. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Underspecification</head><p>occurs when the sentence contains terms that need to be instantiated or qualified information, interface, attack, button, channel, component, procedure, process, report, session,...</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Passive voice</head><p>occurs when the subject of the passive sentence is not be revealed auxiliary to be with a past participle and no agent specified (by)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">chatGPT for ambiguity detection</head><p>As an AI large language model (LLM), chatGPT doesn't use rules to detect ambiguities in the traditional sense. Instead, it uses training data and algorithms to generate an answer. LLMs are such complex algorithms that it is arduous, if not infeasible, to know exactly how and why the model returns a particular result (lack of explainability and transparency) and it is rare to get the same answer twice (nonreproducibility). These are well-known issues that need to be considered when switching from rule-based approaches to LLMs, particularly if there is a need to guarantee a quality level of the requirements. The purpose of this work, however, is to investigate whether chatGPT has reasonable performance in ambiguity detection compared with rule-based tools, such that it would make it a useful tool in software development, alone or in combination with rule-based tools. To the best of our knowledge, there is no documentation or literature so far on the ambiguity detection capabilities of chatGPT.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Data preparation</head><p>To perform our experience we have used two simple requirements documents introduced in previous papers, and two third-party requirements documents <ref type="foot" target="#foot_0">1</ref> : Coffee machine that gives few requirements of an automatic coffee vending machine;   The system shall enable the user to enter the search text on the screen. E2 The system shall display all the matching products based on the search. E3 The system possibly notifies with a pop-up the user when no matching product is found on the search. E4 The system shall allow a user to create his profile and set his credentials. E5 The system shall authenticate user credentials to enter the profile. E6 The system shall display the list of active orders and/or the list of completed orders in the customer profile. E7 The system shall maintain customer email information as a required part of customer profile. E8 The system shall send an order confirmation to the user through email. E9 The system shall allow an user to add and remove products in the shopping cart. E10 The system shall display various shipping methods. E11 The order shall be shipped to the client address or, if the shipping to store service is available, to an associated store. E12 The system shall enable the user to select the shipping method. E13 The system may display the current tracking information about the order. E14 The system shall display the available payment methods. E15 The system shall allow the user to select the payment method for order. E16 After delivery, the system may enable the users to enter their reviews and ratings. E17 Shipping time should be as fast as possible. E18 The system must report the available products, if the availability of these are are less than 10 percent the system should show a pop-up.</p><p>E-shop that describes a simple online shopping system; Library, that describes the requirements for the System Administration Module of a urban library system.</p><p>DigitalHome, that specifies the requirements for developing a domotic system.</p><p>In Table <ref type="table" target="#tab_1">2</ref> we summarise some characteristics of the considered documents. In Tables <ref type="table" target="#tab_3">3 and 4</ref> we present the requirements of the coffee machine and E-shop, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Data Collection and Analysis</head><p>To address the RQs, including RQ2 that requires a comparison with a rule based NLP tool, we perform the following steps:</p><p>Automatic detection: We apply both QuARS and chatGPT to each document. The document is given as input to QuARS in text format while chatGPT is queried by asking: "Find the ambiguities of the following software requirements document: &lt;list of requirements in text format&gt;".</p><p>QuARS returns the requirements that are considered ambiguous, along with the term or expression that is an indicator of ambiguity and the defect class to which it refers. chatGPT has a less structured and more variable response format, but basically indicates which requirements are ambiguous and why.</p><p>Review: The output of the tools is reviewed by the authors in a joint meeting and each defect identified as ambiguity or false positive. The classification derived at this stage is the one used for data analysis in the following step.</p><p>Assessment: The analysis is both quantitative, in terms of performance metrics, and qualitative, to understand in detail what kind of defects are identified or ignored by the two tools.</p><p>For the quantitative analysis, we use the following metrics, where 𝑡𝑝 is true positive, 𝑓 𝑝 is false positive and 𝑓 𝑛 is false negative:</p><formula xml:id="formula_0">𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝 𝑡𝑝 + 𝑓 𝑝 = |𝑓 𝑜𝑢𝑛𝑑 𝑎𝑚𝑏𝑢𝑔𝑢𝑖𝑡𝑖𝑒𝑠 ∩ 𝑡𝑟𝑢𝑒 𝑎𝑚𝑏𝑢𝑔𝑢𝑖𝑡𝑖𝑒𝑠| |𝑓 𝑜𝑢𝑛𝑑 𝑎𝑚𝑏𝑢𝑔𝑢𝑖𝑡𝑖𝑒𝑠| 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑡𝑝 𝑡𝑝 + 𝑓 𝑛 = |𝑓 𝑜𝑢𝑛𝑑 𝑎𝑚𝑏𝑢𝑔𝑢𝑖𝑡𝑖𝑒𝑠 ∩ 𝑡𝑟𝑢𝑒 𝑎𝑚𝑏𝑢𝑔𝑢𝑖𝑡𝑖𝑒𝑠| |𝑡𝑟𝑢𝑒 𝑎𝑚𝑏𝑖𝑔𝑢𝑖𝑡𝑖𝑒𝑠|</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Coffee machine</head><p>As an initial case study we consider the coffee machine, which is short enough to allow us to fully report the output of QuARS and the response to a query to chatGPT. We are well aware that considering only one response is statistically irrelevant, but this example is mainly for illustrative purposes. Let us first report the output of QuARS:</p><p>T Then we have asked chatGPT obtaining the following answer: 1. I t i s n o t c l e a r w h a t i s m e a n t b y " a s u i t a b l e c o i n . " I s t h e r e a s p e c i f i c t y p e o f o i n t h a t t h e m a c h i n e w i l l a c c e p t , o r w i l l i t a c c e p t a n y c o i n ?  I s t h i s a r e q u i r e m e n t o r j u s t a p o s s i b i l i t y ? </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">T h e s t a t e m e n t " T h e m a c h i n e</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">T h e s t a t e m e n t " T h e m a c h i n e</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Results analysis and comparison</head><p>We have manually analysed the requirements, according the classification of ambiguity sources in Table <ref type="table" target="#tab_0">1</ref> and then assessed the tools' outcome. Results are summarised in Table <ref type="table" target="#tab_4">5</ref> and commented below; performance results are non-significant in this toy example.</p><p>1. "suitable coin" in C1 is an ambiguity, detected by both tools; 2. "or" in C2 is an ambiguity detected by both tools. In the same requirement there is a coordination ambiguity, undetected by the tools; 3. "always" in C3 is a false positive, detected as ambiguity by QuaRS. ChatGPT returns an indication of a possible contradiction, which might exist, but is not an ambiguity; 4. the fact that the ring tone is possibly played, in C4, is an ambiguity and it is detected by both tools; 5. in C6 QuARS finds "any", which is a false positive, while chatGPT detects an incompleteness that actually exists, but is not an ambiguity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">E-shop</head><p>Our second experience involved the E-shop example: we performed a manual analysis, an analysis with QuARS, and queried chatGPT twice, on different days. For space reasons, we do not report the whole outcomes but only the found indicators and kind of defect in Table <ref type="table" target="#tab_6">6</ref>.</p><p>Performance values are in Table <ref type="table" target="#tab_7">7</ref> and show that the performance of chatGPT can be highly  variable, which was expected, but also that it can be compared with that of a settled, rule-based tool. It is interesting to note that chatGPT was able to detect an hidden ambiguity in E3 that was not found by manual analysis (note also that we have been working on for some time on this case study and we had never noticed the problem): </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Smart Home</head><p>Our third experience involved the smart home example: we performed an analysis with QuARS, and queried chatGPT as usual. We encountered a limitation of chatGPT: it does not accept documents of this length as input and returns an error. To get around the problem, we divided the document into two parts and had them analysed separately, then merged the results. Since we are looking for sources of ambiguity, which do not depend on the joint analysis of multiple requirements, we claim that this partition-based solution is acceptable. Each part has been analysed twice, on different days. This document returned many false positives, both with QuARS and chatGPT, and chatGPT has a very low recall (Table <ref type="table" target="#tab_7">7</ref>). For space reasons, we do not present the tools output but only the performance measures. With regard to qualitative analysis, chatGPT found the following defect not found by QuARS, which is worth noting since it reveals an incompleteness of the QuARS dictionaries, which do not contain the term compatible.</p><p>A m b i g u i t y i n 2 . 2 . 5 , a s i t s t a t e s " T h e s y s t e m s h a l l b e c o m p a t i b l e w i t h a c e n t r a l i z e d H V A C</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>h e l i n e n u m b e r : 1 . a f t e r i n s e r t i n g a s u i t a b l e c o i n , t h e u s e r s h a l l c h o o s e a b e v e r a g e a n d s e l e c t t h e a m o u n t o f s u g a r . i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : s u i t a b l e T h e l i n e n u m b e r : 2 . t h e m a c h i n e s h a l l o f f e r , a s b e v e r a g e s , c o f f e e a n d c a p p u c c i n o o r t e a . i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : o r T h e l i n e n u m b e r : 3 . t h e m a c h i n e s h a l l a l w a y s o f f e r c o f f e e . i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : a l w a y s T h e l i n e n u m b e r : 4 . a r i n g t o n e p o s s i b l y h a s t o b e p l a y e d a f t e r b e v e r a g e d e l i v e r y . i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : p o s s i b l y T h e l i n e n u m b e r : 6 . t h e b r i t i s h m a r k e t r e q u i r e s t e a a n d e x c l u d e s a n y r i n g t o n e . i s d e f e c t i v e b e c a u s e i t c o n t a i n s t h e w o r d i n g : a n y</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>s h a l l o f f e r , a s b e v e r a g e s , C o f f e e a n d C a p p u c c i n o o r T e a " i s a m b i g u o u s b e c a u s e i t i s n o t c l e a r w h e t h e r t h e m a c h i n e w i l l o f f e r a l l t h r e e o p t i o n s ( C o f f e e , C a p p u c c i n o , a n d T e a ) o r o n l y o n e o f t h e o p t i o n s ( C o f f e e , C a p p u c c i n o , o r T e a ) .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>s h a l l a l w a y s o f f e r c o f f e e " s e e m s t o c o n t r a d i c t t h e p r e v i o u s s t a t e m e n t a b o u t t h e a v a i l a b l e b e v e r a g e o p t i o n s .4. I t i s n o t c l e a r w h a t i s m e a n t b y " a r i n g t o n e p o s s i b l y h a s t o b e p l a y e d a f t e r b e v e r a g e d e l i v e r y . "</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>A m b i</head><label></label><figDesc>g u i t y i n E 3 : I t i s n o t s p e c i f i e d i f t h e p o p -u p w i l l b e d i s p l a y e d w h e n a s i n g l e p r o d u c t i s n o t f o u n d o r m u l t i p l e p r o d u c t s .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Ambiguity classes and indicators.</figDesc><table><row><cell>Ambiguity classes</cell><cell></cell><cell>Indicators</cell><cell></cell></row><row><cell>Homonymy and</cell><cell>occur when a term can have different meanings,</cell><cell>some examples are:</cell><cell>bank, can, bat...</cell></row><row><cell>polisemy</cell><cell>having different (homonymy) or one (polisemy) et-</cell><cell cols="2">(homonymies), left, right, fall, minute, ... (poly-</cell></row><row><cell></cell><cell>ymology</cell><cell>semies)</cell><cell></cell></row><row><cell>Analytical,</cell><cell>occur when a sentence admits more than one gram-</cell><cell cols="2">syntactic analysis: the sentence admits two or</cell></row><row><cell>attachment,</cell><cell>matical structure, and different structures have dif-</cell><cell>more syntactic trees</cell><cell></cell></row><row><cell>coordination</cell><cell>ferent meanings</cell><cell></cell><cell></cell></row><row><cell>Anaphora</cell><cell>occurs when an element of a sentence depends for</cell><cell cols="2">relative and demonstrative pronouns: that, which,</cell></row><row><cell></cell><cell>its reference on another, antecedent, element and</cell><cell>their, it, them, they, both,...</cell><cell></cell></row><row><cell></cell><cell>it is not clear to which antecedent it refers</cell><cell></cell><cell></cell></row><row><cell>Vagueness</cell><cell>occurs when it is not possible to interpret a sen-</cell><cell cols="2">clear, easy, strong, good, bad, adequate, tall,</cell></row><row><cell></cell><cell>tence in a unequivocal way</cell><cell cols="2">short, various, completed, similar, similarly, accord-</cell></row><row><cell></cell><cell></cell><cell>ingly,...</cell><cell></cell></row><row><cell>Comparatives</cell><cell>occurs when the term of comparison or the uni-</cell><cell cols="2">better, easier, worst, faster, bigger, biggest,...</cell></row><row><cell>&amp; superlatives</cell><cell>verse of discourse are missing</cell><cell></cell><cell></cell></row><row><cell>Disjunctions</cell><cell>occurs when a sentence admits different models in</cell><cell>or, and/or,...</cell><cell></cell></row><row><cell></cell><cell>which the first, the second or both disjuncts are</cell><cell></cell><cell></cell></row><row><cell></cell><cell>true</cell><cell></cell><cell></cell></row></table><note>Escape clausesoccurs when a sentence admits different models, containing or not the object the escape clause case, possibly, if possible, if appropriate, among others, as a minimum, when required, ... Weakness occurs when the sentence contains weak verbs may, can, could,... Quantifiers in presence of quantifiers, ambiguities are due to the scope or to the universe of quantification a, all, always, every, any, nothing,...</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Characteristics of the requirement documents: number of requirements, number of words, authorship and characteristic of the system to be.</figDesc><table><row><cell></cell><cell>reqs</cell><cell cols="2">words issued by</cell><cell>characteristics</cell></row><row><cell>Coffee machine</cell><cell>6</cell><cell>63</cell><cell>authors</cell><cell>toy example</cell></row><row><cell>E-shop</cell><cell>18</cell><cell>263</cell><cell>authors</cell><cell>toy example</cell></row><row><cell>Library</cell><cell>94</cell><cell cols="2">1815 company</cell><cell>information system</cell></row><row><cell>DigitalHome</cell><cell>112</cell><cell cols="2">1121 academia</cell><cell>control system</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Coffee-machine requirements C1 After inserting a suitable coin, the user shall choose a beverage and select the amount of sugar. C2 The machine shall offer, as beverages, Coffee and Cappuccino or Tea. C3 The machine shall always offer coffee. C4 A ringtone possibly has to be played after beverage delivery. C5 After the beverage is taken, the machine returns idle. C6 The British market requires tea and excludes any ring tone.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc></figDesc><table><row><cell>E-shop requirements</cell></row><row><cell>E1</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5</head><label>5</label><figDesc>Coffee machine case study. We report the indicator found with its defect class.</figDesc><table><row><cell></cell><cell></cell><cell>QuaRS</cell><cell></cell><cell>chatGPT</cell><cell>Manual analysis</cell></row><row><cell>Req</cell><cell>Indicator</cell><cell>Defect</cell><cell>Indicator</cell><cell>Defect</cell><cell>Indicator</cell></row><row><cell>C1</cell><cell>suitable</cell><cell>vagueness</cell><cell>suitable</cell><cell>vagueness</cell><cell>suitable</cell></row><row><cell>C2</cell><cell>or</cell><cell>disjunction</cell><cell>or</cell><cell>ambiguous disjunction</cell><cell>or</cell></row><row><cell></cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>coord. ambiguity</cell></row><row><cell>C3</cell><cell>always</cell><cell>quantification</cell><cell>always</cell><cell>contradiction</cell><cell>-</cell></row><row><cell>C4</cell><cell>possibly</cell><cell>optionality</cell><cell>possibly</cell><cell>optionality</cell><cell>possibly</cell></row><row><cell></cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>passive voice</cell></row><row><cell>C5</cell><cell>any</cell><cell>quantification</cell><cell>-</cell><cell>incompleteness</cell><cell>-</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 6 E</head><label>6</label><figDesc>-shop case study. All indicators found are true positives unless labeled as false positives (fp).</figDesc><table><row><cell></cell><cell cols="2">QuaRS</cell><cell cols="2">chatGPT 1</cell><cell cols="2">chatGPT 2</cell><cell>Manual analysis</cell></row><row><cell>Req</cell><cell>Indicator</cell><cell>Defect</cell><cell>Indicator</cell><cell>Defect</cell><cell>Indicator</cell><cell>Defect</cell><cell>Indicator</cell></row><row><cell>E2</cell><cell>all</cell><cell>quantif. (fp)</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell></row><row><cell>E3</cell><cell>possibly</cell><cell>optionality</cell><cell>possibly</cell><cell>optionality</cell><cell>-</cell><cell>-</cell><cell>possibly</cell></row><row><cell></cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>no</cell><cell>quantif.</cell><cell>-</cell></row><row><cell>E6</cell><cell>and/or</cell><cell>optionality</cell><cell>and/or</cell><cell>ambig. disj.</cell><cell>and/or</cell><cell>ambig. disj.</cell><cell>and/or</cell></row><row><cell>E10</cell><cell>various</cell><cell>vagueness</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>various</cell></row><row><cell>E11</cell><cell>or</cell><cell>optionality</cell><cell>or</cell><cell>ambig. disj.</cell><cell>-</cell><cell>-</cell><cell>or</cell></row><row><cell></cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>associated</cell><cell>vague (fp)</cell><cell>-</cell></row><row><cell>E13</cell><cell>may</cell><cell>weakness</cell><cell>may</cell><cell>weakness</cell><cell>-</cell><cell>-</cell><cell>may</cell></row><row><cell></cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>current</cell><cell>unclear (fp)</cell><cell>-</cell></row><row><cell>E16</cell><cell>may</cell><cell>weakness</cell><cell>may</cell><cell>weakness</cell><cell>-</cell><cell>-</cell><cell>may</cell></row><row><cell>E17</cell><cell>should</cell><cell>weakness</cell><cell>should</cell><cell>replace by shall</cell><cell>-</cell><cell>-</cell><cell>should</cell></row><row><cell></cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>as fast as</cell><cell>subjective</cell><cell>as fast as possi-</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>possible</cell><cell></cell><cell>ble</cell></row><row><cell>E18</cell><cell>should</cell><cell>weakness</cell><cell>should</cell><cell>replace by shall</cell><cell>-</cell><cell>-</cell><cell>should</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 7 E</head><label>7</label><figDesc>-shop and Smart Home: performance measures.</figDesc><table><row><cell></cell><cell cols="2">QuaRS</cell><cell cols="2">chatGPT1</cell><cell cols="2">chatGPT2</cell></row><row><cell></cell><cell>precision</cell><cell>recall</cell><cell>precision</cell><cell>recall</cell><cell>precision</cell><cell>recall</cell></row><row><cell>eshop</cell><cell>0,89 (8/9)</cell><cell>0,8 0,89 (8/9)</cell><cell>1 (7/7)</cell><cell>0,78 (7/9)</cell><cell>0,6 (3/5)</cell><cell>0,33 (3/9)</cell></row><row><cell cols="2">smart_home 0,24 (17/70)</cell><cell>0,77 (17/22)</cell><cell cols="4">0,28 (3/14) 0,14 (3/22) 0,17 (2/12) 0,09 (2/22)</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">All documents are available at https://github.com/Vibe-NLP/RequirementsForValidation.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The research has been partially supported by the MIUR, Italy project PRIN 2017 FTXR7S ''IT-MaTTerS'' (Methods and Tools for Trustworthy Smart Systems).</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Library</head><p>The last document considered is Library, which is slightly smaller in size than Smart Home. We analysed the document with QuaRS and then with chatGPT for 5 times, on 5 different days. In the table named GPT_QuARS_library in https://github.com/Vibe-NLP/RequirementsForValidation we list all the defects found. The table is truncated because all 5 times we queried chatGPT, although it did not report length errors, it only found defects in the first 38 requirements. We therefore decided to consider this document fragment to make the performance measurements, which are shown in Table <ref type="table">8</ref>. In the GPT_QuARS_library table, for each analysis, we show each defect reported, labelling it directly as false positive (fp) or true positive (amb). In the adjacent column we report: for QuARS, which indicators were considered false positives or true ambiguities; for chatGPT a fragment of the response, if significant.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6.">Threats to validity</head><p>We have used precision and recall as metrics to compare the tools. The human intervention in the review and assessment steps, returning the number of true/false positives and false negatives, is a threat to construct validity, and the involvement of the authors in these phases is also a threat to internal validity. With regard to external validity, we have presented a preliminary study, and the quantitative comparison is limited to three case studies, to two compared tools and to a single kind of query to chatGPT and few chat sessions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions and Future Work</head><p>The findings from the experience allow us to give an answer, albeit preliminar, to the RQs:</p><p>RQ1 chatGPT can be used to detect ambiguities in requirements by simply asking: "Find the ambiguities of the following software requirements document:&lt;list of requirements in text format&gt;". We note that chatGPT does not process long requirement documents: either it returns an error or it provides a partial answer. Since ambiguity detection does not depend on processing the document as a whole, it is possible to break the requirements document into simpler parts and analyze the pieces separately. RQ2 ChatGPT's performance results vary between chat sessions with the bot, especially recall; precision, on the other hand, is more stable and comparable to that of a rule-based NLP tool. Running several sessions with the same question improves recall. For example, when making the union of the 5 responses got from the chatbot for the library case study, we have the following performance: precision = 0, 51(28/55) recall=0, 55(12/22)</p><p>Validity threats can be mitigated in future work by involving third-party reviewers and measuring the level of agreement between them and by increasing the number of documents and querying chatGPT with different queries. Future work can further develop the analysis presented here along several dimensions:</p><p>• Assess the coverage by GPT-3 language model of the technical slang used in requirements;</p><p>• Exploit ChatGPT's ability to rationalise and explain ambiguity;</p><p>• Ask ChatGPT more focused questions, addressing the various classes of ambiguity separately; • Develop the analysis with additional documents and evaluate the hypothesis that slicing a requirements document for chatGPT does not influence its results; • We have seen that chatGPT is able to detect defects, such as incompleteness and inconsistency, that traditional NLP tools cannot identify or can identify with difficulty and after domain-focused training. A future study may be devoted to specifically measuring the performance of chatGPT in finding these classes of defects in requirements. Positive results in this respect could lead to the use of chatGPT to complement a rule-based tool to automatically detect these important quality criteria;</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Annual Conference on Neural Information Processing Systems 2020</title>
				<imprint>
			<date type="published" when="2020-06-12">Dec. 6-12, 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">GPT-3: what&apos;s it good for?</title>
		<author>
			<persName><forename type="first">R</forename><surname>Dale</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nat. Lang. Eng</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="page" from="113" to="118" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A spaCy-based tool for extracting variability from NL requirements</title>
		<author>
			<persName><forename type="first">A</forename><surname>Fantechi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gnesi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Livi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Semini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SPLC &apos;21: 25th ACM Int. Systems and Software Product Line Conference</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Mousavi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Schobbens</surname></persName>
		</editor>
		<meeting><address><addrLine>Leicester, UK</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2021-11">Sept. 6-11. 2021</date>
			<biblScope unit="page" from="32" to="35" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Requirement Engineering of Software Product Lines: Extracting Variability Using NLP</title>
		<author>
			<persName><forename type="first">A</forename><surname>Fantechi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ferrari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gnesi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Semini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">26th IEEE International Requirements Engineering Conference 2018</title>
				<meeting><address><addrLine>Banff, Canada</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2018">August 20-24, 2018. 2018</date>
			<biblScope unit="page" from="418" to="423" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">VIBE: looking for variability in ambiguous requirements</title>
		<author>
			<persName><forename type="first">A</forename><surname>Fantechi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gnesi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Semini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Syst. Softw</title>
		<imprint>
			<biblScope unit="volume">195</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">An experience with the application of three NLP tools for the analysis of natural language requirements</title>
		<author>
			<persName><forename type="first">M</forename><surname>Arrabito</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fantechi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gnesi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Semini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of Quality of Information and Communications Technology -13th Int. Conference, QUATIC</title>
		<title level="s">Communications in Computer and Information Science</title>
		<meeting>of Quality of Information and Communications Technology -13th Int. Conference, QUATIC</meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">1266</biblScope>
			<biblScope unit="page" from="488" to="498" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">J</forename><surname>Kasser</surname></persName>
		</author>
		<author>
			<persName><surname>Tiger-Pro</surname></persName>
		</author>
		<ptr target="www.therightrequirement.com" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Processing natural language requirements</title>
		<author>
			<persName><forename type="first">V</forename><surname>Ambriola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Gervasi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Int. Conference on Automated Software Engineering, ASE</title>
				<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="1997-05">Nov. 2-5. 1997</date>
			<biblScope unit="page" from="36" to="45" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Automating requirement quality standards with QVscribe</title>
		<author>
			<persName><forename type="first">O</forename><surname>Kenney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cooper</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">NLP4RE&apos;20, co-located with the 26th Int. Conf. on Requirements Engineering: Foundation for Software Quality (REFSQ)</title>
		<title level="s">CEUR Workshop Proc.</title>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">2584</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Requirements quality defect detection with the Qualicen requirements scout</title>
		<author>
			<persName><forename type="first">H</forename><surname>Femmer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">NLP4RE&apos;18, co-located with the 23rd Int. Conf. on Requirements Engineering: Foundation for Software Quality (REFSQ)</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">2075</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">The design of SREE -a prototype potential ambiguity finder for requirements specifications and lessons learned</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">F</forename><surname>Tjong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Berry</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Working Conference on Requirements Engineering: Foundation for Software Quality</title>
				<meeting><address><addrLine>Essen, Germany</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="volume">7830</biblScope>
			<biblScope unit="page" from="80" to="95" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">R</forename><surname>Company</surname></persName>
		</author>
		<author>
			<persName><surname>Rat</surname></persName>
		</author>
		<ptr target="www.reusecompany.com/rat-authoring-tools" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Natural language processing</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">G</forename><surname>Chowdhury</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Annu. Rev. Inf. Sci. Technol</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="51" to="89" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">From contract drafting to software specification: Linguistic sources of ambiguity -a handbook version</title>
		<author>
			<persName><forename type="first">D</forename><surname>Berry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kamsties</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Krieger</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page">0</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Ambiguity in requirements engineering: Towards a unifying framework</title>
		<author>
			<persName><forename type="first">V</forename><surname>Gervasi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ferrari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zowghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Spoletini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">From Software Engineering to Formal Methods and Tools, and Back -Essays Dedicated to Stefania Gnesi on the Occasion of Her 65th Birthday</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">11865</biblScope>
			<biblScope unit="page" from="191" to="210" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<author>
			<persName><surname>Incose</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Guide for Writing Requirements</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">3</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">An automatic tool for the analysis of natural language requirements</title>
		<author>
			<persName><forename type="first">S</forename><surname>Gnesi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Trentanni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Systems: Science &amp; Engineering</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
