<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Comparing general purpose pre-trained Word and Sentence embeddings for Requirements Classification</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Federico</forename><surname>Cruciani</surname></persName>
							<email>f.cruciani@ulster.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computing</orgName>
								<orgName type="institution">Ulster University</orgName>
								<address>
									<addrLine>2-24 York Street</addrLine>
									<postCode>BT15 1AP</postCode>
									<settlement>Belfast</settlement>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Samuel</forename><surname>Moore</surname></persName>
							<email>s.moore2@ulster.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computing</orgName>
								<orgName type="institution">Ulster University</orgName>
								<address>
									<addrLine>2-24 York Street</addrLine>
									<postCode>BT15 1AP</postCode>
									<settlement>Belfast</settlement>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chris</forename><surname>Nugent</surname></persName>
							<email>cd.nugent@ulster.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computing</orgName>
								<orgName type="institution">Ulster University</orgName>
								<address>
									<addrLine>2-24 York Street</addrLine>
									<postCode>BT15 1AP</postCode>
									<settlement>Belfast</settlement>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Comparing general purpose pre-trained Word and Sentence embeddings for Requirements Classification</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">8F3439D60D8F27F19F0EE6B0D10CDBF4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-04-29T06:36+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Requirements Engineering</term>
					<term>NLP</term>
					<term>Large Language Models</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The recent evolution of NLP has enriched the set of DL-based approaches to include a number of generalpurpose Large Language Models (LLMs). Whereas new models have been proven useful for generic text handling, their applicability to domain-specific NLP tasks still remains doubtful, particularly because of the limited amount of dataset available in certain domains, such as Requirements Engineering. In this study, different pre-trained embeddings were tested in three requirements classification tasks, in search of a tradeoff between accuracy and computational complexity. The best F1-score results were obtained with BERT (90.36% and 84.23%), with DistilBERT identified as optimal tradeoff (90.28% and 82.61%).</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Natural Language Processing (NLP) is an area of Machine Learning (ML) which aims to learn, understand and generate human language content. More specifically, NLP is a set of techniques which are capable of representing written text at several levels of linguistic analysis with the goal of achieving near human-like levels of language processing for a given task or application <ref type="bibr" target="#b0">[1]</ref>. The maturity level of Large Language Models (LLMs) reached in the past five years is having an enormous impact on Deep Learning (DL) based approaches for NLP. While, on the one hand, these models have made it possible to address previously unattainable NLP tasks, the ability of such general-purpose models on domain-specific contexts still poses some major challenges <ref type="bibr" target="#b0">[1]</ref>. In particular, when applying NLP to domain-specific tasks, the amount of available text is usually extremely limited, and the semantic representation of words when used in a different context might be misleading <ref type="bibr" target="#b0">[1]</ref>. Consequently, the research community looked at finetuning pre-trained LLM <ref type="bibr" target="#b1">[2]</ref>. Whereas finetuning is a valid method, cases with limited amount of data hinder this approach.</p><p>Requirements Engineering (RE) is one such area where NLP approaches can help to improve processes. Requirements, within software development, are largely expressed in natural language <ref type="bibr" target="#b0">[1]</ref>. The correct and accurate statement of requirements is essential for the development of high-quality software which meets the expectations of customers and end-users. Given the importance of this part of the software development lifecycle, it is necessary to ensure that requirements are stated clearly, adhere to quality criteria, are appropriately classified, and are free from errors. While it is possible, and often the norm, to carry out the requirements engineering processes manually, automated NLP approaches stand to offer significant improvement to the process <ref type="bibr" target="#b0">[1]</ref>. Within requirements engineering, there are several areas where NLP approaches may be employed, including; elicitation, quality analysis, error detection, category classification, and traceability <ref type="bibr" target="#b0">[1]</ref>. Although LLMs offer a general purpose approach to language modelling, they were not trained on the task of classifying requirements. In order to classify requirements into a given set of categories, a model must be trained to detect these categories, which is not typically the case for LLMs. As such, in order to develop an NLP solution for requirements classification, it is necessary to expose a model to a range of requirements and their associated categories during training, thereby requiring the development of a task-specific language model. This can be achieved either by fine-tuning the LLM on the specific task of requirements classification, as done in <ref type="bibr" target="#b1">[2]</ref>, or by using the LLM to provide a semantic representation of the requirement, and combining it with more traditional classifiers <ref type="bibr" target="#b2">[3]</ref>. The first solution is more resource-consuming, while the second one is more efficient. To our knowledge, the literature does not provide a systematic comparison between different LLMs in this second scenario.</p><p>This paper seeks to assess the effectiveness of LLMs in accurately classifying requirements into their respective categories. In doing so, this paper considered pre-trained language models and evaluated their ability to create semantic representations from requirements specifications.</p><p>The contribution of this work can be summarized as follows:</p><p>• comparison of the semantic representational power from available pre-trained LLM with application to RE. • an explorative study trying to optimize the tradeoff between computational resources and accuracy</p><p>The remainder of this paper is organized as follows. Section 2 summarizes the state of the art and related work in NLP tasks for RE. Section 3 describes the experiment design, the research questions, and the evaluation methodology. Results and discussion are reported in Sections 4 and 5 respectively. Finally, conclusions are drawn in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Among NLP tasks in RE, requirement classification is one of the most common <ref type="bibr" target="#b0">[1]</ref>. In most cases, classification is applied to the binary case discriminating between functional (F) and non-functional requirements (NF) <ref type="bibr" target="#b3">[4]</ref>. Other studies have also considered specific classes of NF requirements (e.g., usability, security) <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref>. In the earliest examples of requirements classification <ref type="bibr" target="#b4">[5]</ref>, sets of keywords obtained from manually labeled requirements were used to classify unseen data. More recently, studies started to explore the use of ML and DL approaches. In <ref type="bibr" target="#b5">[6]</ref>, BERT <ref type="bibr" target="#b7">[8]</ref> was used in combination with a Graph Attention Network (GAT) and an Multilayer perceptron (MLP) classifier. The method was compared with other ML approaches, (including Naive Bayes, Random Forest (RF)) in two classification tasks: (i) the binary case F vs NF requirements, and (ii) for detecting four types of NF requirements. For an in-depth literature review on ML for requirement classification readers can refer to <ref type="bibr" target="#b0">[1]</ref>, which provides a holistic overview of the progress of NLP for performing RE tasks. On the other hand, this paper is more focused on identifying optimum representational power and analysing the computing resources required to achieve the task effectively.</p><p>Similar to <ref type="bibr" target="#b1">[2]</ref>, in this work we evaluated the use of BERT <ref type="bibr" target="#b7">[8]</ref> for requirement classification extending the comparison to include other embeddings, such as GloVE <ref type="bibr" target="#b8">[9]</ref>, DIstilBERT <ref type="bibr" target="#b9">[10]</ref>, SBERT <ref type="bibr" target="#b10">[11]</ref> and Universal Sentence Encoder (USE) <ref type="bibr" target="#b11">[12]</ref>. It should be noted that, unlike other studies like <ref type="bibr" target="#b1">[2]</ref>, we did not retrain or fine-tuned the models used for embeddings, but simply aimed at comparing them in some classification tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">The experiment</head><p>The experiment aimed at comparing different pre-trained models for word and sentence embeddings to verify their suitability for requirements classification. Evaluation was conducted on the PROMISE-NFR dataset <ref type="bibr" target="#b12">[13]</ref>. The dataset includes a set of 625 functional and non-functional requirements. The non-functional requirements set includes 11 different subclasses. Table <ref type="table" target="#tab_0">1</ref> summarizes all the 12 classes, the number of requirements, sentences and words available per class. Despite being fairly balanced for the binary case F/NF requirements, the dataset presents a great challenge in terms of class imbalance when including all 12 classes (some consisting only of 1-20 requirements).</p><p>The word embeddings used in the experiments were GloVE <ref type="bibr" target="#b8">[9]</ref>, BERT <ref type="bibr" target="#b7">[8]</ref>, DIstilBERT <ref type="bibr" target="#b9">[10]</ref>, with dimension of 300, 768 and 768 respectively. The sentence embeddings were SBERT <ref type="bibr" target="#b10">[11]</ref> and Universal Sentence Encoder (USE) <ref type="bibr" target="#b11">[12]</ref> with size 384<ref type="foot" target="#foot_0">1</ref> and 512 respectively.</p><p>As illustrated in Fig. <ref type="figure">1</ref>, the aim was to train a small size classifier (&lt;10M parameters) on a domain-specific context with limited amount of data, relying on pre-trained language models for semantic representation of words/sentences. Task 1 compares these general-purpose embeddings without an extreme imbalance. Tasks 2 and 3 allow the evaluation to be made on the impact of class imbalance in the more complex cases distinguishing 6 and 12 classes. The 6 classes of Task 2 were chosen as the most frequent classes (comprised of at least 50 sentences and 500 words) as indicated in Table <ref type="table" target="#tab_0">1</ref>. The experiment aimed at answering the following research questions: RQ 1 Which embedding provides better accuracy in requirement classification? RQ 2 Which embedding provides the best trade-off between accuracy and model's complexity?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Evaluation Methodology</head><p>The dataset was split into train and test set, 70% and 30% respectively. With the training set further split into train and validation (10%). The splits were done using the stratify options, i.e. preserving class imbalance, and preventing from separating sentences appearing in the same requirement. The evaluation was done using a 5-fold procedure comparing the embeddings with the two MLP structures on the three tasks. Models were trained with early stopping using a patience of 25 epochs and saving only the models with highest accuracy in the validation set. The data imbalance was handled using weighted loss. Finally, a k-Nearest Neighbors (kNN) classifier was used as a baseline. Since kNN makes direct use of embeddings for classification, and because of its non-parametric nature, it is therefore a suitable approach to measure how well the embeddings obtained from pre-trained models could be used directly to classify requirements. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>In the experiment, results covering all combinations between the large and small MLP classifiers were calculated on the three tasks. For the sake of conciseness only some combinations are reported. Additional results including confusion matrices are available in the published repository. Table <ref type="table" target="#tab_1">2</ref> reports results obtained using the large MLP structure on Task 1 and the small MLP architecture on Task 2. Additional combinations were tested considering the cased and uncased versions of pretrained BERT and DistilBERT models. Table <ref type="table" target="#tab_2">3</ref> report results obtained with the best performing model on Task 2, the small MLP architecture using the uncased version of BERT. Fig. <ref type="figure" target="#fig_2">2</ref> illustrates the confusion matrix and the normalized confusion matrix obtained in Task 2 using BERT uncased and the small MLP.</p><p>Table <ref type="table" target="#tab_3">4</ref> summarizes results obtained in Task 3, including the baseline results using kNN as a classifier. Table <ref type="table" target="#tab_4">5</ref> reports precision, recall and f-score values for all classes obtained with the best performing combination on Task 3. Fig. <ref type="figure" target="#fig_3">3</ref> illustrates the normalized confusion matrices obtained with the different embeddings on Task 3. Finally, Fig. <ref type="figure" target="#fig_4">4</ref> summarizes the macro-average F1 score obtained with all the embeddings on Task 2 and Task 3.    </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion</head><p>Results on Task 1 highlight BERT and DistilBert as the best performing embeddings (RQ1), however results obtained with the other embeddings are comparable and might be considered for resource constrained cases. In particular, SBERT is the fastest model (except for GloVe) for  generating vectors of 384 dimensions, which also reduces the complexity of the final classifier.</p><p>Similarly, GloVe embeddings are obtained by simple lookup on a dictionary data structure and are 300 dimensions embeddings. GloVe, however, is exposed to out-of-dictionary (OOD) words limiting its working ability in the presence of OOD words. Similarly, in Task 2 and 3, BERT and DistilBERT were the best performing models, with DistilBERT a good candidate to reduce the computational overhead of BERT without causing detrimental effects on the accuracy performance (RQ2). No major differences were observed when using the small and the large MLP classifiers, possibly due to the limited size of the dataset that does not allow to maximize the benefit of using a classifier with a higher number of trainable parameters. The lack of data is further exacerbated in the case of sentence embeddings with the MLP trained on fewer data points. The comparison with the baseline highlighted how, despite the limited amount of data, training an MLP classifier outperforms the baseline kNN approach of using embeddings to classify new data. The worst performing baseline results were obtained with GloVe, possibly attributable to out-of-dictionary words. MLP classifiers trained on GloVe vectors, however, appear to reduce the gap, leading to results comparable with SBERT and USE.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Limitations</head><p>Construct Validity Standard evaluation metrics were used as macro-averages to prevent majority classes from masking less represented ones. All mandatory steps of the ECSER pipeline for evaluating classifiers were performed <ref type="bibr" target="#b13">[14]</ref>. Optional steps, e.g. significance tests, were not performed due to the preliminary nature of this work.</p><p>Internal Validity One major factor affecting internal validity is the correctness of the annotation of the dataset that authors have questioned in the past. Nevertheless, it still represents the most widely used dataset for requirements classification, facilitating the comparison with previous work. Since this type of ML tasks typically include a high degree of randomness, 5-fold cross-validation was used to calculate results.</p><p>External Validity The dataset includes requirements written by students, which may not be representative of industrial requirements and the evaluation of the language models is limited to the three examined tasks. Different results may be obtained when other classification schemes are used, or other types of requirements-related information (e.g., user stories, or app reviews) are adopted. Concerning the coverage of possible pre-trained embeddings, we have considered a representative set of basic and deep learning-based ones, not only limited to those derived from BERT. Therefore, we argue that our analysis can be considered representative of the usage of different embeddings for requirements classification. As for the classification algorithm, we use two MLP structures and a kNN as a baseline. Different results may be obtained when using other classifiers (e.g., SVM, Naive Bayes).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>This paper reported on the evaluation of pre-trained embeddings for RE. Some of the most common embeddings were tested under the same circumstances on a public dataset. Results obtained identify BERT and its smaller variant DistilBERT as the best performing embeddings, with the latter being an optimal tradeoff between accuracy and model complexity. GloVE and SBERT despite a slightly lower accuracy were found to be the fastest in prediction time and could be suitable for cases in which time represents a key factor or resource constrained environments. Future work will aim to extend the evaluation on additional datasets to verify the validity of these results on different RE tasks and datasets.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :Task 1</head><label>11</label><figDesc>Figure 1: In the experiment, different pre-trained models were used for extracting word/sentence embeddings without fine-tuning. Obtained embedding were used to train an ad-hoc classifier model.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Confusion matrices obtained with BERT uncased (a) and the normalized version (b).</figDesc><graphic coords="6,149.30,89.17,145.84,127.61" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Confusion matrices obtained with BERT (a), and DistilBERT (b) on task 3.</figDesc><graphic coords="6,149.30,426.70,145.84,127.61" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Macro F1 for all embeddings including the baseline kNN and the large MLP on Task 2 (a) and Task 3 (b).</figDesc><graphic coords="7,128.47,306.33,166.68,125.01" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 PROMISE</head><label>1</label><figDesc></figDesc><table><row><cell>-NFR Dataset</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Class</cell><cell cols="3">Requirements #Sentences #Words *</cell></row><row><cell>Functional (F) +</cell><cell>255</cell><cell>272</cell><cell>4996</cell></row><row><cell>Availability (A)</cell><cell>21</cell><cell>29</cell><cell>432</cell></row><row><cell>Fault Tolerance (FT)</cell><cell>10</cell><cell>11</cell><cell>176</cell></row><row><cell>Legal (L)</cell><cell>13</cell><cell>15</cell><cell>215</cell></row><row><cell>Look &amp; Feel (LF) +</cell><cell>38</cell><cell>51</cell><cell>749</cell></row><row><cell>Maintainability (M)</cell><cell>17</cell><cell>25</cell><cell>476</cell></row><row><cell>Operational (O) +</cell><cell>62</cell><cell>89</cell><cell>1231</cell></row><row><cell>Performance (PE) +</cell><cell>54</cell><cell>69</cell><cell>1207</cell></row><row><cell>Portability (PO)</cell><cell>1</cell><cell>2</cell><cell>25</cell></row><row><cell>Scalability (SC)</cell><cell>21</cell><cell>27</cell><cell>402</cell></row><row><cell>Security (SE) +</cell><cell>66</cell><cell>86</cell><cell>1265</cell></row><row><cell>Usability (U) +</cell><cell>67</cell><cell>96</cell><cell>1508</cell></row><row><cell>Total</cell><cell>625</cell><cell>772</cell><cell>12682</cell></row></table><note>* approximate number of words, + Used as Most frequent class</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Classification Report (Task 1 &amp; 2) using the large and small MLP respectively.</figDesc><table><row><cell></cell><cell></cell><cell>h</cell><cell></cell></row><row><cell></cell><cell>Precision *</cell><cell>Recall *</cell><cell>F1-score *</cell><cell>F1-Score +</cell></row><row><cell></cell><cell cols="4">Task 1 Task 2 Task 1 Task 2 Task 1 Task 2 Task 1 Task 2</cell></row><row><cell>Glove</cell><cell cols="4">0.9110 0.7552 0.8520 0.7351 0.8709 0.7441 0.8840 0.7793</cell></row><row><cell>BERT</cell><cell cols="4">0.9010 0.8373 0.8900 0.7743 0.8949 0.7994 0.9036 0.8423</cell></row><row><cell>DistilBERT</cell><cell cols="4">0.9076 0.8043 0.8836 0.7645 0.8934 0.7807 0.9028 0.8261</cell></row><row><cell>SBERT</cell><cell cols="4">0.8709 0.7455 0.8797 0.7682 0.8748 0.7549 0.8837 0.7833</cell></row><row><cell>USE</cell><cell cols="4">0.8606 0.7484 0.8841 0.7666 0.8669 0.7550 0.8743 0.7948</cell></row></table><note>* Macro-average, + Weighted</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Results with the small MLP architecture trained using BERT (uncased) embeddings (Task 2).</figDesc><table><row><cell>Class</cell><cell cols="3">Precision Recall F1-Score</cell></row><row><cell>Functional (F)</cell><cell>0.8490</cell><cell>0.9449</cell><cell>0.8944</cell></row><row><cell>Look &amp; Feel (LF)</cell><cell>0.7273</cell><cell>0.5818</cell><cell>0.6465</cell></row><row><cell>Operational (O)</cell><cell>0.8500</cell><cell>0.8500</cell><cell>0.8500</cell></row><row><cell>Performance (PE)</cell><cell>0.9138</cell><cell>0.6543</cell><cell>0.7626</cell></row><row><cell>Security (SE)</cell><cell>0.8318</cell><cell>0.8318</cell><cell>0.8318</cell></row><row><cell>Usability (US)</cell><cell>0.8525</cell><cell>0.8062</cell><cell>0.8287</cell></row><row><cell>macro avg</cell><cell>0.8374</cell><cell>0.7782</cell><cell>0.8023</cell></row><row><cell>weighted avg</cell><cell>0.8456</cell><cell>0.8454</cell><cell>0.8416</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Classification Report all classes (Task 3 using Large MLP)</figDesc><table><row><cell cols="2">Precision *</cell><cell cols="2">Recall *</cell><cell cols="2">F1-score *</cell><cell cols="2">F1-Score +</cell></row><row><cell>MLP</cell><cell>kNN</cell><cell>MLP</cell><cell>kNN</cell><cell>MLP</cell><cell>kNN</cell><cell>MLP</cell><cell>kNN</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5</head><label>5</label><figDesc>Results with the MLP architecture trained using BERT uncased embeddings.</figDesc><table><row><cell>Class</cell><cell cols="3">Precision Recall F1-Score</cell></row><row><cell>Availability (A)</cell><cell>0.6486</cell><cell>0.6857</cell><cell>0.6667</cell></row><row><cell>Functional (F)</cell><cell>0.7914</cell><cell>0.9397</cell><cell>0.8592</cell></row><row><cell>Fault Tolerance (FT)</cell><cell>0.8000</cell><cell>0.1739</cell><cell>0.2857</cell></row><row><cell>Legal (L)</cell><cell>0.8000</cell><cell>0.2353</cell><cell>0.3636</cell></row><row><cell>Look &amp; Feel (LF)</cell><cell>0.6852</cell><cell>0.5873</cell><cell>0.6325</cell></row><row><cell>Maintainability (MN)</cell><cell>0.4800</cell><cell>0.4286</cell><cell>0.4528</cell></row><row><cell>Operational (O)</cell><cell>0.7363</cell><cell>0.6505</cell><cell>0.6907</cell></row><row><cell>Performance (PE)</cell><cell>0.8404</cell><cell>0.7980</cell><cell>0.8187</cell></row><row><cell>Scalability (SC)</cell><cell>0.7143</cell><cell>0.6944</cell><cell>0.7042</cell></row><row><cell>Security (SE)</cell><cell>0.6947</cell><cell>0.8148</cell><cell>0.7500</cell></row><row><cell>Usability (US)</cell><cell>0.7840</cell><cell>0.7000</cell><cell>0.7396</cell></row><row><cell>macro avg</cell><cell>0.7250</cell><cell>0.6098</cell><cell>0.6331</cell></row><row><cell>(a)</cell><cell></cell><cell></cell><cell>(b)</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">SBERT was used with the all-MiniLM-L6-v2 model. See https://www.sbert.net/docs/pretrained_models.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">The complete source code is available at: https://github.com/fcruciani/reqclass</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">PricewaterhouseCoopers LLP a limited liability partnership incorporated in England with its registered office office at 1 Embankment Place, London WC2N 6RH</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This research is supported by the ARC (Advanced Research Engineering Centre) project, funded by PwC <ref type="bibr" target="#b2">3</ref> and Invest Northern Ireland.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Natural language processing for requirements engineering: A systematic mapping study</title>
		<author>
			<persName><forename type="first">L</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Alhoshan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ferrari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">J</forename><surname>Letsholo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Ajagbe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E.-V</forename><surname>Chioasca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">T</forename><surname>Batista-Navarro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys (CSUR)</title>
		<imprint>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="page" from="1" to="41" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Norbert: Transfer learning for requirements classification</title>
		<author>
			<persName><forename type="first">T</forename><surname>Hey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Keim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Koziolek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">F</forename><surname>Tichy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 28th International Requirements Engineering Conference (RE), IEEE</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="169" to="179" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">To tune or not to tune? adapting pretrained representations to diverse tasks</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Peters</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ruder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)</title>
				<meeting>the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="7" to="14" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Nonfunctional requirements as qualities, with a spice of ontology</title>
		<author>
			<persName><forename type="first">F.-L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Horkoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mylopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">S</forename><surname>Guizzardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Guizzardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Borgida</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 22nd International Requirements Engineering Conference (RE), IEEE</title>
				<imprint>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="page" from="293" to="302" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Automated classification of non-functional requirements</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cleland-Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Settimi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Solc</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Requirements engineering</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="103" to="120" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Automatic requirements classification based on graph attention network</title>
		<author>
			<persName><forename type="first">G</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="30080" to="30090" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">An end-to-end deep learning system for requirements classification using recurrent neural networks</title>
		<author>
			<persName><forename type="first">O</forename><surname>Aldhafer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Ahmad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mahmood</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information and Software Technology</title>
		<imprint>
			<biblScope unit="volume">147</biblScope>
			<biblScope unit="page">106877</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.04805</idno>
		<title level="m">Bert: Pre-training of deep bidirectional transformers for language understanding</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Glove: Global vectors for word representation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pennington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</title>
				<meeting>the 2014 conference on empirical methods in natural language processing (EMNLP)</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1532" to="1543" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">V</forename><surname>Sanh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Debut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chaumond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wolf</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1910.01108</idno>
		<title level="m">Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1908.10084</idno>
		<title level="m">Sentence-bert: Sentence embeddings using siamese bert-networks</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Cer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>-Y. Kong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Hua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Limtiaco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">S</forename><surname>John</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Constant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Guajardo-Cespedes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Tar</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1803.11175</idno>
		<title level="m">Universal sentence encoder</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">L J</forename><surname>Cleland-Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mazrouee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Port</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.268542</idno>
		<ptr target="https://doi.org/10.5281/zenodo.268542" />
		<title level="m">Promise-nfr dataset</title>
				<imprint>
			<date type="published" when="1007">1007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Evaluating classifiers in se research: the ecser pipeline and two replication studies</title>
		<author>
			<persName><forename type="first">D</forename><surname>Dell'anna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">B</forename><surname>Aydemir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dalpiaz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Empirical Software Engineering</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="1" to="40" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
