Automated Requirements Demarcation using Large Language Models: An Empirical Study Kaishuo Wang, Feier Zhang and Mehrdad Sabetzadeh School of Electrical Engineering and Computer Science (EECS), University of Ottawa, 800 King Edward Ave, Ottawa, ON, K1N 6N5, Canada Abstract Requirements demarcation is concerned with the identification of software and systems requirements in technical natural-language specifications. Manually distinguishing requirements from non-requirements is a labour-intensive task and is prone to omissions and errors. Therefore, building a reliable and accurate automated method for requirements demarcation is important. This research evaluates auto-encoder large language models [1], auto-regressive large language models [2], few-shot learning, and ensembling methods for accurate demarcation of requirements. Specific architectures considered are DeBERTa, Llama2, and few-shot learning with RoBERTa. Our work empirically compares these approaches and determines which one yields the best accuracy. Our experimental results show that DeBERTa yields the best performance. While Llama2 requires more computational resources and training time compared to DeBERTa, it has an accuracy deficit of approximately 1% across different metrics. This result suggests that auto-encoder models may be more suitable for requirements demarcation than auto-regressive models. Further, we observe that the few-shot learning approach has the worst performance among the alternatives considered. Finally, we find that ensembling leads to minor performance improvements compared to a single model. We make all the artifacts developed as part of this research available online: https:// github.com/ KaishuoWang/ Automated-Requirements-Classification. Keywords requirements demarcation, natural language processing, ensemble learning, transformers, empirical evaluation 1. Introduction Distinguishing requirements from non-requirements in technical documents is a difficult and labour-intensive task [3]. Manually classifying text across thousands of pages to identify requirements is also prone to omissions and errors. Automating requirements demarcation, defined as the task of distinguishing requirements from non-requirement sentences [4], is thus important for streamlining downstream analysis, traceability, risk identification, and testing [3]. Table 1 shows examples to illustrate sentences classified as requirements and non-requirements. With the recent emergence of large language models, an interesting question arises as to whether one can improve upon the accuracy of existing natural language processing techniques In: D. Mendez, A. Moreira, J. Horkoff, T. Weyer, M. Daneva, M. Unterkalmsteiner, S. Bühne, J. Hehn, B. Penzenstadler, N. Condori-Fernández, O. Dieste, R. Guizzardi, K. M. Habibullah, A. Perini, A. Susi, S. Abualhaija, C. Arora, D. Dell’Anna, A. Ferrari, S. Ghanavati, F. Dalpiaz, J. Steghöfer, A. Rachmann, J. Gulden, A. Müller, M. Beck, D. Birkmeier, A. Herrmann, P. Mennig, K. Schneider. Joint Proceedings of REFSQ-2024 Workshops, Doctoral Symposium, Posters & Tools Track, and Education and Training Track. Co-located with REFSQ 2024. Winterthur, Switzerland, April 8, 2024. $ kwang126@uottawa.ca (K. Wang); fzhan081@uottawa.ca (F. Zhang); msabetza@uottawa.ca (M. Sabetzadeh) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Table 1 Examples of Requirements and Non-requirements Text Label Upon installation, a DigitalHome user account shall be established. Requirement NPAC SMS shall provide post-collection audit analysis tools that can produce Requirement detailed reports on data items relating to system intrusions. NANC Version 1.6, released on 11/12/97, contains changes from the NANC Non-requirement FRS Version 1.5. User selects option to associate file types with editors (ALT 1). Non-requirement for requirements demarcation. This paper reports on an empirical examination comparing several alternative technologies. We evaluate two state-of-the-art models, DeBERTa and Llama2, as well as few-shot learning with RoBERTa. We also examine an ensemble approach combining these methods to study whether demarcation accuracy over complex documents can be further improved. Based on our experimental results, we observe that DeBERTa slightly outperforms our replication of Bashir et al.’s approach [3]. The accuracy results reported by Bashir et al. are nonetheless slightly higher than our replication, most likely due to slight differences in hyperparameter optimization. Consequently, our accuracy results from DeBERTa are slightly below those reported by Bashir et al. In view of this finding, we hypothesize that DeBERTa will likely outperform Bashir et al.’s approach if its hyperparameters are optimized according to the process used by Bashir et al., details of which we did not have and could not exactly replicate. As for Llama2, we observe that the model has lower performance than DeBERTa, which could be attributed to the small size of the training data, the lack of context, and the absence of a training data selection strategy. We further observe that few-shot learning has a notable performance deficit compared to both DeBERTa and Llama2. In addition, we find that simply integrating the predictions of each approach using their normalized F1 score does not lead to significant improvement in performance. More complex ensemble approaches should thus be considered in the future. 2. Background and Related Work Automating the demarcation of requirements has seen considerable interest in the field of software engineering. This section presents a brief overview of requirement demarcation techniques alongside their enabling techniques, highlighting the evolution from traditional machine learning approaches to the latest advancements in deep learning and transformer-based models. Early requirements demarcation methods, e.g., work by Abualhaija et al. [4], use traditional machine learning such as Support Vector Machines (SVM), Logistic Regression (LR), and Naive Bayes (NB). These methods require careful feature engineering and show limitations in handling the linguistic nuances and complexities inherent in requirements text. Long Short-Term Memory (LSTM) [5] networks revolutionized deep learning for sequential data. LSTMs have gating mechanisms that allow them to retain or forget information, reducing issues like vanishing gradients. This enables models to process longer sequences while capturing context. New word embeddings like FastText (FT) [6] and Global Vectors (GloVe) [7] have been used as well, with FT aggregating subword n-grams for better morphology understanding and GloVe using co-occurrence statistics to capture semantic relationships. Overall, LSTMs allow sequential modelling of longer texts, while new word embeddings like FT and GloVe enable more semantic understanding. In addition, Winkler et al.[8] proposed a method that utilizes convolutional neural networks for automated requirements classification. Their method achieved 0.73 in accuracy and 0.89 in recall on a real-world automotive requirements specification The emergence of Transformer-based models marks an important milestone in NLP. Trans- formers like BERT [1] have brought about major advances in NLP through self-attention and bidirectional context modelling. Rather than processing text linearly, self-attention weighs the significance of each word relative to all others, learning context more effectively. BERT specif- ically introduces bidirectionality, understanding context based on surrounding text on both sides of a word. This enables capturing nuances and relationships that were previously difficult to capture. BERT uses WordPiece tokenization [9] to break words into subword units, allowing it to handle unseen words by representing them as known subwords. Therefore, BERT and its variants generally perform well at understanding context and complex dependencies, which are key attributes for accurately demarcating the content of complex requirements documents. Bashir et al. [3] propose few-shot learning with sentence transformers [10], utilizing the SetFit framework [11] to fine-tune various pre-trained Sentence Transformer models for the task of demarcating requirements. This process involves a dual-step training approach: initially, fine-tuning the Sentence Transformer model on a limited dataset using a contrastive training method, followed by training a Logistic Regression model to act as a classification head on the embeddings generated by the fine-tuned Sentence Transformer. The evaluation of this model entails generating sentence embeddings from unseen examples and then predicting the class label with the Logistic Regression model. In Bashir et al.’s work, the bert-base-uncased pipeline obtained the best performance, with a Macro average F1 score of 0.83 on the Dronology dataset [12]. This pipeline will be used as our baseline and compared to our approaches. 3. Alternative Approaches Figure 1 provides an overview of the alternative models we examine in this paper. Recognizing the potential for synergy, we consider two ensemble systems among the alternatives. The first ensemble system integrates the strengths of DeBERTa and Llama2, while the second system leverages the strength of all three models. This approach not only allows us to compare three methods separately but also to investigate potential enhancements brought about by ensembling. 3.1. DeBERTa DeBERTa [13] improves upon the BERT architecture in several key ways. DeBERTa’s defining feature is its disentangled attention mechanism. Unlike BERT, which processes content and position information jointly within a single self-attention mechanism, DeBERTa separates the Figure 1: Techniques Explored in This Paper for Requirements Demarcation. two. This allows the model to learn the representation of word positions and the content more effectively, giving it a more refined understanding of the meaning that word order and position contribute to a sentence. Such a capability is especially critical in requirement demarcation tasks, where the position of terms can dramatically change their meaning, and then change the classification. In addition, DeBERTa introduces an Enhanced Mask Decoder (EMD). BERT adds absolute position encodings directly in the input layer. In contrast, DeBERTa captures only relative position information within the Transformer layers, and applies absolute position information later, right before predicting masked tokens. Thus, EMD is better at predicting the original tokens from masked ones by using contextual information more efficiently. Given these architectural improvements and DeBERTa’s demonstrated superior performance over BERT in various natural language processing benchmarks, including text classification, it is interesting to examine the application of DeBERTa as a potentially more effective alternative to BERT. 3.2. Llama2 Llama2 [2] is a transformer-based auto-regressive large language model. This type of model is often applied to text generation tasks, such as machine translation, generative question- answering systems, and virtual assistants. In this project, we aim to examine the performance of this model on a text classification task (i.e., requirements demarcation) and compare it with BERT and other machine learning-based classification methods. Llama2 has a substantially larger number of parameters than BERT variants. The version we used in this project contains 7 billion parameters, compared to 345 million and 110 million parameters in DeBERTa-large and BERT-base, respectively. With more parameters, Llama2 is better at generalizing knowledge learned from large amounts of training data to new tasks. At the same time, more parameters may improve context understanding and robustness to ambiguity. These improvements will give models a deeper understanding of context and remove ambiguities common in natural language. In addition, Llama2 combines multi-task learning to jointly train models on different natural language understanding tasks. In other words, the Llama2 model is not only trained on a large text corpus but is also fine-tuned for multiple natural language understanding tasks. This enables the model to learn from diverse and rich data sources and improve its generalization capabilities. These improvements by Llama2 present an opportunity for a potentially more accurate and streamlined approach to classification tasks. 3.3. Few-shot Learning As a machine learning technology that has emerged in recent years, few-shot learning enables the model to learn from a small number of samples. Compared to supervised machine learning, few-shot learning only requires a small number of data points to learn the information in the data and generalize it to new tasks, making it useful when the amount of labelled data is small. Since engineering specification documents are usually confidential information within a company or organization, we hypothesize that applying few-shot learning to the task of requirements demarcation can solve the problem of small data volumes. One recent development in the field of few-shot learning is SetFit [11] – an efficient and prompt-free framework for the few-shot fine-tuning of sentence-transform models. Compared to other few-shot learning methods, SetFit requires no prompts at all and can generate rich embeddings directly from text examples. In addition, SetFit does not require large models like GPT or Llama to achieve high accuracy. This means that it requires less computing resources and training time. 3.4. Ensemble System Ensemble learning is a machine learning technique that combines the predictions from two or more models. Compared to using single models, the ensemble technique exploits the predictive power of single models to achieve more accurate predictions and better performance. Moreover, the ensemble technique also improves the robustness of the model by reducing the spread or dispersion of the predictions and model performance. To ensemble the predictions of DeBERTa, Llama2, and Few-shot Learning, we use the nor- malized macro F1 score of each model as weight and multiply with their predictions to get the final prediction 𝑃 : 𝑃 = 𝑤1 × 𝑃1 + 𝑤2 × 𝑃2 + 𝑤3 × 𝑃3 (1) where 𝑤1 , 𝑤2 , 𝑤3 are the weights of each model and 𝑃1 , 𝑃2 , 𝑃3 are the predictions of each model. 4. Implementation For this project, we utilize the PyTorch and HuggingFace transformers library. During the fine-tuning process of DeBERTa, we add a classification header on top of DeBERTa. This is a common way of fine-tuning, where the weights of a neural network in the classification head are updated via back propagation. However, this method requires a lot of computing resources and time, so it is not suitable for fine-tuning the more complex Llama2 model. Therefore, we use Low Rank Adaptation (LoRA) [14], an improved fine-tuning method that has been proven effective, in the process of fine-tuning Llama2. LoRA avoids catastrophic forgetting, a phenomenon that occurs when knowledge of a pre-trained model is lost during fine-tuning, by fine-tuning two smaller matrices that approximate the weight matrix of a large pre-trained language model. This method greatly reduces the number of trainable parameters, enabling us to keep the computational resources and time required for fine-tuning within an acceptable range. For few-shot learning, we use the SetFit [11] framework and RoBERTa-large model. We fine-tuned the few-shot learning model with different numbers of labelled examples for each label, including 8, 16, 24, and 32. The model fine-tuned with 24 labelled datapoints (per class) yielded the best performance. Therefore, we use 24 randomly selected samples from each label (requirement or non-requirement) for fine-tuning the few-shot learning model. We fine-tuned the DeBERTa and few-shot learning models using a single Nvidia V100 16GB GPU, and the Llama2 model was fine-tuned using a single Nvidia A100 40GB GPU on Colab. 5. Empirical Evaluation 5.1. Research Questions The goals of our evaluation are three-fold. First, we aim to investigate whether there is sufficient rationale for replacing auto-encoder models such as BERT and DeBERTa with more complex auto-regressive models like Llama2 for the task of requirements demarcation. Settling this question necessitates an examination of the trade-off between the potentially increased accuracy from the more complex models and the less resource-intensive nature of earlier auto-encoder models. In this comparison, the BERT model was used as baseline to compare with DeBERTa and Llama2, since the BERT model yielded the best performance in [3]. Second, we seek to study the performance of few-shot learning for requirements demarcation to conduct a more thorough comparison with DeBERTa and Llama2. And third, in order to enhance the reliability of the model as a replacement for manual work, we aim to analyze whether the accuracy of requirements demarcation can be further improved by utilizing ensemble learning techniques. To achieve these objectives, we present the following research questions (RQs): RQ1: Which model architecture - DeBERTa, Llama or BERT - yields the most accurate requirements demarcation results? RQ2: Can few-shot learning achieve on-par performance with state-of-the-art approaches? RQ3: Can ensembling improve upon individual models? To answer the research questions, we designed two sets of experiments. In the first set of experiments, we utilize stratified five-fold cross-validation on the Dronology dataset (discussed in Section 5.2) with DeBERTa, Llama2, and a RoBERTa-based few-shot learning model to provide a fair and direct comparison with the work of Bashir et al [3]. In the second set of experiments, we fine-tune DeBERTa, Llama2, and the same RoBERTa-based few-shot learning model with the dataset combining the Dronology and PURE datasets (discussed in Section 5.2) and compared their performance. 5.2. Description of Dataset The dataset we use for our evaluation combines two existing labelled datasets: the Dronology dataset from [12] and a manually extracted and labelled dataset from the PURE dataset by Table 2 Experimental Results for Accuracy and Training Time on the Dronology Dataset Model Accuracy Weighted Avg. Macro Avg. Training Time Precision Recall F1 Precision Recall F1 BERT* 0.8487 0.8680 0.8487 0.8527 0.8065 0.8418 0.8161 0:11:29 DeBERTa 0.8701 0.8703 0.8701 0.8689 0.8410 0.8205 0.8287 1:07:08 Llama2 0.7108 0.6941 0.7108 0.7002 0.6099 0.5994 0.6041 1:08:17 Few-shot Learning 0.6794 0.7828 0.6794 0.6955 0.6736 0.7133 0.6520 0:26:41 * Results from our replication of bert-base-uncased on the Dronology dataset using the same hyper-parameter reported in [3]. Ivanov et al. [15]. The Dronology dataset consists of 398 entries of various classes, including components, design definitions, sub-task, and requirements. This dataset was processed and labelled by Bashir et al. [3]. They first labelled all non-requirement classes as non-requirement, then deleted 19 entries without text. After processing, the dataset contains 99 requirements and 280 non-requirements. To mitigate imbalance, the authors use stratified 5-fold cross-validation. The second dataset was labelled by Ivanov et al. [15], from the PURE dataset developed by Ferrari et al. [16]. They manually extracted 7,745 sentences from 30 of the 79 natural language requirements documents where 4,145 sentences are requirements and 3,600 are non- requirements. To improve labelling accuracy, they employed a manual labelling process by a subject-matter expert and independently verified by additional experts, and any disputed data was removed. We consider this dataset as a measure against imbalanced and small amounts of data. For the first set of experiments, discussed in Section 5.1, we use the Dronology dataset in the replication package provided by Bashir et al [3] where the dataset was partitioned into five subsets using 5-fold cross-validation. For the second set of experiments, we merged the subsets to form a single dataset and then combined this merged set with the PURE dataset to enlarge the dataset size and address label imbalance. By combining these two datasets, we obtain 8,124 sentences, which contain 4,244 requirements and 3,880 non-requirements with an average length of 134 words. Of these, we randomly selected 30% as the testing set (2,438 sentences), while the remaining 70% was used for fine-tuning (5,686 sentences). 5.3. Results and Discussion Tables 2 and 3 show our experimental results for all the approaches. The Weighted Avg. and Macro Avg. columns are the weighted average and macro average of each evaluation metric. The Training Time column indicates the total training for each approach. We do not report training times for the two ensemble systems because no training is necessary for these systems. The results for BERT are obtained from our replication of bert-base-uncased on the Dronology dataset using stratified five-fold cross-validation and the same configuration used by Bashir et al. [3], to the extent that we could recreate the configuration based on the paper. RQ1: As shown in Table 2 where we utilized the stratified five-fold cross-validation to evaluate all the models, DeBERTa achieved the best performance with a 82.87% macro F1 score. This Table 3 Experimental Results for Accuracy and Training Time on the Dronology and PURE datasets Model Accuracy Weighted Avg. Macro Avg. Training Time Precision Recall F1 Precision Recall F1 DeBERTa 0.9135 0.9135 0.9135 0.9134 0.9135 0.9128 0.9131 0:20:43 Llama2 0.8970 0.8971 0.8970 0.8971 0.8969 0.8971 0.8970 2:24:08 Few-shot Learning 0.7621 0.7628 0.7621 0.7622 0.7622 0.7625 0.7620 0:06:01 DeBERTa + Llama2 0.9479 0.9711 0.9479 0.9553 0.8931 0.9121 0.8953 N/A DeBERTa + Llama2 + 0.9286 0.9637 0.9286 0.9398 0.8736 0.8867 0.8707 N/A Few-shot Learning raises the prospect that DeBERTa will outperform BERT if one could achieve the same level of performance as Bashir et al. have observed with BERT. In relation to Llama2 and few-shot learning, we note that these models show lower perfor- mance than BERT. This deficit can be attributed to several factors. First, the small size of the training set, with only 304 instances per fold, might be insufficient for these models to effectively learn the task, particularly for Llama2, which typically performs better with larger datasets. Second, the lack of context in the data can be detrimental for requirements classification tasks, as context plays an important role in determining whether a sentence is a requirement or not. Without sufficient context, models may struggle to distinguish between neutral sentences and actual requirements, leading to performance degradation. Third, the training data selection strategy used in the few-shot learning approach could be suboptimal if the selected instances are not representative of the entire dataset or do not cover the diversity of classes and patterns present in the data. In future work, to mitigate these issues, one could consider solutions that increase the size of the training set, incorporate context information, and employ more sophisticated training data selection strategies for few-shot learning, such as clustering-based techniques, manual curation, or active learning approaches. Regarding training time, compared to BERT, which took 11 minutes to train, DeBERTa and Llama2 required significantly more training time, at 1 hour and 7 minutes and 1 hour and 8 minutes respectively. Therefore, while DeBRETa exhibits better performance, the substantially longer training time suggests that careful consideration needs to be given to the accuracy versus time trade-off. RQ2: The few-shot learning method exhibited lower performance in both sets of experiments, suggesting that it is unable to match the capabilities of the other models we examined for the requirements demarcation task. However, considering that it requires only 0.8% of the training data (48 used by few-shot learning compared to 5686 used by DeBERTa and Llama2) and has a relatively shorter training time compared to other models, we believe it is a viable option to explore when one has to cope with very small training data. RQ3: We first attempted to construct an ensemble system by combining the predictions of DeBERTa and Llama2, each weighted at 0.5. As shown in Table 3, ensembling DeBERTa and Llama2 models improves upon individual models in terms of overall accuracy and weighted scores, achieving higher accuracy (94.79%) and weighted scores compared to the individual models. The ensemble system’s high weighted F1 of 95.53% indicates its ability to correctly classify the majority class instances. However, it has lower macro scores (precision, recall, and f1) compared to the individual DeBERTa and Llama2 models, indicating the weaknesses of DeBERTa and Llama2 in handling minority classes may be amplified in the ensemble. Finally, we note that adding the few-shot learning model to the ensemble did not provide significant improvements, potentially due to its lower individual performance. The second ensemble system, which includes DeBERTa, Llama2, and few-shot learning, shows a slightly lower accuracy of 92.86% compared to the first ensemble system, with lower weighted and macro scores. Overall Conclusion. In conclusion, our experiments demonstrated that DeBERTa outperformed other models, albeit requiring a longer training duration compared to BERT. As the volume of training data increased, Llama2 and few-shot learning exhibited comparable performance with DeBERTa. Moreover, the results revealed that while ensemble systems generally enhanced overall performance, they faced challenges in accurately classifying minority classes. Moving forward, it is crucial to carefully evaluate class imbalance and ensemble architectures to address this particular issue. 5.4. Threats to Validity Internal Validity. Due to resource limitations, our hyperparameter tuning was not exhaustive, meaning there could still be undiscovered model configurations that improve performance. Furthermore, despite the labelled data having been preprocessed and deemed to be of high quality, it is possible that there still remain unusual, anomalous, or improperly labelled examples within the combined requirements dataset. Such data outliers and noise could skew model performance. In future work, robustly detecting and removing possible labelling errors through outlier analysis and visual data profiling could increase data integrity. External Validity. While our evaluation yielded rather conclusive results on our dataset, the question of whether these findings would generalize to different datasets and varied criteria for requirements demarcation necessitates further case studies. Construct Validity. Currently, our models only classify text into binary requirement and non- requirement categories. However, real-world specifications contain a diverse array of semantic types, such as constraints, assumptions, and metadata descriptors. In complex documents, many such more nuanced labels can exist. By training and evaluating solely on a binary classification task, our models may not capture the full richness within specifications. Expanding the label set beyond binary categories could better measure model capabilities and effectiveness for realistic applications. 6. Conclusion This paper benchmarked two large pre-trained language models, DeBERTa and Llama2, as well as a RoBERTa-based few-shot learning model, for automatically demarcating software and systems requirements in technical specifications. In future work, we plan to create broader training data covering more domains to address the limitations in domain transfer. Furthermore, the integration method used in the ensemble system could be further improved to better take advantage of the complementary traits of DeBERTa and Llama2. References [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [2] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). [3] S. Bashir, M. Abbas, M. Saadatmand, E. P. Enoiu, M. Bohlin, P. Lindberg, Requirement or not, that is the question: A case from the railway industry, in: International Working Conference on Requirements Engineering: Foundation for Software Quality, Springer, 2023, pp. 105–121. [4] S. Abualhaija, C. Arora, M. Sabetzadeh, L. C. Briand, E. Vaz, A machine learning-based approach for demarcating requirements in textual specifications, in: 2019 IEEE 27th International Requirements Engineering Conference (RE), IEEE, 2019, pp. 51–62. [5] R. C. Staudemeyer, E. R. Morris, Understanding lstm–a tutorial into long short-term memory recurrent neural networks, arXiv preprint arXiv:1909.09586 (2019). [6] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Transactions of the association for computational linguistics 5 (2017) 135–146. [7] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [8] J. Winkler, A. Vogelsang, Automatic classification of requirements based on convolutional neural networks, in: 2016 IEEE 24th International Requirements Engineering Conference Workshops (REW), IEEE, 2016, pp. 39–45. [9] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., Google’s neural machine translation system: Bridging the gap between human and machine translation, arXiv preprint arXiv:1609.08144 (2016). [10] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019). [11] L. Tunstall, N. Reimers, U. E. S. Jo, L. Bates, D. Korat, M. Wasserblat, O. Pereg, Efficient few-shot learning without prompts, arXiv preprint arXiv:2209.11055 (2022). [12] J. Cleland-Huang, M. Vierhauser, S. Bayley, Dronology: An incubator for cyber-physical system research, arXiv preprint arXiv:1804.02423 (2018). [13] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, arXiv preprint arXiv:2006.03654 (2020). [14] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021). [15] V. Ivanov, A. Sadovykh, A. Naumchev, A. Bagnato, K. Yakovlev, Extracting software requirements from unstructured documents, in: International Conference on Analysis of Images, Social Networks and Texts, Springer, 2021, pp. 17–29. [16] A. Ferrari, G. O. Spagnolo, S. Gnesi, Pure: A dataset of public requirements documents, in: 2017 IEEE 25th International Requirements Engineering Conference (RE), IEEE, 2017, pp. 502–505.