Comparing general purpose pre-trained Word and
Sentence embeddings for Requirements Classification
Federico Cruciania , Samuel Moorea and Chris Nugenta
a
    School of Computing, Ulster University, 2-24 York Street, Belfast, BT15 1AP, United Kingdom


                                         Abstract
                                         The recent evolution of NLP has enriched the set of DL-based approaches to include a number of general-
                                         purpose Large Language Models (LLMs). Whereas new models have been proven useful for generic text
                                         handling, their applicability to domain-specific NLP tasks still remains doubtful, particularly because of
                                         the limited amount of dataset available in certain domains, such as Requirements Engineering. In this
                                         study, different pre-trained embeddings were tested in three requirements classification tasks, in search
                                         of a tradeoff between accuracy and computational complexity. The best F1-score results were obtained
                                         with BERT (90.36% and 84.23%), with DistilBERT identified as optimal tradeoff (90.28% and 82.61%).

                                         Keywords
                                         Requirements Engineering, NLP, Large Language Models


1. Introduction
Natural Language Processing (NLP) is an area of Machine Learning (ML) which aims to learn,
understand and generate human language content. More specifically, NLP is a set of techniques
which are capable of representing written text at several levels of linguistic analysis with the
goal of achieving near human-like levels of language processing for a given task or application
[1]. The maturity level of Large Language Models (LLMs) reached in the past five years is having
an enormous impact on Deep Learning (DL) based approaches for NLP. While, on the one hand,
these models have made it possible to address previously unattainable NLP tasks, the ability of
such general-purpose models on domain-specific contexts still poses some major challenges [1].
In particular, when applying NLP to domain-specific tasks, the amount of available text is
usually extremely limited, and the semantic representation of words when used in a different
context might be misleading [1]. Consequently, the research community looked at finetuning
pre-trained LLM [2]. Whereas finetuning is a valid method, cases with limited amount of data
hinder this approach.
   Requirements Engineering (RE) is one such area where NLP approaches can help to improve
processes. Requirements, within software development, are largely expressed in natural lan-
guage [1]. The correct and accurate statement of requirements is essential for the development
In: A. Ferrari, B. Penzenstadler, I. Hadar, S. Oyedeji, S. Abualhaija, A. Vogelsang, G. Deshpande, A. Rachmann, J. Gulden,
A. Wohlgemuth, A. Hess, S. Fricker, R. Guizzardi, J. Horkoff, A. Perini, A. Susi, O. Karras, A. Moreira, F. Dalpiaz, P.
Spoletini, D. Amyot. Joint Proceedings of REFSQ-2023 Workshops, Doctoral Symposium, Posters & Tools Track, and
Journal Early Feedback Track. Co-located with REFSQ 2023. Barcelona, Catalunya, Spain, April 17, 2023.
$ f.cruciani@ulster.ac.uk (F. Cruciani); s.moore2@ulster.ac.uk (S. Moore); cd.nugent@ulster.ac.uk (C. Nugent)
 0000-0002-1870-0203 (F. Cruciani); 0000-0003-3205-3310 (S. Moore); 0000-0003-0882-7902 (C. Nugent)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
of high-quality software which meets the expectations of customers and end-users. Given the
importance of this part of the software development lifecycle, it is necessary to ensure that re-
quirements are stated clearly, adhere to quality criteria, are appropriately classified, and are free
from errors. While it is possible, and often the norm, to carry out the requirements engineering
processes manually, automated NLP approaches stand to offer significant improvement to the
process [1]. Within requirements engineering, there are several areas where NLP approaches
may be employed, including; elicitation, quality analysis, error detection, category classification,
and traceability [1]. Although LLMs offer a general purpose approach to language modelling,
they were not trained on the task of classifying requirements. In order to classify requirements
into a given set of categories, a model must be trained to detect these categories, which is not
typically the case for LLMs. As such, in order to develop an NLP solution for requirements
classification, it is necessary to expose a model to a range of requirements and their associ-
ated categories during training, thereby requiring the development of a task-specific language
model. This can be achieved either by fine-tuning the LLM on the specific task of requirements
classification, as done in [2], or by using the LLM to provide a semantic representation of the
requirement, and combining it with more traditional classifiers [3]. The first solution is more
resource-consuming, while the second one is more efficient. To our knowledge, the literature
does not provide a systematic comparison between different LLMs in this second scenario.
   This paper seeks to assess the effectiveness of LLMs in accurately classifying requirements
into their respective categories. In doing so, this paper considered pre-trained language models
and evaluated their ability to create semantic representations from requirements specifications.
   The contribution of this work can be summarized as follows:

    • comparison of the semantic representational power from available pre-trained LLM with
      application to RE.
    • an explorative study trying to optimize the tradeoff between computational resources
      and accuracy

  The remainder of this paper is organized as follows. Section 2 summarizes the state of the art
and related work in NLP tasks for RE. Section 3 describes the experiment design, the research
questions, and the evaluation methodology. Results and discussion are reported in Sections 4
and 5 respectively. Finally, conclusions are drawn in Section 6.


2. Related Work
Among NLP tasks in RE, requirement classification is one of the most common [1]. In most
cases, classification is applied to the binary case discriminating between functional (F) and
non-functional requirements (NF) [4]. Other studies have also considered specific classes of
NF requirements (e.g., usability, security)[5, 2, 6, 7]. In the earliest examples of requirements
classification [5], sets of keywords obtained from manually labeled requirements were used to
classify unseen data. More recently, studies started to explore the use of ML and DL approaches.
In [6], BERT [8] was used in combination with a Graph Attention Network (GAT) and an
Multilayer perceptron (MLP) classifier. The method was compared with other ML approaches,
(including Naive Bayes, Random Forest (RF)) in two classification tasks: (i) the binary case F vs
Table 1
PROMISE-NFR Dataset
                               Class             Requirements      #Sentences     #Words*
                          Functional (F)+              255             272          4996
                     Availability (A)                   21              29           432
                   Fault Tolerance (FT)                 10              11           176
                         Legal (L)                      13              15           215
                      Look & Feel (LF)+                38              51            749
                   Maintainability (M)                  17              25           476
                      Operational (O)+                 62              89           1231
                     Performance (PE)+                 54              69           1207
                     Portability (PO)                    1               2            25
                     Scalability (SC)                   21              27           402
                          Security (SE)+               66              86           1265
                          Usability (U)+               67              96           1508
                               Total                 625           772         12682
                      *
                          approximate number of words, + Used as Most frequent class


NF requirements, and (ii) for detecting four types of NF requirements. For an in-depth literature
review on ML for requirement classification readers can refer to [1], which provides a holistic
overview of the progress of NLP for performing RE tasks. On the other hand, this paper is more
focused on identifying optimum representational power and analysing the computing resources
required to achieve the task effectively.
   Similar to [2], in this work we evaluated the use of BERT [8] for requirement classification
extending the comparison to include other embeddings, such as GloVE [9], DIstilBERT [10],
SBERT [11] and Universal Sentence Encoder (USE) [12]. It should be noted that, unlike other
studies like [2], we did not retrain or fine-tuned the models used for embeddings, but simply
aimed at comparing them in some classification tasks.


3. The experiment
The experiment aimed at comparing different pre-trained models for word and sentence embed-
dings to verify their suitability for requirements classification. Evaluation was conducted on
the PROMISE-NFR dataset [13]. The dataset includes a set of 625 functional and non-functional
requirements. The non-functional requirements set includes 11 different subclasses. Table 1
summarizes all the 12 classes, the number of requirements, sentences and words available per
class. Despite being fairly balanced for the binary case F/NF requirements, the dataset presents
a great challenge in terms of class imbalance when including all 12 classes (some consisting
only of 1-20 requirements).
   The word embeddings used in the experiments were GloVE [9], BERT [8], DIstilBERT [10],
with dimension of 300, 768 and 768 respectively. The sentence embeddings were SBERT [11]
and Universal Sentence Encoder (USE) [12] with size 3841 and 512 respectively.
   1
       SBERT was used with the all-MiniLM-L6-v2 model. See https://www.sbert.net/docs/pretrained_models.html
  As illustrated in Fig. 1, the aim was to train a small size classifier (<10M parameters) on a
domain-specific context with limited amount of data, relying on pre-trained language models
for semantic representation of words/sentences.


Figure 1: In the experiment, different pre-trained models were used for extracting word/sentence
embeddings without fine-tuning. Obtained embedding were used to train an ad-hoc classifier model.


   All embeddings were tested under the same conditions, using two MLP structures (≃100k
parameters and 2M parameters)2 . The MLP models were implemented using dense layers, with
ReLU activation function and Stochastic Gradient Descent (SGD) optimizer. Learning rate values
of 0.1 and 0.01 (default) were used. The embeddings and the MLP structures were evaluated in
three different tasks:

Task 1 Binary classification Functional / Non Functional Requirements

Task 2 Classification of most frequent classes

Task 3 Classification of all classes (except portability)

Task 1 compares these general-purpose embeddings without an extreme imbalance. Tasks 2 and
3 allow the evaluation to be made on the impact of class imbalance in the more complex cases
distinguishing 6 and 12 classes. The 6 classes of Task 2 were chosen as the most frequent classes
(comprised of at least 50 sentences and 500 words) as indicated in Table 1. The experiment
aimed at answering the following research questions:

RQ 1 Which embedding provides better accuracy in requirement classification?

RQ 2 Which embedding provides the best trade-off between accuracy and model’s complexity?

3.1. Evaluation Methodology
The dataset was split into train and test set, 70% and 30% respectively. With the training set
further split into train and validation (10%). The splits were done using the stratify options, i.e.
preserving class imbalance, and preventing from separating sentences appearing in the same
requirement. The evaluation was done using a 5-fold procedure comparing the embeddings
with the two MLP structures on the three tasks. Models were trained with early stopping using a
patience of 25 epochs and saving only the models with highest accuracy in the validation set. The
data imbalance was handled using weighted loss. Finally, a k-Nearest Neighbors (kNN) classifier
was used as a baseline. Since kNN makes direct use of embeddings for classification, and
because of its non-parametric nature, it is therefore a suitable approach to measure how well the
embeddings obtained from pre-trained models could be used directly to classify requirements.


    2
        The complete source code is available at: https://github.com/fcruciani/reqclass
                                                           h
Table 2
Classification Report (Task 1 & 2) using the large and small MLP respectively.
                                  Precision*           Recall*            F1-score*           F1-Score+
                               Task 1 Task 2       Task 1 Task 2       Task 1 Task 2       Task 1 Task 2
       Glove                  0.9110     0.7552   0.8520    0.7351     0.8709     0.7441   0.8840   0.7793
        BERT                  0.9010     0.8373   0.8900    0.7743     0.8949     0.7994   0.9036   0.8423
     DistilBERT               0.9076     0.8043   0.8836    0.7645     0.8934     0.7807   0.9028   0.8261
       SBERT                  0.8709     0.7455   0.8797    0.7682     0.8748     0.7549   0.8837   0.7833
         USE                  0.8606     0.7484   0.8841    0.7666     0.8669     0.7550   0.8743   0.7948
     *
         Macro-average, + Weighted


Table 3
Results with the small MLP architecture trained using BERT (uncased) embeddings (Task 2).
                                          Class         Precision    Recall     F1-Score
                                       Functional (F)    0.8490      0.9449      0.8944
                                     Look & Feel (LF)    0.7273      0.5818      0.6465
                                      Operational (O)    0.8500      0.8500      0.8500
                                     Performance (PE)    0.9138      0.6543      0.7626
                                       Security (SE)     0.8318      0.8318      0.8318
                                       Usability (US)    0.8525      0.8062      0.8287
                                       macro avg         0.8374      0.7782      0.8023
                                      weighted avg       0.8456      0.8454      0.8416


4. Results
In the experiment, results covering all combinations between the large and small MLP classi-
fiers were calculated on the three tasks. For the sake of conciseness only some combinations
are reported. Additional results including confusion matrices are available in the published
repository. Table 2 reports results obtained using the large MLP structure on Task 1 and the
small MLP architecture on Task 2.
   Additional combinations were tested considering the cased and uncased versions of pre-
trained BERT and DistilBERT models. Table 3 report results obtained with the best performing
model on Task 2, the small MLP architecture using the uncased version of BERT. Fig. 2 illustrates
the confusion matrix and the normalized confusion matrix obtained in Task 2 using BERT
uncased and the small MLP.
   Table 4 summarizes results obtained in Task 3, including the baseline results using kNN as a
classifier. Table 5 reports precision, recall and f-score values for all classes obtained with the
best performing combination on Task 3. Fig. 3 illustrates the normalized confusion matrices
obtained with the different embeddings on Task 3. Finally, Fig. 4 summarizes the macro-average
F1 score obtained with all the embeddings on Task 2 and Task 3.
                                                (a)                                                 (b)

Figure 2: Confusion matrices obtained with BERT uncased (a) and the normalized version (b).


Table 4
Classification Report all classes (Task 3 using Large MLP)
                                    Precision*                         Recall*                F1-score*                 F1-Score+
                                 MLP             kNN              MLP                kNN    MLP             kNN       MLP       kNN

     Glove                    0.5941          (0.5760)         0.5781          (0.3128)    0.5751         (0.3583)   0.6553   (0.4851)
      BERT                    0.7585          (0.7640)         0.5920          (0.5065)    0.6148         (0.5470)   0.7363   (0.6798)
   DistilBERT                 0.7417          (0.7347)         0.6065          (0.5027)    0.6401         (0.5544)   0.7415   (0.6760)
     SBERT                    0.6220          (0.6492)         0.5661          (0.5301)    0.5777         (0.5462)   0.6660   (0.6572)
       USE                    0.6221          (0.5890)         0.5472          (0.5113)    0.5615         (0.5216)   0.6860   (0.6583)
   *
       Macro-average + Weighted - in () are the baseline values obtained using kNN


                                                (a)                                                 (b)

Figure 3: Confusion matrices obtained with BERT (a), and DistilBERT (b) on task 3.


5. Discussion
Results on Task 1 highlight BERT and DistilBert as the best performing embeddings (RQ1),
however results obtained with the other embeddings are comparable and might be considered
for resource constrained cases. In particular, SBERT is the fastest model (except for GloVe) for
Table 5
Results with the MLP architecture trained using BERT uncased embeddings.
                                 Class         Precision   Recall   F1-Score
                         Availability (A)       0.6486     0.6857    0.6667
                          Functional (F)        0.7914     0.9397    0.8592
                       Fault Tolerance (FT)     0.8000     0.1739    0.2857
                             Legal (L)          0.8000     0.2353    0.3636
                        Look & Feel (LF)        0.6852     0.5873    0.6325
                       Maintainability (MN)     0.4800     0.4286    0.4528
                         Operational (O)        0.7363     0.6505    0.6907
                        Performance (PE)        0.8404     0.7980    0.8187
                         Scalability (SC)       0.7143     0.6944    0.7042
                          Security (SE)         0.6947     0.8148    0.7500
                          Usability (US)        0.7840     0.7000    0.7396
                            macro avg           0.7250     0.6098    0.6331


                           (a)                                      (b)

Figure 4: Macro F1 for all embeddings including the baseline kNN and the large MLP on Task 2 (a) and
Task 3 (b).


generating vectors of 384 dimensions, which also reduces the complexity of the final classifier.
Similarly, GloVe embeddings are obtained by simple lookup on a dictionary data structure
and are 300 dimensions embeddings. GloVe, however, is exposed to out-of-dictionary (OOD)
words limiting its working ability in the presence of OOD words. Similarly, in Task 2 and 3,
BERT and DistilBERT were the best performing models, with DistilBERT a good candidate to
reduce the computational overhead of BERT without causing detrimental effects on the accuracy
performance (RQ2). No major differences were observed when using the small and the large
MLP classifiers, possibly due to the limited size of the dataset that does not allow to maximize
the benefit of using a classifier with a higher number of trainable parameters. The lack of data
is further exacerbated in the case of sentence embeddings with the MLP trained on fewer data
points. The comparison with the baseline highlighted how, despite the limited amount of data,
training an MLP classifier outperforms the baseline kNN approach of using embeddings to
classify new data. The worst performing baseline results were obtained with GloVe, possibly
attributable to out-of-dictionary words. MLP classifiers trained on GloVe vectors, however,
appear to reduce the gap, leading to results comparable with SBERT and USE.

5.1. Limitations
Construct Validity Standard evaluation metrics were used as macro-averages to prevent majority
classes from masking less represented ones. All mandatory steps of the ECSER pipeline for
evaluating classifiers were performed [14]. Optional steps, e.g. significance tests, were not
performed due to the preliminary nature of this work.
   Internal Validity One major factor affecting internal validity is the correctness of the anno-
tation of the dataset that authors have questioned in the past. Nevertheless, it still represents
the most widely used dataset for requirements classification, facilitating the comparison with
previous work. Since this type of ML tasks typically include a high degree of randomness, 5-fold
cross-validation was used to calculate results.
   External Validity The dataset includes requirements written by students, which may not be
representative of industrial requirements and the evaluation of the language models is limited to
the three examined tasks. Different results may be obtained when other classification schemes
are used, or other types of requirements-related information (e.g., user stories, or app reviews)
are adopted. Concerning the coverage of possible pre-trained embeddings, we have considered
a representative set of basic and deep learning-based ones, not only limited to those derived
from BERT. Therefore, we argue that our analysis can be considered representative of the usage
of different embeddings for requirements classification. As for the classification algorithm, we
use two MLP structures and a kNN as a baseline. Different results may be obtained when using
other classifiers (e.g., SVM, Naive Bayes).


6. Conclusion
This paper reported on the evaluation of pre-trained embeddings for RE. Some of the most
common embeddings were tested under the same circumstances on a public dataset. Results
obtained identify BERT and its smaller variant DistilBERT as the best performing embeddings,
with the latter being an optimal tradeoff between accuracy and model complexity. GloVE and
SBERT despite a slightly lower accuracy were found to be the fastest in prediction time and could
be suitable for cases in which time represents a key factor or resource constrained environments.
Future work will aim to extend the evaluation on additional datasets to verify the validity of
these results on different RE tasks and datasets.


Acknowledgments
This research is supported by the ARC (Advanced Research Engineering Centre) project, funded
by PwC3 and Invest Northern Ireland.


    3
      PricewaterhouseCoopers LLP a limited liability partnership incorporated in England with its registered office
office at 1 Embankment Place, London WC2N 6RH
References
 [1] L. Zhao, W. Alhoshan, A. Ferrari, K. J. Letsholo, M. A. Ajagbe, E.-V. Chioasca, R. T. Batista-
     Navarro, Natural language processing for requirements engineering: A systematic mapping
     study, ACM Computing Surveys (CSUR) 54 (2021) 1–41.
 [2] T. Hey, J. Keim, A. Koziolek, W. F. Tichy, Norbert: Transfer learning for requirements
     classification, in: 2020 IEEE 28th International Requirements Engineering Conference
     (RE), IEEE, 2020, pp. 169–179.
 [3] M. E. Peters, S. Ruder, N. A. Smith, To tune or not to tune? adapting pretrained represen-
     tations to diverse tasks, in: Proceedings of the 4th Workshop on Representation Learning
     for NLP (RepL4NLP-2019), 2019, p. 7–14.
 [4] F.-L. Li, J. Horkoff, J. Mylopoulos, R. S. Guizzardi, G. Guizzardi, A. Borgida, L. Liu, Non-
     functional requirements as qualities, with a spice of ontology, in: 2014 IEEE 22nd Interna-
     tional Requirements Engineering Conference (RE), IEEE, 2014, pp. 293–302.
 [5] J. Cleland-Huang, R. Settimi, X. Zou, P. Solc, Automated classification of non-functional
     requirements, Requirements engineering 12 (2007) 103–120.
 [6] G. Li, C. Zheng, M. Li, H. Wang, Automatic requirements classification based on graph
     attention network, IEEE Access 10 (2022) 30080–30090.
 [7] O. AlDhafer, I. Ahmad, S. Mahmood, An end-to-end deep learning system for requirements
     classification using recurrent neural networks, Information and Software Technology 147
     (2022) 106877.
 [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
 [9] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in:
     Proceedings of the 2014 conference on empirical methods in natural language processing
     (EMNLP), 2014, pp. 1532–1543.
[10] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
     faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[11] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     arXiv preprint arXiv:1908.10084 (2019).
[12] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-
     Cespedes, S. Yuan, C. Tar, et al., Universal sentence encoder, arXiv preprint
     arXiv:1803.11175 (2018).
[13] H. L. J. Cleland-Huang, S. Mazrouee, D. Port, Promise-nfr dataset, https://doi.org/10.5281/
     zenodo.268542, 1007.
[14] D. Dell’Anna, F. B. Aydemir, F. Dalpiaz, Evaluating classifiers in se research: the ecser
     pipeline and two replication studies, Empirical Software Engineering 28 (2023) 1–40.