<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An approach to identifying functional and non-functional requirements in IT-project management using deep learning models</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Khmelnytskyi National University</institution>
          ,
          <addr-line>Khmelnytskyi</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>64/13, Volodymyrska str., Kyiv, 01601</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Bejaia</institution>
          ,
          <addr-line>06000, Bejaia</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Zilina University</institution>
          ,
          <addr-line>Univerzitná 8215, 010 26 Žilina</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The accurate identification of functional (FR) and non-functional (NFR) requirements is critical for software success, yet unsupervised methods using general-purpose language models often sufer from a “semantic gap,” leading to poor clustering performance. In this work, we propose a domain adaptation approach for the RoBERTa model to enhance unsupervised requirements clustering. We investigate the efectiveness of standard fine-tuning compared to contrastive learning and conduct a detailed analysis of layer-wise informativeness. Our results demonstrate that domain adaptation significantly improves clustering quality, increasing the F1-Score from ≈ 0.55 to ≈ 0.81 , with the final 12th layer providing the optimal representation. Furthermore, while both ifne-tuning methods achieve similar accuracy, contrastive learning proves superior in identifying anomalous requirements, efectively isolating ambiguity as outliers. We conclude that domain adaptation efectively bridges the semantic gap, ofering a robust tool for automating requirements analysis in IT project management.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Software requirements classification</kwd>
        <kwd>unsupervised machine learning</kwd>
        <kwd>clustering</kwd>
        <kwd>RoBERTa</kwd>
        <kwd>domain adaptation</kwd>
        <kwd>contrastive learning</kwd>
        <kwd>requirements ambiguity detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In modern software development, the accurate and efective identification of functional and
nonfunctional requirements is critically important for the success of any project. Incorrect or incomplete
classification of these requirements in the early stages can lead to significant financial losses, product
release delays, and ultimately, the creation of software that does not meet user expectations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Therefore, the relevance of automating this process using deep learning methods is rapidly growing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Furthermore, in the current era characterized by global digitalization, this software engineering task
is closely correlated with global humanitarian and environmental goals. The solution to the problem
of identifying functional and non-functional requirements using deep learning correlates with the
Sustainable Development Goals (SDGs) adopted by the UN, not directly, but as a fundamental accelerator
tool that increases the eficiency of creating solutions to achieve these goals [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This can be traced
through the following logical connection: Efective Software Development → High-Quality and Reliable
Software → Scalable Solutions for SDGs. In the context of specific goals, this has a direct impact
on Goal 9 (Industry, Innovation and Infrastructure). This goal is directly related to building resilient
infrastructure and fostering innovation. Automating requirements analysis using deep learning is itself
an innovation in the software development process. It allows for the creation of more complex and
reliable software systems (e.g., for managing smart cities, logistics, industrial processes) faster and
at a lower cost. This strengthens the technological infrastructure and the innovative potential of the
economy. Automating the identification of requirements with deep learning is not just a technical
improvement. It is a strategic step that enables the creation of higher-quality, more reliable, and safer
software tools. These very tools are the foundation for developing innovative solutions aimed at
overcoming the global challenges formulated in the Sustainable Development Goals.
      </p>
      <p>
        Functional requirements define what a system should do, describing its specific functions and behavior.
In contrast, non-functional requirements establish how the system should perform these functions,
defining its attributes such as performance, security, reliability, and usability. Understanding this
diference is fundamental to building a quality software product [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        Traditionally, the analysis and classification of software requirements are performed manually by
business analysts and system engineers. This approach, though time-tested, has a number of significant
drawbacks: it is labor-intensive and costly, subject to human error, prone to incomplete identification,
conflicting requirements, and the dynamic nature of requirements, among others. This is where Deep
Learning and Natural Language Processing (NLP) technologies play a key role [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These methods allow
for the creation of models capable of automatically analyzing and classifying textual information with
high accuracy. The main advantages of using deep learning for identifying functional and non-functional
requirements include: automation and speed, increased accuracy, objectivity, and early detection of
problems [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The application of methods such as convolutional and recurrent neural networks allows for efective
work with textual data, identifying keywords, phrases, and semantic connections that indicate whether
a requirement belongs to the functional or non-functional type [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Thus, the relevance of identifying
functional and non-functional requirements by means of deep learning is driven by the urgent need
to increase the eficiency, accuracy, and speed of the software development process. Automating this
critically important stage not only saves time and resources but also significantly enhances the quality
of the final product, ensuring it meets both functional expectations and requirements for performance,
security, and reliability.
      </p>
      <p>The main contribution of this research is a comprehensive analysis of approaches to the unsupervised
clustering of software requirements based on the domain adaptation of language models. The work
quantitatively demonstrates that fine-tuning is a critical step that significantly improves clustering
quality compared to using the base model. The research includes a deep analysis of the informativeness
of each layer of the fine-tuned model, based on which the best way to form vector representations is
determined, and a conclusion is drawn about the optimality of using the final layer. Furthermore, by
comparing diferent fine-tuning methods, the research shows that contrastive learning is an efective
tool for the automatic identification of semantically ambiguous and anomalous requirements, which is
important for improving their quality in the early stages of development.</p>
      <p>The rest of the paper is structured as follows. Section 2 presents an analysis of related work in the
ifeld of software requirements classification. Section 3 provides a detailed description of the proposed
research methodology. Section 4 presents and discusses the results of the conducted experiments.
Finally, Section 5 formulates the general conclusions of the work and outlines directions for future
research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        Recently, there has been a rapid development of deep learning methods for the automatic classification
of software requirements (SR) into functional (FR) and non-functional (NFR). This process is critical
for successful development, as NFRs define the quality attributes of a system, such as performance,
security, and reliability [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The following is an overview of key modern approaches, their advantages,
disadvantages, and existing research gaps.
      </p>
      <p>
        Research in this field has gone through several stages of deep learning architecture development.
Initially, CNN and RNN architectures (including their variants LSTM and GRU) were actively used for
requirements classification. CNN-based models efectively extracted local features (keywords and phrases),
while RNN/LSTM models were proficient at capturing sequential dependencies in the requirement
text [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Often, these architectures were combined to leverage the advantages of both approaches [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
The advantages of these approaches include a significant improvement in accuracy compared to
traditional machine learning methods (e.g., SVM or Naive Bayes) and the ability to automatically learn
features, which eliminates the need for manual feature engineering [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Disadvantages include limited
contextual understanding, as these models struggled with long and complex sentences due to the
“vanishing gradient” problem and their limited ability to consider the full context of a requirement.
Additionally, the sequential nature of RNNs slows down the training process and prevents efective
parallelization.
      </p>
      <p>
        The emergence of transformer architectures, particularly BERT (Bidirectional Encoder
Representations from Transformers), was revolutionary for natural language processing and, specifically, for
requirements classification. The mechanism of fine-tuning pre-trained models such as BERT, RoBERTa,
and DeBERTa for the specific task of FR and NFR classification has become widely adopted [
        <xref ref-type="bibr" rid="ref11 ref12 ref2">2, 11, 12</xref>
        ].
Approaches like NoRBERT and PRCBERT demonstrate the highest accuracy to date on standard datasets
such as PROMISE [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The advantages of this approach are: (1) Deep contextual understanding –
thanks to the attention mechanism and bidirectional training, BERT considers the context of each word
from both sides, allowing for a better understanding of the requirement’s semantics [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]; (2) Transfer
Learning – BERT is pre-trained on vast amounts of general-language texts, which allows it to achieve
high accuracy even when fine-tuned on relatively small datasets of requirements [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and (3) Higher
accuracy – BERT-based models consistently demonstrate the best results, achieving F1-scores of over
90% in binary classification tasks (FR/NFR) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        The disadvantages include: (1) High computational requirements – fine-tuning and using large
transformer models require significant computational resources (GPUs); (2) The “black box” problem –
the complexity of the architecture makes it dificult to interpret why the model made a particular decision,
which is critical for responsible systems [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and (3) Sensitivity to data quality – the efectiveness of the
models heavily depends on the quality and size of the data on which fine-tuning is performed (existing
datasets, such as PROMISE, are relatively small and may contain outdated requirements [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]).
      </p>
      <p>
        The latest step in the evolution of these approaches has been the application of Large Language
Models (LLMs), such as the latest generation models (e.g., the GPT-4 series, Google Gemini, Anthropic
Claude). Unlike BERT-like architectures, which are predominantly used for classification, LLMs are
capable of performing tasks in a “zero-shot” or “few-shot” manner, as well as generating requirements
or answering questions about them [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. However, their use is associated with challenges such as a
propensity for “hallucinations” (generating false information) and the high cost of use via API [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
Table 1 summarizes the advantages and disadvantages of the existing approaches.
      </p>
      <p>
        Despite the impressive results of transformer models, a major gap in current research lies in the
so-called “semantic gap.” Most works rely on standard models pre-trained on general-purpose text
corpora (e.g., Wikipedia, news), which fail to capture the specific semantics and nuances of terminology
inherent to the field of software engineering [
        <xref ref-type="bibr" rid="ref17">17, 18</xref>
        ]. For example, the word “performance” in a general
context might refer to an artist’s show, whereas in software requirements, it has a clear technical
meaning (response time, resource usage). Terms like “scalability,” “maintainability,” and “security” have
complex, multifaceted meanings that cannot be fully grasped without specialized knowledge. A model
that does not “understand” the intricacies of the subject domain is forced to rely only on superficial
syntactic patterns, which reduces its reliability [19].
      </p>
      <p>
        Promising directions for solving this problem include: (1) creating domain-specific language models
– developing corresponding models by pre-training or further fine-tuning existing transformers on vast
amounts of texts from the software engineering domain (scientific articles, documentation, real-project
requirements, Stack Overflow discussions) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]; (2) Integration with Knowledge Graphs – combining
text embeddings with embeddings obtained from knowledge graphs that formally describe software
engineering concepts and the relationships between them (e.g., the hierarchy of NFR types according to
the ISO 25010 standard) [20] and (3) Structure-aware Embeddings – developing methods that consider
not only the text of a requirement but also its structure or relationship with other requirements,
simultaneously learning representations of words and the structure of the corpus itself [20]. Thus, the
future breakthrough in the accuracy and reliability of automatic requirements classification lies not so
much in creating new architectures as in enriching existing models with domain knowledge through
more informative and contextually-aware embeddings.
      </p>
      <p>Of all the listed directions, the most direct and common method in practice for creating a
domainspecific model is fine-tuning an existing powerful general-purpose language model on relevant data. This
approach, known as domain adaptation, allows for the specialization of the model to the nuances of a
specific industry without requiring training from scratch. However, while most research applies domain
adaptation to create supervised classifiers, which require labels for each new dataset, in real-world
scenarios, data are often unlabeled. Therefore, an important research question arises: how efective is
domain adaptation for improving the quality of the vector representations themselves so they can be
used for unsupervised clustering? Such an approach not only allows for the automation of processing
new, unlabeled requirements but also for the investigation of their natural semantic structure.</p>
      <p>To address this problem and improve the quality of unsupervised approaches, the goal of the research
is formulated: to improve the quality of software requirement clustering through the domain adaptation
of a pre-trained language model and the subsequent analysis of its internal representations. To achieve
this goal, the following tasks are formulated: (1) to compare the quality of clustering before and
after fine-tuning; (2) to conduct an analysis of the informativeness of the fine-tuned model’s layers
to determine the most informative layers and the best way to aggregate them; (3) to compare the
efectiveness of standard fine-tuning and contrastive learning as tools for identifying and characterizing
groups of semantically ambiguous requirements.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>This section details the approach developed for the automatic unsupervised clustering of software
requirements. The core of this research is a comparative analysis that aims to quantitatively evaluate
the impact of a language model’s domain adaptation on the separability of functional (FR) and
nonfunctional (NFR) requirements classes in the vector space. The proposed approach is a sequential
process that is applied to both the base and the fine-tuned models. The entire process can be divided
into the following stages: generation of multi-layer vector representations, combination of layers
and formation of the final vector, clustering, and final quality evaluation. The general scheme of this
approach, illustrating the sequence of key stages, is depicted in Figure 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Experimental design and model preparation</title>
        <p>To convert textual requirements into vector representations, RoBERTa (A Robustly Optimized BERT
Approach) [21] was chosen as the base architecture. This model is an ideological successor to BERT
and was selected for its advantages proven in the original research. Specifically, RoBERTa was trained
on a much larger dataset and used more advanced training methods (e.g., dynamic masking without
the Next Sentence Prediction task), which generally leads to the creation of higher-quality and more
semantically rich embeddings. In this study, its base version, roberta-base, is used.</p>
        <p>The subsequent experiment is based on a comparative analysis of the efectiveness of two domain
adaptation approaches for this model compared to its base state. The first approach is standard
finetuning for classification, where the model is adapted on a separate domain-specific set of labeled data to
solve the binary FR/NFR classification task. Additionally, a more advanced method of domain adaptation
is investigated – contrastive learning [22]. Its goal is not to train the model to predict class labels, but
to directly optimize the structure of the vector space. For this, pairs of requirements are formed from
the training data (“positive” – from the same class, and “negative” – from diferent classes), and the
model learns to minimize the distance between similar ones and maximize it between diferent ones. It
is expected that such an approach will force similar requirements to group into denser and more clearly
separated clusters. Thus, the subsequent stages of the method are applied in parallel to three states of
the model (base, fine-tuned for classification, and contrastively fine-tuned) for an objective comparison
of their efectiveness.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Input corpora and data preparation</title>
        <p>The experiment uses two separate, publicly available datasets to avoid any overlap between training
and testing data. For fine-tuning and obtaining a domain-adapted model, the PROMISE_exp dataset [ 23]
is used. This corpus is an extension of the well-known PROMISE dataset and consists of 969 software
requirements collected from 49 diferent projects. The dataset is relatively balanced, containing 444
functional (FR) and 525 non-functional (NFR) requirements, making it a quality source for training the
model to distinguish between these two classes.</p>
        <p>For the main clustering experiment and final evaluation, a diferent dataset is used – the
“FR_NFR_dataset” [24]. This dataset contains a total of 6118 software requirements (3964 FR and
2154 NFR), collected from existing repositories and open-source project documentation. It is important
to note that the class labels are not used during the feature extraction and clustering stages; they serve
exclusively for the final evaluation of the unsupervised algorithm’s performance. Since the original
dataset is imbalanced (the number of FRs significantly exceeds NFRs), which could lead to a bias in
the clustering results toward the dominant class, a balancing procedure was performed. To ensure
the objectivity of the experiment, a balanced subset was created from the original corpus. From each
class, 1000 requirements were randomly selected. Thus, the final experimental corpus used in this
work consists of 2000 software requirements, containing 1000 functional and 1000 non-functional
examples. This approach ensures that the evaluation of clustering quality will not be skewed due to
class imbalance.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Method steps</title>
      </sec>
      <sec id="sec-3-4">
        <title>Step 1: Generation of Multilayer Vector Representations The first stage of the process is the</title>
        <p>generation of multilayer vector representations (embeddings) from the prepared textual requirements.
For this, the pre-trained language model roberta-base [21] from the Hugging Face Transformers library
is used. The representation generation process begins with tokenization, which converts each textual
requirement into a standardized sequence of numerical tokens. An important detail of this process is
that the tokenizer automatically adds a special service token [CLS] (Classification) to the beginning of
each sequence.</p>
        <p>The tokenized texts are fed into the input of the corresponding model version (base or fine-tuned),
which operates exclusively in inference mode for this stage. This means that during the generation of
representations, the model’s weights are not changed. A key feature of the approach is that outputs
are collected from the entire depth of the architecture for analysis: from the initial embedding layer
(layer 0) and from each of the 12 transformer encoder layers (layers 1-12). This process of multilayer
extraction is schematically depicted in Figure 2.</p>
        <p>Thus, for each requirement in the corpus, a set of vector representations is obtained at the output of
this stage, which includes a separate vector for the [CLS] token from each layer. These representations,
which encode information at diferent levels of abstraction, will serve as the basis for the next stage.
Step 2: Layer Aggregation and Vector Formation After generating multilayer representations, a
systematic study is conducted to find the best way to use them. The study consists of two consecutive
stages.</p>
        <p>1. Individual analysis of layer informativeness. At this stage, the contribution of each individual
layer to the quality of clustering is evaluated. For this, the [CLS] vector extracted from each of the 12
encoder layers of the respective model (base or fine-tuned) is used for clustering separately, and the
quality of the resulting partition is evaluated using a set of internal and external metrics. The purpose
of this step is to create a detailed “informativeness profile” of the model, which allows for visualizing
and quantitatively assessing at which levels of the architecture (lower, middle, or upper) the most
useful semantic information for distinguishing between functional and non-functional requirements is
contained. The results of this analysis provide the rationale for selecting strategies in the next stage.</p>
        <p>2. Testing the hypothesis on the efectiveness of combining the best layers. Based on the results of the
previous stage, which show that the highest informativeness is concentrated in the upper layers of
the model, a hypothesis is tested as to whether combining these best layers can improve the result
and surpass the performance of the best single layer. For this, the strategy of aggregating the last four
layers (from 9th to 12th), which are the most informative according to the analysis, is investigated. Two
methods of aggregating the [CLS] vectors are applied:
• Averaging produces a generalized 768-dimensional vector.</p>
        <p>• Concatenation produces a feature-rich 3072-dimensional representation.</p>
        <p>The results obtained in these stages are compared with the baseline approach, which is the best result
from the individual analysis stage, i.e., the use of a single, most efective layer.</p>
        <p>Step 3: Clustering of Vector Representations After a summary vector has been formed for each
requirement, the next step is their grouping. Before being fed into the clustering algorithms, the vectors
undergo standardization (StandardScaler). The study utilizes two clustering algorithms, each aimed
at solving a separate task. For the main task of binary partitioning into FR/NFR classes, the K-Means
algorithm with the parameter  = 2 is applied. This choice allows for the forced partitioning of
the entire corpus into two clusters, which is a necessary condition for the subsequent calculation of
classification quality metrics, such as the F1-Score. In parallel, to solve the task of identifying ambiguous
and anomalous requirements, the density-based algorithm HDBSCAN [25] is used. The key advantage
of this method is its ability not only to find clusters of arbitrary shape but also to identify “outliers”
– data points that do not belong to any dense grouping. In the context of this work, these “outliers”
are interpreted as anomalous or atypical requirements, which are separated into a distinct group for
further analysis.</p>
        <p>Step 4: Evaluation of Clustering Quality The final stage involves an objective evaluation of
the quality of the obtained clusters to determine which vector formation strategy is the most
efective. The evaluation is performed using both internal and external metrics, which together provide a
comprehensive view of the partition quality.</p>
        <p>Internal evaluation is conducted without using ground truth labels and shows the structural properties
of the clusters themselves. For this, the Silhouette Score [26] is applied. This metric quantitatively
evaluates how well each object fits its cluster, based on two criteria: cohesion – the measure of how
similar an object is to other objects in its own cluster, and separation – the measure of how diferent it
is from objects in other clusters. The coeficient’s value ranges from -1 to +1, where: a value close to
+1 indicates that the clusters are dense and well-separated from each other; a value around 0 means
that the clusters overlap; and a value close to -1 suggests that the points have likely been assigned to
the wrong clusters. The overall Silhouette Score for the entire dataset is the average value across all
points. In scientific practice, values in the 0.5 – 0.7 range are considered to be an indication of a justified,
well-defined cluster structure, while values above 0.7 indicate a strongly pronounced structure.</p>
        <p>External evaluation uses the ground truth class labels (FR/NFR) to determine how well the clustering
result corresponds to the reference distribution. Since the labels assigned by the clustering algorithm
(e.g., “cluster 0” and “cluster 1”) are arbitrary, a mapping step is performed before calculating the metrics.
Each cluster is assigned the class label that is dominant among the points in that cluster. For example, if
95% of the requirements in “cluster 0” have the true label “FR,” the entire cluster receives the label “FR.”
After this, the task is treated as a binary classification problem, and the F1-Score and its components are
used for its evaluation: Precision shows what fraction of the items named “functional” by the clustering
algorithm are indeed so. Recall shows what fraction of all “functional” requirements from the entire
dataset the clustering algorithm was able to find and place in the corresponding cluster. The F1-Score
is the harmonic mean of precision and recall, which makes it an integral indicator of quality. In the
research results, the F1-Score will be the key metric for comparing the efectiveness of diferent layer
aggregation strategies. A high F1-Score will indicate that a certain strategy allows for the creation of
vector representations that are not only well-separated in space but also that this separation corresponds
to the human, semantic division into functional and non-functional requirements.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.4. Experimental environment</title>
        <p>The experiments were implemented in the Python programming language (version 3.x). The Hugging
Face Transformers library (version 4.x) based on the PyTorch framework (version 2.x) was used for
working with language models (roberta-base). The implementation of clustering algorithms (K-Means,
HDBSCAN), the principal component analysis (PCA) method, and the calculation of quality metrics
(Silhouette Score, F1-Score) were performed using the Scikit-learn library (version 1.x). All computations
were conducted in the Google Colaboratory cloud environment using an NVIDIA Tesla T4 graphics
processing unit (GPU).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussion</title>
      <p>This section presents the results of the experimental study, along with their comparative analysis and
discussion. The presentation of the results follows the logic of the research: first, the efectiveness of
the base model is evaluated, followed by the domain-adapted model, after which a deep analysis of its
layers’ informativeness and an error analysis are conducted.</p>
      <sec id="sec-4-1">
        <title>4.1. Efectiveness of Clustering with the Base Model</title>
        <p>The first stage of the research evaluated the quality of unsupervised clustering using embeddings
generated by the “raw,” general-purpose roberta-base model without any fine-tuning. Despite testing
various aggregation and pooling strategies, all of them showed extremely low results, only slightly
exceeding the level of random guessing. The best stable result was obtained using the [CLS] vector from
the 12th layer and the K-Means algorithm, which achieved an F1-Score of ≈ 0.55 . A visual analysis of
this experiment (Figure 3) demonstrates the key problem: although the K-Means algorithm formally
partitions the data into two clusters (Figure 3a), this division does not correspond to the true class
distribution (Figure 3b), as they are heavily intermingled in the vector space. This is confirmed both by
the error analysis (Figure 3c) and the extremely low Truth Silhouette Score (≈ 0.015).</p>
        <p>The model’s inability to form semantically meaningful clusters is further confirmed by the results of
the density-based algorithm HDBSCAN. When applied, about half of all requirements were classified as
“noise” (outliers), indicating the absence of any dense, pronounced structure in the data. These results
are a direct empirical confirmation of the “semantic gap.” The embeddings from the general-purpose
model proved to be unsuitable for the unsupervised separation of software requirements, which justifies
the need for domain adaptation of the model.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Impact of domain adaptation on clustering quality</title>
        <p>The second stage of the research evaluated the impact of prior domain adaptation (fine-tuning) of the
model. The results demonstrated a radical improvement across all key metrics. The best strategy, as
shown by further analysis, was the use of the embedding from the final, 12th layer of the fine-tuned
model. With this strategy, an F1-Score of 0.81 was achieved, and the silhouette score for the true
labels (Truth Silhouette Score) increased from a near-zero value (0.015) to 0.37. Figure 4 presents a
visualization of the results for the best configuration (using the 12th layer of the fine-tuned model).</p>
        <p>A comparison of the results before and after fine-tuning clearly demonstrates the efectiveness of
domain adaptation. While the base model at best achieved an F1-Score of only ≈ 0.55 , which barely
exceeds a random level, the fine-tuned model shows a stable result of F1-Score ≈ 0.81 . This quantitative
leap is also reflected in the visual analysis. In contrast to the completely mixed structure of the classes
in the base model (Figure 3), the visualization for the fine-tuned model (Figure 4b) shows the formation
of two distinct, spatially separated groups. The high F1-Score (≈ 0.81 ) confirms that the structure
found by the unsupervised algorithm corresponds with high accuracy to the actual semantic division
of requirements. The error analysis (Figure 4c) additionally shows that misclassifications (red dots)
are predominantly concentrated at the boundary between the two clusters, indicating that the model
makes mistakes only on the most semantically close, “borderline” examples. Thus, it can be concluded
that domain adaptation is an efective method for overcoming the “semantic gap,” as it allows for the
creation of vector representations that form a high-quality, semantically meaningful, and spatially
separated structure suitable for subsequent unsupervised clustering.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Analysis of the informativeness of the fine-tuned model’s layers</title>
        <p>To understand which layers of the fine-tuned model contribute the most to classification quality,
an individual analysis (“profiling”) of them was conducted. The experiment consisted of sequential
clustering using the [CLS] vector taken separately from each of the 12 encoder layers. The results,
presented in Figure 5, demonstrate a clear trend: the quality of clustering (as measured by the F1-Score)
steadily increases with the layer level. While the lower and middle layers show low efectiveness, a
sharp increase in quality is observed starting from the 9th layer, reaching its maximum at the final, 12th
layer (F1-Score ≈ 0.81).</p>
        <p>In the next step, based on these data, a hypothesis was tested as to whether a combination of the
best layers (9-12) could surpass the result of the best single layer. The experiment showed that the
strategy of aggregating the last four layers (by both averaging and concatenation) achieves an F1-Score
that is statistically identical to the result of the 12th layer alone. The obtained results allow for two
important conclusions. Firstly, the fine-tuning process efectively “specializes” the upper layers of the
model (9-12) for the task of distinguishing FR and NFR, which is confirmed by the consistent growth of
their informativeness. Secondly, a “plateau” efect is observed. The experimental results show that the
ifnal, 12th layer, aggregates all the necessary semantic information for classification so completely that
complicating the model by aggregating several layers (even the strongest ones from 9-11) does not lead
to an improvement in the F1-Score. Therefore, using the [CLS] vector from only the 12th layer is the
most justified and efective approach.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Identification of ambiguous requirements</title>
        <p>In addition to the quantitative quality assessment, a deep analysis of the ability of diferent fine-tuning
approaches to identify groups of semantically ambiguous requirements was conducted. The purpose of
this analysis was to compare how standard classification fine-tuning and the more advanced contrastive
learning handle the task of detecting “problematic” requirements. As shown in the previous sections,
standard fine-tuning for classification achieves a high F1-Score; however, the error analysis (Figure 4c)
shows that the “incorrect” predictions are located in a zone of strong class overlap, and the density-based
algorithm HDBSCAN does not identify a significant number of “outliers.”</p>
        <p>The situation changes radically with the application of contrastive learning. Although the final
F1-Score remains at the same high level (≈ 0.80 ), the structure of the vector space undergoes qualitative
changes. Contrastive learning forces the model to form very dense and compact clusters for typical FR
and NFR, “pushing out” all intermediate and anomalous cases into the space between them. As a result,
as shown in Figure 6, the HDBSCAN algorithm now easily identifies a large and clearly defined group
of 129 “outliers,” which are located precisely in the “zone of semantic ambiguity” between the two main
classes.</p>
        <p>The comparative analysis shows that although both fine-tuning methods yield similar final accuracy,
contrastive learning is a significantly more powerful tool for data quality analysis and the identification
of problematic requirements. Unlike standard fine-tuning, which “hides” ambiguous cases within the
main classes, contrastive learning explicitly isolates them as outliers. Thus, it can be concluded that for
the task of not just simple classification, but of deep analysis and improvement of requirements quality,
preference should be given to contrastive learning. It allows for the creation of an automated pipeline
for detecting anomalous and ambiguous requirements, which are the first candidates for review and
clarification by a business analyst, and can significantly improve the quality of specifications in the
early stages of development.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Comparison with the supervised approach</title>
        <p>In the final stage of the research, an “upper bound” of quality that could be achieved on this test corpus
with the fine-tuned model was established. For this, the same fine-tuned roberta-base model was used
as a full-fledged supervised classifier with its classification head. As a result of direct classification on
the test set (2000 requirements), an F1-Score of 0.82 was achieved. This result serves as a benchmark
for the maximum efectiveness of this model and data. Comparing this figure with the best result from
unsupervised clustering (F1-Score ≈ 0.81 ) leads to an extremely important conclusion. The gap between
the supervised and unsupervised approaches is minimal, at only ≈ 1% . This indicates that domain
adaptation (fine-tuning) created such a high-quality and semantically structured vector space that even
a simple unsupervised algorithm, like K-Means, can find the boundary between classes within it almost
as well as a classifier specifically trained for this purpose. Although the supervised approach remains
the “gold standard,” a properly adapted language model can serve as a powerful tool for high-quality
unsupervised classification of new, unlabeled software requirements, which is a very valuable result for
practical application.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This work addressed the problem of low eficiency in the unsupervised clustering of software
requirements into functional (FR) and non-functional (NFR) types when using general-purpose language
models. The research aimed to determine how efectively domain adaptation via fine-tuning could
improve the quality of vector representations to solve this task. To achieve this goal, a comparative
analysis of two states of the roberta-base model was conducted: a base model and a model fine-tuned on
domain-specific data. The results showed that fine-tuning is a critically important step, which increased
the clustering quality from an F1-Score of ≈ 0.55 to ≈ 0.81 . Further analysis of the fine-tuned model’s
layers revealed that using the embedding from the final, 12th layer is the optimal strategy, and more
complex aggregation methods do not lead to improved results. Furthermore, it was established that the
quality of unsupervised clustering on the fine-tuned model (F1 ≈ 0.81 ) approaches the “upper bound”
set by the supervised classifier (F1 ≈ 0.82).</p>
      <p>The main contribution of the work is a comparative analysis of two fine-tuning paradigms. It has been
shown that while standard classification fine-tuning and contrastive learning yield similar classification
accuracy, contrastive learning is a significantly more powerful tool for data quality analysis and the
identification of problematic requirements. Unlike the standard approach, which “masks” ambiguous
cases, contrastive learning allows for their explicit isolation as “outliers,” which is valuable for practical
application in requirements engineering. The solution to the problem of identifying requirements
using deep learning also correlates with the Sustainable Development Goals (SDGs), specifically Goal
9 (Industry, Innovation and Infrastructure), by increasing the eficiency of creating reliable software
solutions.</p>
      <p>The limitations of this study include the use of a single language model architecture (roberta-base) and
specific datasets for training and testing. Furthermore, the work was conducted only on requirements
written in English. The most promising direction for future research is enriching the model with
linguistic information, for instance, through Multi-Task Learning by adding an auxiliary task of
partof-speech tagging. Future work may also include testing the proposed approaches on other language
model architectures and for other natural languages.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors employed Generative AI tools to revise the final version
of the manuscript. Specifically, Gemini 2.5 Pro and Grammarly were utilized to improve grammar,
spelling, and overall readability. After using these tools, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.
International Conference on Management Science and Industrial Engineering, Osaka, Japan, 2021,
pp. 8–12. URL: https://dl.acm.org/doi/10.1145/3460824.3460826.
[18] S. Phatak, How to choose the right embedding model?, 2021. URL: https://medium.com/
nerd-for-tech/how-to-choose-the-right-embedding-model-487deee9d4d7.
[19] Tiger Data, General-purpose vs. domain-specific embedding models, 2024. URL: https://www.</p>
      <p>tigerdata.com/blog/general-purpose-vs-domain-specific-embedding-models.
[20] D. Lassner, S. Brandl, A. Baillot, S. Nakajima, Domain-specific word embeddings with structure
prediction, Transactions of the Association for Computational Linguistics 11 (2023) 320–335.
[21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[22] T. Gao, X. Yao, D. Chen, SimCSE: Simple contrastive learning of sentence embeddings, in:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP),
Online and Punta Cana, Dominican Republic, 2021, pp. 6894–6910. doi:10.18653/v1/2021.
emnlp-main.552.
[23] M. Lima, V. Valle, E. a. Costa, F. Lira, B. Gadelha, Software engineering repositories: Expanding the
PROMISE database, in: Proceedings of the XXXIII Brazilian Symposium on Software Engineering
(SBES ’19), Salvador, Brazil, 2019, pp. 427–436. doi:10.1145/3350768.3350776.
[24] S. Sonali, S. Thamada, Fr_nfr_dataset, 2024.
[25] R. J. G. B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density
estimates, in: Pacific-Asia conference on knowledge discovery and data mining, Berlin, Heidelberg,
2013, pp. 160–172.
[26] P. J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,
Journal of Computational and Applied Mathematics 20 (1987) 53–65.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <article-title>Identifying Functional and Non-functional Software Requirements from User App Reviews and Requirements Artifacts, Master's thesis</article-title>
          , Montclair State University, Montclair, NJ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          , G. Xu,
          <article-title>Software requirements classification with deep learning: A bibliometric review</article-title>
          ,
          <source>Journal of Systems and Software</source>
          (
          <year>2024</year>
          ). Preprint.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Krak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kuznetsov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kondratiuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Azarova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Barmak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Radiuk</surname>
          </string-name>
          ,
          <article-title>Analysis of deep learning methods in adaptation to the small data problem solving</article-title>
          , in: S.
          <string-name>
            <surname>Babichev</surname>
          </string-name>
          , V. Lytvynenko (Eds.),
          <source>Proceedings of the International Scientific Conference "Intellectual Systems of Decision Making and Problem of Computational Intelligence" (ISDMCI</source>
          <year>2022</year>
          ), volume
          <volume>149</volume>
          <source>of Lecture Notes on Data Engineering and Communications Technologies</source>
          , Springer, Cham,
          <year>2023</year>
          , pp.
          <fpage>333</fpage>
          -
          <lpage>352</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -16203-9_
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Dias Canedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Cordeiro</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <article-title>Software requirements classification using machine learning algorithms</article-title>
          ,
          <source>IEEE Latin America Transactions</source>
          <volume>22</volume>
          (
          <year>2020</year>
          )
          <fpage>1057</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Baskoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Andrahsmara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R. P.</given-names>
            <surname>Darnoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. A.</given-names>
            <surname>Tofan</surname>
          </string-name>
          ,
          <article-title>A systematic comparison of software requirements classification</article-title>
          ,
          <source>Telkomnika</source>
          <volume>32</volume>
          (
          <year>2021</year>
          )
          <fpage>184</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. U.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <article-title>A deep learning framework for non-functional requirement classification</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>14</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nakirijja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yahya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Samikannu</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. L. D.</surname>
          </string-name>
          <article-title>Bulay-og, Enhancing software requirements classiifcation: Integrating recurrent neural networks and natural language processing for managing structural complexity</article-title>
          ,
          <source>International Journal of Intelligent Systems and Applications in Engineering</source>
          <volume>12</volume>
          (
          <year>2024</year>
          )
          <fpage>3110</fpage>
          -
          <lpage>3121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Agilitest</surname>
          </string-name>
          , How to test
          <source>non functional requirements?</source>
          ,
          <year>2025</year>
          . URL: https://www.agilitest.com/cards/ nfr-to-test.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Khayashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jamasb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Akbari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shamsinejadbabaki</surname>
          </string-name>
          ,
          <article-title>Deep learning methods for software requirement classification: A performance study on the pure dataset</article-title>
          ,
          <source>in: Proceedings of the International Conference on Software Engineering</source>
          , Shiraz, Iran,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Sajjad</surname>
          </string-name>
          ,
          <article-title>Hybrid deep learning approach for nonfunctional software requirements classifications</article-title>
          , in: 2023 International Conference on Communication, Computing and
          <string-name>
            <given-names>Digital</given-names>
            <surname>Systems (C-CODE)</surname>
          </string-name>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Prcbert:
          <article-title>Prompt learning for requirement classification using bertbased pretrained language models</article-title>
          ,
          <source>in: 37th IEEE/ACM International Conference on Automated Software Engineering (ASE '22)</source>
          , Rochester, MI, USA,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . URL: https://dl.acm.org/doi/ 10.1145/3551349.3560417.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yucalar</surname>
          </string-name>
          ,
          <article-title>Developing an advanced software requirements classification model using bert: An empirical evaluation study on newly generated turkish data</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>11127</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Applying the bert algorithm for software requirements classification</article-title>
          ,
          <source>in: Applying the Bert Algorithm for Software Requirements Classification</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          . doi:
          <volume>10</volume>
          .13140/RG.2.2. 33317.68320.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsimenidis</surname>
          </string-name>
          ,
          <article-title>Limitations of deep neural networks: a discussion of g. marcus' critical appraisal of deep learning</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <volume>15754</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Almonte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Boominathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nascimento</surname>
          </string-name>
          ,
          <article-title>Automated non-functional requirements generation in software engineering with large language models: A comparative study</article-title>
          ,
          <source>in: CASCON</source>
          <year>2025</year>
          ,
          <article-title>Pennsylvania</article-title>
          , USA,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ebrahim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guirguis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Basta</surname>
          </string-name>
          ,
          <article-title>Enhancing software requirements engineering with language models and prompting techniques: Insights from the current research and future directions, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>4</volume>
          : Student Research Workshop), Vienna, Austria,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          . URL: https://openreview.net/pdf? id=0cxmpFiTsY.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Braun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Klymenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Akan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Matthes</surname>
          </string-name>
          ,
          <article-title>The language of engineering: Training a domain-specific word embedding model for engineering</article-title>
          ,
          <source>in: Proceedings of the 2021 3rd</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>