<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Barcelona, Catalunya, Spain, April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>User Stories to Domain Models: Recom mending Relationships between Entities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maxim Bragilovski</string-name>
          <email>maximbr@post.bgu.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabiano Dalpiaz</string-name>
          <email>f.dalpiaz@uu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arnon Sturm</string-name>
          <email>sturm@bgu.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Requirements Engineering, Conceptual Modeling, Domain Models, Machine Learning, Model Derivation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>A. Wohlgemuth, A. Hess</institution>
          ,
          <addr-line>S. Fricker, R. Guizzardi, J. Horkof, A. Perini, A. Susi, O. Karras, A. Moreira, F. Dalpiaz, P</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information and Computing Sciences, Utrecht University</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev</institution>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>In: A. Ferrari</institution>
          ,
          <addr-line>B. Penzenstadler, I. Hadar, S. Oyedeji, S. Abualhaija, A. Vogelsang, G. Deshpande, A. Rachmann, J. Gulden</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>17</volume>
      <issue>2023</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>User stories are a common notation for expressing requirements, especially in agile development projects. While user stories provide a detailed account of the functional requirements, they fail to deliver a holistic view of the domain. As such, they can be complemented with domain models that not only help gain this comprehensive view, but also serve as a basis for model-driven development. We focus on the task of recommending relationships between entities in a domain model, assuming that these entities were previously extracted from a user story collection either manually or through an automated tool. We investigate whether an approach based on supervised machine learning can recommend essential relationships in a domain model more accurately than state-of-the-art rule-based methods. Based on a collection of datasets that we manually labeled and a set of 32 features we engineered, we train a machine learning model by using a random forest classifier. The results indicate that our approach has higher precision and F1-score than the baseline rule-based methods. Our findings provide preliminary evidence of the suitability of using machine learning to support the development of domain models, especially in recommending relationships between related entities.</p>
      </abstract>
      <kwd-group>
        <kwd>Entities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        User stories are a widespread notation for expressing functional requirements from the
perspective of a user [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Despite their popularity and simplicity, each user story describes an
individual feature of the system, thereby making it hard for an analyst to obtain a holistic view
of the system domain. As a solution, researchers have investigated the automated and manual
derivation of diferent types of conceptual or domain models from user stories [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        A conceptual model is a graphical representation of static phenomena (such as entities and
relationships) as well as dynamic phenomena (such as events and processes) in some domain [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Conceptual models can be used to illustrate the functionality of a system, such as use case
diagrams. Furthermore, they may be used to provide a holistic view of the main entities and
relationships that appear in the requirements [
        <xref ref-type="bibr" rid="ref2 ref5">5, 2</xref>
        ]. These models can be used as a basis for
identifying ambiguities [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], for analyzing qualities such as security and privacy [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and as a
starting point for model-driven engineering.
      </p>
      <p>
        Conceptual and domain model development is a challenging activity, which requires the
identification of the important concepts (in the case of a structural model, entities) and their
relationships. To do so, it is important to distinguish between the essential concepts in a domain
and the secondary ones. Furthermore, the resources used to develop the conceptual model (i.e.,
the requirements) make use of ambiguous terms [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Moreover, as the complexity of the system
increases, it becomes more time consuming for humans to derive these models.
      </p>
      <p>
        To address the challenges of developing conceptual and domain models, several solutions
exist, including guidelines [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and automatic approaches [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The existing automated methods
are rule-based; this limits their efectiveness to those linguistic patterns that the researchers
encoded into the rules. In contrast, methods that rely on guidelines for humans [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] are time
consuming and do not achieve perfect accuracy either.
      </p>
      <p>In our research agenda, we aim to build machine and deep learning models for deriving a
domain model from a collection of user stories. A domain model should contain the entities
and relationships that represent the domain of the system that implements the user stories.
This model can serve as a basis for model-driven development, e.g., via low-code development
platforms. Thus, the automated derivation could increase the usefulness of user stories by
reducing the gap between requirements and the following development activities.</p>
      <p>
        In this paper, we present initial results on the automated derivation of a conceptual model.
As the current automated state-of-the-art method, the Visual Narrator [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], is more efective
at identifying entities than relationships, we choose the relationship identification task as our
ifrst research step. We propose a machine learning-based model that recommends essential
relationships between the entities that are derived from a set of user stories. Our research
question is as follows: Does a machine-learning-based approach outperform rule-based
state-ofthe-art methods for identifying relationships between the entities extracted from user stories?
      </p>
      <p>The results reported in this paper positively answer that question and demonstrate the
advantages of using machine learning for the task at hand. In particular, we make the following
contributions: (i) we describe a novel approach, based on 32 features, for recommending essential
relationships using a machine learning model; and (ii) we compare our machine learning model
to current automated models.</p>
      <p>Paper organization. In Section 2, we discuss the background and related studies. In Section 3,
we present the research method and we describe our proposed approach. In Section 4, we report
on the preliminary results and we discuss the limitatinos. Finally, in Section 5 we conclude and
set plans for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Deriving conceptual models automatically from natural language requirements has been a
research topic for quite some time [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Even so, despite Mike Cohn’s book on user stories [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
that contributed to the popularity of user stories, only in 2016 Robeer et al. [
        <xref ref-type="bibr" rid="ref11 ref2">11, 2</xref>
        ] performed
the first major attempt at extracting conceptual models from user stories.
      </p>
      <p>Since then, research about deriving models from user stories started to emerge. In this section,
we review related studies by referring, when applicable, to the diferent types of models that
are extracted, the method through which the model is derived, the experimental setting and
datasets, the metrics, and the performance that was achieved.</p>
      <p>
        Elallaoui et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] use part-of-speech tagging to identify whether certain keywords should
represent entities or relationships, and this information is used to generate use case diagrams.
Their approach is evaluated via precision and recall. They compare the outcomes with models
that were created manually from the WebCompany dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The results demonstrate that
their plugin has acceptable precision and recall for detecting actors, and high results (above
0.85 for both metrics) for detecting use cases and relationships. While they derive a use case
diagram, we are interested in generating domain models that require a holistic view.
      </p>
      <p>
        Similarly, the recent studies that extract class diagrams automatically also rely on the
partof-speech tagging of terms within user stories. Lucassen et al. [
        <xref ref-type="bibr" rid="ref11 ref2">11, 2</xref>
        ] propose an automated
approach, based on the Visual Narrator tool, for extracting structural conceptual models (i.e.,
class diagrams) from a set of user stories. The Visual Narrator was used to generate conceptual
models from user stories based on 11 out of 23 identified heuristics from the literature. Using
precision, recall and F1-score metrics, they determined whether their tool was successful in
identifying entities and relationships compared to gold-standard models that were created by
the authors of the paper. The approach achieved good precision (97%) and recall (98%), with a
lower bound of 88% recall and 92% precision. These results, however, are obtained by assessing
the tool’s performance against a human execution of the algorithm, rather than against models
that are created by humans based on their own rationale.
      </p>
      <p>
        Typically, automated model derivation from user stories is done using rule-based methods
based on natural language processing heuristics. Although these works achieved good precision
and recall despite limitations of user stories like ambiguity [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], they cannot be perfectly
accurate due to the variety of linguistic patterns that natural language allows. Furthermore,
they are limited to the lexicon they identified and cannot perform the abstraction process that
is crucial to conceptual models.
      </p>
      <p>
        Approaches that derive domain models from other formats of requirements also exist. The
most relevant work is that by Arora and colleagues [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], who use heuristics to create a first
version of a domain model and then apply active learning to remove superfluous elements. We
also use machine learning, but rather than pruning elements, we focus on enriching a model by
suggesting essential relationships.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Research Method and Proposed Approach</title>
      <p>
        The task of this research – illustrated by the mock-up of Figure 1 – consists of recommending
relationships among the entities in a domain model. We assume these entities have been
extracted previously from a collection of user stories, either manually [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or through an automated
tool such as the Visual Narrator [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Given a collection of user stories (in the figure, regarding
Planning Poker), selected entities, and a probability threshold, the tool suggests relationships
whose probability to exist is higher than the set threshold, and then visualizes the resulting
domain model with those relationships.
      </p>
      <p>
        To develop our ML technique for such a tool, we followed a common machine learning
method [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which consists of five steps. Dataset preparation, based on which we created
a gold standard model for each set of user stories. Feature engineering, where features are
created to facilitate the recommendation of relationships between two entities. Baselines and
alternatives selection, in order to compare our method to current state-of-the-art approach.
Choice of a machine learning algorithm from the current state-of-the-art families of machine
learning models (e.g., decision trees, random forest). Metrics selection, to determine against
which criteria we compare the performance of diferent approaches.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Data preparation</title>
        <p>
          As there is no benchmark dataset of user stories with an associated class diagram, we developed
such a dataset. Indeed, the existing gold standards for the datasets used with the Visual
Narrator [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] are not suitable, as the identified relationships are meant to navigate through the
user stories, rather than for representing the domain). Therefore, we first selected 7 sets of user
stories from an online collection of user stories data sets [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Next, for each set of stories, we
developed a conceptual model. During this process, we had to answer the following questions:
(1) What entities are of interest to find relationships between? (2) What are the relationships
that we want the model to recommend? Table 1 shows descriptive information about the seven
datasets.
        </p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Entities Extraction</title>
          <p>
            The first step is the identification of the entities from the user stories. To do so, we first used
the Visual Narrator tool [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] with its default parameters. Since the entities that the Visual
Narrator returns include both domain terms as well as technical concepts that would not be
part of a domain model, we filtered manually its outputs by retaining only those entities that
we considered to be part of the domain, thereby excluding technical terms that pertain to the
solution. We acknowledge that some entities may have been overlooked because of this filtering.
However, as this paper focuses on detecting the relationship between pre-defined entities, the
omission of entities should not afect our analysis of the recommended relationships between
the existing entities.
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. A gold standard for relationships between entities</title>
          <p>After extracting the entities from the set of user stories, we developed a dataset that contains all
the possible relationships that might exist (i.e., all pairs of entities). As a next step, each author
of this paper tagged each relationship independently as follows:
1. Essential: it is required in the domain model for implementing the user stories in the
collection;
2. Optional: it may or may not exist because of the existence of other relationships;
3. Unnecessary: it should not be part of the domain model.</p>
          <p>
            Next, we measured the inter-rater agreement using the Fleiss Kappa [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ], which is a statistical
measure specifically designed to handle categorical data and to handle more than two raters. We
checked the agreement in two ways: (i) binary, where we consider only strong disagreements if
the three tags include at least one essential and at least one unnecessary; and (ii) multi-class,
where we consider disagreements even when considering the optional class.
          </p>
          <p>
            Afterward, we held a discussion that eventually led to the gold standards which can be found
in [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. We decided that the gold standard should include relationships on which we have a
high agreement. This is intended to minimize the chance of false positives in our gold standard.
Because of our high agreement of more than 0.6, we chose the binary classification as the
gold-standard, as this improves our identification of true positives: essential relationships (class
1) and all others (class 0).
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Feature engineering</title>
        <p>We engineered a set of features ([  ]) to characterize each pair of entities ((  1,   2), that the ML
model uses to learn which relationships are essential and which are unnecessary (  ). Based on
word2vec model from g e n s i m 2.
this, the trained ML model can recommend essential relationships on unseen pairs of entities.</p>
        <p>
          We engineered features based on rules for relationship identification [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] as well as from
additional insights we gained after exploring the data. We denote sim as a function that calculates
the similarity between two words or sentences. sim (global similarity) implements that function
using pre-traind model from n l t k 1, and sim (local similarity) implements that function using
        </p>
        <p>
          The user story datasets were used as our corpus to train the gensim model, which resulted
in an embedding vector for each entity that can be used to calculate cosine-similarity (the
code to train the model appears in our online appendix [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]). We represent the dataset as
follows:  = {( 0, x0,  0), ..., (  , xn,   )} where   = (  1
,   2) is a relationship between two
entities   1 and   2, xi is a vector of the features’ values, and   is the target label (essential or
unnecessary relationship). We also denote  ′ = { 0′, ...,   ′} as an external dataset that contains
all the relationships between two entities   ′ = ( ′1

,  ′2) from diferent existing domain model
repositories (we used the M o d e l S e t repository [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] 3). The similarity between two relationships
  = ( 
1
,   2) and   = ( 
1
,   2) is calculated as follows:
rel_sim (  ,   ) =
sim ( 
1
,   1) + sim ( 
2
,   2)
2
(1)
where  is the way the similarity sim is calculated:  ∈ {, }
with a vector of values for the engineered features listed in Table 2.
. Each   = ( 
1
,   2) ∈  is associated
        </p>
        <p>These features were assigned based on the following rationale. Each of Features 1–3
considers external sources; we search in ModelSet and define features with the similarity value of
those relationships that have the highest similarity with the examined relationship, applying
Equation 1. We expect that if other models link entities that are similar to ours, our approach
will also recommend a relationship. Each of Features 4–9 does a similar analysis but based
on each individual entity. Feature 10 calculates the average of Features 1–3. Each of Features
11–18 characterizes individual entities by counting how many times an entity appears in the
user stories (11–12) and whether it appears in the actor, action, or benefit part of at least one
user story (13–18). Each of Features 19–20 calculates the similarity between the entities using
g e n s i m and N L T K . Feature 21 determines the number of user stories where both entities co-occur.
Each of Features 22–23 does a similar calculation but considers only co-occurrences where at
most, 3 or 5 words exist between the entities. Each of Features 24–27 is a binary value that
is true when both entities are identified as either subject or object in at least one user story.
Feature 28 is true if there is a user story where there is an ‘and’ or an ‘or’ word between the
one entity occurs over the number of user stories where both occur.
two entities. Feature 29 counts the nouns that appear in a user story that includes   1 and in a
user story that includes   2. Features 30–32 normalize the number of user stories where at least
1–3
4–9
10
Features used by our machine-learning model for a relationship   = (  1,   2) ∈ 
and any of the relationships in ModelSet ( ′ ∈  ′</p>
        <p>)
The  highest ( ∈ {1, 2, 3} ) relationship similarity values between  
entity in the  most similar relationships in  ′: sim ( 
For both entities in   , the global similarity with the corresponding
, 

′) ( ∈

average(Ext._rel_sim _1 , Ext._rel_sim _2, Ext._rel_sim _3)
Number of appearances of    in the user stories, for each  ∈ {1, 2}
1 if   appears in the role part of at least one user story, otherwise 0
1 if   appears in the action part of 1+ user story, otherwise 0
1 if   appears in the benefit part of 1+ user story, otherwise 0
  (  1,   2), for  ∈ {, }
The number of user stories in which   1 and   2 co-occur
The number of user stories where  
 ∈ {3, 5} words in between them
1 if   1 is identified as {, }
1
and  
2</p>
        <p>co-occur with less than
and   2 is identified as
in 1+ user story, otherwise 0
1 if there is a user story where ‘and’ or ‘or’ appeared between   1
common_friends</p>
        <p>Number of diferent nouns that appear both in a user story where
 1 occurs and in a user story where   2 occurs
 2,  
1
∨   2} occur divided by</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Evaluation Settings</title>
        <p>
          To select relevant machine learning models, we distinguish between two types of models:
shallow and deep. Shallow models such as decision trees are better suited for small, structured
datasets. In contrast, deep models are better suited for large NLP and vision datasets. We do
not select deep models because NLP-for-RE tasks like ours rely on small datasets [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. Thus, we
opt for a shallow model in the form of Random Forest (RF), a state-of-the-art technique that
achieved the best results in some software-engineering-related tasks [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <p>
          We report the results in terms of the commonly used metrics of precision, recall, and F1 score.
Our statistical analysis, however, focuses only on precision and F1 score. We choose precision
as we assume it might be more helpful to humans than recall in a recommendation scenario
like the one sketched in Figure 1, where having a smaller set of essential links without many
unnecessary relationships creates less noise for the analyst than having all the essentials with
many unnecessary ones. We also analyze the F1-score because it balances both precision and
recall, thereby penalizing recommendations that provide a too limited number of essential links.
We acknowledge that these are preliminary metrics that we use for an early assessment of
our approach; future work should determine the most suitable metric based on an in-practice
analysis of the impact of diferent types of errors [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
        <p>
          We compare the performance of the RF classifier against the Visual Narrator [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and a naive
approach in which an essential relationship is suggested every time two entities appear in the
same user story. As the Visual Narrator did not identify all the entities in the gold standard
model, we omitted these entities from the evaluation. This is done since we are only assessing
the ability to predict relationships between entities. Also, we defined the threshold that the
ML model uses to discriminate between the two classes: essential or unnecessary. Since the RF
classifier returns a probability of a relationship being essential and the dataset is unbalanced, it
is not reasonable to set the threshold to 0.5. After checking several thresholds, we found that a
threshold of 0.8 provides reliable results.
        </p>
        <p>We evaluate the performance of the RF classifier , the Visual Narrator, and the Naive approach
using the seven datasets presented in Table 1. We apply the leave-one-out evaluation method:
all datasets except one are used for training the model, and we report the performance on the
remaining dataset.</p>
        <p>To compare if the diferences in the metrics are significant between the approaches (
Independent Variables), we selected F1-score and precision (Dependent Variables) as metrics to check the
significance. We set the following (null) hypothesis:
• The three F1-scores/precision of the naive approach, the Visual Narrator and the RF
classifier are the same (  0EXP-F-score and  0EXP-Precision).</p>
        <p>
          The experiment materials can be found in an online appendix [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Preliminary Validation</title>
      <p>In this section, we report on the results of the preliminary validation we conducted according
to the method described in Section 3.3.</p>
      <sec id="sec-4-1">
        <title>4.1. Descriptive Statistics</title>
        <p>Table 3 presents the results of the experiment. The Dataset column refers to the set of user
stories. We report on the results of the three alternatives: the Naive, Visual Narrator, and RF
classifier . For each alternative, we present the precision, recall, and the F1-score. The bottom
row of the table represents the macro-average for each column. The numbers in bold indicate
the best results of the F1-score for a given user story dataset.</p>
        <p>In most datasets, using the RF classifier leads to better F 1-scores. Particularly, it achieved
superior results in 4 out of 7 datasets. The RF classifier achieved an average F1-score of 0.589,
Naive Approach achieved 0.565 and Visual Narrator only achieved 0.266. In addition, we observe
that the RF classifier has better precision over the other alternatives in 6 out of 7 datasets.</p>
        <p>We conducted statistical tests with alpha = 0.05 to determine if the diferences are statistically
significant. We applied the Friedman test [ 22], a non-parametric statistical test, to compare more
than two methods. We found statistically significant diferences among the related approaches
with  = 0.01 for both F1-score and precision. Therefore we can reject the  0EXP1-F-score and
 0EXP1-Precision hypotheses. To check which alternative is better, we applied Nemenyi’s post-hoc
test [23], and we calculated efect size using Cohen’s d. We found that: (1) the RF classifier is
statistically better than the Visual Narrator ( = 0.042 and  = 0.02 ) with efect sizes of 1.564
and 1.591; (2) there is no statistically significant diference between the RF classifier and Naive
( = 0.9 and  = 0.608 ) with efect sizes of 0.042 and 0.997; (3) there is no statistically significant
diference between Naive and the Visual Narrator ( = 0.111 and  = 0.191 ) with efect sizes of
1.518 and 0.877 for F1-score and precision, respectively.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Discussion and Limitations</title>
        <p>
          The results answer positively our research question as they indicate that using an ML-based
model (RF classifier ) for the relationship recommendation task leads to higher F1-score and
precision than the rule-based alternatives (Naive and Visual Narrator ). Furthermore, the RF
classifier also returns the probabilities for occurrences of relationships, providing extra information
for the user to make the final decision. The preliminary results require additional validation,
such as defining the most suitable metrics by analyzing the relative impact of Type 1 and Type
2 errors [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], by estimating human achievable performance, and to assess the necessary efort
(time). We could not estimate the human achievable performance on our datasets as we were
already familiar with some of those from previous research, and due to an iterative approach
for the construction of the gold standard. Lastly, the selection of the datasets may be biased;
although they difer in the number of samples (pairs of entities) and in the distribution of features
and classes, we need to experiment with other datasets to draw more robust conclusions.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>We have presented an ML-based model for recommending relationships between the entities of
conceptual models that are derived from a set of user stories.</p>
      <p>Rule-based approaches and guidelines were suggested for deriving conceptual models from
user stories. They achieve good accuracy in recognizing entities but fall short in finding
relationships between these entities. Here, we provide initial evidence that an ML-based approach
improves the current state-of-the-art methods for recommending relationships between entities.</p>
      <p>This work calls for further improvements. The ML-based models can be extended to suggest
a complete conceptual model (entities, attributes, and relationships) as well as performing a
better evaluation that compares the tool’s performance with that of human analysts.
[22] D. W. Zimmerman, B. D. Zumbo, Relative power of the Wilcoxon test, the Friedman test,
and repeated-measures ANOVA on ranks, The Journal of Exper. Education 62 (1993) 75–86.
[23] D. G. Pereira, A. Afonso, F. M. Medeiros, Overview of Friedman’s test and post-hoc analysis,
Communications in Statistics-Simulation and Computation 44 (2015) 2636–2653.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lucassen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M. E. van der</given-names>
            <surname>Werf</surname>
          </string-name>
          , S. Brinkkemper,
          <article-title>The use and efectiveness of user stories in practice</article-title>
          ,
          <source>in: Proc. of REFSQ</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>205</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lucassen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Robeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M. E. van der</given-names>
            <surname>Werf</surname>
          </string-name>
          , S. Brinkkemper,
          <article-title>Extracting Conceptual Models from User Stories with Visual Narrator, Req</article-title>
          . Eng.
          <volume>22</volume>
          (
          <year>2017</year>
          )
          <fpage>339</fpage>
          -
          <lpage>358</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sturm</surname>
          </string-name>
          ,
          <article-title>On deriving conceptual models from user requirements: An empirical study</article-title>
          ,
          <source>Inf. Softw. Technol</source>
          .
          <volume>131</volume>
          (
          <year>2021</year>
          )
          <fpage>106484</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wand</surname>
          </string-name>
          , R. Weber, Research commentary:
          <source>Information systems and conceptual modelinga research agenda, Information Systems Research</source>
          <volume>13</volume>
          (
          <year>2002</year>
          )
          <fpage>363</fpage>
          -
          <lpage>376</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sabetzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nejati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Briand</surname>
          </string-name>
          ,
          <article-title>An Active Learning Approach for Improving the Accuracy of Automated Domain Model Extraction</article-title>
          ,
          <source>ACM TOSEM 28</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          , I. van der Schalk,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brinkkemper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. B.</given-names>
            <surname>Aydemir</surname>
          </string-name>
          , G. Lucassen,
          <article-title>Detecting Terminological Ambiguity in User Stories: Tool and Experiment</article-title>
          .,
          <string-name>
            <surname>Inf. Soft. Tech.</surname>
          </string-name>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P. X.</given-names>
            <surname>Mai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goknil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. K.</given-names>
            <surname>Shar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pastore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Briand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaame</surname>
          </string-name>
          , Modeling Security and
          <string-name>
            <given-names>Privacy</given-names>
            <surname>Req</surname>
          </string-name>
          .
          <article-title>: a Use Case-Driven Approach</article-title>
          , Inform. Soft. Tech.
          <volume>100</volume>
          (
          <year>2018</year>
          )
          <fpage>165</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bragilovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sturm</surname>
          </string-name>
          ,
          <article-title>Guided derivation of conceptual models from user stories: A controlled experiment</article-title>
          ,
          <source>in: Proc. of REFSQ</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Briand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Labiche</surname>
          </string-name>
          ,
          <source>aToucan: An Automated Framework to Derive UML Analysis Models from Use Case Models, ACM TOSEM 24</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <article-title>User stories applied: For agile software develop</article-title>
          .,
          <string-name>
            <surname>Addison-Wesley Prof</surname>
          </string-name>
          .,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Robeer</surname>
          </string-name>
          , G. Lucassen,
          <string-name>
            <surname>J. M. E. Van Der Werf</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dalpiaz</surname>
          </string-name>
          , S. Brinkkemper,
          <article-title>Automated extraction of conceptual models from user stories via nlp</article-title>
          , in: RE, IEEE,
          <year>2016</year>
          , pp.
          <fpage>196</fpage>
          -
          <lpage>205</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Elallaoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nafil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Touahni</surname>
          </string-name>
          ,
          <article-title>Automatic transformation of user stories into UML use case diagrams using NLP techniques</article-title>
          ,
          <source>Procedia computer science 130</source>
          (
          <year>2018</year>
          )
          <fpage>42</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Amna</surname>
          </string-name>
          , G. Poels,
          <article-title>Ambiguity in user stories: A systematic literature review</article-title>
          ,
          <source>Information and Software Technology</source>
          <volume>145</volume>
          (
          <year>2022</year>
          )
          <fpage>106824</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shalev-Shwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ben-David</surname>
          </string-name>
          ,
          <article-title>Understanding machine learning:</article-title>
          <source>From theory to algorithms</source>
          , Cambridge University Press,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          ,
          <article-title>Requirements data sets (user stories)</article-title>
          ,
          <source>Mendeley Data, V1, doi: 10.17632/7zbk8zsd8y.1</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Fleiss</surname>
          </string-name>
          ,
          <article-title>Measuring nominal scale agreement among many raters</article-title>
          .,
          <source>Psychological bulletin 76</source>
          (
          <year>1971</year>
          )
          <fpage>378</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bragilovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sturm</surname>
          </string-name>
          ,
          <article-title>Experimental material - from user stories to domain models: Recommending relationships between entities, 2023</article-title>
          . URL: http://dx.doi.org/10. 17632/tvjyw4pzsk.1,
          <string-name>
            <surname>Mendeley</surname>
            <given-names>Data</given-names>
          </string-name>
          ,
          <year>v1</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J. A. H.</given-names>
            <surname>López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L. Cánovas</given-names>
            <surname>Izquierdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Cuadrado</surname>
          </string-name>
          ,
          <article-title>Modelset: a dataset for machine learning in model-driven engineering</article-title>
          ,
          <source>Soft. and Systems Modeling</source>
          <volume>21</volume>
          (
          <year>2022</year>
          )
          <fpage>967</fpage>
          -
          <lpage>986</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferrari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Franch</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Palomares, Natural Language Processing for Requirements Engineering: The Best Is Yet to Come</article-title>
          ,
          <source>IEEE Software 35</source>
          (
          <year>2018</year>
          )
          <fpage>115</fpage>
          -
          <lpage>119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Falessi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Roll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cleland-Huang</surname>
          </string-name>
          ,
          <article-title>Leveraging historical associations between requirements and source code to identify impacted classes</article-title>
          ,
          <source>IEEE TSE 46</source>
          (
          <year>2018</year>
          )
          <fpage>420</fpage>
          -
          <lpage>441</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>D. M. Berry</surname>
          </string-name>
          ,
          <article-title>Empirical evaluation of tools for hairy requirements engineering tasks</article-title>
          ,
          <source>Empirical Software Engineering</source>
          <volume>26</volume>
          (
          <year>2021</year>
          )
          <fpage>111</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>