<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">From User Stories to Domain Models: Recommending Relationships between Entities</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Maxim</forename><surname>Bragilovski</surname></persName>
							<email>maximbr@post.bgu.ac.il</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Software and Information Systems Engineering</orgName>
								<orgName type="institution">Ben-Gurion University of the Negev</orgName>
								<address>
									<country key="IL">Israel</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fabiano</forename><surname>Dalpiaz</surname></persName>
							<email>f.dalpiaz@uu.nl</email>
							<affiliation key="aff1">
								<orgName type="department">Department of Information and Computing Sciences</orgName>
								<orgName type="institution">Utrecht University</orgName>
								<address>
									<country key="NL">The Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Arnon</forename><surname>Sturm</surname></persName>
							<email>sturm@bgu.ac.il</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Software and Information Systems Engineering</orgName>
								<orgName type="institution">Ben-Gurion University of the Negev</orgName>
								<address>
									<country key="IL">Israel</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">From User Stories to Domain Models: Recommending Relationships between Entities</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">44B9C36D7D8004F2F961968E3183938B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-04-29T06:36+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Requirements Engineering, Conceptual Modeling, Domain Models, Machine Learning, Model Derivation Orcid 0000-0002-4778-7897 (M. Bragilovski)</term>
					<term>0000-0003-4480-3887 (F. Dalpiaz)</term>
					<term>0000-0002-4021-7752 (A. Sturm)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>User stories are a common notation for expressing requirements, especially in agile development projects. While user stories provide a detailed account of the functional requirements, they fail to deliver a holistic view of the domain. As such, they can be complemented with domain models that not only help gain this comprehensive view, but also serve as a basis for model-driven development. We focus on the task of recommending relationships between entities in a domain model, assuming that these entities were previously extracted from a user story collection either manually or through an automated tool. We investigate whether an approach based on supervised machine learning can recommend essential relationships in a domain model more accurately than state-of-the-art rule-based methods. Based on a collection of datasets that we manually labeled and a set of 32 features we engineered, we train a machine learning model by using a random forest classifier. The results indicate that our approach has higher precision and F 1 -score than the baseline rule-based methods. Our findings provide preliminary evidence of the suitability of using machine learning to support the development of domain models, especially in recommending relationships between related entities.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>User stories are a widespread notation for expressing functional requirements from the perspective of a user <ref type="bibr" target="#b0">[1]</ref>. Despite their popularity and simplicity, each user story describes an individual feature of the system, thereby making it hard for an analyst to obtain a holistic view of the system domain. As a solution, researchers have investigated the automated and manual derivation of different types of conceptual or domain models from user stories <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref>.</p><p>A conceptual model is a graphical representation of static phenomena (such as entities and relationships) as well as dynamic phenomena (such as events and processes) in some domain <ref type="bibr" target="#b3">[4]</ref>. Conceptual models can be used to illustrate the functionality of a system, such as use case diagrams. Furthermore, they may be used to provide a holistic view of the main entities and relationships that appear in the requirements <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b1">2]</ref>. These models can be used as a basis for identifying ambiguities <ref type="bibr" target="#b5">[6]</ref>, for analyzing qualities such as security and privacy <ref type="bibr" target="#b6">[7]</ref>, and as a starting point for model-driven engineering.</p><p>Conceptual and domain model development is a challenging activity, which requires the identification of the important concepts (in the case of a structural model, entities) and their relationships. To do so, it is important to distinguish between the essential concepts in a domain and the secondary ones. Furthermore, the resources used to develop the conceptual model (i.e., the requirements) make use of ambiguous terms <ref type="bibr" target="#b5">[6]</ref>. Moreover, as the complexity of the system increases, it becomes more time consuming for humans to derive these models.</p><p>To address the challenges of developing conceptual and domain models, several solutions exist, including guidelines <ref type="bibr" target="#b7">[8]</ref> and automatic approaches <ref type="bibr" target="#b1">[2]</ref>. The existing automated methods are rule-based; this limits their effectiveness to those linguistic patterns that the researchers encoded into the rules. In contrast, methods that rely on guidelines for humans <ref type="bibr" target="#b7">[8]</ref> are time consuming and do not achieve perfect accuracy either.</p><p>In our research agenda, we aim to build machine and deep learning models for deriving a domain model from a collection of user stories. A domain model should contain the entities and relationships that represent the domain of the system that implements the user stories. This model can serve as a basis for model-driven development, e.g., via low-code development platforms. Thus, the automated derivation could increase the usefulness of user stories by reducing the gap between requirements and the following development activities.</p><p>In this paper, we present initial results on the automated derivation of a conceptual model. As the current automated state-of-the-art method, the Visual Narrator <ref type="bibr" target="#b1">[2]</ref>, is more effective at identifying entities than relationships, we choose the relationship identification task as our first research step. We propose a machine learning-based model that recommends essential relationships between the entities that are derived from a set of user stories. Our research question is as follows: Does a machine-learning-based approach outperform rule-based state-ofthe-art methods for identifying relationships between the entities extracted from user stories?</p><p>The results reported in this paper positively answer that question and demonstrate the advantages of using machine learning for the task at hand. In particular, we make the following contributions: (i) we describe a novel approach, based on 32 features, for recommending essential relationships using a machine learning model; and (ii) we compare our machine learning model to current automated models.</p><p>Paper organization. In Section 2, we discuss the background and related studies. In Section 3, we present the research method and we describe our proposed approach. In Section 4, we report on the preliminary results and we discuss the limitatinos. Finally, in Section 5 we conclude and set plans for future research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Deriving conceptual models automatically from natural language requirements has been a research topic for quite some time <ref type="bibr" target="#b8">[9]</ref>. Even so, despite Mike Cohn's book on user stories <ref type="bibr" target="#b9">[10]</ref> that contributed to the popularity of user stories, only in 2016 Robeer et al. <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b1">2]</ref> performed the first major attempt at extracting conceptual models from user stories.</p><p>Since then, research about deriving models from user stories started to emerge. In this section, we review related studies by referring, when applicable, to the different types of models that are extracted, the method through which the model is derived, the experimental setting and datasets, the metrics, and the performance that was achieved.</p><p>Elallaoui et al. <ref type="bibr" target="#b11">[12]</ref> use part-of-speech tagging to identify whether certain keywords should represent entities or relationships, and this information is used to generate use case diagrams. Their approach is evaluated via precision and recall. They compare the outcomes with models that were created manually from the WebCompany dataset <ref type="bibr" target="#b10">[11]</ref>. The results demonstrate that their plugin has acceptable precision and recall for detecting actors, and high results (above 0.85 for both metrics) for detecting use cases and relationships. While they derive a use case diagram, we are interested in generating domain models that require a holistic view.</p><p>Similarly, the recent studies that extract class diagrams automatically also rely on the partof-speech tagging of terms within user stories. Lucassen et al. <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b1">2]</ref> propose an automated approach, based on the Visual Narrator tool, for extracting structural conceptual models (i.e., class diagrams) from a set of user stories. The Visual Narrator was used to generate conceptual models from user stories based on 11 out of 23 identified heuristics from the literature. Using precision, recall and F 1 -score metrics, they determined whether their tool was successful in identifying entities and relationships compared to gold-standard models that were created by the authors of the paper. The approach achieved good precision (97%) and recall (98%), with a lower bound of 88% recall and 92% precision. These results, however, are obtained by assessing the tool's performance against a human execution of the algorithm, rather than against models that are created by humans based on their own rationale.</p><p>Typically, automated model derivation from user stories is done using rule-based methods based on natural language processing heuristics. Although these works achieved good precision and recall despite limitations of user stories like ambiguity <ref type="bibr" target="#b12">[13]</ref>, they cannot be perfectly accurate due to the variety of linguistic patterns that natural language allows. Furthermore, they are limited to the lexicon they identified and cannot perform the abstraction process that is crucial to conceptual models.</p><p>Approaches that derive domain models from other formats of requirements also exist. The most relevant work is that by Arora and colleagues <ref type="bibr" target="#b4">[5]</ref>, who use heuristics to create a first version of a domain model and then apply active learning to remove superfluous elements. We also use machine learning, but rather than pruning elements, we focus on enriching a model by suggesting essential relationships.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Research Method and Proposed Approach</head><p>The task of this research -illustrated by the mock-up of Figure <ref type="figure" target="#fig_0">1</ref> -consists of recommending relationships among the entities in a domain model. We assume these entities have been extracted previously from a collection of user stories, either manually <ref type="bibr" target="#b7">[8]</ref> or through an automated tool such as the Visual Narrator <ref type="bibr" target="#b1">[2]</ref>. Given a collection of user stories (in the figure, regarding Planning Poker), selected entities, and a probability threshold, the tool suggests relationships whose probability to exist is higher than the set threshold, and then visualizes the resulting domain model with those relationships. To develop our ML technique for such a tool, we followed a common machine learning method <ref type="bibr" target="#b13">[14]</ref>, which consists of five steps. Dataset preparation, based on which we created a gold standard model for each set of user stories. Feature engineering, where features are created to facilitate the recommendation of relationships between two entities. Baselines and alternatives selection, in order to compare our method to current state-of-the-art approach. Choice of a machine learning algorithm from the current state-of-the-art families of machine learning models (e.g., decision trees, random forest). Metrics selection, to determine against which criteria we compare the performance of different approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Data preparation</head><p>As there is no benchmark dataset of user stories with an associated class diagram, we developed such a dataset. Indeed, the existing gold standards for the datasets used with the Visual Narrator <ref type="bibr" target="#b1">[2]</ref> are not suitable, as the identified relationships are meant to navigate through the user stories, rather than for representing the domain). Therefore, we first selected 7 sets of user stories from an online collection of user stories data sets <ref type="bibr" target="#b14">[15]</ref>. Next, for each set of stories, we developed a conceptual model. During this process, we had to answer the following questions:</p><p>(1) What entities are of interest to find relationships between? (2) What are the relationships that we want the model to recommend? Table <ref type="table">1</ref> shows descriptive information about the seven datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.1.">Entities Extraction</head><p>The first step is the identification of the entities from the user stories. To do so, we first used the Visual Narrator tool <ref type="bibr" target="#b1">[2]</ref> with its default parameters. Since the entities that the Visual Narrator returns include both domain terms as well as technical concepts that would not be part of a domain model, we filtered manually its outputs by retaining only those entities that we considered to be part of the domain, thereby excluding technical terms that pertain to the solution. We acknowledge that some entities may have been overlooked because of this filtering. However, as this paper focuses on detecting the relationship between pre-defined entities, the omission of entities should not affect our analysis of the recommended relationships between the existing entities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Descriptive information about the gold standards for the employed datasets, showing the number of user stories, entities, relationships, and the percentage of essential relationships ( #𝑅𝑒𝑙 (#𝐸𝑛𝑡×(#𝐸𝑛𝑡−1))/2 ), and percentage of entities that co-occur in at least one user story. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DataSet</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.2.">A gold standard for relationships between entities</head><p>After extracting the entities from the set of user stories, we developed a dataset that contains all the possible relationships that might exist (i.e., all pairs of entities). As a next step, each author of this paper tagged each relationship independently as follows:</p><p>1. Essential: it is required in the domain model for implementing the user stories in the collection; 2. Optional: it may or may not exist because of the existence of other relationships; 3. Unnecessary: it should not be part of the domain model.</p><p>Next, we measured the inter-rater agreement using the Fleiss Kappa <ref type="bibr" target="#b15">[16]</ref>, which is a statistical measure specifically designed to handle categorical data and to handle more than two raters. We checked the agreement in two ways: (i) binary, where we consider only strong disagreements if the three tags include at least one essential and at least one unnecessary; and (ii) multi-class, where we consider disagreements even when considering the optional class.</p><p>Afterward, we held a discussion that eventually led to the gold standards which can be found in <ref type="bibr" target="#b16">[17]</ref>. We decided that the gold standard should include relationships on which we have a high agreement. This is intended to minimize the chance of false positives in our gold standard. Because of our high agreement of more than 0.6, we chose the binary classification as the gold-standard, as this improves our identification of true positives: essential relationships (class 1) and all others (class 0).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Feature engineering</head><p>We engineered a set of features ([𝑥 𝑖 ]) to characterize each pair of entities ((𝑒 𝑖 1 , 𝑒 𝑖 2 ), that the ML model uses to learn which relationships are essential and which are unnecessary (𝑦 𝑖 ). Based on this, the trained ML model can recommend essential relationships on unseen pairs of entities.</p><p>We engineered features based on rules for relationship identification <ref type="bibr" target="#b1">[2]</ref> as well as from additional insights we gained after exploring the data. We denote sim as a function that calculates the similarity between two words or sentences. sim 𝑔 (global similarity) implements that function using pre-traind model from n l t k<ref type="foot" target="#foot_0">1</ref> , and sim 𝑙 (local similarity) implements that function using word2vec model from g e n s i m<ref type="foot" target="#foot_1">2</ref> .</p><p>The user story datasets were used as our corpus to train the gensim model, which resulted in an embedding vector for each entity that can be used to calculate cosine-similarity (the code to train the model appears in our online appendix <ref type="bibr" target="#b16">[17]</ref>). We represent the dataset as follows: 𝒟 = {(𝑟 0 , x 0 , 𝑦 0 ), ..., (𝑟 𝑛 , x n , 𝑦 𝑛 )} where 𝑟 𝑖 = (𝑒 𝑖 1 , 𝑒 𝑖 2 ) is a relationship between two entities 𝑒 𝑖 1 and 𝑒 𝑖 2 , x i is a vector of the features' values, and 𝑦 𝑖 is the target label (essential or unnecessary relationship). We also denote 𝒟 ′ = {𝑟 ′ 0 , ..., 𝑟 ′ 𝑛 } as an external dataset that contains all the relationships between two entities 𝑟 ′ 𝑖 = (𝑒 ′ 𝑖 1 , 𝑒 ′ 𝑖 2 ) from different existing domain model repositories (we used the M o d e l S e t repository <ref type="bibr" target="#b17">[18]</ref> <ref type="foot" target="#foot_2">3</ref> ). The similarity between two relationships 𝑟 𝑖 = (𝑒 𝑖 1 , 𝑒 𝑖 2 ) and 𝑟 𝑗 = (𝑒 𝑗 1 , 𝑒 𝑗 2 ) is calculated as follows:</p><formula xml:id="formula_0">rel_sim 𝑥 (𝑟 𝑖 , 𝑟 𝑗 ) = sim 𝑥 (𝑒 𝑖 1 , 𝑒 𝑗 1 ) + sim 𝑥 (𝑒 𝑖 2 , 𝑒 𝑗 2 ) 2<label>(1)</label></formula><p>where 𝑥 is the way the similarity sim is calculated: 𝑥 ∈ {𝑔, 𝑙}. Each 𝑟 𝑖 = (𝑒 𝑖 1 , 𝑒 𝑖 2 ) ∈ 𝒟 is associated with a vector of values for the engineered features listed in Table <ref type="table" target="#tab_1">2</ref>. These features were assigned based on the following rationale. Each of Features 1-3 considers external sources; we search in ModelSet and define features with the similarity value of those relationships that have the highest similarity with the examined relationship, applying Equation <ref type="formula" target="#formula_0">1</ref>. We expect that if other models link entities that are similar to ours, our approach will also recommend a relationship. Each of Features 4-9 does a similar analysis but based on each individual entity. Feature 10 calculates the average of Features 1-3. Each of Features 11-18 characterizes individual entities by counting how many times an entity appears in the user stories <ref type="bibr" target="#b10">(11)</ref><ref type="bibr" target="#b11">(12)</ref> and whether it appears in the actor, action, or benefit part of at least one user story <ref type="bibr" target="#b12">(13)</ref><ref type="bibr" target="#b13">(14)</ref><ref type="bibr" target="#b14">(15)</ref><ref type="bibr" target="#b15">(16)</ref><ref type="bibr" target="#b16">(17)</ref><ref type="bibr" target="#b17">(18)</ref>. Each of Features 19-20 calculates the similarity between the entities using g e n s i m and N L T K . Feature 21 determines the number of user stories where both entities co-occur. Each of Features 22-23 does a similar calculation but considers only co-occurrences where at most, 3 or 5 words exist between the entities. Each of Features 24-27 is a binary value that is true when both entities are identified as either subject or object in at least one user story. Feature 28 is true if there is a user story where there is an 'and' or an 'or' word between the two entities. Feature 29 counts the nouns that appear in a user story that includes 𝑒 𝑖 1 and in a user story that includes 𝑒 𝑖 2 . Features 30-32 normalize the number of user stories where at least one entity occurs over the number of user stories where both occur. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Evaluation Settings</head><p>To select relevant machine learning models, we distinguish between two types of models: shallow and deep. Shallow models such as decision trees are better suited for small, structured datasets. In contrast, deep models are better suited for large NLP and vision datasets. We do not select deep models because NLP-for-RE tasks like ours rely on small datasets <ref type="bibr" target="#b18">[19]</ref>. Thus, we opt for a shallow model in the form of Random Forest (RF), a state-of-the-art technique that achieved the best results in some software-engineering-related tasks <ref type="bibr" target="#b19">[20]</ref>. We report the results in terms of the commonly used metrics of precision, recall, and F 1 score. Our statistical analysis, however, focuses only on precision and F 1 score. We choose precision as we assume it might be more helpful to humans than recall in a recommendation scenario like the one sketched in Figure <ref type="figure" target="#fig_0">1</ref>, where having a smaller set of essential links without many unnecessary relationships creates less noise for the analyst than having all the essentials with many unnecessary ones. We also analyze the F 1 -score because it balances both precision and recall, thereby penalizing recommendations that provide a too limited number of essential links. We acknowledge that these are preliminary metrics that we use for an early assessment of our approach; future work should determine the most suitable metric based on an in-practice analysis of the impact of different types of errors <ref type="bibr" target="#b20">[21]</ref>.</p><p>We compare the performance of the RF classifier against the Visual Narrator [2] and a naive approach in which an essential relationship is suggested every time two entities appear in the same user story. As the Visual Narrator did not identify all the entities in the gold standard model, we omitted these entities from the evaluation. This is done since we are only assessing the ability to predict relationships between entities. Also, we defined the threshold that the ML model uses to discriminate between the two classes: essential or unnecessary. Since the RF classifier returns a probability of a relationship being essential and the dataset is unbalanced, it is not reasonable to set the threshold to 0.5. After checking several thresholds, we found that a threshold of 0.8 provides reliable results. We evaluate the performance of the RF classifier, the Visual Narrator, and the Naive approach using the seven datasets presented in Table <ref type="table">1</ref>. We apply the leave-one-out evaluation method: all datasets except one are used for training the model, and we report the performance on the remaining dataset.</p><p>To compare if the differences in the metrics are significant between the approaches (Independent Variables), we selected F 1 -score and precision (Dependent Variables) as metrics to check the significance. We set the following (null) hypothesis:</p><p>• The three F 1 -scores/precision of the naive approach, the Visual Narrator and the RF classifier are the same (𝐻 EXP-F-score 0 and 𝐻 EXP-Precision 0</p><p>).</p><p>The experiment materials can be found in an online appendix <ref type="bibr" target="#b16">[17]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: A mock-up that illustrates the task where this research fits: the recommendation of relationships between entities extracted from a collection of user stories.</figDesc><graphic coords="4,110.13,84.33,374.89,152.85" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Features used by our machine-learning model for a relationship 𝑟 𝑖 = (𝑒 𝑖 1 , 𝑒 𝑖 2 ) ∈ 𝒟 Ext._entity 𝑡 _sim 𝑔 _𝑘 For both entities in 𝑟 𝑖 , the global similarity with the corresponding entity in the 𝑘 most similar relationships in 𝒟 ′ : sim 𝑔 (𝑒 𝑖 𝑡 , 𝑒 ′ Ext._rel_sim 𝑔 _1 , Ext._rel_sim 𝑔 _2, Ext._rel_sim 𝑔 _3) 11-12 appear 𝑡 Number of appearances of 𝑒 𝑖 𝑡 in the user stories, for each 𝑡 ∈ {1, 2} 13-14 actor 𝑡 1 if 𝑒 𝑡 appears in the role part of at least one user story, otherwise 0 15-16 action 𝑡 1 if 𝑒 𝑡 appears in the action part of 1+ user story, otherwise 0 17-18 benefit 𝑡 1 if 𝑒 𝑡 appears in the benefit part of 1+ user story, otherwise 0 19-20 sim 𝑥 𝑠𝑖𝑚 𝑥 (𝑒 𝑖 1 , 𝑒 𝑖 2 ), for 𝑥 ∈ {𝑔, 𝑙} 21 both The number of user stories in which 𝑒 𝑖 1 and 𝑒 𝑖 2 co-occur 22-23 window 𝑧 The number of user stories where 𝑒 𝑖 1 and 𝑒 𝑖 2 co-occur with less than 𝑧 ∈ {3, 5} words in between them 24-27 sub/obj_sub/obj 1 if 𝑒 𝑖 1 is identified as {𝑠𝑢𝑏𝑗𝑒𝑐𝑡, 𝑜𝑏𝑗𝑒𝑐𝑡} and 𝑒 𝑖 2 is identified as {𝑠𝑢𝑏𝑗𝑒𝑐𝑡, 𝑜𝑏𝑗𝑒𝑐𝑡} in 1+ user story, otherwise 0 28 and_or_btw 1 if there is a user story where 'and' or 'or' appeared between 𝑒 𝑖 1 and 𝑒 𝑖 2 29 common_friends Number of different nouns that appear both in a user story where 𝑒 𝑖 1 occurs and in a user story where 𝑒 𝑖 2 occurs 30-32 both 𝑤 Number of user stories where 𝑤 ∈ {𝑒 𝑖 1 , 𝑒 𝑖 2 , 𝑒 𝑖 1 ∨ 𝑒 𝑖 2 } occur divided by feature 21</figDesc><table><row><cell>ID</cell><cell>Feature</cell><cell>Description</cell></row><row><cell>1-3</cell><cell>Ext._rel_sim 𝑔 _𝑘</cell><cell>The 𝑘 highest (𝑘 ∈ {1, 2, 3}) relationship similarity values between 𝑟 𝑖 and any of the relationships in ModelSet (𝑟 ′ 𝑗 ∈ 𝒟 ′ )</cell></row><row><cell>4-9</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>𝑗 𝑡 ) (𝑡 ∈</cell></row><row><cell></cell><cell></cell><cell>{1, 2})</cell></row><row><cell>10</cell><cell>sim_avg_3_rel</cell><cell>average(</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://nltk.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://pypi.org/project/gensim/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://modelset.github.io/</note>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Preliminary Validation</head><p>In this section, we report on the results of the preliminary validation we conducted according to the method described in Section 3.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Descriptive Statistics</head><p>Table <ref type="table">3</ref> presents the results of the experiment. The Dataset column refers to the set of user stories. We report on the results of the three alternatives: the Naive, Visual Narrator, and RF classifier. For each alternative, we present the precision, recall, and the F 1 -score. The bottom row of the table represents the macro-average for each column. The numbers in bold indicate the best results of the F 1 -score for a given user story dataset.</p><p>In most datasets, using the RF classifier leads to better F 1 -scores. Particularly, it achieved superior results in 4 out of 7 datasets. The RF classifier achieved an average F 1 -score of 0.589, Naive Approach achieved 0.565 and Visual Narrator only achieved 0.266. In addition, we observe that the RF classifier has better precision over the other alternatives in 6 out of 7 datasets.</p><p>We conducted statistical tests with alpha = 0.05 to determine if the differences are statistically significant. We applied the Friedman test <ref type="bibr" target="#b21">[22]</ref>, a non-parametric statistical test, to compare more than two methods. We found statistically significant differences among the related approaches with 𝑝 = 0.01 for both F 1 -score and precision. Therefore we can reject the 𝐻 EXP1-F-score 0 and 𝐻 EXP1-Precision 0 hypotheses. To check which alternative is better, we applied Nemenyi's post-hoc test <ref type="bibr" target="#b22">[23]</ref>, and we calculated effect size using Cohen's d. We found that: (1) the RF classifier is statistically better than the Visual Narrator (𝑝 = 0.042 and 𝑝 = 0.02) with effect sizes of 1.564 </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Discussion and Limitations</head><p>The results answer positively our research question as they indicate that using an ML-based model (RF classifier) for the relationship recommendation task leads to higher F 1 -score and precision than the rule-based alternatives (Naive and Visual Narrator). Furthermore, the RF classifier also returns the probabilities for occurrences of relationships, providing extra information for the user to make the final decision. The preliminary results require additional validation, such as defining the most suitable metrics by analyzing the relative impact of Type 1 and Type 2 errors <ref type="bibr" target="#b20">[21]</ref>, by estimating human achievable performance, and to assess the necessary effort (time). We could not estimate the human achievable performance on our datasets as we were already familiar with some of those from previous research, and due to an iterative approach for the construction of the gold standard. Lastly, the selection of the datasets may be biased; although they differ in the number of samples (pairs of entities) and in the distribution of features and classes, we need to experiment with other datasets to draw more robust conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions and Future Work</head><p>We have presented an ML-based model for recommending relationships between the entities of conceptual models that are derived from a set of user stories. Rule-based approaches and guidelines were suggested for deriving conceptual models from user stories. They achieve good accuracy in recognizing entities but fall short in finding relationships between these entities. Here, we provide initial evidence that an ML-based approach improves the current state-of-the-art methods for recommending relationships between entities.</p><p>This work calls for further improvements. The ML-based models can be extended to suggest a complete conceptual model (entities, attributes, and relationships) as well as performing a better evaluation that compares the tool's performance with that of human analysts.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The use and effectiveness of user stories in practice</title>
		<author>
			<persName><forename type="first">G</forename><surname>Lucassen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dalpiaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M E</forename><surname>Van Der Werf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Brinkkemper</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of REFSQ</title>
				<meeting>of REFSQ</meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="205" to="222" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Extracting Conceptual Models from User Stories with Visual Narrator</title>
		<author>
			<persName><forename type="first">G</forename><surname>Lucassen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Robeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dalpiaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M E</forename><surname>Van Der Werf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Brinkkemper</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Req. Eng</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="page" from="339" to="358" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">On deriving conceptual models from user requirements: An empirical study</title>
		<author>
			<persName><forename type="first">F</forename><surname>Dalpiaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gieske</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sturm</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inf. Softw. Technol</title>
		<imprint>
			<biblScope unit="volume">131</biblScope>
			<biblScope unit="page">106484</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Research commentary: Information systems and conceptual modelinga research agenda</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Weber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Systems Research</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="363" to="376" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">An Active Learning Approach for Improving the Accuracy of Automated Domain Model Extraction</title>
		<author>
			<persName><forename type="first">C</forename><surname>Arora</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sabetzadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Nejati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Briand</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM TOSEM</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Detecting Terminological Ambiguity in User Stories: Tool and Experiment</title>
		<author>
			<persName><forename type="first">F</forename><surname>Dalpiaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Van Der Schalk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Brinkkemper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">B</forename><surname>Aydemir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lucassen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inf. Soft. Tech</title>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Modeling Security and Privacy Req.: a Use Case-Driven Approach</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">X</forename><surname>Mai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Goknil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">K</forename><surname>Shar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Pastore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">C</forename><surname>Briand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shaame</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inform. Soft. Tech</title>
		<imprint>
			<biblScope unit="volume">100</biblScope>
			<biblScope unit="page" from="165" to="182" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Guided derivation of conceptual models from user stories: A controlled experiment</title>
		<author>
			<persName><forename type="first">M</forename><surname>Bragilovski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dalpiaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sturm</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of REFSQ</title>
				<meeting>of REFSQ</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="131" to="147" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">aToucan: An Automated Framework to Derive UML Analysis Models from Use Case Models</title>
		<author>
			<persName><forename type="first">T</forename><surname>Yue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">C</forename><surname>Briand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Labiche</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM TOSEM</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">User stories applied: For agile software develop</title>
		<author>
			<persName><forename type="first">M</forename><surname>Cohn</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2004">2004</date>
			<publisher>Addison-Wesley Prof</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Automated extraction of conceptual models from user stories via nlp</title>
		<author>
			<persName><forename type="first">M</forename><surname>Robeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lucassen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M E</forename><surname>Van Der Werf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dalpiaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Brinkkemper</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">RE, IEEE</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="196" to="205" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Automatic transformation of user stories into UML use case diagrams using NLP techniques</title>
		<author>
			<persName><forename type="first">M</forename><surname>Elallaoui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Nafil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Touahni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Procedia computer science</title>
		<imprint>
			<biblScope unit="volume">130</biblScope>
			<biblScope unit="page" from="42" to="49" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Ambiguity in user stories: A systematic literature review</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Amna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Poels</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information and Software Technology</title>
		<imprint>
			<biblScope unit="volume">145</biblScope>
			<biblScope unit="page">106824</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Shalev-Shwartz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ben-David</surname></persName>
		</author>
		<title level="m">Understanding machine learning: From theory to algorithms</title>
				<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Requirements data sets (user stories)</title>
		<author>
			<persName><forename type="first">F</forename><surname>Dalpiaz</surname></persName>
		</author>
		<idno type="DOI">10.17632/7zbk8zsd8y.1</idno>
	</analytic>
	<monogr>
		<title level="j">Mendeley Data</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Measuring nominal scale agreement among many raters</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Fleiss</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Psychological bulletin</title>
		<imprint>
			<biblScope unit="volume">76</biblScope>
			<biblScope unit="page">378</biblScope>
			<date type="published" when="1971">1971</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Bragilovski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dalpiaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sturm</surname></persName>
		</author>
		<idno type="DOI">10.17632/tvjyw4pzsk.1</idno>
		<ptr target="http://dx.doi.org/10.17632/tvjyw4pzsk.1,MendeleyData" />
		<title level="m">Experimental material -from user stories to domain models: Recommending relationships between entities</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Modelset: a dataset for machine learning in model-driven engineering</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A H</forename><surname>López</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Cánovas Izquierdo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Cuadrado</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Soft. and Systems Modeling</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page" from="967" to="986" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Natural Language Processing for Requirements Engineering: The Best Is Yet to Come</title>
		<author>
			<persName><forename type="first">F</forename><surname>Dalpiaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ferrari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Franch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Palomares</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Software</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="115" to="119" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Leveraging historical associations between requirements and source code to identify impacted classes</title>
		<author>
			<persName><forename type="first">D</forename><surname>Falessi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Roll</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cleland-Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE TSE</title>
		<imprint>
			<biblScope unit="volume">46</biblScope>
			<biblScope unit="page" from="420" to="441" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Empirical evaluation of tools for hairy requirements engineering tasks</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Berry</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Empirical Software Engineering</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="page">111</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Relative power of the Wilcoxon test, the Friedman test, and repeated-measures ANOVA on ranks</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Zimmerman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">D</forename><surname>Zumbo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Journal of Exper. Education</title>
		<imprint>
			<biblScope unit="volume">62</biblScope>
			<biblScope unit="page" from="75" to="86" />
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Overview of Friedman&apos;s test and post-hoc analysis</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">G</forename><surname>Pereira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Afonso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Medeiros</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications in Statistics-Simulation and Computation</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="page" from="2636" to="2653" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
