On Identifying Similar User Stories to Support Agile Estimation based on Historical Data Aleksander Grzegorz Duszkiewicz1,2 , Jacob Glumby Sørensen1 , Niclas Johansen2 , Henry Edison1 and Thiago Rocha Silva1 1 The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Campusvej 55, Odense M, 5230, Denmark 2 Morningtrain ApS, Rugårdsvej 55A 1 tv, Odense C, 5000, Denmark Abstract Accurate and reliable effort and cost estimation is still a challenging activity for agile teams. It is argued that leveraging historical data regarding the actual time spent on similar past projects could be very helpful to support such an activity before companies can embark upon a new project. In this position paper, we discuss our initial efforts towards a software tool that retrieves similar user stories from a database of past projects to support developers estimating effort and cost of developing similar new projects. In close collaboration with a Danish software development company, we developed a tool that employs Natural Language Processing (NLP) algorithms to find past similar user stories and retrieve the actual time spent on them. Developers can then make use of the actual implementation time of these stories as the new estimate or use the value to support their decision on a different estimate. Results of a preliminary evaluation regarding the performance of such a tool in terms of precision and recall showed that it is capable of finding similar stories fairly well, but also showed that different phrasings and wordings can have a high impact on similarity scores. Keywords User Stories, Agile Estimation, Natural Language Processing 1. Introduction Effort and cost estimation is still one of the most critical aspects of agile software development. A fair amount of developers’ time is spent on understanding, discussing, and trying to find a consensus on an accurate estimate for user stories. Such an activity is even costlier for companies that need to budget a software project before even gaining a software development contract. Besides that, estimates can easily be misjudged and the development taking longer than first anticipated, which can lead to project loss. Leveraging historical data from similar past projects has been seen as a good strategy to overcome some of these challenges. Having access to the actual time spent to develop user stories in the past can be very helpful to support developers estimating more accurately the effort required to develop similar user stories at present. In practice, however, it is not trivial to Agil-ISE 2022: Intl. Workshop on Agile Methods for Information Systems Engineering, June 06, 2022, Leuven, Belgium $ alek@morningtrain.dk (A. G. Duszkiewicz); jacso18@student.sdu.dk (J. G. Sørensen); nj@morningtrain.dk (N. Johansen); hedis@mmmi.sdu.dk (H. Edison); thiago@mmmi.sdu.dk (T. R. Silva) € https://www.sdu.dk/staff/hedis (H. Edison); https://www.sdu.dk/staff/trsi (T. R. Silva)  0000-0001-8961-4663 (T. R. Silva) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 21 identify such similar stories from past projects and many agile teams simply do not leverage this asset. Despite fruitful previous research on Natural Language Processing (NLP) to support various activities on requirements engineering more broadly [1], and on user stories more specifically [2], there has been limited work on models and tools applying NLP to identify user story similarity in order to support the agile estimation process [3]. When it comes to commercial agile estimation tools, the closest one to support this process is DoWork.ai1 , which leverages artificial intelligence to find similar tasks. When creating a new task, the tool conducts a search for related historical ones and signalizes how similar they are. It is not clear, however, which kind of NLP model they employ and how such a model has been trained. In this position paper, we discuss our ongoing efforts towards the development of a software tool to support this process. The tool has been co-designed with a Danish software company following design science principles. It employs NLP to identify and retrieve similar user stories from past projects and suggests three levels of estimates (most-likely, pessimistic, and optimistic) for the new user stories being estimated. Developers can then decide to adopt one of the three suggested estimates, or leverage this information to support the decision upon a new estimate based on the particularities of the project at hand. We report our findings from a preliminary evaluation analysing the current performance of the tool to retrieve similar user stories based on text similarity. In the following sections, we briefly introduce the main elements of the tool and discuss both the preliminary results obtained so far and our next steps on this research. 2. The Proposed Software Tool The proposed tool has been co-designed with Morningtrain2 , a Danish medium-sized full service digital agency that specializes in the fields of web design, programming, and online marketing. The company currently employs a weighted three-point estimation technique based on PERT [4] and would like to keep using it. The tool has been developed using React as the front-end framework and Laravel as the back-end framework. The tool is a standalone system that connects to Jira using OAuth 2.0 and Webhooks, providing a two-way communication of project data. Other technologies used include MySQL, Redis, and the library InertiaJS - SPA. Our initial implementation strategy to find similar user stories was based on tagging. In order to lower the amount of stories for the system to compare, we first implemented a tag system with the purpose of excluding stories with non-similar tags. Only stories that passed this first tag filtering would be submitted to further NLP analysis. The rationale was that this could help the performance as the dataset grows. After experimenting with some data, however, we decided to abandon this strategy and submit all the stories in the dataset to NLP analysis since we did not notice any relevant performance issue. When performing NLP, we employed semantic analysis to score semantic similarity among user stories, as literal analysis of text did not produce good results. We implemented semantic similarity analysis using spaCy3 . spaCy compares two strings through its similarity function 1 https://dowork.ai 2 https://morningtrain.dk 3 https://spacy.io 22 that uses cosine similarity to calculate a similarity score. While spaCy comes with a selection of standard models that can be directly used, when tested with our database, the similarity scores were mostly in the high end regardless of the actual similarity. Faced with this issue, we started looking for a model that could better score semantic similarities of full-text sentences. This made us opting for USE [5], a pre-trained model developed by Google. An open-source implementation of USE [6] made it possible to use the model together with spaCy. USE was trained with a large variety of text and uses high-dimensional vectors which makes it a good choice for general purpose tasks, including text classification, semantic similarity, and natural language tasks. During our tests, the similarity scores with USE were closer to what we were expecting based on a manual analysis of the stories in the database. The tool compares a new story against all other stories marked with the status of “done” in the database. During the comparison process, the user stories are pre-processed in order to remove the template skeleton (i.e., the sub-strings “As a”, “I want”, and “so that”) since we observed it improves the parsing. The tool currently employs a similarity threshold of 60%, i.e., any user story with a similarity score of 60% (0.6) or above is returned to the user. We use the actual time spent on the highest scoring user story returned as the most likely estimate for the new similar user story being estimated. Figure 1 depicts the user interface where the sug- gested estimates are presented to the user. In the exam- ple, two user stories were found similar to the one being specified. The highest scoring story has a 93% similar- ity to the one entered. The time spent on this story is used as the most likely estimate for the new user story being estimated. Following the three-point estimation technique employed by the company, the user is also presented with an optimistic and a pessimistic estimate. The optimistic and pessimistic values are calculated based on offset values entered into the system options. The user can choose to manually overwrite the esti- mates or modify the story’s complexity level, which automatically changes the optimistic and pessimistic values. The company’s rationale for using the three- point estimate is that these three values can provide a price range that a customer can expect a story to cost. This has significant business value for the company. Figure 1: Similarity search results. The user can also have access to the full list of similar stories (see Figure 2). From there, s/he can access details such as match percentage, title of the story, the user story description, and the time spent on the implementation of the user story in the past. These data can be used to check if there are any other stories that are actually more similar according to human reasoning. The user can additionally choose to select a specific story from the list of similar ones and access all the details of that story such as an in-depth story description, technical description, user acceptance criteria, tests, and original estimates. The user can also choose to replace the most likely estimate for the new user story by the actual implementation time of any of the returned user stories. 23 Figure 2: Similar stories modal that shows user story details. Table 1 Tool’s performance in the preliminary evaluation. TP refers to True Positives, FP refers to False Positives, FN refers to False Negatives, and TN refers to True Negatives. Pairs (total) TP FP FN TN Precision Recall F-measure 169 21 0 4 144 1 0.84 0.91 This feature aims to give the user the possibility of making an informed choice. Two user stories can be very similar, but bear different system-dependent challenges, which can have a deep impact on the time needed to implement the same feature. Finally, we also implemented a Jira integration feature through which a project with all its underlying requirements and stories can be transferred to Jira. A two-way communication is established through Webhooks, i.e., once a project is exported, besides systematically uploading all local changes to Jira, we also listen for event changes within the project on Jira and reflect those changes back to the tool in order to always have the latest updates in our database. 3. Preliminary Evaluation We performed a preliminary evaluation to determine the accuracy of USE to identify similar user stories into the tool. The test setup included a database of five user stories from past projects in the company and eight additional stories that were either phrased differently but semantically similar to the original ones, or not semantically similar but using phrasings from the original ones. Each user story was then manually compared to the others by one of the researchers (169 combinations), and then independently reviewed by another researcher. A score of either high or low similarity was recorded for each set. We started the study with a similarity threshold of 70% as we were aiming at closely related user stories. Each pair of stories was subsequently tested using our tool and the results compared to the manual analysis. The results were additionally cross-checked by two researchers. Table 1 summarizes the results of the evaluation. During the test, no false positives were found. There were, however, a few cases that got closer to the 70% threshold. For example, the two following user stories scored a similarity of 62%: 24 “As a customer, I want to add a product to my wish list, so that I can find the product next time I visit the page” “As a customer, I want to remove a product from my wish list, so that it does not appear on the list when I visit the page” These are sentences that do bear some resemblance but do not include the same operations from an implementation perspective. The sentences are therefore not similar in this context. This suggests that the algorithm is indifferent to the opposite verbs “add” and “remove”, or at least that it does not significantly penalize these opposites. This is further underlined by the last phrase in each user story, where in the first user story one should be able to find the product, while in the second one it is implied that one should not be able to find the product on the list. Conversely, four false negatives were found, all within the 60-70% range. An example is presented with the two following user stories that scored a similarity of 65%: “As an administrator, I want to search for a specific student, so that I can find the student records” “As a student, I want to look up my student records, so that I can get a look at the details” The two user stories describe very similar features but phrased differently. While 65% is a good score, it also suggests that there can be some pitfalls when using different words to describe what is likely the same thing. When testing the tool, we found that replacing a context-giving keyword in one of the sentences with a similar word can have significant impact on the overall sentence similarity score. Our tests showed that similar words that scored low when tested against each other in isolation also lowered the similarity score when used in sentences. As an example, “records” and “details” are probably two different ways of referring to students’ personal data in this context. The similarity test of these two words alone returned a low similarity score of 32%, which in turn lowered the overall similarity score of the two sentences. This issue relates more broadly to near-synonymy in user stories, which is a topic investigated by Dalpiaz et al. [7]. As the system was designed to be a supporting tool for the user to make an informed decision, after this preliminary evaluation, we decided to lower the tool’s similarity threshold from 70% to 60% before moving to further evaluation. This has allowed for broadening the spectrum of stories returned. It means that, due to the complexity of semantic analysis, a manual process must still be done by the developers to select the most fitting stories for their specific context. 4. Conclusion and Next Steps In this position paper, we briefly introduced a software tool that employs NLP to retrieve information from past user stories in order to support developers through the agile estimation 25 process. After the preliminary evaluation discussed in the paper, we hypothesized whether it would be valuable to train a model specifically for the purpose of finding similarities between user stories. We did not do this originally due to the very limited data available to perform training on. USE has been trained with data from multiple datasets, including the Stanford Natural Language Inference (SNLI) corpus [8] which consists of 570,000 sentence pairs. Another possibility to achieve better results would be replacing USE by other pre-trained models such as the InferSent model released by Facebook [9], which is also a general purpose model. Our next steps on this research include testing the use of different models and/or training our own. We have also planned an evaluation in a real setting, with developers from the company using the tool to estimate user stories for a new project. This would provide valuable insights regarding the role that this kind of tool may play on agile projects in industry. Another interesting future direction would be automating NLP routines to group very similar user stories, but with different implementation times, and calculate a mean implementation time that could be presented to the developer. Another NLP task that could be implemented is to scan the database for possible keywords to be added to a tag bank that can be used in searches. References [1] L. Zhao, W. Alhoshan, A. Ferrari, K. J. Letsholo, M. A. Ajagbe, E.-V. Chioasca, R. T. Batista- Navarro, Natural language processing (nlp) for requirements engineering: A systematic mapping study, ACM Computing Surveys 54 (2021). [2] I. K. Raharjana, D. Siahaan, C. Fatichah, User stories and natural language processing: A systematic literature review, IEEE Access 9 (2021) 53811–53826. [3] B. Alsaadi, K. Saeedi, Data-driven effort estimation techniques of agile user stories: a systematic literature review, Artificial Intelligence Review (2022) 1–32. [4] R. W. Miller, Schedule, cost, and profit control with pert: A comprehensive guide for program management (1963). [5] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo- Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil, Universal sentence encoder (2018). arXiv:1803.11175. [6] M. Mensio, Spacy - universal sentence encoder, 2021. URL: https://github.com/ MartinoMensio/spacy-universal-sentence-encoder. [7] F. Dalpiaz, I. Van Der Schalk, S. Brinkkemper, F. B. Aydemir, G. Lucassen, Detecting terminological ambiguity in user stories: Tool and experimentation, Information and Software Technology 110 (2019) 3–16. [8] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, A large annotated corpus for learning natural language inference, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2015. [9] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised learning of universal sentence representations from natural language inference data (2017). arXiv:1705.02364. 26