1. Introduction

On Identifying Similar User Stories to Support Agile Estimation based on Historical Data

Aleksander Grzegorz Duszkiewicz

0 1

Jacob Glumby Sørensen

Niclas Johansen

Henry Edison

Thiago Rocha Silva

1 0 Morningtrain ApS , Rugårdsvej 55A 1 tv, Odense C, 5000 , Denmark 1 The Maersk Mc-Kinney Moller Institute, University of Southern Denmark , Campusvej 55, Odense M, 5230 , Denmark

21 26

Accurate and reliable efort and cost estimation is still a challenging activity for agile teams. It is argued that leveraging historical data regarding the actual time spent on similar past projects could be very helpful to support such an activity before companies can embark upon a new project. In this position paper, we discuss our initial eforts towards a software tool that retrieves similar user stories from a database of past projects to support developers estimating efort and cost of developing similar new projects. In close collaboration with a Danish software development company, we developed a tool that employs Natural Language Processing (NLP) algorithms to find past similar user stories and retrieve the actual time spent on them. Developers can then make use of the actual implementation time of these stories as the new estimate or use the value to support their decision on a diferent estimate. Results of a preliminary evaluation regarding the performance of such a tool in terms of precision and recall showed that it is capable of finding similar stories fairly well, but also showed that diferent phrasings and wordings can have a high impact on similarity scores.

eol>User Stories Agile Estimation Natural Language Processing

1. Introduction

Efort and cost estimation is still one of the most critical aspects of agile software development. A fair amount of developers’ time is spent on understanding, discussing, and trying to find a consensus on an accurate estimate for user stories. Such an activity is even costlier for companies that need to budget a software project before even gaining a software development contract. Besides that, estimates can easily be misjudged and the development taking longer than first anticipated, which can lead to project loss.

Leveraging historical data from similar past projects has been seen as a good strategy to overcome some of these challenges. Having access to the actual time spent to develop user stories in the past can be very helpful to support developers estimating more accurately the efort required to develop similar user stories at present. In practice, however, it is not trivial to identify such similar stories from past projects and many agile teams simply do not leverage this asset. Despite fruitful previous research on Natural Language Processing (NLP) to support various activities on requirements engineering more broadly [ 1 ], and on user stories more specifically [ 2 ], there has been limited work on models and tools applying NLP to identify user story similarity in order to support the agile estimation process [ 3 ]. When it comes to commercial agile estimation tools, the closest one to support this process is DoWork.ai1, which leverages artificial intelligence to find similar tasks. When creating a new task, the tool conducts a search for related historical ones and signalizes how similar they are. It is not clear, however, which kind of NLP model they employ and how such a model has been trained.

In this position paper, we discuss our ongoing eforts towards the development of a software tool to support this process. The tool has been co-designed with a Danish software company following design science principles. It employs NLP to identify and retrieve similar user stories from past projects and suggests three levels of estimates (most-likely, pessimistic, and optimistic) for the new user stories being estimated. Developers can then decide to adopt one of the three suggested estimates, or leverage this information to support the decision upon a new estimate based on the particularities of the project at hand. We report our findings from a preliminary evaluation analysing the current performance of the tool to retrieve similar user stories based on text similarity. In the following sections, we briefly introduce the main elements of the tool and discuss both the preliminary results obtained so far and our next steps on this research.

2. The Proposed Software Tool

The proposed tool has been co-designed with Morningtrain2, a Danish medium-sized full service digital agency that specializes in the fields of web design, programming, and online marketing. The company currently employs a weighted three-point estimation technique based on PERT [ 4 ] and would like to keep using it.

The tool has been developed using React as the front-end framework and Laravel as the back-end framework. The tool is a standalone system that connects to Jira using OAuth 2.0 and Webhooks, providing a two-way communication of project data. Other technologies used include MySQL, Redis, and the library InertiaJS - SPA.

Our initial implementation strategy to find similar user stories was based on tagging. In order to lower the amount of stories for the system to compare, we first implemented a tag system with the purpose of excluding stories with non-similar tags. Only stories that passed this first tag filtering would be submitted to further NLP analysis. The rationale was that this could help the performance as the dataset grows. After experimenting with some data, however, we decided to abandon this strategy and submit all the stories in the dataset to NLP analysis since we did not notice any relevant performance issue.

When performing NLP, we employed semantic analysis to score semantic similarity among user stories, as literal analysis of text did not produce good results. We implemented semantic similarity analysis using spaCy3. spaCy compares two strings through its similarity function 1https://dowork.ai 2https://morningtrain.dk 3https://spacy.io that uses cosine similarity to calculate a similarity score. While spaCy comes with a selection of standard models that can be directly used, when tested with our database, the similarity scores were mostly in the high end regardless of the actual similarity. Faced with this issue, we started looking for a model that could better score semantic similarities of full-text sentences. This made us opting for USE [ 5 ], a pre-trained model developed by Google. An open-source implementation of USE [ 6 ] made it possible to use the model together with spaCy. USE was trained with a large variety of text and uses high-dimensional vectors which makes it a good choice for general purpose tasks, including text classification, semantic similarity, and natural language tasks. During our tests, the similarity scores with USE were closer to what we were expecting based on a manual analysis of the stories in the database.

The tool compares a new story against all other stories marked with the status of “done” in the database. During the comparison process, the user stories are pre-processed in order to remove the template skeleton (i.e., the sub-strings “As a”, “I want”, and “so that” ) since we observed it improves the parsing. The tool currently employs a similarity threshold of 60%, i.e., any user story with a similarity score of 60% (0.6) or above is returned to the user. We use the actual time spent on the highest scoring user story returned as the most likely estimate for the new similar user story being estimated. This feature aims to give the user the possibility of making an informed choice. Two user stories can be very similar, but bear diferent system-dependent challenges, which can have a deep impact on the time needed to implement the same feature.

Finally, we also implemented a Jira integration feature through which a project with all its underlying requirements and stories can be transferred to Jira. A two-way communication is established through Webhooks, i.e., once a project is exported, besides systematically uploading all local changes to Jira, we also listen for event changes within the project on Jira and reflect those changes back to the tool in order to always have the latest updates in our database.

3. Preliminary Evaluation

We performed a preliminary evaluation to determine the accuracy of USE to identify similar user stories into the tool. The test setup included a database of five user stories from past projects in the company and eight additional stories that were either phrased diferently but semantically similar to the original ones, or not semantically similar but using phrasings from the original ones. Each user story was then manually compared to the others by one of the researchers (169 combinations), and then independently reviewed by another researcher. A score of either high or low similarity was recorded for each set. We started the study with a similarity threshold of 70% as we were aiming at closely related user stories. Each pair of stories was subsequently tested using our tool and the results compared to the manual analysis. The results were additionally cross-checked by two researchers.

Table 1 summarizes the results of the evaluation. During the test, no false positives were found. There were, however, a few cases that got closer to the 70% threshold. For example, the two following user stories scored a similarity of 62%: “As a customer, I want to add a product to my wish list, so that I can find the product next time I visit the page” “As a customer, I want to remove a product from my wish list, so that it does not appear on the list when I visit the page”

These are sentences that do bear some resemblance but do not include the same operations from an implementation perspective. The sentences are therefore not similar in this context. This suggests that the algorithm is indiferent to the opposite verbs “add” and “remove”, or at least that it does not significantly penalize these opposites. This is further underlined by the last phrase in each user story, where in the first user story one should be able to find the product, while in the second one it is implied that one should not be able to find the product on the list.

Conversely, four false negatives were found, all within the 60-70% range. An example is presented with the two following user stories that scored a similarity of 65%: “As an administrator, I want to search for a specific student, so that I can find the student records” “As a student, I want to look up my student records, so that I can get a look at the details” The two user stories describe very similar features but phrased diferently. While 65% is a good score, it also suggests that there can be some pitfalls when using diferent words to describe what is likely the same thing. When testing the tool, we found that replacing a context-giving keyword in one of the sentences with a similar word can have significant impact on the overall sentence similarity score. Our tests showed that similar words that scored low when tested against each other in isolation also lowered the similarity score when used in sentences. As an example, “records” and “details” are probably two diferent ways of referring to students’ personal data in this context. The similarity test of these two words alone returned a low similarity score of 32%, which in turn lowered the overall similarity score of the two sentences. This issue relates more broadly to near-synonymy in user stories, which is a topic investigated by Dalpiaz et al. [ 7 ].

As the system was designed to be a supporting tool for the user to make an informed decision, after this preliminary evaluation, we decided to lower the tool’s similarity threshold from 70% to 60% before moving to further evaluation. This has allowed for broadening the spectrum of stories returned. It means that, due to the complexity of semantic analysis, a manual process must still be done by the developers to select the most fitting stories for their specific context.

4. Conclusion and Next Steps

In this position paper, we briefly introduced a software tool that employs NLP to retrieve information from past user stories in order to support developers through the agile estimation process. After the preliminary evaluation discussed in the paper, we hypothesized whether it would be valuable to train a model specifically for the purpose of finding similarities between user stories. We did not do this originally due to the very limited data available to perform training on. USE has been trained with data from multiple datasets, including the Stanford Natural Language Inference (SNLI) corpus [ 8 ] which consists of 570,000 sentence pairs. Another possibility to achieve better results would be replacing USE by other pre-trained models such as the InferSent model released by Facebook [ 9 ], which is also a general purpose model.

Our next steps on this research include testing the use of diferent models and/or training our own. We have also planned an evaluation in a real setting, with developers from the company using the tool to estimate user stories for a new project. This would provide valuable insights regarding the role that this kind of tool may play on agile projects in industry. Another interesting future direction would be automating NLP routines to group very similar user stories, but with diferent implementation times, and calculate a mean implementation time that could be presented to the developer. Another NLP task that could be implemented is to scan the database for possible keywords to be added to a tag bank that can be used in searches.

[1]

Zhao ,

Alhoshan ,

Ferrari ,

K. J.

Letsholo ,

M. A.

Ajagbe ,

E.-V.

Chioasca , R. T. BatistaNavarro, Natural language processing (nlp) for requirements engineering: A systematic mapping study , ACM Computing Surveys 54 ( 2021 ).

[2]

I. K.

Raharjana ,

Siahaan , C. Fatichah, User stories and natural language processing: A systematic literature review , IEEE Access 9 ( 2021 ) 53811 - 53826 .

[3]

Alsaadi ,

Saeedi , Data-driven efort estimation techniques of agile user stories: a systematic literature review , Artificial Intelligence Review ( 2022 ) 1 - 32 .

[4]

R. W.

Miller , Schedule, cost, and profit control with pert: A comprehensive guide for program management ( 1963 ).

[5]

Cer ,

Yang ,

Kong ,

Hua ,

Limtiaco , R. S. John, N. Constant , M. GuajardoCespedes, S. Yuan,

Tar ,

Sung ,

Strope ,

Kurzweil , Universal sentence encoder ( 2018 ). arXiv: 1803 .11175.

[6]

Mensio , Spacy - universal sentence encoder , 2021 . URL: https://github.com/ MartinoMensio/spacy-universal -sentence-encoder.

[7]

Dalpiaz ,

I. Van Der

Schalk ,

Brinkkemper ,

F. B.

Aydemir , G. Lucassen, Detecting terminological ambiguity in user stories: Tool and experimentation , Information and Software Technology 110 ( 2019 ) 3 - 16 .

[8]

S. R.

Bowman , G. Angeli,

Potts ,

C. D.

Manning , A large annotated corpus for learning natural language inference , in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Association for Computational Linguistics , 2015 .

[9]

Conneau ,

Kiela ,

Schwenk ,

Barrault ,

Bordes , Supervised learning of universal sentence representations from natural language inference data ( 2017 ). arXiv: 1705 . 02364 .