<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Identifying Similar User Stories to Support Agile Estimation based on Historical Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aleksander Grzegorz Duszkiewicz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacob Glumby Sørensen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Niclas Johansen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Henry Edison</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thiago Rocha Silva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Morningtrain ApS</institution>
          ,
          <addr-line>Rugårdsvej 55A 1 tv, Odense C, 5000</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Maersk Mc-Kinney Moller Institute, University of Southern Denmark</institution>
          ,
          <addr-line>Campusvej 55, Odense M, 5230</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <fpage>21</fpage>
      <lpage>26</lpage>
      <abstract>
        <p>Accurate and reliable efort and cost estimation is still a challenging activity for agile teams. It is argued that leveraging historical data regarding the actual time spent on similar past projects could be very helpful to support such an activity before companies can embark upon a new project. In this position paper, we discuss our initial eforts towards a software tool that retrieves similar user stories from a database of past projects to support developers estimating efort and cost of developing similar new projects. In close collaboration with a Danish software development company, we developed a tool that employs Natural Language Processing (NLP) algorithms to find past similar user stories and retrieve the actual time spent on them. Developers can then make use of the actual implementation time of these stories as the new estimate or use the value to support their decision on a diferent estimate. Results of a preliminary evaluation regarding the performance of such a tool in terms of precision and recall showed that it is capable of finding similar stories fairly well, but also showed that diferent phrasings and wordings can have a high impact on similarity scores.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;User Stories</kwd>
        <kwd>Agile Estimation</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Efort and cost estimation is still one of the most critical aspects of agile software development.
A fair amount of developers’ time is spent on understanding, discussing, and trying to find
a consensus on an accurate estimate for user stories. Such an activity is even costlier for
companies that need to budget a software project before even gaining a software development
contract. Besides that, estimates can easily be misjudged and the development taking longer
than first anticipated, which can lead to project loss.</p>
      <p>
        Leveraging historical data from similar past projects has been seen as a good strategy to
overcome some of these challenges. Having access to the actual time spent to develop user
stories in the past can be very helpful to support developers estimating more accurately the
efort required to develop similar user stories at present. In practice, however, it is not trivial to
identify such similar stories from past projects and many agile teams simply do not leverage
this asset. Despite fruitful previous research on Natural Language Processing (NLP) to support
various activities on requirements engineering more broadly [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and on user stories more
specifically [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], there has been limited work on models and tools applying NLP to identify
user story similarity in order to support the agile estimation process [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. When it comes to
commercial agile estimation tools, the closest one to support this process is DoWork.ai1, which
leverages artificial intelligence to find similar tasks. When creating a new task, the tool conducts
a search for related historical ones and signalizes how similar they are. It is not clear, however,
which kind of NLP model they employ and how such a model has been trained.
      </p>
      <p>In this position paper, we discuss our ongoing eforts towards the development of a software
tool to support this process. The tool has been co-designed with a Danish software company
following design science principles. It employs NLP to identify and retrieve similar user stories
from past projects and suggests three levels of estimates (most-likely, pessimistic, and optimistic)
for the new user stories being estimated. Developers can then decide to adopt one of the three
suggested estimates, or leverage this information to support the decision upon a new estimate
based on the particularities of the project at hand. We report our findings from a preliminary
evaluation analysing the current performance of the tool to retrieve similar user stories based
on text similarity. In the following sections, we briefly introduce the main elements of the tool
and discuss both the preliminary results obtained so far and our next steps on this research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The Proposed Software Tool</title>
      <p>
        The proposed tool has been co-designed with Morningtrain2, a Danish medium-sized full service
digital agency that specializes in the fields of web design, programming, and online marketing.
The company currently employs a weighted three-point estimation technique based on PERT
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and would like to keep using it.
      </p>
      <p>The tool has been developed using React as the front-end framework and Laravel as the
back-end framework. The tool is a standalone system that connects to Jira using OAuth 2.0
and Webhooks, providing a two-way communication of project data. Other technologies used
include MySQL, Redis, and the library InertiaJS - SPA.</p>
      <p>Our initial implementation strategy to find similar user stories was based on tagging. In
order to lower the amount of stories for the system to compare, we first implemented a tag
system with the purpose of excluding stories with non-similar tags. Only stories that passed
this first tag filtering would be submitted to further NLP analysis. The rationale was that this
could help the performance as the dataset grows. After experimenting with some data, however,
we decided to abandon this strategy and submit all the stories in the dataset to NLP analysis
since we did not notice any relevant performance issue.</p>
      <p>
        When performing NLP, we employed semantic analysis to score semantic similarity among
user stories, as literal analysis of text did not produce good results. We implemented semantic
similarity analysis using spaCy3. spaCy compares two strings through its similarity function
1https://dowork.ai
2https://morningtrain.dk
3https://spacy.io
that uses cosine similarity to calculate a similarity score. While spaCy comes with a selection
of standard models that can be directly used, when tested with our database, the similarity
scores were mostly in the high end regardless of the actual similarity. Faced with this issue, we
started looking for a model that could better score semantic similarities of full-text sentences.
This made us opting for USE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a pre-trained model developed by Google. An open-source
implementation of USE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] made it possible to use the model together with spaCy. USE was
trained with a large variety of text and uses high-dimensional vectors which makes it a good
choice for general purpose tasks, including text classification, semantic similarity, and natural
language tasks. During our tests, the similarity scores with USE were closer to what we were
expecting based on a manual analysis of the stories in the database.
      </p>
      <p>The tool compares a new story against all other stories marked with the status of “done”
in the database. During the comparison process, the user stories are pre-processed in order
to remove the template skeleton (i.e., the sub-strings “As a”, “I want”, and “so that” ) since we
observed it improves the parsing. The tool currently employs a similarity threshold of 60%, i.e.,
any user story with a similarity score of 60% (0.6) or above is returned to the user. We use the
actual time spent on the highest scoring user story returned as the most likely estimate for the
new similar user story being estimated.
This feature aims to give the user the possibility of making an informed choice. Two user stories
can be very similar, but bear diferent system-dependent challenges, which can have a deep
impact on the time needed to implement the same feature.</p>
      <p>Finally, we also implemented a Jira integration feature through which a project with all its
underlying requirements and stories can be transferred to Jira. A two-way communication is
established through Webhooks, i.e., once a project is exported, besides systematically uploading
all local changes to Jira, we also listen for event changes within the project on Jira and reflect
those changes back to the tool in order to always have the latest updates in our database.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Preliminary Evaluation</title>
      <p>We performed a preliminary evaluation to determine the accuracy of USE to identify similar
user stories into the tool. The test setup included a database of five user stories from past
projects in the company and eight additional stories that were either phrased diferently but
semantically similar to the original ones, or not semantically similar but using phrasings from
the original ones. Each user story was then manually compared to the others by one of the
researchers (169 combinations), and then independently reviewed by another researcher. A
score of either high or low similarity was recorded for each set. We started the study with a
similarity threshold of 70% as we were aiming at closely related user stories. Each pair of stories
was subsequently tested using our tool and the results compared to the manual analysis. The
results were additionally cross-checked by two researchers.</p>
      <p>Table 1 summarizes the results of the evaluation. During the test, no false positives were
found. There were, however, a few cases that got closer to the 70% threshold. For example, the
two following user stories scored a similarity of 62%:
“As a customer, I want to add a product to my wish list, so that I can find the product next
time I visit the page”
“As a customer, I want to remove a product from my wish list, so that it does not appear on
the list when I visit the page”</p>
      <p>These are sentences that do bear some resemblance but do not include the same operations
from an implementation perspective. The sentences are therefore not similar in this context.
This suggests that the algorithm is indiferent to the opposite verbs “add” and “remove”, or at
least that it does not significantly penalize these opposites. This is further underlined by the last
phrase in each user story, where in the first user story one should be able to find the product,
while in the second one it is implied that one should not be able to find the product on the list.</p>
      <p>
        Conversely, four false negatives were found, all within the 60-70% range. An example is
presented with the two following user stories that scored a similarity of 65%:
“As an administrator, I want to search for a specific student, so that I can find the student
records”
“As a student, I want to look up my student records, so that I can get a look at the details”
The two user stories describe very similar features but phrased diferently. While 65% is a
good score, it also suggests that there can be some pitfalls when using diferent words to describe
what is likely the same thing. When testing the tool, we found that replacing a context-giving
keyword in one of the sentences with a similar word can have significant impact on the overall
sentence similarity score. Our tests showed that similar words that scored low when tested
against each other in isolation also lowered the similarity score when used in sentences. As
an example, “records” and “details” are probably two diferent ways of referring to students’
personal data in this context. The similarity test of these two words alone returned a low
similarity score of 32%, which in turn lowered the overall similarity score of the two sentences.
This issue relates more broadly to near-synonymy in user stories, which is a topic investigated
by Dalpiaz et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>As the system was designed to be a supporting tool for the user to make an informed decision,
after this preliminary evaluation, we decided to lower the tool’s similarity threshold from 70%
to 60% before moving to further evaluation. This has allowed for broadening the spectrum of
stories returned. It means that, due to the complexity of semantic analysis, a manual process
must still be done by the developers to select the most fitting stories for their specific context.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Next Steps</title>
      <p>
        In this position paper, we briefly introduced a software tool that employs NLP to retrieve
information from past user stories in order to support developers through the agile estimation
process. After the preliminary evaluation discussed in the paper, we hypothesized whether it
would be valuable to train a model specifically for the purpose of finding similarities between
user stories. We did not do this originally due to the very limited data available to perform
training on. USE has been trained with data from multiple datasets, including the Stanford
Natural Language Inference (SNLI) corpus [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which consists of 570,000 sentence pairs. Another
possibility to achieve better results would be replacing USE by other pre-trained models such as
the InferSent model released by Facebook [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which is also a general purpose model.
      </p>
      <p>Our next steps on this research include testing the use of diferent models and/or training
our own. We have also planned an evaluation in a real setting, with developers from the
company using the tool to estimate user stories for a new project. This would provide valuable
insights regarding the role that this kind of tool may play on agile projects in industry. Another
interesting future direction would be automating NLP routines to group very similar user stories,
but with diferent implementation times, and calculate a mean implementation time that could
be presented to the developer. Another NLP task that could be implemented is to scan the
database for possible keywords to be added to a tag bank that can be used in searches.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Alhoshan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferrari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Letsholo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Ajagbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-V.</given-names>
            <surname>Chioasca</surname>
          </string-name>
          , R. T. BatistaNavarro,
          <article-title>Natural language processing (nlp) for requirements engineering: A systematic mapping study</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>54</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I. K.</given-names>
            <surname>Raharjana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Siahaan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Fatichah, User stories and natural language processing: A systematic literature review</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>53811</fpage>
          -
          <lpage>53826</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Alsaadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Saeedi</surname>
          </string-name>
          ,
          <article-title>Data-driven efort estimation techniques of agile user stories: a systematic literature review</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Miller</surname>
          </string-name>
          , Schedule, cost, and
          <article-title>profit control with pert: A comprehensive guide for program management (</article-title>
          <year>1963</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Limtiaco</surname>
          </string-name>
          ,
          <string-name>
            R. S. John,
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          , M. GuajardoCespedes, S. Yuan,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Strope</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kurzweil</surname>
          </string-name>
          , Universal sentence encoder (
          <year>2018</year>
          ). arXiv:
          <year>1803</year>
          .11175.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mensio</surname>
          </string-name>
          , Spacy - universal
          <source>sentence encoder</source>
          ,
          <year>2021</year>
          . URL: https://github.com/ MartinoMensio/spacy-universal
          <article-title>-sentence-encoder.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalpiaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Van Der</given-names>
            <surname>Schalk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brinkkemper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. B.</given-names>
            <surname>Aydemir</surname>
          </string-name>
          , G. Lucassen,
          <article-title>Detecting terminological ambiguity in user stories: Tool and experimentation</article-title>
          ,
          <source>Information and Software Technology</source>
          <volume>110</volume>
          (
          <year>2019</year>
          )
          <fpage>3</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          , G. Angeli,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>A large annotated corpus for learning natural language inference</article-title>
          ,
          <source>in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <article-title>Supervised learning of universal sentence representations from natural language inference data (</article-title>
          <year>2017</year>
          ). arXiv:
          <volume>1705</volume>
          .
          <fpage>02364</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>