ORSIM: Integrating existing software components to
       detect similar natural language requirements

               Carlos Adrián Furnari1, Cristina Palomares1, Xavier Franch1
                           1
                               Universitat Politècnica de Catalunya, Spain


         Abstract. [Context & motivation] Requirements Engineering (RE) is consid-
         ered as one of the most critical phases in software development. Inside RE, in-
         terdependency detection and requirements reuse are areas that could be im-
         proved and that have been of interest for the research community. [Problem]
         Similarity detection is an activity that emerges in the context of natural lan-
         guage requirements. This activity can be used for interdependency detection
         and requirements reuse. Although there exist several software components to
         detect similar texts in English, creating the setup to test them is time-consuming
         and difficult. [Principal ideas/results] In this paper, we present ORSIM
         (OpenReq-Similarity), a tool which integrates different existing similarity de-
         tection components in the same platform. These components are: Cortical, Gen-
         sim, ParallelDots, and Semilar. [Contribution] ORSIM enables requirements
         engineers to concentrate on evaluating and choosing the similarity detection
         component that best suits their user’s data rather than worrying about the tech-
         nical setup of these components.


         Keywords: Similarity detection, Paraphrasing detection, Natural language pro-
         cessing, Requirements engineering1


1        Introduction
Similarity detection (both from a syntactic and semantic point of view) aims at detect-
ing sentences that express approximately the same idea using different words [1, 2].
In Requirements Engineering (RE), similarity detection can be used for both the iden-
tification of interdependencies between requirements and the reuse of knowledge
related to requirements.
    The relationship between similarity detection and interdependency detection is evi-
dent in the case where two requirements have almost the same formulation, since in this
case we would have an OR interdependency between the requirements. Imagine the
requirements The user interface should use the Arial letter type and The user interface
should use the Calibri letter type. It is clear that these two requirements are similar (ex-
cept for the words Arial and Calibri) and they cannot be used in the same system (since

1
    Copyright 2018 for this paper by its authors. Copying permitted for private and academic
     purposes.
it is not possible to use two letter types for the whole user interface), so these require-
ments are related by an OR interdependency (using the terminology proposed by Carl-
shamre [3]). However, even in other cases not so syntactically similar cases there are
commonalities. As an example, if a requirement states that It shall be possible to filter
personal data by name and address and another one states that The system shall be able
to filter personal data by age, it would probably be wise to treat these two requirements
at the same time to save development resources. This example can be considered as an
ICOST interdependency (i.e., a requirement affects the implementation of another re-
quirement), again using the terminology proposed by Carlshamre [3].
    In requirements reuse, similarity detection could be used to retrieve similar re-
quirements from previous projects. From the identified similar requirements,
knowledge about the metadata attributes could be reused (such as effort, risk or priori-
ty). In addition, it is possible to know if the retrieved requirements caused problems in
previous projects, avoiding their occurrence again in the current one. Similarity detec-
tion could also be used to retrieve similar requirements from a reusable requirements
database—which contains relationships among the reusable requirements—and pro-
posing requirements to the user that are related to the retrieved one.
    The OpenReq project [4] aims at developing, evaluating, and transferring highly
innovative methods, algorithms, and tools for community-driven RE in large and
distributed software-intensive projects. Inside OpenReq, the detection of similar re-
quirements is a cornerstone, since it is the basis for the dependency detection and the
requirements reuse functionalities that it will provide. Some of these functionalities
are: discard duplicate requirements, identify related requirements in previous projects
to save effort and reuse knowledge, and detect incompatibilities and prevent errors.
    Specifically, these functionalities are of special importance for two of the use cases of
OpenReq. One use case deals with bid projects of thousands of requirements. They want
to identify similar and dependent requirements both in the same bid and in previous bids,
and to reuse requirements knowledge of previous bids in the bid at hand. It is expected that
this will reduce the cost of the bid phase and the requirements analysis phase by 10%. The
other use case has a database with almost a hundred thousand requests (containing re-
quirements and bugs). They have not that much interest in the saving effort and reuse
functionality, but in the identification of similar and dependent requests. It is estimated
that at least 1000 requests (annually) could improve their management or be avoided. Of
course, these are just estimations and they need to be validated during the project.
    As there are several well-known components already developed to detect similar
texts in English, the goal of OpenReq is not to reinvent the wheel, but to use the most
adequate component in every context of use and improve the results by applying pre
processing on the input and/or post processing on the output of these components.
However, creating the setup to test these different components is time consuming and
difficult, since most of them need different settings to work: they use different pro-
gramming languages (such as Python and Java), some of them are available just as
APIs while others are available as coding libraries, some of them need specific data-
bases to store specific data (such as the pre-processing of texts) to speed up their
computing process while others can connect to any database, etc.
   Therefore, the first goal of ORSIM (standing for OpenReq-SIMilarity) is to inte-
grate different existing similarity detection components in the same environment, so
requirement engineers do not have to worry about the different setups and may con-
centrate on evaluating and choosing the components that best suits their data. The
engineering behind ORSIM is a prerequisite for further systematic evaluations. With
the ability of doing systematic evaluations, ORSIM can aim to help stakeholders in a
more challenging way in the future. The long-term goal of ORSIM is that the tool
learns what component and parameterization behaves better for data with specific
characteristics (for instance, long versus shorts requirements, requirements containing
lots of not so common domain terms, as the ones found in avionics or rail systems,
etc.) to not only provide stakeholders with a setup to evaluate the components, but to
assist stakeholders in the parameterization and choice of the components that behaves
the best for the stakeholder’s data.
   ORSIM is not only of interest to OpenReq, but to any other stakeholder of the RE
community who needs to evaluate different similarity components or choose the one
that provides better results for their specific data.
   In the following, we present an overview of the ORSIM tool (Section 2). Section 3
presents an initial evaluation of ORSIM and the integrated similarity detection com-
ponents. Finally, we conclude the paper in Section 4.


2      ORSIM Overview
In this version of ORSIM, four similarity components have been included. These
components have been chosen after a literature review done by the authors, choosing the
components that have been used in the works reviewed. These four components are:
        • Cortical [5]. It provides an API with a method that calculates the similari-
            ty between two texts using different algorithms, e.g. using Jaccard [2] and
            Cosine [2] distances.
        • Gensim [6]. It provides a Python library with methods that allow to load a
            corpus (similar to a dictionary) and applying the TF-IDF [7], LSA [8], LDA [9]
            and RP [10] algorithms to obtain the similarity between a text and a corpus.
        • ParallelDots [11]. It provides a semantic analysis API that uses the cosine
            similarity to compute the similarity between two given texts.
        • Semilar [12]. It provides implementations (developed in Java) of a series
            of algorithms to evaluate the semantic similarity between two texts,
            through an application and a library. In OpenReq, we will use the library,
            in order to encapsulate it in an API accessed by ORSIM. Semilar library
            comes with various similarity methods such as LSA, LDA, BLEU, Mete-
            or, Pointwise Mutual Information, and optimized methods based on Quad-
            ratic Assignment (a description of these methods can be found in [13]).
            Semilar also includes features for text pre-processing, such as tokenizer,
            tagger, stemmer and parser, being able to choose between different op-
            tions (e.g., StanfordNLP [14] and OpenNLP [15]).
   Each one of these components is encapsulated in an individual API. This allows to up-
date a component without affecting other components.
                        Fig. 1. ORSIM’s Use Case (UC) diagram


               Fig. 2. ORSIM’s Usage screen example – Gensim component

   To use ORSIM, the requirements engineer must upload a file with all the require-
ments to fill the database (UC 1 of Figure 1), or to use a file containing requirements
(instead of the database) (UC 2 of Figure 1). The tool also allows to edit the require-
ments in the database (UCs 3, 4 and 5 of Figure 1). After that, from the main window,
the requirements engineer can select one of the available components (UC 6 of Figure
1). When a component is chosen, the interface will change to show the parameters
that are necessary for the component and the requirement that is to be compared (Fig-
ure 2 shows such screen for the Gensim component). Next, the requirements engineer
enters the desired value for the parameters, enters the requirement to be compared,
and sends the comparison to the ORSIM’s server (UC 7 of Figure 1). When the server
replies, the results are shown in the bottom of the screen, in a decreasing order of
similarity score (see lowest part of Figure 2). As can be seen in the example of Figure
2 (with the database containing requirements of one of the trials of OpenReq), the
requirements engineer has entered the requirement “Local (at Ćuprija Station) equip-
ment for remote control of traffic devices (indication, control and driver cards)”. The
component brings the requirement “Local (at Jagodina Station) equipment for remote
control of traffic devices (indication, control and driver cards)” as the most similar to
the one that the requirements engineer has entered.

3          Initial Evaluations
Initial tests were done during the study of the components and the development of
ORSIM. In the following, we show the main results of these tests. Although more
tests were carried out, we present here only the ones that are representative. Table 1
contains the data of the tests, i.e., the requirement to be analysed (i.e., the one entered
by the requirements engineer) and the requirement in the database that the domain
expert manually selected as to be the most similar one. Table 2 shows the results of
the tests (in the case of the components that are parameterizable, we do not show all
the parameterizations tested, but only the one that returned the best result). The col-
umn Pos in Table 2 refers to the position of the requirement identified manually as the
most similar requirement in the list of results identified with the components (this list
is ordered using a decreasing order of similarity score).
   The used database, which contains 1137 requirements, belongs to one of the tri-
als of OpenReq. Test 1 is between very similar requirements in the database (only a
word differs) and the results are very good with all the components. Test 2 to 4 are
with requirements that have same meaning of one identified in the database, but the
expression (i.e., syntax) is different. In Test 2, the results of Cortical decreased a lot
compared to the first test, while the rest gave good results, with ParallelDots stand-
ing out as the best. In Test 3, in the case of Semilar, what is supposed to be the best
result has position 2, meaning that it identifies one requirement as more similar than
the original one, which is: “The new telephone of level crossing, which is in work-
ing condition, must be placed at level crossing PBE3”. As it can be seen, this re-
quirement is not more similar than the one identified manually, so this is considered
a bad result of Semilar. Cortical and Gensim also decreased the similarity scores
they returned, while ParallelDots continues behaving well. Finally, in Test 4,
Gensim gave the best result, closely followed by ParallelDots, while the rest gave
notoriously lower results.

                                                    Table 1. Test data
    Test        Requirement entered by the user              Most similar requirement in database (identified manually)
     1      Local (at Ćuprija Station) equipment for        Local (at Jagodina Station) equipment for remote control of
            remote control of traffic devices (indica-      traffic devices (indication, control and driver cards).
            tion, control and driver cards).
     2      The contractor must provide the spare           The Employer have the possibility to request, at the prices spe-
            parts that the employer requested. The          cified in the list of spare parts, and the contractor's obligation to
            prices are specified in the spare parts list.   deliver, a different number of spare parts specified in the list.
     3      To prevent accidents on the rail level          For the purpose of railway traffic safety increase, it is neces-
            crossing, a security camera system is           sary to install video supervision system at the existing level
            required to be installed.                       crossings.
     4      The interlocking requirements includes a        Request of the Employer for interlocking is: high availabil-
            small maintenance costs, great availabil-       ity, high reliability and low maintenance costs. Long life of
            ity, high reliability and long life duration.   products through the use of modern technology.
                                Table 2. Test results
AWM = Abstract Word Metric, NA = Not Apply, BF = Base Form, WWT to Word Weight Type
                        * Result obtained for cosine distance
    Test    Component                     Parameters              Similarity   Position
     1        Cortical                        NA                   99,99%*        1
     1        Gensim                 LDA, with stop words            100%         1
     1      ParallelDots                      NA                     100%         1
     1        Semilar                  Lexical (BF=true)            91,89%        1
     2        Cortical                        NA                   59,85%*        1
     2        Gensim             LSI TF-IDF, without stop words     91,30%        1
     2      ParallelDots                      NA                    97,82%        1
     2        Semilar      Optimum (AWM=LCH, WWT=MIN/TEXTA)         86,81%        1
     3        Cortical                        NA                   43,08%*        1
     3        Gensim             LSI TF-IDF, without stop words     75,28%        1
     3      ParallelDots                      NA                    97,80%        1
     3        Semilar                       Corley                  69,06%        2
     4        Cortical                        NA                   59,75%*        1
     4        Gensim                 LDA, with stop words           99,03%        1
     4      ParallelDots                      NA                    98,20%        1
     4        Semilar            Greedy (BF=false, AWM=LCH)         59,45%        1


4          Conclusions & Future Work
ORSIM allows the comparison of different similarity detection components (Cortical,
Gensim, ParallelDots, and Semilar) in a quick and easy way, to know which one suits
best the requirements at hand. To test the tool and the components, we performed a
series of initial tests with a real requirements database. In our tests, ParallelDots was
the component that brought the best results, but this could change depending on the
database the requirements engineer is dealing with.
   As future work, we want to identify what similarity detection component and pa-
rameterization behaves the best in different situations. On the one hand, we want to
identify the component and parameterization that provides the best results no matter
the set of requirements loaded into the component. With that goal, we will conduct
more tests using different datasets and we will explore more the parameters offered by
Semilar and Gensim to determine if a different configuration of the ones tested until
now behaves better. On the other hand, taking into account that the final goal of
ORSIM is assisting stakeholders in the choice of the similarity component and pa-
rameterization that will work better for the stakeholders’ requirements, we want the
tool to be able to learn what component and parameterization behaves better for data
with specific characteristics.
   Additionally, we aim at integrating other similarity components like DKPro [16],
SenseClusters [17] or Scikit-Learn [18], to extend the possibilities of finding better
results. In a longer term, we would like to make the ORSIM tool easy to extend with
new components.
Acknowledgments
The work presented in this paper has been conducted within the scope of the Horizon 2020
project OpenReq, which is supported by the European Union under the Grant Nr. 732463.


References
 1. Lee, M.C., et al.: A Grammar-Based Semantic Similarity Algorithm for Natural Language Sen-
    tences. The Scientific World Journal (2014).
 2. och Dag, J.N., et al.: Evaluating Automated Support for Requirements Similarity Analysis in
    Market-Driven Development. REFSQ (2001).
 3. Carlshamre, P., et al.: An industrial survey of requirements interdependencies in software prod-
    uct release planning. IEEE International Symposium on Requirements Engineering (2001).
 4. Intelligent Recommendation and Decision Technologies for Community-Driven Requirements
    Engineering (Horizon 2020 Project, https://www.openreq.org).
 5. Cortical: http://www.cortical.io/ . Last visited: January 22nd, 2018.
 6. Gensim: https://radimrehurek.com/Gensim/index.html. Last visited: January 22nd, 2018.
 7. Yang, Y., Pederson, J.O.: A Comparative Study on Features selection in Text Categorization. ICML (1997).
 8. Foltz, et al.: The measurement of textual coherence with latent semantic analysis. Discourse Processes (1998).
 9. Chen, Q., et al.: Short text classification based on LDA topic model. (ICALIP, 2016).
10. Lin, J., et al.: Dimensionality reduction by random projection and latent semantic indexing. SDM (2003).
11. ParallelDots: https://www.paralleldots.com/semantic-analysis. Last visited: January 22nd, 2018.
12. Semilar: http://deeptutor2.memphis.edu/Semilar-Web/index.jsp. Last visited: January 22nd, 2018.
13. Rus, V., et al.: SEMILAR: The Semantic Similarity Toolkit. ACL (2013).
14. StanfordNLP: https://nlp.stanford.edu/. Last visited: January 22nd, 2018.
15. OpenNLP: https://opennlp.apache.org/. Last visited: January 22nd, 2018.
16. DKPro: http://dkpro.github.io/. Last visited: January 22nd, 2018.
17. SenseClusters: http://senseclusters.sourceforge.net/. Last visited: January 22nd, 2018.
18. Scikit-Learn: http://scikit-learn.org/stable/. Last visited: January 22nd, 2018.