Research on NLP for RE at Università della Svizzera
                 Italiana (USI): A Report

                    Arianna Blasi                                              Mauro Pezzè
       Università della Svizzera italiana (USI)                 Università della Svizzera italiana (USI)
                 Lugano, Switzerland                                       Lugano, Switzerland
                 arianna.blasi@usi.ch                                     mauro.pezze@usi.ch
                           Alessandra Gorla                            Michael D. Ernst
                      IMDEA Software Institute                     University of Washington
                             Madrid, Spain                               Seattle, WA
                      alessandra.gorla@imdea.org                   mernst@cs.washington.edu


                                                       Abstract
                       We report the activity of the Software Testing and Analysis Research
                       (STAR) laboratory of USI Università della Svizzera italiana about the
                       use of NLP to automatically generate test cases from documentation
                       in natural language. We first introduce the research contributions of
                       the group to contextualize the work related to NLP. We then summa-
                       rize our research techniques to automatically generate test oracles from
                       specifications expressed in terms of Javadoc tags using NLP. We con-
                       clude by presenting the challenges of shifting the focus on more complex
                       software systems, and on more complex artifacts in natural language.

1    Team Overview
This report describes research to generate inputs and oracles to automatically test software systems. Our research
uses Natural Language Processing (NLP) to automatically produce executable specifications from software docu-
mentation written in natural language. This complements our other work on automatically generating test inputs
for applications with complex structured inputs [BDMP17], interactive and GUI-based applications [MPZ18],
concurrent and distributed software systems [TP18], and test cases for exercising software in the field [GMPP17].
   This report presents our work on the generation of test oracles, such as assertions for test cases. We focus on
the automatic generation of semantically relevant oracles, that is, oracles that can reveal failures due to semantic
mismatches with respect to software requirements. Such oracles are more powerful than simple implicit and
regression oracles [BHM+ 15] that are commonly generated by automatic testing tools such as Randoop [PLEB07]
and Evosuite [FA13].
   In 2009 we started investigating redundancy intrinsically present in software systems [CGP09, CGPP15],
and we designed a technique that exploits such redundancy to generate test oracles [CGG+ 14]. Along the
same line of research, we designed and experimented with techniques and prototype tools to automatically
identify semantically equivalent method calls in Java programs [GGM+ 14]. We then developed techniques
for automatically generating assertions from specifications expressed in natural language. We developed an
approach that exploits NLP to automatically infer executable specifications from Javadoc comments, and use
such executable specifications as test oracles [GGEP16, BGK+ 18]. In the next sections, we describe the results
that we obtained so far, and our research plans to automatically generate semantically relevant oracles from
specifications in natural language with NLP.
Copyright c 2019 by the paper’s authors. Copying permitted for private and academic purposes.
2    Past Research on NLP for RE
Test cases can be generated from different sources of information [BP06, BHM+ 15]: requirements specifications
(black-box and model-based testing), source code (white-box testing), possible faults (fault-based testing), former
versions and similar code (regression and metamorphic testing). When generating test cases by exploiting
requirements specifications, the goal is to identify a finite set of test inputs that properly sample the execution
space (partition testing), and a set of assertions (oracles) that check the results of testing the software system.
Most of the approaches for automatically generating test cases proposed so far focus on generating test inputs
from formal and semi-formal specifications [PY07]. Relatively little work takes advantage of natural language
requirements, and it uses simplistic techniques, such as pattern matching, to determine conditions related to
nullness of parameters (Tan et al.’s @tComment [TMTL12]), part-of-speech tagging and pattern-matching to
generates simple pre- and post-conditions (Pandita et al.’s ALICS [PXZ+ 12]). Some approaches take advantage
of the simplifications induced by the structure of semi-formal specifications to generate test inputs. For instance,
Wang et al. automatically derive test cases from use case specifications [WPG+ 15]. Many techniques use NLP
to solve problems related to requirements quality, such as ambiguity [FDE+ 17], which are not strictly related to
testing oracle specific issues.
   We have investigated more powerful and effective approaches. As illustrated in the following simple example,
the Javadoc tags indicate the scope of the specifications (i.e. whether it is a pre-condition on a parameter, or
a post-condition on the method execution result), and this simplifies the task of processing the information
for testing. However, Javadoc tags may predicate on program elements that are partially implicit, for instance
implicit subjects referring to parameters, and often use developers’ jargon, for instance not null. Such features
pose a challenge for traditional natural language processing:

1 /∗∗
2 ∗ Merges the arrays in input
3 ∗
4 ∗ @param x the first array, not null
5 ∗ @param y the second array, not null
6 ∗ @return an array which is the result of the merge, empty if both arrays are empty
7 ∗ @throws IllegalArgumentEsxception if either array is null
8 ∗/
9 public Object[] merge(Object[] x, Object[] y) throws IllegalArgumentException {...}
                                Listing 1: Sample Javadoc specification of a method

   @param tags indicate the preconditions on the method input parameters, the @return tag (at most one) and
@throws tags (one for each exception that the method may rise) indicate the postconditions of the method
execution. The information expressed with Javadoc tags is useful to determine the correctness of the results of
the test executions, but it is necessary to translate it into executable code assertions to act as test oracles.
   For example, the translation of the specification expressed in the @param tags in Listing 1 is the following
executable assertion, which automatically acts as testing oracle:
                                                   x != null && y!=null

the @throws tag translation is:
                              (x == null || y==null) −→ java.lang.IllegalArgumentException

and the @return tag translation is:
                                    (x.length==0 && y.length==0) −→ result.length==0

   We designed and developed Toradocu [GGEP16] later extended to Jdoctor [BGK+ 18], a technique that auto-
matically infers executable assertions from comments in Javadoc tags expressed in natural language, as shown in
the example. The early Toradocu approach performs simple translations of exceptional postconditions. Jdoctor
extends Toradocu to all Javadoc tags and greatly improves the translation abilities of Toradocu, supporting also
semantic similarity for interpreting synonyms [KSKW15].
   Toradocu and Jdoctor use the Stanford Parser to produce a semantic graph for each sentence. First, they pre-
process the text in natural language to deal with the peculiarities of Javadoc comments, which are rarely complete
and grammatically sound English sentences. For example, most Javadoc specifications lack punctuation, many
have implicit subjects and verbs, and often intermix mathematical or code notation with English. Also, different
types of tags need different preprocesses, and Toradocu and Jdoctor take this into account.
   The core idea of Toradocu and Jdoctor is to exploit information already present in the source code to produce
ready-to-use executable assertions. This approach does not require any other external intervention or effort from
developers. The last experimental results obtained by executing Jdoctor on 6 popular open source Java projects
are encouraging: the tool achieves 92% recall and 83% precision on 829 translations [BGK+ 18]. Also, Jdoctor
assertions are officially integrated with Randoop [PLEB07], and in our evaluation they produce test cases that
raise fewer false alarms and reveal more defects.

3   Research Plan on NLP for RE
The results of our past work confirm the research hypothesis of our long term research plan: automatically
generating test inputs and oracles from requirements specifications given in natural language with NLP is feasible
and effective.
   In the short term, we plan to analyze free, unstructured text in Javadoc, beside the specific Javadoc tags that
we already support. Moreover, we aim to extract information beyond functional properties. We plan to focus
on temporal, security, performance and other non-functional properties. As an example, Figure 1 shows some
temporal properties about call protocols1 . Such information is very useful for reducing the amount false alarms
that affect existing testing approaches.


               Figure 1: Javadoc documentation (free text part) of Apache Commons Collections

   We will combine different approaches to interpret various properties expressed in natural language. We will
resort to Open Information Extraction to infer information from unstructured text [DCG13, FSE11, SBS+ 12],
and natural language parsing, pattern matching, semantic similarities and machine translation techniques to
match documentation with code elements.
   Our mid-term plan aims to extend Jdoctor to deal with information coming from other artifacts in natu-
ral language, such as wikis, issue trackers, and community forums, which are commonly available for popular
applications. Even if these artifacts do not have a narrow scope as Javadoc comments do — i.e., Javadoc
refers to a specific method or a specific class — they are still often partially-structured, and thus we believe
that our techniques, if properly extended, can deal with them. The software engineering research community
already produced some techniques that derive test artifacts from system requirements such as use-case require-
ments [WPG+ 15, MPGB18].
   Our long term plan is to define and develop a set of techniques to automatically test human-centric software
systems. Such systems have key features that make them different from traditional software systems: First and
foremost, the user is an integral part of the system. Secondly, they often integrate different sensors and physical
devices. Lastly, they often rely on machine learning components that drive the decisions of the system based on
the observed inputs from sensors. In a nutshell, human-centric software systems can be seen as an evolution of
ultra large software systems also called systems of systems [MPS08].
   To deal with human-centric software systems we need to radically change the considered scenario and widen the
set of techniques that we plan to use, mostly because the expected behavior of the system may be hard to predict,
and it is seldom specified in the requirements. So far we studied the problem of automatically generating test
inputs and oracles for functional properties of program units (classes and methods) with deterministic behaviors.
   To address the problem of properly testing human-centric software systems, we need to move from functional
properties of software components with deterministic behavior, to properties of subsystems with non deterministic
1   https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/BagUtils.html
behavior. Non determinism may derive from concurrency, and may be due to machine learning components that
act differently depending on the underlying model they use. Moreover, external physical sensors and users
involved in the system may increase the uncertainty of the expected behavior of the system.
   Despite these challenges, we still plan to focus our analysis on natural language artifacts, and aim to infer the
missing information to test such systems. No matter how complex such systems may be, their requirements still
have to be expressed in some form: It could be, for example, classical user stories. We will investigate what kind
of artifacts are mostly used to document such type of systems. We will study different ways of contextualizing the
fragmented and incomplete information expressed in natural language to solve ambiguities and incompleteness,
and we will properly exploit the inferred specification to test these complex systems.

Acknowledgments
This work was partially supported by the Spanish projects DETEST, by the Madrid Regional projects BLUETS
and MadridFlightOnChip, and by the Swiss project ASTERIx: Automatic System TEsting of inteRactive soft-
ware applIcations (SNF-200021 178742). This material is also based on research sponsored by DARPA under
agreement numbers FA8750-12-2-0107, FA8750-15-C-0010, and FA8750-16-2-0032. The U.S. Government is au-
thorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation
thereon.

References
[BDMP17] Pietro Braione, Giovanni Denaro, Andrea Mattavelli, and Mauro Pezzè. Combining symbolic ex-
         ecution and search-based testing for programs with complex heap inputs. In Proceedings of the
         International Symposium on Software Testing and Analysis, ISSTA ’17, pages 90–101. ACM, 2017.

[BGK+ 18] Arianna Blasi, Alberto Goffi, Konstantin Kuznetsov, Alessandra Gorla, Michael D. Ernst, Mauro
          Pezzè, and Sergio Delgado Castellanos. Translating code comments to procedure specifications. In
          Proceedings of the International Symposium on Software Testing and Analysis, ISSTA ’18. ACM,
          2018.

[BHM+ 15] Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. The oracle problem
          in software testing: A survey. IEEE Transactions on Software Engineering, 41(5):507–525, 2015.

[BP06]      Luciano Baresi and Mauro Pezzè. An introduction to software testing. Electronic Notes in Theoretical
            Computer Science, 148(1):89–111, 2006.

[CGG+ 14] Antonio Carzaniga, Alberto Goffi, Alessandra Gorla, Andrea Mattavelli, and Mauro Pezzè. Cross-
          checking oracles from intrinsic software redundancy. In Proceedings of the International Conference
          on Software Engineering, ICSE ’14, pages 931–942. ACM, 2014.

[CGP09]     Antonio Carzaniga, Alessandra Gorla, and Mauro Pezzè. Fault handling with software redundancy.
            In R. de Lemos, J. Fabre, C. Gacek, F. Gadducci, and M. ter Beek, editors, Architecting Dependable
            Systems VI, pages 148–171. Springer, 2009.

[CGPP15] Antonio Carzaniga, Alessandra Gorla, Nicolò Perino, and Mauro Pezzè. Automatic workarounds:
         Exploiting the intrinsic redundancy of web applications. ACM Transactions on Software Engineering
         and Methodologies, 24(3):16, 2015.

[DCG13]     Luciano Del Corro and Rainer Gemulla. Clausie: Clause-based open information extraction. In
            Proceedings of the International Conference on World Wide Web, WWW ’13, pages 355–366. ACM,
            2013.

[FA13]      Gordon Fraser and Andrea Arcuri. Whole test suite generation. IEEE Transactions on Software
            Engineering, 39(2):276–291, 2013.

[FDE+ 17] Alessio Ferrari, Felice Dell‘Orletta, Andrea Esuli, Vincenzo Gervasi, and Stefania Gnesi. Natural
          language requirements processing: a 4d vision. 34(6):28–35, 2017.
[FSE11]     Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying relations for open information
            extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing,
            EMNLP ’11, pages 1535–1545. Association for Computational Linguistics, 2011.
[GGEP16] Alberto Goffi, Alessandra Gorla, Michael D. Ernst, and Mauro Pezzè. Automatic generation of oracles
         for exceptional behaviors. In Proceedings of the International Symposium on Software Testing and
         Analysis, ISSTA ’16, pages 213–224. ACM, 2016.
[GGM+ 14] Alberto Goffi, Alessandra Gorla, Andrea Mattavelli, Mauro Pezzè, and Paolo Tonella. Search-based
          synthesis of equivalent method sequences. In Proceedings of the ACM SIGSOFT International Sym-
          posium on Foundations of Software Engineering, FSE ’14, pages 366–376. ACM, 2014.
[GMPP17] Luca Gazzola, Leonardo Mariani, Fabrizio Pastore, and Mauro Pezzè. An exploratory study of
         field failures. In Proceedings of the International Symposium on Software Reliability Engineering,
         ISSRE ’17, 2017.
[KSKW15] Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. From word embeddings to
         document distances. In Proceedings of the International Conference on International Conference on
         Machine Learning, ICML ’15, pages 957–966, 2015.
[MPGB18] Phu X. Mai, Fabrizio Pastore, Arda Goknil, and Lionel C. Briand. A natural language programming
         approach for requirements-based security testing. In Proceedings of the International Symposium on
         Software Reliability Engineering, ISSRE ’18, pages 58–69. IEEE Computer Society, 2018.
[MPS08]     Hausi Muller, Mauro Pezzè, and Mary Shaw. Visibility of control in adaptive systems. In ULSSIS
            ’08: Proceedings of the 2nd International Workshop on Ultra-Large-Scale Software-Intensive Systems,
            pages 23–26. ACM, 2008.
[MPZ18]     Leonardo Mariani, Mauro Pezzè, and Daniele Zuddas. Augusto: Exploiting popular functionalities
            for the generation of semantic gui tests with oracles. In Proceedings of the International Conference
            on Software Engineering, ICSE ’18, pages 280–290, 2018.
[PLEB07] Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. Feedback-directed random
         test generation. In Proceedings of the International Conference on Software Engineering, ICSE ’07,
         pages 75–84. ACM, 2007.
[PXZ+ 12] Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and Amit Paradkar. Inferring
          method specifications from natural language api descriptions. In Proceedings of the International
          Conference on Software Engineering, ICSE ’12, pages 815–825. IEEE Computer Society, 2012.
[PY07]      Mauro Pezzè and Michal Young. Software Testing and Analysis: Process, Principles and Techniques.
            Wiley, 2007.
[SBS+ 12]   Michael Schmitz, Robert Bart, Stephen Soderland, Oren Etzioni, et al. Open language learning for
            information extraction. In Proceedings of the Joint Conference on Empirical Methods in Natural
            Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’12, pages
            523–534. Association for Computational Linguistics, 2012.
[TMTL12] Shin Hwei Tan, Darko Marinov, Lin Tan, and Gary T. Leavens. @tComment: Testing Javadoc
         comments to detect comment-code inconsistencies. In Proceedings of the International Conference
         on Software Testing, Verification and Validation, ICST ’12, pages 260–269. IEEE Computer Society,
         2012.
[TP18]      Valerio Terragni and Mauro Pezzè. Effectiveness and challenges in generating concurrent tests for
            thread-safe classes. In Proceedings of the International Conference on Automated Software Engineer-
            ing, ASE ’18. ACM, 2018.
[WPG+ 15] Chunhui Wang, Fabrizio Pastore, Arda Goknil, Lionel Briand, and Zohaib Iqbal. Automatic genera-
          tion of system test cases from use case specifications. In Proceedings of the International Symposium
          on Software Testing and Analysis, ISSTA ’15, pages 385–396. ACM, 2015.