Which Semantics for Requirements Engineering:
                                from Shallow to Deep


        Roberto Garigliano              Dominic Perini                                                 Luisa Mich
 SenseGraph Ltd, United Kingdom SenseGraph Ltd, United Kingdom                                  University of Trento, Italy
 roberto_garigliano@hotmail.com    dominic.perini@gmail.com                                       luisa.mich@unitn.it


                                                                Abstract
                         Natural language processing has been proposed and applied to support a
                         variety of tasks in requirements engineering. While shallow semantic allows
                         to address many of the challenges, to further automatize requirements
                         analysis a full understanding of textual requirements is needed. To this end,
                         the future generation of natural language processing systems needs a deep
                         semantics, that is a representation of the content independent of the surface
                         description, which represents hidden casual, spatial, temporal and modal
                         connections.


1 Introduction
Natural language processing (NLP) tools and systems have been applied to analysing requirements texts since the 1980’s
[Ab83]. In software engineering the goal was to design programs by informal English descriptions. Since then, various
problems have shown the limits of existing technologies in reaching such objective.
  A following wave of papers is related to the arrival of object-oriented methods, suggesting the application of linguistic
rules to extract classes and objects from natural language problem statements, in order to develop conceptual models of
requirements [Boo84; Rum91; Bur95].
  Another area of application of linguistic tools in requirements engineering (RE) is related to the identification of
ambiguities in natural language requirements, to improve their quality [Kiy08; Tjo13; Fer16].
  More recent research projects focus on the extraction of requirements specifications from regulatory documents, to
design computer-based systems compliant with security and privacy laws (e.g., [Gov09]), a critical challenge in an
Internet centred world. There are also proposals to analyse user generated content in order to extract requirements (e.g.,
[Bäu17]) for the purpose of improving products or services as part of a management strategy exploiting textual reviews
available on the Web.
  Based on the experience gained in more than 30 years, the paper summarises the most relevant mistaken assumptions
as regards NLP in RE and illustrates the need of a NLP system able to fully understand the meaning of natural language
requirements showing the need for a deep semantics. The main assumptions for the proposed deep semantics are given
and an example of its implementation in a large domain independent NLP system, SenseGraph, are described.


Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
2 Mistaken assumptions in NLP and RE
Among the large number of papers published on NLP in RE, it is worth citing [Rya93], who back in 1993 advised “that
potential role of natural language processing in the requirements engineering process has been overstated in the past,
possibly due to fundamental misunderstandings of the requirements engineering process itself.” The author proposed to
identify activities where NLP could be usefully applied, underlining the difficulty in taking into account common
knowledge, that is knowledge that is not (and could not be) explicitly given in requirements documents (e.g., who a
customer or a user is), but also the difficulties inherent to the requirements engineering for complex systems. In this way,
Ryan suggested to take into account the state of the art of linguistic tools available at that time and the need to further
investigate how to exploit NLP in RE.
   Burg [Bur97] introduced a RE method which makes heavy use of linguistic instruments, suggesting the use of
semantics to avoid ambiguity and incompleteness problems in natural language texts. However, his approach deals with
the semantics at the level of single words, whose meaning is described using a formal representation.
   Many other projects have focused on the use of linguistic tools and systems in RE. An analysis of the literature
highlights two relevant and somehow opposite attitudes:
    − The under-estimation of the complexity of natural language, which introduces a number of limitations on the input
       or grammar or domain etc.
    − A focus on very specific issues or tasks such as, for example, identifying entities, solving anaphorical references,
       looking for a given type of ambiguity.
   Both those trends confirm the need of a new generation of NLP systems, able to deal with meaning in a way closer to
that understood by native users.
   From our first project with the NLP LOLITA, to support the generation of class and use cases models three issues −
mistaken assumptions − were identified [Mic94; Mic96; Mic02]:
    − The isomorphism between syntax and semantic, that is between the role of the words in the sentences and their
       meaning: it is imperfect, so that, for example, rules to extract objects looking for nouns and verbs for methods may
       not work (nouns can be verbalised and verbs can be nominalised [Boo94]).
    − All the information is in the text, but it is not, as for example common knowledge is given for granted [Len90].
    − All the information in the text is useful, that is not true as there could be redundancies, or misinterpreted facts or
       ambiguities, etc. (e.g., [Gén13]).
   Experiments run with LOLITA − a large NLP system designed according to the principle of natural language
engineering [Bog95] − to identify ambiguities in natural language requirements allowed to introduce measures for
different kind of ambiguities, but not to support their elimination [Mic00].
   For some these assumptions could be considered simplifying assumptions and they are if they are made consciously. If
the decision of using the simplifying assumptions is taken by a human who has read the documents and who understands
the client’s needs, then of course it is fine, but it defeats the objectives of automatic processing. If it is done by machines,
it is highly dangerous, as the parts left out might be crucial; e.g., let’s assume that the requirement is “the system will be
kept going at all times, provided the temperature does not rise over X”. If the “provided” part is eliminated, the precision
is still 100%, but the result could be disastrous.
   On the other hand, the need of tools supporting requirements engineers in dealing with real natural language input −
and not assuming that it would be possible to have the requirements in formal language or written in a controlled
language − has been confirmed by a survey run in 1999 [Mic04]. At the same time, the state of the art of NLP tools and
their limitations prompted to identify tasks in RE that could be addressed using lightweight linguistic tools for semi-
automatic approaches in which manual activities are necessary to complete or supervise the tasks [Kiy08; Zen07].
Finally, the development of systems to analyse regulatory texts to extract compliance requirements [Zen15; Zen17]
further confirmed that supporting RE to a better extent needs a full-fledged NLP system.
   This is the goal of SenseGraph, i.e. to implement a deep semantic analysis which is close to what some cognitive
science research indicates as a human internal representation [Joh06]. In our vision, SenseGraph can be feed with NL
requirements documents with minimal human editing (e.g., to eliminate pictures, HTML tags, etc.) and extract a deep
semantic version which clearly shows the crucial causal and temporal links, and which could be queried by NLP too. In
this way requirements engineering activities that now have to be supervised and completed by humans could be further
automatised, overcoming the limitations of linguistic analysis based on shallow semantics.


3 Deep semantics
Natural language texts are usually analyzed in a sequential process, starting with lexical and structural elements, parsing
text to identify the most suitable parsing tree and then applying more or less complex techniques to interpret the semantic
content that is to understand the meaning. This sequence of analysis does not allow to understand fully the content nor to
obtain a representation of the meaning independent on the surface description.
   This paper proposes a deep internal semantic representation of the text, which attempts to describe its meaning in a
form that can be (even very) different from the original one and to extract information that is not explicitly given in the
text.
   The deep semantics provided by SenseGraph (http://www.sensegraph.com) is developed following a strongly
minimalist theory:
   − The basic event representation units required by the model are the simple subject-action-object or subject-action
      (for intransitive actions). These units take place in time-space, thus they needed to be positioned either in ‘absolute’
      terms (e.g. an address such as via Garibaldi 31, Messina, or a specific time such as 10th January 2018) or in relative
      ones. More importantly from the point of view of discourse analysis, such basic units are also positioned with
      respect to one another, e.g. for space correlations (event A taking place 50 meters from Event B) or time
      correlations (event A takes place before event B). Of course these are ‘absolute’ or ‘relative’ only in natural
      language terms, not in any scientific meaning such as in relativity theory.
   − Causal relationships are a subset of the temporal ones, which have the specific feature of forcing some sequence
      aspects (e.g. necessity, sufficiency etc.). The only other relationships between events are those that create the
      transition from a common world to a personal one. For example, in “Tom thinks that Mary is pretty”, the fact that
      “Tom thinks …” is in the common world, while the fact that “she is pretty” is true in Tom’s world. A similar thing
      happens in the relationship between message and content (e.g. “the article contains the story of the robbery”).
   − All other linguistic expressions must be reduced to this simple model, without losing any meaning understandable
      by a competent reader from the surface structure. Given the huge variety of surface forms, and the very limited set
      of primitives in the deep semantic, this transition requires a set of rules which are very complex, both as theory and
      as implementation. Most of the theory has been developed and a good deal of it already implemented.
   SenseGraph has been used, and preliminarily validated, in a national project and in an European project. In these
projects, the system was tasked with analysing texts from similar domains (terrorism for the national project and crime
for the European project), while the type of text was different (short, information rich Reuters flash news in Syntesis;
long newspaper articles and blogs in LASIE). In both cases, the goal was to produce an analysis which helped
investigators to provide a clear representation of the information and the underlying structures. In order to reach these
objectives, the deep semantic representation has proved the key feature, since it has allowed to unify apparently different
entities and events and to connect them using implicit deep temporal, causal and spatial chains. It has also been essential
in extracting motivations, likely actions, elements of planning and other mental structures.
   Deep semantics is particularly suited to requirement analysis, since it leads to very standardised representations from
texts which may appear very different on the surface. It also allows easy and efficient reasoning, because there are so few
types of links allowed between events. The following example illustrates as the original text or its surface representation
- parsing or semantic - is very distant from its deep semantics. The input is the following sentence, which could be part of
a text used to create a database to store data about crimes for the police:

  (a) A 59-year-old man from York has been arrested on suspicion of murdering missing chef Claudia Lawrence.

   A shallow semantic analysis of the sentence cannot help in answering questions as e.g., “Who arrested the man?”, or
even more complex one such as “Why was the man arrested?”, nor could the parsing tree help more, because of its
dependency on the surface structure. In this sentence a lot of knowledge is implicit, but a reader would be able to
interpret it, understanding it as follows:
   Claudia Lawrence worked as a chef, then she disappeared, then she may have been murdered, then police suspected
that a man murdered her and so they arrested him. He had been in York before police arrested him, and was 59 years of
age when the police arrested him. Thus, deep semantics means that all the implicit information (e.g. events hidden inside
nouns such as suspicion, adjective such as missing or roles such as chef) has to be extracted and organised in small
atomic unit, which then are put together in the correct temporal and causal sequence.
   In SenseGraph, information is represented as objects and events, and for that sentence the system creates 4 objects and
14 events. The 4 objects are the concept of man, York (used in the event which describes the man’s position before the
arrest), Claudia Lawrence, and police, derived from the subject of the general event used to represent an arrest. Arrest is
an example of a general event marked as prototypical, which allows to explicit police as the subject of the arrest. Other
prototypes used to represent the meaning of the sentence are that of murder and suspicion. Police is also used as the
subject of the events “Police arrests a man”, “Police suspects a man murdering Claudia Lawrence”, “Police suspects a
man murdering Claudia Lawrence so they arrests him”. The last event represents the reason of the arrest, i.e. the causal
link between suspicion and arrest.
  The meaning of the last part of the sentence is represented by the following events: “Claudia Lawrence is a chef”,
“Claudia Lawrence disappears”, and “Claudia Lawrence works as Chef, before she disappears”. The murder, the
suspicion and the arrest are connected by the event “Police suspects a man murdering Claudia Lawrence so they arrests
him.” The other events are needed to represent the causal, spatial and temporal relations among the events in the original
sentence. Notice how the suspicion is real in the police ‘world’ (at least enough to cause the arrest), but only
‘hypothetical’ in the reporter’s world.
  The following phrase:

  (b) Police have arrested a York man, aged 59, because they suspect him to be the murderer of Claudia Lawrence, the
      chef who has disappeared.

   has the same meaning for any competent native speaker as phrase (a), but it produces a completely different parse tree
and surface semantics. Our system, however, produces the same deep representation. It should be noticed that there is a
large amount of ways in which this same meaning could be expressed.
   Figure 1 illustrates an extract of the output produced by SenseGraph for the sentence: the event arrest, the object police
and the event created to explicit the fact that the man is 59 years old when he is arrested by the police.
   Notice how the final analysis is rather distant from the original text, although according to the Mental Models Theory
is very close to how a native speaker would mentally visualize the story [Joh06]. The system presents this information in
an interactive graphical form, as well as in a textual one.


    * arrest/1: 109608 *                                               * police/2: 172748 *
  universal_:                                                         generalisation_:
    Event - 74883 - rank: universal                                    police/2 - 171402 - rank: universal
    arrest/1 - 820 -                                                  subject_of:
  subject_:                                                            arrest/1 - 109608 - rank: individual
    police/2 - 172748 - rank: universal                                suspicion/1 - 172745 - rank: individual
  action_:                                                            **********************************
    arrest/4 - 823 -                                                  object: Police.
  object_:
    man/2 - 79018 - rank: individual                                  * Event: 172767 *
  time_:                                                              universal_:
    present_ - 248575 -                                                Event - 74883 - rank: universal
  object_of:                                                          subject_:
    Event - 170891 - rank: individual                                  Event - 79016 - rank: individual
    Event - 172767 - rank: individual                                 action_:
    Event - 172779 - rank: individual                                  during/2 - 61250 -
  **********************************                                  object_:
  event: Police arrests a man.                                         arrest/1 - 109608 - rank: individual
                                                                      **********************************
                                                                      event: A man having age 59 during police arresting him.

                      Figure 1 - Examples of events and objects used for representing deep semantics


4 Conclusions
The present goal is to provide a requirement analysis that can be easily understood and checked by a human, using
graphic displays and NLP query answering. Furthermore, because the internal representation is a well-formalized graph,
the analysis results could also be directly fed into the next stage of system development.
   The roadmap is as follows: increase the ability of the hand-crafted system; increase the set of gold models analyses
(i.e. results of system analysis improved by hand); use this with a suitable fitness function [Kra17] to improve large scale
testing of the system; improve the system itself by deploying genetic algorithms based on the existing system, the fitness
function and the gold models. All these are being developed using parallel architectures (Erlang and server-less clouds).
At the same time, user-friendly interfaces are being developed.
References
[Abb83] Abbot R. Program design by informal English descriptions. Comm. ACM 26(11): 882-894, 1983.
[Bäu17] Bäumer F.S., Dollmann M., Geierhos M. Studying software descriptions in SourceForge and app stores for a
        better understanding of real-life requirements. WAMA2017: 19-25, 2017. Doi: 10.1145/3121264.3121269
[Boo94] Booch G.Object-oriented analysis and design with applications. Benjamin/Cumming, 1984.
[Bog95] Boguraev,B., Garigliano R., Tait J. Editorial. Journal of Natural Language Engineering 1(1):1–7, 1995.
[Bur95] Burg J.F.M., Van de Riet R.P. COLOR-X: Object modeling profits from linguistics. KB&KS95: 204-214,
        1995.
[Bur97] Burg J.F.M. Linguistic instruments in requirements engineering. IOS Press, 1997.
[Fer16]   Ferrari A., Spoletini P., Gnesi S. Ambiguity and tacit knowledge in requirements elicitation interviews.
          Requirements Engineering 21(3):333-355, 2016.
[Gén13] Génova G., Fuentes J.M., Llorens J. et al. A framework to measure and improve the quality of textual
        requirements. Requirements Engineering 18(1): 25-41, 2013. Doi 10.1007/s00766-011-0134-z
[Gov09] Governatori G. (Ed.) Legal knowledge and information systems. IOS Press, 2009.
[Joh06]   Johnson-Laird P. How we reason. Oxford University Press, USA, 2006.
[Kiy08] Kiyavitskaya N., Zeni N., Mich L, Berry D.M. Requirements for tools for ambiguity identification and
        measurement in natural language requirements specifications. Requirements Engineering 13(3):207-239, 2008.
[Kiy09] Kiyavitskaya N., Zeni N., Cordy J.R., Mich L., Mylopoulos J. Cerno: Lightweight tool support for semantic
        annotation of textual documents. Data Knowledge Engineering 68(12):1470-1492, 2009.
[Kra17] Kramer O. Genetic Algorithm Essentials. Springer, 2017.
[Len90] Lenat D.B., Guha R.V., Pittman K., Pratt D., Shepherd M. Cyc: toward programs with common sense. Comm.
        of ACM 33(8): 30-49, 1990. Doi: 10.1145/79173.79176 http://doi.acm.org/10.1145/79173.79176
[Mic00] Mich L., Garigliano R. Ambiguity measures in requirements engineering. ICS2000, Beijing: Publishing House
        of Electronics Industry, pp. 39-48, 2000.
[Mic02] Mich L., Garigliano R. NL-OOPS: A requirements analysis tool based on natural language processing. Data
        Mining III, Southampton: WIT Press, pp. 321-330, 2002. Doi: 10.2495/DATA020321
[Mic04] Mich L., Franch M., Novi Inverardi P.L. Market research for requirements analysis using linguistic tools.
        Requirements Engineering 9(1):40-56, 2004. Doi: 10.1007/s00766-003-0179-8
[Mic94] Mich L., Garigliano R. A Linguistic approach to the development of object oriented systems using the natural
        language system LOLITA. ISOOMS 1994: 371-386, 1994.
[Mic96] Mich L. NL-OOPS: from natural language to object oriented requirements using the natural language
        processing system LOLITA. Natural Language Engineering 2(2):161-187, 1996.
[Rum91] Rumbaugh J., Blaha M., Premerlani W., Eddy F., Lorensen W.E. Object-oriented modeling and design.
        Englewood Cliffs, NJ, Prentice-hall, 1991.
[Rya93] Ryan, K. The role of natural language in requirement engineering. ISRE: pp. 240–242, 1993.
[Tjo13]   Tjong S.F., Berry D.M. The Design of SREE - A Prototype Potential Ambiguity Finder for Requirements
          Specifications and Lessons Learned. REFSQ2013: pp. 80-95, 2013.
[Zen07] Zeni N., Kiyavitskaya N., Mich L., Mylopoulos J., Cordy J.R. A lightweight approach to semantic annotation
        of research papers. NLDB2007: pp. 61-72, 2007. Doi: 10.1007/978-3-540-73351-5_6
[Zen15] Zeni N., Kiyavitskaya N., Mich L., Cordy J.R., Mylopoulos J. GaiusT: supporting the extraction of rights and
        obligations for regulatory compliance. Requirements Engineering 20(1):1-22, 2015.
[Zen17] Zeni N., Mich L., Mylopoulos J. Annotating legal documents with GaiusT 2.0. IJMSO: 12(1):47-58, 2017.