FROM NATURAL LANGUAGE REQUIREMENTS TO A CONCEPTUAL MODEL

                           Christian Kop, Günther Fliedl, Heinrich C. Mayr

                                  Alpen-Adria Universität Klagenfurt
                               Applied Informatics/Application Engineering
                                Universitätsstasse 65 – 67, 9020 Klagenfurt
                               (chris | guenther | heinrich)@ifit.uni-klu.ac.at

                    ABSTRACT                              The approach provides instruments for the
                                                          representation of intermediate results and the
In literature it is described in great detail how class   traceability between intermediate results and the
diagrams and ER diagrams or UML class diagrams            original sentences. It supports automated mapping
are derived from natural language sentences. It is        from natural language requirements to interlingua
normally assumed, that there is a direct                  specifications and automated mapping from the
correspondence between natural language elements          interlingua representation to the conceptual models.
(e.g., words) and conceptual model elements. We do
not strictly follow this assumption because of the        The linguistic processing step focuses on the transfer
complexity of natural language with its ambiguities       of written textual requirements to an interlingua, the
and ellipsis. Hence in this paper a stepwise generation   so called Pre-design Model. The “Klagenfurt
of a conceptual model out of natural language             Conceptual Pre-design Model (KCPM)” [6] provides
requirements sentences is proposed. According to the      a glossary and a graphical representation and it is used
ideas of MDA we assume that automatic                     as a basis for the mapping to the conceptual model
transformation steps from the source model (in our        (e.g., UML). We propose that the basic notions
case natural language) to the target conceptual model     introduced in this interlingua should correspond to
(e.g., UML class diagram) make sense. In addition to      hypothetical basic linguistic categories like nouns,
that we suggest that the designer should play an          verbs, etc. Thus, the goal of the whole process which
important part during transformation. It is furthermore   is      called      NIBA        (“Natürlichsprachliche
proposed to introduce an interlingua which helps to       Informationsbedarfsanalyse”) is to automate the
detect defects and provides traceability between          process of producing pre-design models by extracting
sentences and the model elements.                         their entries from the end-user’s natural language
                                                          requirements statements.
Index Terms – natural language processing,
interlingua, conceptual modeling, defect detection        To enhance the mapping process a specific framework
                                                          for annotating natural language descriptions on
                1. INTRODUCTION                           different layers was developed.

In most cases the requirements are presented on two       The paper is structured as follows. In the next section
levels: the level of end user needs and the level of      the related work is described. The linguistic
developers or requirements engineers models. End          processing step is introduced in Section 3. Section 4
user requirements usually are expressed via natural       explains the interpretation step. Section 5 focuses on
language; requirements handled by engineers are           the interlingua and their possibilities. Section 6 gives
usually expressed through formal, conceptual models.      an overview of the mapping to the conceptual model.
In many cases this diverging way of representing          The paper is summarized in Section 7.
knowledge is the main reason for misunderstandings
between users and engineers concerning initial                           2. RELATED WORK
requirements. The discrepancy disables the possibility
of validating requirements, which is an important step    The interpretation of natural language has a long
in the process of requirements engineering.               tradition. In earlier approaches heuristics were
                                                          proposed. Some of these approaches were described
To handle such problems we proposed an intermediate       in [3] [1] [8] [7]. Chen presented 11 rules to generate
level for requirements representation, an interlingua     conceptual model elements (entity types and
connecting the natural language level of the end user     relationship types) from structured sentence. Excerpts
and conceptual model level produced by engineers.         of these rules can be found in the next listing [3].
     (Rule 1) A common noun                 in   English        Extraction of         derivational    morphological
      corresponds to an entity type.                              information.
     (Rule 2) A transitive verb in English                      The identification of multi-words units and
      corresponds to a relationship type in an ER                 idiomatic expression identification. This is made
      diagram.                                                    possible by dynamically extending linguistic
     (Rule 3) An adjective in English corresponds to             knowledge inside the lexicon component.
      an attribute of an entity in an ER diagram.                Verb subclass identification. The filtered verb
     (Rule 4) An adverb in English corresponds to an             classes are based on the NTMS-system
                                                                  (“Natürlichkeitstheoretische Morphosyntax”) [4]
      attribute of a relationship in an ER diagram.
                                                                  included in the NIBA framework.
     (Rule 5) If the sentence has the form: „There
      are … X in Y“ then we can convert it into the                          4. INTERPRETATION
      equivalent form „Y has … X “.
     (Rule 7) If the sentence has the form „The X of         4.1 General guidelines for interpretation
      Y is Z“ and if Z is not a proper noun, we may           Following the different approaches mentioned in the
      treat X as an attribute of Y.                           related work section, the following can be learned for
                                                              the interpretation of natural language sentences:
Abbot [1] used heuristics for the generation of                  Common (individual) nouns are candidates for
program specifications. Parsing techniques were                   classes and attributes.
introduced in [2] and [11]. NL-OOPS [14] uses the
LOLITA [15] natural language processing toolkit with             An adjective and a noun together are candidates
an internal knowledge base to generate first cut                  for specialized classes.
conceptual models. Meanwhile tagging and chunking                Proper nouns are candidates for instance labels.
is the state of the art for the linguistic step. In [13] an
approach is described which uses part of speech
                                                                 A transitive verb is a candidate for a relationship
                                                                  type.
tagging and morphological analysis for the generation
of conceptual model element candidates. Additionally             The nouns related to the verbs are the involved
an ontology (world model) was used to refine the                  classes of the relationship type.
candidates for the project specific conceptual model             Also prepositions       can   be     candidates   for
(discourse model).                                                relationship types.

          3. LINGUISTIC PROCESSING                            In other words, given a source language (e.g., natural
                                                              language) and a “meta model” (i.e., the grammar
The system solves the task of Natural Language                description of the sentence) as well as a target
Processing of English requirements texts by producing         language (e.g., a conceptual model and its meta
chunked and semantically annotated text, which is             model), certain instances of the source language can
made ready for the KCPM modeling notions                      be mapped to instances of the target language. This is
extraction in the interpretation stage of the project. In     achieved by defining equivalences between syntactic
a first stage it accepts the tagged sentences which are       structures of the source model and syntactic structures
produced by QTag [16]. This output is refined and             of the target model.
certain structures are chunked together. Figure 1 in the
appendix shows such a chunk tree representing the             These general rules must be adopted for the certain
syntactic structure including phrasal, feature inheriting     situation (i.e., the annotated natural language). In our
nodes.                                                        case the NTMS was used for annotating the natural
                                                              language sentences with syntactic grammar
This chunking output was processed by a modular               information. Since the NTMS defines N0 as a noun
system of linguistic subsystems including the                 and N3 as a noun phrase, a class can be derived from
following functions:                                          a noun (N0) or noun phrase (N3) respectively. If we
   The identification of compound nouns. We                  find a verb (V0) together with two noun phrases then
    suppose that unclear compound boundaries are              a relationship can be derived from such a pattern.
    very often motivated through ambiguity of                 Figure 1 in the appendix shows such an example.
    complex terms, e.g., the implicit structure of
    compounds or other groups of words.                       Although these and other heuristics are commonly
                                                              used they cannot really support the interpretation. The
   The extraction and generation of inflectional             next section will explain some difficulties of
    word forms.                                               interpretation.
4.2 Problems of Interpretation                               and their roles within a relationship. No distinction is
The problems of interpretation arise since the same          made between objects and their properties. Every
syntactic structure of a phrase can be interpreted           concept is treated equally in a first step.
differently. A typical example of this problem is that       Representatives of this kind of paradigm are NIAM
the combination of an adjective and a noun can be            [7] and its successor ORM [5]. Both approaches have
seen as a specialization of that noun. It is also possible   pros and cons. Object oriented approaches look very
that the adjective together with the noun is the needed      compact. In a typical object oriented class diagram
concept. Another problem: It is not always possible to       attributes are embedded in the class representation.
distinguish between a class and an attribute just by         No additional connections between classes and
analyzing one single sentence. In literature [11] the        attributes are necessary which would expand the
subject-predicate-object structure with the predicate        diagram. On the other hand, many revisions must be
“has” (e.g., X has Y) is interpreted as follows. The         made if such a diagram is used too early in the design
subject X is a class and the object Y is an attribute.       phase. Due to information that is collected, classes
However in [9] it was shown that the verb “has” is           might become attributes and attributes might become
very ambiguous.                                              classes. According to [5] this is a reason why fact
                                                             oriented approaches are better suited to be used as an
Since mainly syntactic structures are analyzed and           interlingua.
mapped to elements of the conceptual model there is
no guarantee that all the extracted elements are             Since the interlingua is placed before the conceptual
relevant for the target model. There is no guarantee         model during an early phase of design the fact
that the model assembled only with the extracted             oriented paradigm was preferred. Nevertheless there
elements will be complete or consistent. Even worse if       must also be the necessity to provide an easy
an arbitrary text is taken for analyzing and                 transformation from the interlingua to a conceptual
interpretation there is no guarantee that the intention      model like UML since it is actually the standard for
of the customer fits with the intentions of the designer.    conceptual modeling. Hence the interlingua for
                                                             conceptual modeling of structural aspects of an
3.4 Solution                                                 information system consists of the following basic
As one possible solution it is necessary to give the         notions:
designer the freedom to select those extracted model
elements which seem to be necessary for the target
                                                                Thing type: Any notion which is important in a
                                                                 certain universe of discourse is treated as a thing
model. Furthermore it is necessary to introduce an
                                                                 type. Since attributes are not defined also notions
interlingua. This interlingua presents the designer the
                                                                 like person name, course id etc. are seen as thing
result of the extraction process and the designer can
                                                                 types.
maintain and refine the results. Hence the model
presented in the interlingua does not represent the             Connection type: Connection types relate thing
final result or final conceptual model. It represents a          types to each other. Special connection types like
intermediate result that must be discussed, refined and          generalization or aggregation can be defined.
improved. A tool was implemented with which the
designer can select necessary model elements and             The aim of the interlingua is also to be a support for
manage the elements in the model of the interlingua.         all kinds of stakeholders (designers and end users).
This also includes a tool feature for the mapping from       Therefore a graphical and glossary based
the interlingua to the conceptual model.                     representation was used for the collection of
                                                             requirements (see Figure 3 in the appendix for the
                  5. INTERLINGUA                             graphical representation – the glossary representation
                                                             is hidden).
5.1 Overview
According to the underlying paradigm of how a                5.2 Defect detection support
stakeholder perceives the “world”, two types of              Beside the purpose to provide a communication
conceptual modeling approaches can be distinguished:         platform between stakeholders, the interlingua can
                                                             also support the detection of structural inconsistencies
     Entity type and object oriented approaches.            and incompleteness. The simplest one can be detected
      Fact oriented approaches.                             if the designer takes a look at the cardinality
In the first paradigm the “world” is seen as a world of      definitions of the connection types. As it can be easily
objects which have properties. Therefore a clear             seen, all of these cardinality descriptions have a
distinction is made between object and object types          “?..?”. This means that cardinalities could not be
respectively and their properties. Representatives of        extracted from the textual description.
this paradigm are the classical ER approach and              Another possibility is to count the number of
UML. Fact oriented approaches on the other hand see          connection types of a thing type. This is described in
the “world” as a world of facts. Facts describe objects      detail in [12]. With this strategy, centered thing types
can be detected (see Figure 4 in the appendix). The         This mapping approach also applies meta-rules to
more connection types a thing type has, the more            resolve conflicting situations between the rules. An
centered or important it is. Such centered thing types      example of a meta rule is: “Laws overrule proposals”.
appear with a bigger rectangular and in another color
(e.g., green) than other thing types which seem to be           7. CONCLUSION AND FUTURE WORK
less important. However, this must not necessarily
reflect the end users intention. Therefore this strategy       In this paper an overview of a mapping process
is used to confront the end user with the result and to     from natural language descriptions to a conceptual
discuss the result with him. For instance if the end        model was given. It was also described that such a
user wonders why certain thing types like course and        process is not straight forward. Instead the designer
professor are not so important (they appear in white        must handle problems. As one possible solution the
color and the rectangular is not so big as the              interlingua (KCPM) was introduced. This model gives
rectangular for assistant or employee) then this can be     the designer an overview of the output of natural
the hint for a defect in the original specification.        language processing and provides him with some help
                                                            to improve it. Without generating the UML target
   If a mapping preview is made, then orphan classes        model, he is able to revise it. Different presentation
[10] can be detected. The Figure 5 shows such a case        techniques (e.g., graphical view and glossary view)
for the university example. In this case thing types like   make it possible to communicate with the end user.
university, faculty, department, assistant, employee,          In future, it is planned to find more possibilities to
professor, budget, ut8 and ut3 were detected to be          detect defects. These defect detection strategies
class candidates. All the thing types which appear in       should then be applied on the notions which were
white color are currently candidates for attributes.        extracted from English or from German requirements
Once again this is not the final result but a starting      sentences.
point for communication, discussion and refinement.
As can be seen in Figure 5, professor, budget, faculty                        8. REFERENCES
and university do not have any related attributes.
Hence the mapping preview gives also hints for              [1] R.J. Abbot, “Program Design by Informal English
defects.                                                    Descriptions,” Communication of the ACM, Vol. 26
                                                            No. 11, pp. 882 – 894, 1983.
5.3 Traceability
Sentences from which thing types and connection             [2] E. Buchholz, H. Cyriaks, A. Düsterhöft, H.
types can be extracted are also stored as “Sources” in      Mehlan, B. Thalheim, B.. “Applying a Natural
the interlingua model. If a thing type was extracted        Language Dialogue Tool for Designing Databases,”.
from the sentence, then a relation between the thing        International Workshop on Applications of Natural
type and the sentence exists. The same holds for            Language to Databases (NLDB’95), pp. 119 – 133,
connection types.                                           1995.

6. MAPPING TO THE CONCEPTUAL MODEL                          [3] P. Chen “English Sentence Structure and Entity
                                                            Relationship Diagrams,” International Journal of
In order to guarantee the mapping to a conceptual           Information Siences, Vol. 29., pp. 127-149, 1983
model rules are applied. These rules can be classified
into                                                        [4] G. Fliedl, Natürlichkeitstheoretische Morpho-
     Laws vs. proposals.                                   syntax – Aspekte der Theorie und Implementierung,
                                                            Gunter Narr Verlag Tübingen, 1999.
      Direct vs. indirect rules.
Laws are much stricter than proposals. If a mapping         [5] T. Halpin, “UML Data Models from an ORM
rule is a law than a mapping to a certain target concept    Perspective Part 1,” Journal of Conceptual Modeling
(e.g., class) cannot be ignored otherwise the syntax of     1998.
the conceptual target model will be incorrect.
Proposals on the other hand only give hints. The            [6] H.C. Mayr, Ch. Kop, “A User Centered Approach
syntax of the target model will not be wrong if these       to     Requirements        Modeling,   Proceedings
hints are ignored.                                          Modelierung,” Lecture Notes in Informatics LNI, p-
An indirect rule not only uses the semantic                 12, GI-Edition, pp. 75-86, 2002.
relationship to decide about the mapping but also
information about previous mappings. For example, if        [7] G.M. Nijssen, T.A Halpin, Conceptual Schema
a concept X is already mapped to an attribute and a         and Relational Database Design – A fact oriented
concept Y is related to that attribute X then an indirect   approach. Prentice Hall Publishing Company, 1989.
rule for Y detects a mapping possibility (Y will
become a class).
[8] M. Saeki, H. Horai, H. Enomoto, “Software
Development from Natural Language Specification,”       [13] H. M. Harmain, R. Gaizauskas, “CM-Builder: An
Proceedings of the 11th International Conference on     Automated NL-based Case Tool,” 15th IEEE
Software Engineering, pp. 64 – 73, 1989.                International Confernce on Automated Software
                                                        Engineering (ASE’00), pp. 45 – 54, 2000.
[9]    V.C. Storey,   “Understanding Semantic
Relationships,” VLDB Journal, Vol. 2, pp. 455 –         [14] L. Mich, J. Mylopoulos, N. Zeni, “Improving the
488., 1993.                                             Quality of Conceptual Models with NLP Tools: An
                                                        Experiment,” Technical Report DIT-02-0047, Dept.
[10] B. Tauzovich, “An Expert System for Conceptual     of Information and Communication Technology,
Data Modeling,” Proceedings of the 8th International    Univ. of Trento, 2002.
Conference on Entity Relationship Approach, North
Holland Publ. Company, pp. 205 – 220, 1989.             [15] R. Garigliano, R. Morgan, M. Smith, “The
                                                        LOLITA System as a Contents Scanning Tool,”
[11] A.M. Tjoa, A.M.; L. Berger, “Transformation of     Proceedings of the 13th International Conference
Requirement Specification Expressed in Natural          Artificial Intelligence, Expert Systems, and Natural
Language into an EER Model, ” Proceedings of the        Language Processing, 1993.
12th International Conference on Entity Realtionship
Approach, Springer Verlag, New York, pp. 127-149,       [16] D. Tufis, O. Mason, “Tagging Romanian Texts:
1993.                                                   a Case Study for QTAG, a Language Independent
                                                        Probabilistic Tagger,” Proceedings of the First
[12] Ch. Kop, “Visualizing Concetual Schemas with       International Conference on Language Resources &
their Sources and Progress,” International Journal on   Evaluation (LREC), Granada (Spain), p.589-596,
Advances in Software, Vol. 2. u. 3., pp. 245 – 258,     1998.
2009.


APPENDIX


                                   Fig. 1. Tagged sentence with chunk tree
 Customer
                                                                   has
 customer no                                   Customer           …                       Name
 name
 address                                                         …


                         Product                            is bought by
     buys               product id                                              Product
                        name                        buys
                        price
Class diagram                                 ORM diagram                …             …

                         Fig. 2. Class diagram versus ORM diagram


            Fig. 3. Graphical representation of the interlingua (university example)
Fig. 4. Visualization of centered thing types


         Fig. 5. Mapping preview