FROM NATURAL LANGUAGE REQUIREMENTS TO A CONCEPTUAL MODEL Christian Kop, Günther Fliedl, Heinrich C. Mayr Alpen-Adria Universität Klagenfurt Applied Informatics/Application Engineering Universitätsstasse 65 – 67, 9020 Klagenfurt (chris | guenther | heinrich)@ifit.uni-klu.ac.at ABSTRACT The approach provides instruments for the representation of intermediate results and the In literature it is described in great detail how class traceability between intermediate results and the diagrams and ER diagrams or UML class diagrams original sentences. It supports automated mapping are derived from natural language sentences. It is from natural language requirements to interlingua normally assumed, that there is a direct specifications and automated mapping from the correspondence between natural language elements interlingua representation to the conceptual models. (e.g., words) and conceptual model elements. We do not strictly follow this assumption because of the The linguistic processing step focuses on the transfer complexity of natural language with its ambiguities of written textual requirements to an interlingua, the and ellipsis. Hence in this paper a stepwise generation so called Pre-design Model. The “Klagenfurt of a conceptual model out of natural language Conceptual Pre-design Model (KCPM)” [6] provides requirements sentences is proposed. According to the a glossary and a graphical representation and it is used ideas of MDA we assume that automatic as a basis for the mapping to the conceptual model transformation steps from the source model (in our (e.g., UML). We propose that the basic notions case natural language) to the target conceptual model introduced in this interlingua should correspond to (e.g., UML class diagram) make sense. In addition to hypothetical basic linguistic categories like nouns, that we suggest that the designer should play an verbs, etc. Thus, the goal of the whole process which important part during transformation. It is furthermore is called NIBA (“Natürlichsprachliche proposed to introduce an interlingua which helps to Informationsbedarfsanalyse”) is to automate the detect defects and provides traceability between process of producing pre-design models by extracting sentences and the model elements. their entries from the end-user’s natural language requirements statements. Index Terms – natural language processing, interlingua, conceptual modeling, defect detection To enhance the mapping process a specific framework for annotating natural language descriptions on 1. INTRODUCTION different layers was developed. In most cases the requirements are presented on two The paper is structured as follows. In the next section levels: the level of end user needs and the level of the related work is described. The linguistic developers or requirements engineers models. End processing step is introduced in Section 3. Section 4 user requirements usually are expressed via natural explains the interpretation step. Section 5 focuses on language; requirements handled by engineers are the interlingua and their possibilities. Section 6 gives usually expressed through formal, conceptual models. an overview of the mapping to the conceptual model. In many cases this diverging way of representing The paper is summarized in Section 7. knowledge is the main reason for misunderstandings between users and engineers concerning initial 2. RELATED WORK requirements. The discrepancy disables the possibility of validating requirements, which is an important step The interpretation of natural language has a long in the process of requirements engineering. tradition. In earlier approaches heuristics were proposed. Some of these approaches were described To handle such problems we proposed an intermediate in [3] [1] [8] [7]. Chen presented 11 rules to generate level for requirements representation, an interlingua conceptual model elements (entity types and connecting the natural language level of the end user relationship types) from structured sentence. Excerpts and conceptual model level produced by engineers. of these rules can be found in the next listing [3].  (Rule 1) A common noun in English  Extraction of derivational morphological corresponds to an entity type. information.  (Rule 2) A transitive verb in English  The identification of multi-words units and corresponds to a relationship type in an ER idiomatic expression identification. This is made diagram. possible by dynamically extending linguistic  (Rule 3) An adjective in English corresponds to knowledge inside the lexicon component. an attribute of an entity in an ER diagram.  Verb subclass identification. The filtered verb  (Rule 4) An adverb in English corresponds to an classes are based on the NTMS-system (“Natürlichkeitstheoretische Morphosyntax”) [4] attribute of a relationship in an ER diagram. included in the NIBA framework.  (Rule 5) If the sentence has the form: „There are … X in Y“ then we can convert it into the 4. INTERPRETATION equivalent form „Y has … X “.  (Rule 7) If the sentence has the form „The X of 4.1 General guidelines for interpretation Y is Z“ and if Z is not a proper noun, we may Following the different approaches mentioned in the treat X as an attribute of Y. related work section, the following can be learned for the interpretation of natural language sentences: Abbot [1] used heuristics for the generation of  Common (individual) nouns are candidates for program specifications. Parsing techniques were classes and attributes. introduced in [2] and [11]. NL-OOPS [14] uses the LOLITA [15] natural language processing toolkit with  An adjective and a noun together are candidates an internal knowledge base to generate first cut for specialized classes. conceptual models. Meanwhile tagging and chunking  Proper nouns are candidates for instance labels. is the state of the art for the linguistic step. In [13] an approach is described which uses part of speech  A transitive verb is a candidate for a relationship type. tagging and morphological analysis for the generation of conceptual model element candidates. Additionally  The nouns related to the verbs are the involved an ontology (world model) was used to refine the classes of the relationship type. candidates for the project specific conceptual model  Also prepositions can be candidates for (discourse model). relationship types. 3. LINGUISTIC PROCESSING In other words, given a source language (e.g., natural language) and a “meta model” (i.e., the grammar The system solves the task of Natural Language description of the sentence) as well as a target Processing of English requirements texts by producing language (e.g., a conceptual model and its meta chunked and semantically annotated text, which is model), certain instances of the source language can made ready for the KCPM modeling notions be mapped to instances of the target language. This is extraction in the interpretation stage of the project. In achieved by defining equivalences between syntactic a first stage it accepts the tagged sentences which are structures of the source model and syntactic structures produced by QTag [16]. This output is refined and of the target model. certain structures are chunked together. Figure 1 in the appendix shows such a chunk tree representing the These general rules must be adopted for the certain syntactic structure including phrasal, feature inheriting situation (i.e., the annotated natural language). In our nodes. case the NTMS was used for annotating the natural language sentences with syntactic grammar This chunking output was processed by a modular information. Since the NTMS defines N0 as a noun system of linguistic subsystems including the and N3 as a noun phrase, a class can be derived from following functions: a noun (N0) or noun phrase (N3) respectively. If we  The identification of compound nouns. We find a verb (V0) together with two noun phrases then suppose that unclear compound boundaries are a relationship can be derived from such a pattern. very often motivated through ambiguity of Figure 1 in the appendix shows such an example. complex terms, e.g., the implicit structure of compounds or other groups of words. Although these and other heuristics are commonly used they cannot really support the interpretation. The  The extraction and generation of inflectional next section will explain some difficulties of word forms. interpretation. 4.2 Problems of Interpretation and their roles within a relationship. No distinction is The problems of interpretation arise since the same made between objects and their properties. Every syntactic structure of a phrase can be interpreted concept is treated equally in a first step. differently. A typical example of this problem is that Representatives of this kind of paradigm are NIAM the combination of an adjective and a noun can be [7] and its successor ORM [5]. Both approaches have seen as a specialization of that noun. It is also possible pros and cons. Object oriented approaches look very that the adjective together with the noun is the needed compact. In a typical object oriented class diagram concept. Another problem: It is not always possible to attributes are embedded in the class representation. distinguish between a class and an attribute just by No additional connections between classes and analyzing one single sentence. In literature [11] the attributes are necessary which would expand the subject-predicate-object structure with the predicate diagram. On the other hand, many revisions must be “has” (e.g., X has Y) is interpreted as follows. The made if such a diagram is used too early in the design subject X is a class and the object Y is an attribute. phase. Due to information that is collected, classes However in [9] it was shown that the verb “has” is might become attributes and attributes might become very ambiguous. classes. According to [5] this is a reason why fact oriented approaches are better suited to be used as an Since mainly syntactic structures are analyzed and interlingua. mapped to elements of the conceptual model there is no guarantee that all the extracted elements are Since the interlingua is placed before the conceptual relevant for the target model. There is no guarantee model during an early phase of design the fact that the model assembled only with the extracted oriented paradigm was preferred. Nevertheless there elements will be complete or consistent. Even worse if must also be the necessity to provide an easy an arbitrary text is taken for analyzing and transformation from the interlingua to a conceptual interpretation there is no guarantee that the intention model like UML since it is actually the standard for of the customer fits with the intentions of the designer. conceptual modeling. Hence the interlingua for conceptual modeling of structural aspects of an 3.4 Solution information system consists of the following basic As one possible solution it is necessary to give the notions: designer the freedom to select those extracted model elements which seem to be necessary for the target  Thing type: Any notion which is important in a certain universe of discourse is treated as a thing model. Furthermore it is necessary to introduce an type. Since attributes are not defined also notions interlingua. This interlingua presents the designer the like person name, course id etc. are seen as thing result of the extraction process and the designer can types. maintain and refine the results. Hence the model presented in the interlingua does not represent the  Connection type: Connection types relate thing final result or final conceptual model. It represents a types to each other. Special connection types like intermediate result that must be discussed, refined and generalization or aggregation can be defined. improved. A tool was implemented with which the designer can select necessary model elements and The aim of the interlingua is also to be a support for manage the elements in the model of the interlingua. all kinds of stakeholders (designers and end users). This also includes a tool feature for the mapping from Therefore a graphical and glossary based the interlingua to the conceptual model. representation was used for the collection of requirements (see Figure 3 in the appendix for the 5. INTERLINGUA graphical representation – the glossary representation is hidden). 5.1 Overview According to the underlying paradigm of how a 5.2 Defect detection support stakeholder perceives the “world”, two types of Beside the purpose to provide a communication conceptual modeling approaches can be distinguished: platform between stakeholders, the interlingua can also support the detection of structural inconsistencies  Entity type and object oriented approaches. and incompleteness. The simplest one can be detected  Fact oriented approaches. if the designer takes a look at the cardinality In the first paradigm the “world” is seen as a world of definitions of the connection types. As it can be easily objects which have properties. Therefore a clear seen, all of these cardinality descriptions have a distinction is made between object and object types “?..?”. This means that cardinalities could not be respectively and their properties. Representatives of extracted from the textual description. this paradigm are the classical ER approach and Another possibility is to count the number of UML. Fact oriented approaches on the other hand see connection types of a thing type. This is described in the “world” as a world of facts. Facts describe objects detail in [12]. With this strategy, centered thing types can be detected (see Figure 4 in the appendix). The This mapping approach also applies meta-rules to more connection types a thing type has, the more resolve conflicting situations between the rules. An centered or important it is. Such centered thing types example of a meta rule is: “Laws overrule proposals”. appear with a bigger rectangular and in another color (e.g., green) than other thing types which seem to be 7. CONCLUSION AND FUTURE WORK less important. However, this must not necessarily reflect the end users intention. Therefore this strategy In this paper an overview of a mapping process is used to confront the end user with the result and to from natural language descriptions to a conceptual discuss the result with him. For instance if the end model was given. It was also described that such a user wonders why certain thing types like course and process is not straight forward. Instead the designer professor are not so important (they appear in white must handle problems. As one possible solution the color and the rectangular is not so big as the interlingua (KCPM) was introduced. This model gives rectangular for assistant or employee) then this can be the designer an overview of the output of natural the hint for a defect in the original specification. language processing and provides him with some help to improve it. Without generating the UML target If a mapping preview is made, then orphan classes model, he is able to revise it. Different presentation [10] can be detected. The Figure 5 shows such a case techniques (e.g., graphical view and glossary view) for the university example. In this case thing types like make it possible to communicate with the end user. university, faculty, department, assistant, employee, In future, it is planned to find more possibilities to professor, budget, ut8 and ut3 were detected to be detect defects. These defect detection strategies class candidates. All the thing types which appear in should then be applied on the notions which were white color are currently candidates for attributes. extracted from English or from German requirements Once again this is not the final result but a starting sentences. point for communication, discussion and refinement. As can be seen in Figure 5, professor, budget, faculty 8. REFERENCES and university do not have any related attributes. Hence the mapping preview gives also hints for [1] R.J. Abbot, “Program Design by Informal English defects. Descriptions,” Communication of the ACM, Vol. 26 No. 11, pp. 882 – 894, 1983. 5.3 Traceability Sentences from which thing types and connection [2] E. Buchholz, H. Cyriaks, A. Düsterhöft, H. types can be extracted are also stored as “Sources” in Mehlan, B. Thalheim, B.. “Applying a Natural the interlingua model. If a thing type was extracted Language Dialogue Tool for Designing Databases,”. from the sentence, then a relation between the thing International Workshop on Applications of Natural type and the sentence exists. The same holds for Language to Databases (NLDB’95), pp. 119 – 133, connection types. 1995. 6. MAPPING TO THE CONCEPTUAL MODEL [3] P. Chen “English Sentence Structure and Entity Relationship Diagrams,” International Journal of In order to guarantee the mapping to a conceptual Information Siences, Vol. 29., pp. 127-149, 1983 model rules are applied. These rules can be classified into [4] G. Fliedl, Natürlichkeitstheoretische Morpho-  Laws vs. proposals. syntax – Aspekte der Theorie und Implementierung, Gunter Narr Verlag Tübingen, 1999.  Direct vs. indirect rules. Laws are much stricter than proposals. If a mapping [5] T. Halpin, “UML Data Models from an ORM rule is a law than a mapping to a certain target concept Perspective Part 1,” Journal of Conceptual Modeling (e.g., class) cannot be ignored otherwise the syntax of 1998. the conceptual target model will be incorrect. Proposals on the other hand only give hints. The [6] H.C. Mayr, Ch. Kop, “A User Centered Approach syntax of the target model will not be wrong if these to Requirements Modeling, Proceedings hints are ignored. Modelierung,” Lecture Notes in Informatics LNI, p- An indirect rule not only uses the semantic 12, GI-Edition, pp. 75-86, 2002. relationship to decide about the mapping but also information about previous mappings. For example, if [7] G.M. Nijssen, T.A Halpin, Conceptual Schema a concept X is already mapped to an attribute and a and Relational Database Design – A fact oriented concept Y is related to that attribute X then an indirect approach. Prentice Hall Publishing Company, 1989. rule for Y detects a mapping possibility (Y will become a class). [8] M. Saeki, H. Horai, H. Enomoto, “Software Development from Natural Language Specification,” [13] H. M. Harmain, R. Gaizauskas, “CM-Builder: An Proceedings of the 11th International Conference on Automated NL-based Case Tool,” 15th IEEE Software Engineering, pp. 64 – 73, 1989. International Confernce on Automated Software Engineering (ASE’00), pp. 45 – 54, 2000. [9] V.C. Storey, “Understanding Semantic Relationships,” VLDB Journal, Vol. 2, pp. 455 – [14] L. Mich, J. Mylopoulos, N. Zeni, “Improving the 488., 1993. Quality of Conceptual Models with NLP Tools: An Experiment,” Technical Report DIT-02-0047, Dept. [10] B. Tauzovich, “An Expert System for Conceptual of Information and Communication Technology, Data Modeling,” Proceedings of the 8th International Univ. of Trento, 2002. Conference on Entity Relationship Approach, North Holland Publ. Company, pp. 205 – 220, 1989. [15] R. Garigliano, R. Morgan, M. Smith, “The LOLITA System as a Contents Scanning Tool,” [11] A.M. Tjoa, A.M.; L. Berger, “Transformation of Proceedings of the 13th International Conference Requirement Specification Expressed in Natural Artificial Intelligence, Expert Systems, and Natural Language into an EER Model, ” Proceedings of the Language Processing, 1993. 12th International Conference on Entity Realtionship Approach, Springer Verlag, New York, pp. 127-149, [16] D. Tufis, O. Mason, “Tagging Romanian Texts: 1993. a Case Study for QTAG, a Language Independent Probabilistic Tagger,” Proceedings of the First [12] Ch. Kop, “Visualizing Concetual Schemas with International Conference on Language Resources & their Sources and Progress,” International Journal on Evaluation (LREC), Granada (Spain), p.589-596, Advances in Software, Vol. 2. u. 3., pp. 245 – 258, 1998. 2009. APPENDIX Fig. 1. Tagged sentence with chunk tree Customer has customer no Customer … Name name address … Product is bought by buys product id Product name buys price Class diagram ORM diagram … … Fig. 2. Class diagram versus ORM diagram Fig. 3. Graphical representation of the interlingua (university example) Fig. 4. Visualization of centered thing types Fig. 5. Mapping preview