Evaluation and Improvement of Database Schemas: A transformation-based framework Jonathan Lemaitre Laboratory of Database Application Engineering - PReCISE research Center Faculty of Computer Science, University of Namur Rue Grandgagnage 21 - B-5000 Namur, Belgium jle@info.fundp.ac.be http://www.fundp.ac.be/precise Abstract. Data schemas are primary artefacts for the development and maintenance of data intensive software systems. As for the application code, one way to improve the quality of the models is to ensure that they comply with best design practices. In this paper, we present a research on data schemas quality, in which the quality evaluation process is based on the identification of specific schema constructs and their comparison with best practices. We provide an overview of a framework based on the use of semantics-preserving transformations as a way to compare and suggest improvements for the most significant best design practices. The paper also summarize the contribution of the approach and of the current achievements. Keywords: Database schema, schema evaluation, schema improvement, schema transformation. 1 Introduction Quality has become one of the main topics in software engineering. Research and industrial communities acknowledge that behind maintainability, efficiency, etc., lies the satisfaction of the users and the financial impacts. The question has been studied through many works during the last three decades. In the 90’s authors have already assessed the impact of poor quality and errors made during the modeling phase [1]. During the last few years quality of schemas became more and more important owing to the MDE approach that relies mainly on modeling and schema transformations. Looking at data schema, one can observe different approaches to deal with quality. A first approach regroups the quality model and framework. Quality models are mainly composed of definitions of quality characteristics [2, 3] while in addition frameworks define high level views of the quality and methodologies [4– 6]. Using quality models and frameworks, some authors propose metrics [7, 8]. Metrics give a numerical evaluation of (a specific aspect of) the quality. The metric approach deals with the evaluation of quality, most of the time, in a affordable way, but it does not allow one to precisely identify the source of a poor quality score. Finally, the last approach regroups proposals, that study very specific problems such as, for example, the normalization [9] and the impact of particular construct [10, 11]. These proposals are generally based on validated and intuitive concepts. They also offer direct means to improve the quality of models. Combined with general modeling conventions, they can define a set of commonly agreed best practices that are to ensure specific requirements such as expressiveness, maintainability, evolutivity, performance, etc. Both the metric approach and the study of specific problems have their ad- vantages and limitations. In this paper, we will describe an ongoing research in which defects and best practices are considered as a possible basis for the schema quality evaluation and improvement. Our approach relies on the identification of schema structures that have been defined to express specific types of facts of the application domain. This analysis is used for evaluating the schema by comparing its structural content to a reference frame and the requirements of the schema context. The improvement activity will modify the schema structure, using semantics-preserving transformations, in order to increase its compliance with the context while preserving its semantics. Our work address the following questions: – What can be the contribution of semantics-preserving transformations to the quality improvement of data models? – What kind of problems/defects could be addressed this way? – How to provide quality evaluation methods in order to assess the impact of transformations? In section 2, we summarize the general concepts used in our research. Those are about schema abstraction level and paradigm and schema transformation. In section 3, we present our motivations and illustrate the type of quality problems we are studying. Section 4 presents the basis of a framework for the evaluation and improvement of schema quality. In the section 5, we compare our approach to related existing proposals. Finally, the section 6 describes the results achieved so far and gives directions about future works. 2 Background In this section we briefly describe the main basic concepts used in our work. They are the abstraction levels and paradigms and the transformational ap- proach. The interested reader is referred to reference [12] for a more detailed description of these. A schema is a formal description of the information/data structures of a database, be it in construction or in use. It is positioned at a certain level of abstraction. Database engineering processes generally rely on a hierarchy of 3 abstraction levels, namely the conceptual, logical and physical levels. Such multi-level approach is currently called Model-Driven Engineering. A schema is also expressed in a specification language, based on a definite paradigm. Entity- relationship (ER) with its many variants, UML class diagrams, relational, object- relational, XML, IMS, standard files and even schema-less, are some of them. The database community calls them models (e.g., the relational model ), a term we will use in this paper. There is an agreement on which abstraction level a given paradigm best fits. For instance, the Entity-relationship model is as its best at the conceptual level while the object-relational model should be used at the logical level. Abstraction levels and paradigms define a two-dimension space in which an arbitrary schema can be located and evaluated. Any process that consists in deriving artefacts from other artefacts relies on such techniques as renaming, translating, restructuring, replacing, refining and abstracting, which basically are transformations. Most database engineer- ing processes can be formalized as chains of elementary models and data trans- formations that preserve some of their aspects. This approach is known as the transformational approach. In this paper, we will address the multiplicity of representations of a given concept by the use of specific transformations called semantics-preserving. Considering a transformation as a function, say g(c), de- fined on a set of constructs C, g is semantics-preserving or reversible iff there exists an inverse g’, defined on C’ such that, for each valid instance c of C, g(c) is a valid instance of C’ and c = g’(g(c)). 3 Motivations The quality of a schema can be seen through various aspects. One can deal with the visual quality [13], that relates to the graphical representation of models and properties of the schema objects (e.g. object position). Quality also encom- pass the syntactical correctness. The semantic quality refers to the correctness and completeness of the schema in regard to the application domain. Quality ad- dress problems such as the unsatisfiable constructs. These are syntactically valid constructs that cannot be instantiated because of the constraints it contains [14, 15]. One can also talk about quality in term of representation choice. In the pre- vious section, we introduced the notion of abstraction level and paradigm. They defined a set of requirements on the schema, implying that some constructs may be incongruous for representing some elements of the application domain. Those constructs cannot be called errors, as they are syntactically and semantically correct. Instead, we use the term defect. The figure 1 provides 2 examples of defects. The schema (a) contains an is- a relationship with one empty subtype. The use of the is-a relationship is not necessary, not to say unwise, if the subtype FORMER-CUSTOMER is not intended to evolve. Indeed, as FORMER-CUSTOMER has no role, nor attribute, it is said to be weakly specific and could be replaced by a simple boolean attribute of CUSTOMER. By doing such a transformation, we simplify a non-minimal schema. In the schema (b), one can observe an unusual construct coming from another abstraction level and paradigm (considering (b) to be an ER-like schema). In- deed, the entity types, together with their relationship types, form a complex but valid expression of a single many-to-many relationship type. Such construct is very common in legacy IMS databases. Considering schema (b) at the con- ceptual level, all objects in the schema, considered separately, belong to the good paradigm but their combination forms a construct influenced by another abstraction level and paradigm. (a) (b) CUSTOMER ORDER DETAIL PRODUCT CustID OrderID Quantity ProductID Name Date Name Address 1-1 1-1 Price id: CustID Rd1 Rd2 0-N 0-N 1-1 1-1 Ro 1-1 ORD-DET PRO-DET 1-1 Rp FORMER-CUSTOMER Fig. 1. Context-dependent defects The transformational approach allow us to deal with the quality through the identification of defects, considering their possible existing alternatives and to improve schemas in order to make them compliant with best practices. In the remaining of this section, we present defects identified for conceptual ER-like schema considering their understandability, which is one of the main re- quirement for conceptual schemas. The understandability refers to the efficiency with which a construct can express a type of fact of the application domain. For example, in order to represent a category of concepts A and its subcategories A1 and A2 , the best solution will be to use an is-a relationship. It is the most expressive construct for such type of facts. However, alternatives exists and ap- pear in schemas. A common alternative is the materialization of the is-a using one-to-one relationship types. The identified defects have been regrouped into 5 categories. A full descrip- tion can be found in [16]. Those categories are the non minimal, the insufficiently expressive, the abnormal, the irregular and the redundant constructs. For most categories, we will list the identified defects, but we can only detail some of them due to space restriction. The non minimal constructs refer to the schema simplicity, which means that a type of fact should be represented as simple as possible. Attribute entity types are entity types that represent by value or instance attributes coming from another entity type. In order to simplify the schema, they can be represented in this other entity type. Empty and unique subtypes can be use to represent a simple property of the supertype, but should rather be expressed through a indicator in the supertype. Other non minimal defects are: N-ary relationship types with a [1-1] or [0-1] role; compound attributes with only one component; one-to-one relationship types with mandatory roles; weakly specified subtypes; split existence constraints. Insufficiently expressive constructs relate to the expressivity of a schema. The expressivity requires that a construct express clearly and naturally its na- ture. Those defects regroup: relationship entity types; complex attributes; ref- erence attributes; existence constraints containing roles and attributes; implicit is-a relationships. We talk about relationship entity type when a many-to-many relationship type is expressed with an entity type and 2 one-to-many relation- ships. Obviously, the many-to-many relationship is more expressive. A reference attribute is an implicit reference to another entity, that can be replaced by an explicit relationship. Abnormal and irregular constructs decrease the foreseeable nature of a schema. Indeed, the user should not be surprise by the use of particular con- structs. Abnormal constructs are: degenerated structures; entity types with no attribute, nor role; foreign constructs. A structure is degenerated if its composi- tion is unjustified, e.g. a coexistence constraint with only one component or an is-a with a total constraint but only one subtype. Such defect should be solve by removing the unjustified element. Foreign constructs show a coloration com- ing from another paradigm and/or abstraction level (e.g. IMS,CODASYL), and may arise from a migration process. Those constructs will be replaced with the corresponding construct of the conceptual level. Irregular constructs introduce a notion of uniformity between the constructs representing the same type of facts. In order to remove this type of defects, a choice of representation has to be done and applied to the whole schema. Finally, the last category of defects concerns the redundant constructs. A schema should express a fact only once. Redundancies should be identified and removed in order to make the schema more understandable. These defects can be removed by applying semantics-preserving transforma- tions. Using the transformational approach allows us to apply similar reasoning with other abstraction levels, paradigms and quality characteristics. For exam- ple, at the logical level, the operational performance could be more interesting. The evolutivity of the schema may also be preferred at the conceptual level. To summary, the use of a construct C of a schema should be evaluated through three questions: Does C naturally belongs to this paradigm? Does C feel comfortable (so to speak) at this abstraction level? Does it best translate the intention of the designer? 4 Framework proposal The goal we have chosen to reach in this research is to design a quality evalua- tion and improvement framework for data schemas. In particular, we expect (1) to augment global quality evaluation approaches with metrics based on seman- tically rich structural patterns and (2) to associate with each structural pattern correction transformations, in order to improve schema quality considering con- text requirements. Such transformations can be either suggested or automatically applied. Especially, we are evaluating the use of semantics-preserving transfor- mations. The framework is based on the principle that these transformations allow the production of alternative structures. Among the many possible struc- tures, some of them, though correct, may not be considered best practices, while others may meet all requirements imposed by the context and therefore be con- sidered as best practices. A more complete definition of the framework can be found in [12]. The concept of semantic equivalence class In order to formalize our view of alternative structures, we defined the concept of equivalence class (EC ). We consider K, the collection of all the constructs of the GER that are pertinent in some engineering processes and a set of trans- formations T . Let us also consider a construct C from K and all the equivalent constructs that can be derived through the reversible transformations of T . All these constructs, together with C, form an equivalence class called ec(C). Since only reversible transformations have been applied, ∀C 0 ∈ ec(C), ec(C 0 ) = ec(C). We now define the function sec : K → (K × 2K ). sec(C) associates to each construct in K its semantic equivalence class (sec), an equivalence class in which the specific element C has been tagged. C is the intention of this equivalence class. sec(C) provides all the constructs a designer can introduce in a schema to express the semantics (the application domain fact type) of C, hence the name semantic equivalence class or sec. Among a SEC, we consider a best structure, that is, the most suitable struc- ture for expressing the modeling intention. Such structure can generally be con- sidered as the best practice of the SEC in term of expressiveness. However, as discussed previously, depending on the context the use of this structure is not the best solution. Context and SEC In order to understand the use of the SEC, we need to define in more detail the concept of context. The context of a schema S is a set of requirements defined by the intended use of S. S has been designed for the abstraction level A, according to the paradigm P and to meet the design criterion D. We call (A,P,D) the context of S. Given a construct C that can appear in schema S, a scoring function is assigned to sec(C) for a given context. As the SEC are defined independently of any model, we propose the concept of projection in order to take into account the model used in a particular context. The projection of a SEC for a model M provides the subset of all constructs of the SEC that comply with M. Generation and representation of the structures The application of transformations for generating the SEC should be considered carefully. The equivalence class of a construct C can be obtained by recursively applying the transformations of T until no new construct can be produced. However, this naive approach can lead to a very large (and, depending on T , possibly infinite) set of constructs of which only a small subset would be of interest. Appropriate meta-rules are necessary to keep the process into reasonable limits. Considering the is-a pattern, one can adopt a regularity of treatment meta-rule according to which each sub-category of a given category must be expressed in the same way. For example, a construct obtained by applying the upward inheritance transformation to one sub-category and the materialization transformation to another one would be rejected. Another example: when an entity type EA results from the transformation of an attribute A, the attribute(s) of the latter cannot be further transformed through the same transformation (figure 2). E E E IdE IdE 1-1 IdE 1-1 id: IdE E_A A id: IdE E_A 1-1 EA 1-1 EA 1-1 ... id: IdE A EA_A 1-1 EAA A Fig. 2. Infinite transformation of an attribute. Another important aspect of our framework is the use of generic definitions, or patterns, for representing the studied constructs. While the use of concrete examples, such as in the figure 1, for illustrating the considered defects is natural, it is completely unrealistic to attempt to list all of them without introducing a certain level of genericity. 5 Related works The normalization process proposed by Codd [9] relies on the use of transforma- tions in order to eliminate problematic functional dependencies. Compared with our framework, it deals with a no redundancies quality criterion. Burton and Weber [11] and Gemino and Wand [10] have studied particular constructs and their influence on specific qualities. Even though they did not explicitly refer to reversible transformations, our work seek to address similar problems. An im- portant basis of our work is the proposal of Assenova and Johanesson [17]. They have considered the use of reversible transformations for enhancing the quality of conceptual data schemas. However, they associated quality scores directly to transformations, while we consider it to be related to the construct itself. Finally, Kurtev [18] used the concept of transformation space for dealing with schema quality. Such space represents a transformation by its initial and resulting struc- tures and allows to link it with quality indicators. However, studied objects are atomic, while we consider semantically richer constructs. 6 Achievements and future works So far, we have identified about 20 SEC, through a schema review process. All SEC represent some of the most common modeling intentions, that can be regrouped into 4 main categories: concepts, properties, relationships and con- straints. Those categories are generally related to specific types of objects, e.g. entity types and tables for the concepts and attributes and columns for prop- erties. The SEC regroups different constructs richer, in term of semantics, than simple objects. Besides the SEC, specific types of defects have also been iden- tified and classified (e.g. non-minimal constructs, unexpressive constructs). The relation between SEC and those specific defects is currently studied. Using SEC for enhancing the quality of schemas requires the ability to com- pare the quality between constructs. We have considered different approaches: the standard empirical studies, the use of metrics and the evaluation by experts. Among them, we choose to evaluate the quality of structures through the opin- ions of experts, which seemed to us to be a good compromise between the cost and the validity of the evaluation. Experts will be asked to assess each construct, independently of the schemas in which it appears. The next step of this research is the complete quality evaluation of the SEC and their constructs. We should gather experts in order to obtain their quality rating. Such rating will allow us to produce construct-based metrics and will be the basis of the improvement process. The improvement method has to be defined in detail and carefully considering limits of our approach. For example, constructs belonging to different SEC could have common objects in a schema. Consequently, conflicts may appear between possible improvements. An important goal of our research is to ensure the usability of the framework. This cannot be realized without a tool support for the identification of the con- structs in a schema and the application of an improvement process. However, such tool has to be semi-automatic, due to limits of the approach. As the defi- nitions of the SEC constructs is generic and represent structural properties, the same patterns may appear in different SEC. In such case, the identification of the modeling intention between the different possibilities has to be done manually by the analyst. It remains to check the validity of the framework. Here, we wish to rely on teachers and students. (Last year) students form a realistic sample of designers of various skills, ranging from desperately inapt to experienced and ingenious. On the other hand, teachers are expected to be expert in evaluating the qual- ity of medium size schemas. Therefore, comparing and aligning academic and automated evaluations allow the tuning of the evaluation framework. These val- idation and alignment processes are still under investigation. References 1. Standish Group: Chaos: A recipe for success. Technical report, Standish Group International (1999) 2. ISO/IEC: ISO 9126-1:2001, Software engineering - Product quality, Part 1: Quality model. ISO/IEC (2001) 3. Davis, A., Overmyer, S., Jordan, K., Caruso, J., Dandashi, F., Dinh, A., Kincaid, G., Ledeboer, G., Reynolds, P., Sitaram, P., Ta, A., Theofanos, M.: Identifying and measuring quality in a software requirements specification. In: Proceedings of the First International Software Metrics Symposium. (1993) 141–152 4. Moody, D.L., Shanks, G.G.: Improving the quality of data models: empirical vali- dation of a quality management framework. Inf. Syst. 28(6) (2003) 619–650 5. Maes, A., Poels, G.: Evaluating quality of conceptual models based on user per- ceptions. In: International Conference on Conceptual Modeling - ER 2006, Tucson, Arizona (November 2006) 54–67 6. Krogstie, J.: Integrating the understanding of quality in requirements specification and conceptual modeling. SIGSOFT Softw. Eng. Notes 23(1) (1998) 86–91 7. Manso, M.E., Genero, M., Piattini, M.: No-redundant metrics for uml class diagram structural complexity. In: CAiSE. Volume 2681 of Lecture Notes in Computer Science., Springer (2003) 127–142 8. Si-Said Cherfi, S., Akoka, J., Comyn-Wattiau, I.: Perceived vs. measured quality of conceptual schemas: An experimental comparison. In: ER (Tutorials, Posters, Panels & Industrial Contributions), Australian Computer Society (2007) 185–190 9. Codd, E.F.: Normalized data structure: A brief tutorial. In: SIGFIDET Workshop, ACM (1971) 1–17 10. Gemino, A., Wand, Y.: Complexity and clarity in conceptual modeling: comparison of mandatory and optional properties. Data Knowl. Eng. 55(3) (2005) 301–326 11. Burton-Jones, A., Weber, R.: Understanding relationships with attributes in entity- relationship diagrams. In: ICIS ’99: Proc. of the 20th international conference on Information Systems, Atlanta, GA, USA (1999) 214–228 12. Lemaitre, J., Hainaut, J.L.: Transformation-based framework for the evaluation and improvement of database schemas. In: Proc. of the 22st International Confer- ence on Advanced Information Systems (CAISE’10, to appear). (2010) 13. Moody, D.: What makes a good diagram? improving the cognitive effectiveness of diagrams in is development. In Knapp, Magyar, eds.: Intl Conf on Information Systems Development, Budapest, Hungary, Springer (August 31-2 2006) 14. Boufares, F., Bennaceur, H.: Consistency problems in er-schemas for database systems. Inf. Sci. 163(4) (2004) 263–274 15. Damm, F.M., Hansen, B.S., Bruun, H.: On type checking in vdm and related consistency issues. In: VDM ’91: Proceedings of the 4th International Sympo- sium of VDM Europe on Formal Software Development-Volume I, London, UK, Springer-Verlag (1991) 45–62 16. Hainaut, J.L.: Bases de donnees. Dunod (2009) 17. Assenova, P., Johannesson, P.: Improving quality in conceptual modelling by the use of schema transformations. In: ER ’96: Proc. of the 15th International Confer- ence on Conceptual Modeling, London, UK, Springer-Verlag (1996) 277–291 18. Kurtev, I.: Adaptability of model transformations. PhD thesis, University of Twente, Enschede (2005) Acknowledgments This doctoral research is being led under the supervision of Pr. Jean-Luc Hainaut, Laboratory of Database Application Engineering - PReCISE Research Center, Faculty of Computer Science, University of Namur, Belgium.