An Approach to Multi-Domain Data Model Development Based on the Model-Driven Architecture and Ontologies Denis A. Nikiforov, Igor G. Lisikh, Ruslan L. Sivakov Centre of Information Technology, Ekaterinburg, Russia Denis.Nikiforov,Igor.Lisyih,Ruslan.Sivakov@centre-it.com Abstract. To date, there are many diverse data representation technologies (EDIFACT, XML, JSON, CSV, relational model, NoSQL). Transition to new technologies or the integration of information systems based on different techno- logical stacks is a complex and expensive process. Platform-independent models take an important role in this process. The structure of such a model is described in this article. However, given the data model has been created at the junction of different domains, it may be not enough. In such case, a one more step of abstrac- tion and a movement to the computation-independent model is required. The au- thors propose to create it in an ontological form. Keywords: ontology, model-driven architecture, platform-independent model, computation-independent model, data model 1 Introduction When developers create complex, heterogeneous, distributed information systems, they face a number of questions: which technologies for data storage and transfer to choose, how to ensure data-model consistency across different participants of infor- mation exchange, and how to simplify future maintenance of the system under devel- opment. In order to assist the developers of such information systems, the OMG con- sortium has developed the model-driven architecture [1]. This architecture considers an information system as a set of models and the development process is transformation of some models into others. The architecture is not a technical specification. It describes only basic principles. Specifically, it describes platform-dependent, platform-independent, computation-in- dependent models without governing the structures thereof. In other words, developers choose existing or create new modeling languages (metamodels) for each specific in- formation system. The NIEM specification [2] is an example of such a metamodel. It describes a platform-independent metamodel and a platform-dependent metamodel as well as the rules for transforming instances of the former into instances of the latter. We have also developed a similar platform-independent metamodel as well as the rules for transforming its instances into platform-dependent models (an XML schema and an ER model). 106 As a rule, the platform-independent model solves a considerable number of problems related to the development and maintenance of an information system without the need to provide the third level of abstraction (in the computation-independent model). How- ever, if information exchange is sufficiently complex and covers multiple domains, then the computation-independent model is required. The purpose hereof is to describe the problems occurring in the data-modeling pro- cess that could indicate the need for the computation-independent model. We also show that such a model is actually an ontology. We hope that our experience will help the developers of complex, heterogeneous, distributed information systems to take correct architectural decisions. We will try to answer the question whether and why ontologies are needed for data modeling. 2 The Platform-Dependent Data Model Let us consider the following example. Some organization produces shrimps at its aquafarm and sells these to another organization. If the organizations reside in different countries, then this process additionally involves customs, sanitary and veterinary, transport and other control authorities. All the stages of this process are carried out together with the storage, transfer, and processing of data on the consignor, consignee, forwarder, transport vehicle, cargo, and other objects (Fig. 1). Fig. 1. The process of delivering the commodity and its accompanying data The participants process data on the same objects. However, they can use different technologies for data storage (RDBMS, NoSQL, Excel) and transfer (XML, JSON, CSV). It means that, in the general case, every participant can have its own platform- dependent data models. For example, an applicant can file a customs declaration in electronic form in the XML format, whereas a customs authority can store the same data in a relational DB (Fig. 2). In this case, they need two platform-dependent data models: an XML schema and an ER model, respectively. In order to enable the customs authority to extract the data from the XML message and save the data in its DB, another model mapping XML elements to DB fields is required, and such a model always exists. Even if operators manually input details into the DB, this model exists in their minds or as text instructions. In the general case, the 107 number of mapping models is proportional to the square of the number of exchanging information systems. Obviously, the integration of such heterogeneous information sys- tems with different data storage, transfer and mapping models is rather labor-consum- ing and error-prone. All these models will have to be changed in case a data structure changes. Fig. 2. The example of platform-dependent models and metamodels 3 The Platform-Independent Data Model Platform-dependent models described above have rather different forms but, on the whole, consider more or less the same data. The development and maintenance of all these models can be significantly simplified if we abstract away from data-representa- tion differences between different platforms and create a single platform-independent model. If one needs to modify a data structure, corrections will have to be made only in this model and then platform-dependent models and mapping models will be auto- matically generated on its basis. Such an approach is described in the model-driven architecture [1]. An example of the platform-independent model is shown in Fig. 3. 108 Fig. 3. The example of the platform-independent data model In our approach, a UML [3] profile is used to develop platform-independent models (Fig. 4). The profile is based on ISO/IEC 11179 [4] and contains the following stereo- types. Classifier is any object of a data model. Namespace is the container for logically-related objects with unique names. Namespace Subset is the subset of related objects of the Namespace. Platform is the set of the Primitive Types of a certain platform. Primitive Type is the primitive data type of a certain platform. Base Type is the data type used to abstract away from the platform. It is to be imple- mented by exactly one primitive data type for every platform. 109 Fig. 4. The profile of the platform-independent data metamodel Element is the data unit with the designated definition, identifier, representation, and permitted values. Simple Element is the data element without properties. 110 Complex Element is the data element with properties. Simple Type is the set of permitted values of the Simple Element. Complex Type is the set of the Element’s properties that are also Elements, in their turn. Component is the Element’s property. Attribute is the context characteristic of the data type. There are two types of inheritance: extension and restriction. Every data element and type is described as a separate independent entity. This al- lows generating maximally-normalized platform-dependent models: XML schemas of the “Garden of Eden” pattern and relational models with the relations in the 6th normal form. If necessary, one can generate less normalized models. In other words, such a platform-independent model contains enough information to generate platform-de- pendent models on its basis with any required characteristics. Further, the metamodel allows describing, using the OCL language [5], business rules that can be converted into SQL or XPath expressions. Model transformation is implemented in [6] using the QVTo language [7], which detailed description falls beyond the scope hereof. On the one hand, data elements in the described platform-independent model are abstracted away from specific data-modeling languages. They are not XML elements, not entities, not relations but some syntax-neutral data units. On the other hand, the data elements can be re-used in different data structures. This means that the same properties of real-world objects are described using the same data element in all particular data sets (documents, messages). One can conclude that, in fact, delinking of data elements from specific data-representation languages and from their contexts makes the de- scribed model conceptual (Fig. 5). The next section shows why this is not the case, and what difficulties it can cause. Fig. 5. A conceptual model 4 The Computation-Independent Data Model The platform-independent data metamodel described above can considerably sim- plify the development and maintenance of information systems. However, if we attempt to create a single model for different participants (Fig. 1), then the following problems will be encountered. Firstly, the participants can use different dictionaries and code lists for semantically identical data elements. For instance, a customs authority can code commodities based on the Harmonized Commodity Description and Coding System, whereas a sanitary 111 and veterinary authority can use different classification codes of the products under control. Kinds of submitted documents for which every participant has its own code list is another example. Such semantically identical data elements can be combined in two ways: either to unify code lists or to accompany codes with references to the used code lists. Secondly, the participants can structure details of the same objects in different ways. For example, customs authority needs to know a commodity code, price, intended use of goods, packaging kind, and mass. Transport control authority is not interested in price and intended use of goods. Sanitary and veterinary control authority checks date of production, best before date, and age and taxon information of shrimps (Fig. 6). All these appropriate authorities keep under control transportation of one and the same ob- ject, i.e. shrimps, but we had to create separate data types for each kind of authority. It increases the size of the model and complicates its maintenance. We can try to harmo- nize these types by one of the three methods: extension (general characteristics are put in a base data type), restriction (all possible characteristics are put to a base data type, and usage of these characteristics in derived data types is restricted), or composition (each characteristic is described as independent object wherein composite data types refer to these global characteristics only). Fig. 6. Different details of one and the same object (shrimps) Thirdly, the participants use different terminologies. For example, a customs author- ity mainly uses the term “commodity”, a transport authority uses the term “cargo”, and a sanitary and veterinary authority uses the term “product” to designate the shrimps (Fig. 1). In spite of this, the same object is controlled by all the authorities, which means that only one term should correspond to this object in a single data model. The question is: which of the above terms? Some sources use these terms as synonyms. Other sources state that “a commodity is a product of labor made for sale.” On the other hand, for example, a land lot can be a commodity without being a product of labor. In its turn, a cargo can or cannot be a commodity or a product, and so on. Although the first and the second problems can be solved within the mentioned plat- form-independent model, they imply that this model is not as universal as was described above. It can contain several different types and elements in order to represent details of the same object. This fact complicates the integration of information systems and the maintenance of the model. 112 The third problem explicitly indicates a fundamental shortcoming of the described model. Despite abstracting away from particular data-representation forms (relational model, XML schema, et al.), our model still depends on the usage context, on the do- main. This is connected with the fact that the model describes particular sets of details of real-world objects (particular documents, messages), which can be different for differ- ent participants, rather than the objects themselves. A single model can be developed only if the context, documents, messages are disregarded and real-world objects are modeled. The goal of the process described in Fig. 1 is not to transfer the documents but to change the statuses, properties, and relations of the real-world objects. Docu- ments are only a tool to reach the process’s goal. If we abstract away from the docu- ments, a single computation-independent model for all the participants can be obtained. Fig. 7. The example of harmonization of different object’ details by means of ontology Upon reaching the ontological level of modeling, we can conclude that a commodity, a cargo, and a product are not the entity of the object (which data we transfer in the process) but the temporal roles that this object can play. A commodity is the role of the object that has a seller and a buyer. A cargo is the role of the object that has a consignor, 113 a consignee, and a forwarder. A product is the role of the object that has a producer and a consumer. If the difficulties described in this section occur in the data-modeling pro- cess, this means that the capacity of the used modeling language is insufficient and an ontology is required. An example of such ontology is presented in Fig. 7.With our approach; the ontology can include concepts of three types: classes, roles and events. Class describes an intrin- sic essence of the object. Role describes temporary characteristics or relations of the object which appear only in a particular context [8, 9]. Context can be presented by a class (for example, an employee is a role of a man appearing in the context of organi- zation) or by an event. Some authors extract one more kind of a context, i.e. process. [10]. However, we view processes as compound events [11]. The computation-inde- pendent model, firstly, does not have duplicate elements and types, and, secondly, does not change significantly with appearance of new messages or by change of existing documents or messages. 5 Similar Studies As early as in the 1970-ies, the need for conceptual data modeling was understood, which resulted in the development of ER models and IDEF1X. It is interesting that the conceptual level of the ER model contains minimum information on which basis a more particular and detailed logical model is further created. In other words, the conceptual model is considered as the first approximation to a logical and then physical model. However, in IDEF1X, a conceptual model is needed to integrate several external and internal data models. Moreover, the conceptual model contains a single, integral, suffi- ciently detailed representation of data rather than the minimum information or the first approximation to other models. Our approach based on the model-driven architecture is very similar to ER and IDEF1X: it also has 3 levels but slightly different ones. The first level is a computation- independent model describing real-world objects, the relations and properties thereof. In fact, this is a conceptual model but more detailed than the one of the ER model (in addition to entities and relations, it also contains properties) and less detailed than the one of IDEF1X (it does not indicate data types). The second level is a platform-inde- pendent model describing particular structures of documents, messages that contain data of real-world objects. This model is something between external and conceptual models as approached by IDEF1X. The third level is a platform-dependent model that is similar to the logical ER model but can also be non-relational. Another difference between our approach and the ER model and IDEF1X is that transformations of plat- form-independent models into platform-dependent ones are completely formalized and automated. Finally, our platform-independent model describes not only a data schema but also business rules in the OCL language. NIEM is another analog of our approach [2]. Similarly, it is based on the model- driven architecture, its specification considers platform-independent and platform-de- pendent models, transformation of one model into the other is described using the QVTo language, and the metamodel is implemented as a UML profile. There are no 114 other similarities, however. Firstly, our approach does not have a profile for platform- dependent models. There is no need for it as such a model is fully automatically gener- ated from the platform-independent model. If a platform-dependent model with other characteristics needs to be generated, then the transformation itself should be changed. Secondly, the platform-independent model of NIEM contains several kinds of complex types (objects, roles, associations, et al.). In our approach, such a division is done on the level of the computation-independent model that NIEM lacks. In contrast to our approach, NIEM does not allow describing business rules using the OCL language and transforming these into XPath expressions. CCTS is another analog of our approach [12]. Similarly, this one is based on ISO/IEC 11179 [4]. Data is modeled on two levels: core components and business- information entities that, for the purposes of this discussion, correspond to the compu- tation-independent and the platform-independent models used in our approach. The dif- ference is that, in CCTS, basic core components, association core components (and the corresponding business-information entities) cannot be reused. For this reason, only partially-normalized platform-dependent models can be generated from such a plat- form-independent model (the “Garden of Eden” and the 6th normal form are impossi- ble). The second distinction is that new aggregate core components (and aggregate busi- ness information entities) cannot be defined in CCTS by extension or restriction; only qualification can be used that is not supported by any platform that the authors hereof are aware of. The third distinction is that, despite the fact that CCTS allows describing business rules in a data model, business rules are described using the XPath language and supported only in XML schemas but, for example, cannot be transformed into SQL expressions. In our platform-independent model, business rules are described using the OCL language and can potentially be transformed into expressions in the language of any platform. Finally, Semantic Web gains rather significant popularity recently [13]. RDF and OWL allow to describe real-world objects conceptually and computation-inde- pendently. Moreover, several working groups developed data integration standards (ISO 15926, ISO 21127), intended on active usage of these technologies. Some authors propose to use RDF as a universal language for data exchange. [14], [15]. Other authors allow usage of XML or relational models for data exchange, but they propose to trans- form XML schemas and ER models to ontology [16], [17]. The latter, in turn, can be used for data integration. Our approach goes in the opposite direction. Firstly, we consider that it is not always possible to be limited to usage of RDF and triplestore in data transfer and storage, sometimes XML schemas or relational models are required. Secondly, computation- independent model describing real-world objects must be primal. A data modeler has to develop platform-independent model describing documents and messages on basis of it. This platform-independent model can be transformed into platform-dependent models (XML schemas, ER models). We found no one article with description of trans- formation of OWL model to ER model or XML schema. Only backward transfor- mations are available. However, for us, this direct transformation is of the most interest. 115 6 Conclusion and Follow-up Studies Data modelers (including authors of this article) often face an issue of motivation for the use of one or another data modeling approach: ontologies, model-driven archi- tecture, IDEF1X, ER. In this article, we tried to correlate these approaches. We also tried to specify problems which can testify a necessity of use of model-driven architec- ture and ontologies. Multi-level data models including the conceptual model were developed as early as in the 1970-ies. At large, neither ontologies nor the model-driven architecture intro- duces anything fundamentally new into this field. These are only steps on the way to generalize, unify existing ideas. Ontologies play exactly the same role as the classical conceptual ER models or IDEF1X in the data-modeling process. In the first case, the ontology is the first approximation to a more particular, detailed, contextual data model that describes the sets of details of real-world objects represented by documents, mes- sages rather than the objects themselves. In the second case, the ontology is a suffi- ciently detailed data model common for several participants. If only data-exchange schemas (documents, messages) or particular data-storage schemas need to be designed, then XML schemas, relational models, or other logical models are sufficient, there is no need to use ontologies. If there is a need to unify all these particular data schemas, then we should abstract away from the sets of objects’ details and start modeling the real-world objects themselves, and for this purpose, on- tologies can be used. The possibilities of the languages intended for data modeling on a logical (rather than conceptual) level are rather limited when there is a need to unify data structures using different code lists, slightly different sets of details, and different terminologies. Such languages allow modeling only particular sets of details of objects rather than the real-world objects themselves. It should also be noted that the developers of data models based on the model-driven architecture, as a rule, restrict themselves to platform-independent and platform-de- pendent models thus omitting computation-independent models. The present paper at- tempts to show that platform-dependent models can be well dispensed with as these can be automatically generated from platform-independent models. However, if the devel- oped model is rather complex, covers multiple domains, the computation-independent model (describing real-world objects rather than particular sets of details thereof) can considerably simplify the development and maintenance of information systems. Fur- thermore, the computation-independent data model is an ontology. The metamodel presented in this article has no essential novelty. As already stated, similar metamodels were discussed in [2], [12]. As a rule, these metamodels have only one target architecture. Our approach has a benefit: from one platform-independent model we form not only XML schemas but ER models as well. Secondly, we describe not only data schema but also business rules, and we do it in platform-independent OCL, not in a language of target architecture. Thirdly, we predicate that in some situa- tions one platform-independent model is not enough, and a computation-independent model is required. Currently, a metamodel is used for data modeling in several domains: customs con- trol, transport control, technical regulation, sanitary and veterinary control, and others. 116 Our approach to data modeling can be used with other types of B2G or G2G interaction. Developed platform-independent model has around 2,000 elements and types. Around 100 exchange structures were developed on the basis of this model. XML schemas and ER models are formed automatically from this model. Unfortunately, transformation rules [6] are too voluminous to present them in this article. It will be described in details in next articles. At this date, development of the computation-independent data model is in the initial stage. In next articles, we shall describe in details the model formation rules and rules for creation of platform-independent model on its basis. We shall also try to evaluate at what extent the computation-independent model will help to improve the quality of the platform-independent model. 7 References 1. Miller, J., Mukerji, J.: MDA Guide Version 1.0.1 (2003). http://www.omg.org/cgi- bin/doc?omg/03-06-01 2. Object Management Group: UML Profile for NIEM, version 1.0 (2014) 3. Object Management Group: OMG Unified Modeling Language, version 2.4.1 (2011) 4. ISO: ISO/IEC 11179-1:2004. Information technology – Metadata registries – Part 1: Frame- work (2004) 5. Object Management Group: Object Constraint Language, version 2.4 (2014) 6. Nikiforov, D.A.: UML Model to XML Schema 1.1 Transformation. doi:10.5281/ze- nodo.16151 (2013) 7. Object Management Group: Meta Object Facility (MOF) 2.0 Query/View/Transformation Specification, version 1.1 (2011) 8. Pradel, M., Henriksson, J., Aßmann, U.: A good role model for ontologies: Collaborations. In: International Workshop on Semantic-Based Software Development (2007) 9. Henriksson, J., Pradel, M., Zschaler, S., and Pan, J. Z.: Ontology Design and Reuse with Conceptual Roles. In Proceedings of the 2nd International Conference on Web Reasoning and Rule Systems, RR ’08, pp. 104–118, Berlin, Heidelberg. Springer-Verlag (2008) 10. Mizoguchi, R., Kozaki, K., and Kitamura, Y.: Ontological Analyses of Roles. FedCSIS, pp. 489-496 (2012) 11. Partridge, C.: Business Objects: Re-Engineering for Re-Use, section 8.3.1.4. Butterworth Heinemann (1996) 12. UN/CEFACT: Core Components Technical Specification, version 3.0 (2009) 13. W3C: Semantic Web. http://www.w3.org/standards/semanticweb/ 14. Gorshkov, S. “Business Semantics” practice of application integration using Semantic Web technologies. In international data science conference on Analysis of Images, Social Net- works, and Texts (2013) 15. Booth, D., Dowling, C., Fry, C. E., Huff, S., Mandel, J.: RDF as a Universal Healthcare Exchange Language. In Semantic Technology and Business Conference San Francisco, CA (2013) 16. Bedini, I., Matheus, C., Patel-Schneider, P., Boran, A., Nguyen B.: Transforming XML schema to OWL using patterns. ICSC 2011 – 5th IEEE International Conference on Seman- tic Computing, Palo Alto, United States. pp.1-8 (2011) 17. Myroshnichenko, I., Murphy, M. C. Mapping ER Schemas to OWL Ontologies. In ICSC '09. IEEE International Conference on Semantic Computing (2009) 117