FAIR Data Based on Extensible Unifying Data Model Development © Sergey Stupnikov © Leonid Kalinichenko Institute of Informatics Problems, Federal Research Center “Computer Science and Control“ of the Russian Academy of Sciences, Moscow, Russia sstupnikov@ipiran.ru Abstract. Nowadays data sources within data infrastructures are quite heterogeneous, they are represented using very different data models. Data models vary from relational one to NoSQL zoo of data models. A prerequisite for (meta)data interoperability, integration and reuse within some data infrastructure is unification of source data models and their data manipulation languages. A unifying data model (called canonical) has to be chosen for the data infrastructure. Every source data model has to be mapped into the canonical model, mapping should be formalized and verified. The paper overviews data unification techniques have been developed during recent years and discusses application of these techniques for data integration within FAIR data infrastructures. Keywords: FAIR data principles, unifying data model, data integration. for every source data model. Canonical model is formed 1 Introduction as the union of the core data model and all extensions. Data unification techniques were extensively studied Data sources nowadays are quite heterogeneous: they at FRC CSC RAS [3]. As the core of the canonical model are represented using very different data models. Variety specific object-frame language with broad range of of data models includes traditional relational model and modeling facilities was used [4]. Approaches for its object-relational extensions, array and graph-based mapping of different classes of source data models were models, semantic models like RDF and OWL, models for developed: process models [5], semantic models [6][13], semi-structured data like NoSQL, XML, JSON and so array [9] and graph-based [10] models, some other kinds on. These models provide also very different data of NoSQL models [8]. Techniques for verification of manipulation and query languages for accessing and mappings applying a formal language based on the first modifying data. order logic and set theory and supported by automatic A prerequisite for (meta)data interoperability, and interactive provers were developed [11][12]. integration and reuse within some data infrastructure is Techniques mentioned are proposed as a formal basis unification of source data models and their data for (meta)data interoperability, integration and reuse manipulation languages. A unifying data model (called within FAIR data infrastructures. Such infrastructures canonical) has to be chosen for the data infrastructure. may combine virtual integration facilities (subject The canonical data model serves as the language for mediators) as well as data warehouses to integrate knowledge representation mentioned in FAIR I1 heterogeneous data sources in an interoperable way. principle ((meta)data use a formal, accessible, shared, The rest part of the paper is structured as follows: and broadly applicable language for knowledge section 2 overviews data unification techniques that have representation) [1][2]. Every source data model has to be been developed during recent years and section 3 mapped into the canonical model. Mapping can be discusses application of these techniques for data accompanied with the extension of the canonical model integration within FAIR data infrastructures. if required. A mapping should be formalized and verified: a formal proof that the mapping preserves 2 Data Model Unification semantics of data structures and data manipulation operations of the source data model should be provided. Various source data models and their data manipulation As the core of the canonical model some concrete languages applied within some data infrastructure have data model like SQL (conforming to ISO/ANSI SQL to be unified in the frame of some canonical data model. standard of 2011 or later) or RDF/RDF Schema with The main principle of the canonical model design SPARQL query language can be used. To cover features (synthesis) for a data infrastructure is the extensibility of of various source data models the canonical model has to the canonical model kernel in heterogeneous be extensible. Examples of extensions are specific data environment [3], including various models used for the structures (data types), compound operations or representation of resources of the data infrastructure. A restrictions (dependencies). An extension is constructed kernel of the canonical model is fixed (for instance, SQL or RDF). A specific source data model R of the environment is said to be unified if it is mapped into the Proceedings of the XX International Conference canonical model C [11][12]. This means a creation of “Data Analytics and Management in Data Intensive such extension E of the canonical model kernel (note that Domains” (DAMDID/RCDL’2018), Moscow, Russia, October 9-12, 2018 9 such extension can be empty) and such mapping M of a prototyped. source model into extended canonical one that the source The first approach [11][12] is based on the model refines the extended canonical one. Model metacompilation languages SDF (Syntax Definition refinement of C by R means that for any admissible Formalism) and ASF (Algebraic Specification specification (schema) r represented in R its image M(r) Formalism). For the languages a tool support — Meta- in C under the mapping M is refined by the specification Environment [16] — is provided based on term rewriting r. Such refining mapping of models means preserving of techniques. Data model syntax is represented using SDF operations and information of a source model while in a version of extended Backus–Naur form. Data model mapping it into the canonical one. Preserving of transformations are defined as ASF language modules operations and information should be formally proven. constituted by sets of functions. A function defines a The canonical model for the environment is synthesized transformation of a syntactic element of a source model as the union of extensions, constructed for all models of into a syntactic element of the canonical model. the environment. Recursive calls of transformation functions are allowed. The following languages and formal methods are According to the ASF-definition the transformation required to support data model mapping: program code (C language) is generated automatically by • a kernel of the canonical data model; means of Meta-Environment tools. The transformation • formal methods allowing to describe data model obtained is used for mapping of source model syntax as well as semantic mappings specifications into the canonical model specifications. (transformations) of one model to another; The second approach [17] is based on the Model- • formal methods supporting verification of refinement Driven Architecture (MDA) [18] proposed by Object reached by the mapping. Management Group. Data model abstract syntax Within studies on data unification techniques at FRC neglecting any syntactic sugar is defined using Ecore CSC RAS as a kernel of the canonical data model the metamodel (an implementation of OMG's Essential SYNTHESIS language [4] was used. The SYNTHESIS Meta-Object Facility) used in Eclipse Modeling language, as a hybrid semistructured and object-oriented Framework [19]. Concrete syntax of data models binding data model, includes the following distinguishing syntactic sugar and abstract syntax was for-malized features: facilities for definitions of frames, abstract data using EMFText framework [20]. Data model types, classes and metaclasses, functions and processes, transformations are defined using ATLAS logical formulae facilities applied for description of Transformation Language (ATL) [21] combining constraints, queries, pre- and post-conditions of declarative and imperative features. ATL transformation functions, assertions related to processes. For extension programs are composed of rules that define how source of the canonical model kernel, metaclasses, metaframes, model elements are matched and navigated to create and parameterized constructions including assertions and initialize the elements of the target models. Type system generic data types were applied. Data unification of the ATL is very close to the type system of the OMG teqhniques developed can be adopted also for other Object Constraint Language. canonical data model kernels like SQL or RDF. Using both approaches construction of a mapping of For data model’s semantics formalization and a source data model R into the canonical model C is refinement verification the AMN (Abstract Machine divided into the following stages: Notation) language [14] was used. The language is • formalization of syntax and semantics the models R supported by technology and tools for proving of and C (if the latter has not yet been defined); refinement (B-technology) [15]. AMN is based on the • definition of reference schemas of the models R and first order predicate logic and Zermelo-Frenkel set C (if the latter has not yet been defined); theory and enables to consider state space specifications • integration of reference schemas of the model R and and behavior specifications in an integrated way. The C; system state is specified by means of state variables and • creation of a required extension E of the canonical invariants over these variables, system behavior is model C; specified by means of operations defined as generalized • construction of a transformation of the model R into substitutions – a sort of predicate transformers. the extended canonical model; Refinement of AMN specifications is formalized as a set • verification of refinement of the extended canonical of refinement proof obligations – theorems of first order model by the model R. logic. Generally speaking in terms of pre- and post- The Reference schema of a data model is an abstract conditions of operations, refinement of AMN description containing concepts related to constructs of specifications means weakening pre-conditions and the model and significant associations among these strengthening post-conditions of corresponding concepts. Both concepts and associations may be operations included in these specifications. Proof annotated by verbal definitions (looking like entries in an requests are generated automatically and should be explanatory dictionary). Using MDA terms reference proven with the help automatic and interactive theorem schemas are just metamodels conforming the Ecore prover [15]. metamodel. For the formal description of model syntax and Formalization of data model semantics and transformations two approaches were developed and verification of data model refinement can be performed 10 in two ways. van der Aalst. Thus the canonical process model In the first way formalization of data model semantics possesses a property of completeness with respect to means a construction of transformations of source and broad class of process models used in various Workflow canonical data model specifications into AMN- Management Systems as well as the languages used for specifications. So for any specification of a source data process composition of Web services. model the AMN-specification expressing its semantics is In [11][12] the Ontology Web Language was unified generated automatically (for instance, in [11][12] the with the SYNTHESIS language, in [6] OWL 2 QL was Ontology Web Language [22] is considered as a source mapped into the SYNTHESIS. model and its semantics in AMN is illustrated by In [7] application of the canonical model synthesis example). Also, for any specification of the canonical methods for the value inventive data models was data model the AMN-specification expressing its discussed. The distinguishing feature of these data semantics is generated automatically [23]. After that models is inference of new, unknown values in the refinement of a canonical data model specification by a process of query answering. source data model specification is reduced to refinement In [8] an approach to mapping of different types of of their semantic AMN specifications and can be verified NoSQL models into the object model of the by the refinement theorem prover [15]. So verification of SYNTHESIS language used as unifying data model was model refinement is realized over a set of source model considered. specification samples. In [9] unification of an array-based data model used In the second way semantics of a data model (source in SciDB DBMS was considered, and in [10] unification or canonical) as a whole is expressed by an AMN of an attributed graph data model was considered. For specification. For instance, in [9] AMN semantics for an both models verification using AMN specifications is array data model is defined, in [10] AMN semantics for provided. a graph data model is defined. AMN semantics for the In [13] issues on unification of RDF with SYNTHESIS language as the canonical data model was accompanying RDF Schema and SPARQL languages also provided [9][10]. Data structures used in data were discussed. models were represented by variables in AMN specifications, properties of data structures were 3 FAIR Data Based on Data Model represented by AMN invariants, typical operations of Unification data models were represented by AMN operations. Generally refinement of the AMN-specification MC The following levels of integration (from higher to corresponding to the canonical data model C by the lower) can be distinguished: data model integration AMN-specification MR corresponding to a source data (unification), schema matching and integration model R should be also proved using refinement theorem (metadata integration) and data integration proper. prover [15]. Usually completion of the integration on a higher level is a prerequisite for integration on a lower level. Obviously Partial automation of data unification techniques the highest level, i.e. data model unification is a mentioned above was implemented within Unifying prerequisite for (meta)data interoperability, integration Information Models Constructor (Model Unifier in short) and reuse within FAIR data infrastructures and data [11][12]. Unifier consists of the following main model unification techniques overviewed in the previous components: section can be considered as a formal basis for achieving • tool for the formal description and correctness FAIRness of data. checking of model syntax and transformations (Meta- Any level of integration makes data more FAIR: Environment, ATL Tools); integrated data are much easier to find, access and reuse • Atelier B [15], supporting AMN and providing and also integrated data are more interoperable than facilities for proving of specification refinement; heterogeneous data stored in different data sources. The • model manager. most mature level of integration is achieved within data integration systems like subject mediators or data Meta-Environment, ATL Tools and Atelier B are warehouses. third-party products. Model manager provides a Subject mediators implement virtual integration with graphical interface allowing an expert to search for, view user queries defined in some unified data model. Such and register data models and extensions of the canonical queries are to be decomposed into sets of subqueries and model; to call specific components for generating these subqueries are to be transferred to heterogeneous templates, editing and integration of reference schemas, data sources. Data sources are connected with a subject generating templates for translators of source models mediator via wrappers which transforms queries into into the canonical one, translation of source models source data models and also transforms query answers specifications into AMN or into canonical specifications, from source data models into unified mediator data translation of canonical specifications into AMN. model. Query answers are transferred by wrappers back Recent years data unification techniques were to the mediator, combined and sent to users. One of the applied to wide range of source data models. In [5] a latest trends nowadays is construction of subject canonical process model has been synthesized for the mediators over data lakes [24]. environment of workflow patterns classified by W. M. P. Data warehouses implement materialized integration 11 with all required data extracted from sources, [6] Kalinichenko, L.A., Stupnikov, S.A.: OWL as Yet transformed into unified warehouse data model, and Another Data Model to be Integrated. In: stored into a warehouse. Advances in Databases and Information Systems: Any kind of integration system requires unified data Proc. II of the 15th East-European Conference, pp. model. One of the important issues to be resolved for data 178-189. Vienna: Austrian Computer Society integration within FAIR data infrastructures is the choice (2011) of the canonical model kernel. Even the choice between [7] Kalinichenko L.A., Stupnikov S.A.: Synthesis of SQL and RDF is difficult. On the one hand SQL is the Canonical Models for Database Integration supported by industrial standards, methods and Preserving Semantics of the Value Inventive Data technologies evolving for decades. On the other hand, Models. Advances in Databases and Information RDF is W3C Recommendation supported by triplestore Systems: Proc. of the 16th East European vendors, is strongly connected with OWL ontological Conference, LNCS 7503, pp. 223-239. Berlin- framework, allows flexible evolution of data schema, Heidelberg: Springer-Verlag (2012) provides logic inference in a native way that is very [8] Skvortsov N. A.: Mapping of NoSQL data models important for knowledge bases. to object specifications. Proc. of the 14th Russian To integrate heterogeneous data sources in an Conference on Digital Libraries RCDL'2012. interoperable way FAIR data infrastructures may support CEUR Workshop Proceedings 934:53-62 (2012) both mentioned kinds of data integration systems and [9] Stupnikov, S. A.: Unification of an Array Data also combined data integration systems [25] with data Model for the Integration of Heterogeneous warehouses considered as resources to be integrated Information Resources. In: Proc. of the 14th within subject mediators. For all kinds of data integration Russian Conference on Digital Libraries systems the data model unification techniques can RCDL'2012. CEUR Workshop Proceedings, Vol. provide a formal basis. 934, pp. 42-52 (2012) Acknowledgments. The research is partially supported [10] Stupnikov, S. A.: Mapping of a Graph Data Model by Russian Foundation for Basic Research, project 18- into an Object-Frame Canonical Information 07-01434. Model for the Development of Heterogeneous Information Resources Integration Systems. In: Proc. of the 15th Russian Conference on Digital References Libraries RCDL'2013. CEUR Workshop [1] Wilkinson, M. D. et al.: The FAIR Guiding Proceedings 1108:85-94 (2013) Principles for scientific data management and [11] Zakharov, V. N., Kalinichenko, L. A., Sokolov, I. stewardship. Sci. Data 3:160018 (2016). DOI: A., Stupnikov, S. A.: Development of Canonical 10.1038/sdata.2016.18. Information Models for Integrated Information [2] Wilkinson, M. D.: Interoperability and FAIRness Systems. Informatics and Applications, 1(2):15-38 through a novel combination of Web technologies. (2007) PeerJ Preprints (2016). URL: [12] Kalinichenko, L.A., Stupnikov, S.A.: Constructing https://doi.org/10.7287/peerj.preprints.2522v1 of Mappings of Heterogeneous Information [3] Kalinichenko, L. A.: Canonical model Models into the Canonical Models of Integrated development techniques aimed at semantic Information Systems. In: Advances in Databases interoperability in the heterogeneous world of and Information Systems: Proc. of the 12th East- information modeling. Knowledge and model European Conference, pp. 106-122. Pori: Tampere driven information systems engineering for University of Technology (2008) networked organizations: Proc. of the CAiSE [13] Skvortsov N. A.: Mapping of RDF Data Model INTEROP Workshop. -- Riga: Riga Technical into the Canonical Model of Subject Mediators. University, 2004. -- P. 101-116. Proc. of the 15th Russian Conference on Digital [4] Kalinichenko, L. A., Stupnikov, S. A., Martynov Libraries RCDL'2013. CEUR Workshop D.O.: SYNTHESIS: a Language for Canonical Proceedings 1108:95-101 (2013) Information Modeling and Mediator Definition for [14] Abrial, J.-R.: The B-Book: Assigning Programs to Problem Solving in Heterogeneous Information Meanings. Cambridge: Cambridge University Resource Environments. Moscow: IPI RAN, 2007. Press (1996) - 171 p. [15] Atelier B, the industrial tool to efficiently deploy [5] Kalinichenko, L.A., Stupnikov, S.A., Zemtsov, the B Method. http://www.atelierb.eu/ N.A.: Extensible Canonical Process Model [16] Van den Brand M. G. J. et al.: The ASF+SDF Synthesis Applying Formal Interpretation. In: meta-environment: a component based language Avances in Databases and Information Systems: development environment // Compiler Proceedings of the East European Conference. Construction 2001 / Ed. By R. Wilhelm, pp. 365– LNCS 3631, pp. 183-198. Berlin-Heidelberg: 370. Springer (2001) Springer-Verlag (2005) [17] Stupnikov, S.A., Kalinichenko, L.A.: Methods for Semi-automatic Construction of Information 12 Models Transformations. Proc. of the 13th East- [23] Stupnikov, S. A.: A semantic transformation of the European Conference Advances in Databases and canonical information model into a formal Information Systems, workshop Model – Driven specification langage for the refinement Architecture: Foundations, Practices and verification. Proc. of the 12th Russian Conference Implications (MDA), pp. 432-440. Riga: Riga on Digital Libraries RCDL'2010, pp. 383-391. Technical University (2009) Kazan: Kazan Federal University (2010) [18] Object Management Group Model Driven [24] Hai, R., Quix, C., Zhou, C.: Query Rewriting for Architecture (MDA). MDA Guide rev. 2.0. OMG Heterogeneous Data Lakes. In: Benczúr A., Document ormsc/2014-06-01 (2014) Thalheim B., Horváth T. (eds) Advances in [19] Steinberg, D., Budinsky, F., Paternostro, M., Databases and Information Systems. ADBIS 2018. Merks, E.: EMF: Eclipse Modeling Framework, LNCS 11019:35-49. Springer (2018) 2nd Edition. Addison-Wesley Professional (2008) [25] Stupnikov, S.A., Vovchenko, A.E.: Combined [20] EMFText Concrete Syntax Mapper. Virtual and Materialized Environment for In- https://github.com/DevBoost/EMFText tegration of Large Heterogeneous Data [21] ATL - a model transformation technology. Collections. In: 16th Russian Conference on https://eclipse.org/atl/ Digital Libraries RCDL 2014 Proceedings. CEUR Workshop Proceedings 1297:201-210 (2014) [22] OWL Web Ontology Language Reference. W3C Recommendation. http://www.w3.org/TR/owl-ref/ (2004) 13