FAIR Data Based on Extensible Unifying Data Model
                          Development
                            © Sergey Stupnikov © Leonid Kalinichenko
  Institute of Informatics Problems, Federal Research Center “Computer Science and Control“ of
                         the Russian Academy of Sciences, Moscow, Russia
                                       sstupnikov@ipiran.ru
            Abstract. Nowadays data sources within data infrastructures are quite heterogeneous, they are
     represented using very different data models. Data models vary from relational one to NoSQL zoo of data
     models. A prerequisite for (meta)data interoperability, integration and reuse within some data infrastructure
     is unification of source data models and their data manipulation languages. A unifying data model (called
     canonical) has to be chosen for the data infrastructure. Every source data model has to be mapped into the
     canonical model, mapping should be formalized and verified. The paper overviews data unification
     techniques have been developed during recent years and discusses application of these techniques for data
     integration within FAIR data infrastructures.
            Keywords: FAIR data principles, unifying data model, data integration.

                                                                 for every source data model. Canonical model is formed
 1 Introduction                                                  as the union of the core data model and all extensions.
                                                                     Data unification techniques were extensively studied
     Data sources nowadays are quite heterogeneous: they         at FRC CSC RAS [3]. As the core of the canonical model
 are represented using very different data models. Variety       specific object-frame language with broad range of
 of data models includes traditional relational model and        modeling facilities was used [4]. Approaches for
 its object-relational extensions, array and graph-based         mapping of different classes of source data models were
 models, semantic models like RDF and OWL, models for            developed: process models [5], semantic models [6][13],
 semi-structured data like NoSQL, XML, JSON and so               array [9] and graph-based [10] models, some other kinds
 on. These models provide also very different data               of NoSQL models [8]. Techniques for verification of
 manipulation and query languages for accessing and              mappings applying a formal language based on the first
 modifying data.                                                 order logic and set theory and supported by automatic
     A prerequisite for (meta)data interoperability,             and interactive provers were developed [11][12].
 integration and reuse within some data infrastructure is            Techniques mentioned are proposed as a formal basis
 unification of source data models and their data                for (meta)data interoperability, integration and reuse
 manipulation languages. A unifying data model (called           within FAIR data infrastructures. Such infrastructures
 canonical) has to be chosen for the data infrastructure.        may combine virtual integration facilities (subject
 The canonical data model serves as the language for             mediators) as well as data warehouses to integrate
 knowledge representation mentioned in FAIR I1                   heterogeneous data sources in an interoperable way.
 principle ((meta)data use a formal, accessible, shared,             The rest part of the paper is structured as follows:
 and broadly applicable language for knowledge                   section 2 overviews data unification techniques that have
 representation) [1][2]. Every source data model has to be       been developed during recent years and section 3
 mapped into the canonical model. Mapping can be                 discusses application of these techniques for data
 accompanied with the extension of the canonical model           integration within FAIR data infrastructures.
 if required. A mapping should be formalized and
 verified: a formal proof that the mapping preserves
                                                                 2 Data Model Unification
 semantics of data structures and data manipulation
 operations of the source data model should be provided.         Various source data models and their data manipulation
     As the core of the canonical model some concrete            languages applied within some data infrastructure have
 data model like SQL (conforming to ISO/ANSI SQL                 to be unified in the frame of some canonical data model.
 standard of 2011 or later) or RDF/RDF Schema with                   The main principle of the canonical model design
 SPARQL query language can be used. To cover features            (synthesis) for a data infrastructure is the extensibility of
 of various source data models the canonical model has to        the canonical model kernel in heterogeneous
 be extensible. Examples of extensions are specific data         environment [3], including various models used for the
 structures (data types), compound operations or                 representation of resources of the data infrastructure. A
 restrictions (dependencies). An extension is constructed        kernel of the canonical model is fixed (for instance, SQL
                                                                 or RDF). A specific source data model R of the
                                                                 environment is said to be unified if it is mapped into the
Proceedings of the XX International Conference
                                                                 canonical model C [11][12]. This means a creation of
“Data Analytics and Management in Data Intensive
                                                                 such extension E of the canonical model kernel (note that
Domains” (DAMDID/RCDL’2018), Moscow, Russia,
October 9-12, 2018


                                                             9
such extension can be empty) and such mapping M of a                 prototyped.
source model into extended canonical one that the source                 The first approach [11][12] is based on the
model refines the extended canonical one. Model                      metacompilation languages SDF (Syntax Definition
refinement of C by R means that for any admissible                   Formalism) and ASF (Algebraic Specification
specification (schema) r represented in R its image M(r)             Formalism). For the languages a tool support — Meta-
in C under the mapping M is refined by the specification             Environment [16] — is provided based on term rewriting
r. Such refining mapping of models means preserving of               techniques. Data model syntax is represented using SDF
operations and information of a source model while                   in a version of extended Backus–Naur form. Data model
mapping it into the canonical one. Preserving of                     transformations are defined as ASF language modules
operations and information should be formally proven.                constituted by sets of functions. A function defines a
The canonical model for the environment is synthesized               transformation of a syntactic element of a source model
as the union of extensions, constructed for all models of            into a syntactic element of the canonical model.
the environment.                                                     Recursive calls of transformation functions are allowed.
    The following languages and formal methods are                   According to the ASF-definition the transformation
required to support data model mapping:                              program code (C language) is generated automatically by
• a kernel of the canonical data model;                              means of Meta-Environment tools. The transformation
• formal methods allowing to describe data model                     obtained is used for mapping of source model
    syntax     as      well    as     semantic     mappings          specifications into the canonical model specifications.
    (transformations) of one model to another;                           The second approach [17] is based on the Model-
• formal methods supporting verification of refinement               Driven Architecture (MDA) [18] proposed by Object
    reached by the mapping.                                          Management Group. Data model abstract syntax
    Within studies on data unification techniques at FRC             neglecting any syntactic sugar is defined using Ecore
CSC RAS as a kernel of the canonical data model the                  metamodel (an implementation of OMG's Essential
SYNTHESIS language [4] was used. The SYNTHESIS                       Meta-Object Facility) used in Eclipse Modeling
language, as a hybrid semistructured and object-oriented             Framework [19]. Concrete syntax of data models binding
data model, includes the following distinguishing                    syntactic sugar and abstract syntax was for-malized
features: facilities for definitions of frames, abstract data        using EMFText framework [20]. Data model
types, classes and metaclasses, functions and processes,             transformations       are     defined     using     ATLAS
logical formulae facilities applied for description of               Transformation Language (ATL) [21] combining
constraints, queries, pre- and post-conditions of                    declarative and imperative features. ATL transformation
functions, assertions related to processes. For extension            programs are composed of rules that define how source
of the canonical model kernel, metaclasses, metaframes,              model elements are matched and navigated to create and
parameterized constructions including assertions and                 initialize the elements of the target models. Type system
generic data types were applied. Data unification                    of the ATL is very close to the type system of the OMG
teqhniques developed can be adopted also for other                   Object Constraint Language.
canonical data model kernels like SQL or RDF.                            Using both approaches construction of a mapping of
    For data model’s semantics formalization and                     a source data model R into the canonical model C is
refinement verification the AMN (Abstract Machine                    divided into the following stages:
Notation) language [14] was used. The language is                    • formalization of syntax and semantics the models R
supported by technology and tools for proving of                         and C (if the latter has not yet been defined);
refinement (B-technology) [15]. AMN is based on the                  • definition of reference schemas of the models R and
first order predicate logic and Zermelo-Frenkel set                      C (if the latter has not yet been defined);
theory and enables to consider state space specifications            • integration of reference schemas of the model R and
and behavior specifications in an integrated way. The                    C;
system state is specified by means of state variables and            • creation of a required extension E of the canonical
invariants over these variables, system behavior is                      model C;
specified by means of operations defined as generalized              • construction of a transformation of the model R into
substitutions – a sort of predicate transformers.                        the extended canonical model;
Refinement of AMN specifications is formalized as a set              • verification of refinement of the extended canonical
of refinement proof obligations – theorems of first order                model by the model R.
logic. Generally speaking in terms of pre- and post-                     The Reference schema of a data model is an abstract
conditions of operations, refinement of AMN                          description containing concepts related to constructs of
specifications means weakening pre-conditions and                    the model and significant associations among these
strengthening post-conditions of corresponding                       concepts. Both concepts and associations may be
operations included in these specifications. Proof                   annotated by verbal definitions (looking like entries in an
requests are generated automatically and should be                   explanatory dictionary). Using MDA terms reference
proven with the help automatic and interactive theorem               schemas are just metamodels conforming the Ecore
prover [15].                                                         metamodel.
    For the formal description of model syntax and                       Formalization of data model semantics and
transformations two approaches were developed and                    verification of data model refinement can be performed


                                                                10
in two ways.                                                      van der Aalst. Thus the canonical process model
    In the first way formalization of data model semantics        possesses a property of completeness with respect to
means a construction of transformations of source and             broad class of process models used in various Workflow
canonical data model specifications into AMN-                     Management Systems as well as the languages used for
specifications. So for any specification of a source data         process composition of Web services.
model the AMN-specification expressing its semantics is               In [11][12] the Ontology Web Language was unified
generated automatically (for instance, in [11][12] the            with the SYNTHESIS language, in [6] OWL 2 QL was
Ontology Web Language [22] is considered as a source              mapped into the SYNTHESIS.
model and its semantics in AMN is illustrated by                      In [7] application of the canonical model synthesis
example). Also, for any specification of the canonical            methods for the value inventive data models was
data model the AMN-specification expressing its                   discussed. The distinguishing feature of these data
semantics is generated automatically [23]. After that             models is inference of new, unknown values in the
refinement of a canonical data model specification by a           process of query answering.
source data model specification is reduced to refinement              In [8] an approach to mapping of different types of
of their semantic AMN specifications and can be verified          NoSQL models into the object model of the
by the refinement theorem prover [15]. So verification of         SYNTHESIS language used as unifying data model was
model refinement is realized over a set of source model           considered.
specification samples.                                                In [9] unification of an array-based data model used
    In the second way semantics of a data model (source           in SciDB DBMS was considered, and in [10] unification
or canonical) as a whole is expressed by an AMN                   of an attributed graph data model was considered. For
specification. For instance, in [9] AMN semantics for an          both models verification using AMN specifications is
array data model is defined, in [10] AMN semantics for            provided.
a graph data model is defined. AMN semantics for the                  In [13] issues on unification of RDF with
SYNTHESIS language as the canonical data model was                accompanying RDF Schema and SPARQL languages
also provided [9][10]. Data structures used in data               were discussed.
models were represented by variables in AMN
specifications, properties of data structures were                3 FAIR Data Based on Data Model
represented by AMN invariants, typical operations of              Unification
data models were represented by AMN operations.
Generally refinement of the AMN-specification MC                      The following levels of integration (from higher to
corresponding to the canonical data model C by the                lower) can be distinguished: data model integration
AMN-specification MR corresponding to a source data               (unification), schema matching and integration
model R should be also proved using refinement theorem            (metadata integration) and data integration proper.
prover [15].                                                      Usually completion of the integration on a higher level is
                                                                  a prerequisite for integration on a lower level. Obviously
    Partial automation of data unification techniques             the highest level, i.e. data model unification is a
mentioned above was implemented within Unifying                   prerequisite for (meta)data interoperability, integration
Information Models Constructor (Model Unifier in short)           and reuse within FAIR data infrastructures and data
[11][12]. Unifier consists of the following main                  model unification techniques overviewed in the previous
components:                                                       section can be considered as a formal basis for achieving
• tool for the formal description and correctness                 FAIRness of data.
    checking of model syntax and transformations (Meta-               Any level of integration makes data more FAIR:
    Environment, ATL Tools);                                      integrated data are much easier to find, access and reuse
• Atelier B [15], supporting AMN and providing                    and also integrated data are more interoperable than
    facilities for proving of specification refinement;           heterogeneous data stored in different data sources. The
• model manager.                                                  most mature level of integration is achieved within data
                                                                  integration systems like subject mediators or data
    Meta-Environment, ATL Tools and Atelier B are
                                                                  warehouses.
third-party products. Model manager provides a
                                                                      Subject mediators implement virtual integration with
graphical interface allowing an expert to search for, view
                                                                  user queries defined in some unified data model. Such
and register data models and extensions of the canonical
                                                                  queries are to be decomposed into sets of subqueries and
model; to call specific components for generating
                                                                  these subqueries are to be transferred to heterogeneous
templates, editing and integration of reference schemas,
                                                                  data sources. Data sources are connected with a subject
generating templates for translators of source models
                                                                  mediator via wrappers which transforms queries into
into the canonical one, translation of source models
                                                                  source data models and also transforms query answers
specifications into AMN or into canonical specifications,
                                                                  from source data models into unified mediator data
translation of canonical specifications into AMN.
                                                                  model. Query answers are transferred by wrappers back
    Recent years data unification techniques were
                                                                  to the mediator, combined and sent to users. One of the
applied to wide range of source data models. In [5] a
                                                                  latest trends nowadays is construction of subject
canonical process model has been synthesized for the
                                                                  mediators over data lakes [24].
environment of workflow patterns classified by W. M. P.
                                                                      Data warehouses implement materialized integration


                                                             11
with all required data extracted from sources,                     [6]  Kalinichenko, L.A., Stupnikov, S.A.: OWL as Yet
transformed into unified warehouse data model, and                      Another Data Model to be Integrated. In:
stored into a warehouse.                                                Advances in Databases and Information Systems:
    Any kind of integration system requires unified data                Proc. II of the 15th East-European Conference, pp.
model. One of the important issues to be resolved for data              178-189. Vienna: Austrian Computer Society
integration within FAIR data infrastructures is the choice              (2011)
of the canonical model kernel. Even the choice between             [7] Kalinichenko L.A., Stupnikov S.A.: Synthesis of
SQL and RDF is difficult. On the one hand SQL is                        the Canonical Models for Database Integration
supported by industrial standards, methods and                          Preserving Semantics of the Value Inventive Data
technologies evolving for decades. On the other hand,                   Models. Advances in Databases and Information
RDF is W3C Recommendation supported by triplestore                      Systems: Proc. of the 16th East European
vendors, is strongly connected with OWL ontological                     Conference, LNCS 7503, pp. 223-239. Berlin-
framework, allows flexible evolution of data schema,                    Heidelberg: Springer-Verlag (2012)
provides logic inference in a native way that is very              [8] Skvortsov N. A.: Mapping of NoSQL data models
important for knowledge bases.                                          to object specifications. Proc. of the 14th Russian
    To integrate heterogeneous data sources in an                       Conference on Digital Libraries RCDL'2012.
interoperable way FAIR data infrastructures may support                 CEUR Workshop Proceedings 934:53-62 (2012)
both mentioned kinds of data integration systems and
                                                                   [9] Stupnikov, S. A.: Unification of an Array Data
also combined data integration systems [25] with data
                                                                        Model for the Integration of Heterogeneous
warehouses considered as resources to be integrated
                                                                        Information Resources. In: Proc. of the 14th
within subject mediators. For all kinds of data integration
                                                                        Russian Conference on Digital Libraries
systems the data model unification techniques can
                                                                        RCDL'2012. CEUR Workshop Proceedings, Vol.
provide a formal basis.
                                                                        934, pp. 42-52 (2012)
Acknowledgments. The research is partially supported               [10] Stupnikov, S. A.: Mapping of a Graph Data Model
by Russian Foundation for Basic Research, project 18-                   into an Object-Frame Canonical Information
07-01434.                                                               Model for the Development of Heterogeneous
                                                                        Information Resources Integration Systems. In:
                                                                        Proc. of the 15th Russian Conference on Digital
References                                                              Libraries RCDL'2013. CEUR Workshop
[1]   Wilkinson, M. D. et al.: The FAIR Guiding                         Proceedings 1108:85-94 (2013)
      Principles for scientific data management and                [11] Zakharov, V. N., Kalinichenko, L. A., Sokolov, I.
      stewardship. Sci. Data 3:160018 (2016). DOI:                      A., Stupnikov, S. A.: Development of Canonical
      10.1038/sdata.2016.18.                                            Information Models for Integrated Information
[2]   Wilkinson, M. D.: Interoperability and FAIRness                   Systems. Informatics and Applications, 1(2):15-38
      through a novel combination of Web technologies.                  (2007)
      PeerJ Preprints (2016). URL:                                 [12] Kalinichenko, L.A., Stupnikov, S.A.: Constructing
      https://doi.org/10.7287/peerj.preprints.2522v1                    of Mappings of Heterogeneous Information
[3]   Kalinichenko, L. A.: Canonical model                              Models into the Canonical Models of Integrated
      development techniques aimed at semantic                          Information Systems. In: Advances in Databases
      interoperability in the heterogeneous world of                    and Information Systems: Proc. of the 12th East-
      information modeling. Knowledge and model                         European Conference, pp. 106-122. Pori: Tampere
      driven information systems engineering for                        University of Technology (2008)
      networked organizations: Proc. of the CAiSE                  [13] Skvortsov N. A.: Mapping of RDF Data Model
      INTEROP Workshop. -- Riga: Riga Technical                         into the Canonical Model of Subject Mediators.
      University, 2004. -- P. 101-116.                                  Proc. of the 15th Russian Conference on Digital
[4]   Kalinichenko, L. A., Stupnikov, S. A., Martynov                   Libraries RCDL'2013. CEUR Workshop
      D.O.: SYNTHESIS: a Language for Canonical                         Proceedings 1108:95-101 (2013)
      Information Modeling and Mediator Definition for             [14] Abrial, J.-R.: The B-Book: Assigning Programs to
      Problem Solving in Heterogeneous Information                      Meanings. Cambridge: Cambridge University
      Resource Environments. Moscow: IPI RAN, 2007.                     Press (1996)
      - 171 p.                                                     [15] Atelier B, the industrial tool to efficiently deploy
[5]   Kalinichenko, L.A., Stupnikov, S.A., Zemtsov,                     the B Method. http://www.atelierb.eu/
      N.A.: Extensible Canonical Process Model                     [16] Van den Brand M. G. J. et al.: The ASF+SDF
      Synthesis Applying Formal Interpretation. In:                     meta-environment: a component based language
      Avances in Databases and Information Systems:                     development environment // Compiler
      Proceedings of the East European Conference.                      Construction 2001 / Ed. By R. Wilhelm, pp. 365–
      LNCS 3631, pp. 183-198. Berlin-Heidelberg:                        370. Springer (2001)
      Springer-Verlag (2005)
                                                                   [17] Stupnikov, S.A., Kalinichenko, L.A.: Methods for
                                                                        Semi-automatic Construction of Information


                                                              12
     Models Transformations. Proc. of the 13th East-        [23] Stupnikov, S. A.: A semantic transformation of the
     European Conference Advances in Databases and               canonical information model into a formal
     Information Systems, workshop Model – Driven                specification langage for the refinement
     Architecture: Foundations, Practices and                    verification. Proc. of the 12th Russian Conference
     Implications (MDA), pp. 432-440. Riga: Riga                 on Digital Libraries RCDL'2010, pp. 383-391.
     Technical University (2009)                                 Kazan: Kazan Federal University (2010)
[18] Object Management Group Model Driven                   [24] Hai, R., Quix, C., Zhou, C.: Query Rewriting for
     Architecture (MDA). MDA Guide rev. 2.0. OMG                 Heterogeneous Data Lakes. In: Benczúr A.,
     Document ormsc/2014-06-01 (2014)                            Thalheim B., Horváth T. (eds) Advances in
[19] Steinberg, D., Budinsky, F., Paternostro, M.,               Databases and Information Systems. ADBIS 2018.
     Merks, E.: EMF: Eclipse Modeling Framework,                 LNCS 11019:35-49. Springer (2018)
     2nd Edition. Addison-Wesley Professional (2008)        [25] Stupnikov, S.A., Vovchenko, A.E.: Combined
[20] EMFText Concrete Syntax Mapper.                             Virtual and Materialized Environment for In-
     https://github.com/DevBoost/EMFText                         tegration of Large Heterogeneous Data
[21] ATL - a model transformation technology.                    Collections. In: 16th Russian Conference on
     https://eclipse.org/atl/                                    Digital Libraries RCDL 2014 Proceedings. CEUR
                                                                 Workshop Proceedings 1297:201-210 (2014)
[22] OWL Web Ontology Language Reference. W3C
     Recommendation. http://www.w3.org/TR/owl-ref/
     (2004)


                                                       13