=Paper=
{{Paper
|id=Vol-2022/paper34
|storemode=property
|title=
On an Approach to Data Integration: Concept, Formal Foundations and Data Model

|pdfUrl=https://ceur-ws.org/Vol-2022/paper34.pdf
|volume=Vol-2022
|authors=Manuk G. Manukyan
|dblpUrl=https://dblp.org/rec/conf/rcdl/Manukyan17
}}
==
On an Approach to Data Integration: Concept, Formal Foundations and Data Model
==
<pdf width="1500px">https://ceur-ws.org/Vol-2022/paper34.pdf</pdf>
<pre>
      On an Approach to Data Integration: Concept, Formal
                 Foundations and Data Model
                                             © Manuk G. Manukyan
                                            Yerevan State University,
                                               Yerevan, Armenia
                                                   mgm@ysu.am
           Abstract. In the frame of an extensible canonical data model a formalization of data integration con-
    cept is proposed. We provide virtual and materialized integration of data as well as the possibility to support
    data cubes with hierarchical dimensions. The considered approach of formalization of data integration con-
    cept is based on the so-called content dictionaries. Namely, by means of these dictionaries we are formally
    defining basic concepts of database theory, metadata about these concepts, and the data integration concept.
    A computationally complete language is used to extract data from several sources, to create the materialized
    view, and to effectively organize queries on the multidimensional data.
    In memory of Garush Manukyan, my father.
           This work was supported by the RA MES State Committee of Science, in the frames of the research
    project N 15T-18350.
           Keywords: data integration, mediator, data warehouse, data cube, canonical data model, OPENMath,
    grid file, XML.

                                                                   which is based on the grid files [18] concept. We con-
1 Introduction                                                     sider the concept of grid files as one of the adequate
      The emergence of a new paradigm in science and               formalisms for effective management of big data. Effi-
various applications of information technology (IT) are            cient algorithms for storage and access of that directory
related to issues of big data handling [21]. The concept           are proposed in order to minimize memory usage and
of big data is relatively new and involves the growing             lookup operations complexities. Estimations of com-
role of data in all areas of human activity beginning              plexities for these algorithms are presented. In fact, the
with research and ending with innovative developments              concept of grid files allows to effectively organize que-
in business. Such data is difficult to process and analyze         ries on multidimensional data [5] and can be used for
using conventional database technologies. In this con-             efficient data cubes storage in data warehouses [13,19].
nection, the creation of new IT is expected in which               A prototype to support the considered dynamic indexa-
data becomes dominant for new approaches to concep-                tion scheme has been created and its performance was
tualization, organization, and implementation of sys-              compared with one of the most demanded NoSQL data-
tems to solve problems that were previously considered             bases [17].
extremely hard or, in some cases, impossible to solve.                   In this paper a formalization of the data integration
Unprecedented scale of development in the big data                 concept is proposed using a mechanism of the content
area and the U.S. and European programs related to big             dictionaries (similarly ontologies) of the OPENMath
data underscore the importance of this trend in IT.                [4]. Subjects of the formalization are the basic concepts
      In the above discussed context the problems of da-           of database theory, metadata about these concepts and
ta integration are very actual. Within of our approach to          the data integration concept. The result of the formaliza-
data integration an extensible canonical model has been            tion are a set of content dictionaries, constructed as
developed [16]. We have published a number of papers               XML DTDs on the base of OPENMath and are used to
that are devoted to the investigation of data virtual and          model the databases concepts. With this approach,
materialized data integration problems, for instance [15,          schema of an integrated database is an instance of con-
17]. Our approach to data integration is based on the              tent dictionary of the data integration concept. Within
works of the SYNTHESIS group (IPI RAS) [2, 9–12,                   the considered approach is provided virtual and materi-
22–25], who are pioneers in the area of justifiable data           alized integration of data as well as the possibility to
models mapping for heterogeneous databases integra-                support data cubes with hierarchical dimensions. Using
tion. To support materialized integration of data during           OPENMath as the kernel of the canonical data model
creation of a data warehouse a new dynamic index                   allows us to use a rich apparatus of computational
structure for multidimensional data was proposed [6]               mathematics for data analysis and management.
                                                                         The paper is organized as follows: Concept and
                                                                   formal foundations of the considered approach to data
Proceedings of the XIX International Conference                    integration are presented briefly in Section 2. Canonical
“Data Analytics and Management in Data Intensive                   data model and issues to support the data integration
Domains” (DAMDID/RCDL’2017), Moscow, Russia,
October 10–13, 2017


                                                             206
concept are considered in Section 3. The conclusion is             on the works of the SYNTHESIS group. According to
provided in Section 4.                                             the research of this group, each data model is defined by
                                                                   syntax and semantics of two languages, data definition
2 Brief Discussion on Data Integration Ap-                         language (DDL) and data manipulation language
proach                                                             (DML). They suggested the following principles of syn-
   The basis of our concept to data integration is based           thesis of the canonical model:
on the idea of integrating arbitrary data models. Based            •    Principle of axiomatic extension of data models
on this assumption our concept of data integration as-                 The canonical data model must be extensible. The
sumes:                                                             kernel of the canonical model is fixed. Kernel extension
    • applying extensible canonical model;                         is defined axiomatically. The extension of the canonical
    •    constructing justifiable data models mapping              data model is formed during the consideration of each
         for heterogeneous databases integration;                  new data model by adding new axioms to its DDL to
     • using content dictionaries.                                 define logical data dependencies of the source model in
                                                                   terms of the target model if necessary. The results of the
    Choosing the extensible canonical model as integra-
                                                                   extension should be equivalent to the source data mod-
tion model allows integrating arbitrary data sources. As
                                                                   el.
we allow integration of arbitrary data sources a necessi-
ty to check mapping correctness between data models                •   Principle of commutative mappings of data
arises. It is reached by formalization of data model con-              models
cepts by means of AMN machines [1] and using B-                       The main principle of mapping of an arbitrary re-
technology to prove correctness of these mappings.                 source data model into the target one (the canonical
     The content dictionaries are central to our concept           model) could be reached under the condition that the
of data integration and semantical information of differ-          diagram of DDL (schemas) mapping and the diagram of
ent types can be defined based on these dictionaries.              DML (operators) mapping are commutative.
The concept of content dictionaries allows us to extend                                           semantic
the canonical model by means of introducing new con-                                              function                   DB_CM
                                                                       SCH_CM
cepts in these dictionaries easily. In other words, canon-
ical model extension only is reduced to adding new
concepts and metadata about these concepts in content
dictionaries. Our concept to data integration is oriented
                                                                                    mapping


                                                                                                                             mapping
                                                                                                                             bijective
as virtual and materialized integration of data as well as
to support data cubes with hierarchical dimensions. It is
important that in all cases we use the same data model.
The considered data model is an advanced XML data
model which is a more flexible data model than rela-                   SCH_SM                       semantic                 DB_SM
tional or object-oriented data models. Among XML                                                    function
data models, a distinctive feature of our model is that
we use a computationally complete language for data                Figure 1 DDL mapping diagram
definition. An important feature of our concept is the
                                                                        In Figure 1 we used the following notations:
support of data warehouses on the base of a new dy-
                                                                   SCH_CM: Set of schemas of the canonical data model;
namic indexing scheme for multidimensional data. A
                                                                   SCH_SM:              Set of schemas of the source data model;
new index structure developed by us allows to organize
                                                                   DB_CM: Database of the canonical data model; DB_SM:
effectively OLAP-queries on multidimensional data and
                                                                   Database of the source model.
can be used for efficient data cubes storage in data
warehouses. Finally, the modern trends of the develop-                                        semantic
                                                                       OP_CM                  function         DB_CM                     DB_CM
ment of database systems lead to use of different divi-
sions of mathematics to data analysis. Within of our
concept to data integration, this leads to the use of cor-
                                                                                                               algorithmic


responding content dictionaries of the OPENMath.
                                                                                                               refinement
                                                                          mapping


2.1 Formal Foundations
   The above discussed concept to data integration is
based on the following formalisms:
                                                                        P_SM                    semantic       DB_SM                     DB_SM
       • canonical data model;                                                                  function
       • OPENMath objects;
       • multidimensional indexes;                                 Figure 2 DML mapping diagram
       • domain element calculus.
   Below we will consider these formalisms in detail.                 In Figure 2 we used the following notations: OP_CM:
As we noted, our approach to data integration is based             Set of operators of the canonical data model; P_SM: Set


                                                             207
of procedures in DML of the source model.                         used to assign formal and informal semantics to all
                                                                  symbols used in the OPENMath objects. A content dic-
•    Principle of synthesis of unified canonical data             tionary is a collection of related symbols encoded in
     model                                                        XML format. In other words, each content dictionary
    The canonical data model is synthesized as a union            defines symbols representing a concept from the specif-
of extensions.                                                    ic subject domain.

                                                                             attr


                                                                     book    type   app


                                                                          sequence app      attr


                                                                       OneOrMore attr       title type    string
Figure 3 Canonical data model

2.2 Mathematical Objects Representation                                   author    type string
                                                                  Figure 4 An example of compound object
    The OpenMath is a standard for representation of
the mathematical objects, allowing them to be ex-
                                                                  2.3 Dynamic Indexing Scheme for Multidimensional
changed between computer programs, stored in data-
                                                                  Data
bases, or published on the Web. The considered formal-
ism is oriented to represent semantic information and is              To support the materialized integration of data dur-
not intended to be used directly for presentation. Any            ing the creation of a data warehouse and to apply very
mathematical concept or fact is an example of mathe-              complex OLAP-queries on it a new dynamic index
matical object. The OpenMath objects are such repre-              structure for multidimensional data was developed (see
sentation of mathematical objects which assume an                 more details in [6]). The considered index structure is
XML interpretation.                                               based on the grid file concept. The grid file can be rep-
    Formally, an OpenMath object is a labeled tree                resented as if the space of points is partitioned into an
whose leaves are the basic OpenMath objects. The                  imaginary grid. The grid lines parallel to axis of each
compound objects are defined in terms of binding and              dimension divide the space into stripes. The number of
application of λ-calculus [8]. The type system is built           grid lines in different dimensions may vary, and there
on the basis of types that are defined by themselves and          may be different spacings between adjacent grid lines,
certain recursive rules, whereby the compound types are           even between lines in the same dimension. Intersections
built from simpler types. To build compound types the             of these stripes form cells which hold references to data
following type constructors are used:                             buckets containing records belonging to corresponding
                                                                  space partitions.
• Attribution. If v is a basic object variable and t is a
  typed object, then attribution (v, type t) is a typed               The weaknesses of the grid file formalism concept
  object. It denotes a variable with type t.                      are non-efficient memory usage by groups of cells re-
                                                                  ferring to the same data buckets and the possibility of
• Abstraction. If v is a basic object variable and t, A           having a large number of overflow blocks for each data
  are typed objects, then binding (λ, attribution (v,             buckets. In our approach, we made an attempt to elimi-
  type t), A) is a typed object.                                  nate these defects of the grid file. Firstly, we introduced
• Application. If F and A are typed objects, then ap-             the concept of the chunk: set of cells whose correspond-
    plication (F, A) is a typed object.                           ing records are stored in the same data bucket (repre-
  The OPENMath is implemented as an XML applica-                  sented by single memory cells with one pointer to the
tion. Its syntax is defined by syntactical rules of XML,          corresponding data buckets). Chunking technique is
its grammar is partially defined by its own DTD. Only             used to solve the problem of empty cells in the grid file.
syntactical validity of the OPENMath objects represen-
tation can be provided on the DTD level. To check se-
mantics, in addition to general rules inherited by XML
applications, the considered application defines new
syntactical rules. This is achieved by means of introduc-
tion of content dictionaries. Content dictionaries are


                                                            208
                                                                   Data                 2.4 Element Calculus
                                                                   buckets
            Y
                                                                                            In the frame of our approach to data integration as
                                                                                        integration model we consider an advanced XML data
                                                                                        model. In fact, data model defines the query language
                                                                                        [5]. Based on this, to give declarative queries a new
    w1                                                                                  query language (domain element calculus) [14] was
                                                                                        developed. A query to XML - database is a formula in
    w2                                                                                  element calculus language. To specify formulas a vari-
           u1                                                                           ant of the multisorted first order predicate logic lan-
                                                                              X
                     u2                                                                 guage is used. Notice that element calculus is developed
                           u3                                                           in the style of object calculus [10]. In addition, there is a
                                              v1      v2      v3                        possibility to give queries by means of λ-expressions.
                                                                                        Generally, we can combine the considered variants of
                                                                                        queries.
   Grid                                                Z
   partitions                                                                           3 Extensible Canonical Data Model
Figure 5 An example of 3-dimensional grid file                                              The canonical model kernel is an advanced XML
    Secondly, we consider each stripe as a linear hash                                  data model: a minor extension of the OPENMath to
table which allows increasing the number of buckets                                     support the concept of databases. The main difference
more slowly (for each stripe, the average number of                                     between our XML data model and analogous XML data
overflow blocks of chunks crossed by that stripe is less                                models (in particular, XML Schema) is that the concept
than one). By using this technique we essentially restrict                              of data types in our case is interpreted conventionally
the number of disk operations.                                                          (set of values, set of operations). More details about the
                                                                                        type system of the XML Schema can be found in [3]. A
                      Chunk              Imaginary divisions                            data model concept formalized on the kernel level is
                      s                                                                 referred to as kernel concept.

                                                                                        3.1 Kernel Concepts
 Stripes


                                                                                            In the frame of canonical data model we distinguish
                                                                                        basic and compound concepts. Formally, a kernel con-
                                                                                        cept is a labeled tree whose leaves are basic kernel con-
                                                                                        cepts. Examples of basic kernel concepts are constants,
                                                                                        variables, and symbols (for instance, reserved words).
                                                                                        The compound concepts are defined in terms of binding
                                                                                        and application of λ-calculus. The type system is built
                                                                                        analogously to that in OPENMath.

                          Data                       ...                                3.2 Extension Principle
                          buckets                                    Overflow               As we noted above the canonical data model must
                                                                     blocks             be extensible. The extension of the canonical model is
Figure 6 An example of 2-dimensional modified grid                                      formed during the consideration of each new data mod-
file                                                                                    el by adding new concepts to its DDL to define logical
     We perform comparison of directory size by our                                     data dependencies of the source model in terms of the
approach with two techniques for grid file organization                                 target model if necessary. Thus, the canonical model
proposed in [20]: MDH (multidimensional dynamic                                         extension assumes defining new symbols. The exten-
hashing) and MEH (multidimensional extendible hash-                                     sion result must be equivalent to the source data model.
ing). Directory sizes for both of these techniques are:                                 To apply a symbol on the canonical model level the
                𝟏𝟏                           𝒏𝒏−𝟏𝟏                                      following rule has been proposed:
𝑶𝑶 �𝒓𝒓𝟏𝟏+𝒔𝒔 � and 𝑶𝑶 �𝒓𝒓𝟏𝟏+𝒏𝒏𝒏𝒏−𝟏𝟏 � correspondingly, where r is
                                                                                                Concept        symbol ContextDefinition.
the total number of records, s is the block size and n is
                                                                                        For example, to support the concept of key of relational
the number of dimensions. In our case the directory                                     data model, we have expanded the canonical model
                                                       𝑛𝑛𝑛𝑛
size can be estimated as 𝑂𝑂 � �. Compared to MDH                                        with the symbol key. Let us consider a relational sche-
                             𝑠𝑠
and MEH techniques, the directory size in our approach                                  ma example:
           𝟏𝟏                   𝒏𝒏−𝟏𝟏
      𝒔𝒔𝒓𝒓 𝒔𝒔              𝒔𝒔𝒓𝒓𝒏𝒏𝒏𝒏−𝟏𝟏                                                                S = {S#, Sname, Status, City}.
is      and        times smaller correspondingly. We
     𝒏𝒏       𝒏𝒏                                                                            The equivalent definition of this schema by means
have implemented a data warehouse prototype based on                                    of extended kernel is considered below:
the proposed dynamic indexation scheme and compared                                     attribution (S, type TypeContext, constraint
its performance with MongoDB [26] (see in [17]).                                                     ConstraintContext)


                                                                                  209
TypeContext           application (sequence,                      the kernel attribution concept and has an attribute name.
                      ApplicationContext)                         By means of this concept we can model schemas of
ApplicationContext          attribution (S#, type int),           databases. The value of attribute name is the DB's
                    attribution (Sname, type string),             name. The content of element med is based on the ele-
                    attribution (Status, type int),               ments msch, wrapper, constraint and has an attribute
                    attribution (City, type string))              name. The value of this attribute is the mediator's name.
ConstraintContext         attribution (name, key S#).             The element msch is interpreted analogously to element
    It is essential that we use a computationally com-            dbsch. Only note that this element is used during mod-
plete language to define the context [14]. As a result of         elling schemas of a mediator. The content of elements
such approach, usage of new symbols in the DDL does               wrapper and constraint is based on the kernel applica-
not lead to any changes in the DDL parser.                        tion concept. By means of wrapper element mappings
                                                                  from source models into a canonical model are defined.
3.3 Semantic Level                                                The integrity constraints on the level of mediator are the
                                                                  values of the constraints elements. It is important that
     The canonical model is an XML application. Only              we are using a computationally complete language for
syntactical validity of the canonical model concepts              defining the mappings and integrity constraints. Below,
representation can be provided on the DTD level. To               an example of a mediator for an automobile company
check semantics the considered application defines new            database is adduced [5] which is an instance of a con-
syntactical rules. We define these syntactical rules in           tent dictionary of data integration concept. It is assumed
content dictionaries.                                             that the mediator with schema AutosMed = {SerialNo,
                                                                  Model, Color} is integrate two relational sources: Cars
3.4 Content Dictionaries                                          = {SerialNo, Model, Color} and Autos = {Serial, Mod-
    The content dictionary is the main formalism to de-           el}, Colors = {Serial, Color}.
fine semantical information about concepts of the ca-             <cd name = ‘dic’>
                                                                   <dbsch name = ‘Source1’>
nonical data model. In other words, content dictionaries            <omattr>
are used to assign formal and informal semantics to all              schema definition of Cars
concepts of the canonical data model. A content dic-                </omattr>
                                                                   </dbsch>
tionary is a collection of related symbols, encoded in             <dbsch name = ‘Source2’>
XML format and fixes the “meaning” of concepts inde-                <omattr>
pendently of the application. Three kinds of content                 schema definition of Autos
dictionaries are considered:                                        </omattr>
                                                                    <omattr>
• content dictionaries to define basic concepts (sym-                schema definition of Colors
  bols);                                                            </omattr>
                                                                   </dbsch>
• content dictionaries to define a signature of basic              <med name = ‘Example’>
  concepts (mathematical symbols) to check the se-                  <msch>
                                                                     <omattr>
  mantic validity of their representation;                            AutosMed: schema for mediator is defined
• content dictionary to define a data integration con-               </omattr>
    cept.                                                           </msch>
                                                                    <wrapper>
    Supporting the above considered content dictionar-               <oma>
ies assumes to develop corresponding DTDs. Instances                  <oms name = ‘convert_to_xml’ cd = ‘xml’/>
                                                                      <oma>
of such DTDs are XML documents. An instance of a                       <oms name = ‘union’ cd = ‘db’/>
DTD of a content dictionary of basic concepts is used to               <omv name = ‘Cars’/>
assign formal and informal semantics of those objects.                 <oma>
Finally, an instance of a DTD of a content dictionary of                <oms name = ‘join’ cd = ‘db’/>
                                                                        <omv name = ‘Autos’/>
a signature of basic concepts contains metainformation                  <omv name = ‘Colors’/>
about these concepts, and an instance of a DTD of a                    </oma>
content dictionary of a data integration concept is a                 </oma>
                                                                     </oma>
metadata for integrating databases.
                                                                    </wrapper>
                                                                   </med>
3.5 Data Integration Concept                                      </cd>

    In the frame of our approach to data integration we
                                                                      It is essential that, we use a computationally com-
consider virtual as well as materialized data integration
                                                                  plete language to model the mediator work.
issues within a canonical model. Therefore, we should
formalize the concepts of this subject area such as me-               Data warehouse. As we noted above the considered
diator, data warehouse and data cube. We are model-               approach to support data warehousing is based on the
ling these concepts by means of the following XML                 grid file concept and is interpreted by means of element
elements: dbsch, med, whse and cube.                              whse. This element is defined as kernel application
                                                                  concept and is based on the elements wsch, extractor,
    Mediator. The content of element dbsch is based on
                                                                  grid and has an attribute name. The value of this attrib-


                                                            210
ute is the name of the data warehouse. The element                 described by means of attribute name. Value of attribute
wsch is interpreted in the same way as the element msch            name is the dimension name. The creation of the data
for the mediator. The element extractor is defined as              cube requires generation of the power set (set of all sub-
kernel application concept and is used to extract data             set) of the aggregation attributes. To implement the
from source databases. The element grid is defined as              formal data cube concept in literature the CUBE opera-
kernel application concept and is based on the elements            tor is considered [7]. In addition to the CUBE operator
dim and chunk by which the grid file concept is mod-               in [7] the operator ROLLUP is produced as a special
elled. To model the concept of stripe of a grid file we            variety of the CUBE operator which produces the addi-
introduced an empty element stripe which is described              tional aggregated information only if they aggregate
by means of five attributes: ref_to_chunk, min_val,                over a tail of the sequence of grouping attributes. To
max_val, rec_cnt and chunk_cnt. The values of attrib-              support these operators we introduced cube and rollup
utes ref_to_chunk are pointers to chunks crossed by                symbols correspondingly. In this context, it is assumed
each stripe. By means of min_val (lower boundary) and              that all independent attributes are grouping attributes.
max_val (upper boundary) attributes we define "widths"             For some dimensions there are many degrees of granu-
of the stripes. The values of attributes rec_cnt and               larity that could be chosen for a grouping on that di-
chunk_cnt are the total number of records in a stripe and          mension. When the number of choices for grouping
the number of chunks that are crossed by it correspond-            along each dimension grows, it becomes non-effective
ingly. To model the concept chunk we introduced an                 to store the results of aggregating based on all the sub-
element chunk which is based on the empty element avg              sets of groupings. Thus, it becomes reasonable to intro-
and is described by means of four attributes: id of type           duce materialized views.
ID, qty, ref_to_db and ref_to_chunk. Values of attrib-                                All
utes ref_to_db and ref_to_chunk are pointers to data                                                                 All
blocks and other chunks, correspondingly. Value of
                                                                                                 Years
attribute qty is the number of different points of the
considered chunk for fixed dimension. Element avg is
                                                                                                                    State
described by means of two attributes: value and dim.
Values of value attributes are used during reorganiza-                                          Quarters
tion of the grid file and contain the average coordinates
of points, corresponding to records of the considered                                                                City
chunk, for each dimension. Value of attribute dim is the
name of the corresponding dimension. To model the                        Weeks                  Months
concept of dimension of a grid file we introduced an
element dim which is based on the empty element stripe                                Days                          Dealer
and has a single attribute name: i. e. the dimension
name.                                                              Figure 7 Examples of lattices partitions for time inter-
                                                                   vals and automobile dealers
    Data cube. Materialized integration of data assumes
the creation of data warehouses. Our approach to create                 Materialized views. A materialized view is the result
data warehouses is mainly oriented to support data cu-             of some query which is stored in the database, and
bes. Using data warehousing technologies in OLAP                   which does not contain all aggregated values. To model
applications is very important [5]. Firstly, the data              the materialized view concept we introduce an element
warehouse is a necessary tool to organize and centralize           mview which is interpreted by means of an element
corporate information in order to support OLAP queries             view, and the last is based on the kernel attribution con-
(source data are often distributed in heterogeneous                cept. When implementing the query over hierarchical
sources). Secondly, significant is the fact that OLAP              dimension, a problem to choose an effective material-
queries, which are very complex in nature and involve              ized view arises. In other words, if we have aggregated
large amounts of data, require too much time to perform            values regarding to granularity Months and Quarters
in a traditional transaction processing environment. To            then for aggregation regarding to granularity on Years it
model the data cube concept we introduced an element               will be effective to apply query over materialized view
cube which is interpreted by means of the following                with granularity Quarters. As in [5], we also consider
elements: felement, delement, fcube, rollup, mview and             the lattice (a partially ordered set) as a relevant con-
granularity. In typical OLAP applications, some collec-            struction to formalize the hierarchical dimension. The
tion of data called fact_table which represent events or           lattice nodes correspond to the units of the partitions of
objects of interest are used [5]. Usually, fact_table con-         a dimension. In general, the set of partitions of a dimen-
tains several attributes representing dimensions, and one          sion is a partially ordered set. We say that partition P1 is
or more dependent attributes that represent properties             precedes partition P2, written P1 ≤ P2 if and only if there
for the point as a whole. To model the fact_table con-             is a path from node P1 to node P2. Based on the lattices
cept we introduced an element felement which is based              for each dimension we can define a lattice for all the
on the kernel attribution concept. To model the concept            possible materialized views of a data cube which are
of dimension we introduced an element delement. This               created by means of grouping according to some parti-
element is based on the empty element element which is             tion in each dimension. Let V1 and V2 be views, then V1
                                                                   ≤ V2 if and only if for each dimension of V1 with parti-


                                                             211
tion P1 and analogous dimension of V2 with partition P2           <!ELEMENT wrapper (oma)>
holds P1 ≤ P2. Finally, let V be a view and Q be a query.         <!ELEMENT constraint (oma)>
                                                                  <!ATTLIST med name CDATA #REQUIRED>
We can implement this query over the considered view              <!ELEMENT whse (wsch,extractor,grid)>
if and only if V ≤ Q. To model the concept of hierar-             <!ELEMENT wsch (omattr)>
chical dimension we introduced an element granularity             <!ELEMENT extractor (oma)>
                                                                  <!ATTLIST whse name CDATA #REQUIRED>
which is based on an empty element partition, and the             <!ELEMENT grid (dim+,chunk+)>
latter is described by means of attribute name. The val-          <!ELEMENT dim (stripe)+>
ue of attribute name is the name of the granularity. Be-          <!ELEMENT stripe EMPTY>
low, an example of data cube for an automobile compa-             <!ELEMENT chunk (avg)+>
                                                                  <!ELEMENT avg EMPTY>
ny database is adduced [5] which is an instance of con-           <!ATTLIST dim name CDATA #REQUIRED>
tent dictionary of data integration concept. We consider          <!ATTLIST avg value CDATA #IMPLIED
Sales = {SerialNo, Dealer, Date, Price} as a data cube                          dim CDATA #REQUIRED>
                                                                  <!ATTLIST chunk id ID #REQUIRED
schema. The considered data cube is implemented on                        qty CDATA #REQUIRED
the base of materialized views and is based on three                      ref_to_db CDATA #REQUIRED
dimensions: Auto, Dealer and Date and has one de-                         ref_to_chunk IDREFS #IMPLIED>
pendent attribute: Value Set of partitions of dimension           <!ATTLIST stripe ref_to_chunk IDREFS #IMPLIED
                                                                                   min_val_CDATA #REQUIRED
Date form a partially ordered set. We are using two                                rec_cnt CDATA #REQUIRED
granularity elements to represent this set.                                        max_val_CDATA #REQUIRED
                                                                                   chunk_cnt CDATA #REQUIRED>
<cd name = ‘dic’>                                                 <!ELEMENT cube (felement,delement,mview?,
 ...                                                                              granularity*)>
 <cube name = ‘example’>                                          <!ELEMENT felement (omattr)>
  <felement>                                                      <!ELEMENT delement (element)+>
   <omattr>                                                       <!ELEMENT element EMPTY>
    schema definition of Sales                                    <!ATTLIST element name CDATA #REQUIRED>
   </omattr>                                                      <!ELEMENT mview (view)+>
  </felement>                                                     <!ELEMENT view (omattr)>
  <delement>                                                      <!ELEMENT granularity (partition)+>
   <element name = ‘Auto’/>                                       <!ELEMENT partition EMPTY>
   <element name = ‘Dealer’/>                                     <!ATTLIST view name CDATA #REQUIRED>
   <element name = ‘Date’/>                                       <!ATTLIST granularity name CDATA #REQUIRED>
  </delement>                                                     <!ATTLIST partition name CDATA #REQUIRED>
  <mview>
   <view name = ‘View1’>                                          Figure 8 DTD for formalization of the data integration
    <omattr>                                                      concept
     definition of materialized view Sales1
    </omattr>
   </view>
                                                                  4 Conclusion
   <view name = ‘View2’>
    <omattr>
                                                                        The data integration concept formalization prob-
     definition of materialized view Sales2                       lems were investigated. The outcome of this investiga-
    </omattr>                                                     tion is a definition language of integrable data, which is
   </view>                                                        based on the formalization of the data integration con-
  </mview>
  <granularity name = ‘Date’>                                     cept using a mechanism of the content dictionaries of
   <partition name = ‘days’/>                                     the OPENMath. Supporting the concept of data integra-
   <partition name = ’months’/>                                   tion is achieved by the creation of content dictionaries,
   <partition name = ‘quarters’/>                                 each of which contains formal definitions of concepts
   <partition name = ‘years’/>
  </granularity>                                                  of a specific area of databases.
  <granularity name = ‘Date’>                                           The data integration concept is represented as a set
   <partition name = ‘days’/>
   <partition name = ’weeks’/>
                                                                  of XML DTDs which are based on the OPENMath for-
  </granularity>                                                  malism. By means of such DTDs were formalized the
 </cube>                                                          basic concepts of database theory, metadata about these
</cd>                                                             concepts and the data integration concept. Within our
                                                                  approach to data integration, an integrated schema is
The detailed discussion of the issues connected with
                                                                  represented as an XML document which is an instance
applying the query language to integrated data is be-
                                                                  of an XML DTD of the data integration concept. Thus,
yond the topic of this paper. Below the XML-
                                                                  modelling of the integrated data based on the OPEN-
formalization of data integration concept is presented.
                                                                  Math formalism leads to the creation of the correspond-
                                                                  ing XML DTDs.
<!-- include dtd for extended OPENManth objects -->
                                                                        By means of the developed content dictionary of
<!ELEMENT cd (dbsch|med|whse|cube)*>                              the data integration concept we are modelling the medi-
<!ATTLIST cd name CDATA #REQUIRED>                                ator and the data warehouse concepts. The considered
<!ELEMENT dbsch (omattr)+>
<!ATTLIST dbsch name CDATA #REQUIRED>
                                                                  approach provides virtual and materialized integration
<!ELEMENT med (msch,wrapper,constraint*)>                         of data as well as the possibility to support data cubes
<!ELEMENT msch (omattr)>                                          with hierarchical dimensions. Within our concept of


                                                            212
data cube, the operators CUBE and ROLLUP are im-                       Models. In Proc. of the 16th East European Con-
plemented. If necessary, in data integrated schemas new                ference. LNCS 7503, pp. 223-239 (2012)
super-aggregate operators can be define. We use a com-            [13] Luo, C., Hou, W. C., Wang, C. F., Want H., Yu,
putationally complete language to create schemas of                    X.: Grid File for Efficient Data Cube Storage.
integrated data. Applying the query language to the                    Computers and their Applications, pp. 424-429
integrated data is generated a reduction problem. Sup-                 (2006)
porting the query language over such data requires addi-          [14] Manukyan, M. G.: Extensible Data Model. In
tional investigations.                                                 ADBIS’08, pp. 42-57 (2008)
     Finally, modern trends of the development of data-           [15] Manukyan, M. G., Gevorgyan, G. R.: An Ap-
base systems lead to the application of different divi-                proach to Information Integration Based on the
sions of mathematics to data analysis and management.                  AMN Formalism. In First Workshop on Pro-
In the frame of our approach to data integration, this                 gramming the Semantic Web. Available:
leads to the use of corresponding content dictionaries of              https://web.archive.org/web/20121226215425/http
the OPENMath.                                                          ://www.inf.puc-rio.br/~psw12/program.html,
                                                                       pp. 1-13 (2012)
References
                                                                  [16] Manukyan, M. G.: Canonical Data Model: Con-
 [1] Abrial, J.-R.: The B-Book: Assigning programs to                  struction Principles. In iiWAS’14, pp. 320-329,
     meaning. Cambridge University Press (1996)                        ACM (2014)
 [2] Briukhov, D. O., Vovchenko, A. E., Zakha-                    [17] Manukyan, M. G., Gevorgyan, G. R.: Canonical
     rov, V. N., Zhelenkova, O. P., Kalinichen-                        Data Model for Data Warehouse. In New Trends
     ko, L. A., Martynov, D. O., Skvortsov, N. A.,                     in Databases and Information Systems,
     Stupnikov, S. A.: The Middleware Architecture of                  Communications in Computer and Information
     the Subject Mediators for Problem Solving over a                  Science, 637, pp. 72-79 (2016)
     Set of Integrated Heterogeneous Distributed In-              [18] Nievergelt, J., Hinterberger, H.: The Grid File: An
     formation Resources in the Hybrid Grid-                           Adaptable, Symmetric, Multikey File Structure.
     Infrastructure of Virtual Observatories. Informat-                ACM Transaction on Database Systems, 9 (1),
     ics and Applications, 2 (1), pp. 2-34, (2008)                     pp. 38-71 (1984)
 [3] Date, C. J.: An Introduction to Database Systems.            [19] Papadopoulos, A. N., Manolopoulos, Y., The-
     Addison Wesley, USA (2004)                                        odoridis, Y., Tsoras, V.: Grid File (and family). In
 [4] Drawar, M.: OpenMath: An overview. ACM SIG-                       Encyclopedia of Database Systems, pp. 1279-1282
     SAM Bulletin, 34 (2), (2000)                                      (2009)
 [5] Garcia-Molina, H., Ullman, J., Widom, J.: Data-              [20] Regnier, M.: Analysis of Grid File Algorithms,
     base Systems: The Complete Book. Prentice Hall,                   BIT, 25 (2), pp. 335-358 (1985)
     USA (2009)                                                   [21] Sharma, S., Tim, U. S., Wong, J., Gadia, S., Shar-
 [6] Gevorgyan, G. R., Manukyan, M. G.: Effective                      ma, S.: A Brief Review on Leading Big Data
     Algorithms to Support Grid Files. RAU Bulletin,                   Models. Data Science Journal, (13), pp. 138-157,
     (2), pp. 22-38 (2015)                                             (2014). Doi: http/doi.org/10.2481/dsj.14-041
 [7] Gray, J., Bosworth, A., Layman, A., Pirahesh, H.:            [22] Stupnikov, S. A.: A Varifiable Mapping of a Mul-
     Data Cube: A Relational Aggregation Operator                      tidimensional Array Data Model into an Object
     Generalizing Group-By, Cross-Tab, and Sub-Tab.                    Data Model, Informatics and Applications, 7 (3),
     In ICDE, pp. 152-159 (1996)                                       pp. 22-34 (2013)
 [8] Hindley, J. R., Seldin, J. P.: Introduction to Com-          [23] Stupnikov, S. A, Vovchenko, A.: Combined Vir-
     binators and λ-Calculus. Cambridge University                     tual and Materialized Environment for Integration
     Press (1986)                                                      of Large Heterogeneous Data Collections. In Proc.
 [9] Kalinichenko, L. A.: Methods and Tools for                        of the RCDL 2014. CEUR Workshop Proceedings,
     Equivalent Data Model Mapping Construction. In                    1297, pp. 339-348 (2014)
     EDBT, pp. 92-119, Springer (1990)                            [24] Stupnikov, S. A, Miloslavskaya, N. G., Budz-
[10] Kalinichenko, L. A.: Integration of Heterogeneous                 ko, V.: Unification of Graph Data Models for Het-
     Semistructured Data Models in the Canonical One.                  erogeneous Security Information Resources' Inte-
     In RCDL, pp. 3-15 (1990)                                          gration. In Proc. of the Int. Conf. on Open and Big
[11] Kalinichenko, L. A., Stupnikov, S. A.: Construct-                 Data OBD 2015 (joint with 3rd Int. Conf. on Fu-
     ing of Mappings of Heterogeneous Information                      ture Internet of Things and Cloud, FiCloud 2015).
     Models into the Canonical Models of Integrated                    IEEE 2015, pp. 457-464 (2015)
     Information Systems. In Proc. of the 12th East-              [25] Zakharov, V. N., Kalinichenko, L. A., Sokolov, I. A.,
     European Conference, pp. 106-122 (2008)                           Stupnikov, S. A.: Development of Canonical Infor-
[12] Kalinichenko, L., Stupnikov, S.: Synthesis of the                 mation Models for Integrated Information Systems.
     Canonical Models for Database Integration Pre-                    Informatics and Applications, 1 (2), pp. 15-38 (2007)
     serving Semantics of the Value Inventive Data                [26] MongoDB. https://www.mongodb.org


                                                            213

</pre>