=Paper=
{{Paper
|id=Vol-3688/paper21
|storemode=property
|title=Formal Data Integration Models Development for Intelligent Electronic Commerce Systems
|pdfUrl=https://ceur-ws.org/Vol-3688/paper21.pdf
|volume=Vol-3688
|authors=Victoria Vysotska,Andrii Berko,Lyubomyr Chyrun,Sofia Chyrun,Olena Havrylyshyn,Oksana Smirnova,Nataliia Sokulska,Olena Sokhatska,Iryna Shakleina
|dblpUrl=https://dblp.org/rec/conf/colins/VysotskaBCCHSSS24
}}
==Formal Data Integration Models Development for Intelligent Electronic Commerce Systems==
Formal Data Integration Models Development for
Intelligent Electronic Commerce Systems
Victoria Vysotska1, Andrii Berko1, Lyubomyr Chyrun2, Sofia Chyrun1, Olena
Havrylyshyn3, Oksana Smirnova1, Nataliia Sokulska4, Olena Sokhatska5 and Iryna
Shakleina1
1 Lviv Polytechnic National University, Stepan Bandera Street, 12, Lviv, 79013, Ukraine
2 Ivan Franko National University of Lviv, University Street, 1, Lviv, 79000, Ukraine
3 Ukrainian Academy of Printing, Pidholosko St., 19, Lviv, 79020, Ukraine
4 Hetman Petro Sahaidachnyi National Army Academy, Heroes of Maidan street, 32, Lviv, 79026, Ukraine
5 West Ukrainian National University, Lvivska Street, 11, Ternopil, 46004, Ukraine
Abstract
The problem of creation and application of methods and means of information technologies of electronic
commerce for various subject areas and applications has been studied, as the problems of developing
mathematical models, solution methods and instrumental means for the integration of information
resources and the functioning of intelligent electronic commerce systems with the use of effective
intelligent models have been solved. Processes of modelling and design of business analytics tools for
processing heterogeneous information resources based on ontologies are described. To solve the
problem, several scientific tasks were performed, in particular, a classification of intelligent electronic
commerce systems and means of processing heterogeneous distributed information resources of
business analytics was proposed, a formal model of intelligent electronic commerce systems is
developed using ontologies, its components, a structural model of information resources, methods and
algorithms for designing intelligent electronic commerce systems based on the apparatus of ontologies
and integration of information resources.
Keywords
Intelligent system, electronic commerce, Data Integration, system model, process model 1
1. Introduction
The processes of data integration have a fairly wide scope of practical applications. In particular,
in areas such as construction of DS of various types and directions, development of corporate
management systems, information Web systems, electronic business systems, computer
monitoring, etc. The information resources of such systems provide for the simultaneous use of a
significant number of various forms, structures, content, methods of presentation and application
of data [1-2]. The purpose of developing the method of multi-level data integration is to build and
justify a single generalized approach to solving the given task and determining ways of
implementation that will ensure its interoperability and invariance to the nature, content,
specificity, and order of application of the integrated data. This is especially important in
operational integration processes, in which these data properties are often not predetermined
and may change during the integration procedures themselves. The basis for solving the
problems of this section is the formal presentation of data as a system, the syntax, structure and
COLINS-2024: 8th International Conference on Computational Linguistics and Intelligent Systems, April 12–13, 2024,
Lviv, Ukraine
victoria.a.vysotska@lpnu.ua (V. Vysotska); andrii.y.berko@lpnu.ua (A. Berko); Lyubomyr.Chyrun@lnu.edu.ua (L.
Chyrun); sofiia.chyrun.sa.2022@lpnu.ua (S. Chyrun); havrylyshynolena@gmail.com (O. Havrylyshyn);
oksana.y.smirnova@lpnu.ua (O. Smirnova); natalya.sokulska@gmail.com (N. Sokulska); o.sokhatska@wunu.edu.ua
(O. Sokhatska); iryna.o.shakleina@lpnu.ua (I. Shakleina)
0000-0001-6417-3689 (V. Vysotska); 0000-0001-8653-1520 (A. Berko); 0000-0003-3140-3788 0000-0002-
9448-1751 (L. Chyrun); 0000-0002-2829-0164 (S. Chyrun); 0000-0001-7181-4421 (O. Havrylyshyn); 0000-0002-
1314-0489 (O. Smirnova); 0000-0002-3425-5517 (N. Sokulska); 0000-0002-6535-549X (O. Sokhatska); 0000-0003-
0809-1480 (I. Shakleina)
© 2024 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
semantics of which elements are described with the help of special tools suitable for software
perception and processing. The main task of data integration is the formation of a complete and
consistent output set based on a set of disparate input data obtained from various sources. To
achieve the final goal of integration, it is necessary to ensure a coordinated combination in the
single formation of their syntax, structure and semantics [3-4]. In the course of solving this kind
of problem, several problematic moments arise, which are manifested in various kinds of
conflicts, and contradictions due to inconsistencies of input local data [5-6]. At the level of data
syntax integration, the following contradictions arise [1-2]: ambiguity or contradiction of
alphabets, mismatch of data types and formats, and mismatch of syntactic constraints. At the level
of integration of data structures, the following are typical contradictions: inconsistency in
methods of defining data units, contradictions in the types and methods of building connections,
and a variety of ways to organize data [7]. The semantic component of the integration process is
one of the most important and complex, since the problems of syntax and structure, in general,
are solved at the technical and technological levels. Formation of an agreed interpretation of
integrated data is impossible without human participation, as well as the application of methods
and means of intelligent data processing. At the level of integration of semantics, conflict
situations [7] arise as a result of the following factors:
contradictions in the definition of concepts,
ambiguity or different readings of names,
use of incompatible metrics when forming data values,
contradictions in defining relationships between data,
contradictions of limitations and axioms of data interpretation,
ambiguous interpretation of data values.
Eliminating the listed contradictions and conflicts between input data is one of the tasks of the
data integration method. The multi-level data integration method is based on the multi-level data
model developed in the previous section and involves the decomposition of the overall process
into sub-processes of value integration, data syntax, structure and semantics. A key element of
this approach to integration processes is the possibility of their implementation at the level of
data meta-schemas, which allows to reduce the number of references to, in fact, data, the volumes
of which can be significant. Due to this execution of data integration procedures, they are
transferred to the meta-level, operating instead of data with their formalized description. Similar
principles of replacing operations on information resources with operations on metadata that
specify them are used in the concept of the "semantic web", which is part of the general concept
of Web 2.0, data spaces [8] and DS of the second generation [9].
The purpose of the method of multi-level data integration is to determine the principles,
composition and content of actions for the formation of the information resource of open
information systems and the order of their implementation. Since the object of application of the
method is information resources, it is advisable to organize the process of its development
according to a set of requirements [9-10], which are applied to the design processes of
information systems. The most acceptable is the application of the popular FURPS+ requirements
model [11] defined according to RUP (Rational Unified Process) specifications and IEEE Std
1233a-1998 [12], IEEE Std 610.12-1990 [13] standards, which today are typical in the field of
creating information systems and their components. Such a model provides for the formulation
of basic and additional requirements for the final result of the development process. The main
requirements that must be met by the method of multi-level data integration, according to the
chosen approach, form a set that will be formulated as follows [1-2].
1. Functionality is compliance of the functionality of the multi-level data integration method
with the requirements and needs of users of the final result.
2. Usability is the possibility of applying the method for implementation using an open
information system.
3. Reliability is the ability of the method to provide the appropriate level of quality indicators
of the results under the specified conditions during the time of its application.
4. Performance is the ratio between the level of costs for the implementation of the actions
provided for within the method and the weight of the results obtained.
5. Supportability is the ability of the method to be applied in all situations and conditions in
which the means of the corresponding information system function.
In addition to the set of basic requirements, the FURPS+ model provides for the formulation
of additional requirements, which, unlike the basic set of FURPS requirements, are not unified
and are formulated to reflect the specifics of the area and subject of the application. For the
method of multi-level data integration, focused on open information systems, additional
requirements are as follows [1-2].
6. Portability is the ability to move the tools that implement this method from one
application environment to another without rebuilding them.
7. Interoperability is the ability to jointly apply the method and means that implement it with
other methods and means of forming IRP (information resource processing) of open systems.
8. Unification is the use of typical concepts, objects and tools and the formation of results,
by the uniform requirements of IS (intelligent system).
The fulfilment of such a set of requirements aims to ensure the appropriate level of quality of
the method being developed and advantages over other known methods of data integration.
2. Related works
The problem of data syntax integration (syntactic integration) is fundamental to the
integration of other components of their general description. Solving the problems of building a
generalized structure and semantics of data is possible only based on a single agreed system of
notation. The concept of data syntax itself is complex and takes into account various aspects of its
representation in documents, DB, DS data repositories, etc. [14]. Taking this into account, the data
syntax is presented as a combination of three components G=, where A is an alphabet, T
is a set of data types, and R is a set of syntactic restrictions [1-2, 15-18]. An alphabet defines a set
of symbols that are used to represent data values in a defined environment. As a rule, the alphabet
consists of letters, numbers, and special and service symbols. However, the definition of the
alphabet is influenced, in particular, by such factors as the localization of the data processing
environment to the language of the users, the nature of the tasks for which the data are used, the
peculiarities of the processes of their storage, transmission and processing, the specifics of
interpretation and the application of various data values. Along with traditional means of marking
data, modern systems widely use graphics, sound, multimedia and other elements for their
display and processing, as well as data of complex and complex types, streaming and active data,
which creates additional difficulties in producing a single, consistent presentation of data [1-2].
The concept of data type is defined as the result of the classification of values according to the
methods of representation and processing [15-18]. Today, along with such classic types as
numerical, symbolic, logical, date-time, etc., specific types of data are widely used, which reflect
the peculiarities of their content, processing and application. These are, in particular, such scalar
types as "hyperlink", "currency", "object", "locator" and other, complex (aggregate) types -
"array", "record", "set", "XML document " etc., object types, and user-defined data types. Such a
variety of data types, on the one hand, creates additional opportunities for the image and
processing of information resources, on the other hand, it complicates the means of supporting
the data storage environment, the procedures for their joint application, transformation and
unification. Constraints, as an element of data syntax, are used to unify forms of data presentation
and create values adequate to the concepts and values they represent. Syntax restrictions are set
in the form of quantitative indicators, dimensions, formats, templates, rules for forming values,
defining a subset of permissible characters, etc. Such restrictions can be defined both at the IS/IT
(information technology) level of data support and at the user level. Therefore, it is advisable to
decompose the data syntax integration problem into the problems of alphabet integration, type
integration, and constraint integration. The ratio of these tasks and the results of their execution
are presented in Fig. 1.
Integration of data Integrated dataset
syntax
Integration of An integrated set of
constraints constraints
Integration of types An integrated set of
types
Integration of the
Integrated alphabet
alphabet
Integrated set of
Integration of values
values
Figure 1: Schematic of the data syntax integration process
According to this scheme, the syntax of the image of the values of the integrated data set GI is
presented as a combination of three components GI =, where AI =IA(A1, A2, …, AN) is the
alphabet of the integrated data set, formed by integrating the input alphabets data sets A1, A2, …,
AN ; TI =IT(T1,T2,…,TN) is the set of data types used in the integrated set, obtained as a result of the
integration of the data types defined for the input data; RI =IR(R1,R2,…,RN) is the set of constraints
of the integrated data set formed by the integration of the constraints applied to the input data;
IA, IT, IR are integration operators, respectively, of alphabets, data types, and constraints. Each of
these operators describes the mapping, respectively, of IA is sets of input alphabets into the output
global alphabet of the integrated data set, IT is sets of local input sets of data types into the output
global set of data types of the integrated set, IR is sets of local input sets of syntactic constraints
into the output global a set of syntactic restrictions of data of an integrated set [1-2, 15-18].
3. Models and methods
3.1. Basic principles of the extended data integration model
Further development of the concept of modelling data integration processes is possible due to
the transition in the formal model from the concept of a scheme as an object of integration to the
concept of a data set. Each data set is a combination of a scheme, as some formalized description
of the composition and structure of data and a set of values (constants) formed according to the
requirements of the scheme. In this way, the formal objects of the model are a set of input (local)
data sets, an output (global) set of integrated data and a mapping that establishes correspondence
between the elements of the input and output sets (Fig. 2a). Formally, such a model is presented
as a triple of the form [1-2]: , where DSL={ | i=1,…N} is a set of
local input data sets; Σi is the data scheme of the i-th input set is made in terms of the input scheme
description language LL, Di is a set of values (constants) formed based on a set of characters of
the input alphabet AL; DSI= is global output set of integrated data; ΣI is the scheme of the
global set of integrated data is made in terms of the description language of the original schemes
LI, DI is the set of values of the original data set given by the symbols of the original alphabet AI;
Map(DSL, DSI) is mapping of local input data into a global output set of integrated data [1-2]. The
fundamental difference between this model and the formal model of M. Lenzerini is the concept
of a global set of integrated data as a result of the integration process. At the same time, this set
can be formed both by moving the values of the input data into the global environment and by
mapping through virtual structures and data elements. In general, the proposed model
corresponds to the real processes of integration to a greater extent than the formal model. Using
such a model, it is possible to formulate a sufficiently accurate and detailed formal description of
the main typical methods of data integration, such as consolidation, federalization, replication,
hybrid integration and collage [1-2].
Global scheme of integrated data
D1 1 D*1 *1
Integrated dataset
D2 2 D*2 *2
ETL
DI I
Reflection . .
...
. .
Data Data Data
scheme 1 scheme 2
...
scheme N . .
*
... DN N D*N
N
Data source 1 Data source 2 Data source N
Figure 2: According to the improved model and data consolidation
3.2. Modelling the data consolidation process
A feature of the data consolidation method is the application of data extraction, transformation
and loading procedures as the basis of the data integration process. The result of consolidation is
a global set of integrated data, which has its scheme, which summarizes the composition and
content of the schemes of the input sets. A description of the process of data consolidation
according to the proposed generalized model is given in Fig. 2b [1-2]. The formal model of the
data consolidation process has the form of a tuple < { | i=1,…, N}, ETL() | i=1,…, N,
>, where <{ | i=1,…N} is a set of input data sets, each of which is given by a scheme
Σi and a set of values Di; is a global set of integrated data, with the scheme Σ I and a set of
DI values; ETL() is display of input data sets into the output by applying extraction
procedures, loading conversion [1-2]. The key element of such a model is a mapping, which
transforms each ith input set of the form into an intermediate data set of the form . The data set formed as a result of such a transformation differs from the initial one, primarily
the fact that its composition, scheme and format are built by the requirements of a global
integrated data storage environment. The next step is to move the intermediate data set to the
global environment and merge the set of its values with the values set of the integrated data set.
3.3. Modelling the data federalization process
The method of data federalization differs in the way of forming a set of integrated values (Fig.
3a) [1-2]. Unlike consolidation, this method involves the formation of an integrated data set as
some virtual image based on a set of local data sets. When accessing the integrated data, the
corresponding image elements are implemented by substituting real values obtained from local
sources. In this way, the integration process is implemented only at the scheme level, using as
values the data placed at the local level. A formal model of data federation can be represented as
an expression of the type
<{ | i=1,…, N}, View() | i=1,…, N, >, (1)
where < { | i=1,…N} is a set of input data sets, each of which is given by a scheme Σi and
a set of values Di; is a global set of integrated data, with the scheme ΣI and a virtual set of
DI values; View() is mapping of the scheme of the input data set to the global scheme of
integrated data. The key principle of such mapping is the formation of a description of a subset of
data of the local input set in terms and composition that meets the requirements of the global
scheme, while the set of values described by the new scheme Σ *i is a subset of the input local set
of values . The result of the mapping is a global scheme of integrated data, formed as a
union of mappings of local schemes [1-2]: ΣI= Σ*1 Σ*2 … Σ*N, where N is the number of input
local data sets. The set of values of the global initial set of integrated data is formed as a union of
the set of projections of local data sets {D*i | i=1,…N}, built according to the set of schemes {Σ*i |
i=1,…, N}, each of which is formed by displaying View(Σi, Σ*i): DI= D*1 D*2 … D*N.
1 1
*
D1 *1 1
*
D1 D1 R1 *1
2 2
R2 *2
D*2 *2 *2
D2 D2 DI I
. DI I
. R3 *3
.
.
.
N . N ...
*
D*N *N *N DN
RM
DN M
Figure 3: Data federalization and data replication
3.4. Modelling the data replication process
Data integration using the replication method involves the formation of a certain mapping
(projection) of the local input data set according to a given mechanism, similar to the
federalization method. The fundamental difference is that the result of displaying input data is
not a virtual set of values, but some intermediate set of data that has its physical image formed
according to some scheme, as in the case of data consolidation. But at the same time, the data set
created in this way - a replica, cannot be moved to a specially defined storage environment. An
advantageous global set of integrated data is formed as a union of a set of replicas. The general
scheme of data integration by the replication method is shown in Fig. 3b [1-2]. The formal model
of the data integration process using the replication method can be described as follows [1-2]:
<{|i=1,…, N}, Replicate() | i=1,…, N, >, where { | i=1,…,N} is a set of
input data sets, each of which is given by a scheme Σi and a set of values Di; N is the number of
incoming local data sets; Replicate() is display of the input data set, which forms a new set
of values – a replica, which is a subset of the set of values of this set, formed according to the
replica scheme, the replica scheme is a subset of the global integrated data scheme; j=1,…,M,
where M is the number of replicas, the number of which may differ from the number of data
sources, since one or more replicas can be formed on the basis of one input local set, the result of
mapping Replicate() is a set data of the form , where Σ*j is a replica scheme, Rj is a
set of values; is a global set of integrated data, with a scheme ΣI and a set of DI values, while
the scheme ΣI is a union of the schemes of all replicas ΣI= Σ*1 Σ*2 … Σ*M, and a set of DI values
by combining sets of replica values – DI= R1 R2 … RM.
3.5. Modeling the hybrid data integration process
A feature of data integration using the hybrid method (Fig. 4) is the combination of the
possibilities of the three methods described above – consolidation, federalization and replication
– in one process. In this case, the global initial set of integrated data is formed as a heterogeneous
entity that combines several segments, each of which is formed based on different methods and
technologies [1-2]. In general, the hybrid integration model can be described by a tuple of the
form <{ | i=1,…N}, Mapi() | i=1,…, N), >, where <{ | i=1,…,N} is a set of
input data sets, each of which is given by the scheme Σi and a set of values Di , N is the total number
of input local data sets, the set of input local data sets is divided into three subsets, according to
the integration methods applied to them; is input local data sets to which the data
consolidation method is applied; is input local data sets to which the data replication
method is applied; is input local data sets to which the data federalization method is
applied; Mapi(Di, DI) is mapping of the input local data set to the global set of integrated data, the
type of mapping is different for different data sets, depending on the integration methods applied
to it – consolidation, federalization or replication; is a global set of integrated data, with a
ΣI scheme and a set of DI values, while the ΣI scheme is a union of schemes formed by different
integration methods ΣI =Σ*С Σ*F Σ*R, where Σ*С is a data scheme formed as a result of the
consolidation of input local data, Σ*F is a data scheme formed by federalization, Σ*R is replication,
the set of values of the global initial set of integrated data is formed as a union of three segments
[1-2]: DI =D*С D*F R*, where D*С is the set of values formed as a result of consolidation of input
local data, D*F is the set of values formed by federalization, R* is replication.
ETL
DC C D*C *C
DC
R
Replication R* *R
DR I
...
F DI
D*F *F *F
DF
Figure 4: Hybrid data integration process model
3.6. Modelling the data collage process
Collage (mashup), as a method of integration, is most often used in Web-systems to combine
in a single presentation of data received from different sources, different in form, structure, and
methods of representation, but combined by a common content/application. The peculiarity of
collage is the absence of a permanent scheme of integrated data and the dynamic formation of a
set of values with each access to resources of this type. At the same time, the initial data are
combined in various ways, forming, as a result, arbitrarily structured hybrid data. The general
scheme of the data collage process is shown in Fig. 5a [1-2].
DI=(D*1 U D*2 U ... U D*N-1 U D*N)
D1 1
I=(*1 U *2 U ... U *N-1 U *N)
D*1 *1
Mashup-server
D2 2 D*2 *2 Integration of semantics
. .
D*3 *3 Integration of structure
. . Integration of syntax
. .
D*N-1 N-1
*
DN N Integration
D*N *N of values
Figure 5: Model of the data collage process and multilevel model of integration
The formal model of the data collage process is described as a tuple:
{ | i=1,…N}, Mashupi(), >, (2)
where { | i=1,…N} is the input local data set, with scheme Σi and set of values Di;
is the initial global set of integrated data; Mashupi() is a mapping that forms a data collage
element for further combining parts of input local data sets into a single view. In the collage
process, some subset of D*i values is selected from each input local set, which is described by the
scheme Σ*i. From these parts, by combining and superimposing different types of data and
forming a global scheme as a combination of schemes, a single integrated data set is
formed for presentation to the user. The difference between integration by collage and other
methods is the absence of physical storage of integration results and the dynamic formation of a
global scheme of integrated data upon user request [1-2].
3.7. Formal modelling of data integration processes and results
The analysis of the results of modelling data integration processes using various methods and
methods using an extended formal model allows the following conclusions to be drawn [1-2]:
the extended formal model of data integration can be applied to model resource-centric
and schema-centric data integration by methods of consolidation, federalization, replication,
hybrid integration and collage. Therefore, the proposed model is invariant to the methods and
paradigms of data integration, which allows us to conclude about its universality;
both the data itself in the form of a set of values (constants) and their formalized
description – a scheme – appear in the integration processes. Integration involves performing
a series of isomorphic transformations over the input schemas to form a global output schema
of integrated data and transformations of sets of values of input data to form a set of values of
the output set of integrated data;
in the process of integration, operations of moving, reformatting, selecting, projecting,
combining, superimposing, etc. are performed on the input data. as a result, new sets of data
are created, which differ from the input ones in composition, content, structure, presentation
and methods of application;
the listed features of integration processes are common to various integration methods
and paradigms, which allows us to conclude the possibility of creating a single generalized
apparatus for describing data integration processes, independent of integration technologies,
subject area, content, purpose and order of application of integrated data.
The general conclusion regarding the modelling of data integration processes is that as a result
of integration, new data values, new forms and presentation formats, new data structures, new
content and new purpose of data are created [1-2]. So, data integration has technical, syntactic,
structural, semantic and pragmatic aspects. Accordingly, each of these aspects involves the use of
its methods and means of data description in integration processes, which allows dividing the
overall integration process into several sub-processes that implement one of the above-
mentioned aspects. This is reflected in the generalized model of data integration processes, which
is proposed to be called a multi-level formal model of integration.
3.8. Multilevel data integration model
The results of the analysis of formal models of data integration using various methods show
that in the process of integration, significant transformations of the composition, content and
form of data occur. This means generating, based on input sets, a set of new final data that have
fundamentally different properties. This creates the basis for further development and
improvement of the formal model of the data integration process by introducing into its
composition elements that describe the main properties of the data and the order of their change.
According to the concept of presenting data as a formal system, the data form some formal
language that is used to denote a set of values and concepts from a certain SA in the environment
of the information system [1-2]. The basis of the construction of language structures is a certain
set of symbols - the alphabet. Mandatory and integral properties of data in such a data
representation are their syntax, semantics, and structure. At the same time, syntax is used to
determine the order of presentation of lexical constructions (constants), for the presentation of
real values, and the order of formation of new lexical units based on given ones. Semantics
provides an ordered and unified description of the ways of interpreting data, that is, it connects
them with the actual values that take place in the subject area, forming, due to this, the content of
the data and their pragmatics. With the help of the structure, the order of formation of data units,
their combination and arrangement is described. The structure, in turn, determines not only the
order of presentation and storage of data but also the methods of its processing and application
[1-2]. In the general case, the definition of an arbitrary data set DS forms a system of the form
DS=, where D is a set of values that represent a set of concepts of some subject area, G
is a formalized representation of the data syntax, S is a formalized description of the structure
data, H is a formalized presentation of data semantics. In this way, the formal presentation of a
data set as a tuple of the form , where D is a set of values, Σ is a data scheme, is changed to
a tuple of the form , where =, formal presentation of the syntax, structure and
semantics of the data in this set, which, in the future, we will call its meta-schema. A meta-schema
is an extension of the concept of a scheme by supplementing the description of the structure and
constraints of data with a formalized description of their syntax and semantics. The introduction
of the concept of a meta-scheme makes it possible to build a much broader and more detailed
description of data properties in integration processes, compared to a scheme. In general, the
process of data integration involves several actions related to their transformation and the
formation of new data based on the initial ones. It is considered a sequence of actions involving
matching, transformation, merging and filtering of data, and aims to form a final set of DS data
based on a set of initial sets, it is formally represented by an expression of the form [1-2]:
DS=I(DS1, DS2, …, DSN), (3)
where I is the data integration operator, DS1, DS2, …, DSN is the set of input initial data sets, and
N is the number of data sets participating in the integration process. In general, such data sets
may contain repeated values, i.e. [1-2]:
D1∩ D2 ∩ … ∩ DN ≠. (4)
Given the data model, which is based on the specification of their syntax, semantics and
structure DS==, the formal definition of the integration process can be reduced
to actions on these components, replacing the DSI value with a detailed description all
components of data definition as follows [1-2]:
==I( | i=1,…N)= (5)
=I(,, …, ),
where , i=1,2, …, N is the detailed formal representation of the ith data set.
In this way, the problem of data integration can be decomposed into separate problems of data
value integration, syntax integration, structure integration, and semantic integration. The general
data integration operator I is presented as a combination I=, where IV is the value
integration operator, IG is the syntax integration operator, IS is the data structure integration
operator, and IH is the semantics integration operator. At the same time, the integration process
will be decomposed into corresponding sub-processes, which can be described by a formal
scheme of the form [1-2]:
=. (6)
The mutual relationship of these processes and their classification by levels are shown in Fig.
5b. According to such a scheme, each subsequent level of integration is based on the results of the
previous one. Thus, the semantic integration of data is possible only after the integration of their
structure, which, in turn, requires the construction of an integrated syntax that defines the
methods of data representation and the integrated set [1-2]. The presentation of data in
integration processes as a formal system allows to develop and improve the theoretical
conceptual foundations of data integration due to a higher level of abstraction and the possibility
of creating integration models that do not depend on the nature, content, subject area, methods
and technologies. As a result of the study of formal models of data integration using the methods
of consolidation, federalization, replication, collage, and the hybrid method, it was found that the
basic principles and concepts are common to all methods, which makes it possible to build a
unified approach and method to data integration that will generalize the methods known today.
In the process of integration, not just a mechanical combination of data is performed, but the
formation of new data, which has fundamentally new properties, differs from the input data in
syntax, structure, semantics and the order of application. This makes it possible to distinguish the
processes of integration of data values, their syntax, structure and semantics [1-2]. The model
developed in the way described above defines and substantiates the possibility of creating a
universal method of data integration, which summarizes the capabilities of currently known
approaches, and also creates an opportunity to move the integration processes from the
procedures for processing the actual data and their schemes to the procedures for manipulating
metadata that describe the properties and specifics of the set data, which is the object of
integration.
4. Experiments, results and discussion
4.1. Syntactic integration of data
4.1.1. Integration of alphabets
The integration of alphabets at the stage of designing a unified integrated data processing
environment consists in creating a consistent set of symbols for representing values from the
resulting data set - the integrated AI alphabet, such that for each symbol of the input alphabet Ai,
which is used to represent the value of the input data set Di (i=1,2, …, N), there is a unique mapping
αi: Ai→AI, which matches each symbol of the input alphabet of the ith data set σi(Ai) with a symbol
of the integrated alphabet – σ(AI). The following ratios of input and integrated data alphabets are
possible (Fig. 6) [1-2]:
the input alphabet is a subset of the integrated alphabet and has no intersections with
other input alphabets (A1);
the input alphabet is a subset of the integrated alphabet and has a non-empty intersection
with another alphabet that is a subset of the integrated alphabet (A5);
the input alphabet is a subset of the integrated alphabet and has a non-empty intersection
with another alphabet that has a partial intersection with the integrated alphabet (A2);
the input alphabet has a non-empty intersection with the integrated alphabet and an
alphabet that is a subset of the integrated alphabet (A3);
the input alphabet is not a subset of the integrated alphabet and has a non-empty
intersection with another input alphabet, which, in turn, has a partial intersection with the
integrated alphabet (A4);
the input alphabet is not a subset of the integrated alphabet and does not have non-empty
intersections with another input alphabet (A6);
the input alphabet is not a subset of the integrated alphabet, but at the same time has a
non-empty intersection with other input alphabets (A7, A8).
Integrated alphabet AI
Input Input
Input
alphabet A1 alphabet A4
alphabet A2
Input Input
alphabet A5 alphabet A3
Input Input
alphabet A8 alphabet A6
Input
alphabet A7
Figure 6: Diagram of the ratio of input and integrated alphabets
We present the process of building an integrated alphabet as a sequence of solving
interconnected problems according to the following scheme [1-2, 15-18].
1. Let A0 be some initial set of symbols of the integrated alphabet.
2. For each of the input alphabets Ai, i=1,2,…, N, the ratio Ai A0 is checked. If it is fulfilled,
then all characters of the alphabet Ai used to represent the values of the data set Di are also
acceptable for representing the corresponding values in the integrated data set D. Therefore,
it can be assumed that the input data can be included in the integrated set without changing
the form of their presentation.
3. In this case, the phenomenon of polysemy of symbols is possible. We will call polysemous
symbols that have the same shape and reflect different meanings, for example, the Ukrainian
letter "І", the Latin letter "I", the Roman numeral "І" (1), the Latin letters A-F, which are used
to represent both letters and numbers in the 16th number system, etc. Such a phenomenon
may cause an ambiguous interpretation of data values and their content in the future. The
problem of polysemic characters has the following solutions:
banning the use of the same symbols to denote different concepts - this method involves
defining a single image for all symbols that have the same shape; this option for solving the
problem of polysemic symbols is possible in cases when they are used for formal meanings
that do not have additional (phonetic, lexical or substantive) interpretation (for example, the
same type of use of Latin and Cyrillic letters that match the spelling in car registration
numbers); in this case, problems are possible when interpreting, reading or phonetizing data
values;
the use of polysemic symbols without restrictions - for each of the symbols that have the
same form, they retain their method of application; in this case, the problem of polysemy of
symbols of the integrated alphabet is not solved in the process of integration, but is transferred
to the level of data application;
replacement of identically shaped symbols with an alternative image - transliteration; this
transformation allows you to eliminate the polysemy of characters without narrowing the
possibilities of data presentation.
4. If the set of characters of the input alphabet is not a subset of the integrated alphabet – Ai
A0, then it is divided into two subsets – Ai1=AiA0 and Ai2=Ai\A0. The first includes symbols
that are elements of the integrated alphabet, the process of integration in this case is described
above. The set Ai2 includes characters of the input alphabet that are not elements in the current
state of the integrated alphabet A0. In such a situation, character polymorphism is possible. We
will consider polymorphic symbols to be symbols that differ in image form and are used to
denote the same concepts. For example, upper/lower case letters in words represent the same
sounds, numerical values can be represented in different number systems, using Arabic, Latin
numbers or letters, etc. The appearance of polymorphic symbols in alphabets is a possible
cause of ambiguous perception and interpretation of data values during their processing. As
for solving the problem of processing polymorphic symbols, the following solutions are
possible [1-2, 15-18]:
replacing polymorphic symbols with homomorphic images, i.e. bringing symbols of
different shapes to a single form, for example, using only uppercase or lowercase letters, only
Arabic numerals, which replace similar Roman numerals, etc.;
parallel application of polymorphic images of synonymous symbols without restrictions,
in which case their interpretation will depend on the context and application of data values;
creating your interpretation and rules for using polymorphic symbols - this path requires
a detailed analysis of their properties, but allows you to significantly expand the capabilities
of the integrated alphabet in terms of displaying data, for example - proper names begin with
capital letters, operators or operations are denoted by special symbols, Arabic numerals are
used to represent quantitative values, and Roman – ordinal, etc.
Regarding the set of symbols Ai2 = Ai \ A0, the following options are possible
prohibiting the use of symbols from the Ai2 set to represent integrated data;
expansion of the integrated alphabet due to the inclusion of a set of symbols Ai2 in its
structure, with the formation of the next version A1 = A0Ai2;
transliteration – replacing symbols that are not elements of the integrated alphabet with
symbols from the A0 alphabet.
5. As a result of iterative repetition of the described sequence of actions, an integrated
alphabet AI=IA(A1, A2, …, AN) is formed, which defines a set of symbols for presenting data
values in an integrated set.
4.1.2. Integration of data types
The integration of data types in the construction of the output integrated set consists in the
formation of a set of data types TI, such that for each of the types applied in the input data sets
there is a mapping τ: Ti → T, which establishes a one-to-one correspondence between the data
types t(Ti) applied in the input set Di (i=1,2, …, N) and t(TI) are the data types used in the
integrated data set DI. The mutual relationship of different sets of data types in the process of
integration is shown in Fig. 7. As can be seen from the diagram, similar to the process of
integration of alphabets, input sets of data types may or may not have full or partial intersections
with the integrated, and may or may not have mutual intersections with each other. The process
of forming an integrated set of types involves the sequential execution of such actions [1-2, 15-
18].
1. Let T0={ t1(T0), t2(T0), …, tn(T0)} be some initial set of types that are defined in the integrated
data set forming the initial state of the TI type set.
2. For each set of input data Di (i=1,2, …, N), check the ratio Ti T0, which determines the
agreement of the types of the input set with the types of the integrated data set DI.
3. Fulfilment of this condition does not guarantee that only types allowed for use in the
integrated set are used to represent the data of the input set Di since there is a possibility
of type polysemy. We will call polysemous types that have the same designation and differ
in implementation methods. For example, a value of the "date/time" type can be
represented by both numeric and character values, the "text" type in some applications
represents character strings, in others - notes, values of the logical type are represented
as numeric or bit, etc. Discrepancies of this nature are a potential factor in possible errors
in data processing, ambiguous interpretation and obtaining incorrect results. This kind of
inconsistency of data types has, in particular, the following solutions [1-2]:
Integrated set TI of data types
Input set of
ВInput set of Input set of
types Т1 types Т4
types Т2
Input set of Input set of
types Т8 types Т3
Input set of Input set of
types Т5 types Т6
Input set of
types Т7
Figure 7: Ratio of input and integrated data type sets
replacement of homonymous data types with new ones, which by definition do not
coincide with others;
reformatting the values of the input data set to the format of the corresponding types
defined in the integrated data set;
4. If the set of data types Ti, which are applied in the input set Di, exceeds the set of data types
of the integrated set D, i.e. Ti T0, this indicates the presence in the input set of data
belonging to such types that are not valid data types of the original integrated set.
Therefore, the data types of the input set are divided into 2 such subsets:
a subset of types Ti1 = Ti T0, which are included in the set of types of the integrated data
set;
a subset of types Ti2= Ti \ T0, which are not included in the set of types of the integrated
data set.
For the subset of Ti1 types, the type-matching procedure is as described above. In the case
when the data of some input set Di belong to types that are not supported in the original
integrated data set D1, the following variants of further transformations are possible [1-2, 15-18]:
expansion of the set of data types of the integrated set by supplementing it with a subset
Ti2 of the data types of the set Di, which is implemented by constructing the next version of the
set of permissible data types of the integrated set T1 = T0 Ti2;
conversion of data from the format of the types of the set Ti2 to the corresponding types
from the set T0, that is, replacing each data value of the type t(Ti2) Ti2 with a similar value
represented according to the requirements of the type t(T0) T0.
5. Another contradiction in the processes of integration of data types is the occurrence of
polymorphism of data types in the integrated set and input sets. We will call polymorphic
data types that differ in form of representation but are identical in interpretation (for
example, REAL and FLOAT, BOOLEAN and LOGICAL types, etc.). In this case, situations may
arise in which data of the same actual type will be incompatible with each other when
performing actions on them, which, in turn, is a potential cause of errors and
contradictions in the data. There are two possible ways to resolve this contradiction [1-2,
15-18]:
bringing polymorphic types to a single method of their determination due to the removal
of such types that repeat others;
compatible application of all possible options for defining data types due to the creation
of additional means of maintaining the polymorphism of data types and their coordination.
The first way is easier to implement, and the second - expands the possibilities for describing
and manipulating data in an integrated set.
6. The result of the steps described above is a generalized and agreed list of data types that
are used in determining the units of the integrated set TI =IT(T1, T2, …, TN) [1-2, 15-18].
4.1.3. Integration of syntactic data constraints
The integration of restrictions, which are used when forming data values of some input sets,
involves the formation of such a set of restrictions RI=(r1(RI), r2(RI), …, rm(RI)) that for each
restriction r(Ri) Ri, applied to some input data set Di (i=1,2, …, N) there is a one-to-one
correspondence given by the mapping ρ:r(Ri) → r(RI). However, unlike alphabets and data types,
the restrictions are not free elements, they are formulated and applied only to specific data types,
categories or values. Therefore, each restriction that is applied to a certain set of data Di is defined
as a condition of the form r(Ri, t(Ti), Dji), which is determined by such factors as belonging to a set
of restrictions Ri, binding to a certain type of data – t(Ti), and the scope is some subset of the data
set Dji Di. Therefore, the problem of integrating the set of constraints of the input data sets into
a single set of constraints of the integrated set can be solved only after performing the integration
of the alphabet and data types. The ratio of the sets of input data constraints and the integrated
data set is shown in Fig. 8. The input sets of constraints can have partial intersections with each
other, be subsets of each other, be completely independent, be fully or partially part of the
integrated set formed by their integration, or not have an intersection and not be part of it. The
general sequence of the process of integration of input syntactic constraints, the purpose of which
is to create a single, consistent and complete set of constraints applied to values from the
integrated data set, is presented in the form of a scheme of actions [1-2, 15-18].
1. Let R0=(r1(R0), r1(R0), …, r1(R0)) be initial set of constraints of some integrated data set D.
2. For each of the sets of restrictions Ri of the input data sets Di (i=1,2,…, N), we check the
condition Ri R0.
Integrated set of constraints RI
Input Input
Input
Constraint Constraint
Constraint
Set R1 Set R4
Set R2
Input Input
Constraint Constraint
Set R5 Set R3
Input Input
Constraint Constraint
Input
Set R8 Set R6
Constraint
Set R7
Figure 8: Diagram of the ratio of input and integrated sets of constraints
3. The fulfilment of this condition means that each of the constraints of the input data set
takes place in the integrated set. But for the final determination of the possibility of
applying restrictions to integrated data, the following factors are additionally checked [1-
2, 15-18]:
the presence among the data types of an integrated set of types for which restrictions are
defined, i.e. for each of the restrictions rj(Ri) Ri there is tj(Ti) TI, where tj(Ti) is the data type
to which the restriction is applied, TI is multiple types of integrated dataset;
the presence among the set of values of an integrated data set of values for which
restrictions are defined, that is, for each of the restrictions r(Ri) Ri, Dji DI is performed,
where Dji is a subset of values of the data set Di to which the restriction is applied, DI is an
integrated data set;
the fulfilment of these requirements ensures the possibility of applying the restriction to
the data of the integrated set, and the lack of appropriate data types and/or values makes it
impossible to apply this restriction to the integrated data set.
4. If among the set of constraints Ri of some input set Di some are not constraints of the
integrated data set DI, i.e. Ri R0, the set of constraints is divided into subsets Ri1 = Ri R0
and Ri2= Ri \ R0.
5. The set of restrictions Ri1 is consistent with the set R0 and the process of its integration is
performed as described above, but the set of restrictions Ri2 has the following integration
options [1-2, 15-18]:
restrictions from the set Ri2 are applied to values and/or data types that are not part of
the integrated set;
restrictions from the set Ri2 are applied to the values and data types included in the
integrated set.
In the first case, each of the restrictions that does not have an object of application can be
removed without loss from the set of restrictions of the integrated set. In the second, the
procedure for matching additional constraints Ri2 and a set of constraints R0 of the integrated data
set is applied. The reconciliation of these sets of restrictions is achieved due to [1-2]:
extraction of Ri2 constraints from further application in the set of constraints of the
integrated data set DI;
transformation of restrictions included in the set Ri2 by replacing them with equivalent
ones in content and application from the composition of the set R0 according to the principle -
each of the restrictions r(Ri2) Ri2 is matched with the restriction r1(R0) R0, which is defined
for types and values of the integrated data set;
expansion of the set of constraints R0 of the integrated data set by supplementing it with
elements of the set of constraints Ri2 to form a new version R1= R0 Ri2.
6. The result of performing a sequence of actions on the integration of syntactic restrictions
of input data sets is the formation of such a list of data presentation requirements that can
be applied to determine additional properties of data values from the integrated set.
4.1.4. Procedure and requirements for syntactic data integration
By performing a sequence of actions on the integration of alphabets, data types and syntactic
constraints according to the scheme described above, a complete and consistent set of elements
of the integrated syntax of GI data is formed, which is used as a method and means of displaying
integrated data obtained as a result of the processes of data extraction, transformation and
loading in DS, as well as with their dynamic integration in operational systems. At the same time,
the problem of detecting and correcting the elimination of contradictions between the local
syntax of the input data sets to be integrated is solved. The general order of syntactic integration
describes such a sequence of steps [1-2].
Step 1. Constructing an integrated alphabet as a complete and consistent set of characters to
represent data values in the original integrated set. Performing this step involves the
implementation of the following procedures.
1. Detection and elimination of contradictions of input alphabets in the process of syntax
integration. This is done according to the following rules.
Elimination of character polymorphism – the alphabet of the resulting integrated data
set cannot contain different characters with the same interpretation. Such a rule for
matching the representation of the symbols of the alphabets Ai and Aj is described by
an expression of the form
Alph1(Ai, Aj): αAi,Aj | inti(α) = intj(), (7)
where Alph1 is the rule identifier, α, are symbols of the input alphabets Ai and Aj; inti(α),
intj() are symbol interpretation functions.
Elimination of character polysemy - the output alphabet of the resulting integrated
data set cannot contain the same characters with different interpretations. The rule for
matching the interpretation of the symbols of the alphabets Ai and Aj is described by
an expression of the form
Alph2(Ai, Aj):αAi,Aj | inti(α) intj(α), (8)
where Alph2 is the rule identifier; α is a symbol included simultaneously in the input alphabets
Ai and Aj; inti(α), intj(α) are functions for interpreting symbols in the alphabets Ai and Aj.
2. Construction of an integrated AI alphabet by combining agreed local input alphabets
according to the rules defined in clause 1:
AI =IA(A1, A2, …, AN) = A1 A2 … AN | (9)
Alph1(A1, A2, …, AN)=true ^ Alph2(A1, A2, …, AN) =true,
where IA is the alphabet integration operator; A1, A2, …, AN is a set of input local alphabets;
Alph1(A1, A2, …, AN), Alph2(A1, A2, …, AN) are the rules for matching alphabets defined in paragraph
1. a) and 1. b).
Step 2. Construction of a single, consistent list of data types that are used in the original
integrated set. The process of integrating data types involves the following actions.
3. Identification and elimination of inconsistencies in data typing methods from input sets.
The following rules apply to this.
Elimination of polymorphism of types - the data types used in the original integrated
data set cannot contain different types that have the same interpretation. Such a
matching rule for sets of data types Ti and Tj is described by an expression of the form
Type1(Ti, Tj):t1Ti,t2Tj |inti(t1)= inti(t2), (10)
where Type1 is the rule identifier; t1, and t2 are data types of input resources, which are
included in the sets of types Ti and Tj, respectively; inti(t1), intj(t1) are interpretations of types t1
and t2, respectively, in sets Ti and Tj.
Elimination of polysemy of types - in the composition of types used for data typing of
the resulting integrated data set, there cannot be identically defined types with
different interpretations. The rule for matching the interpretation of sets of types Ti
and Tj is described by an expression of the form Type2(Ti, Tj): tTi, Tj | inti(t) intj(t),
where Type2 is the rule identifier; t is a type included simultaneously in the input local
sets of types Ti and Tj; inti(t), intj(t) are type interpretation functions, respectively, in
the sets Ti and Tj.
4. Construction of a set of types of the original integrated resource TI by harmonizing and
combining local input sets of types according to the rules defined in clause 1:
TI =IT(T1, T2, …, TN) = T1 T2 … TN | (11)
Type1(A1, A2, …, AN)=true ^ Type2(A1, A2, …, AN) =true,
where IT is the integration operator of input sets of data types; T1, T2, …, TN are sets of input
local data types; Type1(T1, T2, …, TN), Type2(T1, T2, …, TN) are the rules for matching alphabets
defined in clauses 1. a) and 1. b).
Step 3. Formation of a single consistent set of syntactic constraints by merging and matching
local constraint sets of input datasets.
Detection and elimination of contradictions in the syntactic constraint sets of input
datasets. The following rules apply to this.
a. Elimination of polymorphism of constraints - in the composition of the set of
syntactic constraints, which are applied in the resulting integrated data set,
there cannot be different constraints that have the same interpretation. Such a
rule for matching the sets of constraints Ri and Rj is described by the expression
Restrict1(Ri, Rj):r1Ri,t2Rj |inti(t1)= inti(t2), (12)
where Restrict1 is the rule identifier; r1, r2 are syntactic restrictions applied to input resources,
which are included in the sets of restrictions Ri and Rj, respectively; inti(r1), intj(r1) are an
interpretation of syntactic constraints r1 and r2 in sets Ri and Rj.
b. Elimination of polysemy of constraints – in the set of syntactic constraints,
which are applied to the data of the resulting integrated set, there cannot be
identically defined constraints that have different interpretations. The rule for
matching the interpretation of the sets of constraints Ri and Rj is described
Restrict2(Ri, Rj): rRi,Rj | inti(r) intj(r), (13)
where Restrict2 is the rule identifier; r is a syntactic restriction that is simultaneously included
in the input local sets of restrictions Ri and Rj; inti(r), intj(r) are functions for interpreting
constraints in the sets Ri and Rj.
Construction of a set of syntactic restrictions of the original integrated RI resource by
harmonizing and combining local input sets of application types defined in clause 1,
rules:
RI =IR(R1, R2, …, RN) = R1 R2 … RN | (14)
Restrict1(R1, R2, …, RN) = true ^ Restrict2(R1, R2, …, RN)=true,
where IR is the syntactic constraint integration operator; R1, R2, …, RN are sets of input local
syntactic constraints; Restrict2(A1, A2, …, AN), Restrict2(A1, A2, …, AN) are syntactic restriction
matching rules defined, respectively, in clauses 1. a) and 1. b).
Step 4. Constructing an output syntax for representing data in an integrated set. The output
integrated data syntax GI is formed based on the integrated consistent alphabet AI, the integrated
consistent set of data types TI and the integrated output set of syntactic constraints RI
GI =. (15)
The syntax formed in this way provides a correct, consistent and unambiguous representation
of the data values in the data set, which is created as a result of their integration.
4.2. Structural data integration
4.2.1. General principles of integration of data structures
The problems of creating and maintaining heterogeneous structures of integrated information
resources, in general, go beyond the functional capabilities of traditional data storage
environments implemented by DB servers and DBMS [15-19]. Today, many other abstractions
and management methods are known, which either confirm their suitability or are removed from
the management environment of integrated heterogeneous content [1-2]. A comprehensive
solution to the problems of management and application of integrated content, which includes
both structured (relational data) and loosely structured data, provides tools that implement
technologies for the joint processing of such resources as structured data, text, spatial, temporal,
visual, multimedia data, procedural data, triggers, streams and data queues, imprecise and fuzzy
data [20-23]. The heterogeneity of input data to be integrated extends to the diversity of their
structures (Fig. 9). Modern IS uses data of various levels and forms of structuring. Along with the
structured data stored in the DB, the information resources of open IS contain so-called non-
relational data, in particular, weakly structured (semi-structured) data, data without a prior
description of the structure (self-structured), stream data, procedural data, etc. [23]. Building a
single agreed description of the structure of disparate data is one of the tasks performed in the
process of their integration. A general description of the structure of the integrated data set is
given as [1-2, 15-18]:
CI=, (16)
where СI is a description of the structure of the integrated information resource; R is
description of the structure of the relational component, which is formed by structured data
presented in the form of database tables; NR1, NR2, …, NRk are description of non-relational
components of various types; JR is a set of connections between relational elements; JN is a set of
connections between non-relational elements; JRN is a set of relations between relational and non-
relational elements.
NR1
R1
NR2
R2
.. ...
.
Rn
R
NRm NR
Relational component
Non-relational component
Figure 9: The general structure of the information resource of open IS [15-18]
The integration of data structures (structural integration) of data is defined as the process of
a coordinated combination of structured (relational) data stored in databases and non-relational
data stored in formats other than DB. The relational component in this sense is the central
element of integration since modern DBs and DBMSs [15-18] provide a sufficiently wide range of
opportunities for joint and coordinated processing of not only structured data, but also
information resources specified by other data methods [1-2].
4.2.2. Models of structural integration
The main problems of DB integration with other types of data and directions and principles of
their solution are defined in [21-23]. Typical approaches to the integration of structured
relational, weakly structured/self-structured data are described by models [1-2, 15-18].
1. Integration of structured data with loosely structured (documentary, textual, spatial,
temporal, visual and multimedia) data. Modern database management systems largely
provide solutions to such tasks through the use of special data types (temporal types of
symbolic and binary objects, generated types, etc.) and the "XML document" data type (Fig.
10). Values of these types are integrated into tables and supplement the list of elementary
values in descriptions of entities and facts. The structure of relational database tables, in
which loosely structured data is stored together with relational data, is described as [1-2,
15-18]: R(A1, A2, …, Ak, X1, X2, …, Xm), where A1, A2, …, Ak are table columns that represent
scalar values of traditional and special types; X1, X2, …, Xm are columns that depict weakly
structured values [15-18].
2. Integration of DB and procedural data. Such a model involves the integration of actual data
stored in databases and a set of object data types together with methods that encapsulate
them. In this case, each column of the table (Fig. 11a) can be represented by a pair of the
form (A, M), where A is a column of the table, and M is a set of methods associated with this
column. The structure of such a table is described by an expression of the form [1-2, 15-
18]: R((A1, M1),(A2, M2),…,(Ak, Mk)).
a1 a2 ... an
XML1 ... XMLm
Figure 10: Structure of integrated records with XML components [15-18]
3. Integration into databases of triggers and data processing procedures. The use of such
elements ensures the implementation of the concept of an active database. Such a DB,
together with the values, stores a description of certain rules and actions that are
performed when the state of the database changes. Tables, which include traditional data
and active elements, are described by a structure model of the following type R(A1, A2, …,
Ak, T1, T2, …, Tm), where A1, A2, …, Ak are table columns that represent ordinary typed values,
T1, T2, …, Tm are a set of triggers that describe the actions associated with changing the
state of the table (Fig. 11b).
T1
A1 A2 ... An
T1
M1 M1 ... Mn
...
A1 A2 ... An
Tn
Figure 11: With procedural data and with triggers [15-18]
4. Integration of static data with streams and data queues. Stream data is a set of values that
is not stored on the system media, but exists only at the time of application of this data. An
example of streaming data is monitoring, stock exchange information, broadcast news in
a standardized format, etc. The data flow is formed as a result of the execution of requests,
forwarding or selection of data. A queue is a special type of data stream in which each unit
is given an ordinal character. The structure of the data stream S at a certain time t is
described by an expression of the form [1-2, 15-18]:
St=, (17)
where S(Rt) is the structure of a set of values obtained as a result of selection operations from
relational database tables; S(Xt) is the structure of a set of data obtained as a result of selection
from sets of weakly structured data; S(St`) is the structure of a set of data obtained as a result of
selection from other data streams; S(Wt) is the structure of a set of data obtained from web
resources. The result of the integration of static and flow components is a semi-dynamic structure
(Fig. 12), which combines data stored in databases and data formed in the form of a time-varying
flow [1-2].
The structure of the integrated set, which combines relational and stream data CI, is described
as CI=