=Paper=
{{Paper
|id=Vol-355/paper-4
|storemode=property
|title=Co-evolution model for data sources and views
|pdfUrl=https://ceur-ws.org/Vol-355/dolnik.pdf
|volume=Vol-355
|dblpUrl=https://dblp.org/rec/conf/syrcodis/DolnikS08
}}
==Co-evolution model for data sources and views==
Co-evolution model for data sources and views ∗
c Alexander Dolnik Sergei Syntylskiy
Saint-Petersburg State University
The Faculty of Mathematics and Mechanics
{alexander.dolnik, ssyntulsky}@gmail.com
Abstract warehouse is proposed in Section 4. Following in Sec-
tion 5 we present this model as applied to our example.
ETL process evolution is investigated below. A We conclude with directions for future work in Section
model-driven approach to templates and ETL 6.
process evolution problem is developed. We
suppose that the ETL process evolution prob- 2 Problem definition
lem is mainly a problem of a low abstraction
level. So the definition of ETL process based on Let us consider an analytical information system used in
a conceptual model is a principal step towards a number of state regulating institutions. It is intended
effective ETL evolution. Our approach seems for automation of document flow between a regulating
to be scalable, robust and simpler in use com- institution and subjects under regulation and analysis of
pared to existing ETL evolution frameworks collected data.
and tools. A natural monopoly regulation is an important eco-
nomical function of the state. The regulation involves
an intensive document flow between a regulating insti-
1 Introduction tution and organizations under control. Today it is a
From day to day new applications of data warehouses highly-automated process based on a set of formalized
are discovered. A data warehouse is integrated, time- templates. The templates are filled by actual data of orga-
varying, subject-oriented, non-volatile collections of nizations, for example, expenses or production volume.
data that is used primarily in organizational decision Then the data from templates are loaded into the regulat-
making. ing institution data warehouse.
But sometimes world is changed by various causes. The basic source of data for the system are templates
Some of these causes are consequence of following fac- filled by organizations under regulation. Templates are
tors: juristical, industrial and technological, financial, forms for data input implemented as MS Excel tables.
analytical and many others. The template is affirmed by the regulating institution and
So we need to support ability for data warehouse then published. The template has following characteris-
model evolution. tics:
Extract-transformation-load (ETL) process connects
data warehouse with external data sources. ETL pro- • list of organization properties included into the tem-
cess and data warehouse are to change when data source plate, organizations under regulation must give val-
schema has changed. Sometimes this changes may ues of this properties;
produce difficulties for developers and administrators. • periodicity of data collection using this template: if
There are a lot of frameworks (for example, [11], rails, the template must be filled repeatedly, organizations
ARKTOS II, Scriptella ), but framework solutions may are obliged to give new data on a regular basis, for
be hard to support and costly to implement. Usually example, yearly or quarterly.
developer divides an ETL process in several stages. A
stage can be implemented using special tools and meth- See an example of a template on Figure 1.
ods. Such decoupling makes the whole process more ro- When the template is published, officers of the ad-
bust. justable organizations are responsible for filling the tem-
We concentrate on an user-friendly and clear design plate. The filled templates are sent back to the regulating
of the ETL process for considered scope of a problem. institution. They are validated automatically and, if nec-
The rest of this paper is organized as follows. In Sec- essary, manually, then or are accepted, else are sent back
tion 2 problem definition and motivation example are for filling again. For each template a table in the reg-
presented. The related works are observed in Section 3. ulating institution database is generated automatically.
A co-evolution model for data sources and views of data Such tables contain a column for each property of the
∗ This work was partially supported by RFBR (grant 07-07- corresponding template. Data of templates are loaded
00268a). into these tables: properties values, template metadata
Proceedings of the Spring Young Researcher’s Colloquium
containing name of organization, region and time period.
on Database and Information Systems, Saint-Petersburg, Rus- After templates data are loaded into a database they are
sia, 2008 used by various applications for viewing and analyzing
Property Value Property Value
1. Lubricants 1. Lubricants
2. Salary 1.1 Fuel
3. Repair 1.2 Diesel
4. Depreciation 1.3 Other
5. Nonproduction expenses: 2. Salary of production
5.1 Credit expenses employees
5.2 Training expenses 3. Social expenses
... 4. Repair expenses
10. Necessary Gross Receipt 5. Depreciation
... 6. Nonproduction expenses:
15. Product 1 6.1 Credit exps
16. Product 2 6.2 Training exps
... 6.3 Salary of
nonproduction exps
Figure 1: Template v1 ...
10. Necessary Gross Receipt
...
of the data, building reports and analytical cubes. The
15. Product 1
templates change rather frequently. This changes con-
16. Product 2 1
cern both structure of collected parameters (usually col-
17. Product 2 2
lected data expands) and names of the parameters already
included in a template. For example, the template shown ...
on Figure 1, can evolve to one shown on Figure 2 – in the
Figure 2: Template v2
new version expenses on lubricants are detailed. The ex-
penses on the salary are separated into salary of employ-
ees who are participating and not participating in man- as rental costs of administrative office, on various
ufacture. The production of kind 2 is divided into two kinds of production. In this article we treat this pro-
subspecies – 2 1 and 2 2. cedure as a black box. This view, probably, must
Besides the structure of a template the period of data not change when template evolves.
collection can change. Changes of template are treated
as a combination of deactivation of old template and The example of transition from one version of template
adding new ones. So the old table in which data were to another has been considered. Three views using tem-
collected using previous version of template does not plates’ property values should be differently processed:
change, and a table for new template is added. It is ob- • new columns should be added to representation
vious, that though the structure of template evolves, ap- ”Organization expenses”,
plications and views based on previous versions of tem-
plates should work in the same way with new versions • calculation of production dynamics will become a
of templates where it is possible. Thus, the problem little more complicated for data of second version
is offering a method of automatic (or semi-automatic) template due to reduction of data collection period
co-evolution of views and templates. Some views con- and splitting of production 2 on subtypes,
structed on the basis of example templates are listed be- • the view, containing the information on the effec-
low: tiveness, does not depend on the evolution of pat-
1. The view ”Organization expenses” contains de- terns.
tailed information on structure of expenses of the In a Figure 3 the schema of a system’s part under con-
organizations in the accounting periods. For this sideration is given. In the figure connections between
view inclusion of the greatest possible set of param- properties, views and templates are shown as arrows.
eters is desirable. Thus the situation in which value Aforementioned example illustrates typical problems
of some parameters for some periods is not known that ETL process developer faces with. Today there is
must be considered. This view should contain all no general solution for this problem. Practice shows that
parameters concerning the expenses even included there is not ”silver bullet”. Developers usually avoid us-
only in one version of the template. ing these frameworks because of their complexity. Often
evolution problems can be solved by using correct and
2. The dynamic of production 2 is in the whole state. well considered conceptual design of schemas and ETL
This view should display value of annual production processes.
2 of the first version of the template and the produc- The main aim of this paper is to develop evolution
tion values sum of kinds 2 1 and 2 2 summed for model for aforementioned case and demonstrate how it
four quarters from second version. can help to construct flexible and stable ETL process.
3. The effectiveness of product 1 in various regions.
Calculation of a product effectiveness of volumes
3 Related work
and expenses is a complex procedure. There are Some methods are developed for ETL processes and data
various ways to divide constant expenses, such warehouse design and modelling (for example [5]). In
Figure 4: Scheme of Common Conceptual Model
Figure 3: Scheme of Conceptual Level. Motivation Ex- • common area conceptual model represents an ETL
ample process model (mapping templates into data ware-
work [5] framework for data description and ETL pro- house).
cess design by using extend of traditional UML is sug- This construction depicted on Figure 4.
gested. Moreover, OMG defines MOF-based standard
Then we describe common area conceptual model
for modelling warehouses (CWM, [2]). Instead of OMG
metadata in natural way using the EER model suggested
modelling methods EER or ontology-based languages
by Bernhard Thalheim in [3, 4, 12]. Schema of the whole
can be used. Approach suggested in [8] divides concep-
system is depicted on figure 3.
tual and logical levels: the conceptual level (ontology
Now we have got common area conceptual model in
defined in OWL SemanticWeb) is used for describing ex-
EER model terms. A lot of drag-and-drop tools exist
tern data sources and data warehouses, the logical level is
for easy (visual) mapping concepts in the common area
used for describing ETL process. Logical level is defined
conceptual model to concepts in templates conceptual
with declarative language LDL++ ([13]). This language
schema. The second task is to describe mathematical
was chosen because of its expressiveness (for example,
conditions and equations for the common area concep-
supported external function call).
tual model. Mathematical conditions and equations in-
In work [10] semi-automatic framework is developed.
clude following elements:
The main idea is to define a number of template blocks
and build a process of them. • Aggregation functions. There are three types of ag-
The main problem of aforementioned frameworks is gregation functions: distributive, algebraic, holistic.
absence of universal approach to connecting conceptual Distributive and algebraic aggregation functions are
level with logical. Another evolution method is sug- investigated in Thalheim’s work in [3, 4]. A struc-
gested in [6]. This method is based on ETL process graph tural recursion is used for their description. Holistic
construction. Each graph node corresponds to trans- functions are under investigation. But this class of
formation model element and is annotated with What- aggegation funsctions is not mentioned in article.
If policies, containing instructions for changing graph
structure. • Group operations. They will be described in section
However, this approach is difficult to implement, be- 4.1).
cause of enormous amount of additional information re-
quired in nontrivial cases. • Time constraints. They will be described in section
In article [1] the authors propose technical solution of 4.2.
problem in question applied to analytical cubes. How-
ever it does not cover the whole ETL process evolution. • Transformations. Properties of transformation func-
It is possible to describe evolution methods based on tions can be found in work [4]. Details explanation
axioms [7, 9]. Actually axiomatic approach can be used is not included into this article.
to formalize a part of the problem under consideration,
but fails to solve it as a whole. The third question is to map concepts in the common
area conceptual model to the data warehouse conceptual
schema. We developed high-level view declaration lan-
4 Common conceptual model guage for this purpose (will be described in section 4.1).
In this section we suggest model of views and templates This language must be stable to changes in templates
co-evolution. First of all we divide each ETL process set. Also it must be flexible for describing transformation
into three levels: with wide diapason of different templates. We suppose
engine for looking through templates set and generating
• templates conceptual schema represents a set of in- SQL code for templates based on our view declaration.
put data sources schemas;
4.1 View declaration
• data warehouse conceptual schema represents a data
warehouse schema that necessary for data mining or Our model has the following basic and extended mod-
for support decision making applications; elling constructs:
1. Set of entities, actually in our case it is enough to generate Mapping ::= properties properties,
consider only one entity – Organization. [hierarchical type], [order by], [where]
hierarchical type ::= ((level[modif iers2 ])∗ , scope)+
2. Set of simple properties, each property has name
time ::= (period[modif iers2 ])∗ timeset
and domain.
properties ::=
3. Groups, group is a complex property that contains (property[modif iers1 ][{properties}])∗
another groups or simple properties. All properties modifiers1 ::= ([show][sumignore , sumundef ined ])
contained in a group (immediately or transitively) modifiers2 ::= mandatory
must have the same type.
Time can be represented as a hierarchical type. A
4. Constraints. Constraints can be defined for each generate Mapping is parsed to SQL code by finding
group or property. A corresponding logical oper- appropriate templates, aggregation, sorting and filtering.
ator can be defined for each type. A set of logical Detail parser description is omitted in this work.
formulas using this operator can define the integrity At the section 5 our example will be described in de-
constraints which are valid for each instance of the tail using aforementioned BNF.
type. The next section we will discuss temporal aspects of
our model.
5. Operations, defined for each type.
4.2 Temporal properties for templates and views
Typical (in relational databases) views are created by
execution some query like: Suggested procedure of views change or creation con-
tains the following steps:
create view name (projection variables) as
select projection expression 1. The User selects parameters of the Organization
from database sub-schema from the list of available parameters, including both
where selection condition directly collected, and calculated;
group by expression for grouping
2. Using given list of parameters and the list of pos-
having selection among groups
sible periods of data collection (year, quarter or
order by order within the view;
month) program constructs pairs of time sets and
But this is not convenient for our purposes. We need periods such as all necessary parameters are col-
to develop new view based on our model. Following ex- lected with this period within given time set;
tensions of EER model must be taken into account: data 3. The User selects the period and the time interval,
temporality, hierarchical types, schema modularity, cal- being a subset of the set constructed for the period
culated values. These extensions will be described below on the previous step;
in details. Generally we have auxiliary schema A. See
[12] for details of auxiliary schema construction method. 4. Program automatically creates or updates view, in-
New view is defined on top of an EER schema by cluding data on parameters chosen on step 1 with
the period and the restrictions set by the user on step
• a schema V = {S1 ; . . . ; Sm }, where Si is a target
3.
type,
In the following part of the current section we intro-
• an auxiliary schema A mentioned above and
duce method allowing to construct pairs of periods and
• a query q : D × A → V, where D is a given temporal constraints using temporal annotations of tem-
database. plates. These annotations are automatically created as a
result of start and stop of data collection. This method
Generalized view schema suggested by Thalheim is implements step 2 of the procedure. For each template
depicted below: it is possible to define a set of pairs, containing period
generate Mapping : Vars −→ output structure of data and time when data were being collected with
from database types this period. The period of data collection is a year, a
where selection condition quarter or a month. The time when data were being col-
represent using general presentation style lected with given period can be represented as union of
& Abstraction (Modularity, measure, precision) several intervals. Thus borders of intervals should repre-
& Orders within the presentation & Points of view sent points, multiple to the period of data collection. For
& Hierarchical representations & Separation example, if data are collected yearly, the interval should
& Temporality begin and end on the first of January.
browsing definition condition & Navigation It must be noticed, that as the considered periods of
functions Search functions & Export functions data collection form a hierarchy. It is possible to treat
& Input functions & Session functions time intervals as disjunctive. For example, some time
& Marking functions the template data are being collected simultaneously with
the periods equal to month and to quarter. If suddenly it
We simplified this generalized view schema for our appears quarterly collections can be ignored. However
needs. And special language for construction views has such situation is hardly probable in practice.
been developed. It has the following Backus-Naur nota- We use following notation: Hold(t, p) means time set
tion (BNF): when data of template t is collected with period p. So,
because periods are hierarchically ordered, if interval p Consequently using mentioned above tech-
is shorter than q then statement Hold(t, p) ⊂ Hold(t, q) niques(group definition, aggregation function dec-
holds. V ars(x) means set of properties, used by x, laration, hierarchical types and temporal properties
where x is a template or view. So the time set of view definitions) it is possible to achieve view schema con-
v for period p can be calculated using the following for- struction and data evolutions with minimal efforts from
mula: developers and database administrators.
Now we can describe extended view as it was pro-
Hold(v, p) = ∧u∈V ars(v) ∨{t|u∈V ars(t)} Hold(t, p). posed in Section 4.1 with our auxiliary schema A.
Thus, a method of calculating temporal characteristics 5 Real example
of view on the basis of information about templates data
collection is introduced. We defined two types of summarize functions:
sumundef ined , sumignore . The sumundef ined function
will be ”undefined” if at least one parameter is equal
4.3 Constraints and hierarchical types
to ”null” or ”undefined”. The sumignore function will
As it was mentioned above hierarchical types and summarize values, but ”null” and ”undefined” values
schema modularity (groups) must be constructed. Firstly are ignored. In view declaration we explain how to use
formal model need to be determined on subject domain. groups by binding them with summarize functions. We
Then using this formal model we will be able to check need to declare data hierarchy:
correctness of our construction.
Thereto, following hierarchical data types are intro- date = {partitionquarter , partitionyear },
duced.
It can be defined on base types, but with partitionquarter ⊆ partitionyear
the following extensions. Base type B = For example, there are four organizations: ORG1,
(Dom(B), Op(B), P red(B), Υ) is extended with ORG2, ORG3, ORG4. Each organization uses its own
predicate set P red(B) and constraint set Υ. Predicates template. The first organization ORG1 collects data of
P red(B) define a number of equivalence relations eq on SocExp and RepExp expenses every quarter. The second
domain Dom(B). Each of these equivalence relations organization ORG2 collects data of SocExp and RepExp
define a partition Πeq of the domain into equivalence expenses every year. The third organization ORG3 col-
classes. For each equivalence class c of partition Πeq we lects data of RepExp expenses only but every quarter.
introduce a name nc . This partition with named classes The fourth organization ORG4 does not collect any ex-
can be denoted by Π∗ . penses group data.
There are two trivial named partitions that only relate Senior analyst want to obtain detail summary of orga-
elements to themselves is denoted by ⊥∗ and that con- nizations expenses in 2001-2002 years.
sists of {Dom(B)} is denoted >∗ . Using our framework he has to write the query like:
Equivalence relations and partitions may be ordered.
The canonical order of partitions on Dom(B) relates two generate Mapping ”View Organization expenses” :
partitions Π∗ , Π0∗ . We define Π∗ Π0∗ if and only Expenses group(show sumignore as SumE),
if for all (c, nc ) from Π∗ there exists one and only one date (Quarter, Year mandatory), [2001 :
Org,
element (c0 , nc0 ) from Π0∗ such that c ⊆ c0 . 2002]
If it is necessary, we can also consider non-classical Order by Org, date
orderings such as the majority order choice
m that relates
The following table will be produced:
two named partitions. In our example canonical order is
enough.
For instance, we can define types hierarchy for time Org Year Quarter SocExp RepExp SumE
and volume types. ORG1 2001 1 23 3 26
According group definition we can declare type hier- ORG1 2001 2 24 3 27
archy for whole group. ORG1 2001 3 NULL 7 7
The next step is to add elementary evolution transfor- ORG1 2001 4 23 10 33
mation into group operations: ORG1 2001 ALL 70 23 93
ORG2 2001 ALL 80 13 93
• adding new property/group; ORG3 2001 1 UnDef 5 5
ORG3 2001 2 UnDef 7 7
• deleting useless property/group; ORG3 2001 3 UnDef 6 6
ORG3 2001 4 UnDef 6 6
• transferring property up/down over group hierarchy. ORG3 2001 ALL UnDef 24 24
Before executing operation model constraints are to Upon supposition that before 2002 year organization
be checked. Constraints may be manually added or auto- ORG1 used to gather data every year. In 2002 year or-
matically obtained from other constraints. ganization ORG1 started to collect data every quarter.
For example, we may define constraints for intro- Also in 2003 year organization ORG1 splitted quantity
duced groups operations. According to the work [3] of Prod2 into two parameters: Prod2.1 and Prod2.2.
when aggregation functions are defined for group the Group Product2 group evolved in common
group operations became restricted in natural way. area conceptual model. At first it had contained
only Prod2 parameter. Then Prod2 was split- 6 Conclusion
ted into Prod2.1 and Prod2.2. And equation
We developed a conceptual model for considered data
”sumignore (P rod2.1, P rod2.1) = P rod2” was
warehouse metadata and used this model to define ETL
added into mathematical conditions of common area
process. Definition of ETL process based on conceptual
conceptual model.
model is more abstract than sql-based one, proper ab-
The following query helps to construct summary re-
straction level helped us to keep off many problems both
port for ORG1 in 2001-2003 years:
with schema evolution and consistency maintainency.
generate Mapping ”View Quantity of product 2” : Suggested model is based on EER model developed by
Product2 group(show sumignore as Prod2), Bernhard Thalheim. This method allows to escape ver-
Org, date (Quarter, sioning and damping evolution. A method developed in
Year mandatory), this paper has been applied to data warehouse of natural
[2001 : 2003 years]
Order by date, Org monopoly regulating institution.
The following table will be produced:
References
Org Year Quarter Prod2 Prod2.1 Prod2.2 [1] Jim Gray, Surajit Chaudhuri, Adam Bosworth, An-
ORG1 2001 ALL 80 UnDef UnDef drew Layman, Don Reichart, Murali Venkatrao,
ORG1 2002 1 10 UnDef UnDef Frank Pellow, and Hamid Pirahesh. Data cube: A
ORG1 2002 2 8 UnDef UnDef relational aggregation operator generalizing group-
ORG1 2002 3 15 UnDef UnDef by, cross-tab, and sub-totals. Data Min. Knowl.
ORG1 2002 4 10 UnDef UnDef Discov., 1(1):29–53, 1997.
ORG1 2002 ALL 43 UnDef UnDef
ORG1 2003 1 13 3 10 [2] OMG group. Common Warehouse Meta-
ORG1 2003 2 14 7 7 model(CWM) Specification, 2001), ee = .
ORG1 2003 3 14 8 6
ORG1 2003 4 9 7 2 [3] Hans-J. Lenz and Bernhard Thalheim. Olap
ORG1 2003 ALL 50 25 25 databases and aggregation functions. In SSDBM
... ... ... ... ... ... ’01: Proceedings of the Thirteenth International
Conference on Scientific and Statistical Database
The following example requires to define hierarchy of Management, page 91, Washington, DC, USA,
organization region and black box functions for calcula- 2001. IEEE Computer Society.
tion average effectiveness in a region. Value of the black
box function EffectCalcFunc for a region equals aver- [4] Hans-Joachim Lenz and Bernhard Thalheim. Olap
age value of product cost
expenses fractions for each organization
schemata for correct applications. In TEAA, pages
in the region. 99–113, 2005.
View constructions has became more complex in cal- [5] Sergio Lujan-Mora, Panos Vassiliadis, and Juan
culations but declaration remains clear and short. Trujillo. Data mapping diagrams for data ware-
generate Mapping ”Effectiveness of organizations in house design with uml.
regions” : [6] George Papastefanatos, Panos Vassiliadis, Alkis
EffectCalcFunc P roduct Cost, Expenses group Simitsis, and Yannis Vassiliou. What-if analysis for
Org(region mandatory), data warehouse evolution. In DaWaK, pages 23–33,
date (Quarter, Year mandatory), [2001 : 2002] 2007.
Order by region, date
[7] Randel J. Peters and M. Tamer Ozsu. An axiomatic
The following table will be produced: model of dynamic schema evolution in objectbase
systems. ACM Trans. Database Syst., 22(1):75–
region Year Quarter EffectCalcFunc
114, 1997.
SPb 2001 1 34,23
SPb 2001 2 32,53 [8] Timos K. Sellis and Alkis Simitsis. Etl workflows:
SPb 2001 3 35,67 From formal specification to optimization. In AD-
SPb 2001 4 30,00 BIS, pages 1–11, 2007.
SPb 2001 ALL 33,34
SPb 2002 ALL 31,89 [9] A. Simanovsky. Evolution of schema of xml-
LenObl 2001 1 23,05 documents stored in a relational database. In
LenObl 2001 2 18,17 J Barzdins, editor, Proc. of the Baltic DBIS’2004,
LenObl 2001 3 24,67 volume 672, pages 192–204, Riga, Latvia, June
LenObl 2001 4 23,45 2004. Scientific Papers University of Latvia.
LenObl 2001 ALL 22,13
LenObl 2002 1 25,34 [10] Alkis Simitsis. Mapping conceptual to logical mod-
LenObl 2002 2 24,10 els for etl processes. In DOLAP ’05: Proceedings
LenObl 2002 3 25,07 of the 8th ACM international workshop on Data
LenObl 2002 4 26,08 warehousing and OLAP, pages 67–76, New York,
LenObl 2002 ALL 25,63 NY, USA, 2005. ACM.
[11] Darja Solodovnikova. Data warehouse evolution
framework. In Proc. of the SYRCoDIS’2007,
Moscow, Russia, June 2007.
[12] Bernhard Thalheim. Entity-Relationship Modeling,
Foundations of Database Technology. Splinger,
2000. XII, 627 pp. 160 figs., Hardcover.
[13] Carlo Zaniolo. LDL ++ Tutorial, 1998), ee =
http://www.cs.ucla.edu/ldl/tutorial/ldlcourse.html,.