=Paper=
{{Paper
|id=Vol-19/paper-6
|storemode=property
|title=Analysis and Design of Data Warehouses
|pdfUrl=https://ceur-ws.org/Vol-19/paper5.pdf
|volume=Vol-19
|dblpUrl=https://dblp.org/rec/conf/dmdw/Schouten99
}}
==Analysis and Design of Data Warehouses==
https://ceur-ws.org/Vol-19/paper5.pdf
Analysis and design of data warehouses
Han Schouten
Information Systems Dept.
Technical University - Delft, The Netherlands
han.schouten@is.its.tudelft.nl
Abstract 1 Introduction
With the large-scale introduction of the data
warehouse concept, a new phenomenon has ap- At least two different types of computer applications can
peared in the field of information systems devel- be distinguished in the information management of a
opment. Facts in a data warehouse – as opposed company: on-line transaction processing (OLTP) and on-
to those in an operational database – mainly rep- line analytical processing (OLAP). The first supports the
resent immutable, aggregated or otherwise de- primary processes of the company with ordinary transac-
rived, historical information. The aggregation tion processing systems. The latter concerns management
level and specific layout of management infor- information about and control over primary processes by
mation reports often cannot be specified on be- means of management information systems. [Inmon93]
forehand. Therefore, a data warehouse must be gives several valid reasons for the physical separation of
designed in such a way, that it provides optimal the databases and the applications in these two areas.
support for aggregation on the fly and for navi-
gation through aggregation hierarchies, that it al- A database filled with facts derived and aggregated from
lows easy access to time series and that it enables an ordinary operational database for the sole purpose of
reporting in any desired layout. policy making is called a data warehouse. The facts in a
data warehouse may represent key performance indicators
This publication describes the outline of two
to be used in a balanced score card approach to evaluate
complementary methods for the analysis of data
warehouse relations; one simple and the other ad- the operation of a business. There are sound methods for
vanced. The simple method exploits the knowl- the analysis and design of ordinary transaction processing
edge contained in an ordinary relational schema. systems. A comparable method for the development of
The advanced method is based on the analysis of management information systems remains to be invented.
derivation rules. Subsequently, the design of data Inmon deals with many phenomena related to data ware-
warehouses based on these methods is investi- house design, but leaves the ‘how’ of it completely un-
gated. Special attention has been given to the ac- touched. This article is an attempt to devise a proper way
tuality of data warehouses that contain historical of thinking and working to achieve this goal.
information, to the transitivity of derivations, to
the navigation through aggregation hierarchies When developing a transaction processing system, nor-
via so-called drill paths and to the maintenance of mally a distinction is made between the analysis of the
various aggregation levels within a single data ideas of the people in and around an organisation and the
warehouse relation. design of the way in which the operation of these ideas is
supported with information technology. This distinction
_______________
minimises the complexity of information analysis and
The copyright of this paper belongs to the paper’s authors. Permission to copy guarantees generally applicable and reusable definitions.
without fee all or part of this material is granted provided that the copies are not
made or distributed for direct commercial advantage.
The information analysis that always precedes the design
of information systems results in definitions for elemen-
Proceedings of the International Workshop on Design and tary concepts, such as facts, constraints, derivations,
Management of Data Warehouses (DMDW'99) norms and prescriptions for desirable human behaviour.
Heidelberg, Germany, 14. - 15. 6. 1999
These concepts represent the whole communicable expe-
rience of people in and around an organisation. These
(S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, eds.) concepts form the very basis of any conceivable applica-
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-19/ tion in the form of an information system. The definition
Han Schouten 5-1
of these concepts is totally independent of some specific sary and sufficient for the analysis and formulation of
application. The designer of an information system deter- derivation rules.
mines the eventual form in which the facts and the guid-
The next chapter provides two theories and two methods
ance for the behaviour of human workers will be pre-
for the analysis of data warehouse relations. The first is
sented. The prescriptive qualities of a method for the
our workhorse that produces the bulk of our data ware-
analysis and the design of information systems guarantee
house definitions. The second is our fancy horse and con-
reproducibility, verifiability and accountability.
cerns the analysis of derivation rules in conceptual terms,
Management information – as opposed to operational in- i.e. independent of the shape in which the eventual deri-
formation – is always aggregated and, therefore, derived vation will be presented. Armed with the result of a fully-
information. Updating management information is rarely fledged information analysis, a data warehouse can be
necessary. Redundantly storing facts is, therefore, admis- designed. The design consists in the structure of the data
sible and sometimes – for better performance – even nec- warehouse. We will devote special attention to evolving
essary. Redundancy often concerns the registration of objects in data warehouses, to the transitivity of deriva-
management information at various aggregation levels. tion, to drill paths and to aggregation levels. In the chapter
Well-specified information requirements will often trans- called ‘Discussion’ various potentials and impracticalities
late into moderate levels of redundancy. On the other of the approach presented in this article will be consid-
hand, the automated support for policy development in its ered.
broadest sense often results in considerable, yet necessary
levels of redundancy and extraordinarily large databases. 2 The analysis of derivations
In this sense, the structure of a data warehouse more or
less depends of its intended usage. 2.1 Concepts
The information in a data warehouse can be presented in a Natural language oriented information analysis consists in
countless variety of different reports. Facts of various a set of prescriptions for the decomposition of natural
types can be plotted one against the other. When such language sentences into phrases and the classification and
facts occur within some aggregation hierarchy, policy qualification of these phrases. Analysing sentences such
analysts can pass through this hierarchy from the highest as:
to the lowest level (drill-down). At a given aggregation
level, the analyst can investigate each aggregation class
individually. The data warehouse design should, therefore, (1) the person with identification number 12038
optimally facilitate drill-down. is resident of The Netherlands
(2) the person with identification number 12039
The distinction between analysis and design equally ap- is resident of Belgium
plies to the design of management information systems.
During analysis, more attention than with ordinary infor-
mation systems will have to be devoted to derivation produces the following fully classified and qualified re-
rules. In a data warehouse, each relation represents a col- sult:
lection of derived facts grouped in a particular way. The
analysis consists in the detection of derivable facts and the
applicable derivations. The design consists in grouping ST1 Nationality
SFT1 is resident of
derivable facts into data warehouse relations. The theory
of grouping has been documented in [Bakema94]. In the (1) 12038 The Netherlands
present article, the way of grouping derived facts into da- (2) 12039 Belgium
tawarehouse relations is necessarily somewhat less ortho-
dox. Analysis and design according to the prescriptions ST2 Person
described in this article yield an optimally structured data SFT2
warehouse. What we consider optimal, largely depends of identifies a person
the intended usage: standard report or broad analysis. (3) 12038
(4) 12039
Success in the analysis of management information sys- OFT2 the person with
in charge. A precise and complete prescription for infor-
mation analysis is given in [Nijssen94]. Full treatment of ST3 Country
SFT2
this subject exceeds the scope of this article. Apart from identifies a country
detailing the notions of concept, identity and meaning, we (5) The Netherlands
will not dwell on these matters. These notions are neces- (6) Belgium
OFT3
Han Schouten 5-2
NFT1
NT1 identification
Identification number <…> CT2 Person
ST2 Person
NT2 Country name SFT2
NFT2a the country name <…> identifies a person
NFT2b <…> (3/3) 12038
(4/4) 12039
OFT2 the person with
The prescription that yields this result is documented in
[Nijssen94] and [Schouten94]. Moreover, [Nijssen94]
ST3 Man
provides prescriptions for recognition and modelling of SFT3
various types of constraints and cases of generalisation, identifies a person that is a man
specialisation and synonymy. Although the method may (7/4) 12039
help to recognise and analyse prescriptions for desired OFT3 the man with
human behaviour, it does not provide any prescription for
it. In this article, modelling derivation rules will receive ST4 Woman
full attention. Only if all prescriptions are followed me- SFT4
ticulously, the analysis result will be reproducible and identifies a person that is a woman
may survive the severest scrutiny. (8/3) 12038
OFT4 the woman with
Apparently, a sentence of the type ‘Person’ can be ex-
pressed in two different ways: as a sentence formulation ST5 Person
according to ‘SFT2’ and as an object formulation ac- SFT5
cording to ‘OFT2’. The same observation applies to sen- identifies a person
tences of the type ‘Country’ and the formulations ‘SFT3’ (9/4) 1482.63.152
and ‘OFT3’. Many people wonder if objects are facts and OFT5 the person with
the other way around. Indeed, a sentence and the reference
to an object embedded in a sentence are just different
manifestations of a single concept in someone’s mind. CT3 Country
A human being is capable of communicating part of what ST6 Country
he is aware of. The unit of information in the human con- SFT6
science is called a concept. A concept is indivisible, has identifies a country
(10/5) The Netherlands
an identity of its own and has at least one meaning. By
(11/6) Belgium
necessity, information analysis restricts itself to communi- OFT6
cable concepts and consists in the recognition of the iden-
tity, all meanings and all manifestations of concepts that CT4 Point in time
are worth considering in a given context. Among the ra-
ST7 Now
tional and communicable concepts actually we only find
SFT7
facts and other tenets. A fact can be communicated as an identifies the current point in time
assertion. A tenet can be communicated as a general rule (12/7) 18-03-1999 14:12:35
expressed in some natural language or otherwise. A tenet
ST8 Reference date
always concerns facts of types that have been recognised
SFT8
as relevant to a given context. identifies a point in time
We can expand our experience with the notion that a per- that is a reference date
(13/8) 01-01-1999
son is either a man or a woman and with the notion that (14/9) 01-02-1999
we can also identify every Dutch citizen with a so-called (15/10) 01-03-1999
‘Social Fiscal Number’: the SoFi-number. Moreover, we (16/11) 01-04-1999
know the date of birth and possibly the date of death of a
person. This produces the following result: CT5 Birth
ST9 Birth
SFT9 is born on
CT1 Nationality
(17/12) 12038 10-12-1956
ST1 Nationality (18/13) 12039 03-01-1934
SFT1 is resident of
CT6 Death
(1/1) 12038 Belgium
(2/2) 12039 The Netherlands ST10 Death
Han Schouten 5-3
simple conjunction represents a constraint, a derivation, a
SFT10 has deceased on norm, or a prescription of desired human behaviour.
(19/14) 12039
We may expand our consciousness, i.e. conceive new
CT7 Senior population size concepts, by personal observation, by reasoning or by
ST11 Senior population size communication with other persons. One person does not
SFT11 the male Dutch senior population at have any means of access to the consciousness of a second
is person other than the second’s reaction to what the first
has communicated. Only if the second person’s behaviour
(20/18) 01-01-1999 0 is consistent with the first’s expectation on basis of what
(21/19) 01-02-1999 1
he has communicated, the latter may conclude that the
(22/20) 01-03-1999 0
other person has understood the concept communicated as
NT1 Identification number (Categorical) intended.
NFT1 identification number <…> In this sense, information analysis is not more than a way
to specify tenets and a set of sophisticated behavioural
NT2 Country name (Categorical)
experiments to test tenets against the experiences of the
NFT2a the country name <…>
NFT2b <…> domain experts involved. A composite concept defines a
tenet correctly and completely if the composite concept –
NT3 Time value (Interval) given any set of valid composing concepts – is valid in the
NFT3a the time value <…> opinion of every domain expert involved.
NFT3b <…>
2.2 Derived Concepts
NT4 SoFi-number
NFT4 the SoFi-number <…> The expectation that some set of known, independent con-
cepts invariably leads to one particular result is expressed
NT5 Number with a derivation rule. Every conceivable derivation ac-
NFT5 a number of <…> cording to this rule yields – precisely and completely – the
identity and the specific meaning of a single derived con-
The broadest class of comparable concepts is called a cept. This section provides a notation and a method to
concept type. In a role definition like , CT3 identifies the type of concepts fulfilling the document our tenets with regard to derivations.
role ‘country’. OFT6 identifies the class of similarly
formed objects as well as the class of concepts with the 2.2.1 A Notation for Rules
same meaning. In the sample sentences above, the number Any data warehouse embodies information that has been
in front of the slash identifies the sentence; the number derived from some operational information source. Given
following the slash identifies the concept. The sentences a conceptual information model of the source, it is often
3, 7 and 9 are three different manifestations of concept 3. not particularly difficult to specify the derivation rule with
Likewise, the sentences 4 and 8 are manifestations of con- some derived concept type. Any formalism for logical
cept 4. Belgians don’t have a SoFi-number. Consequently, expressions can be used for this purpose: the Peano Rus-
such manifestations don’t occur. The meanings of sen- sell notation, existential graphs, Prolog, or SQL. Person-
tences 3 and 9 do not differ necessarily. The meaning of ally, I prefer a notation derived from the existential graphs
sentence 7 is more specific than that of sentences 3 and 9. of Charles Sanders Peirce [Hartsh60]. [Creasy89] has
The conjunction of several concepts is conceivable if been the first to link this formalism to natural language
every individual concept in the conjunction is valid. Con- oriented information analysis under the acronym of
sequently, every conjunction of concepts is divisible and ENIAM. My own contribution exists in a profound theo-
derives its validity from the validity of the concepts in- retical and methodological foundation of the analysis of
volved. Primarily, information analysis restricts itself to rules as part of natural language oriented information
elementary assertive sentences. Composite sentences are analysis [Schouten99]. Existential graphs provide the
dealt with only after all elementary sentences have been minimally required basic concepts (identification, nega-
analysed. A compound sentence that is a simple conjunc- tion, conjunction), are theoretically sound and are practi-
tion is divided in its elementary components; each indi- cally indispensable.
vidual component is then treated as an ordinary elemen- The logical expression in figure 1 counts the number of
tary sentence. Whenever the nature of composition is persons that are male Dutch citizens that have reached the
more than a simple conjunction, such as an implication or age of 65 years and are alive at a given reference date and
a disjunction, we have recognised a tenet that must be creates a concept of the type ‘Senior population size’ ac-
treated as such. Every composite sentence that is not a cordingly. This expression is indivisible and hence repre-
Han Schouten 5-4
sents an elementary concept. This is a tiny yet representa- planation of this figure follows below
tive example of aggregation in a data warehouse. An ex-
C : C o u n try
c o u n try n a m e
c o u n try n a m e T h e N e th er-
la n d s
P T: P o in t in tim e
N : N a tio n a lity
N ow
p e rso n c o u n try tim e v a lu e
P : P erso n
D : D e a th
P e rso n R e fe ren ce d a te
id e n tific a tio n p e rs o n tim e v a lu e tim e v a lu e
B : B irth
M an
ag e lim it
id e n tfic a tio n p e rs o n tim e v a lu e 2 1 6 5 ja a r
S P : S e n io r p o p u latio n siz e
c o u n try a g e lim it tim e v a lu e num ber
Figure 1: The derivation of the number of male Dutch citizens that have reached
the age of 65 years and that are alive at a particular date
The white background with the shadow on which the In an existential graph, a dark grey ellipse symbolises all
graph is portrayed is the sheet of assertion. Every asser- conceivable concepts of the given type that satisfy the
tion placed on the sheet of assertion is true by definition. conditions imposed. An alias and a concept type name
Every concept that is placed on it is valid by definition. accompany each concept type symbol. The alias symbol-
Every combination of concepts that is placed on the sheet ises any specific concept that can be instantiated under
of assertion represents a valid conjunction. this concept type symbol. A white rectangle within the
ellipse symbolises the role of a concept or the role of a
The rectangles with rounded corners that are alternately
name or a value. Several contiguous roles symbolise a
coloured light grey and white, are called cuts and sym-
sentence type with a meaning. An individual concept may
bolise the negation of every expression that is placed
have several meanings. In such a case, the concept type
within its boundary. If a cut is placed directly on the sheet
symbol shows as many sentence types with their names as
of assertion, this means that it is true that anything placed
there are meanings. A role can be lexical or non-lexical,
within this cut is not true. Two cuts can never intersect.
Han Schouten 5-5
depending of its referent being a simple value without representing ‘The Netherlands’ fulfilling the role ‘country’
identity, or a concept having its own, inalienable identity. must be valid. This person must also be 65 years old or
older, i.e. there must be a concept of the type ‘Birth’ with
A lexical role can be compared with some other role or
this person fulfilling the role ‘person’ and the value in the
some other value expression such as a computation, a user
role ‘time value’ differing at least 65 years with the refer-
defined value or a literal constant. Whenever a concept is
ence date. Also, this person should be alive at the refer-
being created, its lexical roles receive their values from
ence date, i.e. there should not be a valid concept of the
other roles and value expressions. A dashed line with an
type ‘Death’ with this person fulfilling the role ‘person’
operator symbol on it symbolises a comparison. A dashed
that has a value in the role ‘time value’ smaller than or
line with an arrowhead on it symbolises an assignment.
equal to the reference date. If all these conditions are met,
The arrangement of two small squares and a pentagon
nothing can stop the creation of a concept of the type
with a minus sign symbolises a subtraction. The small
‘Senior population size’. The role ‘country’ of this con-
squares symbolise the arguments. The number in each
cept receives its fulfilling concept from the concept in-
square defines the order of evaluation. The pentagon sym-
stantiated under the concept type ‘Country’. The role age
bolises the operator as well as the result of the computa-
limit receives its value directly from the user-defined
tion. In this case, the result of the subtraction is compared
value. The role ‘time value’ receives its value from the
with the user-defined value of ’65 years’.
reference date. The role ‘number’ receives its value from
There are two types of user-defined values: mandatory the number of persons counted. If the user-defined value
and optional. A mandatory user-defined value is depicted ‘country name’ has not been specified, a concept of type
black. An optional user-defined value is depicted white. A ‘SP: Senior population size’ will be derived for each de-
mandatory user-defined value is only valid when it is pro- fined concept of the type ‘C: Country’.
vided with a significant value. Since a mandatory user-
defined value is a prerequisite for the derivation, its sym- 3 Analysing management information
bol is always placed on an odd cut. An optional user- systems
defined value is always true and does not influence the
evaluation of a logical expression in any way. Hence, it We may become aware of a derivation in several ways.
may be placed anywhere in an existential graph. Apart from the organisation structure and prescriptions for
desirable human behaviour, the implementation of organ-
The concept that fulfils a non-lexical role can be identi- isational objectives produces a set of criteria for the de-
fied; the concept that is going to fulfil a non-lexical role gree in which the organisation fulfils its objectives. Often
can be instantiated. A solid line symbolises the identifica- an organisation will produce reports at regular intervals
tion of a concept. A solid line with an arrowhead symbol- that present management information. Whenever the or-
ises the instantiation of a role with some concept. ganisation structure and the standard reports provide in-
The dark grey hexagon labelled ‘Senior population size’ sufficient inspiration, we can survey the corporate infor-
symbolises the concept to be created as soon as some mation model for the presence of countable or measurable
situation enables its creation. items.
We should interpret an existential graph from the outside In management science, key performance indicators (KPI)
inwards. The outer cut postulates that it is not true that a are popular for measuring the degree in which an organi-
significant age limit has been specified and that there is a sation lives up to its critical success factors (CSF) [Kap-
valid point in time ‘Now’ that is equal to some other valid lan96]. For a service company, the elapsed time between
point in time ‘Reference date’ and that the expression the notification of a defect and its repair is a KPI in the
within the second cut would not be true. This is logically light of a CSC customer’s satisfaction. Twenty four hour
equivalent to saying: “If the age limit has been specified availability, is another. A data warehouse must provide an
and the reference date arrives, then ”. The derivation itself is also a ne- given point in time. The interpretation of a set of KPI’s in
gation within a negation and can be read as an implica- the light of some CSC, can be expressed as a norm.
tion: “If , then create a concept of the type ‘SP: Senior robust, fundamental way to analyse its derivation. The
population size’. In this derivation, not the mere existence simple way exists in defining elementary data warehouse
of some concept of the type ‘P: Person’ is tested, but the relations merely on basis of weak functional dependen-
number of all concepts that obey the conditions imposed. cies. The fundamental approach defines the way in which
A concept of the type ‘Person’ counts if it is a man, i.e. if some KPI can be derived with the help of existential
it represents an element in the population of the sentence graphs. One way does not exclude the other, but each is
type ‘Man’. Moreover, this person should be a Dutch citi- best suited to its own particular field of application. Be-
zen, i.e. a concept of the type ‘Nationality’ with this per-
son fulfilling the role ‘person’ and a concept ‘Country’
Han Schouten 5-6
cause of their complementary nature both approaches will Attributes of the absolute, ratio or interval scale type must
receive attention in this article. often be cast to some range of discrete intervals first. The
latter necessarily represents some ordinal scale. If a data
warehouse contains historical data, the cast of a timestamp
3.1 The easy way
to some ordinal scale of time periods often provides a
There is an easy method to capture the bulk of a data most suitable classifying attribute.
warehouse from readily available data in operational data-
The set of weak functional dependencies is not depleted
bases, provided, there exists a well-conceived relational
with the non-key attributes within the same relation. If
data model. First, this model must be transformed into a
some KPI weakly depends on a foreign key, we may ex-
graphical form that suits our purposes. When some for-
tend this dependency to the primary key that it references
eign key in a table or relation references another, the rela-
and to every attribute that functionally depends on it.
tion that contains the reference is called child and the re-
Following foreign key references is – in the diagram
lation that is referred to is called parent. When some rela-
shown above – literally a way of look-up. Look-up legiti-
tion is a specialisation of another, the first is called sub-
mately provides additional classifying attributes. Search-
type and the second super type. The subtype relation itself
ing in the opposite direction is called drill-down. It is ab-
may be the super type of another set of subtypes. Prefera-
solutely inconceivable, that some attribute would depend
bly, we draw this type of diagram as follows:
of attributes reached by drilling down.
Conceptually, every instance of a relation has its own in-
alienable identity. In the case of an instance of a super
Parent type, however, every applicable subtype relation provides
attribute values that specifically belong to the meaning
associated with that subtype and to that particular instance
Sub 1
of the super type. Therefore, if some KPI weakly depends
on the attributes in a particular super type or subtype rela-
Super Sub 2 Sub 2.1 tion, it weakly depends on all attributes of all applicable
subtypes of the super-most relation and on all parents of
these super- and subtypes.
Sub 3
The selection of classifying attributes determines the ap-
plicability of a data warehouse. Insufficient selectivity
C hild results in a bulky data warehouse that will be difficult to
manage and hard to understand. If we restrict ourselves
too much we may loose opportunities. Irrespective of how
Figure 2. Search paths in a relational model. many classifying attributes we take into consideration, the
combination of all classifying attributes with one depend-
In this diagram, the relation ‘Sub 2’ has one parent rela- ent attribute precisely represents one elementary func-
tion ‘Parent’ and one child relation ‘Child’. The relation tional relation. The chapter on the design of data ware-
‘Sub 2’ is one of the three subtypes of the relation ‘Super’ houses explains how such elementary relations can be
and has itself one subtype ‘Sub 2.1’. grouped into useful data warehouse relations.
An attribute of a relation that represents a KPI is almost 3.2 A more advanced approach
always some magnitude of the absolute, ratio or interval
scale type. Every attribute of a relation is either part of the Some KPI’s don’t exist as such in an operational database
primary key to that relation, or functionally depends on it. and must be derived. Sometimes, the logical expression
If a relation has an alternate key, then every dependent for such a derivation is simple. Often, however, its analy-
non-key attribute also fully functionally depends on the sis and verification requires huge efforts. Then, a perfect
alternate key. Apart from functional dependencies of pri- way of working and modelling pays for itself.
mary and alternate keys, we may assume a weak func- An organisation invariably derives the criteria for its per-
tional dependency of some non-key attribute of every formance from facts obtained from its ‘shop floor’. Do-
other attribute within the same relation. As far as the latter main experts can decide which concepts assist in achiev-
are worth considering, we call these classifying attributes. ing a particular derivation result and how. Analysts can
Very often, the strength of the dependency of some KPI help to find the proper logical expression to describe the
on classifying attributes will be the subject of data analy- derivation.
sis on a data warehouse.
Existential graphs have proven to be a tremendous help in
Attributes that belong to the categorical or ordinal scale formulating and verifying the exact nature of our beliefs.
type lend themselves perfectly as classifying attributes.
Han Schouten 5-7
In an existential graph that expresses our belief concern- tion prevents information redundancy and that is an es-
ing the derivability of some concept, the concepts that sential instrument for the maintenance of information in-
trigger the derivation always appear in conjunction in the tegrity. Moreover, clustering improves query perform-
first cut. The concepts that provide classifying variables ance.
and the derived concept always appear in conjunction in
Data warehouse design leaves us greater freedom. A data
the second cut. Every classifying concept will fulfil an
warehouse derives its consistency mainly from the con-
independent role in the concept to be derived. A simple
sistency of the source data. Customarily, data in a ware-
conjunction may not produce the required result in the
house are never changed. Consequently, the fifth normal
case of optional dependent concepts, such as ‘Death’.
form is no longer crucial to information integrity. Better
Here, the non-existence of a concept of this type is re-
performance can be obtained by:
quired and a simple negation suffices. In other cases an
embedded implication, equivalence or non-equivalence is 1. Cluster elementary concepts that represent dependen-
called for. cies of KPI’s on similar ranges of classifying attrib-
utes;
4 Designing management information 2. Aggregating KPI’s in a clustered relation over all val-
systems ues of a classifying attribute or combinations of these.
To enhance understanding and usage of a management
4.1 Design considerations
information system, the correspondence between roles in
A management information system provides for the deri- the data warehouse and roles in the operational database
vation of management information obtained from the or- must be known and documented.
dinary course of affairs in a company, its storage and its
use. The logical expression that documents a derivation 4.2 Designing the data warehouse
rule is the precipitation of a tenet concerning this course
of affairs. A logical expression is itself a composite, yet The analysis of derivation rules may have resulted in a
irreducible concept. We may state the following: rather large number of logical data warehouse relations.
Irrespective of the way in which these relations have come
A composite concept defines a belief correctly and about, these mainly represent functions, i.e. dependencies
completely if this is – without any reservation – con- of a single, possibly derived variable on a set of classify-
ceivable itself in the experience of an arbitrary do- ing variables. The purpose of data warehouse design is, to
main expert given an arbitrary set of composing ele- cluster as many as possible elementary, functional rela-
mentary concepts that are conceivable themselves. tionships into the smallest possible number of composite
We may expect the same quality of the implementation of data warehouse relations.
any belief in a management information system. All op- Mainly, there are two reasons for clustering elementary
erations performed on operational data during and after data warehouse relation into one composite relation:
loading it into a data warehouse, should fully maintain the
logical coherence that we have recognised during analysis. 1. The sets of classifying attributes are equal;
Deviation from this principle would damage the credibil- 2. The set of classifying attributes of one data ware-
ity of the system and could seriously endanger the man- house relation is a subset of the other.
agement of a company.
The first reason is the ordinary clustering of elementary
A conceptual information model including derivation relationships according to similarities in primary keys.
rules describes conceivable concepts in their most ele- Very often, a derivation rule is a function: the derivation
mentary form. In this form, no consideration whatsoever applied to a particular combination of independent classi-
has been implied of the way in which concepts will be fying variable values produces a single dependent variable
stored or presented, let alone a way that guarantees opti- value. By definition, the instantiation of a function is an
mal performance. elementary sentence. Elementary sentences sharing the
The relational model provides an optimal strategy for same set of independent variables can be clustered into a
storage and retrieval of information. This model clusters a composite relation. In a data warehouse, the set of inde-
variety of elementary concepts, in which another concept pendent classifying attributes represents its primary key
fulfils a unique role, around this concept and stores it as and identifies each object residing in it.
such. Such a cluster of concepts is called a tuple or row. A The second reason can be understood by considering the
relational database only provides lexical references to semantics of existential graphs. If a role of a concept in an
concepts. The notion of concept type only exists in the existential graph is unbound, its referent does not influ-
name of the relation, tuple type or table that acts as a ence its evaluation in any way. If such a role is connected
container for similar tuples. This way of storing informa- to an optional connector via a line of comparison and that
Han Schouten 5-8
connector is not provided with some significant value, with the name of the derived attribute, e.g., ‘Number of
then this role does not influence the evaluation process men’ and ‘Number of Women’. The second solution re-
either. If the dependent variable in a data warehouse rela- quires a semantically equivalent schema transformation.
tion represents an aggregation (sum, average, etc.) of Originally, the gender that was implied in the specialisa-
some variable, an undefined classifying variable results in tions ‘Man’ and ‘Woman’ of the concept type ‘Person’.
the aggregation over all values that it assumes in that rela- Now, the gender will be expressed explicitly in a binary
tion. concept ‘Gender’ with a role ‘person’ and a role ‘expres-
sion of gender’ with two possible values ‘male’ and ‘fe-
Grouping is not allowed if two elementary relations repre-
male’. This solution is demonstrated in figure 3. The latter
sent derivations that logically exclude each other. This
being more elegant, we will henceforth assume this solu-
may be the case, if both derivations concern different spe-
tion. If the gender is not specified, all persons that meet
cialisations of the same concept type. E.g., the relation
the other conditions imposed will be counted.
that represents the result of counting the number of male
persons like the one in figure 1 can not be grouped with a In a data warehouse relation, an unspecified classifying
similar relation for women, just like that. We must estab- variable means that each aggregation has been performed
lish, first of all, that either two different units of counting over all values of that variable. In this respect, a data
C : C o u n try
c o u n t ry n a m e
c o u n tr y n a m e
T h e N e th e r la n d s
P : P e rs o n N : N a tio n a lity
p erso n c o u n try
G : G ender
exp. of gender
p erso n exp. of gender
P e rs o n M a le
id e n tific a tio n
B : B irth
a g e lim it
p erso n ti m e v a l u e 2 1
6 5 ja a r
P T : P o in t in tim e
D : D e a th
R e fe re n c e d a te
p erso n t im e v a lu e tim e v a lu e
S P : S e n io r m a le p o p u la tio n s iz e
N ow
c o u n tr y ex p . o f g en d er a g e lim i t tim e v a lu e num ber t im e v a lu e
Figure 3. A semantic equivalent of the derivation in figure 1.
are concerned, or that we must add a classifying attribute warehouse relation differs noticeably from its relational
‘Gender’ to the composite data warehouse relation. With counterpart in OLTP. The relational model according to
the first solution, the nature of the population is interlaced [Codd90] prohibits undefined values for primary key at-
Han Schouten 5-9
tributes. An undefined independent variable value in a
data warehouse relation has an unambiguous, well-
L iv in g p erso n s
understood meaning and is entirely admissible. 'T h e N ethe rla nd s' (co un try na m e ) (nu m be r o f p erso n s)
So far, we have recognised two sources of elementary data 'M a le' (exp re ssio n o f ge nd er) (a vera g e in co m e )
warehouse relations: weak functional dependencies in
'6 5 years' (ag e low e r lim it)
operational databases and derived concepts produced by
the invocation of a derivation rule. Irrespective of its '1 99 6 -0 1 -01 '
(referen ce d ate )
source, we may visualise a data warehouse relation as a
black box with connectors; connectors symbolising classi-
fying variables to the left of the black box and connectors
symbolising dependent variables to its right. In this way Figure 6. The number of persons and their income
we can rapidly investigate large numbers of different sce-
narios. Considering the meaning of unbound roles in aggrega-
tions, we unite all elementary relations in which inde-
Figure 4 depicts the same functional dependency as the pendent variables depend on fully identical sets of inde-
one represented by the fact type derived in figure 3, now pendent classifying variables, or subsets thereof, into
as a ‘black box’. composite relations. The result is a set of data warehouse
relation in which the combination of all independent vari-
ables represents the key to several dependent variables.
'T h e N ethe rla nd s'
(c ou n try na m e )
L iv in g p erso n s
(n u m b er o f pe rson s)
Because of the degree of freedom we experience in clus-
tering elementary data warehouse relations into composite
'M a le' (e xp re ssio n o f ge nd er) ones and because every conceivable aggregation level can
'6 5 ye ars'
be maintained within a single relation, a data warehouse
(a g e lo w er lim it)
usually consists of a few relations only.
'1 9 96 -01 -01 ' (referen ce da te )
4.3 Transitive derivability
In a transitive derivation, at least one independent variable
Figure 4. Counting the number of living persons is itself derived. An average is always a transitively de-
rived variable; the sum and the number of observations
Figure 5 shows an elementary relation that expresses the from which the average are calculated both represent de-
dependency of the average income of living persons on a rived variables. When the object counted is the same as
similar set of classifying variables. the object of summation, a single black box suffices.
However, when the object counted differs from the object
summed up, e.g. the tax administration dividing the total
L iv in g p erso n s
amount spent for verifying tax declarations in some region
'T h e N ethe rla nd s' (co un try na m e ) (a vera g e in co m e )
by the number of taxable persons living in that region in
'M a le' (exp re ssio n o f ge nd er)
order to obtain some measure of efficiency, two black
boxes that share some or all of the independent variable
'6 5 years' (ag e low e r lim it) with equal values are required. Such derivations demand a
'1 99 6 -0 1 -01 ' particular order of evaluation. Figure 7 shows such a sce-
(referen ce d ate )
nario.
'U trec ht' E x p en se s
(reg io n na m e) (to ta l am o un t)
Figure 5. Calculating the average income 'In co m e '
(typ e o f ta x)
Given an identical set of independent variables, the black 1
Ta x a b le p e rso n s 2
boxes in figures 4 and 5 can be united within a single (n um b er o f p erso ns)
:
black box, as is shown in figure 6.
(e fficien c y ra te)
Figure 7. Transitive derivation.
Han Schouten 5-10
4.4 Evolving data warehouse objects the same tuple represents the aggregate of all individual
dependent variables that occur in conjunction with a de-
If the history of the evolution of concepts in a database is
fined value of the classifying variable at a lower aggrega-
registered, timeline attributes represent classifying vari-
tion level. In this way, all aggregation levels of a data
ables that can be treated accordingly. Only dependent
warehouse object – ranging from the most elementary
variables that have been modified between the last and the
level to the highest aggregation level – can be unambigu-
before last database extraction, will be considered for up-
ously maintained within a single data warehouse relation.
dating the data warehouse. To prevent counting un-
changed dependent values more than once, only those An aggregated value is best associated with several other
dependent variables that have changed may be registered derived values that describe the original population and
in the newly created data warehouse tuple. that facilitate further aggregation, whenever needed. The
number of observations from which the aggregate has
Otherwise, if no history is maintained in the database, the
been derived must be known. Also maintaining the lowest
moment of database extraction may provide a suitable
and the highest observed value and the median of all ob-
timestamp. Here, the specific dependency of a dependent
servations may come in handy; after derivation such in-
variable from its classifying variables in the database must
formation may get lost. Next to that, the sum and the sum
be compared with its equivalent in the data warehouse.
of squares of all observations must be stored at least. In
Series of changes between two successive extractions will
this way, the mean, the variance and the standard devia-
go unnoticed, however. The classifying variable in this
tion can always be calculated. When more recent obser-
case can be a simple timestamp, or a derived status vari-
vations are added to a data warehouse relation, the new set
able that expresses a stage in the lifecycle of the data
of aggregates can be computed without having to recur to
warehouse object at the moment of extraction. Here also,
the original values.
only modified dependent variables may be registered, in
order to prevent counting the others more than once. At the level of elementary observations, by definition, the
number of observations equals one and the sum equals the
Selecting several related data warehouse objects, e.g., to
value observed.
calculate the elapsed time between two successive extrac-
tions, may be difficult in some data analysis applications
and near impossible in others. If such a variable is rele- 5. Conclusions
vant, it is good practice to calculate it at the moment of This article describes the outline of an approach for the
extraction and store it as a dependent variable in the data analysis and design of data warehouses. As such, a data
warehouse object concerned. warehouse is the database of a management information
system. Where [Inmon93] just explains what must be
4.5 Classifying variables and Drill paths done, this article explains how this can be achieved. The
Some data warehouse relation may be the result of looking approach in this article aims at procedural precision, com-
up classifying attributes at several successive levels of pleteness and prescriptiveness. As an outline it may not be
parent relations. If the foreign key attributes have been perfect in every respect. At its very best, it is a guide for
included as classifying attributes, aggregation of depend- practical application that leaves ample room for further
ent variables over those foreign key attributes, provides research and improvement. An educationally sound publi-
clean sums, counts, averages and so on for a particular cation about this subject would rather fill a book and
class within the enveloping parent category. would not fit into an article such as this.
Such a hierarchy of embracing categories represents a The analysis and design of a data warehouse based on
useful means to traverse a data warehouse relation from principles derived from information theory offers several
the highest aggregation level to the lowest, or to perform remarkable and possibly new perspectives. A data ware-
drill down. Such hierarchies are called drill paths. house relation may be considered as a set of grouped ele-
mentary functions sharing the same set of classifying vari-
If no other provisions are made, the nature and origin of ables or a subset thereof. The largest set of classifying
classifying variables in a data warehouse relation are lost. variables serves as the primary key to the data warehouse
Therefore, in a data warehouse, the knowledge that sev- relation. If a classifying variable has an undefined value,
eral classifying variables represent a drill path, must be all dependent variables represent aggregates of original
maintained explicitly. observations over all tuples that feature a defined value of
that classifying variable. In this respect, a data warehouse
4.6 Aggregation levels in the data warehouse relation differs essentially from a relation according to
The designer of a data warehouse can define a single data [Codd90] in which undefined primary key attributes are
warehouse relation that may hold every conceivable ag- inadmissible.
gregation level. An undefined classifying variable in a Every aggregation procedure can be described completely
data warehouse tuple means that every derived variable in and correctly with a logical expression – possibly in the
Han Schouten 5-11
form of an existential graph – provided, the identity and References
meaning of the concepts involved is known and well un-
derstood. Apart from the derived concept, an existential [Nijssen94] Nijssen G.M., Schouten H. Matemataal
graph that represents a derivation rule shows concepts that voor bedrijfseconomie en bedrijfskunde.
provide either classifying variables or dependent variables PNA Publishing, Beutenaken, 1994. ISBN
or concepts that link these together. The derived concept 90-5540-002-5.
usually features one role, whose value is the result of [Hartsh60] Hartshorne C., Weiss P. (editors). Col-
counting, accumulation, averaging, or any other computa- lected papers of Charles Sanders Peirce.
tion and that functionally depends on the conjunction of Volume IV: The simplest mathematics.
all classifying variables distinguished in the derivation. Book II: Existential graphs. Pages 293 -
The excellent qualities of existential graphs such as sim- 464. Harvard University Press, Cambridge,
plicity, readability and logical completeness make them
Massachusetts, USA, 1960.
indispensable for the analysis of derivation rules.
[Creasy89] Creasy P. ENIAM – A more complete con-
The special meaning associated with undefined classifying
ceptual schema language.
variables in data warehouse relations offers the designer
Proceedings of the fifteenth international
several degrees of freedom. He may sensibly combine
conference on very large databases, Am-
several aggregation levels of a data warehouse object in a
sterdam. (1989).
single data warehouse relation. This helps to prevent a
combinatory explosion of conceivable aggregation levels. [Schouten94] Schouten H. The rules of the game – How
Likewise, successive stages in the life cycle of a data to formally specify the NIAM information
warehouse object along with all relevant data can be reg- analysis method. Working papers of the
istered and exploited in a single data warehouse relation. Second International NIAM Conference,
Transitive derivability occurs and requires attention. The Albuquerque, New Mexico, USA, 1994.
data warehouse designer is compelled to consider a par- [Inmon93] Inmon W.H. Building the Data Ware-
ticular order of derivation. house. John Wiley & Sons, Inc.1993. ISBN
The fancy horse and the workhorse portrayed in this arti- 0-471-56960-7.
cle have proven to reinforce each other rather than com- [Bakema94] Bakema G.P., Zwart J.P.C., Lek H. van
pete with one another. Looking-up classifying variables in der. Fully communication oriented NIAM.
a hierarchy of enveloping relations provides a valuable Working papers of the Second International
first scan. The rule-based approach helps us to identify NIAM Conference, Albuquerque, New
and investigate more intricate derivations. Mexico, USA, 1994.
[Codd90] Codd E.F. The relational model for data-
base management – Version 2. Addison-
Wesley Publishing Company, Inc., 1990.
ISBN 0-201-14192-2.
[Kaplan96] Kaplan R.S., Norton D.P. Using the bal-
anced scorecard as a strategic management
system. Harvard Business Review, January-
February 1996.
[Schouten99] Schouten, H. A repository for existential
graphs. Journal of Conceptual Modeling,
April 1999. http://www.inconcept.com/jcm
Han Schouten 5-12