Multidimensional Ontologies for Contextual Quality
            Data Specification and Extraction
                                       Mostafa Milani⋆
                        Carleton University, School of Computer Science
                                        Ottawa, Canada
                               mmilani@scs.carleton.ca

        Abstract. Data quality is context-dependent. That is, the quality of data cannot
        be assessed without contextual knowledge about the production or the use of
        data. As expected, context-based data quality assessment requires a formal model
        of context. Accordingly, we propose a model of context that addresses quality
        concerns that are related to the production and use of data.
             Here we follow and extend a context model for the assessment of the quality
        of a database instance that was proposed in a previous work [1]. In that frame-
        work, the context takes the form of a possibly virtual database or data integration
        system into which a database instance under quality assessment is mapped, for
        additional analysis and processing, enabling quality data extraction. In this work
        we extend contexts with dimensions, and by doing so, we make possible a multi-
        dimensional data quality assessment. Multidimensional contexts are represented
        as ontologies written in Datalog±. We use this language for representing dimen-
        sional constraints, and dimensional rules, and also for doing query answering
        based on dimensional navigation, which becomes an important auxiliary activity
        in the assessment of data. We show ideas and mechanisms by means of examples.

1 Introduction
In a previous work [1], a model of context for data quality assessment was proposed.
In that work, the assessment of a database D is performed by putting D in context or,
more precisely, by mapping it into a context C, which is represented as another database,
or as a database schema with partial information, or, more generally, as a virtual data
integration system [8].
    The quality of data in D is determined through additional processing of the data
within the context. Data processing in the context leads to possible several quality
versions of D, forming a class Dq of intended, clean versions of D. The quality of
D is measured in terms of how much D departs from (its quality versions in) Dq :
dist (D, Dq ). We may want to assess the quality of answers to a query Q posed to in-
stance D or to obtain “quality answers” from D. This can be done appealing to the class
Dq of intended clean versions of D. For assessment, the set of query answers to Q from
D can be compared with the certain answers for Q, from each of the instances in Dq
[7]. The certain answers become what we call the clean answers to Q from D [1].
Problem Statement and Relevance: An important contextual element was not con-
sidered in [1]: dimensions. In practice, dimensions are naturally associated to contexts.
Here, in order to capture general dimensional aspects of data for inclusion in contexts,
we take advantage of and start from the Hurtado-Mendelzon (HM) multidimensional
⋆
    Supervisor: Leopoldo Bertossi (bertossi@scs.carleton.ca)
data model [6], whose inception was mainly motivated by data warehouses (DWH) and
OLAP applications.
     In Example 1, we illustrate the intuition behind multidimensional context for quality
specification and extraction. We assume, according to the HM model, that a dimension
consists of a finite set of categories related to each other by a partial order.
Example 1. The relational table Measurements (Table 1) shows body temperatures of
patients in an institution. A doctor wants to know “The body temperatures of Tom Waits
for September 5 taken around noon with a thermometer of brand B1” (as he expected).
It is possible that a nurse, unaware of this requirement, used a thermometer of brand B2,
storing the data in Measurements. In this case, not all the measurements in the table are
up to the expected quality. However, table Measurements alone does not discriminate
between intended values (those taken with brand B1) and the others.
     For assessing the quality of the data in Measurements according to the doctor’s qual-
ity requirement, extra contextual information about the thermometers in use may help.
In this case, the table PatientWard stores the wards of the patients in an institution on
different days. The relation is linked to the Hospital and Time dimensions (Fig. 1, mid-
dle, bottom) meaning that the ward and day attributes of the relation take values from
the members of the Ward and the Day categories in the Hospital and Time dimensions
resp.                                                          Table 1. Measurements
     Furthermore, the institution has a guideline
                                                               Time         Patient Value
prescribing that: “Temperature measurement for
patients in a standard care unit have to be taken 1 Sep/5-12:10 Tom Waits 38.2
with thermometers of Brand B1”. It can be used 2 Sep/6-11:50 Tom Waits 37.1
for data quality assessment when combined with 3 Sep/7-12:15 Tom Waits 37.7
the table PatientUnit (Fig. 1, middle, top), which
                                                         4 Sep/9-12:00 Tom Waits      37.0
is linked to the Unit category, and whose data are
(at least partially) generated from PatientWard by 5 Sep/6-11:05 Lou Reed 37.5
moving data upward through dimension Hospital 6 Sep/5-12:05 Lou Reed 38.0
(Fig. 1, left), from category Ward to category Unit.


                       Fig. 1. An extended multidimensional model

                                           2
    According to the guideline, it is now possible to conclude that, on days when Tom
Waits was in the standard care unit, his temperature values were taken with the ex-
pected thermometer: for patients in wards W1 or W2 a thermometer of brand B1 was
used. These “clean data” —in relation to the doctor’s expectations— appear in relation
Measurements q .
    Elaborating on this example, there could be a constraint involving the Unit category
in the Hospital dimension: “No patient in intensive care unit at any time after August
/2005”. This constraint could be applied on the process of moving data upward from
PatientWard to PatientUnit.                                                           
    Example 1 illustrates the necessity of a formal data model to represent multidimen-
sional context (e.g. including relations linked to dimensions, constraints and rules on
these relations).
Proposed Solutions: In this work, we extend the HM model by adding categorical
relations associated to categories, at different levels of the dimension hierarchies, pos-
sibly to more than one dimension, i.e. PatientWard and PatientUnit (think of general-
ized fact tables as found in data warehouses). It also include dimensional constraints
and dimensional rules, which could be treated both as dimensional integrity constraints
on categorical relations that involve values from dimension categories. However, di-
mensional constraints are intended to be used as denial constraints that forbid certain
combinations of values, whereas the dimensional rules are intended to be used for data
completion, to generate data through their enforcement via dimensional navigation.
    Categorical relations may be incomplete, and new data can be generated for them,
which will be enabled through rules (tgds) of a Datalog± dimensional ontology. The
previous example shows data generation via upward navigation while downward nav-
igation may also be useful (cf. [13]). Our approach to multidimensional contexts will
support both.
    We propose an ontological representation in Datalog± [3, 4] of the extended HM
model, and also mechanisms for data quality assessment based on query answering
(QA) from the ontology via dimensional navigation. The idea is that a query to the
ontology triggers dimensional navigation and the creation of missing data, in possible
upward and downward directions, and on multiple dimensions. Datalog± supports data
generation through the ontological rules. A multidimensional context —corresponding
to the formalization of the extension of HM— becomes a Datalog± ontology, M, that
belongs to an interesting syntactic classes of programs (called weakly-sticky [5]), for
which some results are known. This allows us to give a semantics to our ontologies, and
apply some established and new algorithms for QA.
    The proposed multidimensional context forms a weakly-sticky ontology. This im-
plies that quality query answering and quality data extraction require an algorithm for
QA from such ontologies and its optimization and implementation. The weakly-sticky
class of ontologies is a well-established member of the family of Datalog± ontologies
with polynomial time data complexity for QA [5]. However, a practical QA algorithm
for this member of Datalog± is still missing in the literature. Here, we propose a practi-
cal QA algorithm that runs in polynomial time and it is applicable for some members of
the Datalog± family, including weakly-sticky ontologies. We also study the magic-sets
optimization techniques and show that they are applicable on the QA algorithm.

                                            3
2 Work Done So Far                        S                          C                       S#
We enhance the approach                                     /.5,1% C                         !#
                                    !
to data quality specification
and extraction in [1] (Fig-                       S’
                                     !                   ’                                       #
ure 2), by adding dimen-                                "                              #       !

                                    …
                                              "                            P          "
sions to contexts. In this                         0".70%1,/       #$%&"'()*+,-".%',/
case, the context contains                                                                     0
                                                                                                 #
                                     0
a generic multidimensional
                                                 E                   67
ontology, the shaded M in $0-,+)%//,//1,0'                                            #$%&"'()4,+/"30

Fig. 3, aka. “core ontol-                                   ,2',+0%&)/3$+.,/

ogy” (described next). This
ontology can be extended,              Fig. 2. A context for data quality assessment
within the context, with additional rules and constraints that depend on specific data
quality concerns (cf. Section 2.3).

2.1 The Extended MD Model and Datalog±
We extend the HM model introducing categorical relations, each of them having a rela-
tional schema with a name, and attributes, some of which are categorical and the other,
non-categorical. The former take values that are members of a dimension category. The
latter take values from an arbitrary domain. Categorical relations have to be logically
connected to dimensions. For this we use a Datalog± ontology M, which has a re-
lational schema SM , an instance DM , and a set ΣM of dimensional rules, and a set
κM of constraints. Here, SM = K ∪ O ∪ R, with K a set of unary category predi-
cates, O a set of parent-child predicates, capturing <-relationships for pairs of adjacent
categories, and R a set of categorical predicates, say R(C1 , . . . ; N1 , . . .), where, to
highlight, categorical and non-categorical attributes (Ci s vs. Nj s) are separated by “;”.
Example 2. Categorical relation PatientWard (Ward ,Day;Patient ) in Fig. 1 has cat-
egorical attributes Ward and Day, connected to the Hospital and Time dimensions, resp.
Patient is non-categorical. Ward (·), Unit (·) ∈ K; O contains, e.g. a binary predicate
connecting Ward to Unit; and R contains, e.g. PatientWard.                           

The (extensional) data, DM , associated to the ontology M’s schema are the complete
extensions for categories in K and predicates in O that come from the dimension in-
stances. The categorical relations (with predicates in R) may contain partial data, i.e.
they may be incomplete. They can belong to instance I in Fig. 3. Dimensional rules in
ΣM are those in (c) below; and constraints in κM , those in (a) and (b).
(a) Referential constraints between categorical attributes and categories as negative
    constraint (NC):1 (R ∈ R, K ∈ K; ē, ā are categorical, non-categorical, resp.;
    e ∈ ē)
                                        ⊥ ← R(ē; ā), ¬K(e).                                     (1)


 1
     An alternative and more problematic approach, may use tgds between categorical attributes
     and categories, making it possible to generate elements in categories or categorical attributes.


                                                  4
(b) Additional dimensional constraints, as egds or NCs: (Ri ∈ R, Dj ∈ O, and x, x′
    stand both for either categorical or non-categorical attributes in the body of (2))
             x = x′ ← R1 (ē1 ; ā1 ), ..., Rn (ēn ; ān ), D1 (e1 , e′1 ), ..., Dm (em , e′m ).        (2)
                   ⊥ ←     R1 (ē1 ; ā1 ), ..., Rn (ēn ; ān ), D1 (e1 , e′1 ), ..., Dm (em , e′m ).   (3)
                                          ±
(c) Dimensional rules as Datalog tgds:

       ∃āz Rk (ēk ; āk ) ← R1 (ē1 ; ā1 ), ..., Rn (ēn ; ān ), D1 (e1 , e′1 ), ..., Dm (em , e′m ). (4)

    Here, āz ⊆ āk , ēk ⊆ ē1 ∪ ... ∪ ēn ∪ {e1 , ..., em , e′1 , ..., e′m }, āk r āz ⊆ ā1 ∪ ... ∪
    ān ; and repeated variables in bodies are only in positions of categorical attributes
    (in the categorical relations Ri (ēi ; āi )), and attributes in parent-child predicates
    Dj (ej , e′j ). Value invention is only on non-categorical attributes (we will consider
    relaxing this later on).
With rule (4) (an example is (7) below), the possibility of doing dimensional navigation
is captured by joins between categorical predicates, e.g. Ri (ēi ; āi ), ..., Rj (ēj ; āj ) in the
body, and parent-child predicates, e.g. Dn (en , e′n ), ..., Dm (em , e′m ).
    Rule (4) allows navigation in both upward
and downward directions. The direction of                                             C
navigation is determined by the level of cate-                     M schema C
gorical attributes that participate in the join
                                                             S’
in the body. Assuming the join is between                    Ri’
                                                                    categorical
                                                                     relations
                                                                                           I
                              ′
Ri (ēi ; āi ) and Dn (en , en ), upward naviga-   a  i                                              aiq
tion is enabled when e′n ∈ ēi (i.e. e′n appears
                                                         nicknames
                                                                                         P
                                                                     dimensions
in Ri (ēi ; āi )) and en ∈ ēk (i.e en appears in                               quality predicates
the head). On the other hand, if en occurs in
Ri and e′n occurs in Rk , then downward nav-             Fig. 3. A multidimensional context
                                     ′
igation is enabled, from en to en .

Example 3. (example 2 cont.) The categorical attribute Unit in categorical relation
PatientUnit takes values from the Unit category. We use a constraint of the form (1):
⊥ ← PatientUnit (u, d; p), ¬Unit (u).                                             (5)
   For the constraint in Example 1 requiring “No patient was in intensive care unit
during the time after August 2005”, we use a dimensional constraint of the form (3):
               ⊥ ← [PatientWard (w, d; p), UnitWard(Intensive, w),                                       (6)
                               MonthDay(August/2005, d)].

   The following dimensional rules of the form (4) capture how data in PatientWard
and WorkingSchedules generate data for PatientUnit and Shifts, resp.:
   PatientUnit (u, d; p) ← PatientWard (w, d; p), UnitWard (u, w ).                                      (7)
   ∃z Shifts(w, d; n, z ) ← WorkingSchedules(u, d; n, t ), UnitWard (u, w ).                             (8)
    In (7), dimension navigation is enabled by the join between PatientWard and Unit-
Ward. The rule generates data for PatientUnit (at a the higher level of Unit) from Pa-
tientWard (at the lower level of Ward) via upward navigation.

                                                    5
    Rule (8) captures a guideline that states: “If a nurse works in a unit on a specific day,
he/she has shifts in every ward of that unit on the same day”. The rule is expressed using
two additional categorical relations, WorkingSchedules and Shifts, that store schedules
of nurses in units and shifts of nurses in wards, resp. Downward navigation is performed
by generating data for Shifts (at the level of Ward) from WorkingSchedules (at the level
of Unit). In this case, the schemas of the two categorical relations do not match. So, the
existential variable z represents missing data for the shift attribute.                    

2.2 Properties of MD Datalog± Ontologies
Here, we first establish the membership of our MD ontologies, M (cf. Section 2.1) of
a class of the Datalog± family. Membership is determined by the set ΣM of its tgds.
Next, we analyze the role of the constraints in κM , in particular, of the set ǫM of egds.
Proposition 1. MD ontologies are weakly-sticky Datalog± programs.                         
The proof and a review of weakly-sticky Datalog± [5], can be found in [13] (an extended
version of [11]). A consequence of this result is that QA from ΣM is in polynomial-time
in data complexity [5].
    The complexity stays the same if we add NCs, of the forms (1) and (3), because
they can be checked through the conjunctive queries in their bodies [5]. However, com-
bining the egds in ǫM with ΣM could change things, and, in principle, even lead to
undecidability of QA [2]. Separability[5] of egds and tgds is a semantic condition that
guarantees complexity of query answering still remains the same even after adding
egds. In [13] (Proposition 2), we show a syntactic condition for the dimensional edgs
that implies the separability.
2.3 MD Contexts for Quality Data
We now show in general how to use a MD context, C, containing MD ontologies for
quality data specification and extraction wrt. a database instance D for schema S. Here,
Context C, as shown in Fig. 3, contains:
1. Nickname predicates R′ ∈ S ′ for predicates R of original schema S. In this case, the
R′ have the same extensions as in D, producing a material or virtual instance D′ within
C. For example, Measurements ′ ∈ S ′ is a nickname predicate for Measurements ∈ S,
whose initial contents (in D) is under quality assessment.
2. The core MD ontology, M. We assume that application dependent guidelines and
constraints are all represented as components of M.
    In our running example, PatientUnit, PatientWard , WorkingSchedules and Wor -
k ingT imes are categorical relations. UnitWard , DayTime are parent-child relations
in dimensions, Hospital and Time, resp. The followings are dimensional rules of ΣM :
   WorkingTimes(u, t; n, y) ← WorkingSchedules(u, d; n, y), DayTime(d, t).
  PatientUnit(u, t; p) ← PatientWard (w, d; p),DayTime(d, t),UnitWard (u, w).(9)
3. The set of quality predicates, P, with their definitions in non-recursive Datalog (pos-
sibly with negation, not ), in terms of categorical predicates in R and built-in predicates.
A quality predicate reflects an application dependent specific quality concern.
    Now, TakenByNurse and TakenWithTherm are quality predicates with defini-
tions on top of M, addressing quality concerns about the nurses and the thermometers:

                                             6
  TakenByNurse(t, p, n, y) ← WorkingTimes (u, t; n, y), PatientUnit(u, t; p). (10)
  TakenWithTherm(t, p, b) ← PatientUnit (u, t; p), u = Standard, b = B1.               (11)
                                                                           q
    Furthermore, and not strictly inside context C, there are predicates R1 , ..., Rn ∈ S q ,
                                                                                    q

the quality versions of R1 , ..., Rn ∈ S. They are defined through quality data extraction
rules written in non-recursive Datalog, in terms of nickname predicates (in S ′ ), categor-
ical predicates (in R), and the quality predicates (in P), and built-in predicates. Their
definitions (the αqi in Fig. 3) impose conditions corresponding to user’s data quality
profiles, and their extensions form the quality data (instance).
    The quality version of Measurements is Measurement q ∈ S q , with the following
definition, which captures the intended, clean contents of the former:
     Measurement q (t, p, v)←Measurement ′ (t, p, v), TakenByNurse(t, p, n, y), (12)
                                 TakenWithTherm(t, p, b), b = B1, y = certified.
     Quality data can be obtained from the interaction between the original source D
and the context C, in particular using the MD ontology M. For that, queries have to be
posed to the context, in terms of predicates S q , the quality versions of those of D. A
query could be as direct as asking, e.g. about the contents of predicate Measurement q
above, or a conjunctive query involving predicates S q .
     For example, this is the initial query asking for (quality) values for Tom Waits’ tem-
perature: Q(t, v) :Measurements(t, Tom Waits, v)∧Sep5-11:45 ≤ t ≤ Sep5-12:15,
which, in order to be answered, has to be first rewritten into, Qq (t, v) : Measurements q
(t , Tom Waits, v ) ∧ Sep5-11:45 ≤ t ≤ Sep5-12:15.
To answer this query, first (12) can be used, obtaining a contextual query: QC (t, v) :
Measurement ′ (t, p, v) ∧TakenByNurse(t, p, n, certified) ∧TakenWithTherm (t,
p, B1) ∧ p = Tom Waits ∧ Sep/5-11:45 ≤ t ≤ Sep/5-12:15.
     This query will in turn, use the contents for Measurement ′ coming from D, and the
quality predicate definitions (10) and (11), eventually leading to a conjunctive query
expressed in terms of Measurement ′ and MD predicates only, namely: QM (t, v) :
Measurement ′ (t, p, v) ∧ WorkingTimes (u, t; n, y) ∧ PatientUnit (u, t; p) ∧ u =
Standard ∧ y = certified ∧p = Tom Waits ∧Sep/5-11:45 ≤ t ≤ Sep/5-12:15.
     At this point, QA from a weakly-sticky ontology has to be performed. We know
that this can be done in polynomial time in data. However, there is still a need for
practical QA algorithms. Doing this goes beyond the scope of this paper. In Section 2.4,
we describe some ideas from [12] on the development and optimization of such an
algorithm.

2.4 Query Answering on the MD Ontology
We proposed a conjunctive query answering algorithm for weakly sticky programs to
be also applied to our MD ontology. The algorithm is based on the concepts of parsimo-
nious chase (pChase) and freezing nulls, as used for QA with shy Datalog, a fragment
of Datalog∃ [9].
    At a pChase step, a new atom is added only if a homomorphic atom is not already
in the chase. Freezing a null is promoting it to a constant (and keeping it as such in
subsequent chase steps). So, it cannot take (other) values under homomorphisms, which

                                             7
may create new pChase steps. Resumption of the pChase means freezing all nulls, and
continuing pChase until no more pChase steps are applicable.
    Query answering with shy programs has a first phase where the pChase runs until
termination. In a second phase, the pChase iteratively resumes for a number of times
that depends on the number of distinct ∃-variables in the query. This second phase
is required to properly deal with joins in the query. Our algorithm for weakly sticky
programs is similar, it has the same two phases, but a pChase step is modified: after
every application of a pChase step that generates nulls, the latter that appear in positions
with finite ranks are immediately frozen. The algorithm runs in polynomial-time in data.

3 Work Still to Be Done
We are investigating several extensions of the current work. Some of them are as fol-
lows: (1) Uncertain downward-navigation when tgds allow existentials on categorical
attributes. A parent in a category may have multiple children in the next lower category.
Under the assumption of complete categorical data, we know it is one of them, but
not which one, (2) Using the MD ontologies to fully capture the taxonomy-based data
model [10], (3) We may relax the assumption on complete categorical data. This brings
many new issues and problems that require investigation; from query answering to the
maintenance of structural semantic constraints, such as strictness and homogeneity, on
the HM model and our extension of it.
References
 [1] L. Bertossi, F. Rizzolo and J. Lei. Data Quality is Context Dependent. Proc. VLDB WS
     (BIRTE’10), Springer LNBIP 48, 2011, pp. 52-67.
 [2] A. Cali, D. Lembo and R. Rosati. On the Decidability and Complexity of Query Answering
     over Inconsistent and Incomplete Databases. Proc. PODS, 2003, pp. 260-271.
 [3] A. Cali, G. Gottlob and T. Lukasiewicz. Datalog± : A Unified Approach to Ontologies and
     Integrity Constraints. Proc. ICDT, 2009, pp. 14-30.
 [4] A. Cali, G. Gottlob, T. Lukasiewicz, B. Marnette and A. Pieris. Datalog± : A Family of
     Logical Knowledge Representation and Query Languages for New Applications. Proc.
     LICS, 2010, pp. 228-242.
 [5] A. Cali, G. Gottlob and A. Pieris. Towards More Expressive Ontology Languages: The
     Query Answering Problem. Artificial Intelligence, 2012, 193:87-128.
 [6] C. Hurtado and A. Mendelzon. OLAP Dimension Constraints. Proc. PODs, 2002, pp.
     169-179.
 [7] T. Imielinski and W. Lipski Incomplete Information in Relational Databases. Journal of
     the ACM, 1984, 31(4):761-791.
 [8] M. Lenzerini. Data Integration: A Theoretical Perspective. Proc. PODS, 2002, pp. 233-246.
 [9] N. Leone, M. Manna, G. Terracina, and P. Veltri. Efficiently Computable Datalog∃ Pro-
     grams. Proc. KR, 2012, pp. 1323.
[10] D. Martinenghi and R. Torlone. Querying Context-Aware Databases. Proc. FQAS, 2009,
     pp. 76-87.
[11] M. Milani, L. Bertossi and S. Ariyan. Extending Contexts with Ontologies for Multidimen-
     sional Data Quality Assessment. Proc. ICDEW (DESWeb), 2014, pp. 242 - 247.
[12] M. Milani and L. Bertossi. Tractable Query Answering and Optimization for Extensions of
     Weakly-Sticky Datalog±. Proc. AMW, 2015.
[13] M. Milani and L. Bertossi. Ontology-Based Multidimensional Contexts with Applications
     to Quality Data Specification and Extraction. Submitted, Accepted in RuleML, 2015.


                                              8