The Ontological Multidimensional Data Model (extended abstract) Leopoldo Bertossi? and Mostafa Milani?? Abstract. We briefly present OMD, a model of multidimensional data that uses ontologies written in Datalog± , an extension of the classical declarative language Datalog for relational databases. We present the Ontological Multidimensional Data Model (OMD) as an on- tological, Datalog± -based [3] extension of the Hurtado-Mendelzon (HM) model for multidimensional data [5]. For limitations of space, we will use a running example to illustrate the main elements of an OMD model. WorkSchedules AllHospital Unit Day Nurse Specialization AllTemporal Terminal Sep/5/2016 Cathy Cardiac Care Intensive Nov/12/2016 Alan Critical Care Institution 𝜂 Year Standard Sep/6/2016 Helen ? Intensive Aug/21/2016 Sara ? Unit Shifts Month 𝜎2 Ward Day Nurse Shift 𝜎1 W4 Sep/5/2016 Cathy Noon Ward Day W1 Sep/6/2016 Helen Morning W3 Nov/12/2016 Alan Evening W3 Aug/21/2016 Sara Noon Time W2 Sep/6/2016 Helen ? Fig. 1. An OMD model with categorical relations, dimensional rules, and constraints An OMD model has a database schema RM = H∪Rc , where H is a relational schema with multiple dimensions, with sets K of unary category predicates, and sets L of binary, child-parent predicates; and Rc is a set of categorical predicates. Example: Figure 1 shows Hospital and AllHospital allHospital Temporal dimensions. The former’s instance is here on the RHS. K contains predi- cates Ward (·), Unit(·), Institution(·), etc. Institution H1 H2 Instance DH gives them extensions, e.g. Ward = {W1 , W2 , W3 , W4 }. L contains, e.g. WardUnit(·, ·), with extension: WardUnit Unit standard intensive terminal = {(W1 , standard), (W2 , standard), (W3 , intensive), (W4 , terminal)}. In the middle of Figure 1, categorical relations are associated Ward W1 W2 W3 W4 to dimension categories.  ? Carleton Univ., School of Computer Science, Canada. bertossi@scs.carleton.ca ?? McMaster Univ., Dept. Computing and Software, Canada. mmilani@mcmaster.ca Attributes of categorical predicates are either categorical, whose values are members of dimension categories, or non-categorical, taking values from arbitrary domains. Categorical predicate are represented in the form R(C1 , . . . , Cm ; N1 , . . . , Nn ), with categorical attributes before “;” and non-categorical after. The extensional data, i.e the instance for the schema RM , is I M = DH ∪ I , where DH is a complete instance for dimensional subschema H containing c the category and child-parent predicates; and sub-instance I c contains possibly partial, incomplete extensions for the categorical predicates, i.e. those in Rc . Schema RM comes with basic, application-independent semantic constraints, listed below. 1. Dimensional child-parent predicates must take their values from categories. Accordingly, if child-parent predicate P ∈ L is associated to category predicates K, K 0 ∈ K, in this order, we introduce inclusion dependencies (IDs) as Datalog± negative constraints (ncs): P (x, x0 ), ¬K(x) → ⊥, and P (x, x0 ), ¬K 0 (x0 ) → ⊥. (The ⊥ symbol denotes an always false propositional atom.) We do not repre- sent them as Datalog± ’s tuple-generating dependencies (tgds) P (x, x0 ) → K(x), etc., because we reserve tgds for possibly incomplete predicates (in their RHSs). 2. Key constraints on dimensional child-parent predicates P ∈ K, as equality- generating dependencies (egds): P (x, x1 ), P (x, x2 ) → x1 = x2 . 3. The connections between categorical attributes and the category predicates are specified by means of ncs. For categorical predicate R, the nc R(x̄; ȳ), ¬K(x) → ⊥, where x ∈ x̄ takes values in category K. Example: The categorical attributes Unit and Day of categorical predicate WorkingSchedules(Unit,Day;Nurse, Speciality) in Rc are connected to the Hospi- tal and Temporal dimensions, resp., as captured by the IDs WorkingSchedules[1] ⊆ Unit[1], and WorkingSchedules[2] ⊆ Day[1]. The former is written in Datalog+ as WorkingSchedules(u, d ; n, t), ¬Unit(u) → ⊥. For the Hospital dimension, one of the IDs for predicate WardUnit is WardUnit[2] ⊆ Unit[1], which is expressed by the nc: WardUnit(w , u), ¬Unit(u) → ⊥. The key constraint of WardUnit is captured by the egd: WardUnit(w , u), WardUnit(w , u 0 ) → u = u0 .  The OMD model allows us to build multidimensional ontologies, OM . In addition to an instance I M for a schema RM , they include the set Ω M of basic constraints as in 1.-3. above, a set Σ M of dimensional rules (those in 4. below), and a set κM of dimensional constraints (in 5. below); all of them application- dependent and expressed in the Datalog+ language associated to schema RM . 4. Dimensional rules as Datalog+ tgds: R1 (x̄1 ; ȳ1 ), ..., Rn (x̄n ; ȳn ), P1 (x1 , x01 ), ..., Pm (xm , x0m ) → ∃ȳ 0 Rk (x̄k ; ȳ). Here, the Ri (x̄i ; ȳi )) are categorical predicates, the Pi are child-parent predicates, ȳ 0 ⊆ ȳ, x̄k ⊆ x̄1 ∪...∪x̄n ∪{x1 , ..., xm , x01 , ..., x0m }, ȳ r ȳ 0 ⊆ ȳ1 ∪ ... ∪ ȳn ; repeated variables in bodies (join variables) appear only categorical positions in categorical relations and in child-parent predicates. Ex- istential variables appear only in non-categorical attributes. 5. Dimensional constraints, as egds or ncs: R1 (x̄1 ; ȳ1 ), ..., Rn (x̄n ; ȳn ), P1 (x1 , x01 ), ..., Pm (xm , x0m ) → z = z 0 , and R1 (x̄1 ; ȳ1 ), ..., Rn (x̄n ;Sȳn ), P1S (x1 , x01 ), ..., 0 c 0 Pm (xm , xm ) → ⊥. Here, Ri ∈ R , Pj ∈ L, and z, z ∈ x̄i ∪ ȳj . Some of the lists in the bodies may be empty, i.e. n = 0 or m = 0, which allows to represent also classical constraints on categorical relations, e.g. keys or FDs. Example: The left-hand-side of Figure 1 shows dimensional constraint η on cate- gorical relation WorkingSchedules, which is linked to the Temporal dimension via the Day category. It says: “No personnel was working in the Intensive care unit in January”, i.e. η : WorkingSchedules(intensive, d; n, s), DayMonth(d, jan) → ⊥. Dimensional tgd σ1 in Figure 1, given by Shifts(w, d; n, s), WardUnit(w, u) → ∃t WorkingSchedules(u, d; n, t), says that “If a nurse has shifts in a ward on a specific day, he/she has a working schedule in the unit of that ward on the same day”. The use of σ1 generates, from the Shifts relation, new tuples for relation WorkingSchedules, with null values for the Specialization attribute, due to the existential variable. Existential rules like this (and also egds and ncs) make us depart from classic Datalog, taking us into Datalog± . Relation Working Schedules may be incomplete, and new -possibly virtual- entries can be produced for it, e.g. the shaded ones showing Helen and Sara working for the Standard and Intensive units, resp. This is done by upward-navigation and data propagation through the dimension hierarchy. Constraint η is expected to be satisfied both by the initial extensional tuples for WorkingSchedules and its tuples generated through σ1 , i.e. by its non-shaded tuples and shaded tuples in Figure 1, resp. In this example, η is satisfied. Notice that WorkingSchedules refers to the Day attribute of the Temporal dimensions, whereas η involves the Month attribute. Then, checking η requires upward-navigation through the Temporal dimension. Also the Hospital dimension is involved in the satisfaction of η: The tgd σ1 may generate new tuples for WorkingSchedules, by upward-navigation from Ward to Unit. Furthermore, we have an additional tgd σ2 that can be used with Work- ingSchedules to generate data for categorical relation Shifts (the shaded tuple in it is one of them): σ2 : WorkingSchedules(u, d; n, t), WardUnit(w, u) → ∃s Shifts(w, d; n, s). It reflects the institutional guideline stating that “If a nurse works in a unit on a specific day, he/she has shifts in every ward of that unit on the same day”. Accordingly, σ2 relies on downward-navigation for tuple gener- ation, from the Unit category level down to the Ward category level. If we have a categorical relation Therm(Ward , Thertype; Nurse), with Ward and Thertype categorical attributes (the latter for an Instrument dimension), the following is an egd saying that “All thermometers in a unit are of the same type”: Therm(w , t; n), Therm(w 0 , t 0 ; n 0 ),WardUnit(w , u),WardUnit(w 0 , u) → t = t0 . Notice that our ontological language allows us to impose a condition at the Unit level without having it as an attribute in the categorical relation. The ex- istential variables in dimensional rules, such as t and s as in σ1 and σ2 , resp., make up for the missing, non-categorical attributes Speciality and Shift in Work- ingSchedules and Shifts, resp.  Dimensional tgds can be used for upward- or downward-navigation (or data generation) depending on the joins in the body. A one-step direction is deter- mined by the difference of levels of the dimension categories appearing (as at- tributes) in the joins. Multi-step navigation, between a category and an ancestor or descendant category, can be captured through a chain of joins with adjacent child-parent dimensional predicates in the body of a tgd, e.g. propagating doctors at the unit level all the way up to the hospital level: WardDoc(ward ; na, sp), WardUnit(ward , unit), UnitInst(unit, ins) → HospDoc(ins; na, sp). Example: Rule σ2 supports downward tuple-generation. When enforcing it on a tuple WorkingSchedules(u, d; n, t), via category member u (for Unit), a tuple for Shifts is generated for each child w of u in the Ward category for which the body of σ2 is true. For example, chasing σ2 with the third tuple in WorkingSched- ules generates two new tuples in Shifts: Shifts(W2 , sep/6/2016, helen, ζ) and Shifts(W1 , sep/6/2016, helen, ζ 0 ), with fresh nulls, ζ and ζ 0 . The latter tuple is not shown in Figure 1 (it is dominated by the third tuple, Shifts(W1 , sep/6/2016, helen, morning), in Shifts). With the old and new tuples we obtain the answers to the query about Helen’s wards on Sep/6/2016: Q0 (w) : ∃s Shifts(w, sep/6/2016, helen, s). They are W1 and W2 . In contrast, the join between Shifts and WardUnit in σ1 enables upward- navigation; and the generation of only one tuple for WorkingSchedules from each tuple in Shifts, because each Ward member has at most one Unit parent.  We can see that the OMD data model is an ontological model that goes far beyond classical multidimensional data models. For example, the HM model [5], which is subsumed by OMD, does not include general tgds, egds, or ncs. Starting from our relational reconstruction of the HM model, all these elements, plus the data and queries, are seamlessly integrated into a uniform logico-relational framework. OMD supports general, possibly incomplete categorical relations, and not only complete “fact tables” linked to base (or bottom) categories. Furthermore, the constraints considered in the HM model are specific for the dimensional structure of data, most prominently, to guarantee summariz- ability (i.e. correct aggregation, avoiding double-counting). Specifically, we find constraints enforcing strictness and homogeneity [5]. The former requires that every category elements rolls-up to a single element in a parent category, which in OMD can be expressed by egds. The latter requires that category elements have parent elements in parent categories, which in OMD can be expressed by tgds. (Cf. [10, sec. 4.3] for more details.) The OMD model enables ontology-based data access (OBDA) [6] and allows for the tight integration of conceptual models (e.g. an ER model expressed in logical terms) and the relational model of data, while representing and using dimensional structures and data. Cf. [7, 2] for applications of the OMD model to quality data specification and extraction. The ontologies of the OMD model have good computational properties [2, 7]. Actually, they belong to the class of weakly-sticky Datalog± programs [4], for which conjunctive query answering (CQA) can be done in polynomial time in data. Algorithms for CQA have been proposed [8, 9], so as optimizations thereof [8] with magic-sets techniques [1]. Acknowledgements: Research supported by NSERC Discovery Grant #06148. References [1] M. Alviano, N. Leone, M. Manna, G. Terracina and P. Veltri. Magic-Sets for Datalog with Existential Quantifiers. Proc. Datalog 2.0, Springer LNCS 7494, 2012, pp. 31-43. [2] Bertossi, L. and Milani, M. Ontological Multidimensional Data Models and Con- textual Data Quality. Journal submission, 2017. Posted as Corr Arxiv Paper cs.DB/1704.00115. [3] A. Cali, G. Gottlob, and T. Lukasiewicz. Datalog±: A Unified Approach to On- tologies and Integrity Constraints. Proc. ICDT, 2009, pp. 14-30. [4] A. Cali, G. Gottlob, and A. Pieris. Towards more Expressive Ontology Languages: The Query Answering Problem. Artificial Intelligence, 2012, 193:87-128. [5] Hurtado, C. and Mendelzon, A. OLAP Dimension Constraints. Proc. PODS, 2002, pp. 169-179. [6] M. Lenzerini. Ontology-Based Data Management. Proc. AMW 2012, CEUR Pro- ceedings, Vol. 866, pp. 12-15. [7] Milani, M. and Bertossi, L. Ontology-Based Multidimensional Contexts with Ap- plications to Quality Data Specification and Extraction. Proc. RuleML, Springer LNCS 9202, 2015, pp. 277-293. [8] Milani, M. and Bertossi, L. Extending Weakly-Sticky Datalog± : Query-Answering Tractability and Optimizations. Proc. RR, Springer LNCS 9898, 2016, pp. 128- 143. [9] Milani, M., Bertossi, L. and Calı̀, A. A Hybrid Approach to Query Answering under Expressive Datalog± . Proc. RR, Springer LNCS 9898, 2016, pp. 144-158. [10] Milani. M. Multidimensional Ontologies for Contextual Quality Data Spec- ification and Extraction. PhD Thesis, Carleton University, January 2017. http://people.scs.carleton.ca/∼ bertossi/papers/mostafaFinal.pdf