Context based Data Quality Rules for Multidimensional Data Camila Sanz1 1 Instituto de Computación, Facultad de Ingeniería, Universidad de la República Supervised by Adriana Marotta Abstract Data quality evaluation and improvement is an important asset in every system, particularly in systems which aim is to analyse data, such as those that are based on multidimensional data models. When talking about data quality the main approach found in literature is fitness for use which means that data quality cannot be evaluated nor improved without taking context into account. Evaluating data quality over systems with a multidimensional model is clearly context dependent. However, there is not enough generality in the solutions found in the literature for context-based data quality management, which means that for every particular case the problem needs to be redefined. In this PhD proposal we aim to reach to a formal definition of every concept mentioned above and their interactions. As a result it would be possible that, given a particular multidimensional model and its context, a set of Data Quality rules can be generated in a simple way. Keywords data quality, multidimensional data, context 1. Introduction adopted [1], accepting that DQ cannot be evaluated nor improved ignoring the information about the context Data Quality (DQ) is a multifaceted concept, since there where data will be used. In the case of DWs, context are many aspects that can be taken into account when can be useful for compensating missing data, correcting trying to define and measure the quality of data. These errors, detecting inconsistencies, and many other quality- aspects are called DQ dimensions, while DQ metrics are related tasks. defined in order to measure them [1]. As DQ is context dependent, DQ dimensions and met- Data Warehouse (DW) systems are decision-oriented rics are specific for each domain and use case, therefore, information systems and as so, a fundamental asset for solutions are highly dependent on each particular case. decision making. DWs are populated with data extracted Formalization is needed to provide an abstraction level from heterogeneous sources which is transformed to be that gives generality to solutions, allowing the instantia- queried and analyzed with a multidimensional perspec- tion of them for each particular case. tive, allowing aggregations by different criteria. Multi- Although there has been certain progress in research dimensional data model is typically used for designing about context-oriented DQ for DW, we believe that there DWs and for doing analysis on top of them. The main is still a deep gap for arriving to well-formalized integral concepts of this model are dimensions, hierarchies, facts and robust solutions. There are few works that propose and cubes, also including as an essential tool, a set of formalizations for these concepts, and they do not address multidimensional operations that allow navigating and DQ as an integral discipline, including DQ dimensions aggregating data. and metrics management, and differentiating the tasks in In these systems DQ is an unavoidable issue, since it is DQ management, mainly evaluation and improvement. compromised at different moments of the DW lifecycle, This work is a step forward the formalization of DW, such as ETL and multidimensional operations. Specific DQ and context, in a general way, so that it allows man- DQ problems appear due to multidimensional model char- aging context-oriented DQ in DW for particular cases, acteristics (described above). DQ management allows DQ in a robust and systematic way. Among DQ dimensions, improvement when it is possible, and also DQ awareness we focus on consistency, accuracy and completeness [1], by the user, ensuring decision making is not biased by as they illustrate very common DW quality problems. poor quality data. The rest of the document is structured as follows: in There is consensus in the literature about the impor- section 2 we mention some works related to DW, DQ and tance of considering context in DQ management. The context focusing on existing formalizations, in section well-known fitness for use approach has been widely 3 we present the PhD proposal in terms of the problem and solution approach, and in section 4 we conclude and Envelope-Open csanz@fing.edu.uy (C. Sanz) mention the next steps to be followed. © 2022 Proceedings of the VLDB 2022 PhD Workshop, September 5, 2022. Sydney, Australia. Copyright (C) 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Related Work is presented. The authors show that, although there are many works that consider the context for DQ, and many As said before, DQ is context dependent [2, 3, 4, 5, 6, 7], that consider the context for DW, very few address the as it is perceived differently according to the application problem of managing DQ in DW considering the context domain of the data, the user or even the location in which (i.e., relating the three issues simultaneously). it is being used. For this reason, the context becomes an In [18] a model in which DQ is addressed at the ETL essential part of DQ definition. On its own, data context process is presented. Domain ontologies are used to is an ambiguous concept and in general it is specifically model the process and business rules are mapped to defined for each particular application. Commonly, it those ontologies through different quality metrics. Even refers to user and location aspects [8, 9, 4, 10], but many though it is not explicitly said, both ontologies and busi- other aspects can be considered for its definition. ness rules are considered as context. What is left of this section is focused on the formaliza- Both [19, 20] are based on [21], where Hurtado- tion of context and of the solutions for DQ over Multi- Mendelzon’s multidimensional model for DW is pre- dimensional Data using context, as these are the main sented, making minor adaptations to it, according to the aspects addressed in our work. specific needs of each work. Even if the main goal is, in both of them, context based DQ evaluation, the use of Context Formalization. Considering works that pro- the multidimensional data model differs. pose formal models for contexts, two interesting ap- In [19], the multidimensional model is used to model a proaches were found: using ontologies [11, 12, 13, 14, 15] part of the context that includes relationships between its and using first order predicates [16]. components. To give context to relational databases, the When using ontologies, some works present formal authors adapt the model of [21] to a relational schema specifications that are absolutely domain dependant. In and combine it with quality predicates. [12], the proposal of an access control mechanism is con- On the other hand, in [20] quality evaluation over a structed by the context and the user profile, both modeled multidimensional model is specified with logical rules using ontologies. The main drawback of this approach that includes context. In this case, the specification pre- is that it only considers the user context and that it is sented in [21] is adapted to be used in the rules definition. proposed for a specific domain. In [11] the context is Although there is a lot of work done in order to formal- specified in a more general way. The authors identify ize the context for a dataset, there are two main aspects components that may belong to any context: people, ac- that remain unsolved: the context formalizations are not tivity, location and computational entity. Each specific general enough so that they can be instantiated for the domain is modeled with a particular ontology that is different particular cases, and there is very little work merged with the identified components. A similar ap- on formally specifying the context for DQ management proach is presented in [15], where a context model is in DW. Our work proposes a general formalization of presented considering different elements that should be a DW and its context that enables the instantiation for present, such as the local context or the surrounding con- any particular case, and on top of this, it proposes the text. Each of these concepts are later mapped to domain definition and formalization of context aware DQ rules ontologies in order to contextualize DWs. With the idea for evaluation and improvement of the DW quality. of mappings, in [14] domain ontologies are formalized Our work share some aspects with many of the pre- and used in order to give context for a particular user sented above. We particularly inspire on the multidi- that is modeled using an ontology. To do so, a formal mensional model proposed in [19] and the quality rules mapping between both ontologies is proposed. Finally, idea presented in [20]. We find specially relevant the with the aim of obtaining context from an ontology, in mappings ideas presented in [14, 15] and the idea of de- [13] a mathematical model is proposed. The authors’ termining the context of a dataset within a particular approach is to calculate the distance between different ontology presented in [13]. concepts in an ontology in order to determine the context of particular data. In [16] first order predicates are used to formalize the 3. Thesis Proposal context. However, the authors do not present a general This section presents the research problem addressed by context formalization that can be instantiated for particu- this work and the solution approach. First, we illustrate lar cases. As in [12], the main problem with the approach the problem with an example, then we state the problems is the lack of generality. to solve in a general way and finally, we present the main aspects of our approach, through specific parts of the Context-based DQ over Multidimensional Data solution and examples of our proposal, trying to cover Existing work about contexts, DW and DQ in general, is the whole picture of the proposed solution. analyzed in [17], where an exhaustive literature review Figure 1: Data Quality Problems 3.1. Running Example information and some sales in Sales Fact Table will not be considered when the roll-up from Subgenre to Genre DQ in DW systems involves typical DQ management takes place. However, when the roll-up is done directly over data attributes, but also includes problems related to from Book to Genre, no information is lost. multidimensional operations results. To illustrate both type of problems we use an example, whose main con- cepts are shown in Figure 1. 3.2. Research Problem The example refers to a Sales DW, implemented in The research problem addressed by this work is the defini- a relational star schema, which consists of a fact table tion of formal rules for DQ assessment and improvement Sales, related to many dimension tables with data about for a DW, taking the context into account. books, authors, cities, dates, etc. Figure 1 shows Sales In order to tackle this problem, we state the following Fact Table and Book Dimension Table. Additionally, the sub-problems to be solved: (i) formal definitions for both figure presents the conceptual representation of one hi- DW and context, which allow the instantiation of any erarchy of Book dimension. This hierarchy is composed particular DW or context, (ii) definition of a mechanism by three categories, named Books, Subgenre and Genre. for the interaction between DW and context, enabling We consider all hierarchies to be homogeneous [21], i.e., the use of different formal languages to represent each every member from a category has exactly one parent in one, (iii) definition and formalization of DQ assessment the category above. and improvement rules for the DQ dimensions: accuracy, Different DQ problems may arise in this system, such consistency and completeness [1], and (iv) solution im- as the ones showed in Figure 1: plementation, which integrates all the components in a DQ problems in attributes data. Rectangle 1 shows both unique system. an inconsistency between the attributes language_id In order to test and validate the solution, a real use and language_name and also a semantic accuracy prob- case consisting of a particular DW and its context should lem because “Harry Potter and the Sorcerer’s Stone” is be designed and implemented. Afterwards, metrics and written in English. In rectangle 2 a syntactic accuracy cleaning tasks for consistency, accuracy and complete- problem is presented, it should say “High Fantasy” in- ness, should be implemented, as instantiations of the stead of “High Fantsy” . proposed DQ rules. Finally, we should carry out a com- Summarizability problem. Rectangle 3 shows summa- parison between the obtained results with our solution rizability problem [22] over Book dimension. Ideally, a and results obtained with an analogous solution that does roll-up from Book to Genre and the composition of the not consider context in DQ evaluation and improvement. roll-up operations from Book to Subgenre and from Sub- genre to Genre should return the same result. However, due to a DQ problem they may not return the same result. 3.3. Approach When looking at Book Dimension Table, book with id 1 We use the formalization presented by Hurtado and does not have a value in the subgenre attribute. This Mendelzon [21] to formalize the DW, making some mi- means that a roll-up from Book to Subgenre will loose nor extensions and modifications in order to adapt the model to our goals. The context is modeled through domain ontologies: given an OWL ontology 𝑂, we consider its classes named 𝐶𝑙 = {𝐶𝑙1 , … 𝐶𝑙𝑐 }; its object properties named 𝑂𝑃 = {𝑂𝑃1 … 𝑂𝑃𝑜𝑝 }, where 𝑑𝑜𝑚(𝑂𝑃𝑗 ) and 𝑟𝑎𝑛𝑔𝑒(𝑂𝑃𝑗 ) are its domain and range; and its data properties 𝐷𝑃 = {𝐷𝑃1 , … 𝐷𝑃𝑑𝑝 }, where 𝑑𝑜𝑚(𝐷𝑃𝑗 ) is a class and 𝑟𝑎𝑛𝑔𝑒(𝐷𝑃𝑗 ) is a data type. Mappings are defined as a mechanism for the inter- action between DW and context (issue (iii) of previous section). They are ternary relations, where the first ar- gument is the DW element, the second argument is the ontology element and the third is a Boolean that indicates if the mapping is total, which means that both the ele- Figure 2: Book Dimension Mapping Example ment of the DW and the context represent the same real world entity. We introduce as an example the definition of mappings for Dimensions and Categories. Dimensions: 𝑀𝑎𝑝𝐷𝑖𝑚 ⊆ {𝒮 [1], … , 𝑆[𝑛]} × 𝐶𝑙 × model, the context and the mappings. A set of rules for {𝑡𝑟𝑢𝑒, 𝑓 𝑎𝑙𝑠𝑒} maps DW dimensions, using the specifica- syntactic accuracy for the category 𝑛𝑎𝑚𝑒 of the Book tion taken from [21], to ontology classes. dimension according to the property dct:title of the Categories: 𝑀𝑎𝑝𝐶𝑎𝑡 ⊆ (𝒞1 ∪ … ∪ 𝒞𝑛 ) × (𝐶𝑙 ∪ 𝐷𝑃) × “British National Library” ontology is presented in equa- {𝑡𝑟𝑢𝑒, 𝑓 𝑎𝑙𝑠𝑒} maps DW categories, using the specification tions 1 and 2, where in the predicate 𝑆𝑦𝑛𝑡𝐴𝑐𝑐(𝑏, 𝑛), 𝑏 is a from [21], either to ontology classes or to ontology data particular book and 𝑛 ∈ {0, 1} is the result of the metric. properties. If a category is mapped to a data property 𝑑𝑝, then there must exist a mapping between either the di- 𝑏 ∈ 𝐵𝑜𝑜𝑘 𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 ∧ 𝑏.𝑛𝑎𝑚𝑒 ∈ 𝑟𝑎𝑛𝑔𝑒(𝑑𝑐𝑡 ∶ 𝑡𝑖𝑡𝑙𝑒) mension to which the category belongs or another related (1) category of the same dimension, and the class 𝑑𝑜𝑚(𝑑𝑝). → 𝑆𝑦𝑛𝑡𝐴𝑐𝑐(𝑏, 1) 𝑏 ∈ 𝐵𝑜𝑜𝑘 𝐷𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 ∧ 𝑏.𝑛𝑎𝑚𝑒 ∉ 𝑟𝑎𝑛𝑔𝑒(𝑑𝑐𝑡 ∶ 𝑡𝑖𝑡𝑙𝑒) Running Example The ontologies chosen to give con- (2) text to Book Dimension presented in section 3.1 are “The → 𝑆𝑦𝑛𝑡𝐴𝑐𝑐(𝑏, 0) British National Library”1 ontology and “The Book Vocab- The complete formalizations for the concepts pre- ulary Metadata”2 both ontologies represent information sented above are implemented in Python using PyDat- about books and other aspects related to them such as alog3 for managing the DW model and DQ rules and authors or languages. owlready24 for managing ontologies. Figure 2 shows the ontologies that are mapped to Book dimension. For example from “British National Library” ontology we map 𝐵𝑜𝑜𝑘 category to bibo:Book class, this 4. Conclusions and Next Steps mapping is formalized as 𝑀𝑎𝑝𝐶𝑎𝑡(𝑏𝑜𝑜𝑘, 𝑏𝑖𝑏𝑜 ∶ 𝐵𝑜𝑜𝑘, 𝑡𝑟𝑢𝑒). In this case the mapping is total because both the cate- The main strategy of our approach is based on the use gory Book and the ontology class bibo:Book represent and interaction between ontologies and Datalog, such a book in the real world. This connection between both that their reasoning power can be exploited for DQ rules. ontologies is represented in Figure 2 by the dotted line. To the best of our knowledge, this approach has not been Mappings are fundamentally used to define the con- used before for this kind of solutions. text of interest. Once the DW elements are located in the Up to now we completed first proposals of the litera- chosen ontologies, the context can be defined as any part ture review; a formalization for the DW based on [21] of the ontologies that includes them. This means that the model; and a definition and formalization of the context context can be either minimal, including mapped classes based on domain ontologies. Following these first steps and the ones related to them, or extended in which case we proposed a mapping between the DW and the con- it includes more classes and consequently more informa- text and along with it, a way of managing the context tion. scope, as a way of determining how much of the domain Rules for DQ metrics are defined considering the DW is being taken into account to give context to a DW. We worked on the implementation of each of the formalized 1 3 http://www.bl.uk/bibliographic/pdfs/bldatamodelbook.pdf https://pypi.org/project/pyDatalog/ 2 4 http://www.ebusiness-unibw.org/ontologies/opdm/book.html https://pypi.org/project/Owlready2/ solutions. Finally, we defined and implemented a simple //doi.org/10.1145/2854006.2854008. doi:10.1145/ DQ metric for syntactic accuracy dimension in order to 2854006.2854008 . test the viability of the proposed solution. [8] P. Dourish, What we talk about when we talk The main focus of our ongoing work is to reach a level about context, Personal and Ubiquitous Comput- of abstraction in the formalization of the DW, the context ing 8 (2004) 19–30. URL: https://link.springer.com/ and their interactions that makes it possible to evalu- article/10.1007/s00779-003-0253-8. doi:10.1007/ ate certain DQ dimensions for any DW in any context. s00779- 003- 0253- 8 . Currently, we are working on the definition and formal- [9] G. D. Abowd, A. K. Dey, P. J. Brown, N. Davies, ization of complex and generic DQ rules for consistency, M. Smith, P. Steggles, Towards a Better Un- completeness and accuracy. derstanding of Context and Context-Awareness, Next steps will concentrate in the implementation of in: Handheld and Ubiquitous Computing, Lec- the complete solution based on the proposed formaliza- ture Notes in Computer Science, Springer, Berlin, tions, which will allow the definition of any DW, context 1999, pp. 304–307. URL: https://link.springer.com/ and DQ rules set, as well as the application of the solution chapter/10.1007/3-540-48157-5_29. doi:10.1007/ to a real case study. 3- 540- 48157- 5_29 . [10] Y. W. Lee, Crafting Rules: Context-Reflective Data Quality Problem Solving, Journal of Management References Information Systems 20 (2003) 93–119. URL: http:// dx.doi.org/10.1080/07421222.2003.11045770. doi:10. [1] C. Batini, M. Scannapieco, Data and In- 1080/07421222.2003.11045770 . formation Quality, Data-Centric Sys- [11] X. H. Wang, D. Q. Zhang, T. Gu, H. K. Pung, On- tems and Applications, Springer Interna- tology based context modeling and reasoning us- tional Publishing, Cham, 2016. URL: http: ing OWL, in: IEEE Annual Conference on Perva- //link.springer.com/10.1007/978-3-319-24106-7, sive Computing and Communications Workshops, dOI: 10.1007/978-3-319-24106-7. 2004. Proceedings of the Second, 2004, pp. 18–22. [2] L. Bertossi, F. Rizzolo, L. Jiang, Data Qual- doi:10.1109/PERCOMW.2004.1276898 . ity Is Context Dependent, in: Enabling [12] V. Luna, R. Quintero, M. Torres, M. Moreno-Ibarra, Real-Time Business Intelligence, Lecture Notes G. Guzmán, I. Escamilla, An ontology-based ap- in Business Information Processing, Springer, proach for representing the interaction process be- Berlin, 2010, pp. 52–67. URL: https://link.springer. tween user profile and its context for collaborative com/chapter/10.1007/978-3-642-22970-1_5. doi:10. learning environments, Computers in Human Be- 1007/978- 3- 642- 22970- 1_5 . havior 51 (2015) 1387–1394. [3] M. Helfert, O. Foley, A Context Aware Information [13] C. A. Yeung, H. Leung, Formalizing typicality of Quality Framework, in: 2009 Fourth International objects and context-sensitivity in ontologies, in: Conference on Cooperation and Promotion of Infor- Proceedings of the fifth international joint con- mation Resources in Science and Technology, 2009, ference on Autonomous agents and multiagent pp. 187–193. doi:10.1109/COINFO.2009.65 . systems, AAMAS ’06, Assoc. for Computing Ma- [4] A. L. McNab, D. A. Ladd, Information Quality: The chinery, NY, USA, 2006, pp. 946–948. URL: https: Importance of Context and Trade-Offs, in: 2014 //doi.org/10.1145/1160633.1160801. doi:10.1145/ 47th Hawaii International Conference on System 1160633.1160801 . Sciences, 2014, pp. 3525–3532. doi:10.1109/HICSS. [14] N. Hernandez, J. Mothe, C. Chrisment, D. Egret, 2014.439 . Modeling context through domain ontologies, In- [5] D. M. Strong, Y. W. Lee, R. Y. Wang, Data Quality in formation Retrieval 10 (2007) 143–172. URL: https: Context, Commun. ACM 40 (1997) 103–110. URL: //doi.org/10.1007/s10791-006-9018-0. doi:10.1007/ http://doi.acm.org/10.1145/253769.253804. doi:10. s10791- 006- 9018- 0 . 1145/253769.253804 . [15] O. Barkat, S. Khouri, L. Bellatreche, N. Boustia, [6] F. Serra, Handling Context in Data Quality Man- Bridging context and data warehouses through agement, in: L. Bellatreche (Ed.), ADBIS, TPDL ontologies, in: Proceedings of the Sympo- and EDA 2020 Common Workshops and Doctoral sium on Applied Computing, SAC ’17, Associa- Consortium, Communications in Computer and tion for Computing Machinery, NY, USA, 2017, Information Science, Springer International Pub- p. 336–341. URL: https://doi.org/10.1145/3019612. lishing, Cham, 2020, pp. 362–367. doi:10.1007/ 3019838. doi:10.1145/3019612.3019838 . 978- 3- 030- 55814- 7_32 . [16] A. Ranganathan, R. H. Campbell, An infrastruc- [7] W. Fan, Data quality: From theory to prac- ture for context-awareness based on first order tice, SIGMOD Rec. 44 (2015) 7–18. URL: https: logic, Personal and Ubiquitous Computing 7 (2003) 353–364. [17] F. Serra, A. Marotta, Context-based Data Quality Metrics in Data Warehouse Systems, CLEI Elec- tronic Journal 20 (2017) 3:1–3:23. URL: https://www. clei.org/cleiej/index.php/cleiej/article/view/22. doi:10.19153/cleiej.20.2.3 , number: 2. [18] S. Abdellaoui, L. Bellatreche, F. Nader, A quality- driven approach for building heterogeneous dis- tributed databases: The case of data warehouses, in: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2016, pp. 631–638. doi:10.1109/CCGrid.2016.79 . [19] L. Bertossi, M. Milani, Ontological Multidimen- sional Data Models and Contextual Data Quality, J. Data and Information Quality 9 (2018) 14:1–14:36. [20] A. Marotta, A. Vaisman, Rule-Based Multidimen- sional Data Quality Assessment Using Contexts, in: Big Data Analytics and Knowledge Discovery, Lec- ture Notes in Computer Science, Springer, Cham, 2016, pp. 299–313. [21] C. A. Hurtado, A. O. Mendelzon, OLAP Dimension Constraints, in: Proceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, ACM, NY, USA, 2002, pp. 169–179. [22] C. A. Hurtado, A. O. Mendelzon, Reasoning about Summarizability in Heterogeneous Multi- dimensional Schemas, in: J. Van den Buss- che, V. Vianu (Eds.), Database Theory — ICDT 2001, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2001, pp. 375–389. doi:10.1007/ 3- 540- 44503- X_24 .