1. Introduction

Bridging the gap between micro and macro data: Ontologies to the rescue

Domenico Lembo

Maurizio Lenzerini

Antonella Poggi

Roberta Radini

Michele Riccio

Valerio Santarelli

Sapienza Università di Roma

Istituto Nazionale di Statistica (ISTAT)

OBDA Systems s.r.l.

We describe a new methodology for modeling aggregate data and explicitly connecting them to the individual-level data from which aggregates are generated. The approach makes use of OWL2 ontologies that formalize both the application domain and multidimensional constructs, such as data cubes, measures, dimensions, and hierarchies. This contribution stems from a collaboration among ISTAT, Sapienza University of Rome, and OBDA Systems, within the project INTERSTAT.

eol>Multidimensional Data Modeling Ontologies Data Warehousing

1. Introduction

increment or decrement of the level of aggregation, called roll-up and drill-down, respectively, or the selection of a portion of events in the multidimensional space, called slice-and-dice.

In this paper we propose a new methodology for modeling and manipulating aggregate data, which is based on the use of OWL2 ontologies that provide a rigorous formalization of both the application domain and the multidimensional model. The overall ontology that we devise makes it explicit the way in which macro-data are obtained from micro-data, by exploiting views over the domain ontology, which are first-class citizens in our model. Data cubes and hierarchies are indeed seen as constructed from the (SPARQL) queries associated to the views, which allow cubes dimensions, cubes measures and hierarchy levels, to be instantiated from the answers to such queries. This is a distinguishing feature of our approach, considered that other models for multidimensional data (e.g., [ 3, 4 ]) do not formalize this aspect, and methodologies for data warehouse design do not provide declarative means to specify the connection between micro- and macro-data, which is usually hidden in ETL procedures, and thus it is dificult to understand and reconstruct, e.g., for data provenance and/or lineage.

We remark that our ontology is equipped with a tailored higher-order semantics, in the spirit of [ 5 ]. This paves the way for developing advanced reasoning services, as, e.g., processing queries that mix together meta-level categories (as cubes or hierarchies) and domain elements. Such an aspect is particularly interesting considered also the possibility of linking the ontology to both micro-data repositories and aggregate data sets through mappings, thus extending the Ontology-based Data Management (OBDM) [ 6 ] paradigm to the presence of multidimensional data. Finally, we point out that our approach enables for the understanding and integration of aggregate data, possibly produced independently by diferent organizations.

2. The INTERSTAT project

Since 2015, ISTAT has developed the Integrated System of Statistical Registers (ISSR), which is a set of registers distinguished according to the following themes: socio-demographic, territorial, and institutional. Data of the diferent registries have been integrated and made interoperable through ontologies. The application of the OBDM methodology to the ISSR has made it possible to create a single conceptual access point to the above mentioned information assets, and more generally to obtain data governance of the entire micro-data system of the Institute [ 7, 8, 9 ]. Analogous needs do in fact emerge also for the management of macro-data interoperability. Macro-data represent the core business of statistical institutes and often are at the basis of a significant part of the information services of many public bodies, which often collect such data from various institutional or private producers. The INTERSTAT project studies how to achieve interoperability of statistical data from a cross-border and cross-domain point of view. In this respect, one of the findings of the project is that the relationship between micro- and macro-data should be clearly formalized. To this aim, leveraging the experience of ISSR, within the project we realized an overall ontology that makes such relationship explicit, thus allowing for the harmonization of aggregated and non-aggregated data.

The project carried out three pilots to test semantic interoperability methods on diferent topics and with respect to data producers with distinct information purposes. The final goal of each pilot is to create synthetic indicators to support public decision-makers.

3. Illustrating the approach

We illustrate our approach through an example that refers to the “School for You” (S4Y) pilot of INTERSTAT. This pilot concerns with the integration, from various sources, of data related to school attendance in Italy and France, for the construction of comparative indicators on the population of students by order of study. After presenting the domain ontology, which provides a conceptual representation of the domain of interest, we describe how, starting from the ontology, we have defined a set of views and then, based on such views, a set of hierarchies and cubes, to support analytical tasks regarding specific phenomena. The final result is an overall ontology providing a unified view over the domain of interest (i.e., the representation of micro-data) and the domain of views, hierarchies and cubes defined for its analysis (i.e., the conceptualization of the macro-data).

The domain ontology. The Ontology describes the domain of interest in terms of concepts (also known as classes), roles (also known as object properties), and attributes (also known as data properties). Specifically, it represents persons (concept Person) in terms of some of their features, such as sex, birth date, citizenship and residence. It then describes students (concept Student), which are persons who have a student id and attend schools in some scholastic years. In particular, the school attendance by students (concept Student_attendance) is characterized by the study subject area and the school complex (concept School_complex) where the student is registered. Each school complex can be a public or private institution, and is described in terms of its identifying code, name, and Enumeration Area where it is located (concept EA), which belongs to a local administrative unit (concept LAU). Finally, each local administrative unit (concept NUTS3) is part of a territorial unit.

We point out that, since the person citizenship and residence may vary during the years and we are interested in keeping track of the variations, we assume to represent the status of a person at the beginning of each year, by associating to an instance of the concept Person_status that is characterized by the attributes year and the citizenship, and by the Enumeration Area where the person resides (cf. role resides_in).

Definition of the views. As an example, we here discuss analyses over aggregate data about school attendance of Italian students, both male and female, since 2015. All relevant information is scattered through the ontology. In fact, our analyses involve several domain concepts, playing the role of multiple statistical units, namely the attendance, the students and the school addresses, which are related to each other through a specific set of conditions. This is one of the circumstances in which, in our approach, we resort to views, which allow to formally capture through a query over the domain ontology all relevant data of multiple statistical units. Indeed, once appropriate views are defined, we can use them to specify aggregates for analytical tasks. Note that, as we will see in the next subsection, views over the ontology are also used to define hierarchies allowing to navigate data through diferent aggregation levels.

We thus define the view Attendance as follows: View Attendance(id,year,sex,s_code,s_ea) as (, , , , ) : − _(, ), ℎ_(, ), ℎ__(, ), (, ), ℎ(,′ ′), _(, ), _ℎ_(, ), _ℎ_(, ), ℎ_(, ), ℎ_(, ) _(, ), > 2015 The target variables of the query (e.g. or ) defining the view are in one-to-one correspondence with the view attributes (e.g., year,sex). These attributes refer to all and only the data of interest for a specific set of investigations. For instance, the Attendance view does not include the study subject area nor the school type of institution, which are not needed the analyses at hand.

The extension of the view, executed over the ontology under certain answers semantics [ 11 ], is shown in Fig. 2.

Hierarchies. As already mentioned, analytical tasks often require to aggregate data at diferent levels of granularity. This is achieved by means of built-in or customized hierarchies. In our approach, hierarchies are defined in terms of the domain ontology, by exploiting views. In more detail, we define a hierarchy ℎ by specifying its intension as the set of pairs of nodes (i.e., the eArea locUnit 201 l1 204 l1 32 l2 47 l3 (a) enumToLocal hierarchy levels) constituting the edges of a Directed Acyclic Graph (DAG), where each edge is associated to a binary view. Intuitively, the extension of the hierarchy is another DAG whose edges are the pairs of values in the view extension.

Turning the attention to our example, in order to be able to navigate data aggregated on the basis of diferent levels of territorial partitioning, we first define two views enumToLocal and localToTerr consisting of all pairs (, ) such that is an enumeration area belonging to the local unit and of all pairs (, ) such that is an local unit belonging to the territorial unit , respectively: View enumToLocal(eArea,locUnit) as

(, ) : − _ (, ) View localToTerr(lUnit,terrUnit) as

(, ) : − _ 3(, )

Hierarchy HSpace with edges

{ (eArea,enumToLocal,locUnit), (locUnit,localToTerr,terrUnit) } The extensions of enumToLocal and localToTerr are shown in Fig. 3.

By exploiting the views above, we define the hierarchy named HSpace as follows: where eArea, locUnit, terrUnit are the three nodes of the hierarchy HSpace, whose intension is the DAG graphically depicted in Fig. 4a, and whose extension is the DAG depicted in Fig. 4b. Base Data Cube. Suppose that the primary events of interest for our analysis refer to the number of Italian students who attended a school in Italy since 2015, per scholastic year, sex, and school address. We thus define a base data cube as follows: scholYear 2020 2021 2022 2017 2018 2017 sex male male male female female female The above definition specifies that (i) BDC1 is defined on the view Attendance, (ii) that it has dimensions scholYear from (the view attribute) year, sex from sex, and location, from s_ea with hierarchy HSpace, and, finally, (iii) that it counts the number of tuples in the view having the same values for year, sex, and s_ea; this measure is named qty (the operator used to compute it is ()).

Given the extension of Attendance in Fig. 2, 1 is instantiated by the observed events shown in Fig. 5a.

Note that BDC1 projects some components of the view Attendance out, in particular those that do not play any role in the data cube. As said, views are typically designed for a set of analyses, and indeed Attendance is at the basis of the definition of other cubes. Derived Data Cubes. Once we have defined a base data cube, we may want to represent derived data cubes obtained from the base one (or from other derived data cubes) by applying OLAP operators such as Roll-up, Drill-down and Slice and Dice. For the lack of space, we next illustrate only the case of Roll-up.

A data cube is obtained by another through a roll-up by specifying the wanted aggregation level along one or more (hierarchies associated to) dimensions. For example, suppose that we want to apply the roll-up operator to the (base) data cube BDC1, to get the data cube DDC1 reporting the number of Italian students who attended a school in Italy since 2015, per sex and per territorial unit. To this aim we use the following specification:

Data Cube DDC1 on cube 1

Roll-up on dimension sex location at node terrUnit of hierarchy HSpace with measures (qty) as qty

The above definition states that DDC1 is the result of applying the Roll-up operator to BDC1, towards the terrUnit node of the hierarchy , and by eliminating the scholYear dimension, which does not appear among the dimensions of DDC1 (note that here we are “rolling-up” the entire degenerate dimension scholYear [ 1 ]).

Given the extension of HSpace in Fig. 4b and of BDC1 in Fig. 5a, the extension of DDC1 is that shown in Fig. 5b.

4. Conclusion

Our work is currently focused on the development of services to support both design- and run-time activities related to the production, distribution and integration of aggregate data. Such services are defined according to a formal semantics that extends the Metamodeling Semantics proposed in [ 5 ]. This semantics allows to reason over the various representation layers of the overall ontology we realized, i.e., the meta-level formalizing the multidimensional model, the actual data cubes designed for the analysis of the business trends, the domain ontology and the views bridging it to the cubes. A fundamental service in this scenario is query answering. Such service is indeed at the basis of several more complex functionalities, such as integration of aggregate data sets, possibly acquired from external sources and suitably linked to the ontology through mappings as in OBDM, and production and publishing of linked open data. Interestingly, queries in our framework may smoothly combine together elements belonging to the various levels of the ontology. This allows, for instance, to pose a query as the following (, ) : − (), (), _(, 1), _(, 2), (1, 2) In words, the above query is asking for all pairs of cubes that are based on disjoint views (i.e., views without common answers), and can thus be considered incomparable. Notice that this querying ability goes beyond those of current systems that manage aggregate information. Our eforts are thus concentrated in the implementation of software components, possibly integrated in the OBDM tool Mastro [ 12 ], that realize the above idea.

[1]

Kimball ,

Ross , The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3 ed ., Wiley, 2013 .

[2]

Golfarelli ,

Rizzi , Data Warehouse Design: Modern Principles and Methodologies , McGraw-Hill , 2009 .

[3]

Cyganiak ,

Reynolds , The RDF Data Cube Vocabulary, W3C Recommendation, W3C , 2014 . Available at https://www.w3.org/TR/vocab -data-cube/.

[4] The oficial site for the SDMX community , https://sdmx.org/, 2023 .

[5]

Lenzerini ,

Lepore ,

Poggi , Metamodeling and metaquerying in OWL 2 QL , AIJ 292 ( 2021 ) 103432 .

[6]

Lenzerini , Managing data through the lens of an ontology , AI Magazine 39 ( 2018 ) 65 - 74 .

[7]

Aracri ,

A. M.

Bianco ,

Scannapieco ,

Lepore ,

Radini ,

Santarelli , L' uso delle ontologie per la governance dei dati del sir , in: Ital-IA , 2019 .

[8]

R. M.

Aracri ,

Bianco ,

Radini ,

Scannapieco ,

Tosco , Using ontologies for oficial statistics: The istat experience, in: Practi-o- Web , 2017 .

[9]

Radini ,

Scannapieco , G. Garofalo, The italian integrated system of statistical registers: On the design of an ontology-based data integration architecture , in: NTTS , 2017 .

[10]

Lembo ,

Santarelli ,

D. F.

Savo ,

G. D.

Giacomo , Graphol: A graphical language for ontology modeling equivalent to OWL 2 , Future

Internet

14 ( 2022 ).

[11]

Calvanese , G. De Giacomo,

Lembo ,

Lenzerini ,

Rosati , Tractable reasoning and eficient query answering in description logics: The DL-Lite family , J. of Automated Reasoning 39 ( 2007 ) 385 - 429 .

[12] Mastro - The OBDM Engine, https://obdm.obdasystems.com/mastro/, 2023 .