<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bridging the gap between micro and macro data: Ontologies to the rescue</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Domenico Lembo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maurizio Lenzerini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonella Poggi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberta Radini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Riccio</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerio Santarelli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sapienza Università di Roma</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Istituto Nazionale di Statistica (ISTAT)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>OBDA Systems s.r.l.</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>We describe a new methodology for modeling aggregate data and explicitly connecting them to the individual-level data from which aggregates are generated. The approach makes use of OWL2 ontologies that formalize both the application domain and multidimensional constructs, such as data cubes, measures, dimensions, and hierarchies. This contribution stems from a collaboration among ISTAT, Sapienza University of Rome, and OBDA Systems, within the project INTERSTAT.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multidimensional Data Modeling</kwd>
        <kwd>Ontologies</kwd>
        <kwd>Data Warehousing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>increment or decrement of the level of aggregation, called roll-up and drill-down, respectively,
or the selection of a portion of events in the multidimensional space, called slice-and-dice.</p>
      <p>
        In this paper we propose a new methodology for modeling and manipulating aggregate data,
which is based on the use of OWL2 ontologies that provide a rigorous formalization of both
the application domain and the multidimensional model. The overall ontology that we devise
makes it explicit the way in which macro-data are obtained from micro-data, by exploiting
views over the domain ontology, which are first-class citizens in our model. Data cubes and
hierarchies are indeed seen as constructed from the (SPARQL) queries associated to the views,
which allow cubes dimensions, cubes measures and hierarchy levels, to be instantiated from the
answers to such queries. This is a distinguishing feature of our approach, considered that other
models for multidimensional data (e.g., [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]) do not formalize this aspect, and methodologies
for data warehouse design do not provide declarative means to specify the connection between
micro- and macro-data, which is usually hidden in ETL procedures, and thus it is dificult to
understand and reconstruct, e.g., for data provenance and/or lineage.
      </p>
      <p>
        We remark that our ontology is equipped with a tailored higher-order semantics, in the spirit
of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This paves the way for developing advanced reasoning services, as, e.g., processing
queries that mix together meta-level categories (as cubes or hierarchies) and domain elements.
Such an aspect is particularly interesting considered also the possibility of linking the ontology
to both micro-data repositories and aggregate data sets through mappings, thus extending the
Ontology-based Data Management (OBDM) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] paradigm to the presence of multidimensional
data. Finally, we point out that our approach enables for the understanding and integration of
aggregate data, possibly produced independently by diferent organizations.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. The INTERSTAT project</title>
      <p>
        Since 2015, ISTAT has developed the Integrated System of Statistical Registers (ISSR), which is a
set of registers distinguished according to the following themes: socio-demographic, territorial,
and institutional. Data of the diferent registries have been integrated and made interoperable
through ontologies. The application of the OBDM methodology to the ISSR has made it possible
to create a single conceptual access point to the above mentioned information assets, and more
generally to obtain data governance of the entire micro-data system of the Institute [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ].
Analogous needs do in fact emerge also for the management of macro-data interoperability.
Macro-data represent the core business of statistical institutes and often are at the basis of a
significant part of the information services of many public bodies, which often collect such
data from various institutional or private producers. The INTERSTAT project studies how to
achieve interoperability of statistical data from a cross-border and cross-domain point of view.
In this respect, one of the findings of the project is that the relationship between micro- and
macro-data should be clearly formalized. To this aim, leveraging the experience of ISSR, within
the project we realized an overall ontology that makes such relationship explicit, thus allowing
for the harmonization of aggregated and non-aggregated data.
      </p>
      <p>The project carried out three pilots to test semantic interoperability methods on diferent
topics and with respect to data producers with distinct information purposes. The final goal of
each pilot is to create synthetic indicators to support public decision-makers.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Illustrating the approach</title>
      <p>We illustrate our approach through an example that refers to the “School for You” (S4Y) pilot
of INTERSTAT. This pilot concerns with the integration, from various sources, of data related
to school attendance in Italy and France, for the construction of comparative indicators on
the population of students by order of study. After presenting the domain ontology, which
provides a conceptual representation of the domain of interest, we describe how, starting from
the ontology, we have defined a set of views and then, based on such views, a set of hierarchies
and cubes, to support analytical tasks regarding specific phenomena. The final result is an
overall ontology providing a unified view over the domain of interest (i.e., the representation
of micro-data) and the domain of views, hierarchies and cubes defined for its analysis (i.e., the
conceptualization of the macro-data).</p>
      <p>The domain ontology. The Ontology describes the domain of interest in terms of concepts
(also known as classes), roles (also known as object properties), and attributes (also known as
data properties). Specifically, it represents persons (concept Person) in terms of some of their
features, such as sex, birth date, citizenship and residence. It then describes students (concept
Student), which are persons who have a student id and attend schools in some scholastic
years. In particular, the school attendance by students (concept Student_attendance) is
characterized by the study subject area and the school complex (concept School_complex)
where the student is registered. Each school complex can be a public or private institution, and
is described in terms of its identifying code, name, and Enumeration Area where it is located
(concept EA), which belongs to a local administrative unit (concept LAU). Finally, each local
administrative unit (concept NUTS3) is part of a territorial unit.</p>
      <p>We point out that, since the person citizenship and residence may vary during the years
and we are interested in keeping track of the variations, we assume to represent the status
of a person  at the beginning of each year, by associating to  an instance  of the concept
Person_status that is characterized by the attributes year and the citizenship, and by
the Enumeration Area where the person resides (cf. role resides_in).</p>
      <p>Definition of the views. As an example, we here discuss analyses over aggregate data about
school attendance of Italian students, both male and female, since 2015. All relevant information
is scattered through the ontology. In fact, our analyses involve several domain concepts,
playing the role of multiple statistical units, namely the attendance, the students and the school
addresses, which are related to each other through a specific set of conditions. This is one of the
circumstances in which, in our approach, we resort to views, which allow to formally capture
through a query over the domain ontology all relevant data of multiple statistical units. Indeed,
once appropriate views are defined, we can use them to specify aggregates for analytical tasks.
Note that, as we will see in the next subsection, views over the ontology are also used to define
hierarchies allowing to navigate data through diferent aggregation levels.</p>
      <p>We thus define the view Attendance as follows:
View Attendance(id,year,sex,s_code,s_ea) as
(, , , , ) : −
_(, ), ℎ_(, ),
ℎ__(, ), (, ), ℎ(,′ ′),
_(, ), _ℎ_(, ),
 _ℎ_(, ), ℎ_(, ), ℎ_(, )
_(, ),  &gt; 2015
The target variables of the query (e.g.  or ) defining the view are in one-to-one correspondence
with the view attributes (e.g., year,sex). These attributes refer to all and only the data of interest
for a specific set of investigations. For instance, the Attendance view does not include the
study subject area nor the school type of institution, which are not needed the analyses at hand.</p>
      <p>
        The extension of the view, executed over the ontology under certain answers semantics [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
is shown in Fig. 2.
      </p>
      <p>Hierarchies. As already mentioned, analytical tasks often require to aggregate data at diferent
levels of granularity. This is achieved by means of built-in or customized hierarchies. In our
approach, hierarchies are defined in terms of the domain ontology, by exploiting views. In more
detail, we define a hierarchy ℎ by specifying its intension as the set of pairs of nodes (i.e., the
eArea locUnit
201 l1
204 l1
32 l2
47 l3
(a) enumToLocal
hierarchy levels) constituting the edges of a Directed Acyclic Graph (DAG), where each edge is
associated to a binary view. Intuitively, the extension of the hierarchy is another DAG whose
edges are the pairs of values in the view extension.</p>
      <p>Turning the attention to our example, in order to be able to navigate data aggregated on the
basis of diferent levels of territorial partitioning, we first define two views enumToLocal and
localToTerr consisting of all pairs (, ) such that  is an enumeration area belonging to the
local unit  and of all pairs (, ) such that  is an local unit belonging to the territorial unit ,
respectively:
View enumToLocal(eArea,locUnit) as</p>
      <p>(, ) : − _ (, )
View localToTerr(lUnit,terrUnit) as</p>
      <p>(, ) : − _   3(, )</p>
      <sec id="sec-3-1">
        <title>Hierarchy HSpace with edges</title>
        <p>{ (eArea,enumToLocal,locUnit),
(locUnit,localToTerr,terrUnit) }
The extensions of enumToLocal and localToTerr are shown in Fig. 3.</p>
        <p>By exploiting the views above, we define the hierarchy named HSpace as follows:
where eArea, locUnit, terrUnit are the three nodes of the hierarchy HSpace, whose intension is
the DAG graphically depicted in Fig. 4a, and whose extension is the DAG depicted in Fig. 4b.
Base Data Cube. Suppose that the primary events of interest for our analysis refer to the
number of Italian students who attended a school in Italy since 2015, per scholastic year, sex,
and school address. We thus define a base data cube as follows:
scholYear
2020
2021
2022
2017
2018
2017
sex
male
male
male
female
female
female
The above definition specifies that (i) BDC1 is defined on the view Attendance, (ii) that it has
dimensions scholYear from (the view attribute) year, sex from sex, and location, from s_ea
with hierarchy HSpace, and, finally, (iii) that it counts the number of tuples in the view having
the same values for year, sex, and s_ea; this measure is named qty (the operator used to
compute it is ()).</p>
        <p>Given the extension of Attendance in Fig. 2, 1 is instantiated by the observed events
shown in Fig. 5a.</p>
        <p>Note that BDC1 projects some components of the view Attendance out, in particular those
that do not play any role in the data cube. As said, views are typically designed for a set of
analyses, and indeed Attendance is at the basis of the definition of other cubes.
Derived Data Cubes. Once we have defined a base data cube, we may want to represent
derived data cubes obtained from the base one (or from other derived data cubes) by applying
OLAP operators such as Roll-up, Drill-down and Slice and Dice. For the lack of space, we next
illustrate only the case of Roll-up.</p>
        <p>A data cube is obtained by another through a roll-up by specifying the wanted aggregation
level along one or more (hierarchies associated to) dimensions. For example, suppose that we
want to apply the roll-up operator to the (base) data cube BDC1, to get the data cube DDC1
reporting the number of Italian students who attended a school in Italy since 2015, per sex and
per territorial unit. To this aim we use the following specification:</p>
      </sec>
      <sec id="sec-3-2">
        <title>Data Cube DDC1 on cube 1</title>
        <p>Roll-up on dimension
sex
location at node terrUnit of hierarchy HSpace
with measures (qty) as qty</p>
        <p>
          The above definition states that DDC1 is the result of applying the Roll-up operator to BDC1,
towards the terrUnit node of the hierarchy , and by eliminating the scholYear dimension,
which does not appear among the dimensions of DDC1 (note that here we are “rolling-up” the
entire degenerate dimension scholYear [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]).
        </p>
        <p>Given the extension of HSpace in Fig. 4b and of BDC1 in Fig. 5a, the extension of DDC1 is
that shown in Fig. 5b.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>
        Our work is currently focused on the development of services to support both design- and
run-time activities related to the production, distribution and integration of aggregate data. Such
services are defined according to a formal semantics that extends the Metamodeling Semantics
proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This semantics allows to reason over the various representation layers of the
overall ontology we realized, i.e., the meta-level formalizing the multidimensional model, the
actual data cubes designed for the analysis of the business trends, the domain ontology and the
views bridging it to the cubes. A fundamental service in this scenario is query answering. Such
service is indeed at the basis of several more complex functionalities, such as integration of
aggregate data sets, possibly acquired from external sources and suitably linked to the ontology
through mappings as in OBDM, and production and publishing of linked open data. Interestingly,
queries in our framework may smoothly combine together elements belonging to the various
levels of the ontology. This allows, for instance, to pose a query as the following
(, ) : −
(), (), _(, 1),
_(, 2), (1, 2)
In words, the above query is asking for all pairs of cubes that are based on disjoint views (i.e.,
views without common answers), and can thus be considered incomparable. Notice that this
querying ability goes beyond those of current systems that manage aggregate information. Our
eforts are thus concentrated in the implementation of software components, possibly integrated
in the OBDM tool Mastro [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], that realize the above idea.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kimball</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <article-title>The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3 ed</article-title>
          ., Wiley,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          , Data Warehouse Design:
          <article-title>Modern Principles and Methodologies</article-title>
          ,
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          ,
          <source>The RDF Data Cube Vocabulary, W3C Recommendation, W3C</source>
          ,
          <year>2014</year>
          . Available at https://www.w3.org/TR/vocab
          <article-title>-data-cube/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] The oficial site for the SDMX community</article-title>
          , https://sdmx.org/,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lepore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <article-title>Metamodeling and metaquerying in OWL 2 QL</article-title>
          , AIJ
          <volume>292</volume>
          (
          <year>2021</year>
          )
          <fpage>103432</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          ,
          <article-title>Managing data through the lens of an ontology</article-title>
          ,
          <source>AI</source>
          Magazine
          <volume>39</volume>
          (
          <year>2018</year>
          )
          <fpage>65</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Aracri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Bianco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Scannapieco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lepore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Radini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Santarelli</surname>
          </string-name>
          ,
          <string-name>
            <surname>L'</surname>
          </string-name>
          <article-title>uso delle ontologie per la governance dei dati del sir</article-title>
          ,
          <source>in: Ital-IA</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Aracri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bianco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Radini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Scannapieco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tosco</surname>
          </string-name>
          ,
          <article-title>Using ontologies for oficial statistics: The istat experience, in: Practi-o-</article-title>
          <string-name>
            <surname>Web</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Radini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Scannapieco</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Garofalo,</surname>
          </string-name>
          <article-title>The italian integrated system of statistical registers: On the design of an ontology-based data integration architecture</article-title>
          ,
          <source>in: NTTS</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Santarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Savo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Giacomo</surname>
          </string-name>
          ,
          <article-title>Graphol: A graphical language for ontology modeling equivalent to OWL 2</article-title>
          ,
          <string-name>
            <surname>Future</surname>
            <given-names>Internet</given-names>
          </string-name>
          14 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          ,
          <article-title>Tractable reasoning and eficient query answering in description logics: The DL-Lite family</article-title>
          ,
          <source>J. of Automated Reasoning</source>
          <volume>39</volume>
          (
          <year>2007</year>
          )
          <fpage>385</fpage>
          -
          <lpage>429</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Mastro - The OBDM</surname>
          </string-name>
          Engine, https://obdm.obdasystems.com/mastro/,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>