Multi-Granular Schemas for Data Integration

Multi-Granular Schemas for Data Integration MAndreaRodríguez arodriguez@inf.udec.cl Universidad de Concepción

Edmundo Larenas 215 4070409 Concepción Chile, Chile

LoretoBravo lbravo@inf.udec.cl Universidad de Concepción

Edmundo Larenas 215 4070409 Concepción Chile, Chile

Multi-Granular Schemas for Data Integration 3AC48C4CB1996921EF3918C056FAEFC6 GROBID - A machine learning software for extracting information from scholarly documents

Data can contain information at different levels of granularities but this metadata is generally left implicit in the data model. If we want to take advantage of different levels of granularity when integrating data, we need to first extend database schemas to include granularity information. In this article we (i) provide a multi-granular domain schema that is used in the formalization of database schemas so that each attribute is assigned a certain granularity; and (ii) explore the issues when integrating data at different granularities and suggest possible global schemas and instances.

Introduction

The notion of granularity is relevant to many aspects of data representation, and its formalization is fundamental for data integration. Granularity relates to data quality since it introduces bounds to the level of detail in which data is represented. Despite its relevance, granularity is usually implicit in a data model. This rises consequence for data integration, since no useful information can be extracted from data that may be semantically related but that are represented at different granularities.

As a motivating example, consider the integration of two different databases that store information about earthquakes occurred in different populated areas of the world. One of the databases stores the location of earthquakes by the epicenter's geographic coordinates, stores the time of occurrence by day and hour, and stores the magnitude in the Richter scale. The second database stores the location of earthquakes by the name of the administrative district that contains the epicenter, stores the time of occurrence by day, hour and minute, and stores the magnitude in the same scale. To solve this problem, we first need a database schema language that can specify both databases using different granularities. Then, the integration of both databases is not trivial. Although both databases store the magnitude in the same scale, they store temporal and spatial information at different granularities. We then need to formalization the schema of integration to be able to materialize the integrated database.

Although there exist different studies that model and query data at different granularities [1][2][3], none of them have addressed the integration of database schemas that differ in the granularity to represent different attributes. Unlike other studies, we do not focus on a particular domain, instead, we provide a general definition of schema integration, where granularity can be applied to any of the attribute domains. Even more, unlike classical approaches in data warehousing, data is not necessarily stored at the finest level of granularity upon which aggregation functions derive data at other coarser levels. In summary, the main contributions of this work are twofold. We start by formalizing a multi-granular domain schema and multi-granular database schema, in which attributes take values from domains that can be represented at different granularities. A second contribution is the formalization of a global schema given source multi-granular schemas and data. The goal is to obtain a global schema that provides at least the same information as the source data. Future work is the materialization of the database instance of the global schema.

The organization of the paper is as follows. In Section 2 we introduce our formalization of domain schemas and databases with multiple granularities. In Section 3 we study the problem of integrating two instances with data at different levels of granularity under some restrictions and we propose two possible global schemas and instances to integrate this data. Related work is presented in Section 4 to highlight the novelty of the work. Finally, Section 5 provides a discussion about more general settings.

Domain schemas and Databases

A domain schema is a tuple Ψ = (U, , I, ρ, τ ), where (i) U is the domain associated with Ψ , (ii) is the set of granularity identifiers (or labels), (iii) I is the set of granule identifiers (or labels), (iv) ρ is a function ρ : I → 2 U that maps granules identifiers to subsets of the domain, and (v) τ is a function

→ 2 I such that for all G ∈ if i, j ∈ τ (G) then ρ(i) ∩ ρ(j) = ∅.

To simplify the presentation we will asume that for i, j ∈ I, i = j iff ρ(i) = ρ(j), and that for

G 1 , G 2 ∈ , G 1 = G 2 iff τ (G 1 ) = τ (G 2 ).

Given a domain schema Ψ = (U, , I, ρ, τ ) and granularities

G 1 , G 2 ∈ : (i) G 1 is finer or equal than G 2 , denoted G 1 Ψ G 2 , iff for all i ∈ τ (G 1 ) there exists j ∈ τ (G 2 ) such that ρ(i) ⊆ ρ(j); (ii) G 1 aggregates to G 2 , denoted G 1 Ψ G 2 , iff for all j ∈ τ (G 2 ) there exists i 1 , . . . , i n ∈ τ (G 1 ) such that ρ(i 1 ) ∪ ... ∪ ρ(i n ) = ρ(j); and (iii) G 1 is a partition of G 2 , denoted by G 1 Ψ G 2 , if G 1 Ψ G 2 and G 1 ≺ Ψ G 2 .

When the domain schema is clear from the context we will use ≺, and .

A particular instance of a domain schema is the sets of granularities defined over the time domain. A time domain is a pair (T ; ≤), where T is a non-empty set of time instants, and ≤ is a total order in T [4][5][6]. By saying that T is a set of time instants we do not impose the idea that the time is only discrete. Time domain is continuous (dense) if for all t < t there exists t such that t < t < t . The next example illustrates the case of multiple time granularities for representing academic activities of universities. In this setting the set of granule identifiers is an ordered set of index values.

Example 1. Consider the following domain schema Ψ time = (U time , time , I time , ρ time , τ time ), with time = {month, term, year} and the elements in I time , as well as functions ρ time and τ time are represented graphically in Figure 1. The only relations that hold among these granularities are month year and term ≺ year.

Another instance of a domain schema is the domain representing foodstuffs. Different granularities to represent a food product can be defined using classification criteria such as group, calorie, and brand. This is illustrated in the following example.

ρ(Jan12) ρ(Feb12) ρ(Mar12) ρ(Apr12) ρ(May12) ρ(Jun12) ρ(Jul12) ρ(Aug12)ρ(Sep12) ρ(Oct12) ρ(Nov12) ρ(Dic12) ρ(winter12) ρ(ρ prod (Soprole) = {P1, P2} ρ prod (Nestle) = {P3, P4} ρ prod (Alpura) = {P5, P6} ρ prod (Bimbo) = {P7, P8} ρ prod (Baker ) = {P9, P10} ρ prod (Low ) = {P1, P2, P5} ρ prod (Medium) = {P4, P7, P9, P10} ρ prod (High) = {P8, P6} ρ prod (Dairies) = {P1, P2, P3, P5, P8} ρ prod (Bread ) = {P7, P9} ρ prod (Meat) = {P4}

The relationships between granularities in Ψ P rod are: P roduct Brand, P roduct Calorie and P roduct Group.

The previous example resembles data warehousing (DW). In an homogeneous and strict DWs, if a category A rolls-up to a category B then A B. Unlike datawarehousing, a domain schema enables to store data at different levels of granularities and not only at the finest granularity.

Databases. A database schema is a tuple Σ = (M, R, Dom), where: (a) M is a set of domain schema, (b) R is a set of relational schemas and (c) Dom is a function that, given a relation R ∈ R and an attribute A ∈ R, returns a tuple (Ψ RA , G RA ) where

Ψ RA = (U RA , RA , I RA , ρ RA , τ RA ) ∈ M and G RA ∈ RA .

Intuitively, Dom returns the domain schema and granularity associated with attribute A ∈ R.

To simplify the presentation of results, we will assume that there are no two Ψ 1 , Ψ 2 ∈ M that refer to the same domain U. Function schemaOfDomain(M, U) returns the domain schema in M defined over domain U.

A database instance D of a schema Σ is a finite collection of ground atoms of the form R(c 1 , . . . , c i , . . . , c l ), where (a) R(B 1 , . . . , B i , . . . , B l ) ∈ R, and (b) every

c i with i ∈ [1, l] is such that c i ∈ τ RBi (G RBi ) where Dom(R, B i ) = (Ψ RBi , G RBi ) and Ψ RBi = (U RBi , RBi , I RBi , ρ RBi , τ RBi ).

Database Integration

We want to explore the challenges that arise when merging data with different levels of granularity. For example, given several data sources with data at different granularity levels, which is the best granularity to be used in the global schema? At what level of granularity should the data be stored in the merged instances? How do integrity constraints affect this process? What type of query language can we use to deal with this different levels of granularity?

In a first stage we decided to restrict to a simple base case with two source databases to be integrated and a global schema that share the same set of relations in R and have no integrity constraints. In this way, each relation in a source is mapped to the same relation in the global schema. In this setting, we will want to define the granularity associated with every attribute in the global schema and the level of granularity at which the data should be stored.

Schemas

Σ 1 = (M 1 , R 1 , Dom 1 ) and Σ 2 = (M 2 , R 2 , Dom 2 ) are domain com- patible if M 1 = M 2 , R 1 = R 2 and for every R ∈ R 1 and every A ∈ R 1 , there exists Ψ , G 1 and G 2 such that Dom 1 (R, A) = (Ψ, G 1 ) and Dom 2 (R, A) = (Ψ, G 2 )

. This is, the two schemas share the same set of domain schemas and relational schemas and every attribute is associated to the same domain schema (even if it is at different levels of granularity) in the different schemas.

In this article we concentrate on the integration of databases defined over domain compatible schemas. In this setting, we will consider two cases: Domain Invariant Integration and Finest Domain Integration. In the former, we require global schema to share the same set of domains M as the sources. In the latter setting, we allow M G to differ from the sources but require it to contain the finest granularity schema that generalizes the domains in the sources.

Domain Invariant Integration

Given two domain compatible source databases with schemas Σ 1 = (M, R, Dom 1 ) and Σ 2 = (M, R, Dom 2 ) and instances D 1 and D 2 , respectively, we want to define a global schema Σ G = (M, R, Dom G ) such that at least the following conditions hold: (A.1) For every R ∈ R and every

A ∈ R, if Dom 1 (R, A) = Dom 2 (R, A) then Dom G (R, A) = Dom 1 (R, A) = Dom 2 (R, A). (A.2) For every R ∈ R and every A ∈ R, if Dom 1 (R, A) = (Ψ, G 1 ), Dom 2 (R, A) = (Ψ, G 2 ) and Dom G (R, A) = (Ψ, G G ), then G 1 G G and G 2 G G .

The first condition will ensure that if two attributes have the same granularity, the attribute in the global schema will have the same granularity. On the other hand, if they are different, the second condition ensures that the granularity in the global schema is coarser or equal than both of them.

In order to find the global schema that provides the finest granularity for each attribute such that these properties are satisfied, we define the join operator. Given a domain schema Ψ = (U, , I, ρ, τ ) and G 1 , G 2 ∈ , the Join Operator is

Join(Ψ, G 1 , G 2 ) = G if G ∈ , G 1 G, G 2 G, ∃G ∈ (G 1 G , G 2 G , G ≺ G)

⊥ otherwise Thus, the join operator of G 1 and G 2 returns the finest granularity in such that it subsumes both G 1 and G 2 ; if that granularity does not exist, it return ⊥.

Definition 1. Given schemas Σ 1 = (M, R, Dom 1 ) and Σ 2 = (M, R, Dom 2 ) let the join schema Σ 1 Σ 2 = (M, R, Dom Σ1 Σ2

) where for every R ∈ R and A ∈ R:

Dom Σ1 Σ2 (R, A) =    G if G = Join(Ψ, G 1 , G 2 ) = ⊥, Dom 1 (R, A) = (Ψ, G 1 ) and Dom 2 (R, A) = (Ψ, G 2 ) ⊥

otherwise If any of this granularities is ⊥, then there is no join schema that allows the integration of the databases under the domains in M.

The join schema of Σ 1 and Σ 2 is a good candidate for a global schema since it satisfies conditions (A.1) and (A.2). Its main drawback is the fact that it does not always exist.

Example 3. Consider the database schemas Σ 1 = (M, R, Dom 1 ) and Σ 2 = (M, R, Dom 2 ) where Ψ P rod = (U prod , prod , I prod , ρ prod , τ prod ) ∈ M, R = {Diet(product, amount)}, Dom 1 (Diet, product) = (Ψ prod , Group) and Dom 2 (Diet, product) = (Ψ prod , Calorie). There is no G ∈ prod such that Calorie G and Group G. Therefore, Join(Ψ prod , Group, Calorie) = ⊥ and the join schema Σ 1 Σ 2 does not exist. Now, let Σ 3 = (M, R, Dom 3 ) with Dom 3 (Diet, product) = (Ψ prod ,Product). In this case, Join(Ψ prod , Product, Group) = Group and, therefore, Σ 1 Σ 3 = (M, R, Dom Σ1 Σ3 ), where Dom Σ1 Σ3 (Diet, product) = (Ψ prod , Group). Now, in order to define the global instance given a global schema, we need to ensure that the data in the global schema are the same as the one contained in the sources but probably at a higher level of granularity. For example, we can have a month in the source and the associated year in the global schema.

D G over the schema Σ G is such that R(c 1 , . . . , c i , . . . , c l ) ∈ D G if

and only if:

There exists R(B 1 , . . . , B i , . . . , B l ) ∈ R, -For every i ∈ [1, l], c i ∈ τ (G RBi ) where Dom Σ G (R, B i ) = (Ψ RBi , G RBi ). -There exists R(c s 1 , . . . , c s i , . . . , c s l ) ∈ D 1 or R(c s 1 , . . . , c s i , . . . , c s l ) ∈ D 1 such that for every i ∈ [1, l], it holds that ρ RBi (c s i ) ⊆ ρ RBi (c i ) where Dom Σ G (R, B i ) = (Ψ RBi , G RBi ).

It is easy to see that if the join schema exists, then there will exist a unique global instance for the join schema.

Since sometimes there is no join schema to use as a global schema, in the next section we study the consequences of lifting the restriction of forcing the global schema to use the same set of domains M as the sources.

Finest Domain Integration

When integrating two granularities G 1 and G 2 , it might be the case that its associated domain schema has no granularity which is coarser than both of them. To solve this limitation, we consider in this section the possibility of adding to the domain schema the finest granularity that is coarser than both G 1 and G 2 . More formally, given two domain compatible schemas Σ 1 = (M, R, Dom 1 ) and Σ 2 = (M, R, Dom 2 ), we want to find a global schema Σ G = (M G , R, Dom G ) for which the following conditions hold: (B.1) For every Ψ = (U, , I, ρ, τ ) ∈ M there is a

Ψ G = (U G , G , I G , ρ G , τ G ) ∈ M G such that U G = U; ⊆ G ; I ⊆ I G ; for every i ∈ I, ρ(i) = ρ G (i); and for every G ∈ , τ (G) = τ G (G)

. Furthermore, only domains with this property belong to

M G . (B.2) For every R ∈ R and every A ∈ R, if Dom 1 (R, A) = Dom 2 (R, A) then Dom G (R, A) = Dom 1 (R, A) = Dom 2 (R, A). (B.3) For every R ∈ R and every A ∈ R, if Dom 1 (R, A) = (Ψ, G 1 ), Dom 2 (R, A) = (Ψ, G 2 ) and Dom G (R, A) = (Ψ G , G RAi ), with Ψ = (U, , I, ρ, τ ) and Ψ G = (U G , G , I G , ρ G , τ G ), then U = U G , G 1 Ψ G G RAi and G 2 Ψ G G RAi .

The first conditions ensures that M is an extension of M, this is, it contains a domain for the same set of U as M, and to each of them, it can only add new indices and new granularities without modifying the ones that already exist in M. Condition (B.2) ensures that if two attributes have the same granularity in the sources, the attribute in the global schema will not be modified. On the other hand, if they are different, condition (B.3) requires that for each pair of granularities to merge, there exists a domain schema Ψ G that extends the domain schema Ψ by adding a new granularity that is coarser than both of them. As before, we would like the global schema to use the finest granularity for each attribute such that these conditions hold. In order to achieve this, we first need to pair up the granularities and indices that will need to be merged.

Definition 3. Given two domain compatible schemasΣ 1 = (M, R, Dom 1 ), Σ 2 = (M, R, Dom 2 ) and Ψ = (U, , I, ρ, τ ) ∈ M, the set of granularities to merge of Ψ for Σ 1 and Σ 2 is GtoMerge(Ψ, Σ 1 , Σ 2 ) = {{G 1 , G 2 } | R ∈ R, A ∈ R, Dom 1 (R, A) = (Ψ, G 1 ) and Dom 2 (R, A) = (Ψ, G 2 )}.

Also, we will denote the set of attributes that share a specific pair of granularities by

Att(Σ 1 , Σ 2 , {G 1 , G 2 }) = {(R, A) | Ψ ∈ M, Dom i (R, A) = (Ψ, G 1 ), Dom j (R, A) = (Ψ, G 2 ) and (i, j) ∈ {(1, 2), (2, 1)}}. Let ℘(Ψ, G 1 , G 2 ) be the maximal partition of τ (G 1 ) ∪ τ (G 2

) such that for every distinct sets S 1 , S 2 ∈ ℘(Ψ, G 1 , G 2 ) and every i ∈ S 1 and j ∈ S 2 , ρ(i) ∩ ρ(j) = ∅.

Taking into consideration the pairs of granularities that need to be merged, we define the new domain schema that can include new granularities that merge the granularities that need to be integrated. Definition 4. Given two domain compatible schemas Σ 1 = (M, R, Dom 1 ) and Σ 2 = (M, R, Dom 2 ) and Ψ = (U, , I, ρ, τ ) ∈ M, the merged domain schema of Ψ for Σ 1 and Σ 2 is Merged (Ψ, Σ 1 , Σ 2 ) = (U, , I , ρ , τ ) where:

U = U, 2. ⊇ 3. I ⊇ I 4. For every i ∈ I, ρ (i) = ρ(i); 5. For every G ∈ , τ (G) = τ (G); 6. For every {G 1 , G 2 } ∈ GtoMerge(Ψ, Σ 1 , Σ 2 ), there exists G ∈ such that G 1

G and G 2 G and that it is not possible to define a granularity G" over U such that G" ≺ G , G 1 G" and G 1 G". 7. There is no other schema Ψ " = (U", ", I", ρ", τ ") that satisfies conditions (1) to (6) and such that " and/or I" I .

By condition (1) the merged domain schema will be defined over the same sources domain schema. Next, by conditions ( 2)-( 5), the merged schema will contain at least the same granularity labels and indices as Ψ with the same meaning. Condition (6) ensures that for each pair of granularities G 1 and G 2 that need to be merged, there is another granularity (not necessarily new) for which both of them are finer, and which is also the finest one that satisfies this requirement.

Ψ prod = Merged (Ψ prod , Group, Calories) = (U prod , prod ∪ {GroupCalories}, I prod ∪ {δ 1 }, ρ ⊗ , τ ⊗ )

, where ρ ⊗ (G) = ρ prod (G) for every G ∈ prod , ρ ⊗ (δ 1 ) = {P 1 , P 2 , P 3 , P 5 , P 6 , P 8 }, τ ⊗ (i) = τ prod (i) for every i ∈ I prod , τ ⊗ (GroupCalories) = {δ 1 , Medium}. Thus, the merged schema

Σ 1 ⊗ Σ 2 = (M ⊗ , R, Dom ⊗ ) where M ⊗ = {Ψ prod } and Dom ⊗ (Diet, product) = (Ψ prod , GroupCalories).

On the other hand, when integrating schemas Σ 1 and Σ 3 the join and the merge schema coincide.

A good property of the merged schema is that it always exists and is unique up to renaming of granularities in ⊗ and indices en I ⊗ . Also, when merging schemas Σ 1 and Σ 2 , for every domain schema Ψ in them, and for every G 1 , G 2 ∈ Ψ , if the join schema exists and is

Join(Ψ, G 1 , G 2 ) = G J , then for Ψ = Merged (Ψ, G 1 , G 2 ) it holds that the Join(Ψ , G 1 , G 2 ) = G ≺ Ψ G J . Thus, schema Σ 1 ⊗Σ 2 can contain finer grained attributes than Σ 1 Σ 2 .

A drawback of the merged schema is that the new granularity names (in ⊗ ) and indices (in I ⊗ ) will be less intuitive. Indeed, in Example 4 the index δ 1 and the granularity name GroupCalories are less intuitive than the indices and granularity names already in the domain schema before the merge.

Computing the Merged Schema. The merged schema of two domain compatible schemas Σ 1 and Σ 2 can be computed by Algorithm 1 called MergedSchema which relies on the following property of the merged schema: for each pair of granularities that need to be merged, the merged schema contains a new granularity G that contains one index for each partition in ℘(Ψ, G 1 , G 2 ) and the ρ of each of this indices will contain exactly the union of the ρ of the indices in the corresponding partition. Relying on this observation, the subroutine MergedGranularity (see Algorithm 2) returns granularity G and the new domain schema which now contains this new granularity G . Algorithm MergedSchema calls MergedGranularity for all the combinations of G 1 and G 2 that need to be merged. It might be the case that some of the new granularities and indices added in lines 4-5 of the Algorithm will result in non-minimal schemas, in the sense that there might be two indices i and j in a domain schema for which ρ(i) = ρ(j), also there might be granularities G 1 and G 2 which contain the same indices. Method CleanUp (Algorithm 3) removes all these indices and granularities that are not needed.

To achieve this, it first replaces any new index in G by one in the original schema which maps to the same elements in U if it exists. Next, it makes sure that there is no other granularity with the same indices. After the CleanUp, Algorithm MergeSchema in lines 8-9 associates to each merged attribute its new granularity through function DOm ⊗ . Finally, in line 10 the merged schema for Σ 1 and Σ 2 is returned.

al. [6] extends ODMG type system [10] to handle spatio-temporal data with specific types to define spatio-temporal properties at multiple granularities. They provide geometric converse operators to implement changes on granularity. Similarly, Bertino et al. [1] extends object-relational model based on OpenGis specifications described in SQL3 to represent temporal dimension in this model and the multi-representation of spatio-temporal granularities.

From another perspective, a schema model for datawarehousing defines dimensions over fact tables. From fact tables, data is aggregated along a dimension forming a hierarchy of finer-than relationships of granularity. As Iftikhar and Pedersen already highlighted [2,3], current models cannot store data at different levels of granularity, since they require all dimensions attributes to be given concrete values. They propose extensions to current models that include a time dimension granularity defined as a single hierarchy. Thus, fact data is associated with a time dimension in a particular granularity. The work in [3] can be seen as a materialization of our multi-granular database schema. Unlike the work in [3], however, our model is applicable to not only temporal granularity but also spatial or semantic granularity of a conceptual classification.

Our work relates to the problem described in [11], which characterizes the derivation problem for summary data. The idea is to compare summary data in statistical databases based on common but not identical classification criterion. This classification criterion can be seen as a granularity of atomic data, where each class corresponds to a granularity index that maps to a portion of the underlying domain (granule). The problem is then to integrate different summarized data and extract useful information from them. Unlike the datawarehouse context, summary datasets may not have atomic data from which summarized data is obtained. Data is stored in terms of different classification criterion, which is similar to our problem of having datasets at different level of granularities without the atomic facts of the underlying domain.

Despite previous work, to the best of our knowledge, there is no work that formalizes granularity in a more general relational context. We want here to formalize granularity not only for time-depending attributes, but also for attributes defined by different classification criteria. Our approach could then be extended to deal with particularities of each attribute domain. Like the work in [11], we also want to be able to extract useful information from different datasets whose data are represented at different granularities.

Discussion

In this section we discuss the issues that arise if we consider constraints or allow instances where each attribute can contain different levels of granularity.

Integrity Constraints. In the previous sections, we have considered that the global instance contains all the tuples of the data sources, where some of their attributes may have been modified to a coarser level of granularity. If we now consider that the schema contains integrity constraints, new issues arise, which are illustrated by the following example.

Example 5. Consider different databases that store information about relevant touristic landmarks. Each source database contains the predicate LandMark (name, location) and functional dependency name → location. Attribute location may be defined over a domain at different levels of granularities. For example, while a data source may store a location in terms of provinces, another data source stores it in terms of countries.

Since we have a functional dependency constraints, an integrated database will not store two tuples in LandMark with the same value in attribute name. A global schema will need to be such that the functional dependency constraints must be satisfied. Tuples with the same name will need to be analyzed and, when possible, the conversion of attributes will need to be applied.

We say that a conflict of data integration exists when a data stored in one database cannot be converted to the data in another database. For example, a database stores tuple (Aconcagua mountain, Argentina) and another stores (Aconcagua mountain, Chile). In this example, both tuples are at the same granularity of countries and, therefore, they are inconsistent because they should have the same value in attribute location. An inconsistency can also happen when storing the location at different levels of granularity. For example, a database stores tuple (Aconcagua mountain, San Juan Province) using a granularity of provinces and another stores (Aconcagua mountain, Chile) using a granularity of countries. For this example, there is no way to convert the value of location in one tuple into the value of the other tuple.

Although values of attribute location can be different in two databases for the same value of attribute name, there are cases of no conflict. We call this situation synergy as in [11]. For example, a database stores tuple (Aconcagua mountain, San Juan Province) and another stores (Aconcagua mountain, Argentina). In such case, we would expect to keep the tuples with the fine granularity which is not in conflict with the coarse granularity.

Instances with several granularities in the same attribute. One of the limitations of how global instances have been defined is that we loose some of the information when the global schema uses a coarser granularity in any of the attributes. For example, consider the domain schema Ψ P rod and an attribute A with granularity Product in the source and Calories in the global schema. If when merging we replace every product value in attribute A by its associated Low , Medium and High, we will be loosing some information that may turn to be useful. This is why we could consider a modification of the definition of database instance to allow for each attribute to contain indices in or below the granularity of the attribute. More formally, a database instance D of a schema Σ could alternatively be defined as a finite collection of ground atoms of the form R(c 1 , . . . , c i , . . . , c l ), where (a) R(B 1 , . . . , B i , . . . , B l ) ∈ R, and (b) every c i is such that c i ∈ I, there exists g ∈ τ RBi (G RBi ) for which ρ RBi (c i ) ⊆ ρ RBi (g) where Dom(R, B i ) = (Ψ RBi , G RBi ).

With this alternative definition, when merging instances D 1 over Σ 1 and D 2 over Σ 2 we can use the global instance D G = D 1 ∪ D 2 over either the join or merge schema. The problem now becomes the way in which we can query this type of instances and how answers should be displayed. A query language, like SQL, could be augmented to request the level of granularity at which we want the answers to be displayed.

Definition 2 .2Given instances D 1 and D 2 defined respectively over schemas Σ 1 = (M, R, Dom 1 ) and Σ 2 = (M, R, Dom 2 ), and a global schema Σ G , a global instance

Definition 5 .5Given two domain compatible schemasΣ 1 = (M, R, Dom 1 ) and Σ 2 = (M, R, Dom 2 ), let the merged schema of Σ 1 and Σ 2 be Σ 1 ⊗Σ 2 = (M ⊗ , R, Dom ⊗ ) where -M ⊗ = {Ψ | Ψ = Merged (Ψ, Σ 1 , Σ 2 ) and Ψ ∈ M}; and -For every R ∈ R and A ∈ R, Dom ⊗ (R, A)=(Ψ ⊗ ,Join(Ψ , G 1 , G 2 )) where Dom 1 (R, A) = (Ψ, G 1 ), Dom 2 (R, A) = (Ψ, G 2 ) and Ψ = Merged (Ψ, Σ 1 , Σ 2 ).The merged schema is a good candidate to be a global schema since it satisfies properties (B.1) to (B.3). Example 4. (example 3 continued) When merging Σ 1 and Σ 2 we need to compute

Algorithm 1 : 3 Ψ 5 ( 6 G1356MergedSchema Input : Domain compatible schemas Σ1 = (M, R, Dom1) and Σ2 = (M, R, Dom2) Output : Merged domain schemas Σ1 ⊗Σ2 1 M⊗ ← ∅ 2 for (U, , I, ρ, τ ) ∈ M do ← (U, , I, ρ, τ ) 4 for {G1, G2} ∈ GtoMerge(Ψ, Σ1, Σ2) do / * Add to Ψ a granularity that merges G1 and G2 * / G , Ψ ) ← MergedGranularity(Ψ , G1, G2) / * Removes equivalent indices and granularities that could have been added in lines homomorphic indices and granularities * / ← CleanUp(Ψ , G ) 7 M⊗ ← M⊗ ∪ {Ψ } 8 for (R, A) ∈ Att(Σ1, Σ2, {G1, G2}) do 9 Dom⊗(R, A) = G 10 return (M⊗, R, Dom⊗);

Consider a product domain schema Ψ prod = (U prod , prod , I prod , ρ prod , τ prod ), where:ρ(2012)yearρ(spring-summer12)fall12)termmonthTime DomainFig. 1. Instance a of a domain schema over the time domain to represent academic activities ofuniversitiesExample 2. U prod = {P1, P2, P3, P4, P5, . . . , P10}prod = {P roduct, Group, Calorie, Brand}I prod = {P1, . . . , P10, Diaries, Bread , Meat, Low , Medium, High, Soprole, . . . , Baker }τ prod (P roduct) = {P1, P2, P3, P4, P5, . . . , P10}τ prod (Group) = {Diaries, Bread , Meat}τ prod (Calorie) = {Low , Medium, High}τ prod (Brand) = {Soprole, Nestle, Alpura, Bimbo, Baker }ρ prod (Pi) = {Pi} for i ∈ [1, 10]

Acknowledgements We would like to thank reviewers for their valuable comments that help to improve this paper. This work is funded by Bicentenario Program PSD 57. Andrea Rodríguez is also partially funded by Fondef D09I1185.

Algorithm 2: MergedGranularity

Input : Schema Domain Ψ = (U, , I, ρ, τ ) and granularities G1, G2 ∈ Output : (G , Ψ ) where G is a granularity that minimally merges G1 and G2 and Ψ extends Ψ to consider it 1 Ψ ← (U, , I, ρ, τ ) 2 G ← fresh granularity label not already in 3 τ (G ) ← ∅ 4 for S ∈ ℘(Ψ, G1, G2) do 5 i ← fresh index not already in

Related work

The concept of granularity is not new. Important advances have been done in the formalization of granularity in the temporal and spatial domains, since their applications need abstraction mechanisms for handling domains that are continuous in nature. The formalization of temporal granularity by Bettini et. al. [4,5] has been basis for several different studies that explore temporal and spatial granularity. The work by Worboys [7,8] provides theoretical foundation, using notions similar to rough set theory, to define imprecision of spatial data at different granularities. For conceptual modeling, Khatri et al. [9] propose an annotation-based spatio-temporal conceptual model that accounts for the semantic related to spatial and temporal granularity. The work by Camossi et

Multi-granular spatio-temporal object models: concepts and research directions EBertino ECamossi MBertolotto ICOODB'09 Springer-Verlag 2010 Schema design alternatives for multi-granular data warehousing NIftikhar TBPedersen DEXA'10 Springer-Verlag 2010 Gradual data aggregation in multi-granular fact tables on resource-constrained systems NIftikhar TBPedersen KES'10 Springer-Verlag 2010 A general framework for time granularity and its application to temporal reasoning CBettini XSWang SJajodia Annals of Mathematics and Artificial Intelligence 22 1-2 1998 A glossary of time granularity concepts CBettini CEDyreson WSEvans RTSnodgrass XSWang Temporal Databases, Dagstuhl 1997 A multigranular object-oriented framework supporting spatio-temporal granularity conversions ECamossi MBertolotto EBertino International Journal of Geographical Information Science 20 5 2006 Imprecision in finite resolution spatial data MFWorboys GeoInformatica 2 3 1998 Computation with imprecise geospatial data MFWorboys Computers, Environment, and Urban Systems 22 2 1998 Supporting user-defined granularities in a spatiotemporal conceptual model VKhatri SRam RTSnodgrass GMO'brien Annals of Mathematics and Artificial Intelligence 36 1-2 2002 The Object Data Standard: ODMG 3 RGCattell DKBarry <author> <persName><forename type="first">Morgan</forename><surname>Kaufmann</surname></persName> </author> <imprint> <date type="published" when="2000">2000</date> </imprint> </monogr> </biblStruct> <biblStruct xml:id="b11"> <analytic> <title level="a" type="main">The derivation problem for summary data FMMalvestuto SIGMOD Conference ACM Press 1988