=Paper= {{Paper |id=None |storemode=property |title=Multi-Granular Schemas for Data Integration |pdfUrl=https://ceur-ws.org/Vol-866/paper10.pdf |volume=Vol-866 |dblpUrl=https://dblp.org/rec/conf/amw/RodriguezB12 }} ==Multi-Granular Schemas for Data Integration== https://ceur-ws.org/Vol-866/paper10.pdf
        Multi-Granular Schemas for Data Integration

                         M. Andrea Rodrı́guez and Loreto Bravo

                           Universidad de Concepción, Chile
                     Edmundo Larenas 215, 4070409 Concepción, Chile
                       {arodriguez,lbravo}@inf.udec.cl



       Abstract. Data can contain information at different levels of granularities but
       this metadata is generally left implicit in the data model. If we want to take ad-
       vantage of different levels of granularity when integrating data, we need to first
       extend database schemas to include granularity information. In this article we
       (i) provide a multi-granular domain schema that is used in the formalization of
       database schemas so that each attribute is assigned a certain granularity; and (ii)
       explore the issues when integrating data at different granularities and suggest
       possible global schemas and instances.


1   Introduction
The notion of granularity is relevant to many aspects of data representation, and its
formalization is fundamental for data integration. Granularity relates to data quality
since it introduces bounds to the level of detail in which data is represented. Despite
its relevance, granularity is usually implicit in a data model. This rises consequence for
data integration, since no useful information can be extracted from data that may be
semantically related but that are represented at different granularities.
     As a motivating example, consider the integration of two different databases that
store information about earthquakes occurred in different populated areas of the world.
One of the databases stores the location of earthquakes by the epicenter’s geographic
coordinates, stores the time of occurrence by day and hour, and stores the magnitude in
the Richter scale. The second database stores the location of earthquakes by the name
of the administrative district that contains the epicenter, stores the time of occurrence by
day, hour and minute, and stores the magnitude in the same scale. To solve this prob-
lem, we first need a database schema language that can specify both databases using
different granularities. Then, the integration of both databases is not trivial. Although
both databases store the magnitude in the same scale, they store temporal and spatial
information at different granularities. We then need to formalization the schema of in-
tegration to be able to materialize the integrated database.
     Although there exist different studies that model and query data at different gran-
ularities [1–3], none of them have addressed the integration of database schemas that
differ in the granularity to represent different attributes. Unlike other studies, we do
not focus on a particular domain, instead, we provide a general definition of schema
integration, where granularity can be applied to any of the attribute domains. Even
more, unlike classical approaches in data warehousing, data is not necessarily stored
at the finest level of granularity upon which aggregation functions derive data at other




                                             142
coarser levels. In summary, the main contributions of this work are twofold. We start
by formalizing a multi-granular domain schema and multi-granular database schema,
in which attributes take values from domains that can be represented at different gran-
ularities. A second contribution is the formalization of a global schema given source
multi-granular schemas and data. The goal is to obtain a global schema that provides at
least the same information as the source data. Future work is the materialization of the
database instance of the global schema.
    The organization of the paper is as follows. In Section 2 we introduce our formal-
ization of domain schemas and databases with multiple granularities. In Section 3 we
study the problem of integrating two instances with data at different levels of granular-
ity under some restrictions and we propose two possible global schemas and instances
to integrate this data. Related work is presented in Section 4 to highlight the novelty of
the work. Finally, Section 5 provides a discussion about more general settings.


2   Domain schemas and Databases

A domain schema is a tuple Ψ = (U, `, I, ρ, τ ), where (i) U is the domain associated
with Ψ , (ii) ` is the set of granularity identifiers (or labels), (iii) I is the set of granule
identifiers (or labels), (iv) ρ is a function ρ : I → 2U that maps granules identifiers
to subsets of the domain, and (v) τ is a function ` → 2I such that for all G ∈ ` if
i, j ∈ τ (G) then ρ(i) ∩ ρ(j) = ∅. To simplify the presentation we will asume that for
i, j ∈ I, i 6= j iff ρ(i) 6= ρ(j), and that for G1 , G2 ∈ `, G1 6= G2 iff τ (G1 ) 6= τ (G2 ).
     Given a domain schema Ψ = (U, `, I, ρ, τ ) and granularities G1 , G2 ∈ `: (i) G1 is
finer or equal than G2 , denoted G1 Ψ G2 , iff for all i ∈ τ (G1 ) there exists j ∈ τ (G2 )
such that ρ(i) ⊆ ρ(j); (ii) G1 aggregates to G2 , denoted G1 /Ψ G2 , iff for all j ∈ τ (G2 )
there exists i1 , . . . , in ∈ τ (G1 ) such that ρ(i1 ) ∪ ... ∪ ρ(in ) = ρ(j); and (iii) G1 is a
partition of G2 , denoted by G1 EΨ G2 , if G1 /Ψ G2 and G1 ≺Ψ G2 . When the domain
schema is clear from the context we will use ≺, / and E.
     A particular instance of a domain schema is the sets of granularities defined over
the time domain. A time domain is a pair (T ; ≤), where T is a non-empty set of time
instants, and ≤ is a total order in T [4–6]. By saying that T is a set of time instants we
do not impose the idea that the time is only discrete. Time domain is continuous (dense)
if for all t < t0 there exists t00 such that t < t00 < t0 . The next example illustrates the
case of multiple time granularities for representing academic activities of universities.
In this setting the set of granule identifiers is an ordered set of index values.

Example 1. Consider the following domain schema Ψtime = (Utime , `time , Itime ,
ρtime , τtime ), with `time = {month, term, year} and the elements in Itime , as well
as functions ρtime and τtime are represented graphically in Figure 1. The only relations
that hold among these granularities are month E year and term ≺ year.                 

Another instance of a domain schema is the domain representing foodstuffs. Different
granularities to represent a food product can be defined using classification criteria such
as group, calorie, and brand. This is illustrated in the following example.




                                             143
                                      ρ(2012)
                                                                                                               year

           ρ(winter12)                ρ(spring-summer12)                          ρ(fall12)
                                                                                                               term

  ρ(Jan12) ρ(Feb12) ρ(Mar12) ρ(Apr12) ρ(May12) ρ(Jun12) ρ(Jul12) ρ(Aug12)ρ(Sep12) ρ(Oct12) ρ(Nov12) ρ(Dic12)
                                                                                                               month



                                                  Time Domain



Fig. 1. Instance a of a domain schema over the time domain to represent academic activities of
universities


Example 2. Consider a product domain schema Ψprod = (Uprod , `prod , Iprod , ρprod ,
τprod ), where:
Uprod = {P1 , P2 , P3 , P4 , P5 , . . . , P10 }
`prod = {P roduct, Group, Calorie, Brand}
Iprod = {P1 , . . . , P10 , Diaries, Bread , Meat, Low , Medium, High, Soprole, . . . , Baker }
τprod (P roduct) = {P1 , P2 , P3 , P4 , P5 , . . . , P10 }
τprod (Group) = {Diaries, Bread , Meat}
τprod (Calorie) = {Low , Medium, High}
τprod (Brand) = {Soprole, Nestle, Alpura, Bimbo, Baker }
ρprod (Pi ) = {Pi } for i ∈ [1, 10]
ρprod (Soprole) = {P1 , P2 }
ρprod (Nestle) = {P3 , P4 }
ρprod (Alpura) = {P5 , P6 }
ρprod (Bimbo) = {P7 , P8 }
ρprod (Baker ) = {P9 , P10 }
ρprod (Low ) = {P1 , P2 , P5 }
ρprod (Medium) = {P4 , P7 , P9 , P10 }
ρprod (High) = {P8 , P6 }
ρprod (Dairies) = {P1 , P2 , P3 , P5 , P8 }
ρprod (Bread ) = {P7 , P9 }
ρprod (Meat) = {P4 }
The relationships between granularities in ΨP rod are: P roduct E Brand, P roduct E
Calorie and P roduct / Group.                                                                   

The previous example resembles data warehousing (DW). In an homogeneous and strict
DWs, if a category A rolls-up to a category B then A E B. Unlike datawarehousing, a
domain schema enables to store data at different levels of granularities and not only at
the finest granularity.

Databases. A database schema is a tuple Σ = (M, R, Dom), where: (a) M is a set
of domain schema, (b) R is a set of relational schemas and (c) Dom is a function that,
given a relation R ∈ R and an attribute A ∈ R, returns a tuple (ΨRA , GRA ) where




                                                           144
ΨRA = (URA , `RA , IRA , ρRA , τRA ) ∈ M and GRA ∈ `RA . Intuitively, Dom returns
the domain schema and granularity associated with attribute A ∈ R.
    To simplify the presentation of results, we will assume that there are no two Ψ1 , Ψ2 ∈
M that refer to the same domain U. Function schemaOfDomain(M, U) returns the do-
main schema in M defined over domain U.
    A database instance D of a schema Σ is a finite collection of ground atoms of the
form R(c1 , . . . , ci , . . . , cl ), where (a) R(B1 , . . . , Bi , . . . , Bl ) ∈ R, and (b) every ci
with i ∈ [1, l] is such that ci ∈ τRBi (GRBi ) where Dom(R, Bi ) = (ΨRBi , GRBi ) and
ΨRBi = (URBi , `RBi , IRBi , ρRBi , τRBi ).


3     Database Integration

We want to explore the challenges that arise when merging data with different levels of
granularity. For example, given several data sources with data at different granularity
levels, which is the best granularity to be used in the global schema? At what level
of granularity should the data be stored in the merged instances? How do integrity
constraints affect this process? What type of query language can we use to deal with
this different levels of granularity?
     In a first stage we decided to restrict to a simple base case with two source databases
to be integrated and a global schema that share the same set of relations in R and have no
integrity constraints. In this way, each relation in a source is mapped to the same relation
in the global schema. In this setting, we will want to define the granularity associated
with every attribute in the global schema and the level of granularity at which the data
should be stored.
     Schemas Σ1 = (M1 , R1 , Dom1 ) and Σ2 = (M2 , R2 , Dom2 ) are domain com-
patible if M1 = M2 , R1 = R2 and for every R ∈ R1 and every A ∈ R1 , there exists
Ψ , G1 and G2 such that Dom1 (R, A) = (Ψ, G1 ) and Dom2 (R, A) = (Ψ, G2 ). This
is, the two schemas share the same set of domain schemas and relational schemas and
every attribute is associated to the same domain schema (even if it is at different levels
of granularity) in the different schemas.
     In this article we concentrate on the integration of databases defined over domain
compatible schemas. In this setting, we will consider two cases: Domain Invariant In-
tegration and Finest Domain Integration. In the former, we require global schema to
share the same set of domains M as the sources. In the latter setting, we allow MG
to differ from the sources but require it to contain the finest granularity schema that
generalizes the domains in the sources.


3.1   Domain Invariant Integration

Given two domain compatible source databases with schemas Σ1 = (M, R, Dom1 )
and Σ2 = (M, R, Dom2 ) and instances D1 and D2 , respectively, we want to define a
global schema ΣG = (M, R, DomG ) such that at least the following conditions hold:
(A.1) For every R ∈ R and every A ∈ R, if Dom1 (R, A) = Dom2 (R, A) then
       DomG (R, A) = Dom1 (R, A) = Dom2 (R, A).




                                                145
 (A.2) For every R ∈ R and every A ∈ R, if Dom1 (R, A) = (Ψ, G1 ), Dom2 (R, A) =
        (Ψ, G2 ) and DomG (R, A) = (Ψ, GG ), then G1  GG and G2  GG .
The first condition will ensure that if two attributes have the same granularity, the at-
tribute in the global schema will have the same granularity. On the other hand, if they
are different, the second condition ensures that the granularity in the global schema is
coarser or equal than both of them.
    In order to find the global schema that provides the finest granularity for each at-
tribute such that these properties are satisfied, we define the join operator. Given a do-
main schema Ψ = (U,    `, I, ρ, τ ) and G1 , G2 ∈ `, the Join Operator is
                        G if G ∈ `, G1  G, G2  G, 6 ∃G0 ∈ `(G1  G0 , G2  G0 , G0 ≺ G)
Join(Ψ, G1 , G2 ) =
                        ⊥ otherwise
Thus, the join operator of G1 and G2 returns the finest granularity in ` such that it sub-
sumes both G1 and G2 ; if that granularity does not exist, it return ⊥.

Definition 1. Given schemas Σ1 = (M, R, Dom1 ) and Σ2 = (M, R, Dom2 ) let the
join schema Σ1 o n Σ2 = (M,  R, DomΣ1n     oΣ2 ) where for every R ∈ R and A ∈ R:
                            G       if G = Join(Ψ, G1 , G2 ) 6= ⊥, Dom1 (R, A) = (Ψ, G1 )
    DomΣ1n  oΣ2 (R, A) =              and Dom2 (R, A) = (Ψ, G2 )
                               ⊥     otherwise
                            
If any of this granularities is ⊥, then there is no join schema that allows the integration
of the databases under the domains in M.                                                 

The join schema of Σ1 and Σ2 is a good candidate for a global schema since it satisfies
conditions (A.1) and (A.2). Its main drawback is the fact that it does not always exist.

Example 3. Consider the database schemas Σ1 = (M, R, Dom1 ) and Σ2 = (M, R,
Dom2 ) where ΨP rod = (Uprod , `prod , Iprod , ρprod , τprod ) ∈ M, R = {Diet(product,
amount)}, Dom1 (Diet, product) = (Ψprod , Group) and Dom2 (Diet, product) =
(Ψprod , Calorie). There is no G ∈ `prod such that Calorie  G and Group  G.
Therefore, Join(Ψprod , Group, Calorie) = ⊥ and the join schema Σ1 o        n Σ2 does not
exist. Now, let Σ3 = (M, R, Dom3 ) with Dom3(Diet, product) = (Ψprod ,Product).
In this case, Join(Ψprod , Product, Group) = Group and, therefore, Σ1 o      n Σ3 = (M,
R, DomΣ1n   oΣ3 ), where Dom  Σ1n
                                oΣ3 (Diet, product)    =  (Ψ prod , Group).            

Now, in order to define the global instance given a global schema, we need to ensure
that the data in the global schema are the same as the one contained in the sources
but probably at a higher level of granularity. For example, we can have a month in the
source and the associated year in the global schema.

Definition 2. Given instances D1 and D2 defined respectively over schemas Σ1 =
(M, R, Dom1 ) and Σ2 = (M, R, Dom2 ), and a global schema ΣG , a global instance
DG over the schema ΣG is such that R(c1 , . . . , ci , . . . , cl ) ∈ DG if and only if:
 – There exists R(B1 , . . . , Bi , . . . , Bl ) ∈ R,
 – For every i ∈ [1, l], ci ∈ τ (GRBi ) where DomΣG (R, Bi ) = (ΨRBi , GRBi ).
 – There exists R(cs1 , . . . , csi , . . . , csl ) ∈ D1 or R(cs1 , . . . , csi , . . . , csl ) ∈ D1 such that
    for every i ∈ [1, l], it holds that ρRBi (csi ) ⊆ ρRBi (ci ) where DomΣG (R, Bi ) =
    (ΨRBi , GRBi ).                                                                                         




                                                   146
It is easy to see that if the join schema exists, then there will exist a unique global
instance for the join schema.
     Since sometimes there is no join schema to use as a global schema, in the next
section we study the consequences of lifting the restriction of forcing the global schema
to use the same set of domains M as the sources.

3.2   Finest Domain Integration
When integrating two granularities G1 and G2 , it might be the case that its associated
domain schema has no granularity which is coarser than both of them. To solve this
limitation, we consider in this section the possibility of adding to the domain schema the
finest granularity that is coarser than both G1 and G2 . More formally, given two domain
compatible schemas Σ1 = (M, R, Dom1 ) and Σ2 = (M, R, Dom2 ), we want to find
a global schema ΣG = (MG , R, DomG ) for which the following conditions hold:
 (B.1) For every Ψ = (U, `, I, ρ, τ ) ∈ M there is a ΨG = (UG , `G , IG , ρG , τG ) ∈ MG
        such that UG = U; ` ⊆ `G ; I ⊆ IG ; for every i ∈ I, ρ(i) = ρG (i); and for every
        G ∈ `, τ (G) = τG (G). Furthermore, only domains with this property belong to
        MG .
 (B.2) For every R ∈ R and every A ∈ R, if Dom1 (R, A) = Dom2 (R, A) then
        DomG (R, A) = Dom1 (R, A) = Dom2 (R, A).
 (B.3) For every R ∈ R and every A ∈ R, if Dom1 (R, A) = (Ψ, G1 ), Dom2 (R, A) =
        (Ψ, G2 ) and DomG (R, A) = (ΨG , GRAi ), with Ψ = (U, `, I, ρ, τ ) and ΨG =
        (UG , `G , IG , ρG , τG ), then U = UG , G1 ΨG GRAi and G2 ΨG GRAi .

The first conditions ensures that M0 is an extension of M, this is, it contains a domain
for the same set of U as M, and to each of them, it can only add new indices and
new granularities without modifying the ones that already exist in M. Condition (B.2)
ensures that if two attributes have the same granularity in the sources, the attribute in the
global schema will not be modified. On the other hand, if they are different, condition
(B.3) requires that for each pair of granularities to merge, there exists a domain schema
ΨG that extends the domain schema Ψ by adding a new granularity that is coarser than
both of them. As before, we would like the global schema to use the finest granularity
for each attribute such that these conditions hold. In order to achieve this, we first need
to pair up the granularities and indices that will need to be merged.
Definition 3. Given two domain compatible schemas Σ1 = (M, R, Dom1 ), Σ2 =
(M, R, Dom2 ) and Ψ = (U, `, I, ρ, τ ) ∈ M, the set of granularities to merge of Ψ for
Σ1 and Σ2 is GtoMerge(Ψ, Σ1 , Σ2 ) = {{G1 , G2 } | R ∈ R, A ∈ R, Dom1 (R, A) =
(Ψ, G1 ) and Dom2 (R, A) = (Ψ, G2 )}. Also, we will denote the set of attributes that
share a specific pair of granularities by Att(Σ1 , Σ2 , {G1 , G2 }) = {(R, A) | Ψ ∈ M,
Domi (R, A) = (Ψ, G1 ), Domj (R, A) = (Ψ, G2 ) and (i, j) ∈ {(1, 2), (2, 1)}}.
    Let ℘(Ψ, G1 , G2 ) be the maximal partition of τ (G1 ) ∪ τ (G2 ) such that for every
distinct sets S1 , S2 ∈ ℘(Ψ, G1 , G2 ) and every i ∈ S1 and j ∈ S2 , ρ(i) ∩ ρ(j) = ∅. 
Taking into consideration the pairs of granularities that need to be merged, we define
the new domain schema that can include new granularities that merge the granularities
that need to be integrated.




                                           147
Definition 4. Given two domain compatible schemas Σ1 = (M, R, Dom1 ) and Σ2 =
(M, R, Dom2 ) and Ψ = (U, `, I, ρ, τ ) ∈ M, the merged domain schema of Ψ for Σ1
and Σ2 is Merged (Ψ, Σ1 , Σ2 ) = (U,0 `0 , I 0 , ρ0 , τ 0 ) where:
 1. U 0 = U,
 2. `0 ⊇ `
 3. I 0 ⊇ I
 4. For every i ∈ I, ρ0 (i) = ρ(i);
 5. For every G ∈ `, τ 0 (G) = τ (G);
 6. For every {G1 , G2 } ∈ GtoMerge(Ψ, Σ1 , Σ2 ), there exists G0 ∈ `0 such that G1 
    G0 and G2  G0 and that it is not possible to define a granularity G” over U such
    that G” ≺ G0 , G1  G” and G1  G”.
 7. There is no other schema Ψ ” = (U”, `”, I”, ρ”, τ ”) that satisfies conditions (1) to
    (6) and such that `” ( `0 and/or I” ( I 0 .                                        

By condition (1) the merged domain schema will be defined over the same sources
domain schema. Next, by conditions (2)-(5), the merged schema will contain at least
the same granularity labels and indices as Ψ with the same meaning. Condition (6)
ensures that for each pair of granularities G1 and G2 that need to be merged, there is
another granularity (not necessarily new) for which both of them are finer, and which is
also the finest one that satisfies this requirement.

Definition 5. Given two domain compatible schemas Σ1 = (M, R, Dom1 ) and Σ2 =
(M, R, Dom2 ), let the merged schema of Σ1 and Σ2 be Σ1 ⊗Σ2 = (M⊗ , R, Dom⊗ )
where
 – M⊗ = {Ψ 0 | Ψ 0 = Merged (Ψ, Σ1 , Σ2 ) and Ψ ∈ M}; and
 – For every R ∈ R and A ∈ R, Dom⊗ (R, A)=(Ψ⊗ ,Join(Ψ 0, G1, G2 )) where Dom1 (R,
    A) = (Ψ, G1 ), Dom2 (R, A) = (Ψ, G2 ) and Ψ 0 = Merged (Ψ, Σ1 , Σ2 ).      

The merged schema is a good candidate to be a global schema since it satisfies proper-
ties (B.1) to (B.3).

Example 4. (example 3 continued) When merging Σ1 and Σ2 we need to compute
  0
Ψprod   = Merged (Ψprod , Group, Calories) = (Uprod , `prod ∪ {GroupCalories}, Iprod
∪ {δ1 }, ρ⊗ , τ⊗ ), where ρ⊗ (G) = ρprod (G) for every G ∈ `prod , ρ⊗ (δ1 ) = {P1 , P2 ,
P3 , P5 , P6 , P8 }, τ⊗ (i) = τprod (i) for every i ∈ Iprod , τ⊗ (GroupCalories) = {δ1 ,
Medium}. Thus, the merged schema Σ1 ⊗ Σ2 = (M⊗ , R, Dom⊗ ) where M⊗ =
    0                                          0
{Ψprod   } and Dom⊗ (Diet, product) = (Ψprod       , GroupCalories).
      On the other hand, when integrating schemas Σ1 and Σ3 the join and the merge
schema coincide.                                                                     

A good property of the merged schema is that it always exists and is unique up to
renaming of granularities in `⊗ and indices en I⊗ . Also, when merging schemas Σ1
and Σ2 , for every domain schema Ψ in them, and for every G1 , G2 ∈ `Ψ , if the join
schema exists and is Join(Ψ, G1 , G2 ) = GJ , then for Ψ 0 = Merged (Ψ, G1 , G2 ) it
holds that the Join(Ψ 0 , G1 , G2 ) = G0 ≺Ψ 0 GJ . Thus, schema Σ1 ⊗Σ2 can contain
finer grained attributes than Σ1 o
                                 n Σ2 . A drawback of the merged schema is that the new
granularity names (in `⊗ ) and indices (in I⊗ ) will be less intuitive. Indeed, in Example 4




                                           148
 Algorithm 1: MergedSchema
   Input : Domain compatible schemas Σ1 = (M, R, Dom1 ) and Σ2 = (M, R, Dom2 )
   Output : Merged domain schemas Σ1 ⊗Σ2
 1 M⊗ ← ∅
 2 for (U, `, I, ρ, τ ) ∈ M do
 3      Ψ 0 ← (U, `, I, ρ, τ )
 4      for {G1 , G2 } ∈ GtoMerge(Ψ, Σ1 , Σ2 ) do
            /* Add to Ψ a granularity that merges G1 and G2                */
 5          (G0 , Ψ 0 ) ← MergedGranularity(Ψ 0 , G1 , G2 )
            /* Removes equivalent indices and granularities that
                 could have been added in lines homomorphic indices
                 and granularities                                         */
 6          G0 ← CleanUp(Ψ 0 , G0 )
 7          M⊗ ← M⊗ ∪ {Ψ 0 }
 8          for (R, A) ∈ Att(Σ1 , Σ2 , {G1 , G2 }) do
 9               Dom⊗ (R, A) = G0

10   return (M⊗ , R, Dom⊗ );




the index δ1 and the granularity name GroupCalories are less intuitive than the indices
and granularity names already in the domain schema before the merge.


Computing the Merged Schema. The merged schema of two domain compatible
schemas Σ1 and Σ2 can be computed by Algorithm 1 called MergedSchema which
relies on the following property of the merged schema: for each pair of granularities
that need to be merged, the merged schema contains a new granularity G0 that contains
one index for each partition in ℘(Ψ, G1 , G2 ) and the ρ of each of this indices will con-
tain exactly the union of the ρ of the indices in the corresponding partition. Relying
on this observation, the subroutine MergedGranularity (see Algorithm 2) returns gran-
ularity G0 and the new domain schema which now contains this new granularity G0 .
Algorithm MergedSchema calls MergedGranularity for all the combinations of G1 and
G2 that need to be merged. It might be the case that some of the new granularities and
indices added in lines 4-5 of the Algorithm will result in non-minimal schemas, in the
sense that there might be two indices i and j in a domain schema for which ρ(i) = ρ(j),
also there might be granularities G1 and G2 which contain the same indices. Method
CleanUp (Algorithm 3) removes all these indices and granularities that are not needed.
To achieve this, it first replaces any new index in G0 by one in the original schema
which maps to the same elements in U if it exists. Next, it makes sure that there is no
other granularity with the same indices. After the CleanUp, Algorithm MergeSchema
in lines 8-9 associates to each merged attribute its new granularity through function
DOm⊗ . Finally, in line 10 the merged schema for Σ1 and Σ2 is returned.




                                          149
    Algorithm 2: MergedGranularity
      Input : Schema Domain Ψ = (U, `, I, ρ, τ ) and granularities G1 , G2 ∈ `
      Output : (G0 , Ψ 0 ) where G0 is a granularity that minimally merges G1 and G2 and Ψ 0
                 extends Ψ to consider it
        0
    1 Ψ ← (U, `, I, ρ, τ )
         0
    2 G ← fresh granularity label not already in `
           0
    3 τ (G ) ← ∅
    4 for S ∈ ℘(Ψ, G1 , G2 ) do
    5      i ← fresh index not already in I
    6      I←I∪     S{i}
    7      ρ(i) ← j∈S ρ(j)
    8      τ (G0 ) ← τ (G0 ) ∪ {i}
    9      ` ← ` ∪ {G0 }
10      return (G0 , Ψ 0 );


    Algorithm 3: CleanUp
      Input : Schema Domain Ψ = (U, `, I, ρ, τ ) and a granularities G ∈ `
      Output : A granularity G0 equivalent to G
                   0
    1 for i ∈ τ (G ) do
    2      for j ∈ (I r τ (G0 )) do
               /* Condition can be satisfied only once for each i                              */
    3          if ρ(i) = ρ(j) then
    4                G0 ← (G0 r {i}) ∪ {j}
    5                I ← I r {i}

   GtoReturn ← G0
    6
                       0
 7 for Gi ∈ (` r {G }) do
       /* Condition can be satisfied only once.                                                */
 8     if τ (Gi ) = τ (G0 ) then
 9           GtoReturn ← Gi
10           ` ← (` r {G0 })

11      return GtoReturn;



4       Related work

The concept of granularity is not new. Important advances have been done in the for-
malization of granularity in the temporal and spatial domains, since their applications
need abstraction mechanisms for handling domains that are continuous in nature. The
formalization of temporal granularity by Bettini et. al. [4, 5] has been basis for several
different studies that explore temporal and spatial granularity. The work by Worboys [7,
8] provides theoretical foundation, using notions similar to rough set theory, to define
imprecision of spatial data at different granularities. For conceptual modeling, Khatri
et al. [9] propose an annotation-based spatio-temporal conceptual model that accounts
for the semantic related to spatial and temporal granularity. The work by Camossi et




                                              150
al. [6] extends ODMG type system [10] to handle spatio-temporal data with specific
types to define spatio-temporal properties at multiple granularities. They provide ge-
ometric converse operators to implement changes on granularity. Similarly, Bertino et
al. [1] extends object-relational model based on OpenGis specifications described in
SQL3 to represent temporal dimension in this model and the multi-representation of
spatio-temporal granularities.
     From another perspective, a schema model for datawarehousing defines dimensions
over fact tables. From fact tables, data is aggregated along a dimension forming a hier-
archy of finer-than relationships of granularity. As Iftikhar and Pedersen already high-
lighted [2, 3], current models cannot store data at different levels of granularity, since
they require all dimensions attributes to be given concrete values. They propose exten-
sions to current models that include a time dimension granularity defined as a single
hierarchy. Thus, fact data is associated with a time dimension in a particular granu-
larity. The work in [3] can be seen as a materialization of our multi-granular database
schema. Unlike the work in [3], however, our model is applicable to not only temporal
granularity but also spatial or semantic granularity of a conceptual classification.
     Our work relates to the problem described in [11], which characterizes the deriva-
tion problem for summary data. The idea is to compare summary data in statistical
databases based on common but not identical classification criterion. This classification
criterion can be seen as a granularity of atomic data, where each class corresponds to a
granularity index that maps to a portion of the underlying domain (granule). The prob-
lem is then to integrate different summarized data and extract useful information from
them. Unlike the datawarehouse context, summary datasets may not have atomic data
from which summarized data is obtained. Data is stored in terms of different classifi-
cation criterion, which is similar to our problem of having datasets at different level of
granularities without the atomic facts of the underlying domain.
     Despite previous work, to the best of our knowledge, there is no work that formal-
izes granularity in a more general relational context. We want here to formalize granu-
larity not only for time-depending attributes, but also for attributes defined by different
classification criteria. Our approach could then be extended to deal with particularities
of each attribute domain. Like the work in [11], we also want to be able to extract useful
information from different datasets whose data are represented at different granularities.


5   Discussion
In this section we discuss the issues that arise if we consider constraints or allow in-
stances where each attribute can contain different levels of granularity.

Integrity Constraints. In the previous sections, we have considered that the global
instance contains all the tuples of the data sources, where some of their attributes may
have been modified to a coarser level of granularity. If we now consider that the schema
contains integrity constraints, new issues arise, which are illustrated by the following
example.
Example 5. Consider different databases that store information about relevant touristic
landmarks. Each source database contains the predicate LandMark (name, location)




                                          151
and functional dependency name → location. Attribute location may be defined over
a domain at different levels of granularities. For example, while a data source may store
a location in terms of provinces, another data source stores it in terms of countries.
     Since we have a functional dependency constraints, an integrated database will not
store two tuples in LandMark with the same value in attribute name. A global schema
will need to be such that the functional dependency constraints must be satisfied. Tuples
with the same name will need to be analyzed and, when possible, the conversion of
attributes will need to be applied.
    We say that a conflict of data integration exists when a data stored in one database
cannot be converted to the data in another database. For example, a database stores tuple
(Aconcagua mountain, Argentina) and another stores (Aconcagua mountain, Chile). In
this example, both tuples are at the same granularity of countries and, therefore, they
are inconsistent because they should have the same value in attribute location. An in-
consistency can also happen when storing the location at different levels of granularity.
For example, a database stores tuple (Aconcagua mountain, San Juan Province) using a
granularity of provinces and another stores (Aconcagua mountain, Chile) using a gran-
ularity of countries. For this example, there is no way to convert the value of location
in one tuple into the value of the other tuple.
    Although values of attribute location can be different in two databases for the same
value of attribute name, there are cases of no conflict. We call this situation synergy as
in [11]. For example, a database stores tuple (Aconcagua mountain, San Juan Province)
and another stores (Aconcagua mountain, Argentina). In such case, we would expect
to keep the tuples with the fine granularity which is not in conflict with the coarse
granularity.                                                                            


Instances with several granularities in the same attribute. One of the limitations
of how global instances have been defined is that we loose some of the information
when the global schema uses a coarser granularity in any of the attributes. For example,
consider the domain schema ΨP rod and an attribute A with granularity Product in the
source and Calories in the global schema. If when merging we replace every product
value in attribute A by its associated Low , Medium and High, we will be loosing some
information that may turn to be useful.
    This is why we could consider a modification of the definition of database instance
to allow for each attribute to contain indices in or below the granularity of the attribute.
More formally, a database instance D of a schema Σ could alternatively be defined
as a finite collection of ground atoms of the form R(c1 , . . . , ci , . . . , cl ), where (a)
R(B1 , . . . , Bi , . . . , Bl ) ∈ R, and (b) every ci is such that ci ∈ I, there exists g ∈
τRBi (GRBi ) for which ρRBi (ci ) ⊆ ρRBi (g) where Dom(R, Bi ) = (ΨRBi , GRBi ).
   With this alternative definition, when merging instances D1 over Σ1 and D2 over
Σ2 we can use the global instance DG = D1 ∪ D2 over either the join or merge schema.
The problem now becomes the way in which we can query this type of instances and
how answers should be displayed. A query language, like SQL, could be augmented to
request the level of granularity at which we want the answers to be displayed.




                                            152
Acknowledgements We would like to thank reviewers for their valuable comments
that help to improve this paper. This work is funded by Bicentenario Program PSD 57.
Andrea Rodrı́guez is also partially funded by Fondef D09I1185.


References
 1. Bertino, E., Camossi, E., Bertolotto, M.: Multi-granular spatio-temporal object models: con-
    cepts and research directions. In: ICOODB’09, Springer-Verlag (2010) 132–148
 2. Iftikhar, N., Pedersen, T.B.: Schema design alternatives for multi-granular data warehousing.
    In: DEXA’10, Springer-Verlag (2010) 111–125
 3. Iftikhar, N., Pedersen, T.B.: Gradual data aggregation in multi-granular fact tables on
    resource-constrained systems. In: KES’10, Springer-Verlag (2010) 349–358
 4. Bettini, C., Wang, X.S., Jajodia, S.: A general framework for time granularity and its ap-
    plication to temporal reasoning. Annals of Mathematics and Artificial Intelligence 22(1-2)
    (1998) 29–58
 5. Bettini, C., Dyreson, C.E., Evans, W.S., Snodgrass, R.T., Wang, X.S.: A glossary of time
    granularity concepts. In: Temporal Databases, Dagstuhl. (1997) 406–413
 6. Camossi, E., Bertolotto, M., Bertino, E.: A multigranular object-oriented framework sup-
    porting spatio-temporal granularity conversions. International Journal of Geographical In-
    formation Science 20(5) (2006) 511–534
 7. Worboys, M.F.: Imprecision in finite resolution spatial data. GeoInformatica 2(3) (1998)
    257–279
 8. Worboys, M.F.: Computation with imprecise geospatial data. Computers, Environment, and
    Urban Systems 22(2) (1998) 85–106
 9. Khatri, V., Ram, S., Snodgrass, R.T., O’Brien, G.M.: Supporting user-defined granularities
    in a spatiotemporal conceptual model. Annals of Mathematics and Artificial Intelligence
    36(1-2) (2002) 195–232
10. Cattell, R.G., Barry, D.K.: The Object Data Standard: ODMG 3.0. Morgan Kaufmann (2000)
11. Malvestuto, F.M.: The derivation problem for summary data. In: SIGMOD Conference,
    ACM Press (1988) 82–89




                                             153