=Paper=
{{Paper
|id=Vol-363/paper-5
|storemode=property
|title=Reuse of a repository of conceptual schemas in a large scale project
|pdfUrl=https://ceur-ws.org/Vol-363/paper5.pdf
|volume=Vol-363
|dblpUrl=https://dblp.org/rec/conf/emmsad/BatiniG05
}}
==Reuse of a repository of conceptual schemas in a large scale project==
<pdf width="1500px">https://ceur-ws.org/Vol-363/paper5.pdf</pdf>
<pre>
    Reuse of a repository of conceptual schemas in a large
                        scale project

                               Carlo Batini1, Riccardo Grosso2
                               1 University of Milano Bicocca

                     Via Bicocca degli Arcimboldi 8, 20126 Milano, Italy
                              {batini}@disco.unimib.it
                                        2 CSI-Piemonte,

                          Corso Unione Sovietica 216, Torino, Italy
                              {Riccardo.Grosso}@csi.it


       Abstract. Large amounts of data are managed by organizations, available to be
       viewed and analysed from multiple perspectives, which becomes a fundamental
       resource to the effectiveness of the organizations. An organization can achieve
       full benefit from the available information by managing its data resource,
       through the planning of its exploitation and its maintenance. The concept of
       data repository fulfils these requirements, due to the fact that it contains the de-
       scription of all types of data produced, managed, maintained and exchanged in
       an organization. This paper describes an experience of the use of an existing
       repository of conceptual schema, representing a wide amount of entities of in-
       terest for Central Public administration, in order to produce the corresponding
       repository of the administrations located in a region. Several heuristics are de-
       scribed and experiments are reported.


1    Structure of Italian Public Administration and previous
experiences of conceptual schema Repositories

The goal of this paper is to describe an experience with a repository of conceptual
schemas, related to Central Italian public administration, in order to build a first ver-
sion of the corresponding repository of conceptual schemas of the local public ad-
ministration located in one of the 21 regions of Italy. Due to limited available re-
sources, several approximate techniques have been applied, that allow fast prototyp-
ing of the local repository, to be refined by domain expert, resulting in a resource
consumption one order of magnitude lower than that with a traditional process.
   The Italian Government’s policy, in the past few years, similarly to many other
governments in the world, has been to improve the quality of services to the citizen,
by gradually improving services provided by information systems and databases of its
agencies. However, in the past the lack of co-operation among the departments led to
the establishment of heterogeneous and isolated systems. As a result, two main prob-
lems have arisen: duplicated and inconsistent information; and difficult data access.
   Moreover, the Government efficiency depends on the sharing of information be-
tween administrations, due to the fact that many of them are usually involved in the
same procedures, while using different, overlapped, heterogeneous databases.
   Therefore, in the long term, a crucial aspect for the overall project is to design a
cooperation architecture that allows both central and local administration to share
information in such a way as to be able to provide services to citizens and businesses
on the basis of the “one stop shop” paradigm. A crucial aspect of such cooperation
architecture is the data architecture: data have to be interchanged in an interoperable
format, all the administration assign the same meaning to the same data, achieving
database integration in the long term; this will enable the spread of information within
government branches, a more easily accessible working environment, an increased
quality of information management, and an improved state-wide decision making.
   One of the first activities performed in the last decade, with the goal of designing a
suitable data architecture, has been the project of building an inventory of existing
information systems operating within the Central Public Administration in Italy. The
activity was performed over about 500 data bases, whose logical schemas through
reverse engineering activities were translated into Entity Relationship schemas.
   In order to achieve cooperation among central and local administrations, it is now
the moment to design a data architecture that covers both types of administrations,
and, consequently, it is necessary to develop a similar repository for local administra-
tions. For this reason, several regional administrations are now designing their own
data architecture. The most advanced organizational context among local administra-
tions in a region is when they are coordinated by a regional agency, that provides
services to all or at least to the majority of them. This is the situation of administra-
tions of the Piedmont region, where such central agency exists, and is CSI Piemonte.
But also in such a fortunate context, only logical relational schemas are available as
input to the process of construction of the local repository. So, a methodology and
tools are needed that allow approximate production of conceptual schemas to be ar-
ranged in a repository. This paper describes such a methodology and the experience
so far in applying this to the context of the Piedmont Public Administrations.
   The paper is organized as follows. In section 2 we discuss the structure of the Cen-
tral Administration Repository and we recall the methodology for its construction. In
section 3 we describe knowledge available for the design of the local PA repository.
In section 4 we provide the methodology for building, starting from the central re-
pository and local logical schemas, a first draft version of the local repository. Sec-
tion 5 discusses experiences and future research work.


2. The Structure of the Central PA Repository

In this paper, a repository is defined as a set of conceptual schemas, each describing
all the information managed by an organisation area within the information system
considered. The data repository referenced in this paper uses the Entity Relationship
model to represent conceptual schemas. However a simple set of schemas does not
display the relationships among schemas of different areas; the repository has to be
organised in a more complex structure, through the use of the structuring primitives.
   The primitives are: refinements, views; integration. Refinements allow the descrip-
tion of the same reality at different levels of abstraction. This mechanism is funda-
mental for a data repository, since it helps the user to perceive a complex reality step
by step, going from a more abstract level to a local one. Views are descriptions of
fragments of a schema. They allow users to focus their attention just on the part of a
complex reality of interest to them. Integration is the mechanism by which it is possi-
ble to build a global description of data managed by an organisation area starting
from local schemas. By jointly using these structuring primitives we obtain a reposi-
tory of schemas. Each column of the repository represents an organisation unit while
each row stands for a different abstraction level. The left column contains the
schemes resulting from the integration of all the other schemes belonging to the same
row (views of the integrated schema). In fig. 1 we show an example of repository,
where the Production, Sales, Department Schemas are represented at different re-
finement levels respectively in the second, third and fourth column, while the Com-
pany schema in the first column is the result of their integration.
   In practice, when the repository is populated at the bottom level by hundreds of
schemas, as in the case that we will examine, it is unfeasible to manage the three
structuring primitives, and the view primitive is sacrificed. Furthermore, the integra-
tion/abstraction structuring mechanism is iterated, producing a sparsely populated
repository such as the one symbolically represented in fig. 2, where, for instance,
schema S123 results from the integration/abstraction of schemas S1, S2, and S3.

                        Company                                                                   Production                                                  Sales                     Department
                                                                                                                                                                                        structure
                 DEP            D-E                    EMP-DATA


                                                                                                DEP      D-E                                                          EMP-DATA
                                                                                                                      EMP-DATA
               Man
                                                         Acq
                                                                                                                                                                                        DEP     D-E    EMP-DATA
                                                                                              ProducT                                                                      Acq


                                                       ORD-DATA
                 ITEM                 In                                      view                                                       ITEM                In       ORD-DATA
                                                                                                ITEM


                                                  refinements
                                                                                                                         CITY

      DEPART            Man           EMPLOYEE           Born      CITY
                                                                                                                         Born

                                                                                         DEPART                EMPLOYEE                                     EMPL
                               SElLER

                                                                                                                                                                                         DEP     Man     EMP
                                 Acq                                                         Product
                                                                                                                                                           SELLER
                                                                              view                                                                          Acq                                         Born

          ITEM          In      ORDER             Of           PURCH
                                                                                              ITEM                                   ITEM             In    ORD     Of           PUR                    CITY


     FLOOR                                                                                                                 CITY
                                                       refinements
          In                               Head                                                                Head
                                                                                                                          Born                             EMP
                        Gest          EMPLOYEE                                       DEP                       EMP
   DEPARTM                                                Born         CITY
                        Lav.                                                                                                                                                            FLOOR
                                                                                                                                                           SELLER
                                                                                                                                                                                          In
                                                                                     Prod.                                                                                                               Head
                               SELLER
                                                                                                                CLERK    ENGIN                              Acq                         DEP     Man      EMP
      Man                                               ENGINEER
                                           CLERK
                                 of
                                                                                                                                  ITEM           In         ORD       Of          PUR                    Born
                                                                                     ITEM
      ITEM                                        Of                                                                                     Loc..
                        In      ORDER                           PURCH                                                                                                                                   CITY
                 Loc                                                                                                                 WARE
   Type                                                                          WARR

   WARR        WARE


                                                        integration


                                                                              Fig. 1. An example of repository
        SI12345678


                     SI123                           SI456                          SI78


                              S1   S2     S3                   S4         S5   S6          S7   S8


                             Fig. 2. A fragment of repository

   The repository structure described previously has been adopted for representing
the conceptual content of a wide amount of conceptual schemas related to the most
relevant databases of Italian central public administration in an integrated structure.
   In order to build the whole repository, a methodology has been adopted, described
in [1], [2]. About 200 person months were needed to produce the 500 basic concep-
tual schemas of the repository, while about 24 person months were needed to produce
the 55 abstract schemas of the upper part (approximately 2 weeks per schema, both
for basic and for abstract schemas). In figure 3 the schema at the top level of the re-
pository is shown.


3 The Repository of                Piedmont Local Administrations: basic
knowledge available

In this section we describe in more detail the knowledge available for the design of
the Piedmont Local public administration (LPA) repository and the assumptions that
have been made in the activity.


                                                                           Place

                                   Property


                        Document                     Subject


                                        Individual         Legal person


                 Fig. 3. The schema at the top level of the repository

   A first relevant input available for the process is the Central Public Administration
(CPA) Repository of schemas, made of basic and abstract schemas. A second input
concerns Piedmont databases. Piedmont local public administration are centrally
served by a unique consortium, CSI Piemonte, that created in the last years approxi-
mately 450 databases of 12 main local administrations, whose logical schemas are
documented in terms of: relational database schemas, tables (approximately 17.000),
textual descriptions of tables, referential integrity constraints defined among tables,
attributes, definitions of attributes, identifiers. A very thin conceptual documentation
has been created, that concern so called “supertypes of attributes” and “supertypes of
relations”, corresponding to generalization abstractions of a few attributes and tables
(about 10%) defined in the logical schemas. They have not been used so far in the
process.
   The basic sources of knowledge available for the production of the LPA reposi-
tory, as results from the above discussion, are very rich, but characterized by two
significant heterogeneities: the conceptual documentation concerns central admini-
stration, while for local Piedmont administration the prevalent documentation con-
cerns logical schemas.
   A second relevant condition of our activity has concerned budget constraints; for
the first year of the project we had only one person year available, so less than one
tenth of the resources that were available for the construction of the central reposi-
tory. So, in conceiving the methodology for the LPA repository production, we made
a few significant assumptions, and used heuristics and approximate reasoning, in
order to reduce human intervention as much as possible.
   A first assumption we made has been that, while basic schemas of the CPA reposi-
tory and the LPA repository may probably differ, due to the different functions
among central and local administrations, the similarity should be much higher among
the abstract schemas of the CPA repository and basic + abstract schemas of the LPA
repository.
   In consequence of the above assumption and resource constraints, we decided to
use in some steps of the methodology a more manageable knowledge base than the
500 central basic schemas + the 50 abstract schemas. Such schemas can be repre-
sented in terms of a much more dense conceptual structure, that corresponds to the
generalization hierarchies that have at their top level the five concepts defined in the
schema of fig. 3, and having at lower levels the concepts in more refined abstract
schemas and basic schemas, obtained applying top down the refinements along the
integration/abstraction hierarchy. We show in fig. 4 a fragment of one of the hierar-
chies, the one referring to individuals.

Individual
        Employment
                Unemployed
                …
                Retired
                        State pension retired
                        …..
                        Disability retired
        Education
        ……
                Fig. 4. A fragment of the Individual generalization hierarchy

   So, a second idea we implemented has been to use, besides the basic schemas and
the abstract schemas, the five generalization hierarchies of Individual, Legal Person,
Property, Document, Place.
   As a consequence of the above assumptions, constraints and choices, the inputs to
the methodological process, shown in fig. 5, have been:

    1.     The CPA Repository of 550 basic + abstract schemas
    2.     The five CPA Generalization hierarchies
    3.     The logical schemas of the 450 local PA databases.

               Central Public Administration                                 Local Public
                                                                             Administration


 Conceptual
 schemas                                  Abstract          Generalization
                                                            hierachies of
                                          schemas                             Repository of local
                                                            -Individual
                                                                             Conceptual schemas
                                                            -Legal person
                                                            -Document
                                                            -Place
                                                            -Property
                  Basic schemas

 Logical
 schemas


Fig. 5. Input knowledge for the production of the Repository of local conceptual schemas


4 The methodology for the construction of the local repository

In this paper, for reasons of space, we present only the methodology for building the
basic schemas (its extension to abstract schemas is briefly discussed in Section 6).
Each step is described with a common documentation frame, describing the inputs to
the step, the procedure, and in some cases, when relevant, the outputs of the step. An
example is provided, related to a logical schema concerning grant monitoring of in-
dustrial business activities.

Step 1. Extract entities
Inputs: Central PA generalization hierarchies of concepts, one Local PA logical
schema
   Names of entities in hierarchies are compared with names and description of each
table, and set of attributes of the logical schema. The comparison function makes use
presently of a simple distance function among the different strings. The entities and
corresponding frequency of matching are sorted, and a threshold is fixed: all the enti-
ties with frequency over the threshold are selected, resulting in a first draft schema
made only of entities. The output is a draft schema made of disconnected entities.
Step 2. Add generalizations
Inputs: the draft schema obtained in the previous step and the four CPA generaliza-
tion hierarchies.
   Visit the generalization hierarchies and add to the draft schema subset relationships
present in hierarchies, defined among the entities in the draft schema.

Step 3. Extract relationships
Inputs: the draft schema + all the basic schemas in the CPA repository
   Entities of the draft schema are pairwise compared with all the basic schemas in
the CPA repository. For each pair of entities E21 and E2 several types of relation-
ships are extracted by the basic schemas:
    a. relationships defined exactly on E1 and E2;
    b. relationships corresponding to chains of relationships defined among pairs
          E1-Ei; Ei-Ei+1; …; Ei+j-E2;
    c. relationships defined among entities E1* and E2* corresponding to ancestors
          of E1 and E2 in the four generalization hierarchies.
   Relationships collected in steps a and c are sorted according to the frequency of
names. Here we have several possibilities:
      a. The most frequent name is chosen as the name of the relationship
      b. The name is assigned by the domain expert.

Step 4. Check the schema with referential integrity constraints defined among logical
tables
Input: the draft schema + constraints defined in tables
   For each referential integrity constraint defined among two tables T1 and T2 in the
logical schema, it is checked whether T1 and/or T2 have been already selected as
entities in the draft schema, and in case added as new entities. Furthermore, it is
checked whether a relationship is defined among the entities, and in case added.

Step 5. Domain expert check of the draft schema and construction of the final
schema
Input: the draft schema
   In this step the schema produced by the semi automated process is examined by the
knowledge domain expert that may add new concepts, cancel existing concepts, or
else modify some concepts.
   Since step 5 is performed after addition of relationships and entities resulting from
integrity constraints, it may happen that too many concepts have been added, and the
manual check of the domain expert leads to delete concepts. Sometimes new concepts
are added, resulting in an enriched schema whose kernel is the initial schema. More
frequently schemas obtained after integrity constraints check and after domain expert
check coincide. The output is the: final schema
   We show in fig. 6 the schemas obtained as a result of the execution of steps 1 to 5
of the methodology in our case study. In this case, schemas obtained after integrity
constraints check and after domain expert check coincide, and, consequently, are not
distinguished in the figure.
                                                                                                                    Subject

                                         Subject
                  Document                                                                      Document
                                                                                                                   Individual        Legal person
                                       Individual        Legal person

                  Registry               Italian                                              Registry act
                    act                                    Business
                                         citizen                                                                     Italian
                                                                                                                                      Business
                                                                                                                     citizen

                             Step 1: Create entities                                                   Step 2: Add generalizations


                                                                                                     Subject


                                                                                                    Individual     Legal person        Grant
                                   Subject


                                                                                                     Italian                                    Canceled
                                                                                                                    Business
                                  Individual           Legal person                                  citizen                                     grant
          Document
                                                                                      Document
                                                                                                                        Project             Paid off
         Registry act               Italian                                                                             budget               grant
                                                        Business
                                    citizen                                          Registry act
                                                                                                                                  Awarded
                                                                                                                                   grant
                                                                        Concession                                   Procedure
                                                                           rule                      Source


                        Step 3: Add relationships                                          Step 4-5: Add integrity constraints
                                                                                           And check with the domain expert


                                               Fig. 6. Schemas obtained after steps 1-5


5. Experiments

In the present stage of the project we experimented with the above methodology in
three different matters: businesses, health care, regional territory, and nine related
areas. The total number of tables of the nine databases is approximately 350, corre-
sponding to 2% of the total. We were interested in measuring two relevant qualities
of the process:

    1.     the correctness of the conceptual schema with respect to the “true” one, i.e.
           the schema that could be obtained directly by the domain expert through a
           traditional analysis or else a reverse engineering activity. Correctness is
           measured with an approximate indirect metrics, corresponding to the per-
           centage of new/deleted concepts in the schema produced by the expert at the
           end of step 5 with respect to concepts produced in the semi automatic steps
           1-4.
    2.     the completeness of the conceptual schema with respect to the corresponding
           reengineered logical schema. Completeness is measured by the percentage of
           tables that are catched in steps 1-5, in comparison with the total number of
           tables, after excluding tables not carrying relevant information, such as re-
           dundant tables, tables of codes, etc.
    Table 1 summarizes main results of experiments. Concerning correctness, in gen-
eral the schemas obtained after step 4: check with integrity constraints, and after step
5: domain expert check are very similar, i.e. domain experts tend to confirm and con-
sider complete entities and relationships added in the previous step; the overall fig.
for the nine experiments results in more than 80% of concepts common to the two
types of schemas. We see also that the add constraints step introduces approximately
30% of new concepts in comparison with the extract entities step. Consequently the
joint application of the Central PA knowledge and Local PA knowledge reveals ef-
fective. These are, in our opinion, encouraging results, considering the highly heuris-
tic nature of the methodology.
    Concerning completeness, results are less reassuring. On the average, only 50% of
tables are catched. This value changes significantly in the different areas. Further-
more, as was to be expected, completeness decreases significantly when the referen-
tial integrity constraints are not documented or partially documented, resulting in
lower quality (completeness) conceptual schema when the input schema is character-
ized by poor documentation. Apart the quality of the documentation, another cause of
reduced completeness is the static nature of generalization hierarchies used in step 1,
and the unequal semantic richness in representing related top level concepts. For
instance, in the initial Subject hierarchy, 20 concepts represent individuals, while
only 3 represent legal persons. An improvement we are presently applying concerns
their incremental update with abstract concepts, possibly generated in step 5. Such
enriched hierarchies are progressively reconciled and brought near to hierarchies
characteristic of local administrations, resulting in a corresponding more effective
selection mechanism.

Table 1. Experiments results


        Step                   # of tables extracted        % of tables ex-
                                                            tracted
        Create entities     172                             30
        Add constraints     219                             41
        Domain       expert 275                             51
        check

   A final comment on resources. The amount of resources spent in the experiments
has been on the whole 30 person/days, corresponding to 3 person/day per schema.
About 30% of time has been spent in steps 1-4, and 60% of time has been spent in the
manual check. So, the domain expert has been engaged for 2 days per schema; we
have to add to this variable cost a fixed cost of a 3 days course. We may expect a
greater efficiency as long as the activity proceeds, and fix in 1 person day the average
final due effort, significantly lower than the typical 2-3 person/weeks needed for
traditional design of one schema.
6. Concluding remarks

The problem addressed in this paper, and the related conceptual tools, are not new in
the literature.
   Repositories of conceptual schemas are proposed in several application areas (e.g.
biosciences [9], reuse [3]). In [7] a solution and methodology are presented for re-
verse engineering of legacy databases using formal method-based techniques. Reposi-
tories of ontologies are proposed in several papers. The alignment and integration of
ontologies is investigated in [4], [5], where information integration is enabled by
having a precisely defined common terminology. A set of tools and services is pro-
posed to support the process of achieving consensus on such a common shared on-
tologies by geographically distributed groups. Users can quickly assemble a new
ontology from a library of modules.
   Repositories of ontologies for public sector organizations are proposed in [6], [8].
The repository is used in [6] in a system supporting organizational activity by for-
malizing, sharing and preserving operational experience and knowledge for future
use.
   What seems new in our approach as regards the above mentioned papers is the ab-
straction/integration primitive adopted for structuring the repository and the attention
devoted to feasibility aspects and resource constraints, and the consequent heuristic
strategy. On the other side, we are conscious that our conceptual model is less power-
ful than ontology based models. A complete comparison with existing approaches is
out of the scope of this paper.
   We are now analyzing lessons learned and improving the methodology. First, we
are extending the methodology to the production of abstract schemas in the reposi-
tory. This step may effectively use the results of previous steps 1-5. In fact, the initial
schema obtained after steps 1-3 inherits high level abstract knowledge from the CPA
Repository and basic knowledge from the LPA logical schemas, while the enriched
schema obtained in steps 4-5 encapsulates basic knowledge from the LPA logical
schemas. We may conjecture that the initial schema is a candidate for abstract schema
for the upper levels of the LPA repository, while the enriched schema, being a more
detailed description representing a logical schema, populates the basic level of the
repository. So, we may conceive two possible strategies for the repository update
step.
   In the first strategy, starting from the initial schema and the enriched schema we
first complete the “local” repository of abstract schemas corresponding to the en-
riched schema; we then integrate the local repository with the actual one: it may hap-
pen that we have to update, due to similarities between concepts, the abstract schemas
of the actual repository, or else add new schemas, autonomous with respect to the
previous ones.
   In the second strategy the new repository is obtained through abstrac-
tion/integration activities on the actual LPA repository and the initial and refined
schemas.
   The first strategy is probably more effective when the actual LPA repository and
the new schema represent very different knowledge, while the second strategy has the
advantage of natively using the structuring paradigm of the repository, the abstrac-
tion/integration operation. We are currently experimenting with the two strategies,
and other possible strategies, such as building small homogeneous repositories and
then integrating them to obtain a larger repository.
   We are also investigating new techniques that use more complex similarity meas-
ures in matching between generalization hierarchies and logical schemas. Further-
more, since some of the local PA schemas (and corresponding hierarchies) have been
independently developed, especially in the regional territory area, we are using such
schemas as training examples to tune semiautomatic steps of the methodology and
similarity measures adopted.


References

1. C. Batini, G. Di Battista, G. Santucci - Structuring primitives for a dictionary of entity
   relationship data schemas -IEEE Transactions on Software Engineering vol.19 no.4 April
   1993.
2. C. Batini, S. Castano, V. De Antonellis, M.G. Fugini, B. Pernici - Analysis of an Inventory
   of Information systems in the Public Administration - Requirements Engineering, 1996, 47-
   62
3. C. Batini, S: Castano, B. Pernici - Tutorial on Reuse Methodologies and Tools - Entity
   Relationships International Conference, Cottbus, Germany, 1996.
4. Jonathan DiLeo, Timothy Jacobs, and Scott DeLoach. Integrating Ontologies into Multi-
   agent Systems Engineering. Fourth International Bi-Conference Workshop on Agent-
   Oriented Information Systems (AOIS-2002). 15-16 July 2002, Bologna (Italy).
5. Adam Farquhar, Richard Fikes, Wanda Pratt, James Rice - Collaborative Ontology Con-
   struction for Information Integration -Knowledge Systems Laboratory Department of Com-
   puter Science, KSL-95-63, August 1995.
6. F. Fonseca, C. Davis, G. Camara – Bridging Ontologies and Conceptual schemas in Geo-
   graphic Information systems – Geoinformatica 7:4, pp. 355 – 378, 2003.
7. J. Perez, I. Ramos, J: Cubel, F. Dominguez, A. Boronat, J. Carsì Data reverse
   engineering of Legacy Databases to object oriented conceptual schemas - Elec-
   tronic Notes in Theoretical Computer Sceince 74 No. 4, 2002.
8. R. Slota, M. Majewska, M. Dziewierz, K Krawczyk, M. Laclavik, Z. Balogh, L. Hluchy, J.
   Kitowski, S. Lambert Ontology Assisted Access to Document Repositories for Public Sec-
   tor Organizations, PPAM 2003, Czestochowa, Poland
9. Taxonomic Databases Working Group on Biodiversity Informatics 2004, Taxonomic Data-
   bases Working Group Annual Meeting, 11-17 October 2004 - University of Canterbury,
   Christchurch, New Zealand.

</pre>