=Paper=
{{Paper
|id=Vol-363/paper-5
|storemode=property
|title=Reuse of a repository of conceptual schemas in a large scale project
|pdfUrl=https://ceur-ws.org/Vol-363/paper5.pdf
|volume=Vol-363
|dblpUrl=https://dblp.org/rec/conf/emmsad/BatiniG05
}}
==Reuse of a repository of conceptual schemas in a large scale project==
Reuse of a repository of conceptual schemas in a large
scale project
Carlo Batini1, Riccardo Grosso2
1 University of Milano Bicocca
Via Bicocca degli Arcimboldi 8, 20126 Milano, Italy
{batini}@disco.unimib.it
2 CSI-Piemonte,
Corso Unione Sovietica 216, Torino, Italy
{Riccardo.Grosso}@csi.it
Abstract. Large amounts of data are managed by organizations, available to be
viewed and analysed from multiple perspectives, which becomes a fundamental
resource to the effectiveness of the organizations. An organization can achieve
full benefit from the available information by managing its data resource,
through the planning of its exploitation and its maintenance. The concept of
data repository fulfils these requirements, due to the fact that it contains the de-
scription of all types of data produced, managed, maintained and exchanged in
an organization. This paper describes an experience of the use of an existing
repository of conceptual schema, representing a wide amount of entities of in-
terest for Central Public administration, in order to produce the corresponding
repository of the administrations located in a region. Several heuristics are de-
scribed and experiments are reported.
1 Structure of Italian Public Administration and previous
experiences of conceptual schema Repositories
The goal of this paper is to describe an experience with a repository of conceptual
schemas, related to Central Italian public administration, in order to build a first ver-
sion of the corresponding repository of conceptual schemas of the local public ad-
ministration located in one of the 21 regions of Italy. Due to limited available re-
sources, several approximate techniques have been applied, that allow fast prototyp-
ing of the local repository, to be refined by domain expert, resulting in a resource
consumption one order of magnitude lower than that with a traditional process.
The Italian Government’s policy, in the past few years, similarly to many other
governments in the world, has been to improve the quality of services to the citizen,
by gradually improving services provided by information systems and databases of its
agencies. However, in the past the lack of co-operation among the departments led to
the establishment of heterogeneous and isolated systems. As a result, two main prob-
lems have arisen: duplicated and inconsistent information; and difficult data access.
Moreover, the Government efficiency depends on the sharing of information be-
tween administrations, due to the fact that many of them are usually involved in the
same procedures, while using different, overlapped, heterogeneous databases.
Therefore, in the long term, a crucial aspect for the overall project is to design a
cooperation architecture that allows both central and local administration to share
information in such a way as to be able to provide services to citizens and businesses
on the basis of the “one stop shop” paradigm. A crucial aspect of such cooperation
architecture is the data architecture: data have to be interchanged in an interoperable
format, all the administration assign the same meaning to the same data, achieving
database integration in the long term; this will enable the spread of information within
government branches, a more easily accessible working environment, an increased
quality of information management, and an improved state-wide decision making.
One of the first activities performed in the last decade, with the goal of designing a
suitable data architecture, has been the project of building an inventory of existing
information systems operating within the Central Public Administration in Italy. The
activity was performed over about 500 data bases, whose logical schemas through
reverse engineering activities were translated into Entity Relationship schemas.
In order to achieve cooperation among central and local administrations, it is now
the moment to design a data architecture that covers both types of administrations,
and, consequently, it is necessary to develop a similar repository for local administra-
tions. For this reason, several regional administrations are now designing their own
data architecture. The most advanced organizational context among local administra-
tions in a region is when they are coordinated by a regional agency, that provides
services to all or at least to the majority of them. This is the situation of administra-
tions of the Piedmont region, where such central agency exists, and is CSI Piemonte.
But also in such a fortunate context, only logical relational schemas are available as
input to the process of construction of the local repository. So, a methodology and
tools are needed that allow approximate production of conceptual schemas to be ar-
ranged in a repository. This paper describes such a methodology and the experience
so far in applying this to the context of the Piedmont Public Administrations.
The paper is organized as follows. In section 2 we discuss the structure of the Cen-
tral Administration Repository and we recall the methodology for its construction. In
section 3 we describe knowledge available for the design of the local PA repository.
In section 4 we provide the methodology for building, starting from the central re-
pository and local logical schemas, a first draft version of the local repository. Sec-
tion 5 discusses experiences and future research work.
2. The Structure of the Central PA Repository
In this paper, a repository is defined as a set of conceptual schemas, each describing
all the information managed by an organisation area within the information system
considered. The data repository referenced in this paper uses the Entity Relationship
model to represent conceptual schemas. However a simple set of schemas does not
display the relationships among schemas of different areas; the repository has to be
organised in a more complex structure, through the use of the structuring primitives.
The primitives are: refinements, views; integration. Refinements allow the descrip-
tion of the same reality at different levels of abstraction. This mechanism is funda-
mental for a data repository, since it helps the user to perceive a complex reality step
by step, going from a more abstract level to a local one. Views are descriptions of
fragments of a schema. They allow users to focus their attention just on the part of a
complex reality of interest to them. Integration is the mechanism by which it is possi-
ble to build a global description of data managed by an organisation area starting
from local schemas. By jointly using these structuring primitives we obtain a reposi-
tory of schemas. Each column of the repository represents an organisation unit while
each row stands for a different abstraction level. The left column contains the
schemes resulting from the integration of all the other schemes belonging to the same
row (views of the integrated schema). In fig. 1 we show an example of repository,
where the Production, Sales, Department Schemas are represented at different re-
finement levels respectively in the second, third and fourth column, while the Com-
pany schema in the first column is the result of their integration.
In practice, when the repository is populated at the bottom level by hundreds of
schemas, as in the case that we will examine, it is unfeasible to manage the three
structuring primitives, and the view primitive is sacrificed. Furthermore, the integra-
tion/abstraction structuring mechanism is iterated, producing a sparsely populated
repository such as the one symbolically represented in fig. 2, where, for instance,
schema S123 results from the integration/abstraction of schemas S1, S2, and S3.
Company Production Sales Department
structure
DEP D-E EMP-DATA
DEP D-E EMP-DATA
EMP-DATA
Man
Acq
DEP D-E EMP-DATA
ProducT Acq
ORD-DATA
ITEM In view ITEM In ORD-DATA
ITEM
refinements
CITY
DEPART Man EMPLOYEE Born CITY
Born
DEPART EMPLOYEE EMPL
SElLER
DEP Man EMP
Acq Product
SELLER
view Acq Born
ITEM In ORDER Of PURCH
ITEM ITEM In ORD Of PUR CITY
FLOOR CITY
refinements
In Head Head
Born EMP
Gest EMPLOYEE DEP EMP
DEPARTM Born CITY
Lav. FLOOR
SELLER
In
Prod. Head
SELLER
CLERK ENGIN Acq DEP Man EMP
Man ENGINEER
CLERK
of
ITEM In ORD Of PUR Born
ITEM
ITEM Of Loc..
In ORDER PURCH CITY
Loc WARE
Type WARR
WARR WARE
integration
Fig. 1. An example of repository
SI12345678
SI123 SI456 SI78
S1 S2 S3 S4 S5 S6 S7 S8
Fig. 2. A fragment of repository
The repository structure described previously has been adopted for representing
the conceptual content of a wide amount of conceptual schemas related to the most
relevant databases of Italian central public administration in an integrated structure.
In order to build the whole repository, a methodology has been adopted, described
in [1], [2]. About 200 person months were needed to produce the 500 basic concep-
tual schemas of the repository, while about 24 person months were needed to produce
the 55 abstract schemas of the upper part (approximately 2 weeks per schema, both
for basic and for abstract schemas). In figure 3 the schema at the top level of the re-
pository is shown.
3 The Repository of Piedmont Local Administrations: basic
knowledge available
In this section we describe in more detail the knowledge available for the design of
the Piedmont Local public administration (LPA) repository and the assumptions that
have been made in the activity.
Place
Property
Document Subject
Individual Legal person
Fig. 3. The schema at the top level of the repository
A first relevant input available for the process is the Central Public Administration
(CPA) Repository of schemas, made of basic and abstract schemas. A second input
concerns Piedmont databases. Piedmont local public administration are centrally
served by a unique consortium, CSI Piemonte, that created in the last years approxi-
mately 450 databases of 12 main local administrations, whose logical schemas are
documented in terms of: relational database schemas, tables (approximately 17.000),
textual descriptions of tables, referential integrity constraints defined among tables,
attributes, definitions of attributes, identifiers. A very thin conceptual documentation
has been created, that concern so called “supertypes of attributes” and “supertypes of
relations”, corresponding to generalization abstractions of a few attributes and tables
(about 10%) defined in the logical schemas. They have not been used so far in the
process.
The basic sources of knowledge available for the production of the LPA reposi-
tory, as results from the above discussion, are very rich, but characterized by two
significant heterogeneities: the conceptual documentation concerns central admini-
stration, while for local Piedmont administration the prevalent documentation con-
cerns logical schemas.
A second relevant condition of our activity has concerned budget constraints; for
the first year of the project we had only one person year available, so less than one
tenth of the resources that were available for the construction of the central reposi-
tory. So, in conceiving the methodology for the LPA repository production, we made
a few significant assumptions, and used heuristics and approximate reasoning, in
order to reduce human intervention as much as possible.
A first assumption we made has been that, while basic schemas of the CPA reposi-
tory and the LPA repository may probably differ, due to the different functions
among central and local administrations, the similarity should be much higher among
the abstract schemas of the CPA repository and basic + abstract schemas of the LPA
repository.
In consequence of the above assumption and resource constraints, we decided to
use in some steps of the methodology a more manageable knowledge base than the
500 central basic schemas + the 50 abstract schemas. Such schemas can be repre-
sented in terms of a much more dense conceptual structure, that corresponds to the
generalization hierarchies that have at their top level the five concepts defined in the
schema of fig. 3, and having at lower levels the concepts in more refined abstract
schemas and basic schemas, obtained applying top down the refinements along the
integration/abstraction hierarchy. We show in fig. 4 a fragment of one of the hierar-
chies, the one referring to individuals.
Individual
Employment
Unemployed
…
Retired
State pension retired
…..
Disability retired
Education
……
Fig. 4. A fragment of the Individual generalization hierarchy
So, a second idea we implemented has been to use, besides the basic schemas and
the abstract schemas, the five generalization hierarchies of Individual, Legal Person,
Property, Document, Place.
As a consequence of the above assumptions, constraints and choices, the inputs to
the methodological process, shown in fig. 5, have been:
1. The CPA Repository of 550 basic + abstract schemas
2. The five CPA Generalization hierarchies
3. The logical schemas of the 450 local PA databases.
Central Public Administration Local Public
Administration
Conceptual
schemas Abstract Generalization
hierachies of
schemas Repository of local
-Individual
Conceptual schemas
-Legal person
-Document
-Place
-Property
Basic schemas
Logical
schemas
Fig. 5. Input knowledge for the production of the Repository of local conceptual schemas
4 The methodology for the construction of the local repository
In this paper, for reasons of space, we present only the methodology for building the
basic schemas (its extension to abstract schemas is briefly discussed in Section 6).
Each step is described with a common documentation frame, describing the inputs to
the step, the procedure, and in some cases, when relevant, the outputs of the step. An
example is provided, related to a logical schema concerning grant monitoring of in-
dustrial business activities.
Step 1. Extract entities
Inputs: Central PA generalization hierarchies of concepts, one Local PA logical
schema
Names of entities in hierarchies are compared with names and description of each
table, and set of attributes of the logical schema. The comparison function makes use
presently of a simple distance function among the different strings. The entities and
corresponding frequency of matching are sorted, and a threshold is fixed: all the enti-
ties with frequency over the threshold are selected, resulting in a first draft schema
made only of entities. The output is a draft schema made of disconnected entities.
Step 2. Add generalizations
Inputs: the draft schema obtained in the previous step and the four CPA generaliza-
tion hierarchies.
Visit the generalization hierarchies and add to the draft schema subset relationships
present in hierarchies, defined among the entities in the draft schema.
Step 3. Extract relationships
Inputs: the draft schema + all the basic schemas in the CPA repository
Entities of the draft schema are pairwise compared with all the basic schemas in
the CPA repository. For each pair of entities E21 and E2 several types of relation-
ships are extracted by the basic schemas:
a. relationships defined exactly on E1 and E2;
b. relationships corresponding to chains of relationships defined among pairs
E1-Ei; Ei-Ei+1; …; Ei+j-E2;
c. relationships defined among entities E1* and E2* corresponding to ancestors
of E1 and E2 in the four generalization hierarchies.
Relationships collected in steps a and c are sorted according to the frequency of
names. Here we have several possibilities:
a. The most frequent name is chosen as the name of the relationship
b. The name is assigned by the domain expert.
Step 4. Check the schema with referential integrity constraints defined among logical
tables
Input: the draft schema + constraints defined in tables
For each referential integrity constraint defined among two tables T1 and T2 in the
logical schema, it is checked whether T1 and/or T2 have been already selected as
entities in the draft schema, and in case added as new entities. Furthermore, it is
checked whether a relationship is defined among the entities, and in case added.
Step 5. Domain expert check of the draft schema and construction of the final
schema
Input: the draft schema
In this step the schema produced by the semi automated process is examined by the
knowledge domain expert that may add new concepts, cancel existing concepts, or
else modify some concepts.
Since step 5 is performed after addition of relationships and entities resulting from
integrity constraints, it may happen that too many concepts have been added, and the
manual check of the domain expert leads to delete concepts. Sometimes new concepts
are added, resulting in an enriched schema whose kernel is the initial schema. More
frequently schemas obtained after integrity constraints check and after domain expert
check coincide. The output is the: final schema
We show in fig. 6 the schemas obtained as a result of the execution of steps 1 to 5
of the methodology in our case study. In this case, schemas obtained after integrity
constraints check and after domain expert check coincide, and, consequently, are not
distinguished in the figure.
Subject
Subject
Document Document
Individual Legal person
Individual Legal person
Registry Italian Registry act
act Business
citizen Italian
Business
citizen
Step 1: Create entities Step 2: Add generalizations
Subject
Individual Legal person Grant
Subject
Italian Canceled
Business
Individual Legal person citizen grant
Document
Document
Project Paid off
Registry act Italian budget grant
Business
citizen Registry act
Awarded
grant
Concession Procedure
rule Source
Step 3: Add relationships Step 4-5: Add integrity constraints
And check with the domain expert
Fig. 6. Schemas obtained after steps 1-5
5. Experiments
In the present stage of the project we experimented with the above methodology in
three different matters: businesses, health care, regional territory, and nine related
areas. The total number of tables of the nine databases is approximately 350, corre-
sponding to 2% of the total. We were interested in measuring two relevant qualities
of the process:
1. the correctness of the conceptual schema with respect to the “true” one, i.e.
the schema that could be obtained directly by the domain expert through a
traditional analysis or else a reverse engineering activity. Correctness is
measured with an approximate indirect metrics, corresponding to the per-
centage of new/deleted concepts in the schema produced by the expert at the
end of step 5 with respect to concepts produced in the semi automatic steps
1-4.
2. the completeness of the conceptual schema with respect to the corresponding
reengineered logical schema. Completeness is measured by the percentage of
tables that are catched in steps 1-5, in comparison with the total number of
tables, after excluding tables not carrying relevant information, such as re-
dundant tables, tables of codes, etc.
Table 1 summarizes main results of experiments. Concerning correctness, in gen-
eral the schemas obtained after step 4: check with integrity constraints, and after step
5: domain expert check are very similar, i.e. domain experts tend to confirm and con-
sider complete entities and relationships added in the previous step; the overall fig.
for the nine experiments results in more than 80% of concepts common to the two
types of schemas. We see also that the add constraints step introduces approximately
30% of new concepts in comparison with the extract entities step. Consequently the
joint application of the Central PA knowledge and Local PA knowledge reveals ef-
fective. These are, in our opinion, encouraging results, considering the highly heuris-
tic nature of the methodology.
Concerning completeness, results are less reassuring. On the average, only 50% of
tables are catched. This value changes significantly in the different areas. Further-
more, as was to be expected, completeness decreases significantly when the referen-
tial integrity constraints are not documented or partially documented, resulting in
lower quality (completeness) conceptual schema when the input schema is character-
ized by poor documentation. Apart the quality of the documentation, another cause of
reduced completeness is the static nature of generalization hierarchies used in step 1,
and the unequal semantic richness in representing related top level concepts. For
instance, in the initial Subject hierarchy, 20 concepts represent individuals, while
only 3 represent legal persons. An improvement we are presently applying concerns
their incremental update with abstract concepts, possibly generated in step 5. Such
enriched hierarchies are progressively reconciled and brought near to hierarchies
characteristic of local administrations, resulting in a corresponding more effective
selection mechanism.
Table 1. Experiments results
Step # of tables extracted % of tables ex-
tracted
Create entities 172 30
Add constraints 219 41
Domain expert 275 51
check
A final comment on resources. The amount of resources spent in the experiments
has been on the whole 30 person/days, corresponding to 3 person/day per schema.
About 30% of time has been spent in steps 1-4, and 60% of time has been spent in the
manual check. So, the domain expert has been engaged for 2 days per schema; we
have to add to this variable cost a fixed cost of a 3 days course. We may expect a
greater efficiency as long as the activity proceeds, and fix in 1 person day the average
final due effort, significantly lower than the typical 2-3 person/weeks needed for
traditional design of one schema.
6. Concluding remarks
The problem addressed in this paper, and the related conceptual tools, are not new in
the literature.
Repositories of conceptual schemas are proposed in several application areas (e.g.
biosciences [9], reuse [3]). In [7] a solution and methodology are presented for re-
verse engineering of legacy databases using formal method-based techniques. Reposi-
tories of ontologies are proposed in several papers. The alignment and integration of
ontologies is investigated in [4], [5], where information integration is enabled by
having a precisely defined common terminology. A set of tools and services is pro-
posed to support the process of achieving consensus on such a common shared on-
tologies by geographically distributed groups. Users can quickly assemble a new
ontology from a library of modules.
Repositories of ontologies for public sector organizations are proposed in [6], [8].
The repository is used in [6] in a system supporting organizational activity by for-
malizing, sharing and preserving operational experience and knowledge for future
use.
What seems new in our approach as regards the above mentioned papers is the ab-
straction/integration primitive adopted for structuring the repository and the attention
devoted to feasibility aspects and resource constraints, and the consequent heuristic
strategy. On the other side, we are conscious that our conceptual model is less power-
ful than ontology based models. A complete comparison with existing approaches is
out of the scope of this paper.
We are now analyzing lessons learned and improving the methodology. First, we
are extending the methodology to the production of abstract schemas in the reposi-
tory. This step may effectively use the results of previous steps 1-5. In fact, the initial
schema obtained after steps 1-3 inherits high level abstract knowledge from the CPA
Repository and basic knowledge from the LPA logical schemas, while the enriched
schema obtained in steps 4-5 encapsulates basic knowledge from the LPA logical
schemas. We may conjecture that the initial schema is a candidate for abstract schema
for the upper levels of the LPA repository, while the enriched schema, being a more
detailed description representing a logical schema, populates the basic level of the
repository. So, we may conceive two possible strategies for the repository update
step.
In the first strategy, starting from the initial schema and the enriched schema we
first complete the “local” repository of abstract schemas corresponding to the en-
riched schema; we then integrate the local repository with the actual one: it may hap-
pen that we have to update, due to similarities between concepts, the abstract schemas
of the actual repository, or else add new schemas, autonomous with respect to the
previous ones.
In the second strategy the new repository is obtained through abstrac-
tion/integration activities on the actual LPA repository and the initial and refined
schemas.
The first strategy is probably more effective when the actual LPA repository and
the new schema represent very different knowledge, while the second strategy has the
advantage of natively using the structuring paradigm of the repository, the abstrac-
tion/integration operation. We are currently experimenting with the two strategies,
and other possible strategies, such as building small homogeneous repositories and
then integrating them to obtain a larger repository.
We are also investigating new techniques that use more complex similarity meas-
ures in matching between generalization hierarchies and logical schemas. Further-
more, since some of the local PA schemas (and corresponding hierarchies) have been
independently developed, especially in the regional territory area, we are using such
schemas as training examples to tune semiautomatic steps of the methodology and
similarity measures adopted.
References
1. C. Batini, G. Di Battista, G. Santucci - Structuring primitives for a dictionary of entity
relationship data schemas -IEEE Transactions on Software Engineering vol.19 no.4 April
1993.
2. C. Batini, S. Castano, V. De Antonellis, M.G. Fugini, B. Pernici - Analysis of an Inventory
of Information systems in the Public Administration - Requirements Engineering, 1996, 47-
62
3. C. Batini, S: Castano, B. Pernici - Tutorial on Reuse Methodologies and Tools - Entity
Relationships International Conference, Cottbus, Germany, 1996.
4. Jonathan DiLeo, Timothy Jacobs, and Scott DeLoach. Integrating Ontologies into Multi-
agent Systems Engineering. Fourth International Bi-Conference Workshop on Agent-
Oriented Information Systems (AOIS-2002). 15-16 July 2002, Bologna (Italy).
5. Adam Farquhar, Richard Fikes, Wanda Pratt, James Rice - Collaborative Ontology Con-
struction for Information Integration -Knowledge Systems Laboratory Department of Com-
puter Science, KSL-95-63, August 1995.
6. F. Fonseca, C. Davis, G. Camara – Bridging Ontologies and Conceptual schemas in Geo-
graphic Information systems – Geoinformatica 7:4, pp. 355 – 378, 2003.
7. J. Perez, I. Ramos, J: Cubel, F. Dominguez, A. Boronat, J. Carsì Data reverse
engineering of Legacy Databases to object oriented conceptual schemas - Elec-
tronic Notes in Theoretical Computer Sceince 74 No. 4, 2002.
8. R. Slota, M. Majewska, M. Dziewierz, K Krawczyk, M. Laclavik, Z. Balogh, L. Hluchy, J.
Kitowski, S. Lambert Ontology Assisted Access to Document Repositories for Public Sec-
tor Organizations, PPAM 2003, Czestochowa, Poland
9. Taxonomic Databases Working Group on Biodiversity Informatics 2004, Taxonomic Data-
bases Working Group Annual Meeting, 11-17 October 2004 - University of Canterbury,
Christchurch, New Zealand.