Effective method for large scale ontology matching
                            Gayo Diallo                                        Mouhamadou Ba
                  Univ. Bordeaux, ISPED- LESIM                          Univ. Bordeaux, ISPED- LESIM
                        146 rue Léo Saignat                                   146 rue Léo Saignat
                         F-33000 Bordeaux                                      F-33000 Bordeaux
                       Gayo.Diallo@isped.u-                               Mouhamadou.Ba@isped.u-
                          bordeaux2.fr                                         bordeaux2.fr


ABSTRACT                                                                  Moreover, they are becoming more complex, large and
Nowadays, we are facing a proliferation of heterogeneous                  multilingual. For instance, the Systematized Nomenclature of
biomedical data sources accessible through various knowledge-             Medicine- -Clinical Terms (SNOMED-CT), a multiaxial,
based applications. These data are annotated by more and more             hierarchical classification system that is used by physicians and
large and disseminated knowledge organization systems ranging             other health care providers for encoding clinical health
from simple terminologies and structured vocabularies to very             information, contains more than 300,000 concepts which are
formal ontologies. In order to solve the interoperability issue           regularly evolving. Each concept designated sometimes by
which arises due to the heterogeneity of these ontologies, an             several synonymous terms. Another example is the International
alignment task is usually performed. However, while a                     Classification of Diseases (ICD), the World Health Organization
significant effort has been undertaken to provide tools that              (WHO) standard diagnostic tool for epidemiology, health
automatically align ontologies containing hundreds of entities, a         management and clinical purposes which is used to monitor the
little attention has been paid to the matching of large size              incidence and prevalence of diseases and other health problems.
ontologies as it uses to be the case in the life sciences domain.         The current ICD-10 version contains more than 12,000 concepts
We present in this paper ServOMap, a fast and efficient high              designated with terms in 43 different languages including
precision system able to perform matching ontologies containing           English, Spanish and French.
hundreds of thousands of entities. The system participated in the              In many cases, there is a need for establishing mappings
2012 edition of the Ontology Alignment Evaluation Initiative              between these different KOS in order to make interoperable
campaign and achieved very good performance, among the top                systems that use them. For instance, the EU-ADR project (1)
three systems for the Large Biomedical Ontologies Track.                  developed a computerized system that exploits data from eight
                                                                          European healthcare databases and electronic health records for
                                                                          the early detection of adverse drug reactions (ADR). As these
Categories and Subject Descriptors                                        databases use different medical terminologies (ICD9, ICD10,
I.2.4 [Artificial Intelligence] Knowledge Representation                  Read Code, ICPC) to encode their data, some mappings are
Formalisms and Methods– representation languages; H.3.1                   needed to translate query posed to the global system into queries
[Information Storage And Retrieval] Content Analysis and                  understandable by the different data sources. Performing manual
Indexing - Indexing methods, Thesauruses; J.3 [Life And                   mappings between all the mentioned resources is not feasible in
Medical Sciences]: Medical information systems.                           a reasonable time. Generally speaking, the data integration
                                                                          domain and the semantic browsing of information domains (2)
General Terms                                                             are areas where matching ontologies is usually performed.
Algorithms, Performance, Design.                                               There is, therefore, a crucial need for tools which are able
                                                                          to perform fast and automated correspondences computation
                                                                          between entities of different KOS and which can scale to large
Keywords                                                                  ontologies and mapping sets. There is also a need of tools which
Life Sciences Ontology Matching, Ontology Repository,                     provide support for multi-ontologies based applications.
Semantic Interoperability                                                      Regarding the first issue, a significant effort has been
                                                                          conducted in the ontology alignment/matching domain (3) and
1. INTRODUCTION                                                           the Ontology Alignment Evaluation Initiative campaign has
     With the wide adoption of Semantic Web technologies, the             played an important role (4). In this context, it has been noticed
increasing availability of knowledge based applications in the            during the 2011.5 edition of this campaign that few systems,
life sciences domain raises the issue of finding possible                 including GOMMA (5) and LogMap (6), was able to match the
correspondences      between     the     underlying      knowledge        whole Foundational Model of Anatomy (FMA) and the National
organization systems (KOS). Indeed, various terminologies,                Cancer Institute (NCI) Thesaurus with a good F-measure in a
structured vocabularies and ontologies are used for annotating            reasonable time.
data and the linked open data initiative is increasing this activity.          Regarding the second issue, several initiatives have been
One of the key roles played by these KOS is to provide a                  conducted in order to provide systems for facilitating accessing
support for data exchange based not only on a common syntax               multiple and various knowledge artifacts within the semantic
but on also on a shared semantic. This particular issue makes             web infrastructure (e.g. Swoogle (7), Watson (8), Ontology
them a central component within the Semantic Web and the                  Lookup Service (OLS) (9) and the BioPortal initiative (10)).
emerging e-science and e-health infrastructure.                           However, they follow a centralized approach. Embedding them
     These KOS which are independently developed at the                   in an application is not easy as they are not designed with such a
discretion of the various projects are heterogeneous in nature.           purpose.
     The work described in this paper falls within the above          functionalities that can be embedded within a knowledge-based
mentioned research area and presents the ServOMap approach, a         application for accessing the managed ontologies.
large scale ontology matching system which is able to deal with            It provides functionalities to meet the following set of
large ontologies associated with multilingual terminologies.          requirements:
ServOMap deals with ontologies described in the RDF(S)1 and                •    allowing building and maintaining decentralized
OWL2 W3C standard languages. It relies on the ServO Ontology          repositories and make them communicating
Repository (OR) system (11) (12) which is able of managing                 •    providing the ability to dynamically index a set of
multiple KOS and provides indexing and retrieving features.           ontologies in a single repository that can be later updated as
Thanks to the use of the ServO OR, ServOMap follows                   needed
Information Retrieval (IR) based techniques for computing                  •    be able to overcome the difference in the languages
similarity between entities. Contrary to most of the existing large   used for describing ontologies
scale matching systems, it is knowledge background free
ontology matching system.
     From now on, an ontology repository is an index that could
be maintained in the memory or in the system files and which
store a “representation” of several KOS which are later used for
performing some meta-operations including searching similarity
between entities. The notion of ontology repository described
here differs from the notion represented by system such as
OWLIM (13) and more generally Ontology-Based Databases
systems (14) and RDF repositories such as Sesame (15). It is
more related to the work described in (16).
     The rest of the paper is structured as follows. In section 2
we briefly outline the ServO OR on which relies ServOMap and
we present its main features. In section 3 we detail the              Figure 1: The Servo Kernel and Business Components (meta
ServOMap ontology matching approach and discribe the                  operations)
different steps for similarity computing. We present in section 4     Thus, the approach adopted is based on the adaptation of IR
the evaluation performed on the Large BioMedical dataset              tested and validated methods. And the following choices have
provided by the 2012 edition of the OAEI campaign. We                 been made (figure 1). First, a common meta-model is defined for
conclude in section 5 and give some perspectives as future work.      representing any ontology regardless its language or format.
                                                                      This meta-model is instantiated by processing the input ontology
2. Background on the ServO Ontology                                   with the JENA framework (17). Then, an Ontology Processing
Repository                                                            and loading module is designed and implemented. Finally, an
     ServO is a system which provides decentralized ontology          Ontology Indexing Module (OIM) and an Ontology Retrieving
repository for managing heterogeneous knowledge resources             Module (ORM) are designed.
(11). Its design principle is guided by the analogy that could be          The OIM and the ORM use the high-performance scalable
made between semantic resources retrieval available within an         information retrieval library Apache Lucene3. These components


    Figure 2: Overview of the ServOMap approach


ontology and traditional information retrieval (IR) techniques        are detailed in (11).
over a corpus of documents. ServO provides an OR and the                   The model for the OR defines the two main functionalities
                                                                      of the repository: indexing and retrieving resources according to
                                                                      some criteria. An indexing and retrieval model specifies how
1
    http://www.w3.org/TR/rdf-schema/
2                                                                     3
    http://www.w3.org/TR/owl-features/                                    http://lucene.apache.org
documents and queries must be represented. Also it details the        average number of sub-concepts for a concept, the different
retrieval function to be used. Moreover it determines the notion      languages used to denote entities labels or annotations, the most
of relevance. The relevance can be binary (the case of the            frequent terms within the ontology, the longest set of synonyms
Boolean model) or continuous (a ranked list of results).              labels used to describe a concepts, etc. Some metrics are
      ServO allows querying the repository by combining               necessary for optimizing the use of the Lucene backend.
Boolean terms (a.k.a the labels of the entities) and both datatype
and object properties. This requirement allows comparing in a         3.2 Lexical and Contextual Indexing
structured basis several concepts from different ontologies.               As we have already pointed out, ServOMap relies on IR
Following the functionality offered by the Lucene API, we             techniques for ontologies matching. Therefore, an ontology is
adopted an approach which combines both the Boolean and the           seen as a corpus of document to process. Each entity (concepts,
Vectorial space models (VSM) of IR to compute the relevance           properties including both object properties and data type
between the queries and the entities of the ontologies within the     properties) is a document to process.
repository.                                                                To do so, ServOMap constructs an inverted index (an
      In the VSM, each document or query is represented by a          ontology repository) from the input ontologies. Thus, for each
vector in a space where each dimension is associated to an            ontology, ServOMap uses the Ontology Processing Module of
indexing term. The similarity between the query q and the             ServO to retrieve all entities (concepts and properties). Then,
concept c is computed as (11):                                        according to the parameters computed during the previous step
                                                                      (Computing Ontology Metrics) a dynamic generation of entity
                                                                      description is performed. This process is dynamic as each entity
                                                                      is described according to the features it holds. Thus, some
     Where:                                                           concepts may have synonyms in several languages or may have
     •    tf(t in c) correlates to the term's frequency, defined as   comments where other may only have English terms. Though,
the number of times term t appears in the currently scored            some concepts may have declared properties (either object
concept c. tf(t in c) = √frequency                                    properties or datatype properties), etc. During the dynamic
     •    icf(t) stands for Inverse Concept Frequency. This           description process, the retrieved labels from a concept are
value correlates to the inverse of ConceptFreq (the number of         passed to a set of filters: stop words removal, normalization
concepts in which the term t appears).                                (upper case to lower case), punctuations removal, completion of
     •    coord(q,c) is a score factor based on how many of the       labels by the permutations of their terms and so on. It is also
query terms are found in the specified concept.                       possible to indicate whether ServOMap uses label stemming or
     •    queryNorm(q) is a normalizing factor used to make           not. Moreover, the words of a term can be concatenated as in the
scores between queries comparable. It attempts to make scores         Table 1.
from different queries (or even different indexes) comparable.
     •    t.getBoost() is a search time boost of term t in the        TABLE I.      EXAMPLE OF AVAILABLE FIELDS WITHIN THE INDEX AND
query q as specified in the query text.                                 THEIR TERM COUNTS FOR THE FOUNDATIONAL MODEL OF ANATOMY
     •    norm(t,c) encapsulates a few (indexing time) boost                                    ONTOLOGY
and length factors such as Concept boost and Field boost.
                                                                                              Term
                                                                       Field Name                                   Example
     Finally, the different functionalities offered by the ServO                             Counts
OR are:                                                                dDomain                  15      spatialassocirelat
     •    Mapping users query terms to concepts from
previously indexed ontologies (Term2Concept)                           dRange                   5       string
     •    Ontology matching and semantic similarity computing                                           accessorilobarvein
between entities for different ontologies (ServOMap)                   directLabelCEn        152,088    veinaccessorilobar
     •    Ontology searching in order to provide a KOS or a set                                         veinlobaraccessori
of KOS suitable for a particular task (ServOSearch)
                                                                       directNameC            78,884    accessorilobarvein
     •    Change detection between different versions of the
same KOS (ServOChangeDetect).                                          directNameP              52      percentag
     In the following section, we detail the ontology matching                                          http://bioontology.org/#Acces
process ServOMap which is based on the use of the ServO OR.            uri                    79,042
                                                                                                        sory_lobar_vein

3. Large scale ontology matching with                                      Table 1 gives an example of available fields and their term
ServOMap                                                              counts within the index for the Foundational Model of Anatomy
In this section, we detail the overall process that ServOMap          ontology (FMA). Term counts are provided by the Lucene
follows for computing similarity between entities of two given        backend. FMA contains 79,042 entities, among them 78,884 are
ontologies and more generally two given knowledge                     concepts. As we can see, the value of the dDomain field (the
organization systems. The approach is depicted in Figure 2.           domain of a property) is spatialassocirelat which is the term
There are 5 steps that are described below.                           “spatial association relation”. And the concept with id
                                                                      #Accessory_lobar_vein has as directLabelCEn the set
3.1 Computing Ontology Metrics                                        {accessorilobarvein veinaccessorilobar veinlobaraccessori} for
     The first step after parsing and loading input ontologies is     “Accessory lobar vein” and its permutations. All spaces are
to compute a set of metrics that are later used as parameters for     removed within labels.
the systems. These metrics include for any input ontology: the
     In ServoMap we make the assumption that two concepts              concepts and not on properties. And, it is restricted to only the
similar have likely their surrounding concepts similar. Thus, the      concepts that have not been yet mapped to any other concepts.
description of a concept is completed by contextual descriptions.      This is again based on the assumption that if two concepts are
The first one is the SubConcept strategy where a concept is            mapped by the previous strategy, it is likely to be correct.
completed by the information about all its sub-concepts. The                 The same process as previously is followed for dynamically
second strategy is the SupConcept strategy where each concept          generating the description of the concepts. The resulting query is
is completed by the description of its super-concepts. The third       sent to the index for retrieving the possible mappings. The same
one is the SibConcept strategy. In this case the description of a      process is repeated for SubConcept, SupConcept, SibConcept.
concept is completed by the description of all its siblings.                 After the complete process, we have three sets of mappings
     A flag is used to indicate whether the two input ontologies       according to the three strategies. These three sets are then
have to be indexed or only the smallest one. This flag is              combined and duplicates mappings are removed.
exploited latter during the similarity computing phase.                      As our approach is mainly lexical based, we realized during
                                                                       our experiments that this strategy generates a lot of noise. We
                                                                       then defined a refinement strategy to select the best mappings
3.3 Compute lexical based similarity                                   among the set obtained during the context based mapping. This
     After the indexing phase, ServOMap proceeds to the lexical
                                                                       strategy is briefly described in the following section.
based similarity computing. This step relies on the Ontology
Retrieval Module of the ServO Ontology Repository and use the          3.5 Refinement strategy for context-based
similarity function described in section 2.                            mappings
     Depending on the flag indicating the indexed ontologies,              During the context mappings refinement we try to keep
the Ontology Processing Module is called for retrieving the            only the couples obtained and that do not contradict the
concepts to use for searching over the built index. Thus, if both
input ontologies are indexed, the first one, let’s say O1 is used as
search ontology over the index on the second ontology I2. And,         Algo Refinement_SubSupSib
vice versa, the ontology O2 is used to perform search over the         input: vector ContextM, LexicalM
index of the first ontology I1. If the flag indicates that only one
                                                                       output: vector CleanContextM
ontology is indexed, then ServOMap performs only a one way
search.                                                                Begin
     As in the lexical and contextual indexing phase, a dynamic        For each couple (C1, C2) in ContextM
generation of entity description if performed for any entity to use
in order to search the index. A Boolean query is constructed
with all the available fields for the entity (label, comments,               If C1 OR C2 exists in LexicalM Then
properties, etc.). Please note that the same string processing task                  1. If C1 is LexMappedWith Sup(C2) or Sub(C2) Or
is performed for all the components of the entity in order to have
the same level of description than the indexing phase.                                  C2 is LexMappedWith Sup(C1) or Sub(C1)) Then
     Again, ServOMap relies on the ServO OR. Each Boolean                                      removeCouple(C1,C2)
query represented as a vector of terms is searched over the
                                                                                     2. If C1 is LexMappedWith Sib (C2) Then
index. A ranked list of entities is retrieved. SeroMap keeps as a
possible mapping the couple constituted of the entity to search                                removeCouple(C1,C2)
and the entity having the highest similarity (vectorial similarity)                  3. If C2 is LexMappedWith Sib (C1) Then
with the entity to search. It can happen that several entities have
the same similarity with the entity to search. In this case, in                                removeCouple(C1,C2)
order to keep the most relevant, the local names of the entities                     4. If Sub(C1) isMappedWith (Sib(C2) OR Sup(C2)
are compared using the Levenshtein Distance.
     At the end of this process, a first set of mappings between                               Then removeCouple(C1,C2)
the two ontologies is made available.                                                5. If Sup(C1) isMappedWith (Sib(C2) OR Sub(C2)
3.4 Compute context-based similarity                                                           Then removeCouple(C1,C2)
     Usually the mappings computed previously are considered                         Do 4.) and 5.) for C2
high precision based mapping. Indeed, as it is almost a strict
equality that is used between entities to compare, and only the              EndIf
direct description is used, the mapping is likely to be correct.       EndFor
However, this high-level accuracy is relativized by the relatively
                                                                                     return CleanContextM ;
low recall. And, as the objective is to return as many mappings
as possible, there is a need to complete the set of mappings           End
obtained previously.
     To do so, a contextual based similarity is performed. The
idea is based on the assumption that when two entities are             mappings that are already found with the lexical based
similar, there is a big chance that the concepts that surround it      mappings. Again, here, this is based on the assumption that the
are also similar. Here, by surrounding concepts we mean super-         lexical-based similarity is highly accurate. In order to filter out
concepts, sub-concepts and siblings concepts. Thus, in the             the results provided by the SubConcept, SibConcept,
context based similarity, the description of a concept is based on     SupConcept strategies we use the refinement algorithm
the strategies outlined previously (i.e. SubConcept, SupConcept,       described above and illustrated in figure 3. In this figure,
SibConcept). This contextual strategy is applied only on               ContextM is the set of mappings provided by the context-based
                                                                              In the following section we present the evaluation of
                                                                          ServOMap that has been performed on a set of various dataset.

                                                                          4. Evaluation
                                                                               In this section, we report the performance achieved by our
                                                                          system on the large biomedical track of the OAEI 2012
                                                                          campaign. To do so, we will describe first OAEI and the dataset
                                                                          that has been used in our evaluation.
                                                                          4.1 The Ontology Alignment Evaluation
                                                                          Initiative
Figure 3: Refinement strategy. If C1, C2 is obtained from the                  The Ontology Alignment Evaluation Initiative known as
 lexical mapping, all the contextual-based mappings which                 the OAEI campaign is an international campaign for the
               contradict C1, C2 are removed                              systematic evaluation of ontology matching systems. A matching
strategy; LexicalM is the set of mappings computed by the                 system is defined by OAEI as a software programs capable of
lexical based strategy. The idea is to avoid keeping a couple             finding correspondences (called alignments) between the
obtained from the context based similarity where one of the               vocabularies of a given set of input ontologies (3). The
entries is already mapped during the lexical process by another           campaign started in 2004 and is mainly motivated by the need to
concept. This strategy takes into account the worst case and              establish a consensus for the evaluation of the ever increasing
allows removing several unwanted mappings and increase the                number of methods available for schema matching or ontology
recall at the same time. However, it generates noise, and the             integration. It is usually associated with Ontology Matching
precision obtained with lexical-based mappings is then reduced.           (OM) Workshop of the International Semantic Web Conference
                                                                          (ISWC).
                                                                               For the 2012 edition4 of the campaign there were 23
3.6 Processing Disjoints Concepts                                         participating systems for six entity matching problems and three
     Some knowledge organization systems are described in                 others for the instance matching problem. This edition was
formal languages allowing expression complex axioms and                   aiming at automated evaluation to a large extent with new test
constraints. In particular, declared disjoints concepts can be            sets that have been made available. This is the case with the
                                                                          Large Biomedical ontologies track referred to as LargeBio
                                                                          described in the next section.
                                                                               The SEALS platform (18) is used for the automated
                                                                          evaluation of all the systems. The SEALS project is dedicated to
                                                                          the evaluation of semantic web technologies. It created a
                                                                          platform5 for easing this evaluation, organizing evaluation
                                                                          campaigns, and building the community of tool providers and
                                                                          tool users around this evaluation activity. The different
                                                                          participant systems are wrapped according to the SEALS
                                                                          specification before to be uploaded to the platform. The overall
                                                                          process for the OAEI 2012 campaign using this platform is
     Figure 4: Strategy for processing disjoints concepts                 described in the campaign web site6.

found in certain KOS. As our approach is mainly based the                 4.2 The OAEI 2012 LargeBio dataset
lexical description of the features of entities, it is possible to find         The LargeBio track is one of the most challenging tasks in
two concepts lexically similar while they are semantically                term of scalability and complexity. The ontologies in this dataset
declared as disjoint. In order to avoid such a situation, we have         are semantically rich and contain tens of thousands of classes.
taken into account these cases during both indexing and                   Indeed, the track consists of finding alignments between the
retrieving phases.                                                        Foundational Model of Anatomy (FMA) which contains 78,989
      Let’s assume that C1 and C2 are two disjoints OWL                   concepts, the SNOMED-CT which contains 306,591 concepts,
concepts belonging to an ontology O1 and C3 and C4 two other              and the National Cancer Institute Thesaurus (NCI) which
disjoints concepts belonging to the ontology O2 (figure 4). In            contains 66,724 concepts.
order to compute the similarity between C1 and C3, we proceed                    The FMA is a domain ontology that represents a coherent
as follows:                                                               body of explicit declarative knowledge about human anatomy. It
      •    If it is O2 which is indexed, then C3 must have a field        is integrated in the distributed framework of the Anatomy
Disjoint_Concept which contains all the generated description             Information System developed and maintained by the Structural
terms of C4. ServOMap proceeds inversely if O1 is indexed                 Informatics Group at the University of Washington It is
      •    During the similarity computing phase, when the score          concerned with the representation of classes or types and
between C1 and C3 is computed, the query is built taking into             relationships necessary for the symbolic representation of the
account the fact no terms from the field Disjoint_Concept of C1
(i.e. C2) appears in the generated description of C3. Similarly, no       4
terms from the Disjoint_Concept field of C3 (i.e. C4) appears in              http://oaei.ontologymatching.org/2012/
                                                                          5
the generated description of C1. Thus, we ensure a set of                     http://www.seals-project.eu/
coherent mappings regarding disjointnes.                                  6
                                                                              http://oaei.ontologymatching.org/2012/seals-eval.html
phenotypic structure of the human body in a form that is             1:1 mappings and does not use stemming. The two versions are
understandable to humans and is also navigable, parseable and        freely available for download online7.
interpretable by machine-based systems.
                                                                     4.4 Results
      SNOMED CT is a clinical healthcare terminology which                The evaluation is performed in a server with 16 CPUs and
provides a core general terminology for the electronic health        allocating 15 Gb RAM. 15 out of 23 participating
record (EHR) and contains currently more than 311,000 active         systems/configurations have been able to cope with at least one
concepts with unique meanings and formal logic-based                 of the tasks of the LargeBio track matching problems.
definitions organized into hierarchies. It is owned, maintained
and distributed by the International Health Terminology                   TABLE III.      SERVOMAP-LT PERFORMANCE ON THE LARGEBIO
Standard Development Organization (IHTSDO).                                                        DATASET

      The NCI Thesaurus covers vocabulary for clinical care,
translational and basic research, and public information and                       Task        Precision Recall       F 1-  Time (s)
administrative activities. It provides reference terminology for                                                    measure
many National Cancer Institute of the US National Institutes of
Health and other systems.
                                                                             FMA-NCI             0.931       0.8     0.86       366
      The LargeBio track consisted of three matching problems:
FMA-NCI matching problem, FMA-SNOMED matching
problem and SNOMED-NCI matching problem. Each matching                       FMA-                0.956       0.60    0.802      790
problem is divided in three tasks involving different fragments              SNOMED
of the considered ontologies, i.e. a small fragment of the
ontologies, a big fragment and the whole ontologies. This leads              SNOMED-NCI          0.875     0.593     0.706     1,248
to 9 sub-tasks. The 2009AA version of the Unified Medical
Language System (UMLS) Metathesaurus is used as the basis
for the track reference alignments (19).                                     AVERAGE             0.890     0.699     0.780     2,405

4.3 The configurations used for ServOMap                                  The performance of the two versions of the ServOMap
     As ServOMap is highly flexible, it participated in the          system is depicted on Table 3 and 4. We have averaged the
campaign with two configurations. They differ by the parameters      results obtained on the entire sub-tasks (small, big, and whole).
that are used to tune the matching process. These parameters are     We refer the reader to the OAEI 2012 LargeBio web page for
depicted on Table 2. The first version of the system that we refer   the complete results of the evaluation8. Thus, each matching
to as ServOMap-lt uses the same processing technique for the         problem (FMA-NCI, FMA-SNOMED, SNOMED-NCI) is
terms of the entities being matched regardless their language        presented in one row. The last entry gives the average of the
(English, French, etc.).                                             entire LargeBio track. The last column gives the total
                                                                     computation times.
   TABLE II.       TABLE 1: PARAMETERS USED TO TUNE THE TWO
                     VERSIONS OF THE SYSTEM
                                                                     TABLE IV.         SERVOMAP PERFORMANCE ON THE LARGEBIO DATASET
                       ServOMap-lt            ServOMAP
     Terms                                  According to                          Task         Precision Recall       F 1-    Time (s)
                        The same for
     processing                             the language                                                            measure
                        all languages
                                            of the labels
     Entities                                                               FMA-NCI             0.945      0.747     0.834      327
     taken into        Only Concepts          All Entities
     account                                                                FMA-                0.953      0.656     0.777      893
                                                                            SNOMED
     Ontologies
                             One                 Both
     indexed
                                                                            SNOMED-NCI          0.901      0.554     0.687     1,089
     Searching
                          One way             Two ways
     strategy
                                                                            AVERAGE             0.903      0.657     0.758     2,310
     Stemming                Yes                  No
     Arity of the
                              1:n                 1:1                     The best precision is obtained for the FMA-SNOMED
     mappings
                                                                     matching problem with 95.6% and 95.3% for ServOMap-lt and
     In addition, only concepts are taken into account contrary
to the second version, which we refer to as ServOMap. Also,          7
                                                                         http://code.google.com/p/servo/
only one of the input ontology is indexed with ServOMap-lt, the
                                                                     8
second one being used for searching over the index. Finally,
ServOMap-lt uses stemming techniques for the labels and it               http://www.cs.ox.ac.uk/isg/projects/SEALS/oaei/2012/results2
performs 1:n mappings while ServOMap takes into account only             012.html
ServOMap respectively. The best recall is obtained for the
FMA-NCI matching problem. ServOMap-lt obtained 80% while
ServOMap obtained 83.4%. We can notice on average that
                                                                      5. Conclusion and Perspectives
                                                                            We have presented in this paper the main component of the
ServOMap-lt provides the best recall (65.7%) while ServOMap
                                                                      ServO Ontology Repository and detailed its ServOMap
achieves the best precision (90.3%). Clearly, these results show
                                                                      component for large scale ontology matching. We have reported
that ServOMap-lt benefited from 1:n mappings by providing
                                                                      the performance obtained by this component on the LargeBio
more correspondences that can be found in the reference
                                                                      track during the 2012 edition of the OAEI campaign. The two
alignment. However, this decreased its precision. Another
                                                                      versions of ServOMap achieved very good results both in term
explanation of the lower precision is the use of stemming
                                                                      of F-measure and computation times by finishing among the top-
techniques which lead to grouping to the same index entry
                                                                      3 systems and providing mappings with the best precision. We
different labels having the same stem. In contrast, ServOMap
                                                                      notice, however, that so far our approach relies heavily on the
thanks to the 1:1 mapping strategy was able to provide the most
                                                                      richness of the description of the input ontologies, which used to
precise correspondences, but with a lower recall.
                                                                      be the case in the life sciences domain. The efficiency is reduced
     From the computation time point of view, the SNOMED-
                                                                      for KOS whose mappings must be based more on the structure.
NCI task was the longest to complete with respectively 1,248
                                                                            There is a room of improvement of this research work.
seconds (20.8mn) and 1,089 seconds (18.15mn) for ServOMap
                                                                      First, we plan to improve the algorithm used for filtering out the
and ServOMap-lt. In contrast, the FMA-NCI matching problem
                                                                      mappings provided by the context-based matching in order to
was the fastest to complete. ServOMap-lt performed the task in
                                                                      increase recall without reducing the precision. ServOMap does
366 seconds (6.1mn) while ServOMap finished in 327 seconds
                                                                      not use any external resource in the similarity computing
(5.45mn). These results are in line with the size of the ontologies
                                                                      process. We intend to use the UMLS resource for better
to match. The SNOMED-NCI is the largest task to process in
                                                                      discarding wrong mappings for the ontologies presents in this
term of involved entities.
                                                                      resource. Moreover, the current version does not take into
     Now let’s compare our system to the other participating
                                                                      account the mapping of two ontologies described in two
systems which completed the LargeBio track. According to the
                                                                      different languages. For instance, an ontology with terms in
official OAEI results, we have presented the summary of the
                                                                      English to compare with an ontology with terms in German. An
top-8 systems in Table 5. According to these figures,
                                                                      improvement of the system is then to implement a cross lingual
ServOMap-lt provided the best results in terms of F-measure
                                                                      ontology matching. Finally, we plan introducing logic
and precision for the FMA-SNOMED task while ServOMap
                                                                      assessment of computed mappings (21) and implementing a
generated the most precise mappings when all the task are
                                                                      user-friendly interface.
averaged, with 90.3%. ServOMap-lt finished overall second in
term of F-measure with 78% closely behind the YAM++ system            6. Acknowledgment
(78.2%) (20). For the computation times, ServOMap finished                 We thank the organizers of the OAEI evaluation campaigns
the entire 9 tasks in 2.310 seconds (38.5 mn) at the second           for providing us the test data and Seals infrastructure and the
position behind the LogMaplt system (711 seconds) (14) while          LargeBio track organizers for their valuable feedback.
YAM++ completed them in 18 hours. We mention that
GOMMA, YAM++ and LogMap systems use different kinds of
background knowledge. LogMap uses normalisations and
                                                                      7. References
spelling variants from the UMLS Lexicon while use the general
purpose background knowledge provided by WordNet and                  1.       Avillach P, Mougin F, Joubert M, Thiessard F,
GOMMA reuses mappings from FMA-UMLS and NCI-UMLS.                     Pariente A, Dufour J-C, et al. A semantic approach for the
     Please note that the last column of table 5 (Incoherence)        homogeneous identification of events in eight patient databases:
reports the number of unsatisfiabilities when reasoning using the     a contribution to the European eu-ADR project. Stud Health
HermiT reasoner with the input ontologies together with the           Technol Inform. 2009;150:190–4.
computed mappings. The logic assessment of computed
                                                                      2.        Diallo G, Khelif K, Corby O, Kostkova P, Madle G.
mappings is not a feature implemented yet into ServOMap.
                                                                      Semantic Browsing of a Domain Specific Resources: The
LogMap was the system which provides the cleanest mappings.
                                                                      Corese-NeLI Framework. Web Intelligence/IAT Workshops.
                                                                      2008. p. 50–4.
    TABLE V.        SUMMARY RESULTS OF THE LARGEBIO TOP 8
                          SYSTEMSTRACK                                3.         Shvaiko P, Euzenat J. Ten Challenges for Ontology
                                                                      Matching. In: Meersman R, Tari Z, editors. On the Move to
                                                                      Meaningful Internet Systems: OTM 2008 [Internet]. Springer
                                                                      Berlin / Heidelberg; 2008. p. 1164–82. Available from:
                                                                      http://dx.doi.org/10.1007/978-3-540-88873-4_18
                                                                      4.        Euzenat J, Meilicke C, Stuckenschmidt H, Shvaiko P,
                                                                      Santos CT dos. Ontology Alignment Evaluation Initiative: Six
                                                                      Years of Experience. J. Data Semantics. 2011;15:158–92.
                                                                      5.       Kirsten T, Gross A, Hartung M, Rahm E. GOMMA: a
                                                                      component-based infrastructure for managing and analyzing life
                                                                      science ontologies and their evolution. Journal of Biomedical
                                                                      Semantics. 2011;2(1):6.
                                                                      6.        Ruiz EJ, Grau BC, Zhou Y, Horrocks I. Large-scale
                                                                      Interactive Ontology Matching: Algorithms and Implementation.
Proceedings of the 20th European Conference on Artificial            421–5.           Available       from:       http://dblp.uni-
Intelligence (ECAI). IOS Press; 2012. p. 444–9.                      trier.de/db/conf/ekaw/ekaw2012.html#NgoB12
7.         Finin T, Peng Y, Scott R, Joel C, Joshi SA, Reddivari     21.      Meilicke C, Stuckenschmidt H, Sváb-Zamazal O. A
P, et al. Swoogle: A search and metadata engine for the semantic     Reasoning-Based Support Tool for Ontology Mapping
web. In Proceedings of the Thirteenth ACM Conference on              Evaluation. ESWC. 2009. p. 878–82.
Information and Knowledge Management. ACM Press; 2004. p.
652–9.
8.       d’ Aquin M, Motta E, Sabou M, Angeletou S,
Gridinoc L, Lopez V, et al. Toward a New Generation of
Semantic Web Applications. IEEE Intelligent Systems.
2008;23:20–8.
9.        Côté RG, Jones P, Apweiler R, Hermjakob H. The
Ontology Lookup Service, a lightweight cross-platform tool for
controlled vocabulary queries. BMC Bioinformatics. 2006;7:97.
10.       Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M,
Griffith N, et al. BioPortal: ontologies and integrated data
resources at the click of a mouse. Nucleic Acids Research. 2009
May 29;37(Web Server):W170–W173.
11.        Diallo G. Efficient Building of Local Repository of
Distributed Ontologies. IEEE; 2011 [cited 2012 Oct 6]. p. 159–
66.                         Available                     from:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber
=6120644
12.       Diallo G. Towards decentralized and cooperative
repositories of distributed ontologies. Proceedings of SWAT4LS
2011. 2011. p. 8–9.
13.      Kiryakov A, Damova M. The Semantic Web:
Semantic Repositories. Semantic Web Handbook. Springer
Verlag, Heidelberg Germany. 2011.
14.       Fankam C, Jean S, Pierra G, Bellatreche L, Ait-Ameur
Y. Towards Connecting Database Applications to Ontologies.
IEEE Computer Society, Conference Publishing Service; 2009.
p. 131–7.
15.        Schenk S, Petrak J. Sesame RDF Repository
Extensions for Remote Querying. Znalosti2008 [Internet]. 2008.
Available                                                    from:
http://znalosti2008.fiit.stuba.sk/download/articles/znalosti2008-
Schenk.pdf
16.       Ghoula N, Falquet G. Towards an ontology based
large repository for managing heterogeneous knowledge
resources. E-LKR’12. 2012.
17.        Carroll JJ, Dickinson I, Dollin C, Reynolds D,
Seaborne A, Wilkinson K. Jena: implementing the semantic web
recommendations. Proceedings of the 13th international World
Wide Web conference on Alternate track papers & posters
[Internet]. New York, NY, USA: ACM; 2004. p. 74–83.
Available from: http://doi.acm.org/10.1145/1013367.1013381
18.      Esteban-Gutiérrez M, Garcıa-Castro R, Gómez-Pérez
A. Executing Evaluations over Semantic. Technologies using the
SEALS Platform. IWEST 2010. 2010.
19.      Bodenreider O. The Unified Medical Language
System (UMLS): integrating biomedical terminology. Nucleic
Acids Research. 2004;32(Database-Issue):267–70.
20.      Ngo D, Bellahsene Z. YAM++ : A Multi-strategy
Based Approach for Ontology Matching Task. In: ten Teije A,
Völker J, Handschuh S, Stuckenschmidt H, d’ Aquin M,
Nikolov A, et al., editors. EKAW [Internet]. Springer; 2012. p.