=Paper= {{Paper |id=None |storemode=property |title=Cross-Lingual Ontology Mapping and Its Use on the Multilingual Semantic Web |pdfUrl=https://ceur-ws.org/Vol-571/paper3.pdf |volume=Vol-571 |dblpUrl=https://dblp.org/rec/conf/www/FuBO10 }} ==Cross-Lingual Ontology Mapping and Its Use on the Multilingual Semantic Web== https://ceur-ws.org/Vol-571/paper3.pdf
        Cross-Lingual Ontology Mapping and Its Use on the
                   Multilingual Semantic Web
                                         Bo Fu, Rob Brennan, Declan O’Sullivan
                  Knowledge and Data Engineering Group, School of Computer Science and Statistics,
                               Trinity College Dublin, College Green, Dublin 2, Ireland
                                  {bofu, rob.brennan, declan.osullivan}@scss.tcd.ie

ABSTRACT                                                                    personalised querying of multilingual knowledge repositories are
Ontology-based knowledge management systems enable the                      presented in section 4. An overview of the initial implementation
automatic discovery, sharing and reuse of structured data sources           of the proposed framework is given in section 5. Section 6
on the semantic web. With the emergence of multilingual                     presents an experiment that engages the integrated framework in a
ontologies, accessing knowledge across natural language barriers            mapping scenario that involves ontologies labelled in English and
has become a pressing issue for the multilingual semantic web. In           Chinese, and discusses the evaluation results and findings from
this paper, a semantic-oriented cross-lingual ontology mapping              this experiment. Finally, work in progress is outlined in section 7.
(SOCOM) framework is proposed to enhance interoperability of
ontology-based systems that involve multilingual knowledge                  2. STATE OF THE ART
repositories. The contribution of cross-lingual ontology mapping            Current CLOM strategies can be grouped into five categories,
is demonstrated in two use case scenarios. In addition, the notion          namely manual processing, corpus-based approach, instance-
of appropriate ontology label translation, as employed by the               based approach, linguistic enrichment of ontologies and the two-
SOCOM framework, is examined in a cross-lingual ontology                    step generic approach. A costly manual CLOM process is
mapping experiment involving ontologies with a similar domain               documented in [13], where the English version of the
of interest but labelled in English and Chinese respectively.               AGROVOC 1 thesaurus is mapped to the Chinese Agriculture
Preliminary evaluation results indicate the promise of the cross-           Thesaurus. Given large and complex ontologies, such an approach
lingual mapping approach used in the SOCOM framework, and                   would be infeasible. Ngai et al. [16] propose a corpus-based
suggest that the integrated appropriate ontology label translation          approach to align the English thesaurus WordNet 2 and the
mechanism is effective in the facilitation of monolingual matching          Chinese thesaurus HowNet3. As bilingual corpora are not always
techniques in cross-lingual ontology mapping scenarios.                     available to domain-specific ontologies, it is difficult to apply
                                                                            their approach in practice. The instance-based approach proposed
Keywords                                                                    by Wang et al. [24] generates matching correspondences based on
Cross-Lingual Ontology Mapping; Appropriate Ontology Label                  the analysis of instance similarities. It requires rich sets of
Translation; Matching Assessment Feedback; Querying of                      instances embedded in ontologies, which is a condition that may
Multilingual Knowledge Repositories.                                        not always be satisfied in the ontology development process.
                                                                            Pazienza & Stellato propose a linguistically motivated mapping
                                                                            method [17], advocating a linguistic-driven approach in the
1. INTRODUCTION                                                             ontology development process that generates enriched ontologies
The promise of the semantic web is that of a new way to organise,           with human-readable linguistic resources. To facilitate this
present and search information that is based on meaning and not             linguistic enrichment process, a plug-in for the Protégé4 editor –
just text. Ontologies are explicit and formal specifications of             OntoLing 5 was also developed [18]. Linguistically enriched
conceptualisations of domains of interests [11], thus are at the            ontologies may offer strong evidence when generating matching
heart of semantic web technologies such as semantic search [8]              correspondences. However, as such enrichment is not currently
and ontology-based information extraction [2]. As knowledge and             standardised, it is difficult to apply the proposed solution.
knowledge representations are not restricted to the usage of a                  Trojahn et al. [23] present a multilingual ontology mapping
particular natural language, multilinguality is increasingly evident        framework, where ontology labels are first represented with
in ontologies as a result. Ontology-based applications therefore            collections of phrases in the target natural language. Matches are
must be able to work with ontologies that are labelled in diverse           then generated using specialized monolingual matching agents
natural languages. One way to realise this is by means of cross-            that use various techniques (i.e. structured-based matching
lingual ontology mapping (CLOM).                                            algorithms, lexicon-based matching algorithms and so on).
    In this paper, a summary of current CLOM approaches is                  However, as Shvaiko & Euzenat state in [20], “despite the many
presented in section 2. A semantic-oriented cross-lingual ontology          component matching solutions that have been developed so far,
mapping (SOCOM) framework that aims to facilitate mapping                   there is no integrated solution that is a clear success”. Often
tasks carried out in multilingual environments is proposed and              various techniques are combined in order to generate high quality
discussed in section 3. To illustrate possible applications of the          matching results [12], searching for globally accepted matches
SOCOM framework on the multilingual semantic web, two use
case scenarios including cross-language document retrieval and              1
                                                                              http://aims.fao.org/website/AGROVOC-Thesaurus/sub
                                                                            2
                                                                              http://wordnet.princeton.edu
                                                                            3
 Copyright is held by the author/owner(s).                                    http://www.keenage.com/html/e_index.html
                                                                            4
 WWW 2010, April 26-30, 2010, Raleigh, North Carolina, USA.                   http://protege.stanford.edu
                                                                            5
                                                                              http://art.uniroma2.it/software/OntoLing




                                                                       13
can lead to a limited matching scope. In 2008, an OAEI6 test case            position in the ontology structure. Thus, the ontology rendering
that involves the mapping of web directories written in English              process should not modify the position of a node, because doing
and Japanese was designed. Only one participant – the RiMOM                  so would effectively alter the semantics of the original ontology.
tool – was able to submit results [26], by using a Japanese-
English dictionary to translate labels from the Japanese web
directory into English first, before applying monolingual matching
procedures. This highlights the difficulty of exercising current
monolingual matching techniques in CLOM scenarios.
    Trojahn et al’s framework and RiMOM’s approach both
employ a generic two-step method, where ontology labels are
translated into the target natural language first and monolingual
matching techniques are applied next. The translation process
occurs in isolation of the mapping activity, and takes place
independently of the semantics in the concerned ontologies. As a
result, inadequate and/or synonymic translations can introduce
“noise” into the subsequent matching step, where matches may be
neglected by matching techniques that (solely) rely on the
discovery of lexical similarities. This conception is further
examined in [9], where strong evidence indicates that to enhance
the performance of existing monolingual matching techniques in
CLOM scenarios, appropriate ontology label translation is key to
the generation of high quality matching results. This notion of
selecting appropriate ontology label translations in the given
mapping context is the focus of the SOCOM framework and the
evaluation shown in this paper.
    Notable work in the field of (semi-)automatic ontology label
translation conducted by Espinoza et al. [7] introduces the
LabelTranslator tool, which is designed to assist humans during
the ontology localisation process. Upon selecting the labels of an
ontology one at a time, ranked lists of suggested translations for
each label are presented to the user. The user finally decides
which suggested translation is the best one to localise the given
ontology. In contrast to the LabelTranslator tool, the ontology
rendition process of the SOCOM framework presented in this
paper differs in its input, output and design purpose. Firstly, our
rendering process takes formally defined ontologies (i.e. in RDF/
OWL format) as input, but not the labels within an ontology.
Secondly, it outputs formally defined ontologies labelled in the
                                                                                  Figure 1. SOCOM Framework Workflow Overview
target natural language, but not lists of ranked translation
suggestions. Lastly, our rendering process is designed to facilitate             In contrast to the generic approach, where the translation of
further machine processing (more precisely, existing monolingual             ontology labels takes place in isolation from the ontologies
ontology matching techniques), whereas the LabelTranslator tool              concerned, the SOCOM framework is semantic-oriented and aims
aims to assist humans.                                                       to identify the most appropriate translation for a given label. To
                                                                             achieve this, firstly, suitable translation tools are selected at the
                                                                             translator selection point to generate candidate translations. This
3. THE SOCOM FRAMEWORK                                                       selection process is influenced by the knowledge domain of the
Given ontologies O1 and O2 (see Figure 1) that are labelled in               concerned ontologies. For general knowledge representations, off-
different natural languages, O1 is first transformed by the SOCOM            the-shelf machine translation (MT) tools or thesauri can be
framework into an equivalent of itself through the ontology                  applied. For specific domains such as the medical field,
rendering process as O1'. O1' contains all the original semantics of         specialised translation media are more appropriate. Secondly, to
O1 but is labelled in the natural language that is used by O2. O1' is        identify the most appropriate translation for a label among its
then matched to O2 using monolingual matchers to generate                    candidate translations, the appropriate translation selection
candidate matches, which are then reviewed by the matching                   process is performed. This selection process is under the influence
assessment mechanism in order to establish the final mappings.               of several information sources including the source ontology
    Ontology renditions are achieved by structuring the translated           semantics, the target ontology semantics, the mapping intent, the
ontology labels in the same way as the original ontology O1, and             operating domain, the time constraints, the resource constraints,
assigning these translation labels to new namespaces to create               the user and finally the matching assessment result feedback.
well-formed resource URIs in O1' (for more details, please see               These influences are explained next.
[9]). Note that the structure of O1 is not changed during this                   The semantics defined in O1 can indicate the context that a to-
process, as Giunchiglia et al. [10] point out, the conceptualisation         be-translated label is used in. Given a certain position of the node
of a particular ontology node is captured by its label and its               with this label, the labels of its surrounding nodes (referred to as
                                                                             surrounding semantics in this paper) can be retrieved and studied.
6
    http://oaei.ontologymatching.org                                         For example, for a class node, its surrounding semantics can be




                                                                        14
represented by the labels of its super/sub/sibling-classes. For a             class Summary contains: {BookChapter, Reference}, and the
property node, its surrounding semantics can be represented by                surrounding semantics of the class Abstract would include:
the labels of the resources which this property restricts. For an             {Mathematics, Applied}. Using string comparison techniques, one
individual, the surrounding semantics can be characterised by the             can determine that the strings in the surroundings of the target
label of the class it belongs to. Depending on the granularity of             class Summary are more similar to those of the source class.
the given ontologies in a mapping scenario, an ontological                    Summary therefore would be the appropriate translation in such a
resource’s surrounding semantics should be modelled with                      case. Note that the SOCOM framework is concerned with
flexibility. For example, if the ontologies are rich in structure,            searching for appropriate translations (from a mapping point of
immediate surrounding resource labels (e.g. direct super/sub                  view) but not necessarily the most linguistically correct
relations) alone can form the content of the surrounding                      translations (from a natural language processing point of view),
semantics. If the ontologies are rich in instance, where the                  because our motivation for translating ontology labels is so that
immediate surrounding label (e.g. the class an instance belongs to)           the ontologies can be best mapped8. This should not be confused
alone is weak to provide the instance’s context of use, indirect              with translating labels for the purpose of ontology localisation,
(e.g. all super/sub classes declared in the ontology) resource labels         where labels of an ontology are translated so that it is “adapted to
should be included in the surrounding semantics. The goal of                  a particular language and culture” [21].
obtaining surrounding semantics of a given resource is to provide
the translation selection process with additional indications of the
context a resource is used in7.
     As O1 is transformed so that it can be best mapped to O2, the
semantics defined in O2 therefore can act as broad translation
selection rules. When several translation candidates are all
linguistically correct for a label in O1, the most appropriate
translation is the one that is most semantically similar to what is
used in O2. An example of appropriate ontology label translation

                                                                   摘
is shown in Figure 2, where the source ontology is labelled in
Chinese and is mapped to an English target ontology. The class
要   from the source ontology has translation candidates abstract
and summary. To determine the most appropriate translation, the                      Figure 2. Examples of Appropriate Label Translation
defined semantics of the target ontology can influence the
translation selection process. To understand how this is possible,                In addition to using the embedded semantics of the given
consider three scenarios. Figure 2a demonstrates a situation where            ontologies, task intention can also influence the outcome of the
a class named Summary exists in the target ontology. In this case,            translation selection process as it captures some of the mapping
Summary would be considered as more appropriate than abstract                 motives. Consider a CLOM scenario where the user is not
since it is the exact label used by the target ontology. Figure 2b            comfortable with all the natural languages involved, and would
illustrates another scenario where the target ontology contains a             like to test just how meaningful/useful it is to map the given
class named Sum. From a thesaurus or a dictionary, one can learn              ontologies. In such a case, the selection of translation candidates
that Sum is a synonym of summary, therefore, instead of using                 need not be very sophisticated, thus results returned from off-the-
either abstract or summary, Sum will be chosen as the appropriate             shelf MT tools can be acceptable. The domain of the ontologies is
translation in this case. Figure 2c shows a third scenario where              another influence on the translation selection process. For
both Abstract and Summary exist in the target ontology, the                   example, if O1 and O2 are domain representations where each one
                                                                              is associated with collections of documents in different natural
                                           摘要                      出
appropriate translation is then concluded by studying the
surrounding semantics. The source class          has a super-class            languages, lists of frequently used words in these documents can
版物     (with translation candidates publication and printing), two            be collected. The translation candidate that is ranked highest on
sibling-classes   章节     (with translation candidates chapter and
                                                                              these lists would be deemed as the most appropriate translation.
section) and      书籍      (with translation candidates book and
                                                                              Moreover, time constraints can influence the translation selection
                                                                              process. If the mappings must be conducted dynamically such as
literature). Its surrounding semantics therefore include:                     the work presented in [5], the translation selection consequently
{publication, printing, chapter, section, book, literature}.                  must be fast, where it might not make use of all the resources that
Similarly, in the target ontology, the surrounding semantics of the           are available to it. On the other hand, not all of the
                                                                              aforementioned resources will be available in every CLOM
7                                                                             scenario. Resource constraints therefore can have an impact on
    The generation of surrounding semantics presented in this paper
    does not attempt to estimate the semantic relatedness between             the outcome of the translation selection process. Furthermore,
    concepts, it is a procedure performed within readily defined              users, at times, can have the expertise that is not obtained by the
    ontologies in a cross-lingual ontology mapping scenario that              system, and should influence the translation selection process
    aims to gather the context of use for a particular resource in the        when necessary. Lastly, matching result feedback can influence
    given ontologies. Though one might assume that the SOCOM                  the future selection of appropriate translations (discussed next).
    framework would work best when ontologies with similar
    granularity are presented, this however, is not a requirement of
                                                                              8
    the framework. As already mentioned, the surrounding                          Note that the appropriate ontology label translation mechanism
    semantics are modelled with flexibility, where indirectly related             presented in this paper does not attempt to disambiguate word
    concepts in the ontology would be collected as long as the                    senses, as the appropriateness of a translation is highly restricted
    surrounding well illustrates the context of use for a particular              to the specific mapping scenarios, thus it is not a form of natural
    ontological resource.                                                         language processing technique.




                                                                         15
    Once O1' is generated, various monolingual matching                   mapping scenario. Hence alternative solutions are in need. The
techniques can be applied to create matches between O1' and O2.           SOCOM framework presented in this paper can contribute
The selection of these monolingual matchers depends on the                towards this need. Its contribution can be demonstrated through
feedback generated from the mapping result assessment.                    two use cases as shown in Figures 3 & 4.
Assessment feedback can be implicit (i.e. pseudo feedback) or                 User generated content such as forums often contain
explicit. Pseudo feedback is obtained automatically, where the            discussions on how to solve particular technical problems, and a
system assumes matches that meet certain criteria are correct. For        large amount of content of this type is written in English.
example, “correct” results may be assumed to be the ones that             Consider a scenario illustrated in Figure 3, where the user whose
have confidence levels of at least 0.5. The precision of the              preferred natural language is Portuguese is searching for help on a
matches generated can then be calculated for each matching                forum site, but the query in Portuguese is returning no satisfactory
algorithm used, which will allow the ranking of these algorithms.         results. Let us assume that the user also speaks English as a
The ranking of the MT sources can also be determined upon                 second language and would like to receive relevant documents
establishment of the usage of each MT source (i.e. as percentages)        that are written in English instead. To achieve this, domain
among the “correct” matches. Based on these rankings, the top             ontologies in Portuguese and English can be extracted based on
performing MT tools and matching algorithms can then be                   text presented in the documents using such as Alani et al.’s
selected for the future executions of the SOCOM framework.                approach [1]. Mappings can then be generated pre-runtime using
Explicit feedback is generated from users and is more reliable            the SOCOM framework between the Portuguese ontology and the
than pseudo feedback, which can aid the mapping process in the            English ontology, and stored as RDF triples. At run time, once a
same way as discussed above.                                              query is issued in Portuguese, it is first transformed using such as
    Matching assessment feedback allows insights into how the             Lopez et al.’s method [14] to associate itself with a concept in the
correct mappings are generated, in particular, which translation          Portuguese domain ontology. This Portuguese concept’s
tool(s) and matching algorithm(s) are most suitable in the                corresponding English concept(s) can then be obtained by looking
specified CLOM scenario. Such feedback in turn could influence            it up in the mapping triplestore. Once the system establishes
the future selection of appropriate label translations and the            which English concepts to explore further, their associated
monolingual matching techniques to use. Finally, the feedback             documents in English can be retrieved.
should be influenced by the selection rationale employed during
the translation selection process and the monolingual matching
process. Such rationale can be captured as metadata as part of the
mapping process and include information such as the influence
sources used, translation tools used, monolingual matching
techniques used, similarity measures of semantic surroundings
and so on. The use of matching assessment feedback addresses
one of the scalability issues that arise. Consider a mapping
scenario where the concerned ontologies contain thousands of
entities, one way to rapidly generate mapping results and improve
mapping quality dynamically is to use the pseudo feedback. For
the first, e.g. 100 mapping tasks, assume the ones that satisfy
certain criteria are correct, detect how they are generated, and
keep using the same techniques for the remaining mapping tasks.
This assessment process can also be recursive where the system is
adjusted for every few mapping tasks. Finally, explicit feedback
involves users in the mapping process, which contributes towards
addressing one of the challenges, namely user involvement in
ontology matching as identified by Shvaiko & Euzenat in [20].                   Figure 3. SOCOM Enabled Cross-Language Document
                                                                                                   Retrieval
4. USE CASES                                                                  Personalisation can also be enhanced with the integration of
The notion of using conceptual frameworks such as thesauri and            the SOCOM framework in scenarios such as the one shown in
ontologies in search systems [6] [4] for improved information             Figure 4, where a user is bi/multi-lingual and would like to
access [19] and enhanced user experiences [22] is well researched         receive documents in a restricted knowledge domain in various
in the information retrieval (IR) and the cross-lingual IR (CLIR)         natural languages as long as they are relevant. To achieve this,
community. However, the use of ontology mapping as a technique            ontology-based user models9 containing knowledge such as user
to aid the search functions in IR has been relatively limited. The        interests and language preferences can be generated pre-runtime
most advanced work of using ontology alignment in CLIR, to the            using approaches such as [3]. Similar to the previous scenario,
best of our knowledge, is Zhang et al.’s statistical approach             domain ontologies labelled in different natural languages can be
presented in [25], which does not involve translations of ontology        obtained from sets of documents. In Figure 4, knowledge
labels. To avail statistical analysis such as latent semantic             representations in English, French, German and Spanish are
indexing, singular value decomposition, directed acyclic graphs           obtained in ontological form. Mappings of the user model and the
and maximal common subgraph on document collections, parallel             various domain ontologies can then be generated using the
corpora must be generated beforehand. However, this often is an
expensive requirement and may not always be satisfied. Also, by           9
applying statistical techniques only, such an approach ignores the            User modelling is a well researched area particularly in adaptive
existing semantic knowledge within the given ontologies in a                  hypermedia and personalised search systems, however, this is
                                                                              outside the scope of this paper.




                                                                     16
SOCOM framework. At run time, a user query is transformed to                translations are stored in a translation repository, whereas the
be associated with a concept or concepts in the user model. By              synonyms are stored in a lexicon repository. Both repositories are
looking up in the mapping triplestore, the matched concepts in              stored in the eXist16 1.0rc database.
various knowledge repositories (the German and the Spanish                       The appropriate translation selection process invokes the
knowledge repositories in the case of Figure 4) can be obtained,            repositories in the database via the XML:DB 17 1.0 API, to
which will then lead to the retrieval of relevant documents in              compare each candidate translation of a given source label to what
different natural languages.                                                is stored in the lexicon repository. An overview of this appropriate
                                                                            translation selection process can be seen in Figure 5. If a one-to-
                                                                            one match (note that the match found in the lexicon repository can
                                                                            be either a target label used in O2, or a synonym of a target label
                                                                            that is used in O2) is found, the (matched target label or the
                                                                            matched synonym’s corresponding) target label is selected as the
                                                                            appropriate translation. If one-to-many matches (i.e. when several
                                                                            target labels and/or synonyms in the lexicon repository are
                                                                            matched) are found, the surrounding semantics (see section 3) of
                                                                            the matched target labels are collected and compared to the
                                                                            surrounding semantics of the source label in question. Using a
                                                                            space/case-insensitive edit distance string comparison algorithm
                                                                            based on Nerbonne et al.’s method [15], the target label with
                                                                            surrounding semantics that are most similar to those of the source
                                                                            resource is chosen as the most appropriate translation. If no match
                                                                            is found in the lexicon repository, for each candidate translation, a
                                                                            set of interpretative keywords are generated to illustrate the
                                                                            meaning of this candidate. This is achieved by querying
                                                                            Wikipedia 18 via the Yahoo Term Extraction Tool 19 . Using the
     Figure 4. Personalised Querying of Multilingual Knowledge              same customised string comparison algorithm, the candidate with
                     Repositories with SOCOM                                keywords that are most similar to the source label’s surrounding
                                                                            semantics is deemed as the most appropriate translation.
5. IMPLEMENTATION
To examine the soundness of the appropriate ontology label
translation selection process proposed in the SOCOM framework,
an initial implementation of the proposal has been completed that
uses just the semantics within the given ontologies in a CLOM
scenario. This light-weight translation selection process (i.e. one
that includes semantics in O1 and semantics in O2, but excludes
the six other influence sources as shown in Figure 1) is the focus
of the implementation and the evaluation presented in this paper.
    This initial SOCOM implementation integrates the Jena 2.5.5
Framework10 to parse the formally defined input ontologies. To
collect candidate translations for ontology labels in O1, the
GoogleTranslate11 0.5 API and the WindowsLive12 translator are
used 13 . Synonyms of ontology labels in O2 are generated by
querying WordNet14 2.0 via the RiTa15 API. Ontology labels are
often concatenated to create well-formed URIs (as white spaces
                                                                                 Figure 5. Overview of the Appropriate Ontology Label
are not allowed), e.g. a concept associate professor can be
                                                                                             Translation Selection Process
labelled as AssociateProfessor in the ontology. As the integrated
MT tools cannot process such concatenated labels, they are split                Once appropriate translations are identified for each label in
into sequences of their constituent words before being passed to            O1, given the original source ontology structure, O1' is generated
the MT tools. This is achieved by recognising concatenation                 using the Jena Framework. Finally, O1' is matched to O2 to
patterns. In the previous example, white spaces are inserted before         generate candidate matches via the Alignment API20 version 3.6.
each capital letter found other than the first one. The candidate
                                                                            6. EVALUATION
10                                                                          To evaluate the effectiveness of the integrated appropriate
   http://jena.sourceforge.net
11                                                                          translation selection process, this initial implementation of the
   http://code.google.com/p/google-api-translate-java
12                                                                          SOCOM framework is engaged in a CLOM experiment that
   http://www.windowslivetranslator.com/Default.aspx
13
   One could use a dictionary/thesaurus here, however, as the
   appropriate ontology label translation selection process in the          16
                                                                               http://exist.sourceforge.net
   SOCOM framework is not a word sense disambiguation                       17
                                                                               http://xmldb-org.sourceforge.net/index.html
   mechanism (see section 3), off-the-self MT tools are efficient to        18
                                                                               http://www.wikipedia.org
   collect candidate translations.                                          19
                                                                               http://developer.yahoo.com/search/content/V1/
14
   http://wordnet.princeton.edu                                                termExtraction.html
15                                                                          20
   http://www.rednoise.org/rita                                                http://alignapi.gforge.inria.fr




                                                                       17
involves ontologies labelled in Chinese and English describing              Legend (Figure 7 & Table 1):
                                                                            1   NameAndPropertyAlignment                   5       SMOANameAlignment
the research community domain, against a baseline system – the              2   StructSubsDistAlignment                    6       SubsDistNameAlignment
generic approach, where labels are translated in isolation using            3   ClassStructAlignment                       7       EditDistNameAlignment
                                                                            4   NameEqAlignment                            8       StringDistAlignment
just the GoogleTranslate 0.5 API and matches are generated using                                          Generic Approach            SOCOM Approach
                     21                                                               Precision
                                                                                                          Generic Avg. = 0.5914       SOCOM Avg. = 0.6100
the Alignment API version 3.6 (see [9] for more technical                             1.00

details of the implementation of the generic approach).
                                                                                      0.75


6.1 Experimental Setup
                                                                                      0.50
Figure 6 gives an overview of the experiment. A Chinese ontology
          22
CSWRC is created manually by a group of domain experts                                0.25
(excluding the authors of this paper) based on the English
       23
SWRC ontology. It contains 54 classes, 44 object properties and                       0.00
30 data type properties. This Chinese ontology is matched to the                                  1   2     3       4         5       6        7       8
                                                                                                                                  Algorithm Matching Results
                24
English ISWC ontology (containing 33 classes, 18 object
properties, 17 data type properties and 50 instances) using the                                                 (a) Precision

generic approach and the SOCOM approach, generating results                           Recall              Generic Approach
                                                                                                          Generic Avg. = 0.4561
                                                                                                                                      SOCOM Approach
                                                                                                                                      SOCOM Avg. = 0.5067
                                                                                      1.00
M-G and M-S respectively.

                                                                                      0.75



                                                                                      0.50


     Figure 6. Cross-Lingual Ontology Mapping Experiments
                                                                                      0.25
    As the CSWRC ontology is formally and semantically
equivalent (with the same structured concepts but labelled in                         0.00
Chinese) to the SWRC ontology, a reliable set of gold standard                                    1   2     3       4         5       6        7       8
                                                                                                                                  Algorithm Matching Results
(referred to as Std. in Figure 6) can be generated as matches found
between the SWRC ontology and the ISWC ontology using the                                                        (b) Recall
Alignment API25. By comparing results M-G and M-S to Std., this                      Figure 7. Overview of Precision and Recall
experimental design aims to find out which approach can generate                       when Disregarding Confidence Levels
higher quality matching results, when the concerned ontologies
hold distinct natural languages and varied structures.                         Figure 7a shows that except the NameEqAlignment and the
                                                                           StringDistAlignment algorithm, all other matching methods
                                                                           indicate equal or higher precision when using the SOCOM
6.2 Experimental Results                                                   approach. The aforementioned two algorithms employ strict string
                    26
Precision and recall scores of M-G and M-S are calculated, see             comparison techniques, where no dissimilarity between two labels
Figure 7, where a match is considered correct as long as the               is overlooked. Though this is a desirable characteristic at times, in
identified pair of corresponding resources is included in the gold         this particular experiment setting, some matches are neglected in
standard Std., regardless of its confidence level.                         Std.. E.g. when using the StringDistAlignment algorithm, the gold
                                                                           standard was unable to establish a match between the class
                                                                           AssociateProfessor (in SWRC) and the class Associate_ Professor
21
    The Alignment API 3.6 contains eight matching algorithms,              (in ISWC) because these labels are not identical, although this
   namely NameAndPropertyAlignment, StructSubsDistAlign-                   would have been a sound match if a human was involved or if
   ment, ClassStructAlignment, NameEqAlignment, SMOAName-                  preprocessing was undertaken. When the SOCOM approach is
   Alignment, SubsDistNameAlignment, EditDistNameAlignment                 used to match CSWRC to ISWC, the most appropriate translation
   and StringDistAlignment. For each correspondence found, a               for the class  副教授    (associate professor) in the source ontology
   matching relationship is given and is accompanied by a                  was determined as Associate_Professor since this exact English
   confidence measure that range between 0 (not confident) and 1           label was used in the target ontology. Consequently, a match with
   (confident).                                                            1.00 confidence level between the two was generated in M-S.
22
   http://www.scss.tcd.ie/~bofu/SOCOMExperimentJuly2009/                   However, as this correspondence was not included in Std., such a
   Ontologies/CSWRC.owl                                                    result is deemed as incorrect. Similar circumstances led to the
23
   http://ontoware.org/frs/download.php/298/swrc_v0.3.owl                  lower precision scores of the SOCOM approaches in cases that
24
   http://annotation.semanticweb.org/iswc/iswc.owl                         involve the NameEqAlignment and the StringDistAlignment
25
    Based on the assumption that the CSWRC ontology is                     algorithms. Nevertheless, on average, with a precision score at
   equivalent to the SWRC ontology, this experimental design               0.61, the SOCOM approach generated more correct matching
   aims to validate whether matches generated using the exact              results than the generic approach overall. Furthermore, at an
   same matching algorithms would result the same or highly                average recall score of 0.5067 (see Figure 7b), the SOCOM
   similar corresponding concepts.                                         approach demonstrates that its correct results are always more
26
   Given a gold standard with R number of matching results, and            complete than those generated by the generic approach.
   an evaluation set containing X number of results, if N number               As precision and recall each measures one aspect of the match
   of them are correct based on the gold standard, then for this           quality, f-measure scores are calculated to indicate the overall
   evaluation set precision = N/X, recall = N/R and f-meaure =
   2/(1/precision + 1/recall).




                                                                      18
          27
quality . Table 1 shows that the SOCOM approach generated                        and f-measure scores than the generic approach no matter what
results with at least equal quality compared to the generic                      the threshold is28. This finding further confirms that the matches
approach. In fact, the majority of algorithms were able to generate              generated using the SOCOM approach are of higher quality.
higher quality matches when using the SOCOM approach, leading                                      1.0   Precision                           Generic
to an average of 0.5460 in its f-measure score. The differences in                                                                           SOCOM

the two approaches’ f-measure scores (when they exist) range
from a smallest 1.9% (when using the NameAndPropertyAlign-
ment algorithm) to a highest of 11.4% (when using the EditDist-
NameAlignment algorithm). Additionally, when using the                                             0.5

SOCOM approach, bigger differences in f-measure can be seen in
lexicon-based algorithms. Such a finding indicates that
appropriate ontology label translation in the SOCOM framework
contributes positively to the enhanced performances of matching                                    0.0
algorithms, particularly those that are lexicon-based.                                                    0         0.25      0.5     0.75        1
   Table 1. F-measure Scores when Disregarding Confidence                                                                                    Threshold

                               Levels                                                                              (a) Precision Trend
                                Generic     SOCOM
                        1       .5233       .5421                                                  1.0   Recall                              Generic
                                                                                                                                             SOCOM
                        2       .4574       .4574
                        3       .4651       .4884
                        4       .6000       .6667
                        5       .5020       .5714
                        6       .5039       .5039
                        7       .3571       .4714
                                                                                                   0.5
                        8       .6000       .6667
                        Avg.    .5011       .5460

    So far, the confidence levels of matching results have not been
taken into account. To include this aspect in the evaluation,
confidence means of the correct matches and their standard
                                                                                                   0.0
deviations are calculated. The mean is the average confidence of                                          0         0.25      0.5     0.75        1
the correct matches found in a set of matching results, where the                                                                            Threshold

higher it is, the better the results. The standard deviation is a                                                      (b) Recall Trend
measure of dispersion, where the greater it is, the greater the                                                                               Generic
                                                                                                   1.0   F-
spread in the confidence levels. Higher quality matching results                                         Measure                              SOCOM
therefore are those with higher means and lower standard
deviations. On average, when using the SOCOM framework, the
confidence mean is 0.7105. Whereas, a lower mean of 0.6970 is
found in the generic approach. The standard deviation when using                                   0.5
the SOCOM framework is 0.2134, which is lower than 0.2161 as
found in the generic approach. These findings denote that matches
generated using the SOCOM approach are of higher quality,
because they are not only more confident but also less dispersed.
    Moreover, average precision, recall and f-measure scores are                                   0.0
collected at various thresholds. These scores are calculated when                                         0         0.25      0.5     0.75       1
                                                                                                                                             Threshold
the conditions a correct result must satisfy adjust, i.e. a matching
result is only considered correct when it is included in the gold                                                    (c) F-Measure Trend

standard, and it has confidence level of at least 0.25, 0.50, 0.75 or                     Figure 8. Trend Overview in Average Precision, Recall
1.00. An overview of the trends is shown in Figure 8. As the                                               and F-Measure
requirement for a correct matching result become stricter, the
                                                                                      Lastly, one can argue that the differences in the f-measure
precision (Figure 8a) and recall (Figure 8b) scores both decline as
                                                                                 scores found between the generic and the SOCOM approach are
a result, leading to a similar decreasing trend in the f-measure
                                                                                 rather small and therefore can be ignored. To validate the
(Figure 8c) scores. The differences in the recall scores of the two
                                                                                 difference (if it exists) of the two approaches, paired t-tests are
approaches are greater than the differences of their precision
                                                                                 carried out on the f-measure scores collected across various
scores. This finding suggests that the matches generated using the
                                                                                 thresholds, and a p-value of 0.001 is found. At a significance level
two approaches may appear similar in their correctness, but the
                                                                                 of α=0.05, it can be concluded that the f-measure scores are
ones generated by the SOCOM approach are more complete.
                                                                                 statistically significant, meaning that the SOCOM approach
Overall, the SOCOM approach always has higher precision, recall
                                                                                 generated higher quality matches than the generic approach.
27
     Note that neither precision nor recall alone is a measurement of
                                                                                 28
     the overall quality of a set of matching results, as the former is a             Dotted lines of the generic and the SOCOM approach shown in
     measure for correctness and the latter is a measure for                          Figure 8 are almost parallel to one another, this may be in part a
     completeness. One can be sacrificed for the optimisation of the                  result of the engineering approach deployed in the experiment
     other, for example, when operating in the medical domain,                        (i.e. using the same tools in the implementation for both
     recall may be sacrificed in order to achieve high precision; when                approaches). Further research, however, is needed to confirm
     merging ontologies, the opposite may be desired.                                 the validity of this speculation.




                                                                            19
7. CONCLUSIONS & FUTURE WORK                                              [9] Fu B., Brennan R., O’Sullivan D.. Cross-lingual ontology
A semantic-oriented framework to cross-lingual ontology                       mapping – an investigation of the impact of machine
mapping is presented and evaluated in this paper. Preliminary                 translation. In Proceedings of ASWC, LNCS 5926, 1-15,
evaluation results of an early prototype implementation illustrate            2009
the effectiveness of the integrated appropriate ontology label            [10]Giunchiglia F., Yatskevich M., Shvaiko P.. Semantic
translation mechanism, and denote a promising outlook for                     matching: algorithms and implementation. Journal on Data
applying CLOM techniques in multilingual ontology-based                       Semantics, vol. IX, 1-38, 2007
applications. The findings also suggest that a fully implemented          [11]Gruber T.. A translation approach to portable ontologies.
SOCOM framework – i.e. one that integrates all the influence                  Knowledge Acquisition 5(2):199-220, 1993
factors (discussed in section 2) – would be even more effective in        [12]Li J., Tang J., Li Y., Luo Q.. RiMOM: A dynamic
the generation of high quality matches in CLOM scenarios.                     multistrategy ontology alignment framework. IEEE
    The implementation of such a comprehensive SOCOM                          Transactions on Knowledge and Data Engineering, Vol. 21,
framework is currently on-going. It is planned to be evaluated                No. 8, 1218-1232, 2009
using the benchmark datasets from the OAEI 2009 campaign,                 [13]Liang A. C., Sini M.. Mapping AGROVOC and the Chinese
engaging the proposed framework in the mapping of ontologies                  agricultural thesaurus: definitions, tools, procedures. New
that are written in very similar natural languages, namely English            Review of Hypermedia and Multimedia, 12:1, 51-62, 2006
and French. In addition, the SOCOM framework is to be                     [14]Lopez V., Uren V., Motta E., Pasin M.. AquaLog: an
embedded in a demonstrator cross-language document retrieval                  ontology-driven question answering system for organizational
system as part of the Centre for Next Generation Localisation,                semantic intranets. Web Semantics. 5, 2, 72-105, Jun. 2007
which involves several Irish academic institutions and a                  [15]Nerbonne J., Heeringa W., Kleiweg P.. Edit distance and
consortium of multi-national industrial partners aiming to develop            dialect proximity. Time Warps, String Edits and
novel localisation techniques for commercial applications.                    Macromolecules: The Theory and Practice of Sequence
                                                                              Comparison, 2nd ed. CSLI, Stanford, v-xv, 1999
                                                                          [16]Ngai G., Carpuat M., Fung P.. Identifying concepts across
8. ACKNOWLEDGMENT                                                             languages: a first step towards a corpus-based approach to
This research is partially supported by Science Foundation Ireland            automatic ontology alignment. In Proceedings of the 19th
(Grant 07/CE/11142) as part of the Centre for Next Generation                 International Conference on Computational Linguistics, vol.1,
Localisation (http://www.cngl.ie) at Trinity College Dublin.                  1-7, 2002
                                                                          [17]Pazienta M., Stellato A.. Linguistically motivated ontology
9. REFERENCES                                                                 mapping for the Semantic Web. In Proceedings of the 2nd
[1] Alani H., Kim S., Millard D. E., Weal M. J., Hall W., Lewis               Italian Semantic Web Workshop, 14-16, 2005
    P. H., Shadbolt N. R.. Automatic ontology-based knowledge             [18]Pazienza M. T., Stellato A.. Exploiting linguistic resources
    extraction from Web documents. IEEE Intelligent Systems 18,               for building linguistically motivated ontologies in the
    1, 14-21, Jan. 2003                                                       Semantic Web. In Proceedings of OntoLex Workshop, 2006
[2] Buitelaar P., Cimiano P., Frank A., Hartung M., Racioppa S..          [19]Shuang L., Fang L., Clement Y., Weiyi M.. An effective
    Ontology-based information extraction and integration from                approach to document retrieval via utilizing WordNet and
    heterogeneous data sources. International Journal of Human                recognizing phrases. 27th Annual international ACM SIGIR
    Computer Studies, 66, 11, 759-788, Nov. 2008                              Conference on Research and Development in information
[3] Cantador I., Fernández M., Vallet D., Castells P., Picault J.,            Retrieval, 266-272, ACM Press, 2004
    Ribière M.. A multi-purpose ontology-based approach for               [20]Shvaiko P., Euzenat J.. Ten challenges for ontology matching.
    personalised content filtering and retrieval. Advances in                 In Proceedings of ODBASE, 1164-1182, 2008
    Semantic Media Adaptation and Personalization. Studies in             [21]Suárez-Figueroa M. C., Gómez-Pérez A.. First attempt
    Computational Intelligence, vol. 93, 25-51, 2008                          towards a standard glossary of ontology engineering
[4] Castells P., Fernández M., Vallet D.. An adaptation of the                terminology. In Proceedings of the 8th International
    vector-Space model for ontology-based information retrieval.              Conference on Terminology and Knowledge Engineering
    IEEE Transactions on Knowledge and Data Engineering 19(2),                (TKE'08), 2008
    Special Issue on Knowledge and Data Engineering in the                [22]Stamou, S., Ntoulas, A.. Search personalization through
    Semantic Web Era, 261-272, Feb. 2007                                      query and page topical analysis. User Modeling and User-
[5] Conroy C., Brennan R., O’Sullivan D., Lewis D.. User                      Adapted Interaction 19, 1-2, 5-33., Feb. 2009
    evaluation study of a tagging approach to semantic mapping.           [23]Trojahn C., Quaresma P., Vieira R.. A framework for multi-
    In Proceedings of ESWC, 623-637, 2009                                     lingual ontology mapping. In Proceedings of LREC, 1034-
[6] De Luca E. W., Eul M., Nürnberger A.. Multilingual query-                 1037, 2008
    reformulation     using    an    RDF-OWL        EuroWordNet           [24]Wang S., Englebienne G., Schlobach S.. Learning concept
    representation. In Proceedings of the Workshop on Improving               mappings from instance similarity. In Proceedings of ISWC,
    Web Retrieval for Non-English Queries (iNEWS07), at SIGIR                 339-355, 2008
    2007, ISBN 978-84-690-6978-3, 55-61, 2007                             [25]Zhang L., Wu G., Xu Y., Li W., Zhong Y.. Multilingual
[7] Espinoza M., Gómez-Pérez A., Mena E.. LabelTranslator – a                 collection retrieving via ontology alignment. In Proceeding of
    tool to automatically localize an ontology. In Proceedings of             ICADL 2004, LNCS 3334, 510-514, Springer-Verlag, 2004
    ESWC, 792-796, 2008                                                   [26]Zhang X., Zhong Q., Li J., Tang J., Xie G., Li H.. RiMOM
[8] Fernandez M., Lopez V., Sabou M., Uren V., Vallet D., Motta               results for OAEI 2008. In Proceedings of the OM Workshop,
    E., Castells P.. Semantic search meets the Web. In                        182-189, 2008
    Proceedings of IEEE ICSC, 253-260, 2008




                                                                     20