Towards Browsing Distant Metadata Using Semantic
                         Signatures
                       Andrew Choi                                                                     Marek Hatala
              Simon Fraser University                                                           Simon Fraser University
      School of Interactive Arts and Technology                                         School of Interactive Arts and Technology
                 Surrey, BC, Canada                                                                Surrey, BC, Canada
                    aschoi@sfu.ca                                                                    mhatala@sfu.ca


ABSTRACT                                                                         requests. Currently, the use of metadata and ontologies to
In this document, we describe a light-weighted ontology                          formalize semantics of concepts in the E-learning domain
mediation method that allows users to send semantic                              does not completely resolve the problem of interoperability
queries to distant data repositories to browse for learning                      in a federated environment. This is because metadata in
object metadata. In a collaborative E-learning community,                        different repositories are very often annotated with
member data repositories might use different ontologies to                       concepts defined by different ontologies specific to their
control a set of vocabularies describing topics in learning                      organizations or communities. That makes finding
resources. This could hinder the search of learning                              information based on a local conceptual framework
resources based on local ontological concepts. With the use                      difficult.   Different    organizations    with    different
of WordNet, we develop a toolkit that indexes ontological                        backgrounds and target audience may use different terms
concepts with WordNet senses for semantic browsing in                            with similar semantics to define and describe two similar
order to integrate information in a distributed learning                         learning resources. In addition to ontological differences,
community. The effectiveness of the toolkit was validated                        linguistic variations in metadata values and lack of use of
with real-world data in a specific domain, namely E-                             metadata standard across learning network makes direct
learning metadata.                                                               querying with keywords sometimes ineffective to discover
                                                                                 a conceptually similar metadata.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information                           PROBLEM DESCRIPTION
Search and Retrieval – information integration, retrieval                        The primary objective of this research is to explore the use
models, search process                                                           of semantic signatures expressed in WordNet senses to
                                                                                 provide mediation between different ontologies in order to
General Terms                                                                    enhance concept retrieval. Consider the scenario when the
Algorithms, Management, Experimentation, Verification                            learner L1 associated with the repository R1 looking for
                                                                                 learning resources related to the topic of how to find a
Keywords                                                                         good bass musical instrument, L1 sends out a request
Semantic Retrieval, Data Integration, Ontology Mediation                         “search for bass” to remote repositories R2 and R3
                                                                                 respectively in an E-learning network. However, the
INTRODUCTION                                                                     returned results from them are mixed with many irrelevant
As the advance of the Internet and rapid development in E-                       resources related to catching a bass (e.g. fish). Such a
learning, more and more institutions are joining to form a                       problem occurs frequently when the concepts are defined
distributed learning network to allow users to access                            by different domain ontologies with different sets of
resources from different learning repositories. This creates                     vocabularies carrying different intended meanings. Imagine
pressure for institutions to provide an efficient way to                         another case when the same learner L1 sends out a
organize a huge volume of materials located in different                         distributed request for learning resources on the topic
repositories, according to a consistent concept                                  “advance databases”. Since the topic is annotated by the
classification, in order to answer distributed retrieval                         concept “database systems II” in remote repositories, that
                                                                                 is to say it is labelled differently. Therefore, in a concept-
                                                                                 based label matching search, learning resources defined by
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
                                                                                 the concept “database systems II” will not be returned for
not made or distributed for profit or commercial advantage and that              the request of “advance databases” even though the two
copies bear this notice and the full citation on the first page. To copy         concepts are actually semantically equivalent.
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission by the copyright owners.                      From these simple scenarios, one can easily see that
Copyright 2005                                                                   without a proper semantic mapping between ontologies in
                                                                                 heterogeneous data sources, even with the ontology to


                                                                            10
define vocabulary used to describe metadata on learning              gives GLUE flexibility to find semantic mappings between
resources, it is still challenging to find learning resources        ontologies. Second, GLUE applies a multi-strategy learning
based on the local conceptual definition.                            approach to use certain information discovered by different
                                                                     classifiers during the training process. This approach
OVERVIEW OF ONTOLOGY MAPPING                                         divides the classification process into two phases. First, a
Semantic or ontology mapping can be described as a                   set of base classifiers is developed to classify instances of
mapping task that identifies common concepts and                     concepts on different attributes with different algorithms.
establishes semantic relationships between heterogeneous             Then, the prediction of these base classifiers, assigned with
data models in the same domain of discourse [1]. Since               different weights representing their importance on overall
semantics is mostly defined by ontological constructs in             accuracy, is combined to form a meta-learner. Finally, the
modern knowledge systems, we will use the term semantic              classification is determined by the result from the meta-
mapping interchangeably with ontology mapping in this                learner. As an instance, one base learner can exploits the
discussion. According to [10], ontology mapping between              frequency of words in the name property using a Naïve
two ontologies O1 and O2, can be expressed as a                      Bayes learning technique while another base learner can
mathematical structure: O1 = (C1, A1) to O2 = (C2, A2) by a          use pattern matching on another property using a Decision
function f: C1→C2 to semantically related concept C1 to              Tree Induction technique. At the end, the meta-learner will
concept C2 such that A2 |= f(A1) whose all interpretations           gather all the results to form the final prediction. Using
that satisfy axioms in O2 also satisfy axioms in O1. For             multiple classifiers, GLUE intends to increase the accuracy
example, if the concept agent (C1) is defined in O1 by a set         of the overall prediction. Third, GLUE incorporates label
of properties such as <broker, travel agent and officer>             relaxation techniques into the matching process to boost the
with axioms such as <part-of agency, is-a individual, is-a           matching opportunity based on features of the
organization and type-of communicator> (ignoring other               neighbouring nodes. Generally, the relaxation labelling
attributes and cardinality for the sake of simplicity), it is        iteratively makes use of neighbouring features, domain
possible to map it to a concept representative (C2) defined          constraints and heuristic knowledge to assign the label of
in O2 with a set of properties such as <government agent,            the target node.
client, spokesperson and advisor> and having axioms such             MAFRA (Mapping FRAmework) is another ontology
as <part-of government, is-a person, and is-a expert>. This          mapping methodology that prescribes “all phases of the
assumes that all the semantic interpretations of C1 will be          ontology     mapping     process,    including      analysis,
respected by C2 in the domain of discourse when executing            specification, representation, execution and evolution”
logical inference operation on C2.                                   [14]. It uses the declarative representation approach in
                                                                     ontology mapping by creating a Semantic Bridging
REVIEW OF OTHER APPROACHES                                           Ontology (SBO) that contains all concept mappings and
This section presents a brief overview of two approaches             associated transformation rule information. In this model,
on semantic mapping. The two selected approaches are                 given two ontologies (source and target), it requires domain
GLUE and MAFRA. The former is a system that employs                  experts to examine and analyze the class definitions,
machine-learning techniques to find ontology mappings                properties, relations and attributes to determine the
with the use of probabilistic multiple learners while the            corresponding mapping and transformation method. Then,
latter uses a declarative representation of mappings as              all accumulated information will be encoded into concepts
instances in a mapping ontology defining bridging axioms             in SBO. Therefore, SBO serves as an upper ontology to
to encode transformation rules. With two domain                      govern the mapping and transformation between two
ontologies, for each concept in an ontology GLUE claims              ontologies. Each concept in SBO consists of five
to find the most similar concept in another ontology [7]. A          dimensions: they are Entity, Cardinality, Structural,
number of features distinct GLUE from other similar                  Constraint and Transformation. During the process of
mapping systems. First, unlike many mapping systems that             ontology mapping, software agent will inspect the values
only incorporate single similarity function to determine if          from two given ontologies under these dimensions and
two concepts are semantically related, GLUE utilizes                 execute the transformation process when constraints are
multiple similarity functions to measure the closeness of            satisfied.
two concepts based on the purpose of the mapping. The
intuition behind the multiple similarity functions is to take        Some recent approaches like INRIA 1 make use of OWL
advantage of the mapping requirement to relax or limit the           API to build a set of alignment APIs with built-in WordNet
choice of corresponding concepts. For instance, based on             function for the purpose of ontology alignment or axioms
the requirement of the application the task of mapping the           generation and transformations. However, the details on the
concept “associate professor” can be satisfied by similarity         use of WordNet to generate the alignments are not well
criteria “exact”, “most-specific-parent” or “most-general-           documented in the published literatures.
child” similarity criteria to find “senior lecturer”,
“academic staff” or “John Cunningham” respectively. This             1
                                                                         http://co4.inrialpes.fr/align/index.html


                                                                11
WORDNET
WordNet is a widely recognized online lexical reference
system, developed at Princeton University, whose design is
inspired by “current psycholinguistic theories of human
lexical memory. English nouns, verbs, adjectives and
adverbs are organized into synsets (synonym sets), each
representing one underlying lexical concept that is
semantically identical to each other” [2]. Synsets are
interlinked via relationships such as synonymy and
antonymy, hypernymy and hyponymy (Subclass-Of and
Superclass-Of), meronymy and holonymy (Part-Of and
Has-a) [3]. Each synset has a unique identifier (ID) and a
specific definition. A synset may consist of only a single
element, or it may have many elements all describing the
                                                                      Figure 1. Semantic Signature Generation Framework
same concept. Each element in a particular synset's list is
synonymous with all other elements in that synset. For              from a collection of the best WordNet senses for all
example, the synset {World Wide Web, WWW, Web}                      representational keywords for a particular document.
represents the concept of computer network consisting of a          The generation of a semantic signature for a class of
collection of internet sites. In this context, 'World Wide          metadata is divided into three distinct phases. In the rest of
Web', ‘WWW’ and 'Web' are all semantically equivalent.              this section, the general architecture of the methodology is
For cases where a single word has multiple meanings                 described while each phase is discussed in detail and as
(polysemy), multiple separate and potentially unrelated             well as illustrated with examples.
synsets will contain the same word. For instance, the word
‘Web’ can have 7 multiple meanings defined in WordNet               System Design and Architecture
as computer network, entanglement, simply spider web and            The methodology for creating semantic signature relies
etc.                                                                heavily on the assumptions that the aggregates of all
                                                                    semantic information from metadata records of a particular
OUR APPROACH                                                        class are a good representation of the concept for that class.
To help distributed learning repositories to organize and           In fact, the metadata record is an instance of a concept in
manage their metadata in compliance with a global                   the ontological framework. Moreover, the methodology
semantic view, we create a semantic mapping strategy                assumes that semantic information of a class can be
using WordNet as a mediator to provide word sense                   approximated by a set of important word senses from all
disambiguation and to generate semantic signature each              metadata records. Besides, semantic word senses specific to
representing learning resource category.                            the context can be found based on important terms
Semantic signature in the categorical browsing context can          extracted from metadata through WordNet. Finally yet
be defined as a logical grouping of representational word           importantly, it assumes that the local semantic signature for
senses for a class of metadata. In essence, it is a semantic        a class of metadata is similar to signatures for metadata of
representation of a class label with important WordNet              semantically equivalent concepts in distant repositories.
senses regarding context. To formalize the concept of               The methodology uses k-Nearest Neighbour (kNN) search
semantic signature, it can be written as follows:                   algorithm to classify semantically relevant concepts in
                                                                    distant repositories based on local semantic signatures [11].
                                                                    The instances (metadata) of concepts in local repository
                                                                    serve as the training dataset. Based on semantic features of
                                                                    the local metadata, semantic signatures for each class of
   where Sig (c ) = semantic signature for class c                  concepts are formed. To find semantically relevant
         DSj = set of document senses for class c                   concepts in distant repositories, a distance function is
          BSdi = set of best sense in document dj                   defined and used to measure closeness between the query
                                                                    signature and semantic signatures for concepts in distant
          T = all keywords in document dj                           repositories. Eventually, k most similar concepts to the
          Fav = selection function to find best sense               query signature will be retrieved from remote repositories.
          WS(t) = set of WordNet sense for term ti                       Figure 1 shows the four phases of the semantic
                                                                    signature generation framework. In the Word Extraction
To briefly explain, semantic signature of a class of
                                                                    phase representative features are extracted from each
metadata is built from a set of important document senses
                                                                    metadata document. The Document Preprocessing phase
from all documents (metadata records) belonging to a
                                                                    eliminates all irrelevant information as well as all non-noun
particular class. In turn, document senses are generated
                                                                    words. In the Document Vector Sensitization phase all the


                                                               12
representative keywords are used as seeds to find the                 multiple word senses retrieved. For example, the word
corresponding word senses from WordNet. Finally, in the               “search” can be mapped to WordNet senses as <hunting,
Sense Selection phase several strategies are applied to               hunt>, <lookup> and <investigation>. Because of this, the
select the best word sense is selected among all senses to            mapping information of a single noun word term can be
represent each word term.                                             denoted by a triple construct in the form <T, S, D> where T
                                                                      is the original word term, S is the synset of T and D is the
Signature Generation in Action                                        definition of T. When a noun term can be mapped to
Phase I: Word Extraction                                              multiple senses, there will be multiple triples. Take the
First, the input metadata are transformed to comply with              word term “search” as an example. After the sensitization,
the IEEE LOM standard 2 using XML transformer. Then,                  it becomes <search – {hunting. hunt} = “the activity of
adapted from Edmundsonian paradigm [4], content from                  looking thoroughly in order to find something or someone”
<Title> and <Description> elements is extracted to                    (TFIDF 0.623101)> in triple construct. The triple construct
represent the whole metadata document. That presumes                  format is used to substitute the original word term in the
that the content from these two elements carry important              master document vector. Then again, recall that since a
weight as cue phrase to be able to represent the whole                single word term could be mapped to possible different
document [4]. This view seems reasonable in the case of               word senses through WordNet. Each word sense is
learning object metadata because other elements like                  represented in synset which may have multiple
publication date, ISBN or format do not bear good                     synonymous terms. Because of this, the length of the
semantic information to signify the category of the                   document vector in word sense will grow considerably.
metadata.                                                             This problem is addressed in the next phase.
Phase II: Document Preprocessing                                      Phase IV: Sense Selection Strategy (S3)
The condensed metadata with only the <Title> and                      This is the last, and the most crucial phase in the method. It
<Description> elements are subjected to cleaning to                   chooses the best word sense among all retrieved word
remove all stopwords, punctuation information, numerical              senses from WordNet to represent the word term. As
values and irregular symbols. Next, all non-noun words are            stated, a word term can be mapped to multiple WordNet
removed using part-of-speech tagger except some                       senses. In such a case, the dimensionality of the vector
commonly used phrasal words which carry specific                      grows significantly after the sensitization procedure.
meaning. For example, the word “artificial” in the phrase             Imagine that a word term “light” can be mapped to 15
“artificial intelligence” will be preserved to retain the             WordNet noun senses “visible light”, “light source”,
special meaning of the binary phrase in the branch of                 “luminosity”, “lighting”, etc. The growth ratio is 15 times
computer science. The reason why this approach only uses              in this case. Such a high dimension not only negatively
nouns as the base keyword is explanined in [5] where it is            affects the efficiency of the similarity computation, but
said that long phrases are not easily disambiguated                   more seriously, the many senses are noise which does not
comparing to a single word term or a binary word term.                carry actual meaning of the word in the context of the
The accuracy to use a phrase as a distinguishing feature for          document. Included irrelevant senses will distort the
a document classification in effect will be lower through             semantic representation of the signature and lower the
previous experiments demonstrated in [6]. On the other                accuracy in similarity calculation when finding similar
hand, it has been shown that the use of noun word terms               classes of metadata using signature matching. On the other
carry the most salient expression to serve as distinguishing          hand, from the semantic knowledge standpoint, WordNet
feature for doing text classification [7].                            senses only provide the lexical information of the word
                                                                      term, but not the contextual information to determine how
Phase III: Document Vector Sensitization
                                                                      the meaning is clarified in a specified context [8]. Without
Supposing that all irrelevant information has been                    that, the semantic signature is just a bigger collection of
eliminated, the physical metadata documents are projected             keywords and would have small use in identifying the
onto the vector space model. The document vector                      classes of metadata based on the semantic relevance of the
becomes a logical representation of the physical metadata             signature. Therefore, it is necessary to find a way to reduce
record. Then, using TFIDF weighting scheme we select                  the dimension and only select the sense that conveys the
most significant terms across all document vectors to                 main idea of the word in the current context. To select the
represent a category of metadata [12]. After that, each word          best sense representing a word term, a contextual-based
term with the TFIDF score higher than the threshold is sent           Senses Selection Strategy (S3) is applied to retrieved word
to WordNet to retrieve the corresponding word senses and              senses. The strategy is based on the assumption that the
its definition. The threshold is determined by trial and error        local contextual information of a document serves as a
approach with a test run. A single word term can have                 good hint to tell which sense represents the actual meaning
                                                                      of the word term best. The S3 approach can be summarized
                                                                      in the following algorithm:
2
    http://ieeeltsc.org/wg12LOM/lomDescription


                                                                 13
Steps of algorithm (Calculate the best senses for class C1):            Figure 2. Compute associative frequency between immediate
                                                                                      parent with other word sense
For each metadata document D ∈ C1
 Get the list of synsets for each word term T1 ∈ D                                     Strategy 2          S1 S2
 For each synset Syn1 of the word term T1                                                                    Sk
                                                                                                                   A set of WordNet senses
  For each sense term Si ∈ Syn1
   1. Compute associative frequency af for Si to other senses Sk                generalize          • t1    • t2       • t3   …     • tn
       ∈ Synk, Synk ⊆ Tk and T1 ≠ Tk
       1.1 Find the sense Sl with highest score Max(af)                                                    Document vector
       1.2 If (Max(af) < 1) then go to 2 otherwise stop and
          return Sl                                                     The rationale behind sequencing three strategies is based
    2. Compute associative frequency af for Si to k-order parent        on observations and hypothesis that the local context is the
       senses PSk ∈ P(Synk), P(Synk) ⊆ Tk and T1 ≠ Tk                   most specific and relevant candidate to provide contextual
       2.1 Find the sense Sp with highest score Max(af)                 meaning for the word term sense. Therefore, a word sense
       2.2 If (Max(af) < 1) then go to 3 otherwise stop and
                                                                        for a particular term can most likely be disambiguated by
           return Sp                                                    other local senses (Strategy 1). If it could not be resolved
   3. Return the most popular sense Sw offered by WordNet               by step 1, then it compares the immediate parent sense to
 Return the Best Sense to represent word term T1                        the other word senses to check if the parent sense is a
                                                                        frequently occurring sense for the underlying word term.
Aggregate all sense from all important word terms to represent
signature of the document D
                                                                        At last, the most popular sense is adopted to represent the
                                                                        semantic meaning for a word term when the two strategies
The algorithm works in the following way. For each word                 above could not resolve the ambiguity of the word term.
sense of a word term, it first computes the associative                 Following the above procedures, a set of senses becomes a
frequency (af) of each sense term in a synset to other sense            semantic signature of a document. In order to generate the
terms in other synset of other word terms in the same                   final semantic signature for a class of documents referring
document. From this, the most occurred word sense will be               to particular concept, TFIDF scheme is applied again to
used to substitute the semantic representation of the word              each word sense in all document signatures for a particular
term.                                                                   class. Based on the score, the most relevant senses for
Next, if the word sense of a word term cannot be                        characterizing the class of metadata are aggregated to form
discriminated by Strategy 1, the algorithm generalizes the              the final signature for the class.
word term to the k-order parent senses. In this approach,
the value of k is 1. Hence, it generalizes to its immediate
                                                                        Concept browsing in heterogeneous ontologies
parent word sense. Referring to Figure 2, Strategy 2 will use           In our application, the generated semantic signatures are
the immediate parent sense to compute the associative frequency         used to index the actual classes of metadata for fast
against other senses from other word terms in the document              distributed browsing. We developed a tool called Signature
vector. As such, in this example the word term t1 will be               Generation Indexer (SGI) that supports the methodology
rolled up to its immediate parent through hypernym (is-a)               described in the previous section. Focusing on the
relation in the WordNet hierarchy. Then, the parent’s                   efficiency, the design of SGI is to allow repository
synset is used to calculate the associative frequency to                operators to produce semantic signatures for classes of
other word senses for other word terms. Unlike other                    learning object metadata easily without tedious human
generalization approaches [7, 13], we generalize the sense              interaction, or complicated implementation.
to its most-specific parent only. The reason why it uses
                                                                        The ultimate goal is to achieve semantic search based on E-
immediate parent senses (k=1) to compute the associative
                                                                        learning topics defined by heterogeneous ontologies in a
frequency is given in [9] where the most specific parent in
                                                                        federated network. In a collaborative learning environment,
a hierarchical terminology has a higher distinctive power to
                                                                        users expect to be able to access all the learning resources
classify the topic. Essentially following the intuition that if
                                                                        within the learning network. To fulfill this anticipation, it is
a word sense is generalized to higher order parent sense
                                                                        important to assume that all participant repositories in the
than k=1, the generalized sense may be too general and
                                                                        collaborative network employ the same strategy to index
becomes incoherent to local context, and would become
                                                                        learning resources metadata with WordNet semantic
noise when used to classify metadata.
                                                                        signature.
Finally, as arranged by WordNet, the word senses retrieved
                                                                        In this way, when users launches a query by selecting a
from WordNet for a particular word are a partial order set
                                                                        specific topic (concept) from the local ontology (e.g. via
ranked by popularity in English usage. If the previous two
                                                                        user interface), the corresponding semantic signature
strategies can not find the best sense to represent the word
                                                                        representing the topic is retrieved from local database. The
term, then the most popular sense offered by WordNet will
                                                                        signature is then sent across the network to participating
be adopted in Strategy 3.


                                                                   14
Figure 3. Integrated         process    of   semantic-based           Figure 5. Dataset distributions into training and testing
browsing of metadata                                                                               data


                                                                                     Master
                                                                                     Dataset
                                                                                     (2235)


                                                                      Remote2         Local        Remote1

                                                                       Testing       Training       Testing
learning repositories. The query in the form of semantic               dataset        dataset       dataset
signature is the input of the Similarity Calculator in distant                        Classifier
repositories. The Similarity Calculator is used to compute
the similarity of signatures in each of the learning
repositories. The similarity calculator uses the cosine
similarity function, thereby the more matched elements in
the signature, the higher the score is. In calculating the            EVALUATION
similarity score, different weights are assigned to senses            In order to test the hypothesis of using semantic signatures
from <Title> and <Description> in which the match in the              to enable distributed semantic browsing and to improve
title sense gets higher contribution to overall score than the        relevance we have simulated the distributed concept
one from the description tag.                                         retrieval and compared the results with the traditional
In order to ensure the global accuracy of the result, results         keyword-based and label-matching method. To replicate
from participating remote repositories are merged and                 the distributed repositories in a collaborative E-learning
sorted in the descending order based on the cosine                    network, the three independent databases are set up. As
similarity score. Then, the top k (k=5) topics of the                 shown in Figure 5, they are called “local”, “remote1” and
metadata are offered as the answer to the local query. The            “remote2” where the local, of course, denotes a local data
overall operation of the semantic-based browsing of                   source and both remote1 and remote2 simulate distant data
learning resources metadata is shown in Figure 3.                     sources. A single master set of metadata in 8 different
                                                                      categories is distributed evenly in number and randomly
IMPLEMENTATION                                                        into the three simulated repositories.
The SGI is implemented in the C# programming language.                The metadata have been transformed to conform to the
The current version is a desktop application, but it can be           IEEE LOM format. After the distribution, the local
easily extended to a web service. The goal of SGI is to               database contains the metadata that represents the set of the
integrate signature generation, document indexing and                 training data for the classifier. During the training phase,
browsing capability. The signature indexes are stored in an           the kNN classifier uses the instance of the local metadata to
inverted index database (e.g. MS Access). The similarity              learn the features to identify the class of the metadata. It
calculator is a separated module implemented in C# as well            starts by extracting keyword terms from each category of
and connected to the index database. Figure 4 shows the               metadata and projecting them into the vector space model.
browsing interface of SGI to illustrate how to search distant         Next, after running through the signature generation
concept semantically.                                                 module, each category of metadata is represented and
Figure 4. Browsing interface of SGI                                   indexed by a semantic signature in the database.
                                                                      The dataset in both remote1 and remote2 is controlled to
                                                                      model the situation of potentially different ontological
                                                                      classification in a distributed environment. To simulate the
                                                                      effect of varied concept labelling, the original 8 categories
                                                                      of metadata are expanded to 14 categories in remote1. The
                                                                      6 derived categories are labelled with different class names
                                                                      from their respective sources and described with the
                                                                      metadata taken out from source categories. Each newly
                                                                      derived category contains metadata belonging to the same
                                                                      class. To illustrate, a part of the metadata from the category
                                                                      “computing science” is distributed to the derived categories
                                                                      “technology” and “engineering” in remote1. Thereby, the
                                                                      metadata for concept “computing science” is now grouped


                                                                 15
into “computing science”, “technology” and “engineering”.                     Table 1. Source and Category of Metadata
Essentially, this simulates the situation when a concept                       Category       Source                                 No. of
“computing science” could be categorized differently into                                                                            records
concepts like “technology” and “engineering” in different                      Accounting     Business       Source        Premier   382
ontology. The same distribution principle is applied to                                       Publications
                                                                                              Biological and Agricultural Index,
remote2 database which includes 13 categories of which 7                       Biology
                                                                                              BioMed Central Online Journals
                                                                                                                                     315
are derived categories.                                                        Computing      Citeseer                               320
                                                                               Science
Similar to the local database, each category of the metadata                                  American Economic Association’s
in remote1 and remote2 is mapped to a semantic signature                       Economics                                             353
                                                                                              electronic database
in WordNet senses and stored in the local database as an                                      Educational Resource Information
                                                                               Education                                             307
index. To test semantic-based search, semantic signature                                      Center
representing a local concept is sent to query the remote                       Geography      Geobase                                237
                                                                               Mathematics    arXiv.org, MathSciNet                  157
repositories. The semantic similarity is compared between                      Psychology     PsycINFO, ERIC                         164
the query signature and the distant signature based on the
similarity function. Finally, the result of the k most similar                based retrieval and label-matching retrieval.
concept signatures from the remote databases are studied
based on the relevance metric.                                                As oppose to the classic or traditional keywords-based
                                                                              representation, semantic-based indexing with WordNet
Dataset                                                                       senses can include more lexicon information than simple
Since there is no publicly available dataset of learning                      syntactic approach. This implies that more features will be
resources metadata, the experiment metadata were acquired                     added to the class signature representation. Since more
through a number of different sources. Table 1 shows the                      features are added, that may also mean that more noise is
category of metadata acquired and their respective sources.                   included as well.
In total, 2235 metadata subdivided into the 8 different                       Intuitively, the increased relevance of retrieval can be
categories are acquired. The dataset is partitioned into                      attributed to the expansion of features in class
training and testing groups. As mentioned, the local                          representation. However, different from what we expected,
database stores the training dataset while remote1 and                        the precision does not decreased. It is suspected that due to
remote2 store the testing dataset. All metadata are known                     the relatively small size of the dataset and 1-k hypernym
with their class label. Metadata are distributed randomly,                    generalization, the senses included in the signature are
using Microsoft Excel random generator, to train and test                     ‘good’ in terms of classification. Therefore, combined with
the group. After distribution, the local database contains                    a good contextual-based sense selection strategy, WordNet
667 training records while remote1 and remote2 contain                        as a mediatory can provide source for ambiguity resolution
1568 testing records.                                                         and semantic information for the process of semantic
                                                                              browsing. Coupled with that, the selection of kNN
Results                                                                       algorithm as the classifier also contributes to the
In order to gauge the effectiveness of the proposed                           performance of the system.
mediation method between different E-learning ontologies,
three standard metrics for information retrieval are used in                  kNN is an instance-based classifier. The performance of
the evaluation of the system performance: they are Recall,                    instance-based classifiers is more dependent on the
Precision and F-measure. Table 2 shows that the use of                        sufficiency of the training set rather than other machine
semantic signature can consistently improve retrieval                         learning classification algorithms. Thus, it is a
relevance in terms of recall and precision. In all categories,                disadvantage for kNN to have a small dataset for training
the semantic based retrieval out perform both keywords-                       and testing. A smaller training set implies more terms or
                                                                              term combinations important for content identification may
Table 2. Comparison on precision, recall and F-measure on                     be missing from the training sample documents. This
concept retrieval                                                             negatively affects the performance of a classifier.
Cate         Precision             Recall                F-Measure            Nevertheless, the ontology (e.g. WordNet) guided
gory     S      K        L    S       K       L      S      K        L        approach seems to somewhat reduce the negative influence
                                                                              of this problem. The replacement of child concepts with
Acc       1     0.6    0.5      1     0.75   0.5     1      0.6    0.5
Bio      0.6    0.6    0.5     0.75   0.6    0.5    0.6     0.6    0.5
                                                                              parent concept through hypernym relationship appears to
CS        1     0.5    0.3      1     0.5    0.3     1      0.5    0.3        be able to discover an optimum concept set without
Econ      1      1     0.6      1     0.75   0.6     1      0.6    0.6        adversely affecting performance. Therefore, an important
Educ     0.6    0.5    0.5     0.75   0.75   0.5    0.6    0.45    0.5        term, which resides low in the concept hierarchy may be
Geo      0.6    0.5    0.5     0.75   0.5    0.5    0.6     0.5    0.5        mapped to a parent concept and included in the signature
Math      1     0.3    0.6     0.6    0.5    0.6    0.7    0.36    0.6
                                                                              for class comparison, even if this term is not included in the
Psy       1     0.3    0.3     0.6    0.6    0.3    0.7     0.4    0.3
S = Signature-based retrieval, K = Keywords-based, L = Label-matching         training set.


                                                                         16
DISCUSSION                                                            [2] George A. Miller, Wordnet: An Online Lexical
The improvement on concept retrieval by using semantic                     Database, International Journal of Lexicography
signature is not uniform across different categories. For                  (1993).
example, the improvement on retrieval of “Psychology”                 [3] Asuncion Gomez-Perez, Ontological Engineering with
and “Accounting” metadata is more than improvement on                      Examples from the areas of Knowledge Mangement, e-
“Biology” and “Geology”. We believe that for some classes                  Commerce and the Semantic Web, Springer-Verlag
of metadata like “Biology”, which are characterised by a set               London (2004).
of specific keywords, the use of semantic signatures does             [4] H. P. Edmundson, New Methods in Automatic
not add extra useful information into the representation                   Extracting, Journal of the ACM (1969).
model to help in classifying metadata. On the other hand,
                                                                      [5] Ching Kang Cheng, Xiaoshan Pan and Franz Kurfess,
using 1-k hypernym generalization on such a highly
                                                                           “Ontology-based Semantic Classification of
specialized domain may in fact introduce more noise to
                                                                           Unstructured Documents”.
reduce the matching possibility in similarity calculations.
In addition, with a small size of dataset, over-fitting on            [6] Khaled M. Hammouda and Mohamed S. Kamel,
classification model may also result. Therefore, further                   “Phrase-based Document Similarity Based on an
experimentation and analysis are needed to fully                           Index Graph Model”, Proceedings of IEEE
understand the impact of WordNet signature with sense                      International Conference on Data Mining (2002).
generalization in classification of metadata.                         [7] AnHai Doan, Jayant Madhavan, Pedro Domingos, and
                                                                           Alon Halevy “Learning to map between ontologies on
CONCLUSION                                                                 the semantic web”, Proceedings of WWW2002
This project offers two important contributions. First, it                 conference (2002).
gives a new light-weighted semantic (ontology) mapping                [8] Ching Kang Cheng, Xiaoshan Pan and Franz Kurfess,
approach to enable cross platform concept browsing in a                    “Ontology-based Semantic Classification of
federated network. Unlike many current practices in                        Unstructured Documents”, Proceedings of 1st
semantic mapping that either require intensive user                        International Workshop on Adaptive Multimedia
involvement to provide mapping information, or resort to                   Retrieval (2003).
complicated heuristic or rule-based machine learning
                                                                      [9] Martin Ester, Hans-Peter Kriegel and Matthias
approach, this work shows an effective automatic mapping
                                                                           Schubert, “Web Site Mining : A new way to spot
protocol that can allow federated concept browsing with
                                                                           Competitors, Customers and Suppliers in the World
semantic signature. It is evident for the experimental results
                                                                           Wide Web”, Proceedings of 4th International
that establish the merit of using WordNet to provide
                                                                           Conference on Knowledge Discovery and Data
semantic knowledge for metadata classification in the
                                                                           Mining (2002).
domain of E-learning. The merits include the provision of
semantic representation of categorical data and increased             [10] Yannis Kalfoglou and Marco Schorlemmer, Ontology
semantic relevance in categorical browsing.                                mapping: the state of the art, The Knowledge
                                                                           Engineering Review (2003).
By using immediate parent sense generalization during
sense selection process, it does not only successfully                [11] Mineau, G.W, “A simple KNN algorithm for text
reduce the dimension in semantic signature, but more                       categorization”, Proceedings of IEEE International
importantlly introduces flexibility in the sense selection and             Conference on Data Mining (2001).
increases the opportunity to find a better sense without              [12] G. Salton and M. McGill, Introduction to Modern
compromising the relevance in the search result. This                      Information Retrieval, McGraw-Hill (1983).
creates incentive to explore the use of other sense selection         [13] F. Giunchiglia, P. Shvaiko, and M. Yatskevich,
strategy.                                                                  Semantic matching, In 1st European semantic web
                                                                           symposium (ESWS’04) (2004).
REFERENCES                                                            [14] A. Maedche, B. Motik, N. Silva and R. Volz,
[1] Robin Dhamankar, Yoonkyong Lee, AnHai Doan,                            "MAFRA - A MApping FRAmework for Distributed
    Alon Halevy, Pedro Domingos, “iMap: Discovering                        Ontologies", in EKAW '02: Proceedings of the 13th
    Complex Semantic Matches between Database                              International Conference on Knowledge Engineering
    Schemas”, Proceedings of the ACM SIGMOD                                and Knowledge Management. Ontologies and the
    Conference on Management of Data. (2004)                               Semantic Web, pp. 235-250, 2002.


                                                                 17