<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Browsing Distant Metadata Using Semantic Signatures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrew Choi</string-name>
          <email>aschoi@sfu.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marek Hatala</string-name>
          <email>mhatala@sfu.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Simon Fraser University, School of Interactive Arts and Technology</institution>
          ,
          <addr-line>Surrey, BC</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <fpage>10</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>In this document, we describe a light-weighted ontology mediation method that allows users to send semantic queries to distant data repositories to browse for learning object metadata. In a collaborative E-learning community, member data repositories might use different ontologies to control a set of vocabularies describing topics in learning resources. This could hinder the search of learning resources based on local ontological concepts. With the use of WordNet, we develop a toolkit that indexes ontological concepts with WordNet senses for semantic browsing in order to integrate information in a distributed learning community. The effectiveness of the toolkit was validated with real-world data in a specific domain, namely Elearning metadata.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Semantic Retrieval</kwd>
        <kwd>Data Integration</kwd>
        <kwd>Ontology Mediation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission by the copyright owners.</p>
      <p>Copyright 2005
requests. Currently, the use of metadata and ontologies to
formalize semantics of concepts in the E-learning domain
does not completely resolve the problem of interoperability
in a federated environment. This is because metadata in
different repositories are very often annotated with
concepts defined by different ontologies specific to their
organizations or communities. That makes finding
information based on a local conceptual framework
difficult. Different organizations with different
backgrounds and target audience may use different terms
with similar semantics to define and describe two similar
learning resources. In addition to ontological differences,
linguistic variations in metadata values and lack of use of
metadata standard across learning network makes direct
querying with keywords sometimes ineffective to discover
a conceptually similar metadata.</p>
    </sec>
    <sec id="sec-2">
      <title>PROBLEM DESCRIPTION</title>
      <p>The primary objective of this research is to explore the use
of semantic signatures expressed in WordNet senses to
provide mediation between different ontologies in order to
enhance concept retrieval. Consider the scenario when the
learner L1 associated with the repository R1 looking for
learning resources related to the topic of how to find a
good bass musical instrument, L1 sends out a request
“search for bass” to remote repositories R2 and R3
respectively in an E-learning network. However, the
returned results from them are mixed with many irrelevant
resources related to catching a bass (e.g. fish). Such a
problem occurs frequently when the concepts are defined
by different domain ontologies with different sets of
vocabularies carrying different intended meanings. Imagine
another case when the same learner L1 sends out a
distributed request for learning resources on the topic
“advance databases”. Since the topic is annotated by the
concept “database systems II” in remote repositories, that
is to say it is labelled differently. Therefore, in a
conceptbased label matching search, learning resources defined by
the concept “database systems II” will not be returned for
the request of “advance databases” even though the two
concepts are actually semantically equivalent.</p>
      <p>From these simple scenarios, one can easily see that
without a proper semantic mapping between ontologies in
heterogeneous data sources, even with the ontology to
define vocabulary used to describe metadata on learning
resources, it is still challenging to find learning resources
based on the local conceptual definition.</p>
    </sec>
    <sec id="sec-3">
      <title>OVERVIEW OF ONTOLOGY MAPPING</title>
      <p>
        Semantic or ontology mapping can be described as a
mapping task that identifies common concepts and
establishes semantic relationships between heterogeneous
data models in the same domain of discourse [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Since
semantics is mostly defined by ontological constructs in
modern knowledge systems, we will use the term semantic
mapping interchangeably with ontology mapping in this
discussion. According to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], ontology mapping between
two ontologies O1 and O2, can be expressed as a
mathematical structure: O1 = (C1, A1) to O2 = (C2, A2) by a
function f: C1→C2 to semantically related concept C1 to
concept C2 such that A2 |= f(A1) whose all interpretations
that satisfy axioms in O2 also satisfy axioms in O1. For
example, if the concept agent (C1) is defined in O1 by a set
of properties such as &lt;broker, travel agent and officer&gt;
with axioms such as &lt;part-of agency, is-a individual, is-a
organization and type-of communicator&gt; (ignoring other
attributes and cardinality for the sake of simplicity), it is
possible to map it to a concept representative (C2) defined
in O2 with a set of properties such as &lt;government agent,
client, spokesperson and advisor&gt; and having axioms such
as &lt;part-of government, is-a person, and is-a expert&gt;. This
assumes that all the semantic interpretations of C1 will be
respected by C2 in the domain of discourse when executing
logical inference operation on C2.
      </p>
    </sec>
    <sec id="sec-4">
      <title>REVIEW OF OTHER APPROACHES</title>
      <p>
        This section presents a brief overview of two approaches
on semantic mapping. The two selected approaches are
GLUE and MAFRA. The former is a system that employs
machine-learning techniques to find ontology mappings
with the use of probabilistic multiple learners while the
latter uses a declarative representation of mappings as
instances in a mapping ontology defining bridging axioms
to encode transformation rules. With two domain
ontologies, for each concept in an ontology GLUE claims
to find the most similar concept in another ontology [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A
number of features distinct GLUE from other similar
mapping systems. First, unlike many mapping systems that
only incorporate single similarity function to determine if
two concepts are semantically related, GLUE utilizes
multiple similarity functions to measure the closeness of
two concepts based on the purpose of the mapping. The
intuition behind the multiple similarity functions is to take
advantage of the mapping requirement to relax or limit the
choice of corresponding concepts. For instance, based on
the requirement of the application the task of mapping the
concept “associate professor” can be satisfied by similarity
criteria “exact”, “most-specific-parent” or
“most-generalchild” similarity criteria to find “senior lecturer”,
“academic staff” or “John Cunningham” respectively. This
gives GLUE flexibility to find semantic mappings between
ontologies. Second, GLUE applies a multi-strategy learning
approach to use certain information discovered by different
classifiers during the training process. This approach
divides the classification process into two phases. First, a
set of base classifiers is developed to classify instances of
concepts on different attributes with different algorithms.
Then, the prediction of these base classifiers, assigned with
different weights representing their importance on overall
accuracy, is combined to form a meta-learner. Finally, the
classification is determined by the result from the
metalearner. As an instance, one base learner can exploits the
frequency of words in the name property using a Naïve
Bayes learning technique while another base learner can
use pattern matching on another property using a Decision
Tree Induction technique. At the end, the meta-learner will
gather all the results to form the final prediction. Using
multiple classifiers, GLUE intends to increase the accuracy
of the overall prediction. Third, GLUE incorporates label
relaxation techniques into the matching process to boost the
matching opportunity based on features of the
neighbouring nodes. Generally, the relaxation labelling
iteratively makes use of neighbouring features, domain
constraints and heuristic knowledge to assign the label of
the target node.
      </p>
      <p>
        MAFRA (Mapping FRAmework) is another ontology
mapping methodology that prescribes “all phases of the
ontology mapping process, including analysis,
specification, representation, execution and evolution”
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. It uses the declarative representation approach in
ontology mapping by creating a Semantic Bridging
Ontology (SBO) that contains all concept mappings and
associated transformation rule information. In this model,
given two ontologies (source and target), it requires domain
experts to examine and analyze the class definitions,
properties, relations and attributes to determine the
corresponding mapping and transformation method. Then,
all accumulated information will be encoded into concepts
in SBO. Therefore, SBO serves as an upper ontology to
govern the mapping and transformation between two
ontologies. Each concept in SBO consists of five
dimensions: they are Entity, Cardinality, Structural,
      </p>
      <sec id="sec-4-1">
        <title>Constraint and Transformation. During the process of</title>
        <p>ontology mapping, software agent will inspect the values
from two given ontologies under these dimensions and
execute the transformation process when constraints are
satisfied.</p>
        <p>Some recent approaches like INRIA1 make use of OWL
API to build a set of alignment APIs with built-in WordNet
function for the purpose of ontology alignment or axioms
generation and transformations. However, the details on the
use of WordNet to generate the alignments are not well
documented in the published literatures.</p>
        <sec id="sec-4-1-1">
          <title>1 http://co4.inrialpes.fr/align/index.html</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>WORDNET</title>
      <p>
        WordNet is a widely recognized online lexical reference
system, developed at Princeton University, whose design is
inspired by “current psycholinguistic theories of human
lexical memory. English nouns, verbs, adjectives and
adverbs are organized into synsets (synonym sets), each
representing one underlying lexical concept that is
semantically identical to each other” [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Synsets are
interlinked via relationships such as synonymy and
antonymy, hypernymy and hyponymy (Subclass-Of and
Superclass-Of), meronymy and holonymy (Part-Of and
Has-a) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Each synset has a unique identifier (ID) and a
specific definition. A synset may consist of only a single
element, or it may have many elements all describing the
same concept. Each element in a particular synset's list is
synonymous with all other elements in that synset. For
example, the synset {World Wide Web, WWW, Web}
represents the concept of computer network consisting of a
collection of internet sites. In this context, 'World Wide
Web', ‘WWW’ and 'Web' are all semantically equivalent.
For cases where a single word has multiple meanings
(polysemy), multiple separate and potentially unrelated
synsets will contain the same word. For instance, the word
‘Web’ can have 7 multiple meanings defined in WordNet
as computer network, entanglement, simply spider web and
etc.
      </p>
    </sec>
    <sec id="sec-6">
      <title>OUR APPROACH</title>
      <p>To help distributed learning repositories to organize and
manage their metadata in compliance with a global
semantic view, we create a semantic mapping strategy
using WordNet as a mediator to provide word sense
disambiguation and to generate semantic signature each
representing learning resource category.</p>
      <p>Semantic signature in the categorical browsing context can
be defined as a logical grouping of representational word
senses for a class of metadata. In essence, it is a semantic
representation of a class label with important WordNet
senses regarding context. To formalize the concept of
semantic signature, it can be written as follows:
where Sig (c) = semantic signature for class c</p>
      <p>DSj = set of document senses for class c
BSdi = set of best sense in document dj
T = all keywords in document dj
Fav = selection function to find best sense</p>
      <p>WS(t) = set of WordNet sense for term ti
To briefly explain, semantic signature of a class of
metadata is built from a set of important document senses
from all documents (metadata records) belonging to a
particular class. In turn, document senses are generated
from a collection of the best WordNet senses for all
representational keywords for a particular document.
The generation of a semantic signature for a class of
metadata is divided into three distinct phases. In the rest of
this section, the general architecture of the methodology is
described while each phase is discussed in detail and as
well as illustrated with examples.</p>
    </sec>
    <sec id="sec-7">
      <title>System Design and Architecture</title>
      <p>
        The methodology for creating semantic signature relies
heavily on the assumptions that the aggregates of all
semantic information from metadata records of a particular
class are a good representation of the concept for that class.
In fact, the metadata record is an instance of a concept in
the ontological framework. Moreover, the methodology
assumes that semantic information of a class can be
approximated by a set of important word senses from all
metadata records. Besides, semantic word senses specific to
the context can be found based on important terms
extracted from metadata through WordNet. Finally yet
importantly, it assumes that the local semantic signature for
a class of metadata is similar to signatures for metadata of
semantically equivalent concepts in distant repositories.
The methodology uses k-Nearest Neighbour (kNN) search
algorithm to classify semantically relevant concepts in
distant repositories based on local semantic signatures [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
The instances (metadata) of concepts in local repository
serve as the training dataset. Based on semantic features of
the local metadata, semantic signatures for each class of
concepts are formed. To find semantically relevant
concepts in distant repositories, a distance function is
defined and used to measure closeness between the query
signature and semantic signatures for concepts in distant
repositories. Eventually, k most similar concepts to the
query signature will be retrieved from remote repositories.
representative keywords are used as seeds to find the
corresponding word senses from WordNet. Finally, in the
Sense Selection phase several strategies are applied to
select the best word sense is selected among all senses to
represent each word term.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Signature Generation in Action</title>
      <sec id="sec-8-1">
        <title>Phase I: Word Extraction</title>
        <p>
          First, the input metadata are transformed to comply with
the IEEE LOM standard2 using XML transformer. Then,
adapted from Edmundsonian paradigm [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], content from
&lt;Title&gt; and &lt;Description&gt; elements is extracted to
represent the whole metadata document. That presumes
that the content from these two elements carry important
weight as cue phrase to be able to represent the whole
document [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This view seems reasonable in the case of
learning object metadata because other elements like
publication date, ISBN or format do not bear good
semantic information to signify the category of the
metadata.
        </p>
      </sec>
      <sec id="sec-8-2">
        <title>Phase II: Document Preprocessing</title>
        <p>
          The condensed metadata with only the &lt;Title&gt; and
&lt;Description&gt; elements are subjected to cleaning to
remove all stopwords, punctuation information, numerical
values and irregular symbols. Next, all non-noun words are
removed using part-of-speech tagger except some
commonly used phrasal words which carry specific
meaning. For example, the word “artificial” in the phrase
“artificial intelligence” will be preserved to retain the
special meaning of the binary phrase in the branch of
computer science. The reason why this approach only uses
nouns as the base keyword is explanined in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] where it is
said that long phrases are not easily disambiguated
comparing to a single word term or a binary word term.
The accuracy to use a phrase as a distinguishing feature for
a document classification in effect will be lower through
previous experiments demonstrated in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. On the other
hand, it has been shown that the use of noun word terms
carry the most salient expression to serve as distinguishing
feature for doing text classification [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
      </sec>
      <sec id="sec-8-3">
        <title>Phase III: Document Vector Sensitization</title>
        <p>
          Supposing that all irrelevant information has been
eliminated, the physical metadata documents are projected
onto the vector space model. The document vector
becomes a logical representation of the physical metadata
record. Then, using TFIDF weighting scheme we select
most significant terms across all document vectors to
represent a category of metadata [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. After that, each word
term with the TFIDF score higher than the threshold is sent
to WordNet to retrieve the corresponding word senses and
its definition. The threshold is determined by trial and error
approach with a test run. A single word term can have
2 http://ieeeltsc.org/wg12LOM/lomDescription
multiple word senses retrieved. For example, the word
“search” can be mapped to WordNet senses as &lt;hunting,
hunt&gt;, &lt;lookup&gt; and &lt;investigation&gt;. Because of this, the
mapping information of a single noun word term can be
denoted by a triple construct in the form &lt;T, S, D&gt; where T
is the original word term, S is the synset of T and D is the
definition of T. When a noun term can be mapped to
multiple senses, there will be multiple triples. Take the
word term “search” as an example. After the sensitization,
it becomes &lt;search – {hunting. hunt} = “the activity of
looking thoroughly in order to find something or someone”
(TFIDF 0.623101)&gt; in triple construct. The triple construct
format is used to substitute the original word term in the
master document vector. Then again, recall that since a
single word term could be mapped to possible different
word senses through WordNet. Each word sense is
represented in synset which may have multiple
synonymous terms. Because of this, the length of the
document vector in word sense will grow considerably.
This problem is addressed in the next phase.
        </p>
        <p>
          Phase IV: Sense Selection Strategy (S3)
This is the last, and the most crucial phase in the method. It
chooses the best word sense among all retrieved word
senses from WordNet to represent the word term. As
stated, a word term can be mapped to multiple WordNet
senses. In such a case, the dimensionality of the vector
grows significantly after the sensitization procedure.
Imagine that a word term “light” can be mapped to 15
WordNet noun senses “visible light”, “light source”,
“luminosity”, “lighting”, etc. The growth ratio is 15 times
in this case. Such a high dimension not only negatively
affects the efficiency of the similarity computation, but
more seriously, the many senses are noise which does not
carry actual meaning of the word in the context of the
document. Included irrelevant senses will distort the
semantic representation of the signature and lower the
accuracy in similarity calculation when finding similar
classes of metadata using signature matching. On the other
hand, from the semantic knowledge standpoint, WordNet
senses only provide the lexical information of the word
term, but not the contextual information to determine how
the meaning is clarified in a specified context [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Without
that, the semantic signature is just a bigger collection of
keywords and would have small use in identifying the
classes of metadata based on the semantic relevance of the
signature. Therefore, it is necessary to find a way to reduce
the dimension and only select the sense that conveys the
main idea of the word in the current context. To select the
best sense representing a word term, a contextual-based
Senses Selection Strategy (S3) is applied to retrieved word
senses. The strategy is based on the assumption that the
local contextual information of a document serves as a
good hint to tell which sense represents the actual meaning
of the word term best. The S3 approach can be summarized
in the following algorithm:
Steps of algorithm (Calculate the best senses for class C1):
For each metadata document D ∈ C1
Get the list of synsets for each word term T1 ∈ D
For each synset Syn1 of the word term T1
For each sense term Si ∈ Syn1
1. Compute associative frequency af for Si to other senses Sk
∈ Synk, Synk ⊆ Tk and T1 ≠ Tk
1.1 Find the sense Sl with highest score Max(af)
1.2 If (Max(af) &lt; 1) then go to 2 otherwise stop and
        </p>
        <p>return Sl
2. Compute associative frequency af for Si to k-order parent
senses PSk ∈ P(Synk), P(Synk) ⊆ Tk and T1 ≠ Tk</p>
        <sec id="sec-8-3-1">
          <title>2.1 Find the sense Sp with highest score Max(af)</title>
          <p>2.2 If (Max(af) &lt; 1) then go to 3 otherwise stop and
return Sp
3. Return the most popular sense Sw offered by WordNet</p>
        </sec>
        <sec id="sec-8-3-2">
          <title>Return the Best Sense to represent word term T1</title>
          <p>Aggregate all sense from all important word terms to represent
signature of the document D
The algorithm works in the following way. For each word
sense of a word term, it first computes the associative
frequency (af) of each sense term in a synset to other sense
terms in other synset of other word terms in the same
document. From this, the most occurred word sense will be
used to substitute the semantic representation of the word
term.</p>
          <p>
            Next, if the word sense of a word term cannot be
discriminated by Strategy 1, the algorithm generalizes the
word term to the k-order parent senses. In this approach,
the value of k is 1. Hence, it generalizes to its immediate
parent word sense. Referring to Figure 2, Strategy 2 will use
the immediate parent sense to compute the associative frequency
against other senses from other word terms in the document
vector. As such, in this example the word term t1 will be
rolled up to its immediate parent through hypernym (is-a)
relation in the WordNet hierarchy. Then, the parent’s
synset is used to calculate the associative frequency to
other word senses for other word terms. Unlike other
generalization approaches [
            <xref ref-type="bibr" rid="ref13 ref7">7, 13</xref>
            ], we generalize the sense
to its most-specific parent only. The reason why it uses
immediate parent senses (k=1) to compute the associative
frequency is given in [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] where the most specific parent in
a hierarchical terminology has a higher distinctive power to
classify the topic. Essentially following the intuition that if
a word sense is generalized to higher order parent sense
than k=1, the generalized sense may be too general and
becomes incoherent to local context, and would become
noise when used to classify metadata.
          </p>
          <p>Finally, as arranged by WordNet, the word senses retrieved
from WordNet for a particular word are a partial order set
ranked by popularity in English usage. If the previous two
strategies can not find the best sense to represent the word
term, then the most popular sense offered by WordNet will
be adopted in Strategy 3.
The rationale behind sequencing three strategies is based
on observations and hypothesis that the local context is the
most specific and relevant candidate to provide contextual
meaning for the word term sense. Therefore, a word sense
for a particular term can most likely be disambiguated by
other local senses (Strategy 1). If it could not be resolved
by step 1, then it compares the immediate parent sense to
the other word senses to check if the parent sense is a
frequently occurring sense for the underlying word term.
At last, the most popular sense is adopted to represent the
semantic meaning for a word term when the two strategies
above could not resolve the ambiguity of the word term.
Following the above procedures, a set of senses becomes a
semantic signature of a document. In order to generate the
final semantic signature for a class of documents referring
to particular concept, TFIDF scheme is applied again to
each word sense in all document signatures for a particular
class. Based on the score, the most relevant senses for
characterizing the class of metadata are aggregated to form
the final signature for the class.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Concept browsing in heterogeneous ontologies</title>
      <p>In our application, the generated semantic signatures are
used to index the actual classes of metadata for fast
distributed browsing. We developed a tool called Signature
Generation Indexer (SGI) that supports the methodology
described in the previous section. Focusing on the
efficiency, the design of SGI is to allow repository
operators to produce semantic signatures for classes of
learning object metadata easily without tedious human
interaction, or complicated implementation.</p>
      <p>The ultimate goal is to achieve semantic search based on
Elearning topics defined by heterogeneous ontologies in a
federated network. In a collaborative learning environment,
users expect to be able to access all the learning resources
within the learning network. To fulfill this anticipation, it is
important to assume that all participant repositories in the
collaborative network employ the same strategy to index
learning resources metadata with WordNet semantic
signature.</p>
      <p>In this way, when users launches a query by selecting a
specific topic (concept) from the local ontology (e.g. via
user interface), the corresponding semantic signature
representing the topic is retrieved from local database. The
signature is then sent across the network to participating
learning repositories. The query in the form of semantic
signature is the input of the Similarity Calculator in distant
repositories. The Similarity Calculator is used to compute
the similarity of signatures in each of the learning
repositories. The similarity calculator uses the cosine
similarity function, thereby the more matched elements in
the signature, the higher the score is. In calculating the
similarity score, different weights are assigned to senses
from &lt;Title&gt; and &lt;Description&gt; in which the match in the
title sense gets higher contribution to overall score than the
one from the description tag.</p>
      <p>In order to ensure the global accuracy of the result, results
from participating remote repositories are merged and
sorted in the descending order based on the cosine
similarity score. Then, the top k (k=5) topics of the
metadata are offered as the answer to the local query. The
overall operation of the semantic-based browsing of
learning resources metadata is shown in Figure 3.</p>
    </sec>
    <sec id="sec-10">
      <title>IMPLEMENTATION</title>
      <p>The SGI is implemented in the C# programming language.
The current version is a desktop application, but it can be
easily extended to a web service. The goal of SGI is to
integrate signature generation, document indexing and
browsing capability. The signature indexes are stored in an
inverted index database (e.g. MS Access). The similarity
calculator is a separated module implemented in C# as well
and connected to the index database. Figure 4 shows the
browsing interface of SGI to illustrate how to search distant
concept semantically.</p>
    </sec>
    <sec id="sec-11">
      <title>EVALUATION</title>
      <p>In order to test the hypothesis of using semantic signatures
to enable distributed semantic browsing and to improve
relevance we have simulated the distributed concept
retrieval and compared the results with the traditional
keyword-based and label-matching method. To replicate
the distributed repositories in a collaborative E-learning
network, the three independent databases are set up. As
shown in Figure 5, they are called “local”, “remote1” and
“remote2” where the local, of course, denotes a local data
source and both remote1 and remote2 simulate distant data
sources. A single master set of metadata in 8 different
categories is distributed evenly in number and randomly
into the three simulated repositories.</p>
      <p>The metadata have been transformed to conform to the
IEEE LOM format. After the distribution, the local
database contains the metadata that represents the set of the
training data for the classifier. During the training phase,
the kNN classifier uses the instance of the local metadata to
learn the features to identify the class of the metadata. It
starts by extracting keyword terms from each category of
metadata and projecting them into the vector space model.
Next, after running through the signature generation
module, each category of metadata is represented and
indexed by a semantic signature in the database.
The dataset in both remote1 and remote2 is controlled to
model the situation of potentially different ontological
classification in a distributed environment. To simulate the
effect of varied concept labelling, the original 8 categories
of metadata are expanded to 14 categories in remote1. The
6 derived categories are labelled with different class names
from their respective sources and described with the
metadata taken out from source categories. Each newly
derived category contains metadata belonging to the same
class. To illustrate, a part of the metadata from the category
“computing science” is distributed to the derived categories
“technology” and “engineering” in remote1. Thereby, the
metadata for concept “computing science” is now grouped
into “computing science”, “technology” and “engineering”.
Essentially, this simulates the situation when a concept
“computing science” could be categorized differently into
concepts like “technology” and “engineering” in different
ontology. The same distribution principle is applied to
remote2 database which includes 13 categories of which 7
are derived categories.</p>
      <p>Similar to the local database, each category of the metadata
in remote1 and remote2 is mapped to a semantic signature
in WordNet senses and stored in the local database as an
index. To test semantic-based search, semantic signature
representing a local concept is sent to query the remote
repositories. The semantic similarity is compared between
the query signature and the distant signature based on the
similarity function. Finally, the result of the k most similar
concept signatures from the remote databases are studied
based on the relevance metric.</p>
    </sec>
    <sec id="sec-12">
      <title>Dataset</title>
      <p>Since there is no publicly available dataset of learning
resources metadata, the experiment metadata were acquired
through a number of different sources. Table 1 shows the
category of metadata acquired and their respective sources.
In total, 2235 metadata subdivided into the 8 different
categories are acquired. The dataset is partitioned into
training and testing groups. As mentioned, the local
database stores the training dataset while remote1 and
remote2 store the testing dataset. All metadata are known
with their class label. Metadata are distributed randomly,
using Microsoft Excel random generator, to train and test
the group. After distribution, the local database contains
667 training records while remote1 and remote2 contain
1568 testing records.</p>
    </sec>
    <sec id="sec-13">
      <title>Results</title>
      <p>In order to gauge the effectiveness of the proposed
mediation method between different E-learning ontologies,
three standard metrics for information retrieval are used in
the evaluation of the system performance: they are Recall,
Precision and F-measure. Table 2 shows that the use of
semantic signature can consistently improve retrieval
relevance in terms of recall and precision. In all categories,
the semantic based retrieval out perform both
keywords</p>
      <p>As oppose to the classic or traditional keywords-based
representation, semantic-based indexing with WordNet
senses can include more lexicon information than simple
syntactic approach. This implies that more features will be
added to the class signature representation. Since more
features are added, that may also mean that more noise is
included as well.</p>
      <p>Intuitively, the increased relevance of retrieval can be
attributed to the expansion of features in class
representation. However, different from what we expected,
the precision does not decreased. It is suspected that due to
the relatively small size of the dataset and 1-k hypernym
generalization, the senses included in the signature are
‘good’ in terms of classification. Therefore, combined with
a good contextual-based sense selection strategy, WordNet
as a mediatory can provide source for ambiguity resolution
and semantic information for the process of semantic
browsing. Coupled with that, the selection of kNN
algorithm as the classifier also contributes to the
performance of the system.
kNN is an instance-based classifier. The performance of
instance-based classifiers is more dependent on the
sufficiency of the training set rather than other machine
learning classification algorithms. Thus, it is a
disadvantage for kNN to have a small dataset for training
and testing. A smaller training set implies more terms or
term combinations important for content identification may
be missing from the training sample documents. This
negatively affects the performance of a classifier.
Nevertheless, the ontology (e.g. WordNet) guided
approach seems to somewhat reduce the negative influence
of this problem. The replacement of child concepts with
parent concept through hypernym relationship appears to
be able to discover an optimum concept set without
adversely affecting performance. Therefore, an important
term, which resides low in the concept hierarchy may be
mapped to a parent concept and included in the signature
for class comparison, even if this term is not included in the
training set.</p>
    </sec>
    <sec id="sec-14">
      <title>DISCUSSION</title>
      <p>The improvement on concept retrieval by using semantic
signature is not uniform across different categories. For
example, the improvement on retrieval of “Psychology”
and “Accounting” metadata is more than improvement on
“Biology” and “Geology”. We believe that for some classes
of metadata like “Biology”, which are characterised by a set
of specific keywords, the use of semantic signatures does
not add extra useful information into the representation
model to help in classifying metadata. On the other hand,
using 1-k hypernym generalization on such a highly
specialized domain may in fact introduce more noise to
reduce the matching possibility in similarity calculations.
In addition, with a small size of dataset, over-fitting on
classification model may also result. Therefore, further
experimentation and analysis are needed to fully
understand the impact of WordNet signature with sense
generalization in classification of metadata.</p>
    </sec>
    <sec id="sec-15">
      <title>CONCLUSION</title>
      <p>This project offers two important contributions. First, it
gives a new light-weighted semantic (ontology) mapping
approach to enable cross platform concept browsing in a
federated network. Unlike many current practices in
semantic mapping that either require intensive user
involvement to provide mapping information, or resort to
complicated heuristic or rule-based machine learning
approach, this work shows an effective automatic mapping
protocol that can allow federated concept browsing with
semantic signature. It is evident for the experimental results
that establish the merit of using WordNet to provide
semantic knowledge for metadata classification in the
domain of E-learning. The merits include the provision of
semantic representation of categorical data and increased
semantic relevance in categorical browsing.</p>
      <p>By using immediate parent sense generalization during
sense selection process, it does not only successfully
reduce the dimension in semantic signature, but more
importantlly introduces flexibility in the sense selection and
increases the opportunity to find a better sense without
compromising the relevance in the search result. This
creates incentive to explore the use of other sense selection
strategy.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Robin</given-names>
            <surname>Dhamankar</surname>
          </string-name>
          , Yoonkyong Lee, AnHai Doan, Alon Halevy, Pedro Domingos, “iMap:
          <article-title>Discovering Complex Semantic Matches between Database Schemas”</article-title>
          ,
          <source>Proceedings of the ACM SIGMOD Conference on Management of Data</source>
          . (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>George</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>Wordnet:</given-names>
          </string-name>
          <article-title>An Online Lexical Database</article-title>
          ,
          <source>International Journal of Lexicography</source>
          (
          <year>1993</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Asuncion</given-names>
            <surname>Gomez-Perez</surname>
          </string-name>
          ,
          <article-title>Ontological Engineering with Examples from the areas of Knowledge Mangement, eCommerce and</article-title>
          the Semantic Web, Springer-Verlag London (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Edmundson</surname>
          </string-name>
          , New Methods in Automatic Extracting,
          <source>Journal of the ACM</source>
          (
          <year>1969</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ching</given-names>
            <surname>Kang</surname>
          </string-name>
          <string-name>
            <surname>Cheng</surname>
          </string-name>
          , Xiaoshan Pan and Franz Kurfess, “
          <article-title>Ontology-based Semantic Classification of Unstructured Documents”</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Khaled</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hammouda</surname>
          </string-name>
          and Mohamed S. Kamel, “
          <article-title>Phrase-based Document Similarity Based on an Index Graph Model”</article-title>
          ,
          <source>Proceedings of IEEE International Conference on Data Mining</source>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>AnHai</given-names>
            <surname>Doan</surname>
          </string-name>
          , Jayant Madhavan, Pedro Domingos, and Alon Halevy “
          <article-title>Learning to map between ontologies on the semantic web”</article-title>
          ,
          <source>Proceedings of WWW2002 conference</source>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ching</given-names>
            <surname>Kang</surname>
          </string-name>
          <string-name>
            <surname>Cheng</surname>
          </string-name>
          , Xiaoshan Pan and Franz Kurfess, “
          <article-title>Ontology-based Semantic Classification of Unstructured Documents”</article-title>
          ,
          <source>Proceedings of 1st International Workshop on Adaptive Multimedia Retrieval</source>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hans-Peter Kriegel</surname>
          </string-name>
          and Matthias Schubert, “
          <article-title>Web Site Mining : A new way to spot Competitors, Customers and Suppliers in the World Wide Web”</article-title>
          ,
          <source>Proceedings of 4th International Conference on Knowledge Discovery and Data Mining</source>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Yannis</given-names>
            <surname>Kalfoglou</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Schorlemmer</surname>
          </string-name>
          ,
          <article-title>Ontology mapping: the state of the art, The Knowledge Engineering Review (</article-title>
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Mineau</surname>
          </string-name>
          , G.W, “
          <article-title>A simple KNN algorithm for text categorization”</article-title>
          ,
          <source>Proceedings of IEEE International Conference on Data Mining</source>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton and M. McGill</surname>
          </string-name>
          , Introduction to Modern Information Retrieval,
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          (
          <year>1983</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Giunchiglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shvaiko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Yatskevich</surname>
          </string-name>
          ,
          <article-title>Semantic matching</article-title>
          ,
          <source>In 1st European semantic web symposium (ESWS'04)</source>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Maedche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Motik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Silva</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Volz</surname>
          </string-name>
          ,
          <article-title>"MAFRA - A MApping FRAmework for Distributed Ontologies"</article-title>
          ,
          <source>in EKAW '02: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web</source>
          , pp.
          <fpage>235</fpage>
          -
          <lpage>250</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>