=Paper=
{{Paper
|id=Vol-156/paper-3
|storemode=property
|title=Towards Browsing Distant Metadata Using Semantic Signatures
|pdfUrl=https://ceur-ws.org/Vol-156/paper3.pdf
|volume=Vol-156
|dblpUrl=https://dblp.org/rec/conf/kcap/ChoiH05
}}
==Towards Browsing Distant Metadata Using Semantic Signatures==
Towards Browsing Distant Metadata Using Semantic
Signatures
Andrew Choi Marek Hatala
Simon Fraser University Simon Fraser University
School of Interactive Arts and Technology School of Interactive Arts and Technology
Surrey, BC, Canada Surrey, BC, Canada
aschoi@sfu.ca mhatala@sfu.ca
ABSTRACT requests. Currently, the use of metadata and ontologies to
In this document, we describe a light-weighted ontology formalize semantics of concepts in the E-learning domain
mediation method that allows users to send semantic does not completely resolve the problem of interoperability
queries to distant data repositories to browse for learning in a federated environment. This is because metadata in
object metadata. In a collaborative E-learning community, different repositories are very often annotated with
member data repositories might use different ontologies to concepts defined by different ontologies specific to their
control a set of vocabularies describing topics in learning organizations or communities. That makes finding
resources. This could hinder the search of learning information based on a local conceptual framework
resources based on local ontological concepts. With the use difficult. Different organizations with different
of WordNet, we develop a toolkit that indexes ontological backgrounds and target audience may use different terms
concepts with WordNet senses for semantic browsing in with similar semantics to define and describe two similar
order to integrate information in a distributed learning learning resources. In addition to ontological differences,
community. The effectiveness of the toolkit was validated linguistic variations in metadata values and lack of use of
with real-world data in a specific domain, namely E- metadata standard across learning network makes direct
learning metadata. querying with keywords sometimes ineffective to discover
a conceptually similar metadata.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information PROBLEM DESCRIPTION
Search and Retrieval – information integration, retrieval The primary objective of this research is to explore the use
models, search process of semantic signatures expressed in WordNet senses to
provide mediation between different ontologies in order to
General Terms enhance concept retrieval. Consider the scenario when the
Algorithms, Management, Experimentation, Verification learner L1 associated with the repository R1 looking for
learning resources related to the topic of how to find a
Keywords good bass musical instrument, L1 sends out a request
Semantic Retrieval, Data Integration, Ontology Mediation “search for bass” to remote repositories R2 and R3
respectively in an E-learning network. However, the
INTRODUCTION returned results from them are mixed with many irrelevant
As the advance of the Internet and rapid development in E- resources related to catching a bass (e.g. fish). Such a
learning, more and more institutions are joining to form a problem occurs frequently when the concepts are defined
distributed learning network to allow users to access by different domain ontologies with different sets of
resources from different learning repositories. This creates vocabularies carrying different intended meanings. Imagine
pressure for institutions to provide an efficient way to another case when the same learner L1 sends out a
organize a huge volume of materials located in different distributed request for learning resources on the topic
repositories, according to a consistent concept “advance databases”. Since the topic is annotated by the
classification, in order to answer distributed retrieval concept “database systems II” in remote repositories, that
is to say it is labelled differently. Therefore, in a concept-
based label matching search, learning resources defined by
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
the concept “database systems II” will not be returned for
not made or distributed for profit or commercial advantage and that the request of “advance databases” even though the two
copies bear this notice and the full citation on the first page. To copy concepts are actually semantically equivalent.
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission by the copyright owners. From these simple scenarios, one can easily see that
Copyright 2005 without a proper semantic mapping between ontologies in
heterogeneous data sources, even with the ontology to
10
define vocabulary used to describe metadata on learning gives GLUE flexibility to find semantic mappings between
resources, it is still challenging to find learning resources ontologies. Second, GLUE applies a multi-strategy learning
based on the local conceptual definition. approach to use certain information discovered by different
classifiers during the training process. This approach
OVERVIEW OF ONTOLOGY MAPPING divides the classification process into two phases. First, a
Semantic or ontology mapping can be described as a set of base classifiers is developed to classify instances of
mapping task that identifies common concepts and concepts on different attributes with different algorithms.
establishes semantic relationships between heterogeneous Then, the prediction of these base classifiers, assigned with
data models in the same domain of discourse [1]. Since different weights representing their importance on overall
semantics is mostly defined by ontological constructs in accuracy, is combined to form a meta-learner. Finally, the
modern knowledge systems, we will use the term semantic classification is determined by the result from the meta-
mapping interchangeably with ontology mapping in this learner. As an instance, one base learner can exploits the
discussion. According to [10], ontology mapping between frequency of words in the name property using a Naïve
two ontologies O1 and O2, can be expressed as a Bayes learning technique while another base learner can
mathematical structure: O1 = (C1, A1) to O2 = (C2, A2) by a use pattern matching on another property using a Decision
function f: C1→C2 to semantically related concept C1 to Tree Induction technique. At the end, the meta-learner will
concept C2 such that A2 |= f(A1) whose all interpretations gather all the results to form the final prediction. Using
that satisfy axioms in O2 also satisfy axioms in O1. For multiple classifiers, GLUE intends to increase the accuracy
example, if the concept agent (C1) is defined in O1 by a set of the overall prediction. Third, GLUE incorporates label
of properties such as relaxation techniques into the matching process to boost the
with axioms such as (ignoring other neighbouring nodes. Generally, the relaxation labelling
attributes and cardinality for the sake of simplicity), it is iteratively makes use of neighbouring features, domain
possible to map it to a concept representative (C2) defined constraints and heuristic knowledge to assign the label of
in O2 with a set of properties such as and having axioms such MAFRA (Mapping FRAmework) is another ontology
as . This mapping methodology that prescribes “all phases of the
assumes that all the semantic interpretations of C1 will be ontology mapping process, including analysis,
respected by C2 in the domain of discourse when executing specification, representation, execution and evolution”
logical inference operation on C2. [14]. It uses the declarative representation approach in
ontology mapping by creating a Semantic Bridging
REVIEW OF OTHER APPROACHES Ontology (SBO) that contains all concept mappings and
This section presents a brief overview of two approaches associated transformation rule information. In this model,
on semantic mapping. The two selected approaches are given two ontologies (source and target), it requires domain
GLUE and MAFRA. The former is a system that employs experts to examine and analyze the class definitions,
machine-learning techniques to find ontology mappings properties, relations and attributes to determine the
with the use of probabilistic multiple learners while the corresponding mapping and transformation method. Then,
latter uses a declarative representation of mappings as all accumulated information will be encoded into concepts
instances in a mapping ontology defining bridging axioms in SBO. Therefore, SBO serves as an upper ontology to
to encode transformation rules. With two domain govern the mapping and transformation between two
ontologies, for each concept in an ontology GLUE claims ontologies. Each concept in SBO consists of five
to find the most similar concept in another ontology [7]. A dimensions: they are Entity, Cardinality, Structural,
number of features distinct GLUE from other similar Constraint and Transformation. During the process of
mapping systems. First, unlike many mapping systems that ontology mapping, software agent will inspect the values
only incorporate single similarity function to determine if from two given ontologies under these dimensions and
two concepts are semantically related, GLUE utilizes execute the transformation process when constraints are
multiple similarity functions to measure the closeness of satisfied.
two concepts based on the purpose of the mapping. The
intuition behind the multiple similarity functions is to take Some recent approaches like INRIA 1 make use of OWL
advantage of the mapping requirement to relax or limit the API to build a set of alignment APIs with built-in WordNet
choice of corresponding concepts. For instance, based on function for the purpose of ontology alignment or axioms
the requirement of the application the task of mapping the generation and transformations. However, the details on the
concept “associate professor” can be satisfied by similarity use of WordNet to generate the alignments are not well
criteria “exact”, “most-specific-parent” or “most-general- documented in the published literatures.
child” similarity criteria to find “senior lecturer”,
“academic staff” or “John Cunningham” respectively. This 1
http://co4.inrialpes.fr/align/index.html
11
WORDNET
WordNet is a widely recognized online lexical reference
system, developed at Princeton University, whose design is
inspired by “current psycholinguistic theories of human
lexical memory. English nouns, verbs, adjectives and
adverbs are organized into synsets (synonym sets), each
representing one underlying lexical concept that is
semantically identical to each other” [2]. Synsets are
interlinked via relationships such as synonymy and
antonymy, hypernymy and hyponymy (Subclass-Of and
Superclass-Of), meronymy and holonymy (Part-Of and
Has-a) [3]. Each synset has a unique identifier (ID) and a
specific definition. A synset may consist of only a single
element, or it may have many elements all describing the
Figure 1. Semantic Signature Generation Framework
same concept. Each element in a particular synset's list is
synonymous with all other elements in that synset. For from a collection of the best WordNet senses for all
example, the synset {World Wide Web, WWW, Web} representational keywords for a particular document.
represents the concept of computer network consisting of a The generation of a semantic signature for a class of
collection of internet sites. In this context, 'World Wide metadata is divided into three distinct phases. In the rest of
Web', ‘WWW’ and 'Web' are all semantically equivalent. this section, the general architecture of the methodology is
For cases where a single word has multiple meanings described while each phase is discussed in detail and as
(polysemy), multiple separate and potentially unrelated well as illustrated with examples.
synsets will contain the same word. For instance, the word
‘Web’ can have 7 multiple meanings defined in WordNet System Design and Architecture
as computer network, entanglement, simply spider web and The methodology for creating semantic signature relies
etc. heavily on the assumptions that the aggregates of all
semantic information from metadata records of a particular
OUR APPROACH class are a good representation of the concept for that class.
To help distributed learning repositories to organize and In fact, the metadata record is an instance of a concept in
manage their metadata in compliance with a global the ontological framework. Moreover, the methodology
semantic view, we create a semantic mapping strategy assumes that semantic information of a class can be
using WordNet as a mediator to provide word sense approximated by a set of important word senses from all
disambiguation and to generate semantic signature each metadata records. Besides, semantic word senses specific to
representing learning resource category. the context can be found based on important terms
Semantic signature in the categorical browsing context can extracted from metadata through WordNet. Finally yet
be defined as a logical grouping of representational word importantly, it assumes that the local semantic signature for
senses for a class of metadata. In essence, it is a semantic a class of metadata is similar to signatures for metadata of
representation of a class label with important WordNet semantically equivalent concepts in distant repositories.
senses regarding context. To formalize the concept of The methodology uses k-Nearest Neighbour (kNN) search
semantic signature, it can be written as follows: algorithm to classify semantically relevant concepts in
distant repositories based on local semantic signatures [11].
The instances (metadata) of concepts in local repository
serve as the training dataset. Based on semantic features of
the local metadata, semantic signatures for each class of
where Sig (c ) = semantic signature for class c concepts are formed. To find semantically relevant
DSj = set of document senses for class c concepts in distant repositories, a distance function is
BSdi = set of best sense in document dj defined and used to measure closeness between the query
signature and semantic signatures for concepts in distant
T = all keywords in document dj repositories. Eventually, k most similar concepts to the
Fav = selection function to find best sense query signature will be retrieved from remote repositories.
WS(t) = set of WordNet sense for term ti Figure 1 shows the four phases of the semantic
signature generation framework. In the Word Extraction
To briefly explain, semantic signature of a class of
phase representative features are extracted from each
metadata is built from a set of important document senses
metadata document. The Document Preprocessing phase
from all documents (metadata records) belonging to a
eliminates all irrelevant information as well as all non-noun
particular class. In turn, document senses are generated
words. In the Document Vector Sensitization phase all the
12
representative keywords are used as seeds to find the multiple word senses retrieved. For example, the word
corresponding word senses from WordNet. Finally, in the “search” can be mapped to WordNet senses as , and . Because of this, the
select the best word sense is selected among all senses to mapping information of a single noun word term can be
represent each word term. denoted by a triple construct in the form where T
is the original word term, S is the synset of T and D is the
Signature Generation in Action definition of T. When a noun term can be mapped to
Phase I: Word Extraction multiple senses, there will be multiple triples. Take the
First, the input metadata are transformed to comply with word term “search” as an example. After the sensitization,
the IEEE LOM standard 2 using XML transformer. Then, it becomes and elements is extracted to (TFIDF 0.623101)> in triple construct. The triple construct
represent the whole metadata document. That presumes format is used to substitute the original word term in the
that the content from these two elements carry important master document vector. Then again, recall that since a
weight as cue phrase to be able to represent the whole single word term could be mapped to possible different
document [4]. This view seems reasonable in the case of word senses through WordNet. Each word sense is
learning object metadata because other elements like represented in synset which may have multiple
publication date, ISBN or format do not bear good synonymous terms. Because of this, the length of the
semantic information to signify the category of the document vector in word sense will grow considerably.
metadata. This problem is addressed in the next phase.
Phase II: Document Preprocessing Phase IV: Sense Selection Strategy (S3)
The condensed metadata with only the and This is the last, and the most crucial phase in the method. It
elements are subjected to cleaning to chooses the best word sense among all retrieved word
remove all stopwords, punctuation information, numerical senses from WordNet to represent the word term. As
values and irregular symbols. Next, all non-noun words are stated, a word term can be mapped to multiple WordNet
removed using part-of-speech tagger except some senses. In such a case, the dimensionality of the vector
commonly used phrasal words which carry specific grows significantly after the sensitization procedure.
meaning. For example, the word “artificial” in the phrase Imagine that a word term “light” can be mapped to 15
“artificial intelligence” will be preserved to retain the WordNet noun senses “visible light”, “light source”,
special meaning of the binary phrase in the branch of “luminosity”, “lighting”, etc. The growth ratio is 15 times
computer science. The reason why this approach only uses in this case. Such a high dimension not only negatively
nouns as the base keyword is explanined in [5] where it is affects the efficiency of the similarity computation, but
said that long phrases are not easily disambiguated more seriously, the many senses are noise which does not
comparing to a single word term or a binary word term. carry actual meaning of the word in the context of the
The accuracy to use a phrase as a distinguishing feature for document. Included irrelevant senses will distort the
a document classification in effect will be lower through semantic representation of the signature and lower the
previous experiments demonstrated in [6]. On the other accuracy in similarity calculation when finding similar
hand, it has been shown that the use of noun word terms classes of metadata using signature matching. On the other
carry the most salient expression to serve as distinguishing hand, from the semantic knowledge standpoint, WordNet
feature for doing text classification [7]. senses only provide the lexical information of the word
term, but not the contextual information to determine how
Phase III: Document Vector Sensitization
the meaning is clarified in a specified context [8]. Without
Supposing that all irrelevant information has been that, the semantic signature is just a bigger collection of
eliminated, the physical metadata documents are projected keywords and would have small use in identifying the
onto the vector space model. The document vector classes of metadata based on the semantic relevance of the
becomes a logical representation of the physical metadata signature. Therefore, it is necessary to find a way to reduce
record. Then, using TFIDF weighting scheme we select the dimension and only select the sense that conveys the
most significant terms across all document vectors to main idea of the word in the current context. To select the
represent a category of metadata [12]. After that, each word best sense representing a word term, a contextual-based
term with the TFIDF score higher than the threshold is sent Senses Selection Strategy (S3) is applied to retrieved word
to WordNet to retrieve the corresponding word senses and senses. The strategy is based on the assumption that the
its definition. The threshold is determined by trial and error local contextual information of a document serves as a
approach with a test run. A single word term can have good hint to tell which sense represents the actual meaning
of the word term best. The S3 approach can be summarized
in the following algorithm:
2
http://ieeeltsc.org/wg12LOM/lomDescription
13
Steps of algorithm (Calculate the best senses for class C1): Figure 2. Compute associative frequency between immediate
parent with other word sense
For each metadata document D ∈ C1
Get the list of synsets for each word term T1 ∈ D Strategy 2 S1 S2
For each synset Syn1 of the word term T1 Sk
A set of WordNet senses
For each sense term Si ∈ Syn1
1. Compute associative frequency af for Si to other senses Sk generalize • t1 • t2 • t3 … • tn
∈ Synk, Synk ⊆ Tk and T1 ≠ Tk
1.1 Find the sense Sl with highest score Max(af) Document vector
1.2 If (Max(af) < 1) then go to 2 otherwise stop and
return Sl The rationale behind sequencing three strategies is based
2. Compute associative frequency af for Si to k-order parent on observations and hypothesis that the local context is the
senses PSk ∈ P(Synk), P(Synk) ⊆ Tk and T1 ≠ Tk most specific and relevant candidate to provide contextual
2.1 Find the sense Sp with highest score Max(af) meaning for the word term sense. Therefore, a word sense
2.2 If (Max(af) < 1) then go to 3 otherwise stop and
for a particular term can most likely be disambiguated by
return Sp other local senses (Strategy 1). If it could not be resolved
3. Return the most popular sense Sw offered by WordNet by step 1, then it compares the immediate parent sense to
Return the Best Sense to represent word term T1 the other word senses to check if the parent sense is a
frequently occurring sense for the underlying word term.
Aggregate all sense from all important word terms to represent
signature of the document D
At last, the most popular sense is adopted to represent the
semantic meaning for a word term when the two strategies
The algorithm works in the following way. For each word above could not resolve the ambiguity of the word term.
sense of a word term, it first computes the associative Following the above procedures, a set of senses becomes a
frequency (af) of each sense term in a synset to other sense semantic signature of a document. In order to generate the
terms in other synset of other word terms in the same final semantic signature for a class of documents referring
document. From this, the most occurred word sense will be to particular concept, TFIDF scheme is applied again to
used to substitute the semantic representation of the word each word sense in all document signatures for a particular
term. class. Based on the score, the most relevant senses for
Next, if the word sense of a word term cannot be characterizing the class of metadata are aggregated to form
discriminated by Strategy 1, the algorithm generalizes the the final signature for the class.
word term to the k-order parent senses. In this approach,
the value of k is 1. Hence, it generalizes to its immediate
Concept browsing in heterogeneous ontologies
parent word sense. Referring to Figure 2, Strategy 2 will use In our application, the generated semantic signatures are
the immediate parent sense to compute the associative frequency used to index the actual classes of metadata for fast
against other senses from other word terms in the document distributed browsing. We developed a tool called Signature
vector. As such, in this example the word term t1 will be Generation Indexer (SGI) that supports the methodology
rolled up to its immediate parent through hypernym (is-a) described in the previous section. Focusing on the
relation in the WordNet hierarchy. Then, the parent’s efficiency, the design of SGI is to allow repository
synset is used to calculate the associative frequency to operators to produce semantic signatures for classes of
other word senses for other word terms. Unlike other learning object metadata easily without tedious human
generalization approaches [7, 13], we generalize the sense interaction, or complicated implementation.
to its most-specific parent only. The reason why it uses
The ultimate goal is to achieve semantic search based on E-
immediate parent senses (k=1) to compute the associative
learning topics defined by heterogeneous ontologies in a
frequency is given in [9] where the most specific parent in
federated network. In a collaborative learning environment,
a hierarchical terminology has a higher distinctive power to
users expect to be able to access all the learning resources
classify the topic. Essentially following the intuition that if
within the learning network. To fulfill this anticipation, it is
a word sense is generalized to higher order parent sense
important to assume that all participant repositories in the
than k=1, the generalized sense may be too general and
collaborative network employ the same strategy to index
becomes incoherent to local context, and would become
learning resources metadata with WordNet semantic
noise when used to classify metadata.
signature.
Finally, as arranged by WordNet, the word senses retrieved
In this way, when users launches a query by selecting a
from WordNet for a particular word are a partial order set
specific topic (concept) from the local ontology (e.g. via
ranked by popularity in English usage. If the previous two
user interface), the corresponding semantic signature
strategies can not find the best sense to represent the word
representing the topic is retrieved from local database. The
term, then the most popular sense offered by WordNet will
signature is then sent across the network to participating
be adopted in Strategy 3.
14
Figure 3. Integrated process of semantic-based Figure 5. Dataset distributions into training and testing
browsing of metadata data
Master
Dataset
(2235)
Remote2 Local Remote1
Testing Training Testing
learning repositories. The query in the form of semantic dataset dataset dataset
signature is the input of the Similarity Calculator in distant Classifier
repositories. The Similarity Calculator is used to compute
the similarity of signatures in each of the learning
repositories. The similarity calculator uses the cosine
similarity function, thereby the more matched elements in
the signature, the higher the score is. In calculating the EVALUATION
similarity score, different weights are assigned to senses In order to test the hypothesis of using semantic signatures
from and in which the match in the to enable distributed semantic browsing and to improve
title sense gets higher contribution to overall score than the relevance we have simulated the distributed concept
one from the description tag. retrieval and compared the results with the traditional
In order to ensure the global accuracy of the result, results keyword-based and label-matching method. To replicate
from participating remote repositories are merged and the distributed repositories in a collaborative E-learning
sorted in the descending order based on the cosine network, the three independent databases are set up. As
similarity score. Then, the top k (k=5) topics of the shown in Figure 5, they are called “local”, “remote1” and
metadata are offered as the answer to the local query. The “remote2” where the local, of course, denotes a local data
overall operation of the semantic-based browsing of source and both remote1 and remote2 simulate distant data
learning resources metadata is shown in Figure 3. sources. A single master set of metadata in 8 different
categories is distributed evenly in number and randomly
IMPLEMENTATION into the three simulated repositories.
The SGI is implemented in the C# programming language. The metadata have been transformed to conform to the
The current version is a desktop application, but it can be IEEE LOM format. After the distribution, the local
easily extended to a web service. The goal of SGI is to database contains the metadata that represents the set of the
integrate signature generation, document indexing and training data for the classifier. During the training phase,
browsing capability. The signature indexes are stored in an the kNN classifier uses the instance of the local metadata to
inverted index database (e.g. MS Access). The similarity learn the features to identify the class of the metadata. It
calculator is a separated module implemented in C# as well starts by extracting keyword terms from each category of
and connected to the index database. Figure 4 shows the metadata and projecting them into the vector space model.
browsing interface of SGI to illustrate how to search distant Next, after running through the signature generation
concept semantically. module, each category of metadata is represented and
Figure 4. Browsing interface of SGI indexed by a semantic signature in the database.
The dataset in both remote1 and remote2 is controlled to
model the situation of potentially different ontological
classification in a distributed environment. To simulate the
effect of varied concept labelling, the original 8 categories
of metadata are expanded to 14 categories in remote1. The
6 derived categories are labelled with different class names
from their respective sources and described with the
metadata taken out from source categories. Each newly
derived category contains metadata belonging to the same
class. To illustrate, a part of the metadata from the category
“computing science” is distributed to the derived categories
“technology” and “engineering” in remote1. Thereby, the
metadata for concept “computing science” is now grouped
15
into “computing science”, “technology” and “engineering”. Table 1. Source and Category of Metadata
Essentially, this simulates the situation when a concept Category Source No. of
“computing science” could be categorized differently into records
concepts like “technology” and “engineering” in different Accounting Business Source Premier 382
ontology. The same distribution principle is applied to Publications
Biological and Agricultural Index,
remote2 database which includes 13 categories of which 7 Biology
BioMed Central Online Journals
315
are derived categories. Computing Citeseer 320
Science
Similar to the local database, each category of the metadata American Economic Association’s
in remote1 and remote2 is mapped to a semantic signature Economics 353
electronic database
in WordNet senses and stored in the local database as an Educational Resource Information
Education 307
index. To test semantic-based search, semantic signature Center
representing a local concept is sent to query the remote Geography Geobase 237
Mathematics arXiv.org, MathSciNet 157
repositories. The semantic similarity is compared between Psychology PsycINFO, ERIC 164
the query signature and the distant signature based on the
similarity function. Finally, the result of the k most similar based retrieval and label-matching retrieval.
concept signatures from the remote databases are studied
based on the relevance metric. As oppose to the classic or traditional keywords-based
representation, semantic-based indexing with WordNet
Dataset senses can include more lexicon information than simple
Since there is no publicly available dataset of learning syntactic approach. This implies that more features will be
resources metadata, the experiment metadata were acquired added to the class signature representation. Since more
through a number of different sources. Table 1 shows the features are added, that may also mean that more noise is
category of metadata acquired and their respective sources. included as well.
In total, 2235 metadata subdivided into the 8 different Intuitively, the increased relevance of retrieval can be
categories are acquired. The dataset is partitioned into attributed to the expansion of features in class
training and testing groups. As mentioned, the local representation. However, different from what we expected,
database stores the training dataset while remote1 and the precision does not decreased. It is suspected that due to
remote2 store the testing dataset. All metadata are known the relatively small size of the dataset and 1-k hypernym
with their class label. Metadata are distributed randomly, generalization, the senses included in the signature are
using Microsoft Excel random generator, to train and test ‘good’ in terms of classification. Therefore, combined with
the group. After distribution, the local database contains a good contextual-based sense selection strategy, WordNet
667 training records while remote1 and remote2 contain as a mediatory can provide source for ambiguity resolution
1568 testing records. and semantic information for the process of semantic
browsing. Coupled with that, the selection of kNN
Results algorithm as the classifier also contributes to the
In order to gauge the effectiveness of the proposed performance of the system.
mediation method between different E-learning ontologies,
three standard metrics for information retrieval are used in kNN is an instance-based classifier. The performance of
the evaluation of the system performance: they are Recall, instance-based classifiers is more dependent on the
Precision and F-measure. Table 2 shows that the use of sufficiency of the training set rather than other machine
semantic signature can consistently improve retrieval learning classification algorithms. Thus, it is a
relevance in terms of recall and precision. In all categories, disadvantage for kNN to have a small dataset for training
the semantic based retrieval out perform both keywords- and testing. A smaller training set implies more terms or
term combinations important for content identification may
Table 2. Comparison on precision, recall and F-measure on be missing from the training sample documents. This
concept retrieval negatively affects the performance of a classifier.
Cate Precision Recall F-Measure Nevertheless, the ontology (e.g. WordNet) guided
gory S K L S K L S K L approach seems to somewhat reduce the negative influence
of this problem. The replacement of child concepts with
Acc 1 0.6 0.5 1 0.75 0.5 1 0.6 0.5
Bio 0.6 0.6 0.5 0.75 0.6 0.5 0.6 0.6 0.5
parent concept through hypernym relationship appears to
CS 1 0.5 0.3 1 0.5 0.3 1 0.5 0.3 be able to discover an optimum concept set without
Econ 1 1 0.6 1 0.75 0.6 1 0.6 0.6 adversely affecting performance. Therefore, an important
Educ 0.6 0.5 0.5 0.75 0.75 0.5 0.6 0.45 0.5 term, which resides low in the concept hierarchy may be
Geo 0.6 0.5 0.5 0.75 0.5 0.5 0.6 0.5 0.5 mapped to a parent concept and included in the signature
Math 1 0.3 0.6 0.6 0.5 0.6 0.7 0.36 0.6
for class comparison, even if this term is not included in the
Psy 1 0.3 0.3 0.6 0.6 0.3 0.7 0.4 0.3
S = Signature-based retrieval, K = Keywords-based, L = Label-matching training set.
16
DISCUSSION [2] George A. Miller, Wordnet: An Online Lexical
The improvement on concept retrieval by using semantic Database, International Journal of Lexicography
signature is not uniform across different categories. For (1993).
example, the improvement on retrieval of “Psychology” [3] Asuncion Gomez-Perez, Ontological Engineering with
and “Accounting” metadata is more than improvement on Examples from the areas of Knowledge Mangement, e-
“Biology” and “Geology”. We believe that for some classes Commerce and the Semantic Web, Springer-Verlag
of metadata like “Biology”, which are characterised by a set London (2004).
of specific keywords, the use of semantic signatures does [4] H. P. Edmundson, New Methods in Automatic
not add extra useful information into the representation Extracting, Journal of the ACM (1969).
model to help in classifying metadata. On the other hand,
[5] Ching Kang Cheng, Xiaoshan Pan and Franz Kurfess,
using 1-k hypernym generalization on such a highly
“Ontology-based Semantic Classification of
specialized domain may in fact introduce more noise to
Unstructured Documents”.
reduce the matching possibility in similarity calculations.
In addition, with a small size of dataset, over-fitting on [6] Khaled M. Hammouda and Mohamed S. Kamel,
classification model may also result. Therefore, further “Phrase-based Document Similarity Based on an
experimentation and analysis are needed to fully Index Graph Model”, Proceedings of IEEE
understand the impact of WordNet signature with sense International Conference on Data Mining (2002).
generalization in classification of metadata. [7] AnHai Doan, Jayant Madhavan, Pedro Domingos, and
Alon Halevy “Learning to map between ontologies on
CONCLUSION the semantic web”, Proceedings of WWW2002
This project offers two important contributions. First, it conference (2002).
gives a new light-weighted semantic (ontology) mapping [8] Ching Kang Cheng, Xiaoshan Pan and Franz Kurfess,
approach to enable cross platform concept browsing in a “Ontology-based Semantic Classification of
federated network. Unlike many current practices in Unstructured Documents”, Proceedings of 1st
semantic mapping that either require intensive user International Workshop on Adaptive Multimedia
involvement to provide mapping information, or resort to Retrieval (2003).
complicated heuristic or rule-based machine learning
[9] Martin Ester, Hans-Peter Kriegel and Matthias
approach, this work shows an effective automatic mapping
Schubert, “Web Site Mining : A new way to spot
protocol that can allow federated concept browsing with
Competitors, Customers and Suppliers in the World
semantic signature. It is evident for the experimental results
Wide Web”, Proceedings of 4th International
that establish the merit of using WordNet to provide
Conference on Knowledge Discovery and Data
semantic knowledge for metadata classification in the
Mining (2002).
domain of E-learning. The merits include the provision of
semantic representation of categorical data and increased [10] Yannis Kalfoglou and Marco Schorlemmer, Ontology
semantic relevance in categorical browsing. mapping: the state of the art, The Knowledge
Engineering Review (2003).
By using immediate parent sense generalization during
sense selection process, it does not only successfully [11] Mineau, G.W, “A simple KNN algorithm for text
reduce the dimension in semantic signature, but more categorization”, Proceedings of IEEE International
importantlly introduces flexibility in the sense selection and Conference on Data Mining (2001).
increases the opportunity to find a better sense without [12] G. Salton and M. McGill, Introduction to Modern
compromising the relevance in the search result. This Information Retrieval, McGraw-Hill (1983).
creates incentive to explore the use of other sense selection [13] F. Giunchiglia, P. Shvaiko, and M. Yatskevich,
strategy. Semantic matching, In 1st European semantic web
symposium (ESWS’04) (2004).
REFERENCES [14] A. Maedche, B. Motik, N. Silva and R. Volz,
[1] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, "MAFRA - A MApping FRAmework for Distributed
Alon Halevy, Pedro Domingos, “iMap: Discovering Ontologies", in EKAW '02: Proceedings of the 13th
Complex Semantic Matches between Database International Conference on Knowledge Engineering
Schemas”, Proceedings of the ACM SIGMOD and Knowledge Management. Ontologies and the
Conference on Management of Data. (2004) Semantic Web, pp. 235-250, 2002.
17