Towards Browsing Distant Metadata Using Semantic Signatures Andrew Choi Marek Hatala Simon Fraser University Simon Fraser University School of Interactive Arts and Technology School of Interactive Arts and Technology Surrey, BC, Canada Surrey, BC, Canada aschoi@sfu.ca mhatala@sfu.ca ABSTRACT requests. Currently, the use of metadata and ontologies to In this document, we describe a light-weighted ontology formalize semantics of concepts in the E-learning domain mediation method that allows users to send semantic does not completely resolve the problem of interoperability queries to distant data repositories to browse for learning in a federated environment. This is because metadata in object metadata. In a collaborative E-learning community, different repositories are very often annotated with member data repositories might use different ontologies to concepts defined by different ontologies specific to their control a set of vocabularies describing topics in learning organizations or communities. That makes finding resources. This could hinder the search of learning information based on a local conceptual framework resources based on local ontological concepts. With the use difficult. Different organizations with different of WordNet, we develop a toolkit that indexes ontological backgrounds and target audience may use different terms concepts with WordNet senses for semantic browsing in with similar semantics to define and describe two similar order to integrate information in a distributed learning learning resources. In addition to ontological differences, community. The effectiveness of the toolkit was validated linguistic variations in metadata values and lack of use of with real-world data in a specific domain, namely E- metadata standard across learning network makes direct learning metadata. querying with keywords sometimes ineffective to discover a conceptually similar metadata. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information PROBLEM DESCRIPTION Search and Retrieval – information integration, retrieval The primary objective of this research is to explore the use models, search process of semantic signatures expressed in WordNet senses to provide mediation between different ontologies in order to General Terms enhance concept retrieval. Consider the scenario when the Algorithms, Management, Experimentation, Verification learner L1 associated with the repository R1 looking for learning resources related to the topic of how to find a Keywords good bass musical instrument, L1 sends out a request Semantic Retrieval, Data Integration, Ontology Mediation “search for bass” to remote repositories R2 and R3 respectively in an E-learning network. However, the INTRODUCTION returned results from them are mixed with many irrelevant As the advance of the Internet and rapid development in E- resources related to catching a bass (e.g. fish). Such a learning, more and more institutions are joining to form a problem occurs frequently when the concepts are defined distributed learning network to allow users to access by different domain ontologies with different sets of resources from different learning repositories. This creates vocabularies carrying different intended meanings. Imagine pressure for institutions to provide an efficient way to another case when the same learner L1 sends out a organize a huge volume of materials located in different distributed request for learning resources on the topic repositories, according to a consistent concept “advance databases”. Since the topic is annotated by the classification, in order to answer distributed retrieval concept “database systems II” in remote repositories, that is to say it is labelled differently. Therefore, in a concept- based label matching search, learning resources defined by Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are the concept “database systems II” will not be returned for not made or distributed for profit or commercial advantage and that the request of “advance databases” even though the two copies bear this notice and the full citation on the first page. To copy concepts are actually semantically equivalent. otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission by the copyright owners. From these simple scenarios, one can easily see that Copyright 2005 without a proper semantic mapping between ontologies in heterogeneous data sources, even with the ontology to 10 define vocabulary used to describe metadata on learning gives GLUE flexibility to find semantic mappings between resources, it is still challenging to find learning resources ontologies. Second, GLUE applies a multi-strategy learning based on the local conceptual definition. approach to use certain information discovered by different classifiers during the training process. This approach OVERVIEW OF ONTOLOGY MAPPING divides the classification process into two phases. First, a Semantic or ontology mapping can be described as a set of base classifiers is developed to classify instances of mapping task that identifies common concepts and concepts on different attributes with different algorithms. establishes semantic relationships between heterogeneous Then, the prediction of these base classifiers, assigned with data models in the same domain of discourse [1]. Since different weights representing their importance on overall semantics is mostly defined by ontological constructs in accuracy, is combined to form a meta-learner. Finally, the modern knowledge systems, we will use the term semantic classification is determined by the result from the meta- mapping interchangeably with ontology mapping in this learner. As an instance, one base learner can exploits the discussion. According to [10], ontology mapping between frequency of words in the name property using a Naïve two ontologies O1 and O2, can be expressed as a Bayes learning technique while another base learner can mathematical structure: O1 = (C1, A1) to O2 = (C2, A2) by a use pattern matching on another property using a Decision function f: C1→C2 to semantically related concept C1 to Tree Induction technique. At the end, the meta-learner will concept C2 such that A2 |= f(A1) whose all interpretations gather all the results to form the final prediction. Using that satisfy axioms in O2 also satisfy axioms in O1. For multiple classifiers, GLUE intends to increase the accuracy example, if the concept agent (C1) is defined in O1 by a set of the overall prediction. Third, GLUE incorporates label of properties such as relaxation techniques into the matching process to boost the with axioms such as (ignoring other neighbouring nodes. Generally, the relaxation labelling attributes and cardinality for the sake of simplicity), it is iteratively makes use of neighbouring features, domain possible to map it to a concept representative (C2) defined constraints and heuristic knowledge to assign the label of in O2 with a set of properties such as and having axioms such MAFRA (Mapping FRAmework) is another ontology as . This mapping methodology that prescribes “all phases of the assumes that all the semantic interpretations of C1 will be ontology mapping process, including analysis, respected by C2 in the domain of discourse when executing specification, representation, execution and evolution” logical inference operation on C2. [14]. It uses the declarative representation approach in ontology mapping by creating a Semantic Bridging REVIEW OF OTHER APPROACHES Ontology (SBO) that contains all concept mappings and This section presents a brief overview of two approaches associated transformation rule information. In this model, on semantic mapping. The two selected approaches are given two ontologies (source and target), it requires domain GLUE and MAFRA. The former is a system that employs experts to examine and analyze the class definitions, machine-learning techniques to find ontology mappings properties, relations and attributes to determine the with the use of probabilistic multiple learners while the corresponding mapping and transformation method. Then, latter uses a declarative representation of mappings as all accumulated information will be encoded into concepts instances in a mapping ontology defining bridging axioms in SBO. Therefore, SBO serves as an upper ontology to to encode transformation rules. With two domain govern the mapping and transformation between two ontologies, for each concept in an ontology GLUE claims ontologies. Each concept in SBO consists of five to find the most similar concept in another ontology [7]. A dimensions: they are Entity, Cardinality, Structural, number of features distinct GLUE from other similar Constraint and Transformation. During the process of mapping systems. First, unlike many mapping systems that ontology mapping, software agent will inspect the values only incorporate single similarity function to determine if from two given ontologies under these dimensions and two concepts are semantically related, GLUE utilizes execute the transformation process when constraints are multiple similarity functions to measure the closeness of satisfied. two concepts based on the purpose of the mapping. The intuition behind the multiple similarity functions is to take Some recent approaches like INRIA 1 make use of OWL advantage of the mapping requirement to relax or limit the API to build a set of alignment APIs with built-in WordNet choice of corresponding concepts. For instance, based on function for the purpose of ontology alignment or axioms the requirement of the application the task of mapping the generation and transformations. However, the details on the concept “associate professor” can be satisfied by similarity use of WordNet to generate the alignments are not well criteria “exact”, “most-specific-parent” or “most-general- documented in the published literatures. child” similarity criteria to find “senior lecturer”, “academic staff” or “John Cunningham” respectively. This 1 http://co4.inrialpes.fr/align/index.html 11 WORDNET WordNet is a widely recognized online lexical reference system, developed at Princeton University, whose design is inspired by “current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synsets (synonym sets), each representing one underlying lexical concept that is semantically identical to each other” [2]. Synsets are interlinked via relationships such as synonymy and antonymy, hypernymy and hyponymy (Subclass-Of and Superclass-Of), meronymy and holonymy (Part-Of and Has-a) [3]. Each synset has a unique identifier (ID) and a specific definition. A synset may consist of only a single element, or it may have many elements all describing the Figure 1. Semantic Signature Generation Framework same concept. Each element in a particular synset's list is synonymous with all other elements in that synset. For from a collection of the best WordNet senses for all example, the synset {World Wide Web, WWW, Web} representational keywords for a particular document. represents the concept of computer network consisting of a The generation of a semantic signature for a class of collection of internet sites. In this context, 'World Wide metadata is divided into three distinct phases. In the rest of Web', ‘WWW’ and 'Web' are all semantically equivalent. this section, the general architecture of the methodology is For cases where a single word has multiple meanings described while each phase is discussed in detail and as (polysemy), multiple separate and potentially unrelated well as illustrated with examples. synsets will contain the same word. For instance, the word ‘Web’ can have 7 multiple meanings defined in WordNet System Design and Architecture as computer network, entanglement, simply spider web and The methodology for creating semantic signature relies etc. heavily on the assumptions that the aggregates of all semantic information from metadata records of a particular OUR APPROACH class are a good representation of the concept for that class. To help distributed learning repositories to organize and In fact, the metadata record is an instance of a concept in manage their metadata in compliance with a global the ontological framework. Moreover, the methodology semantic view, we create a semantic mapping strategy assumes that semantic information of a class can be using WordNet as a mediator to provide word sense approximated by a set of important word senses from all disambiguation and to generate semantic signature each metadata records. Besides, semantic word senses specific to representing learning resource category. the context can be found based on important terms Semantic signature in the categorical browsing context can extracted from metadata through WordNet. Finally yet be defined as a logical grouping of representational word importantly, it assumes that the local semantic signature for senses for a class of metadata. In essence, it is a semantic a class of metadata is similar to signatures for metadata of representation of a class label with important WordNet semantically equivalent concepts in distant repositories. senses regarding context. To formalize the concept of The methodology uses k-Nearest Neighbour (kNN) search semantic signature, it can be written as follows: algorithm to classify semantically relevant concepts in distant repositories based on local semantic signatures [11]. The instances (metadata) of concepts in local repository serve as the training dataset. Based on semantic features of the local metadata, semantic signatures for each class of where Sig (c ) = semantic signature for class c concepts are formed. To find semantically relevant DSj = set of document senses for class c concepts in distant repositories, a distance function is BSdi = set of best sense in document dj defined and used to measure closeness between the query signature and semantic signatures for concepts in distant T = all keywords in document dj repositories. Eventually, k most similar concepts to the Fav = selection function to find best sense query signature will be retrieved from remote repositories. WS(t) = set of WordNet sense for term ti Figure 1 shows the four phases of the semantic signature generation framework. In the Word Extraction To briefly explain, semantic signature of a class of phase representative features are extracted from each metadata is built from a set of important document senses metadata document. The Document Preprocessing phase from all documents (metadata records) belonging to a eliminates all irrelevant information as well as all non-noun particular class. In turn, document senses are generated words. In the Document Vector Sensitization phase all the 12 representative keywords are used as seeds to find the multiple word senses retrieved. For example, the word corresponding word senses from WordNet. Finally, in the “search” can be mapped to WordNet senses as , and . Because of this, the select the best word sense is selected among all senses to mapping information of a single noun word term can be represent each word term. denoted by a triple construct in the form where T is the original word term, S is the synset of T and D is the Signature Generation in Action definition of T. When a noun term can be mapped to Phase I: Word Extraction multiple senses, there will be multiple triples. Take the First, the input metadata are transformed to comply with word term “search” as an example. After the sensitization, the IEEE LOM standard 2 using XML transformer. Then, it becomes and elements is extracted to (TFIDF 0.623101)> in triple construct. The triple construct represent the whole metadata document. That presumes format is used to substitute the original word term in the that the content from these two elements carry important master document vector. Then again, recall that since a weight as cue phrase to be able to represent the whole single word term could be mapped to possible different document [4]. This view seems reasonable in the case of word senses through WordNet. Each word sense is learning object metadata because other elements like represented in synset which may have multiple publication date, ISBN or format do not bear good synonymous terms. Because of this, the length of the semantic information to signify the category of the document vector in word sense will grow considerably. metadata. This problem is addressed in the next phase. Phase II: Document Preprocessing Phase IV: Sense Selection Strategy (S3) The condensed metadata with only the and This is the last, and the most crucial phase in the method. It <Description> elements are subjected to cleaning to chooses the best word sense among all retrieved word remove all stopwords, punctuation information, numerical senses from WordNet to represent the word term. As values and irregular symbols. Next, all non-noun words are stated, a word term can be mapped to multiple WordNet removed using part-of-speech tagger except some senses. In such a case, the dimensionality of the vector commonly used phrasal words which carry specific grows significantly after the sensitization procedure. meaning. For example, the word “artificial” in the phrase Imagine that a word term “light” can be mapped to 15 “artificial intelligence” will be preserved to retain the WordNet noun senses “visible light”, “light source”, special meaning of the binary phrase in the branch of “luminosity”, “lighting”, etc. The growth ratio is 15 times computer science. The reason why this approach only uses in this case. Such a high dimension not only negatively nouns as the base keyword is explanined in [5] where it is affects the efficiency of the similarity computation, but said that long phrases are not easily disambiguated more seriously, the many senses are noise which does not comparing to a single word term or a binary word term. carry actual meaning of the word in the context of the The accuracy to use a phrase as a distinguishing feature for document. Included irrelevant senses will distort the a document classification in effect will be lower through semantic representation of the signature and lower the previous experiments demonstrated in [6]. On the other accuracy in similarity calculation when finding similar hand, it has been shown that the use of noun word terms classes of metadata using signature matching. On the other carry the most salient expression to serve as distinguishing hand, from the semantic knowledge standpoint, WordNet feature for doing text classification [7]. senses only provide the lexical information of the word term, but not the contextual information to determine how Phase III: Document Vector Sensitization the meaning is clarified in a specified context [8]. Without Supposing that all irrelevant information has been that, the semantic signature is just a bigger collection of eliminated, the physical metadata documents are projected keywords and would have small use in identifying the onto the vector space model. The document vector classes of metadata based on the semantic relevance of the becomes a logical representation of the physical metadata signature. Therefore, it is necessary to find a way to reduce record. Then, using TFIDF weighting scheme we select the dimension and only select the sense that conveys the most significant terms across all document vectors to main idea of the word in the current context. To select the represent a category of metadata [12]. After that, each word best sense representing a word term, a contextual-based term with the TFIDF score higher than the threshold is sent Senses Selection Strategy (S3) is applied to retrieved word to WordNet to retrieve the corresponding word senses and senses. The strategy is based on the assumption that the its definition. The threshold is determined by trial and error local contextual information of a document serves as a approach with a test run. A single word term can have good hint to tell which sense represents the actual meaning of the word term best. The S3 approach can be summarized in the following algorithm: 2 http://ieeeltsc.org/wg12LOM/lomDescription 13 Steps of algorithm (Calculate the best senses for class C1): Figure 2. Compute associative frequency between immediate parent with other word sense For each metadata document D ∈ C1 Get the list of synsets for each word term T1 ∈ D Strategy 2 S1 S2 For each synset Syn1 of the word term T1 Sk A set of WordNet senses For each sense term Si ∈ Syn1 1. Compute associative frequency af for Si to other senses Sk generalize • t1 • t2 • t3 … • tn ∈ Synk, Synk ⊆ Tk and T1 ≠ Tk 1.1 Find the sense Sl with highest score Max(af) Document vector 1.2 If (Max(af) < 1) then go to 2 otherwise stop and return Sl The rationale behind sequencing three strategies is based 2. Compute associative frequency af for Si to k-order parent on observations and hypothesis that the local context is the senses PSk ∈ P(Synk), P(Synk) ⊆ Tk and T1 ≠ Tk most specific and relevant candidate to provide contextual 2.1 Find the sense Sp with highest score Max(af) meaning for the word term sense. Therefore, a word sense 2.2 If (Max(af) < 1) then go to 3 otherwise stop and for a particular term can most likely be disambiguated by return Sp other local senses (Strategy 1). If it could not be resolved 3. Return the most popular sense Sw offered by WordNet by step 1, then it compares the immediate parent sense to Return the Best Sense to represent word term T1 the other word senses to check if the parent sense is a frequently occurring sense for the underlying word term. Aggregate all sense from all important word terms to represent signature of the document D At last, the most popular sense is adopted to represent the semantic meaning for a word term when the two strategies The algorithm works in the following way. For each word above could not resolve the ambiguity of the word term. sense of a word term, it first computes the associative Following the above procedures, a set of senses becomes a frequency (af) of each sense term in a synset to other sense semantic signature of a document. In order to generate the terms in other synset of other word terms in the same final semantic signature for a class of documents referring document. From this, the most occurred word sense will be to particular concept, TFIDF scheme is applied again to used to substitute the semantic representation of the word each word sense in all document signatures for a particular term. class. Based on the score, the most relevant senses for Next, if the word sense of a word term cannot be characterizing the class of metadata are aggregated to form discriminated by Strategy 1, the algorithm generalizes the the final signature for the class. word term to the k-order parent senses. In this approach, the value of k is 1. Hence, it generalizes to its immediate Concept browsing in heterogeneous ontologies parent word sense. Referring to Figure 2, Strategy 2 will use In our application, the generated semantic signatures are the immediate parent sense to compute the associative frequency used to index the actual classes of metadata for fast against other senses from other word terms in the document distributed browsing. We developed a tool called Signature vector. As such, in this example the word term t1 will be Generation Indexer (SGI) that supports the methodology rolled up to its immediate parent through hypernym (is-a) described in the previous section. Focusing on the relation in the WordNet hierarchy. Then, the parent’s efficiency, the design of SGI is to allow repository synset is used to calculate the associative frequency to operators to produce semantic signatures for classes of other word senses for other word terms. Unlike other learning object metadata easily without tedious human generalization approaches [7, 13], we generalize the sense interaction, or complicated implementation. to its most-specific parent only. The reason why it uses The ultimate goal is to achieve semantic search based on E- immediate parent senses (k=1) to compute the associative learning topics defined by heterogeneous ontologies in a frequency is given in [9] where the most specific parent in federated network. In a collaborative learning environment, a hierarchical terminology has a higher distinctive power to users expect to be able to access all the learning resources classify the topic. Essentially following the intuition that if within the learning network. To fulfill this anticipation, it is a word sense is generalized to higher order parent sense important to assume that all participant repositories in the than k=1, the generalized sense may be too general and collaborative network employ the same strategy to index becomes incoherent to local context, and would become learning resources metadata with WordNet semantic noise when used to classify metadata. signature. Finally, as arranged by WordNet, the word senses retrieved In this way, when users launches a query by selecting a from WordNet for a particular word are a partial order set specific topic (concept) from the local ontology (e.g. via ranked by popularity in English usage. If the previous two user interface), the corresponding semantic signature strategies can not find the best sense to represent the word representing the topic is retrieved from local database. The term, then the most popular sense offered by WordNet will signature is then sent across the network to participating be adopted in Strategy 3. 14 Figure 3. Integrated process of semantic-based Figure 5. Dataset distributions into training and testing browsing of metadata data Master Dataset (2235) Remote2 Local Remote1 Testing Training Testing learning repositories. The query in the form of semantic dataset dataset dataset signature is the input of the Similarity Calculator in distant Classifier repositories. The Similarity Calculator is used to compute the similarity of signatures in each of the learning repositories. The similarity calculator uses the cosine similarity function, thereby the more matched elements in the signature, the higher the score is. In calculating the EVALUATION similarity score, different weights are assigned to senses In order to test the hypothesis of using semantic signatures from <Title> and <Description> in which the match in the to enable distributed semantic browsing and to improve title sense gets higher contribution to overall score than the relevance we have simulated the distributed concept one from the description tag. retrieval and compared the results with the traditional In order to ensure the global accuracy of the result, results keyword-based and label-matching method. To replicate from participating remote repositories are merged and the distributed repositories in a collaborative E-learning sorted in the descending order based on the cosine network, the three independent databases are set up. As similarity score. Then, the top k (k=5) topics of the shown in Figure 5, they are called “local”, “remote1” and metadata are offered as the answer to the local query. The “remote2” where the local, of course, denotes a local data overall operation of the semantic-based browsing of source and both remote1 and remote2 simulate distant data learning resources metadata is shown in Figure 3. sources. A single master set of metadata in 8 different categories is distributed evenly in number and randomly IMPLEMENTATION into the three simulated repositories. The SGI is implemented in the C# programming language. The metadata have been transformed to conform to the The current version is a desktop application, but it can be IEEE LOM format. After the distribution, the local easily extended to a web service. The goal of SGI is to database contains the metadata that represents the set of the integrate signature generation, document indexing and training data for the classifier. During the training phase, browsing capability. The signature indexes are stored in an the kNN classifier uses the instance of the local metadata to inverted index database (e.g. MS Access). The similarity learn the features to identify the class of the metadata. It calculator is a separated module implemented in C# as well starts by extracting keyword terms from each category of and connected to the index database. Figure 4 shows the metadata and projecting them into the vector space model. browsing interface of SGI to illustrate how to search distant Next, after running through the signature generation concept semantically. module, each category of metadata is represented and Figure 4. Browsing interface of SGI indexed by a semantic signature in the database. The dataset in both remote1 and remote2 is controlled to model the situation of potentially different ontological classification in a distributed environment. To simulate the effect of varied concept labelling, the original 8 categories of metadata are expanded to 14 categories in remote1. The 6 derived categories are labelled with different class names from their respective sources and described with the metadata taken out from source categories. Each newly derived category contains metadata belonging to the same class. To illustrate, a part of the metadata from the category “computing science” is distributed to the derived categories “technology” and “engineering” in remote1. Thereby, the metadata for concept “computing science” is now grouped 15 into “computing science”, “technology” and “engineering”. Table 1. Source and Category of Metadata Essentially, this simulates the situation when a concept Category Source No. of “computing science” could be categorized differently into records concepts like “technology” and “engineering” in different Accounting Business Source Premier 382 ontology. The same distribution principle is applied to Publications Biological and Agricultural Index, remote2 database which includes 13 categories of which 7 Biology BioMed Central Online Journals 315 are derived categories. Computing Citeseer 320 Science Similar to the local database, each category of the metadata American Economic Association’s in remote1 and remote2 is mapped to a semantic signature Economics 353 electronic database in WordNet senses and stored in the local database as an Educational Resource Information Education 307 index. To test semantic-based search, semantic signature Center representing a local concept is sent to query the remote Geography Geobase 237 Mathematics arXiv.org, MathSciNet 157 repositories. The semantic similarity is compared between Psychology PsycINFO, ERIC 164 the query signature and the distant signature based on the similarity function. Finally, the result of the k most similar based retrieval and label-matching retrieval. concept signatures from the remote databases are studied based on the relevance metric. As oppose to the classic or traditional keywords-based representation, semantic-based indexing with WordNet Dataset senses can include more lexicon information than simple Since there is no publicly available dataset of learning syntactic approach. This implies that more features will be resources metadata, the experiment metadata were acquired added to the class signature representation. Since more through a number of different sources. Table 1 shows the features are added, that may also mean that more noise is category of metadata acquired and their respective sources. included as well. In total, 2235 metadata subdivided into the 8 different Intuitively, the increased relevance of retrieval can be categories are acquired. The dataset is partitioned into attributed to the expansion of features in class training and testing groups. As mentioned, the local representation. However, different from what we expected, database stores the training dataset while remote1 and the precision does not decreased. It is suspected that due to remote2 store the testing dataset. All metadata are known the relatively small size of the dataset and 1-k hypernym with their class label. Metadata are distributed randomly, generalization, the senses included in the signature are using Microsoft Excel random generator, to train and test ‘good’ in terms of classification. Therefore, combined with the group. After distribution, the local database contains a good contextual-based sense selection strategy, WordNet 667 training records while remote1 and remote2 contain as a mediatory can provide source for ambiguity resolution 1568 testing records. and semantic information for the process of semantic browsing. Coupled with that, the selection of kNN Results algorithm as the classifier also contributes to the In order to gauge the effectiveness of the proposed performance of the system. mediation method between different E-learning ontologies, three standard metrics for information retrieval are used in kNN is an instance-based classifier. The performance of the evaluation of the system performance: they are Recall, instance-based classifiers is more dependent on the Precision and F-measure. Table 2 shows that the use of sufficiency of the training set rather than other machine semantic signature can consistently improve retrieval learning classification algorithms. Thus, it is a relevance in terms of recall and precision. In all categories, disadvantage for kNN to have a small dataset for training the semantic based retrieval out perform both keywords- and testing. A smaller training set implies more terms or term combinations important for content identification may Table 2. Comparison on precision, recall and F-measure on be missing from the training sample documents. This concept retrieval negatively affects the performance of a classifier. Cate Precision Recall F-Measure Nevertheless, the ontology (e.g. WordNet) guided gory S K L S K L S K L approach seems to somewhat reduce the negative influence of this problem. The replacement of child concepts with Acc 1 0.6 0.5 1 0.75 0.5 1 0.6 0.5 Bio 0.6 0.6 0.5 0.75 0.6 0.5 0.6 0.6 0.5 parent concept through hypernym relationship appears to CS 1 0.5 0.3 1 0.5 0.3 1 0.5 0.3 be able to discover an optimum concept set without Econ 1 1 0.6 1 0.75 0.6 1 0.6 0.6 adversely affecting performance. Therefore, an important Educ 0.6 0.5 0.5 0.75 0.75 0.5 0.6 0.45 0.5 term, which resides low in the concept hierarchy may be Geo 0.6 0.5 0.5 0.75 0.5 0.5 0.6 0.5 0.5 mapped to a parent concept and included in the signature Math 1 0.3 0.6 0.6 0.5 0.6 0.7 0.36 0.6 for class comparison, even if this term is not included in the Psy 1 0.3 0.3 0.6 0.6 0.3 0.7 0.4 0.3 S = Signature-based retrieval, K = Keywords-based, L = Label-matching training set. 16 DISCUSSION [2] George A. Miller, Wordnet: An Online Lexical The improvement on concept retrieval by using semantic Database, International Journal of Lexicography signature is not uniform across different categories. For (1993). example, the improvement on retrieval of “Psychology” [3] Asuncion Gomez-Perez, Ontological Engineering with and “Accounting” metadata is more than improvement on Examples from the areas of Knowledge Mangement, e- “Biology” and “Geology”. We believe that for some classes Commerce and the Semantic Web, Springer-Verlag of metadata like “Biology”, which are characterised by a set London (2004). of specific keywords, the use of semantic signatures does [4] H. P. Edmundson, New Methods in Automatic not add extra useful information into the representation Extracting, Journal of the ACM (1969). model to help in classifying metadata. On the other hand, [5] Ching Kang Cheng, Xiaoshan Pan and Franz Kurfess, using 1-k hypernym generalization on such a highly “Ontology-based Semantic Classification of specialized domain may in fact introduce more noise to Unstructured Documents”. reduce the matching possibility in similarity calculations. In addition, with a small size of dataset, over-fitting on [6] Khaled M. Hammouda and Mohamed S. Kamel, classification model may also result. Therefore, further “Phrase-based Document Similarity Based on an experimentation and analysis are needed to fully Index Graph Model”, Proceedings of IEEE understand the impact of WordNet signature with sense International Conference on Data Mining (2002). generalization in classification of metadata. [7] AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy “Learning to map between ontologies on CONCLUSION the semantic web”, Proceedings of WWW2002 This project offers two important contributions. First, it conference (2002). gives a new light-weighted semantic (ontology) mapping [8] Ching Kang Cheng, Xiaoshan Pan and Franz Kurfess, approach to enable cross platform concept browsing in a “Ontology-based Semantic Classification of federated network. Unlike many current practices in Unstructured Documents”, Proceedings of 1st semantic mapping that either require intensive user International Workshop on Adaptive Multimedia involvement to provide mapping information, or resort to Retrieval (2003). complicated heuristic or rule-based machine learning [9] Martin Ester, Hans-Peter Kriegel and Matthias approach, this work shows an effective automatic mapping Schubert, “Web Site Mining : A new way to spot protocol that can allow federated concept browsing with Competitors, Customers and Suppliers in the World semantic signature. It is evident for the experimental results Wide Web”, Proceedings of 4th International that establish the merit of using WordNet to provide Conference on Knowledge Discovery and Data semantic knowledge for metadata classification in the Mining (2002). domain of E-learning. The merits include the provision of semantic representation of categorical data and increased [10] Yannis Kalfoglou and Marco Schorlemmer, Ontology semantic relevance in categorical browsing. mapping: the state of the art, The Knowledge Engineering Review (2003). By using immediate parent sense generalization during sense selection process, it does not only successfully [11] Mineau, G.W, “A simple KNN algorithm for text reduce the dimension in semantic signature, but more categorization”, Proceedings of IEEE International importantlly introduces flexibility in the sense selection and Conference on Data Mining (2001). increases the opportunity to find a better sense without [12] G. Salton and M. McGill, Introduction to Modern compromising the relevance in the search result. This Information Retrieval, McGraw-Hill (1983). creates incentive to explore the use of other sense selection [13] F. Giunchiglia, P. Shvaiko, and M. Yatskevich, strategy. Semantic matching, In 1st European semantic web symposium (ESWS’04) (2004). REFERENCES [14] A. Maedche, B. Motik, N. Silva and R. Volz, [1] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, "MAFRA - A MApping FRAmework for Distributed Alon Halevy, Pedro Domingos, “iMap: Discovering Ontologies", in EKAW '02: Proceedings of the 13th Complex Semantic Matches between Database International Conference on Knowledge Engineering Schemas”, Proceedings of the ACM SIGMOD and Knowledge Management. Ontologies and the Conference on Management of Data. (2004) Semantic Web, pp. 235-250, 2002. 17