Demo: Enriching Text with RDF/OWL Encoded Senses Delia Rusu, Tadej Štajner, Lorand Dali, Blaž Fortuna, Dunja Mladenić, Jožef Stefan Institute, Ljubljana, Slovenia {delia.rusu, tadej.stajner, lorand.dali, blaz.fortuna, dunja.mladenic}@ijs.si Abstract. This demo paper describes an extension of the Enrycher text enhancement system, which annotates words in context, from a text fragment, with RDF/OWL encoded senses from WordNet and OpenCyc. The extension is based on a general purpose disambiguation algorithm which takes advantage of the structure and/or content of knowledge resources, reaching state-of-the-art performance when compared to other knowledge-lean word sense disambiguation algorithms. Keywords: RDF/OWL word sense representation. 1 Introduction A variety of Semantic Web resources in the Linked Open Data (LOD) cloud can serve as knowledge bases for identifying word senses; more general, like DBpedia, W3C WordNet, OpenCyc, or more domain specific like the Gene Ontology, just to name a few of them. Moreover, these resources complement each other, as they span across several domains from music to chemistry and biology. Enrycher [8] is a service-oriented natural language processing and information extraction framework. It annotates text at various levels, listing: subject – predicate – object triplets (interesting statements) visually interconnected in a semantic graph representation, co-referenced named entities linked to DBpedia, Yago and OpenCyc, keywords and DMOZ categories. In this Demo paper we present an extension of Enrycher1, relying on a general purpose algorithm which can take advantage of several Semantic Web resources to disambiguate text. This extension annotates words in context with RDF/OWL encoded senses from WordNet [1] and OpenCyc 2. Given an input text fragment, every word or collocation (word sequence) will be annotated with the appropriate sense in context, and linked to the associated RDF resources defining the sense, in both WordNet and OpenCyc. The motivation behind adding this extension is to provide richer disambiguated annotations of words that are not named entities, and to improve the semantic graph quality, by merging nodes that refer to the same disambiguated concept. Word sense disambiguation (WSD) is defined as identifying the meaning of words in a given context, and has become a prerequisite for several Semantic Web specific 1 Demo video: http://marquis.ijs.si/delia/ 2 http://sw.opencyc.org/ tasks like ontology mapping and reasoning. WSD techniques have been previously introduced to validate ontology mappings, by analyzing the semantics of the ontological terms; they exploit ontological context, as well as information provided by WordNet. Aside from WordNet, another knowledge resource, namely Wikipedia has been used for building sense tagged corpora, which have further been employed to train a classifier, obtaining promising results [4]. Wikipedia was also used to automatically extend WordNet with semantic relations (such as synonymy, antonymy, hyponymy, etc.) [6]. However, the existing disambiguation systems mainly retrieve WordNet senses that are not readily usable for Semantic Web applications. Our extension can be easily integrated in other applications that require WSD as a preprocessing step, as word senses are labeled with the corresponding disambiguated RDF/OWL resource. Moreover, we take advantage of ontologies to find word senses, and in future work we plan to add some domain ontologies that can better disambiguate domain specific terminology. The paper is structured as follows: we start by describing the Enrycher extension integration in Section 2, continue with presenting the disambiguation algorithm in Section 3 and conclude with a section on future work and the demo presentation. 2 Enrycher Extension Integration Our RDF/OWL word sense annotation extension of Enrycher relies on the Text Preprocessing component which performs sentence splitting, tokenization, part-of- speech tagging and keyword extraction based on a bag-of-words model (see Fig. 1). Both WordNet 3.0 and OpenCyc are processed offline, in order to extract structure and content information. By structure we refer to the semantic relations: synonymy, hypernymy, etc. specific to WordNet, as well as the generalization, specialization, etc. relations encoded in OpenCyc. The content is given by the WordNet glosses and the OpenCyc comments, and provides descriptions of the word sense. Fig. 1. Enrycher components and their dependencies. Given an input text fragment, every word or collocation will be annotated with the appropriate sense in context from the aforementioned knowledge resources. If existent, both RDF resources corresponding to WordNet and OpenCyc will be linked. The following section elaborates on the proposed general purpose disambiguation algorithm. 3 Word/Collocation Annotation We have implemented an unsupervised semantic knowledge based word sense disambiguation algorithm. It relies on the Viterbi algorithm for Hidden Markov Model (HMM) part-of-speech tagging [2], and an initial version was described in [7]. The Viterbi algorithm is a common decoding algorithm for HMM, which was first applied to speech and language processing in the context of speech recognition. We have adapted the algorithm in order to determine, given the senses of words in a sentence, the best sequence of senses that disambiguates the sentence. We start by looking for the senses of nouns, verbs, adjectives and adverbs in one of the two aforementioned knowledge resources. The sequence of observations O = o1o2...oT will represent the T words we disambiguate, while the set of states Q = q1q2...qN define the N senses for a given observation. The sequence of observation likelihoods B=bi(ot) expresses the probability of an observation ot being generated from a state i. They are obtained by computing the cosine similarity between an ambiguous word description, as defined by the knowledge resource, and information provided by the context (at the level of the sentence, paragraph, etc.). The transition probability matrix A = a11a12…an1…anm is determined by computing the semantic relatedness between the two senses in state i and j respectively. There have been several relatedness measures proposed in the literature, some of them relying on the knowledge resource structure, others on its content. We have implemented four such relatedness measures, one of which exploiting the resource structure – Lexical Chains, while the others take the resource content into account – Adapted Lesk, Vector and Vector Pairwise [5]. Fig. 2. Disambiguating the phrase Data mining algorithms using the proposed algorithm. We explain the algorithm with the aid of the following example in Fig. 2. To disambiguate the phrase “Data mining algorithms” using WordNet 3.0 as a sense repository, we consider the senses of all words (the word “mining” having the sense of “excavating” or “minelaying”), and in addition the sense of the collocation “data mining” (data processing). We denote the sense part of speech and number in curly brackets. The edges are labeled by state transitions. The collocation is modeled by copying the corresponding sense state, and setting the transition between these two states to 1.0. There is equal probability to reach any of the sense states of the first word from the start state. Once the final state is reach, we back trace to find the states with the highest associated scores. We compared our system with others participating in the SemEval 2007 coarse grained all words English disambiguation task based on WordNet senses, obtaining precision/recall/F1 measures of 77.3, lower than the most frequent sense baseline of 78.9, but higher than the best unsupervised disambiguation algorithm participating in the task (SUSSX-FR, based on parsing text and identifying the k nearest neighbors of each word [3]) – 77.0. We also evaluated OpenCyc using a labor-on-demand platform, asking people to determine the correct sense for a given word in context, from a subset of OpenCyc sense definitions, obtaining an average F1 score of 37.55. 4 The Demo and Future Work The demo will show how the implemented system’s web interface annotates words/collocations in a given text fragment with RDF/OWL encoded senses from WordNet and OpenCyc. We are also going to show how to make usage of the system output programmatically, using the LarKC (the Large Knowledge Collider) platform, in order to build Semantic Web applications that rely on WSD. As for the future work, we plan to integrate other Semantic Web resources from LOD datasets, such as DBpedia, and investigate differences in disambiguation results when using distinct resources and the potential for combining different resources in the same task. Additionally, we aim to apply our WSD algorithm to improve the Enrycher generated semantic graphs. References 1. Fellbaum, Ch., WordNet: An Electronic Lexical Database. MIT Press (1998) 2. Jurafsky, D., Martin, J. H. Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall Series in Artificial Intelligence. (2008). 3. Koeling, R. and D. McCarthy. Sussx: WSD using Automatically Acquired Predominant Senses. In Proceedings of the 4th SemEval. pp 314--317. Prague (2007). 4. Mihalcea, R., Using Wikipedia for Automatic Word Sense Disambiguation. In Proceedings of the North American Chapter of the ACL (NAACL), Rochester, NY (2007) 5. Pedersen, T., Patwardhan, S. and Michelizzi, J. WordNet::Similarity - Measuring the Relatedness of Concepts. In Proceedings of NAACL, pp 38--41, Boston, MA (2004). 6. Ponzetto, S.P., Navigli, R., Knowledge-rich Word Sense Disambiguation Rivaling Supervised Systems. In Proceedings of the 48th ACL. pp 1522--1531. Uppsala, (2010). 7. Rusu, D., Fortuna, B. Mladenic, D. Improved Semantic Graphs with Word Sense Disambiguation. Poster. 8th ISWC. Washington, DC (2009). 8. Stajner, T., Rusu, D., Dali, L., Fortuna, B., Mladenic, D., and Grobelnik, M. Enrycher: Service Oriented Text Enrichment. In Proceedings of the 12th Int. Multiconference Information Society. pp. 203--206. Ljubljana, (2009).