Automatically Annotating Text with Linked Open Data Delia Rusu, Blaž Fortuna, Dunja Mladenić Artificial Intelligence Laboratory, Jožef Stefan Institute, Ljubljana, Slovenia name.surname@ijs.si ABSTRACT Another area related to annotating text with LOD resources is This paper presents and evaluates two existing word sense ontology matching (establishing mappings between two disambiguation approaches which are adapted to annotate text ontologies). The task is addressed by using the ontology content with several popular Linked Open Data datasets. One of the (e.g., concept labels and descriptions) [9] or, more recently, algorithms is based on relationships between resources, while the ontology structure (relations between the ontology concepts) [7]. other one takes advantage of resource definitions provided by the In text annotation, as addressed in this paper, instead of matching datasets. The aim is to test their applicability when annotating text two ontologies, we are dealing with the text to be annotated on with resources from WordNet, OpenCyc and DBpedia. The one side, and the ontology, on the other side. Two main experiments expose several shortcomings related to the current differences reside in the lack of ontology-like structure in text approaches, which are mostly connected to overfitting the together with the ambiguity of words, which often depends on the datasets. Based on the findings, we indicate future work directions context. regarding text annotation with Linked Open Data resources, Up to now, word sense disambiguation approaches have been which can bridge these shortcomings. mostly developed and tested using WordNet. In this work we investigate the applicability of two existing WSD approaches to Categories and Subject Descriptors annotating text with several other LOD datasets. Due to the I.2.7 [Natural Language Processing]: Text analysis. scarcity of sense-annotated corpora (mainly available for WordNet), we consider two knowledge-based algorithms, which General Terms we adapt to the task of automatically annotating text with LOD. Algorithms, Design. The first approach, based on the Page Rank algorithm relies on the relations between resources defined within a dataset, while the Keywords second, called Context Similarity, takes advantage of the human- Linked Open Data, text annotation, word sense disambiguation. readable description of a resource coupled with local relationship structure. Our experiments are conducted on three LOD datasets: 1. INTRODUCTION WordNet, OpenCyc and DBpedia, containing common features The Linked Open Data (LOD) numbers over 200 datasets, more required by the above mentioned WSD approaches: resource than double compared to a year ago, spanning domains such as descriptions and a rich set of relationships between resources. The media, geography, publications, life sciences, etc, incorporating experimental results reveal the obstacles when attempting to several cross-domain datasets. This is an important source of generalize the current state-of-the-art knowledge-based WSD structured data, which so far has been employed for building approaches across LOD datasets. Linked Data browsers, search-engines, or other domain-specific applications such as semantic tagging and rating ones [3]. In this The paper is structured as follows: Section 2 briefly lists related paper, we are looking at ways to link structured and textual work, while Section 3 describes the algorithms for annotating text information on the Web, by annotating text with LOD resources, with Linked Open Data used in the paper. Section 4 elaborates on and as such moving closer to better machine text understanding. the specifics of the datasets that we used in our experiments, and As a scenario, one can consider news articles as a source of Section 5 discusses the results obtained thus far. The last part of unstructured, yet up-to-date knowledge, which can be linked with the paper is dedicated to conclusions and future work. the LOD, providing additional context for entities (e.g. people, organizations) and events described in the articles. 2. RELATED WORK Supervised approaches to WSD have recorded better results in the The task of annotating text with LOD resources is closely related past evaluation workshops, compared to their unsupervised and to word sense disambiguation (WSD), defined in natural language knowledge-based counterparts. The best performing system [5] in processing as identifying the meaning of words in a given context. the SemEval 2007 course grained English all-words task [13] was Three main approaches have emerged for determining word a supervised approach based on the Support Vector Machine senses [14]: supervised – employing machine learning methods algorithm and was trained on several corpora: an English-Chinese for training a classifier on sense-annotated data, unsupervised – parallel corpora, SemCor (a subset of the Brown corpus, where relying on clustering of word occurrences and knowledge-based words are syntactically and semantically tagged), and the Defence which exploits knowledge resources like dictionaries, ontologies Science Organisation corpus. or thesauri. Up to now, most of the effort has been directed to identifying WordNet senses for ambiguous words, using Among the knowledge-based approaches, attempts have been Wikipedia for building sense-tagged corpora or extending made to exploit the structure of the concept network, devising WordNet with relationships extracted from Wikipedia. either similarity measures (most of them implemented in the WordNet::Similarity package [16]) or taking advantage of the Copyright is held by the author/owner(s). graph structure of the knowledge base. Patwardhan et al [15] took LDOW’11, March 29, 2011, Hyderabad, India. the context of the word into account and performed [ ] disambiguation based on various similarity measures. Medelyan ∑ ( ) and Legg [10] used similarity and disjointness and disambiguated Wikipedia articles into concepts from the Cyc ontology. Previous where N stands for the total number of vertices in the graph and attempts to annotate text with Cyc used its taxonomic knowledge the damping factor D is set to 0.85. The algorithm converges as well as limited context (at the sentence level) or tried to when the difference between the previous and current PageRank formally describe the structure of the document, build hypothesis values for a vertex is below 10-15 (the numerical error for double (interpretations) and try to reason based on them [6]. Mihalcea et precision). Finally we select the candidate resource with the al. [11] and Agirre and Soroa [1] adapt the PageRank algorithm highest PageRank score for each word wa. to disambiguate words based the structure of the WordNet graph. Navigli and Velardi [12] propose the Structural Semantic 3.2 Context Similarity Interconnections (SSI) algorithm, which further develops lexical Many of the LOD datasets have textual definitions attached to chains (sequences of semantically related words) by encoding a resources: DBpedia, Freebase, OpenCyc, WordNet, etc. For context free grammar of valid semantic interconnection patterns. example, in DBpedia, this definition is the human-readable Ponzetto and Navigli [17] have shown that, when having a high- description of a resource found under rdfs:comment. Based on this quality knowledge base (they enriched WordNet with relations remark, we adapted the Extended Gloss Overlaps method used in from Wikipedia) simple knowledge-based approaches compete word sense disambiguation and introduced for WordNet in [2] by with state-of-the-art supervised approaches. Banerjee and Pedersen. The method scores each candidate resource based on the word overlap between the context around 3. ALGORITHMS FOR ANNOTATING word wa and the human-readable descriptions for a candidate TEXT WITH LINKED OPEN DATA resource, together with its neighbouring resources (directly The task of annotating text with Linked Open Data is defined as connected to the resource under consideration). The candidate follows: given a word or an n-gram wa from a text fragment which resource with the highest score for each word/n-gram will be is to be annotated, the aim is to identify the corresponding LOD selected as the annotation. The context of wa is represented by the resource from a set of candidate resources that best matches the surrounding words in the text fragment, for e.g. all words from the word/n-gram in its context (text fragment). In our same sentence or paragraph. The overlap between a candidate implementations we consider as candidate resources for wa all the resource and the word context is computed using the measure of resources defined by the dataset with wa as their rdfs:label. cosine similarity (simcos), a standard text mining approach to compute the similarity between documents. It is defined between In this paper we are focusing on three LOD datasets with cross- two bag-of-words vectors A and B as domain coverage: WordNet, OpenCyc and DBpedia. We exploit two characteristics of these datasets. Firstly, we take into account their structure, which is based on the relations between the ‖ ‖‖ ‖ resources. Secondly, where available, we consider the human- We summarize the algorithm for computing the similarity readable description of a resource generally found under between a word/n-gram wa and a resource in Figure 1. We start by rdfs:comment. Based on these characteristics, we have adapted determining the neighbourhood resources, and the context of wa. two text annotation algorithms: a structure based one – Page Then, for each resource from the neighbourhood resources, we Rank, and a content-based one: Context Similarity. compute the cosine similarity between the bag-of-words representation of the resource definition (NR[i]) and the context of 3.1 Page Rank wa. PageRank [4] is a well-known algorithm used for ranking the vertices in a graph representing structure of Web pages. It has ContextSimilarity(resource, wa) returns Similarity been previously applied to word sense disambiguation into Similarity = 0 WordNet [11, 1] by building a graph representing the text to NR = GetNeighborhoodResources(resource) disambiguate and identifying relationships between the vertices CW = GetContext(wa) (which describe the words in the text fragment). for i = 1 to Size(NR) do The LOD datasets also exhibit a graphical structure based on the CS = simcos(NR[i], CW) relationships between resources, for e.g., between an instance and Similarity = Similarity + CS a class described by rdf:type, between a class and its superclass end for described by rdfs:subClassOf, and other dataset-specific relations return Similarity such as, antonym, meronym and holonym in WordNet and broader-term relation in OpenCyc. In order to apply the PageRank Figure 1 The Context Similarity algorithm. Given a candidate algorithm in the case of a LOD dataset, we first build a graph of resource and a word (wa), the algorithm computes the the dataset G(V,E) where the vertices represent all the dataset similarity between the resource and word’s context. resources and the edges are the relationships between these resources. As a next step, we identify all the candidate resources 4. DATASETS for all the words belonging to the text fragment, which are to be 4.1 WordNet annotated. The main difference is the initialization step, which WordNet1 is a lexical database of English, comprising nouns, consists of setting the graph vertices to either of the values 0, if verbs, adjectives and adverbs grouped into synsets. There are two the vertex does not represent a candidate resource, or 1/R, with R RDF/OWL representations of WordNet in the LOD: WordNet being the total number of candidate resources. The PageRank value for each vertex i (PR[Vi]) is computed using the formula: 1 http://wordnet.princeton.edu/ (W3C) models WordNet version 2.0, while WordNet (VUA) used some of the words as control words, that is, the correct models the latest version, 3.0. annotations were reported to the CrowedFlower system which used them to filter out the workers with bad performance. In order to evaluate the aforementioned approaches, we considered two scenarios: in the first one we opted for a dataset Annotations from at least 5 workers were obtained for each word used in the word sense disambiguation tasks – SemEval 2007 and as such we distinguish two cases. In the aggregate case the predominant annotation for a word was selected from the set of all Task 7: Course Grained English All Words. In our second annotations. In the non-aggregate case each obtained annotation scenario we used crowdsourcing for evaluating the Context Similarity measure. The SemEval series of evaluation workshops was considered separately. are regarded as a framework for comparing state-of-the-art WSD The inter-annotator agreement is usually considered an upper systems. In Task 7 of SemEval 2007, the participating systems are bound for WSD systems [8]; in our experiments it ranges between provided with a corpus consisting of 5 texts annotated with 59.61% for D1 to 66.89% for D5. The results when using the WordNet 2.1 senses from sources like the Wall Street Journal (D 1 Context Similarity measure are reported in Table 3. We observe to D3), a Wikipedia entry on computer programming (D4) and an an expected decrease in the F-measure, as we are dealing with excerpt from a fiction book (D5). Out of the 2269 annotated non-expert annotators, while the Context Similarity method words, 1591 are polysemous (have more than one meaning). The results are close to the inter-annotator agreement. results of our experiments are summarized in Table 1. The precision, recall and F measure are the same in this case, as the system annotates all words and only one annotation/word is Table 2 A subset of 325 words from the SemEval 2007 Task 7 yielded. For WordNet, there is a most-frequent-sense baseline corpus composed of 95 control words, and 230 test words, (which few knowledge-based systems outperform) obtained by which was employed in the WordNet annotation task, using always choosing the first sense of a word in WordNet. Notice that CrowdFlower. this first sense in WordNet is also the most frequent sense of that D1 D2 D3 D4 D5 Total word, as measured in a corpus called SemCor. In all but D4, which Control 11 14 20 34 16 95 is domain specific, the baseline performs better than the proposed Test 30 41 56 71 32 230 approaches; the F measure of the baseline is as high as 85.60% for D1, and the lowest for D5 – 74.20%; for D4 the F measure is Table 3 WordNet evaluation results (F measure, in %) when 75.19%. The best supervised system obtained an F measure of using CrowdFlower, a labor on demand platform. As 82.50%, while the best unsupervised system recorded 77.04%. annotation algorithm, we used Context Similarity. D1 D2 D3 D4 D5 Table 1 WordNet evaluation results (F measure, in %) for the Aggregate 56.66 60.98 51.79 56.34 68.75 SemEval 2007 Task 7 corpus. CS is the Context Similarity Non-aggregate 67.57 61.72 57.80 59.78 69.35 algorithm, while PR is the PageRank algorithm. D1 D2 D3 D4 D5 All 4.2 OpenCyc CS 75.27 77.84 72.80 77.84 72.46 75.50 OpenCyc3 is the open source version of Cyc, a common-sense PR 74.19 74.67 73.60 73.71 71.01 73.51 knowledge base, covering the top 40% of the complete Cyc knowledge base. OpenCyc is also available as a downloadable In the second scenario, we used CrowdFlower2, a labor-on- OWL ontology, and in this paper we refer to the 2008 version. demand platform where one can assign tasks to a number of non- expert workers. Workers were assigned the task to select the For the first experiment on OpenCyc we extracted a subset of 177 correct human-readable description of a resource (annotation) words from the SemEval 2007 Task 7 corpus, as shown in Table from a list of possible descriptions. The example below shows a 4. The correct annotations were provided by CrowdFlower marked word for annotation (painter) together with its possible workers, and the workers were selected using the WordNet WordNet resource descriptions, from which the second one control words, as described in the previous sub-section. represents a correct annotation. The 4th option (none of the above) can be chosen in case none of the candidate resources represents a Table 4 A corpus used to evaluate OpenCyc annotations, correct annotation. comprised of 177 words. The table shows the distribution of It must have been about the same time when Fra Angelico was the extracted words over the five documents. covering the walls of San Marco with his angel pictures, that a D1 D2 D3 D4 D5 Total very different kind of {painter} was working the Carmine church OpenCyc 25 30 43 50 29 177 in Florence. 1. painter : a worker who is employed to cover objects with paint Table 5 OpenCyc evaluation results (F measure, in %) when 2. painter : an artist who paints using CrowdFlower. As annotation algorithm, we used 3. catamount, cougar, felis_concolor, mountain_lion, painter: Context Similarity. large American feline resembling a lion D1 D2 D3 D4 D5 4. none of the above Aggregate 24.00 36.67 27.91 42.00 34.48 This scenario serves as a performance baseline for WSD methods Non-aggregate 41.29 37.42 29.95 37.01 42.08 developed for WordNet, when evaluated using crowdsourcing. The obtained results are listed in Table 5. After looking at the The evaluation was done on a corpus composed of 325 words results, our first assumption was that non-expert annotators found from the SemEval 2007 Task 7, which is shown in Table 2. We 2 3 http://crowdflower.com/ http://sw.opencyc.org/ it difficult to identify the correct OpenCyc annotations based on However, this meaning only appears in the disambiguation page the resource human-readable definition. For example, the word of lead in Wikipedia and does not have a corresponding resource boy can be resolved in OpenCyc to the following three resources: in DBpedia. On the other hand, DBpedia lists several other 1. The collection of all boys (juvenile male humans). A type of possible resource candidates: lead as the chemical element, Lead young animal and male person. as a Japanese Hip-Hop Group, etc., all in all 32 candidate 2. The collection of male children, male kids about 12 years of resources for lead. age or less. 3. (son PAR MALE) means that MALE is one of the sons (male 5. DISCUSSION The LOD datasets that we considered in this paper exhibit children) of PAR. MALE could be a child of PAR by birth, by common features: the existence of human-readable resource adoption, by marriage (e.g., if PAR had married a biological descriptions as well as a relationship structure between resources. parent of MALE), or by some other social arrangement. However, when applying text annotation approaches (adapted Therefore, in our second OpenCyc experiment (see Table 6), we from state-of-the-art WSD algorithms) which are based on these extracted a similar subset comprised of 50 words from D3 with common characteristics, the obtained results are not comparable. more than one candidate resource, which were manually In this section we consider several characteristics of the employed annotated with OpenCyc resources by 2 expert annotators (A1 and datasets which serve to better understand the obtained results. A2). The inter-annotator agreement was 74.00%. The results Firstly, each of the three datasets under consideration was turned out to be more or less the same as the ones obtained via developed for a different purpose, which has to be taken into crowdsourcing and reflect the difficulty of directly transferring account when developing annotation algorithms for these datasets. WSD algorithms from WordNet to OpenCyc. WordNet is a dictionary-based taxonomy with a good coverage of the common English lexicon and good, dictionary like, Table 6 OpenCyc evaluation results (F measure, in %) based descriptions. OpenCyc is a common-sense knowledge base on manual annotations provided by A1 and A2. primarily developed for modeling and reasoning about the world. A1 A2 DBpedia is an effort to extract structured information from Context Similarity 24.00 32.00 Wikipedia, has only a small upper level layer and, unlike PageRank 32.00 48.00 WordNet and OpenCyc, has a rich set of instances (named entities Random 22.00 26.00 such as places, people, and organizations). As such, WordNet has the highest ratio of covered words, given 4.3 DBpedia the texts used in experiments. This is due to its dictionary-like The DBpedia4 dataset is based on a cross-domain ontology with nature and the fact that candidate resources correspond directly most concepts representing places, persons, work, species, and with the possible word meanings. On the other hand, OpenCyc organizations. The ontology was mostly extracted from infoboxes contains many resources or distinctions between resources which in Wikipedia. Each DBpedia resource is described by a label, a are important from the reasoning perspective (e.g. the three short and long English abstract, a link to corresponding Wikipedia candidate resources for the word boy) but are hard to page and a link to the image representation of the resource, when disambiguate by looking at the word and text alone. The available. differences between the three candidates would only become apparent when faced with distinct reasoning tasks, requiring Table 7 DBpedia evaluation results (F measure, in %) based various representations of the sentence at hand. This aspect alone on manual annotations provided by A1. can explain a large portion of the performance gap between WordNet and OpenCyc annotations. One possible solution is D3 relaxing the evaluation measures and allowing for more than one Context Similarity 17.86 possible annotation to be correct. Moreover, the annotation PageRank 21.43 algorithms need to assume that there is not always one correct Random 14.28 annotation; there can be more correct annotations or, as it is often the case with OpenCyc and DBpedia, none. For evaluation we randomly extracted a subset of words from D3 with more than one candidate resource, similar to the OpenCyc Secondly, although all three datasets share common features, experiments. These words were manually annotated with DBpedia these features are actually quite different due to dataset design. resources by one annotator and the results are shown in Table 7. For example, human-readable descriptions in all three cases are In the subset of D3 all but two of the words to disambiguate are written in very different genres and target different users. general, non-entity words. However, due to high emphasis of WordNet descriptions are written similar to dictionary entries, entities in DBpedia, often none of the candidate resources was DBpedia descriptions are, by definition, written like encyclopedia correct. For example, in the sentence: entries and OpenCyc descriptions are meant as documentation to the ontology engineer using it to model some world phenomena. In France Americans it seems have followed Malcolm Forbes's Similarly, relations in all three datasets have very distinct hot-air lead and taken to ballooning heady way. semantics, and the annotation methods developed or focused on so the correct annotation for the word lead would be: far either pay little attention to this or are largely overfitted to the few relations used in WordNet. Each of the datasets has its own With this pronunciation, ‘lead’ generally means ‘first’, ‘ahead’, vocabulary for determining the closeness of concepts. For or ‘guide’. example, OpenCyc uses relations such as nearestIsa, nearestIsaOfType or conceptuallyRelated. WordNet largely focuses on the closeness of concepts within one part of speech (e.g. nouns) having less relation types defined between different 4 http://dbpedia.org/About parts of speech. Both OpenCyc and DBpedia contain relations which mostly regard their infrastructure, (wikiPageUsesTemplate Conference on World Wide Web (WWW). Brisbane, is the most common relation in infobox triplets) and, when Australia. naively used, are not a good indicator of concept closeness (e.g. [5] Chan, Y. S., NG, H. T. and Zhong, Z. 2007. NUS-PT: PageRank approach from Section 3). To overcome this drawback, Exploiting parallel texts for word sense disambiguation in the annotation methods have to better take advantage of the rich the English all-words tasks. In Proceedings of the 4th relationship structure of LOD datasets and to allow for an easy International Workshop on Semantic Evaluations, Prague, addition of new relations and datasets. Czech Republic. With the future evolution of LOD, it would also be beneficial to [6] Curtis, J., Cabral, J. and Baxter, D. 2006. On the application introduce a model for defining lexical resources, which would be of the Cyc ontology to word sense disambiguation. In attached to the LOD resources. Currently, each resource can Proceedings of the 19th International Florida Artificial contain a label (rdfs:label) in one or more languages. It would be Intelligence Research Society Conference. useful to assign more linguistic meta-data to these labels, such as part-of-speech, inflected forms (e.g. go, goes, going, went, and [7] David, J., Euzenat, J. and Zamazal, O. S. 2010. Ontology gone), etc. Since this is generally expensive to build, tools for similarity in the alignment space. In Proceedings of the 9th doing this (semi-)automatically would also be of great benefit. International Semantic Web Conference (ISWC), Shanghai, China. 6. CONCLUSION AND FUTURE WORK [8] Edmonds, P. and Kilgariff, A. 1998. Introduction to the In this paper we investigated the applicability of two common special issue on evaluating word sense disambiguation approaches, taken from the word sense disambiguation systems. In Journal of Natural Language Engineering, 8(4). community, for annotating text with LOD datasets. One of the Cambridge University. approaches relies on the dataset relationship structure and is based on the Page Rank algorithm; the second one, called Context [9] Euzenat, J. and Shvaiko, P. 2007. Ontology matching. Similarity, takes advantage of the human-readable description of a Springer, Heidelberg, Germany. resource as well as neighbourhood relationships defined for that [10] Medelyan, O. and Legg, C. 2008. Integrating Cyc and resource. These approaches were chosen based on the common Wikipedia: Folksonomy meets rigorously defined common- characteristics of three datasets: WordNet, DBpedia and sense. In Proceedings of Wikipedia and AI workshop at the OpenCyc. The experimental findings revealed the shortcomings of AAAI 2008 Conference. Chicago, USA. the current state-of-the-art word sense disambiguation methods [11] Mihalcea, R., Tarau, P. and Figa, E. 2004. PageRank on when applied to different LOD datasets. In the discussion section semantic networks, with application to word sense we provided several possible explanations for these shortcomings disambiguation. In Proceedings of the 20th International together with alternatives and solutions. Conference on Computational Linguistics (COLING). As far as future work is concerned, we plan to use the lessons Geneva, Swizerland. learned in the experiments we presented in order to further [12] Navigli, R and Velardi, P. 2005. Structural semantic develop text annotations methods which can offer better interconnections: A knowledge-based approach to word performance on datasets, such as OpenCyc and DBpedia, and can sense disambiguation. IEEE Transactions on Pattern be, with a reasonable and predictable amount of effort, transferred Analysis and Machine Intelligence Volume 27, Issue 7, pp. to other LOD datasets. 1075—1088. 7. ACKNOWLEDGMENTS [13] Navigli, R., Litkowski, K. C. and Hargraves, O. 2007. The research leading to these results has received funding from Semeval-2007 task 07: Coarse-grained English allwords the Slovenian Research Agency and the European Union's task. In Proceedings of the 4th International Workshop on Seventh Framework Programme (FP7/2007-2013) under grant Semantic Evaluations, Prague, Czech Republic. agreement n°257790. [14] Navigli, R. 2009. Word sense disambiguation: A survey. ACM Computational Surveys, 41(2). 8. REFERENCES [1] Agirre, E. and Soroa, A. 2009. Personalizing PageRank for [15] Patwardhan, S., Banerjee, S. and Pedersen, T. 2007. word sense disambiguation. In Proceedings of the 12th UMND1: Unsupervised Word Sense Disambiguation Using Conference of the European Chapter of the Association for Contextual Semantic Relatedness. In proceedings of the 4th Computational Linguistics (EACL). Athens, Greece. International Workshop on Semantic Evaluations (SemEval). Prague, Czech Republic. [2] Banerjee, S. and Pedersen, T. 2002. An Adapted Lesk Algorithm for Word Sense Disambiguation using WordNet. [16] Pedersen, T., Patwardhan and S. Michelizzi, J. 2004. In Proceedings of the 3rd International Conference on WordNet::Similarity - Measuring the Relatedness of Intelligent Text Processing and Computational Linguistics, Concepts. In Proceedings of Fifth Annual Meeting of the pp. 136--145, Mexico City, Mexico. North American Chapter of the Association for Computational Linguistics (ACL). Boston, USA [3] Bizer, C., Heath, T. and Berners-Lee, T. 2009. Linked Data – The Story So Far. International Journal on Semantic Web [17] Ponzetto, S.P. and Navigli, R. 2010. Knowledge-rich Word and Information Systems. Sense Disambiguation Rivaling Supervised Systems. In Proceedings of the 48th Conference of the Association for [4] Brin, S. and Page, M. 1998. Anatomy of a large-scale Computational Linguistics. pp 1522--1531. Uppsala, hypertextual Web search engine. In Proceedings of the 7th Sweden.