UNSUPERVISED ENTITY CLASSIFICATION WITH WIKIPEDIA AND WORDNET Tomáš Kliegr UEP Prague, Knowledge Engineering Group, Czech Republic ABSTRACT explosion typically limited to a window of several words be- fore and after the entity. This is less suitable for textual anno- The task of classifying entities appearing in textual annota- tations of objects, which are often too short to contain a usable tions to an arbitrary set of classes has not been extensively local context. However, we noticed in our past work [3, 4] researched, yet it is useful in multimedia retrieval. We pro- that object annotations tend to have common global context posed an unsupervised algorithm, which expresses entities within the collection. and classes as Wordnet synsets and uses Lin measure to clas- Our proposal is more similar to the recent work [1], who sify them. Real-time hypernym discovery from Wikipedia is propose unsupervised classification algorithm that uses con- used to map uncommon entities to Wordnet. Further, this pa- text vectors automatically extracted from text to represent both per investigates the possibility to improve the performance by entities and classes and assigns entity to a class with which it utilizing the global context with simulated annealing. has the highest similarity. Paper [1] shows that using pseu- dosyntactic dependencies is superior to word windows. 1. INTRODUCTION 3. OUR FRAMEWORK Analysis of textual annotations attached to various objects can provide useful information complementary to results of the In our previous work, we have addressed the problem of clas- analysis of the object itself. Annotations are typically short, sifying entities to an arbitrary set of classes by introducing a but very informative due to use of Named Entities (NE). Al- framework utilizing two algorithms Targeted Hypernym Dis- though NEs have high information content, background knowl- covery (THD) and Semantic Concept Mapping (SCM). edge is needed to resolve their meaning. Semantic Concept Mapping is an unsupervised algorithm, Named Entity Recognition (NER) is a long established which classifies each entity occurring in the annotation to one discipline which aims at classifying NEs to a predefined set of class; both entities and classes need to be expressed as Word- classes1 ). Large labeled corpora available for this task are ex- net Synsets. The winning class has the highest Lin similarity ploited by NER systems to learn statistical classification mod- simL with the entity. els. However, this approach cannot be utilized in a generic entity classification due to the data acquisition bottleneck [1]. 2 ∗ log p(lso(c1 , c2 )) simL (c1 , c2 ) = (1) This paper present a framework for unsupervised clas- log p(c1 ) + log p(c2 ) sification of named entities that utilizes background knowl- The lowest common subsumer from the hierarchy is returned edge extracted from Wikipedia to overcome data sparsity and by lso, value −log(p(c)) is information content, p(c) denotes Wordnet similarity to perform classification (Section 3). We the probability of encountering an instance of concept c. When discuss ongoing work leading to improved results (Section 4). entity is not present in Wordnet, Targeted Hypernym Discov- ery (THD) is used to provide a hypernym for the entity. 2. RELATED RESEARCH Targeted Hypernym Discovery builds upon the large body of available work on discovery of hypernyms with lexico- Treating classes as word categories makes entity classifica- syntactic patterns from text. It is called targeted, because tion a WSD problem [2], and thus a range of WSD algorithms it does not extract all word-hypernym pairs like most other can be directly applied. However, many WSD algorithms in- approaches, but only the most likely hypernym from the most cluding [2] are supervised, which is not desirable in entity suitable document. In our implementation2 , we use the GATE classification [1]. Further, WSD algorithms are typically con- NLP text engineering framework3 (see Figure 1) to extract the strained to finding a maximizing combination of word senses first hypernym from the Wikipedia article defining the entity. only within a local context, which is due to a combinatorial In our earlier work [5], we found Wikipedia to be the perfect and sustainable resource for hypernym discovery. This paper would not be here without Krishna and Jan 1 Typically 2 http://nb.vse.cz/˜klit01/hypernym discovery/ PERSON, LOCATION, ORGANIZATION, MISC in the CONLL task: www.cnts.ua.ac.be/conll/ 3 http://gate.ac.uk 4. ONGOING WORK: ADAPTED E-LESK Close analysis of the experimental results showed that the misclassification error in SCM/THD can be attributed to the first-sense assumption, due to which the system maps the first hypernym found to its first Wordnet synset, and particularly to the poor performance of the Lin measure on Wordnet. We suggest to simultaneously address both these prob- lems with a variation of the Lesk Algorithm [6], which uses simulated annealing to find combination of word senses that maximizes the overall similarity of dictionary definitions of words in the sentence. Instead of dictionary, we plan to use Wikipedia as a source definitions for both classes and the en- tities. The amount of data will be further increased by involv- ing hypernyms discovered by THD. We intent to evaluate the performance of this approach on Fine-Grained Senseval Task. 5. CONCLUSIONS Fig. 1. Targeted Hypernym Discovery Most NER systems use supervised techniques. However, as noted in [1], unsupervised algorithms are needed when the set 3.1. Use case: SCM/THD in Image Classification of classes is larger and flexible. There is not much existing work in this area [2] as most of the research has been focus- So far, we have performed experiments with THD in image ing on the typical NER task. We have proposed and imple- relevance feedback [4] and image classification [3]. Our al- mented an unsupervised entity classification system. Further gorithm proceeds in a similar way as a human would if pre- work will focus on substituting the currently used Lin mea- sented an image annotation, a pool of possible concepts Ctc , sure, which uses Wordnet relations, with a variation of Lesk and asked to express what is probably on the image using only measure applied on definitions obtained from Wikipedia. the concepts provided: first identify the likely objects on the image by parsing the annotation for entities (noun phrases). If 6. REFERENCES entity is not known, look it up in the Wikipedia. For each en- tity, select the class with highest semantic similarity. [1] Philipp Cimiano and Johanna Völker, “Towards large- INPUT: Annotation AN OT , set of concepts Ctc scale, open-domain and ontology-based named entity OUTPUT: set of concepts T , T ⊆ Ctc classification,” in RANLP, 2005, pp. 166–172. NP:= extractNounphrases(ANOT) [2] Michael Fleischman and Eduard Hovy, “Fine grained for all noun phrases np in NP do classification of named entities,” in COLING. 2002, ACL. syn:= mapToWordnetSynsetWithTHD(np) [3] Tomáš Kliegr, Krishna Chandramouli, Jan Nemrava, maxSim:= 0, maxSimConc:= {} Vojtěch Svátek, and Ebroul Izquierdo, “Combining cap- for all c in Ctc do tions and visual analysis for image concept classifica- sim := wordnetSim(syn, c) tion,” in MDM/KDD’08. 2008, ACM, To appear. if sim > maxSim then maxSim:=sim, maxSimConc:=s [4] Krishna Chandramouli, Tomáš Kliegr, Jan Nemrava, end if Vojtěch Svátek, and Ebroul Isquierdo, “Query refinement end for and user relevance feedback for contextualized image re- T := T ∪ maxSimConc trieval,” in VIE 08, 2008, To appear. end for [5] Tomáš Kliegr, Krishna Chandramouli, Jan Nemrava, Performance of SCM/THD alone was mediocre with ac- Vojtěch Svátek, and Ebroul Izquierdo, “Wikipedia as curacy of 27%, but combining its results with image classifier the premiere source for targeted hypernym discovery,” in (KAA) resulted into the accuracy of 55% (relative improve- WBBT/ECML’08, 2008, To appear. ment of 49% and 31% over the text/image-only baselines4 . [6] Jim Cowie, Joe Guthrie, and Louise Guthrie, “Lexical 4 Text-only: concept with the highest confidence was selected as the image disambiguation using simulated annealing,” in COLING, label; image-only (KAA):the class associated with segment with the highest Morristown, NJ, USA, 1992, pp. 359–365, ACM. ratio between the area of the segmented region and the whole image [3].