UNSUPERVISED ENTITY CLASSIFICATION WITH WIKIPEDIA AND WORDNET

                                                         Tomáš Kliegr

                           UEP Prague, Knowledge Engineering Group, Czech Republic


                        ABSTRACT                                    explosion typically limited to a window of several words be-
                                                                    fore and after the entity. This is less suitable for textual anno-
The task of classifying entities appearing in textual annota-
                                                                    tations of objects, which are often too short to contain a usable
tions to an arbitrary set of classes has not been extensively
                                                                    local context. However, we noticed in our past work [3, 4]
researched, yet it is useful in multimedia retrieval. We pro-
                                                                    that object annotations tend to have common global context
posed an unsupervised algorithm, which expresses entities
                                                                    within the collection.
and classes as Wordnet synsets and uses Lin measure to clas-
                                                                        Our proposal is more similar to the recent work [1], who
sify them. Real-time hypernym discovery from Wikipedia is
                                                                    propose unsupervised classification algorithm that uses con-
used to map uncommon entities to Wordnet. Further, this pa-
                                                                    text vectors automatically extracted from text to represent both
per investigates the possibility to improve the performance by
                                                                    entities and classes and assigns entity to a class with which it
utilizing the global context with simulated annealing.
                                                                    has the highest similarity. Paper [1] shows that using pseu-
                                                                    dosyntactic dependencies is superior to word windows.
                   1. INTRODUCTION
                                                                                            3. OUR FRAMEWORK
Analysis of textual annotations attached to various objects can
provide useful information complementary to results of the          In our previous work, we have addressed the problem of clas-
analysis of the object itself. Annotations are typically short,     sifying entities to an arbitrary set of classes by introducing a
but very informative due to use of Named Entities (NE). Al-         framework utilizing two algorithms Targeted Hypernym Dis-
though NEs have high information content, background knowl-         covery (THD) and Semantic Concept Mapping (SCM).
edge is needed to resolve their meaning.                                Semantic Concept Mapping is an unsupervised algorithm,
     Named Entity Recognition (NER) is a long established           which classifies each entity occurring in the annotation to one
discipline which aims at classifying NEs to a predefined set of     class; both entities and classes need to be expressed as Word-
classes1 ). Large labeled corpora available for this task are ex-   net Synsets. The winning class has the highest Lin similarity
ploited by NER systems to learn statistical classification mod-     simL with the entity.
els. However, this approach cannot be utilized in a generic
entity classification due to the data acquisition bottleneck [1].                                    2 ∗ log p(lso(c1 , c2 ))
                                                                                 simL (c1 , c2 ) =                                (1)
     This paper present a framework for unsupervised clas-                                           log p(c1 ) + log p(c2 )
sification of named entities that utilizes background knowl-        The lowest common subsumer from the hierarchy is returned
edge extracted from Wikipedia to overcome data sparsity and         by lso, value −log(p(c)) is information content, p(c) denotes
Wordnet similarity to perform classification (Section 3). We        the probability of encountering an instance of concept c. When
discuss ongoing work leading to improved results (Section 4).       entity is not present in Wordnet, Targeted Hypernym Discov-
                                                                    ery (THD) is used to provide a hypernym for the entity.
                2. RELATED RESEARCH                                      Targeted Hypernym Discovery builds upon the large body
                                                                    of available work on discovery of hypernyms with lexico-
Treating classes as word categories makes entity classifica-        syntactic patterns from text. It is called targeted, because
tion a WSD problem [2], and thus a range of WSD algorithms          it does not extract all word-hypernym pairs like most other
can be directly applied. However, many WSD algorithms in-           approaches, but only the most likely hypernym from the most
cluding [2] are supervised, which is not desirable in entity        suitable document. In our implementation2 , we use the GATE
classification [1]. Further, WSD algorithms are typically con-      NLP text engineering framework3 (see Figure 1) to extract the
strained to finding a maximizing combination of word senses         first hypernym from the Wikipedia article defining the entity.
only within a local context, which is due to a combinatorial        In our earlier work [5], we found Wikipedia to be the perfect
                                                                    and sustainable resource for hypernym discovery.
   This paper would not be here without Krishna and Jan
  1 Typically                                                         2 http://nb.vse.cz/˜klit01/hypernym discovery/
             PERSON, LOCATION, ORGANIZATION, MISC in the
CONLL task: www.cnts.ua.ac.be/conll/                                  3 http://gate.ac.uk
                                                                                      4. ONGOING WORK: ADAPTED E-LESK

                                                                                Close analysis of the experimental results showed that the
                                                                                misclassification error in SCM/THD can be attributed to the
                                                                                first-sense assumption, due to which the system maps the first
                                                                                hypernym found to its first Wordnet synset, and particularly
                                                                                to the poor performance of the Lin measure on Wordnet.
                                                                                     We suggest to simultaneously address both these prob-
                                                                                lems with a variation of the Lesk Algorithm [6], which uses
                                                                                simulated annealing to find combination of word senses that
                                                                                maximizes the overall similarity of dictionary definitions of
                                                                                words in the sentence. Instead of dictionary, we plan to use
                                                                                Wikipedia as a source definitions for both classes and the en-
                                                                                tities. The amount of data will be further increased by involv-
                                                                                ing hypernyms discovered by THD. We intent to evaluate the
                                                                                performance of this approach on Fine-Grained Senseval Task.

                                                                                                    5. CONCLUSIONS
              Fig. 1. Targeted Hypernym Discovery
                                                                                Most NER systems use supervised techniques. However, as
                                                                                noted in [1], unsupervised algorithms are needed when the set
3.1. Use case: SCM/THD in Image Classification                                  of classes is larger and flexible. There is not much existing
                                                                                work in this area [2] as most of the research has been focus-
So far, we have performed experiments with THD in image                         ing on the typical NER task. We have proposed and imple-
relevance feedback [4] and image classification [3]. Our al-                    mented an unsupervised entity classification system. Further
gorithm proceeds in a similar way as a human would if pre-                      work will focus on substituting the currently used Lin mea-
sented an image annotation, a pool of possible concepts Ctc ,                   sure, which uses Wordnet relations, with a variation of Lesk
and asked to express what is probably on the image using only                   measure applied on definitions obtained from Wikipedia.
the concepts provided: first identify the likely objects on the
image by parsing the annotation for entities (noun phrases). If                                      6. REFERENCES
entity is not known, look it up in the Wikipedia. For each en-
tity, select the class with highest semantic similarity.                        [1] Philipp Cimiano and Johanna Völker, “Towards large-
   INPUT: Annotation AN OT , set of concepts Ctc                                    scale, open-domain and ontology-based named entity
   OUTPUT: set of concepts T , T ⊆ Ctc                                              classification,” in RANLP, 2005, pp. 166–172.

  NP:= extractNounphrases(ANOT)                                                 [2] Michael Fleischman and Eduard Hovy, “Fine grained
  for all noun phrases np in NP do                                                  classification of named entities,” in COLING. 2002, ACL.
     syn:= mapToWordnetSynsetWithTHD(np)                                        [3] Tomáš Kliegr, Krishna Chandramouli, Jan Nemrava,
     maxSim:= 0, maxSimConc:= {}                                                    Vojtěch Svátek, and Ebroul Izquierdo, “Combining cap-
     for all c in Ctc do                                                            tions and visual analysis for image concept classifica-
        sim := wordnetSim(syn, c)                                                   tion,” in MDM/KDD’08. 2008, ACM, To appear.
        if sim > maxSim then
           maxSim:=sim, maxSimConc:=s                                           [4] Krishna Chandramouli, Tomáš Kliegr, Jan Nemrava,
        end if                                                                      Vojtěch Svátek, and Ebroul Isquierdo, “Query refinement
     end for                                                                        and user relevance feedback for contextualized image re-
     T := T ∪ maxSimConc                                                            trieval,” in VIE 08, 2008, To appear.
  end for                                                                       [5] Tomáš Kliegr, Krishna Chandramouli, Jan Nemrava,
    Performance of SCM/THD alone was mediocre with ac-                              Vojtěch Svátek, and Ebroul Izquierdo, “Wikipedia as
curacy of 27%, but combining its results with image classifier                      the premiere source for targeted hypernym discovery,” in
(KAA) resulted into the accuracy of 55% (relative improve-                          WBBT/ECML’08, 2008, To appear.
ment of 49% and 31% over the text/image-only baselines4 .
                                                                                [6] Jim Cowie, Joe Guthrie, and Louise Guthrie, “Lexical
   4 Text-only: concept with the highest confidence was selected as the image       disambiguation using simulated annealing,” in COLING,
label; image-only (KAA):the class associated with segment with the highest          Morristown, NJ, USA, 1992, pp. 359–365, ACM.
ratio between the area of the segmented region and the whole image [3].