=Paper= {{Paper |id=None |storemode=property |title=Integrating Named Entities in a Semantic Search Engine |pdfUrl=https://ceur-ws.org/Vol-560/paper4.pdf |volume=Vol-560 |dblpUrl=https://dblp.org/rec/conf/iir/CaputoBS10 }} ==Integrating Named Entities in a Semantic Search Engine== https://ceur-ws.org/Vol-560/paper4.pdf
                                                                                                                                     *
     Integrating Named Entities in a Semantic Search Engine
                                                                                      ∗
                                                         [Extended Abstract]
                  Annalina Caputo                               Pierpaolo Basile               Giovanni Semeraro
                  University of Bari                            University of Bari                University of Bari
             Dept. of Computer Science                     Dept. of Computer Science         Dept. of Computer Science
                 via E. Orabona, 4                             via E. Orabona, 4                 via E. Orabona, 4
                      Bari, Italy                                   Bari, Italy                       Bari, Italy
               acaputo@di.uniba.it                           basilepp@di.uniba.it             semeraro@di.uniba.it

ABSTRACT                                                                  organization, one role-playing game, one fictional character
Traditional Information Retrieval (IR) systems are based on               and one TV show.
bag-of-words representation. This approach retrieves rel-                    In this paper we propose a new way of exploiting named
evant documents by lexical matching between query and                     entities in Information Retrieval. Named entities mentioned
document terms. Due to synonymy and polysemy, lexical                     in a document constitute an important part of its seman-
methods produce imprecise or incomplete results. In this pa-              tics. However, when named entities are considered alone
per we present how named entities are integrated in SENSE                 they may fail to capture the semantics expressed in a doc-
(SEmantic N-levels Search Engine). SENSE is an IR system                  ument or in a user query. For that reason we adopt an IR
that tries to overcome the limitations of the ranked keyword              model, called N-levels [2], able to capture semantic informa-
approach, by introducing semantic levels which integrate                  tion in a text by exploiting word meanings, described in a
(and not simply replace) the lexical level represented by                 reference dictionary (e.g. WordNet), and named entities.
keywords. Semantic levels provide information about word                  Thus, we propose an IR system, called SENSE (SEmantic
meanings, as described in a reference dictionary, and named               N-levels Search Engine), which manages documents indexed
entities. Our aim is to prove that named entities are useful              at multiple separate levels: keywords, senses (word mean-
to improve retrieval performance.                                         ings) and entities (named entities). The system is able to
                                                                          combine keyword search with semantic information provided
                                                                          by the two other indexing levels. Finally, we present the de-
1.   BACKGROUND AND MOTIVATION                                            velopment of the full-fledged entity level based on a novel
   In recent years a lot of attention has been invested on                model called Semantic Vectors.
Named Entities (NE), and their informative and discrim-
inative power within documents. Due to the importance
of research on NE, several sub-areas arose, such as entity
                                                                          2.     NAMED ENTITY LEVEL
detection and extraction, entity disambiguation and entity                   Named entities are phrases that contain the names of per-
ranking. The typical information extraction task involving                sons, organizations, locations and, more generally, entities
NE is Named Entity Recognition (NER). This task has been                  that can be identified by proper names. In order to iden-
defined for the first time during the Message Understanding                 tify named entities in a text, several methods can be applied
Conference (MUC) [4], and requires the identification and                  such as Rule-based, Dictionary-based or Statistical ones. We
categorization of NE as entity names (for people and orga-                adopted a statistical method exploiting YamCha2 , a generic
nization), place names, temporal expressions and numerical                open source text chunker useful for a lot of NLP tasks.
expressions. Named Entities play also a key role in the In-               YamCha adopts a state-of-the-art machine learning algo-
formation Retrieval context. Indeed, a very common task                   rithm called Support Vector Machines (SVMs), introduced
in that research area is the entity ranking, whose aim is                 by Vapnik in 1995. We trained YamCha using the dataset
to retrieve entities (rather than documents) that satisfy the             provided by CoNLL-2003 organization during the Shared-
user query. Most documents we deal on everyday contain                    Task 2003 [5]. The dataset contains entities extracted from
a lot of references to persons, dates, monetary values and                Reuters dataset. In particular three types of entities are
places. Moreover, named entity terms are among the most                   extracted: PERSON, LOCATION, ORGANIZATION and
frequently searched terms on the Web. Statistics on Yahoo’s               MISC, which contains entities that do not belong to the pre-
top 10 search terms in 20081 showed that all the ten search               vious three categories. We extract entities from the CLEF
terms consist of named entity terms: six persons, one sport               2008 collection [1]. The results of the entity recognition task
                                                                          are exported into a Lucene index. In detail, each document
∗The full version appears in [3]                                          is split in two fields: HEADLINE and TEXT, in compliance
1
  http://buzz.yahoo.com/yearinreview2008/top10/                           with the document structure in CLEF. Each field contains
                                                                          the set of the recognized entities and, for each entity, the
                                                                          number of occurrences.
                                                                             Building the entity level requires three steps:

Appears in the Proceedings of the 1st Italian Information Retrieval            1. pre-processing and entity extraction: XML files
Workshop (IIR’10), January 27–28, 2010, Padova, Italy.                    2
                                                                              http://chasen.org/ taku/software/YamCha/
http://ims.dei.unipd.it/websites/iir10/index.html
Copyright owned by the authors.
        provided by CLEF 2008 organizers are processed in
        order to extract entities. Named entities are stored              Table 1: Results of the performed experiments
                                                                                      Run         MAP      GMAP
        in IOB2 format. In IOB2, words outside the Named
                                                                                   Keyword (K)    0.192     0.041
        Entity are tagged with O, while the first word in the
                                                                                   Meaning (M)    0.188     0.035
        entity is tagged with B-k (to begin class k), and further                     K+M         0.220     0.057
        words receive the I-k tag, indicating that these words                      Entity (E)    0.134     0.006
        are inside the entity;
                                                                                      K+E         0.220     0.048
                                                                                      M+E         0.228     0.054
     2. entity indexing: entities extracted in the previous
                                                                                    K+M+E         0.252     0.076
        step are stored into an index using Lucene. The entity
        extraction procedure allows to obtain an entity-based
        vector space representation, called bag-of-entities (BoE).   search is performed by making use of multiple levels, the
        In this model an entity vector, rather than a word vec-      entity level is able to improve performance even on those
        tor, corresponds to a document.                              (difficult) topics for which few relevant documents are re-
     3. Semantic Vector building: in this step semantic              turned. This result suggests that named entities play a key
        vectors are built by exploiting the Lucene index. The        role in increasing the number of retrieved relevant results
        main idea behind models based on Semantic Vectors [6]        previously ignored. Specifically, considering the experiment
        is that words and concepts are represented by points         K+M+E where we used all three levels, an improvement of
        in a mathematical space, and this representation is          14.5% in the MAP and 33.3% in the GMAP was observed.
        learned from text in such a way that concepts with           Generally speaking, we noted an overall improvement in all
        similar or related meanings are near to one another in       the experiments that used the entity level, compared to the
        that space. The SemanticVectors package offers tools          equivalent experiments in which that level was not exploited.
        for indexing a collection of documents and their re-
        trieval. It relies on Apache Lucene to create a ba-          4.     REFERENCES
        sic term-document matrix. Then the Lucene API is             [1] E. Agirre, G. M. Di Nunzio, N. Ferro, T. Mandl, and
        exploited to create a Wordspace model from the term-             C. Peters. CLEF 2008: Ad Hoc Track Overview. In
        document matrix, by using Random Projection to per-              Working notes for the CLEF 2008 Workshop, 2008.
        form on-the-fly dimensionality reduction. This is a rel-      [2] P. Basile, A. Caputo, A. L. Gentile, M. Degemmis,
        evant point because it allows us to use the same entity          P. Lops, and G. Semeraro. Enhancing Semantic Search
        index produced in step 2 to induce semantic vectors. A           using N-Levels Document Representation. In
        detailed discussion on Semantic Vectors can be found             S. Bloehdorn, M. Grobelnik, P. Mika, and D. T. Tran,
        in [6], whilst a thorough explanation about the entity           editors, Proceedings of the Workshop on Semantic
        index can be found in [3].                                       Search (SemSearch 2008) at the 5th European Semantic
                                                                         Web Conference (ESWC 2008), Tenerife, Spain, June
3.      EXPERIMENTAL SESSION                                             2nd, 2008, volume 334 of CEUR Workshop Proceedings,
   For the evaluation of the system effectiveness, we used                pages 29–43. CEUR-WS.org, 2008.
the CLEF Ad Hoc WSD-Robust dataset derived from the                  [3] A. Caputo, P. Basile, and G. Semeraro. Boosting a
English CLEF data, which comprises corpora from “Los An-                 semantic search engine by named entities. In J. Rauch,
geles Times” and “Glasgow Herald”, amounting to 166, 726                 Z. W. Ras, P. Berka, and T. Elomaa, editors, ISMIS -
documents and 160 topics in English and Spanish. The                     Foundations of Intelligent Systems, 18th International
relevance judgments were taken from CLEF. The goal of                    Symposium, ISMIS 2009, Prague, Czech Republic,
the evaluation was to prove that the combination of three                September 14-17, 2009. Proceedings, volume 5722 of
indexing levels outperforms a single level. In particular,               Lecture Notes in Computer Science, pages 241–250.
that adding the entity level increases the effectiveness of               Springer, 2009.
the search with respect to the keyword and meaning lev-              [4] R. Grishman and B. Sundheim. Message understanding
els. To evaluate system effectiveness, different runs were                 conference-6: A brief history. In COLING, pages
performed by exploiting a single level at a time, or a combi-            466–471, 1996.
nation of two or more levels. Each experiment is identified           [5] E. F. Tjong Kim Sang and F. De Meulder. Introduction
by the names of the used levels. To measure retrieval per-               to the CoNLL-2003 Shared Task:
formance, we adopted Mean-Average-Precision (MAP) and                    Language-Independent Named Entity Recognition. In
Geometric-Mean-Average-Precision (GMAP) calculated by                    W. Daelemans and M. Osborne, editors, Proceedings of
trec eval 0.8.1, a simple program supplied by the Text RE-               CoNLL-2003, pages 142–147. Edmonton, Canada, 2003.
trieval Conference organizers3 , on the basis of 1,000 retrieved     [6] D. Widdows and K. Ferraro. Semantic Vectors: A
items per request. Table 1 shows the results for each run,               Scalable Open Source Package and Online Technology
with an overview on the exploited features.                              Management Application. In Proceedings of the 6th
   The results confirm our hypothesis: named entity recogni-              International Conference on Language Resources and
tion, in conjunction with an IR model capable of expressing              Evaluation (LREC 2008), 2008.
semantics, can greatly improve the retrieval performance.
If evaluated individually, the entity level does not yield to
satisfactory results. This result is due to the presence of
topics in which no entity was recognized. Conversely, when
3
    http://trec.nist.gov/trec eval/