=Paper=
{{Paper
|id=None
|storemode=property
|title=Integrating Named Entities in a Semantic Search Engine
|pdfUrl=https://ceur-ws.org/Vol-560/paper4.pdf
|volume=Vol-560
|dblpUrl=https://dblp.org/rec/conf/iir/CaputoBS10
}}
==Integrating Named Entities in a Semantic Search Engine==
* Integrating Named Entities in a Semantic Search Engine ∗ [Extended Abstract] Annalina Caputo Pierpaolo Basile Giovanni Semeraro University of Bari University of Bari University of Bari Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science via E. Orabona, 4 via E. Orabona, 4 via E. Orabona, 4 Bari, Italy Bari, Italy Bari, Italy acaputo@di.uniba.it basilepp@di.uniba.it semeraro@di.uniba.it ABSTRACT organization, one role-playing game, one fictional character Traditional Information Retrieval (IR) systems are based on and one TV show. bag-of-words representation. This approach retrieves rel- In this paper we propose a new way of exploiting named evant documents by lexical matching between query and entities in Information Retrieval. Named entities mentioned document terms. Due to synonymy and polysemy, lexical in a document constitute an important part of its seman- methods produce imprecise or incomplete results. In this pa- tics. However, when named entities are considered alone per we present how named entities are integrated in SENSE they may fail to capture the semantics expressed in a doc- (SEmantic N-levels Search Engine). SENSE is an IR system ument or in a user query. For that reason we adopt an IR that tries to overcome the limitations of the ranked keyword model, called N-levels [2], able to capture semantic informa- approach, by introducing semantic levels which integrate tion in a text by exploiting word meanings, described in a (and not simply replace) the lexical level represented by reference dictionary (e.g. WordNet), and named entities. keywords. Semantic levels provide information about word Thus, we propose an IR system, called SENSE (SEmantic meanings, as described in a reference dictionary, and named N-levels Search Engine), which manages documents indexed entities. Our aim is to prove that named entities are useful at multiple separate levels: keywords, senses (word mean- to improve retrieval performance. ings) and entities (named entities). The system is able to combine keyword search with semantic information provided by the two other indexing levels. Finally, we present the de- 1. BACKGROUND AND MOTIVATION velopment of the full-fledged entity level based on a novel In recent years a lot of attention has been invested on model called Semantic Vectors. Named Entities (NE), and their informative and discrim- inative power within documents. Due to the importance of research on NE, several sub-areas arose, such as entity 2. NAMED ENTITY LEVEL detection and extraction, entity disambiguation and entity Named entities are phrases that contain the names of per- ranking. The typical information extraction task involving sons, organizations, locations and, more generally, entities NE is Named Entity Recognition (NER). This task has been that can be identified by proper names. In order to iden- defined for the first time during the Message Understanding tify named entities in a text, several methods can be applied Conference (MUC) [4], and requires the identification and such as Rule-based, Dictionary-based or Statistical ones. We categorization of NE as entity names (for people and orga- adopted a statistical method exploiting YamCha2 , a generic nization), place names, temporal expressions and numerical open source text chunker useful for a lot of NLP tasks. expressions. Named Entities play also a key role in the In- YamCha adopts a state-of-the-art machine learning algo- formation Retrieval context. Indeed, a very common task rithm called Support Vector Machines (SVMs), introduced in that research area is the entity ranking, whose aim is by Vapnik in 1995. We trained YamCha using the dataset to retrieve entities (rather than documents) that satisfy the provided by CoNLL-2003 organization during the Shared- user query. Most documents we deal on everyday contain Task 2003 [5]. The dataset contains entities extracted from a lot of references to persons, dates, monetary values and Reuters dataset. In particular three types of entities are places. Moreover, named entity terms are among the most extracted: PERSON, LOCATION, ORGANIZATION and frequently searched terms on the Web. Statistics on Yahoo’s MISC, which contains entities that do not belong to the pre- top 10 search terms in 20081 showed that all the ten search vious three categories. We extract entities from the CLEF terms consist of named entity terms: six persons, one sport 2008 collection [1]. The results of the entity recognition task are exported into a Lucene index. In detail, each document ∗The full version appears in [3] is split in two fields: HEADLINE and TEXT, in compliance 1 http://buzz.yahoo.com/yearinreview2008/top10/ with the document structure in CLEF. Each field contains the set of the recognized entities and, for each entity, the number of occurrences. Building the entity level requires three steps: Appears in the Proceedings of the 1st Italian Information Retrieval 1. pre-processing and entity extraction: XML files Workshop (IIR’10), January 27–28, 2010, Padova, Italy. 2 http://chasen.org/ taku/software/YamCha/ http://ims.dei.unipd.it/websites/iir10/index.html Copyright owned by the authors. provided by CLEF 2008 organizers are processed in order to extract entities. Named entities are stored Table 1: Results of the performed experiments Run MAP GMAP in IOB2 format. In IOB2, words outside the Named Keyword (K) 0.192 0.041 Entity are tagged with O, while the first word in the Meaning (M) 0.188 0.035 entity is tagged with B-k (to begin class k), and further K+M 0.220 0.057 words receive the I-k tag, indicating that these words Entity (E) 0.134 0.006 are inside the entity; K+E 0.220 0.048 M+E 0.228 0.054 2. entity indexing: entities extracted in the previous K+M+E 0.252 0.076 step are stored into an index using Lucene. The entity extraction procedure allows to obtain an entity-based vector space representation, called bag-of-entities (BoE). search is performed by making use of multiple levels, the In this model an entity vector, rather than a word vec- entity level is able to improve performance even on those tor, corresponds to a document. (difficult) topics for which few relevant documents are re- 3. Semantic Vector building: in this step semantic turned. This result suggests that named entities play a key vectors are built by exploiting the Lucene index. The role in increasing the number of retrieved relevant results main idea behind models based on Semantic Vectors [6] previously ignored. Specifically, considering the experiment is that words and concepts are represented by points K+M+E where we used all three levels, an improvement of in a mathematical space, and this representation is 14.5% in the MAP and 33.3% in the GMAP was observed. learned from text in such a way that concepts with Generally speaking, we noted an overall improvement in all similar or related meanings are near to one another in the experiments that used the entity level, compared to the that space. The SemanticVectors package offers tools equivalent experiments in which that level was not exploited. for indexing a collection of documents and their re- trieval. It relies on Apache Lucene to create a ba- 4. REFERENCES sic term-document matrix. Then the Lucene API is [1] E. Agirre, G. M. Di Nunzio, N. Ferro, T. Mandl, and exploited to create a Wordspace model from the term- C. Peters. CLEF 2008: Ad Hoc Track Overview. In document matrix, by using Random Projection to per- Working notes for the CLEF 2008 Workshop, 2008. form on-the-fly dimensionality reduction. This is a rel- [2] P. Basile, A. Caputo, A. L. Gentile, M. Degemmis, evant point because it allows us to use the same entity P. Lops, and G. Semeraro. Enhancing Semantic Search index produced in step 2 to induce semantic vectors. A using N-Levels Document Representation. In detailed discussion on Semantic Vectors can be found S. Bloehdorn, M. Grobelnik, P. Mika, and D. T. Tran, in [6], whilst a thorough explanation about the entity editors, Proceedings of the Workshop on Semantic index can be found in [3]. Search (SemSearch 2008) at the 5th European Semantic Web Conference (ESWC 2008), Tenerife, Spain, June 3. EXPERIMENTAL SESSION 2nd, 2008, volume 334 of CEUR Workshop Proceedings, For the evaluation of the system effectiveness, we used pages 29–43. CEUR-WS.org, 2008. the CLEF Ad Hoc WSD-Robust dataset derived from the [3] A. Caputo, P. Basile, and G. Semeraro. Boosting a English CLEF data, which comprises corpora from “Los An- semantic search engine by named entities. In J. Rauch, geles Times” and “Glasgow Herald”, amounting to 166, 726 Z. W. Ras, P. Berka, and T. Elomaa, editors, ISMIS - documents and 160 topics in English and Spanish. The Foundations of Intelligent Systems, 18th International relevance judgments were taken from CLEF. The goal of Symposium, ISMIS 2009, Prague, Czech Republic, the evaluation was to prove that the combination of three September 14-17, 2009. Proceedings, volume 5722 of indexing levels outperforms a single level. In particular, Lecture Notes in Computer Science, pages 241–250. that adding the entity level increases the effectiveness of Springer, 2009. the search with respect to the keyword and meaning lev- [4] R. Grishman and B. Sundheim. Message understanding els. To evaluate system effectiveness, different runs were conference-6: A brief history. In COLING, pages performed by exploiting a single level at a time, or a combi- 466–471, 1996. nation of two or more levels. Each experiment is identified [5] E. F. Tjong Kim Sang and F. De Meulder. Introduction by the names of the used levels. To measure retrieval per- to the CoNLL-2003 Shared Task: formance, we adopted Mean-Average-Precision (MAP) and Language-Independent Named Entity Recognition. In Geometric-Mean-Average-Precision (GMAP) calculated by W. Daelemans and M. Osborne, editors, Proceedings of trec eval 0.8.1, a simple program supplied by the Text RE- CoNLL-2003, pages 142–147. Edmonton, Canada, 2003. trieval Conference organizers3 , on the basis of 1,000 retrieved [6] D. Widdows and K. Ferraro. Semantic Vectors: A items per request. Table 1 shows the results for each run, Scalable Open Source Package and Online Technology with an overview on the exploited features. Management Application. In Proceedings of the 6th The results confirm our hypothesis: named entity recogni- International Conference on Language Resources and tion, in conjunction with an IR model capable of expressing Evaluation (LREC 2008), 2008. semantics, can greatly improve the retrieval performance. If evaluated individually, the entity level does not yield to satisfactory results. This result is due to the presence of topics in which no entity was recognized. Conversely, when 3 http://trec.nist.gov/trec eval/