Semantic Vectors: an Information Retrieval scenario Pierpaolo Basile Annalina Caputo Giovanni Semeraro Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science University of Bari University of Bari University of Bari Via E. Orabona, 4 Via E. Orabona, 4 Via E. Orabona, 4 70125 Bari (ITALY) 70125 Bari (ITALY) 70125 Bari (ITALY) basilepp@di.uniba.it acaputo@di.uniba.it semeraro@di.uniba.it ABSTRACT challenge for computational linguistics is ambiguity. Ambi- In this paper we exploit Semantic Vectors to develop an IR guity means that a word can be interpreted in more than system. The idea is to use semantic spaces built on terms one way, since it has more than one meaning. Ambiguity and documents to overcome the problem of word ambiguity. usually is not a problem for humans therefore it is not per- Word ambiguity is a key issue for those systems which have ceived as such. Conversely, for a computer ambiguity is one access to textual information. Semantic Vectors are able of the main problems encountered in the analysis and gener- to dividing the usages of a word into different meanings, ation of natural languages. Two main strategies have been discriminating among word meanings based on information proposed to cope with ambiguity: found in unannotated corpora. We provide an in vivo eval- 1. Word Sense Disambiguation: the task of selecting uation in an Information Retrieval scenario and we compare a sense for a word from a set of predefined possibilities; the proposed method with another one which exploits Word usually the so called sense inventory 1 comes from a Sense Disambiguation (WSD). Contrary to sense discrimi- dictionary or thesaurus. nation, which is the task of discriminating among different meanings (not necessarily known a priori), WSD is the task of selecting a sense for a word from a set of predefined pos- 2. Word Sense Discrimination: the task of dividing sibilities. The goal of the evaluation is to establish how the usages of a word into different meanings, ignoring Semantic Vectors affect the retrieval performance. any particular existing sense inventory. The goal is to discriminate among word meanings based on informa- tion found in unannotated corpora. Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Indexing meth- The main difference between the two strategies is that dis- ods, Linguistic processing; H.3.3 [Information Search and ambiguation relies on a sense inventory, while discrimination Retrieval]: Retrieval models, Search process exploits unannotated corpora. In the past years, several attempts were proposed to in- clude sense disambiguation and discrimination techniques Keywords in IR systems. This is possible because discrimination and Semantic Vectors, Information Retrieval, Word Sense Dis- disambiguation are not an end in themselves, but rather “in- crimination termediate tasks” which contribute to more complex tasks such as information retrieval. This opens the possibility of 1. BACKGROUND AND MOTIVATIONS an in vivo evaluation, where, rather then being evaluated in isolation, results are evaluated in terms of their contribu- Ranked keyword search has been quite successful in the tion to the overall performance of a system designed for a past, in spite of its obvious limits basically due to polysemy, particular application (e.g. Information Retrieval). the presence of multiple meanings for one word, and syn- The goal of this paper is to present an IR system which onymy, multiple words having the same meaning. The result exploits semantic spaces built on words and documents to is that, due to synonymy, relevant documents can be missed overcome the problem of word ambiguity. Then we com- if they do not contain the exact query keywords, while, due pare this system with another one which uses a Word Sense to polysemy, wrong documents could be deemed as relevant. Disambiguation strategy. We evaluated the proposed sys- These problems call for alternative methods that work not tem into the context of CLEF 2009 Ad-Hoc Robust WSD only at the lexical level of the documents, but also at the task [2]. meaning level. The paper is organized as follows: Sections 2 presents In the field of computational linguistics, a number of im- the IR model involved into the evaluation, which embodies portant research problems still remain unresolved. A specific semantic vectors strategies. The evaluation and the results are reported in Section 3, while a brief discussion about the main works related to our research are in Section 4. Conclusions and future work close the paper. Appears in the Proceedings of the 1st Italian Information Retrieval 1 A sense inventory provides for each word a list of all pos- Workshop (IIR’10), January 27–28, 2010, Padova, Italy. sible meanings. http://ims.dei.unipd.it/websites/iir10/index.html Copyright owned by the authors. 2. AN IR SYSTEM BASED ON SEMANTIC VECTORS Semantic Vectors are based on WordSpace model [15]. This model is based on a vector space in which points are used to represent semantic concepts, such as words and doc- uments. Using this strategy it is possible to build a vector space on both words and documents. These vector spaces can be exploited to develop an IR model as described in the following. The main idea behind Semantic Vectors is that words are represented by points in a mathematical space, and words Figure 1: Word vectors in word-space or documents with similar or related meanings are repre- sented close in that space. This provide us an approach to perform sense discrimination. We adopt the Semantic Vec- space, document semantically related will be represented tors package [18] which relies on a technique called Random closer in that space. Indexing (RI) introduced by Kanerva in [13]. This allows The Semantic Vectors package supplies tools for indexing to build semantic vectors with no need for the factorization a collection of documents and their retrieval adopting the of document-term or term-term matrix , because vectors Random Indexing strategy. This package relies on Apache are inferred using an incremental strategy. This method al- Lucene2 to create a basic term-document matrix, then it lows to solve efficiently the problem of reducing dimensions, uses the Lucene API to create both a word-space and a which is one of the key features used to uncover the “latent document-space from the term-document matrix, using Ran- semantic dimensions” of a word distribution. dom Projection to perform dimensionality reduction without RI is based on the concept of Random Projection: the matrix factorization. In order to evaluate Semantic Vectors idea is that high dimensional vectors chosen randomly are model we must modify the standard Semantic Vectors pack- “nearly orthogonal”. This yields a result that is compara- age by adding some ad-hoc features to support our evalua- ble to orthogonalization methods, such as Singular Value tion. In particular, documents are split in two fields, head- Decomposition, but saving computational resources. Specif- line and title, and are not tokenized using the standard text ically, RI creates semantic vectors in three steps: analyzer in Lucene. 1. a context vector is assigned to each document. This An important factor to take into account in semantic- vector is sparse, high-dimensional and ternary, which space model is the number of contexts, that sets the dimen- means that its elements can take values in {-1, 0, 1}. sions of the context vector. We evaluated Semantic Vectors The index vector contains a small number of randomly using several values of reduced dimensions. Results of the distributed non-zero elements, and the structure of this evaluation are reported in Section 3. vector follows the hypothesis behind the concept of Random Projection; 3. EVALUATION 2. context vectors are accumulated by analyzing terms The goal of the evaluation was to establish how Semantic and documents in which terms occur. In particular the Vectors influence the retrieval performance. The system is semantic vector of each term is the sum of the context evaluated into the context of an Information Retrieval (IR) vectors of the documents which contain the term; task. We adopted the dataset used for CLEF 2009 Ad-Hoc Robust WSD task [2]. Task organizers make available doc- 3. in the same way a semantic vector for a document is ument collections (from the news domain) and topics which the sum of the semantic vectors of the terms (created have been automatically tagged with word senses (synsets) in step 2) which occur in the document. from WordNet using several state-of-the-art disambiguation systems. Considering our goal, we exploit only the monolin- The two spaces built on terms and documents have the gual part of the task. same dimension. We can use vectors built on word-space as In particular, the Ad-Hoc WSD Robust task used existing query vectors and vectors built on document-space as search CLEF news collections, but with WSD added. The dataset vectors. Then, we can compute the similarity between word- comprises corpora from “Los Angeles Times” and “Glasgow space vectors and document-space vectors by means of the Herald”, amounting to 169,477 documents, 160 test topics classical cosine similarity measure. In this way we imple- and 150 training topics. The WSD data were automatically ment an information retrieval model based on semantic vec- added by systems from two leading research laboratories, tors. UBC [1] and NUS [9]. Both systems returned word senses Figure 1 shows a word-space with two only dimensions. If from the English WordNet, version 1.6. We used only the those two dimensions refer respectively to LEGAL and SPORT senses provided by NUS. Each term in the document is an- contexts, we can note that the vector of the word soccer notated by its senses with their respective scores, as assigned is closer to the SPORT context than the LEGAL context, vice by the automatic WSD system. This kind of dataset sup- versa the word law is closer to the LEGAL context. The an- plies WordNet synsets that are useful for the development gle between soccer and law represents the similarity degree of search engines that rely on disambiguation. between the two words. It is important to emphasize that In order to compare the IR system based on Semantic contexts in WordSpace have no tag, thus we know that each Vectors to other systems which cope with word ambiguity dimension is a context, but we cannot know the kind of the 2 context. If we consider document-space rather than word- http://lucene.apache.org/ by means of methods based on Word Sense Disambiguation, Topic fields MAP we provide a baseline based on SENSE. SENSE: SEmantic TITLE 0.0892 N-levels Search Engine is an IR system which relies on Word TITLE+DESCRIPTION 0.2141 Sense Disambiguation. SENSE is based on the N-Levels TITLE+DESCRIPTION+NARRATIVE 0.2041 model [5]. This model tries to overcome the limitations of the ranked keyword approach by introducing semantic lev- Table 1: Semantic Vectors: Results of the performed els, which integrate (and not simply replace) the lexical level experiments represented by keywords. Semantic levels provide informa- tion about word meanings, as described in a reference dic- System MAP Imp. tionary or other semantic resources. SENSE is able to man- KEYWORD 0.3962 - age documents indexed at separate levels (keywords, word MEANING 0.2930 -26.04% meanings, and so on) as well as to combine keyword search SENSE 0.4222 +6.56% with semantic information provided by the other indexing SVbest 0.2141 -45.96% levels. In particular, for each level: 1. a local scoring function is used in order to weigh ele- Table 2: Results of the performed experiments ments belonging to that level according to their infor- mative power; card terms that have a frequency below Tf . After a tuning 2. a local similarity function is used in order to compute step, we set the dimension to 2000 and Tf to 10. Tuning document relevance by exploiting the above-mentioned is performed using training topics provided by the CLEF scores. organizers. Queries for the Semantic Vectors model are built using Finally, a global ranking function is defined in order to com- several combinations of topic fields. Table 1 reports the re- bine document relevance computed at each level. The SEN- sults of the experiments using Semantic Vectors and different SE search engine is described in [4], while the setup of SEN- combinations of topic fields. SE into the context of CLEF 2009 is thoroughly described To compare the systems we use a single measure of perfor- in [7] mance: the Mean Average Precision (MAP), due to its good In CLEF, queries are represented by topics, which are stability and discrimination capabilities. Given the Average structured statements representing information needs. Each Precision [8], that is the mean of the precision scores ob- topic typically consists of three parts: a brief TITLE state- tained after retrieving each relevant document, the MAP ment, a one-sentence DESCRIPTION, and a more complex is computed as the sample mean of the Average Precision “narrative” specifying the criteria for assessing relevance. All scores over all topics. Zero precision is assigned to unre- topics are available with and without WSD. Topics in En- trieved relevant documents. glish are disambiguated by both UBC and NUS systems, Table 2 reports the results of each system involved into yielding word senses from WordNet version 1.6. the experiment. The column Imp. shows the improvement We adopted as baseline the system which exploits only with respect to the baseline KEYWORD. The system SVbest keywords during the indexing, identified by KEYWORD. refers to the best result obtained by Semantic Vectors re- Regarding disambiguation we used the SENSE system adopt- ported in boldface in Table 1. ing two strategies: the former, called MEANING, exploits The main result of the evaluation is that MEANING works only word meanings, the latter, called SENSE, uses two lev- better than SVbest ; in other words disambiguation wins over els of document representation: keywords and word mean- discrimination. Another important observation is that the ings combined. combination of keywords and word meanings, the SENSE The query for the KEYWORD system is built using word system, obtains the best result. It is important to note that stems in TITLE and DESCRIPTION fields of the topics. SVbest obtains a performance below the KEYWORD sys- All query terms are joined adopting the OR boolean clause. tem, about the 46% under the baseline. It is important Regarding the MEANING system each word in TITLE and to underline that the keyword level implemented in SENSE DESCRIPTION fields is expanded using the synsets in Word- uses a modified version of Apache Lucene which implements Net provided by the WSD algorithm. More details regarding Okapi BM25 model [14]. the evaluation of SENSE in CLEF 2009 are in [7]. In the previous experiments we compared the performance The query for the SENSE system is built combining the of the Semantic Vectors-based IR system to SENSE. In the strategies adopted for the KEYWORD and the MEANING following, we describe a new kind of experiment in which systems. For all the runs we remove the stop words from we integrate the Semantic Vector as a new level in SENSE. both the index and the topics. In particular, we build a The idea is to combine the results produced by Semantic different stop words list for topics in order to remove non Vectors with the results which come out from both the key- informative words such as find, reports, describe, that occur word level and the word meaning level. Table 3 shows that with high frequency in topics and are poorly discriminating. the combination of the keyword level with Semantic Vectors In order to make results comparable we use the same index outperforms the keyword level alone. built for the KEYWORD system to infer semantic vectors Moreover, the combination of Semantic Vectors with word using the Semantic Vectors package, as described in Section meaning level achieves an interesting result: the combina- 2. We need to tune two parameters in Semantic Vectors: tion is able to outperform the word meaning level alone. the number of dimensions (the number of contexts) and the Finally, the combination of Semantic Vectors with SENSE frequency3 threshold (Tf ). The last value is used to dis- (keyword level+word meaning level) obtains the best MAP 3 In this instance word frequency refers to word occurrences. with an increase of about the 6% with respect to KEY- System MAP Imp. found in queries. SV +KEYWORD 0.4150 +4.74% In order to show that WordSpace model is an approach SV +MEANING 0.3238 -18.27% to ambiguity resolution that is beneficial in information re- SV +SENSE 0.4216 +6.41% trieval, we summarize the experiment presented in [16]. This experiment evaluates sense-based retrieval, a modification of Table 3: Results of the experiments: combination of the standard vector-space model in information retrieval. In Semantic Vectors with other levels word-based retrieval, documents and queries are represented as vectors in a multidimensional space in which each dimen- sion corresponds to a word. In sense-based retrieval, docu- WORD. However, SV does not contribute to improve the ments and queries are also represented in a multidimensional effectiveness of SENSE, in fact SENSE without SV (see Ta- space, but its dimensions are senses, not words. The eval- ble 2) outperforms SV +SENSE. uation shows that sense-based retrieval improved average Analyzing results query by query, we discovered that for precision by 7.4% when compared to word-based retrieval. some queries the Semantic Vectors-based IR system achieves Regarding the evaluation of word sense disambiguation an high improvement wrt keyword search. This happen systems in the context of IR it is important to cite SemEval- mainly when few relevant documents exist for a query. For 2007 task 1 [3]. This task is an application-driven one, where example, query “10.2452/155-AH” has only three relevant the application is a given cross-lingual information retrieval documents. Both keyword and Semantic Vectors are able system. Participants disambiguate text by assigning Word- to retrieve all relevant documents for that query, but key- Net synsets, then the system has to do the expansion to word achieves 0,1484 MAP, while for Semantic Vectors MAP other languages, the indexing of the expanded documents grows to 0,7051. This means that Semantic Vectors are more and the retrieval for all the languages in batch. The re- accurate than keyword when few relevant documents exist trieval results are taken as a measure for the effectiveness of for a query. the disambiguation. CLEF 2009 Ad-hoc Robust WSD [2] is inspired to SemEval-2007 task 1. 4. RELATED WORKS Finally, this work is strongly related to [6], in which a first attempt to integrate Semantic Vectors in an IR system was The main motivation for focusing our attention on the performed. evaluation of disambiguation or discrimination systems is the idea that ambiguity resolution can improve the perfor- mance of IR systems. 5. CONCLUSIONS AND FUTURE WORK Many strategies have been used to incorporate semantic We have evaluated Semantic Vectors exploiting an infor- information coming from electronic dictionaries into search mation retrieval scenario. The IR system which we propose paradigms. relies on semantic vectors to induce a WordSpace model ex- Query expansion with WordNet has shown to potentially ploited during the retrieval process. Moreover we compare improve recall, as it allows matching relevant documents the proposed IR system with another one which exploits even if they do not contain the exact keywords in the query word sense disambiguation. The main outcome of this com- [17]. On the other hand, semantic similarity measures have parison is that disambiguation works better than discrimi- the potential to redefine the similarity between a document nation. This is a counterintuitive result: indeed it should and a user query [10]. The semantic similarity between con- be obvious that discrimination is better than disambigua- cepts is useful to understand how similar are the meanings tion. Since, the former is able to infer the usages of a word of the concepts. However, computing the degree of relevance directly from documents, while disambiguation works on a of a document with respect to a query means computing the fixed distinction of word meanings encoded into the sense similarity among all the synsets of the document and all the inventory such as WordNet. synsets of the user query, thus the matching process could It is important to note that the dataset used for the evalu- have very high computational costs. ation depends on the method adopted to compute document In [12] the authors performed a shift of representation relevance, in this case the pooling techniques. This means from a lexical space, where each dimension is represented that the results submitted by the groups participating in the by a term, towards a semantic space, where each dimen- previous ad hoc tasks are used to form a pool of documents sion is represented by a concept expressed using WordNet for each topic by collecting the highly ranked documents. synsets. Then, they applied the Vector Space Model to What we want to underline here is that generally the sys- WordNet synsets. The realization of the semantic tf-idf tems taken into account rely on keywords. This can produce model was rather simple, because it was sufficient to index relevance judgements that do not take into account evidence the documents or the user-query by using strings represent- provided by other features, such as word meanings or con- ing synsets. The retrieval phase is similar to the classic tf-idf text vectors. Moreover, distributional semantics methods, model, with the only difference that matching is carried out such as Semantic Vectors, do not provide a formal descrip- between synsets. tion of why two terms or documents are similar. The se- Concerning the discrimination methods, in [11] some ex- mantic associations derived by Semantic Vectors are similar periments in IR context adopting LSI technique are reported. to how human estimates similarity between terms or docu- In particular this method performs better than canonical ments. It is not clear if current evaluation methods are able vector space when queries and relevant documents do not to detect these cognitive aspects typical of human thinking. share many words. In this case LSI takes advantage of the More investigation on the strategy adopted for the evalua- implicit higher-order structure in the association of terms tion is needed. As future work we intend to exploit several with documents (“semantic structure”) in order to improve discrimination methods, such as Latent Semantic Indexing the detection of relevant documents on the basis of terms and Hyperspace Analogue to Language. 6. REFERENCES Equivalence and Entailment, pages 13–18, Ann Arbor, Michigan, June 2005. Association for Computational [1] E. Agirre and O. L. de Lacalle. BC-ALM: Combining Linguistics. k-NN with SVD for WSD. In Proceedings of the 4th [11] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. International Workshop on Semantic Evaluations Landauer, and R. Harshman. Indexing by latent (SemEval 2007), Prague, Czech Republic, pages semantic analysis. Journal of the American Society for 341–325, 2007. Information Science, 41:391–407, 1990. [2] E. Agirre, G. M. Di Nunzio, T. Mandl, and A. Otegi. [12] J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigarran. CLEF 2009 Ad Hoc Track Overview: Robust - WSD Indexing with WordNet synsets can improve text Task. In Working notes for the CLEF 2009 Workshop, retrieval. In Proceedings of the COLING/ACL, pages 2009. 38–44, 1998. http://clef-campaign.org/2009/working notes/agirre- robustWSDtask-paperCLEF2009.pdf. [13] P. Kanerva. Sparse Distributed Memory. MIT Press, 1988. [3] E. Agirre, B. Magnini, O. L. de Lacalle, A. Otegi, G. Rigau, and P. Vossen. SemEval-2007 Task 1: [14] S. Robertson, H. Zaragoza, and M. Taylor. Simple Evaluating WSD on Cross-Language Information bm25 extension to multiple weighted fields. In CIKM Retrieval. In Proceedings of the 4th International ’04: Proceedings of the thirteenth ACM international Workshop on Semantic Evaluations (SemEval 2007), conference on Information and knowledge Prague, Czech Republic, pages 7–12. ACL, 2007. management, pages 42–49, New York, NY, USA, 2004. ACM. [4] P. Basile, A. Caputo, M. de Gemmis, A. L. Gentile, P. Lops, and G. Semeraro. Improving Ranked [15] M. Sahlgren. The Word-Space Model: Using Keyword Search with SENSE: SEmantic N-levels distributional analysis to represent syntagmatic and Search Engine. Communications of SIWN (formerly: paradigmatic relations between words in System and Information Sciences Notes), special issue high-dimensional vector spaces. PhD thesis, on DART 2008, 5:39–45, August 2008. SIWN: The Stockholm: Stockholm University, Faculty of Systemics and Informatics World Network. Humanities, Department of Linguistics, 2006. [5] P. Basile, A. Caputo, A. L. Gentile, M. Degemmis, [16] H. Schütze and J. O. Pedersen. Information retrieval P. Lops, and G. Semeraro. Enhancing Semantic Search based on word senses. In In Proceedings of the 4th using N-Levels Document Representation. In Annual Symposium on Document Analysis and S. Bloehdorn, M. Grobelnik, P. Mika, and D. T. Tran, Information Retrieval, pages 161–175, 1995. editors, Proceedings of the Workshop on Semantic [17] E. M. Voorhees. WordNet: An Electronic Lexical Search (SemSearch 2008) at the 5th European Database, chapter Using WordNet for text retrieval, Semantic Web Conference (ESWC 2008), Tenerife, pages 285–304. Cambridge (Mass.): The MIT Press, Spain, June 2nd, 2008, volume 334 of CEUR 1998. Workshop Proceedings, pages 29–43. CEUR-WS.org, [18] D. Widdows and K. Ferraro. Semantic Vectors: A 2008. Scalable Open Source Package and Online Technology [6] P. Basile, A. Caputo, and G. Semeraro. Exploiting Management Application. In Proceedings of the 6th Disambiguation and Discrimination in Information International Conference on Language Resources and Retrieval Systems. In Proceedings of the 2008 Evaluation (LREC 2008), 2008. IEEE/WIC/ACM International Conference on Web Intelligence and International Conference on Intelligent Agent Technology - Workshops, Milan, Italy, 15-18 September 2009, pages 539–542. IEEE, 2009. [7] P. Basile, A. Caputo, and G. Semeraro. UNIBA-SENSE @ CLEF 2009: Robust WSD task. In Working notes for the CLEF 2009 Workshop, 2009. http://clef-campaign.org/2009/working notes/basile- paperCLEF2009.pdf. [8] C. Buckley and E. M. Voorhees. Evaluating evaluation measure stability. In SIGIR ’00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 33–40, New York, NY, USA, 2000. ACM. [9] Y. S. Chan, H. T. Ng, and Z. Zhong. NUS-PT: Exploiting Parallel Texts for Word Sense Disambiguation in the English All-Words Tasks. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval 2007), Prague, Czech Republic, pages 253–256, 2007. [10] C. Corley and R. Mihalcea. Measuring the semantic similarity of texts. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic