Natat in Cerebro: Intelligent Information Retrieval for “The Guillotine” Language Game * ∗ [Extended Abstract] Pierpaolo Basile Marco de Gemmis Dept. of Computer Science Dept. of Computer Science University of Bari University of Bari via E. Orabona, 4 via E. Orabona, 4 Bari, Italy Bari, Italy basilepp@di.uniba.it degemmis@di.uniba.it Pasquale Lops Giovanni Semeraro Dept. of Computer Science Dept. of Computer Science University of Bari University of Bari via E. Orabona, 4 via E. Orabona, 4 Bari, Italy Bari, Italy lops@di.uniba.it semeraro@di.uniba.it ABSTRACT She receives one word at a time, and must choose between This paper describes OTTHO (On the Tip of my THOught), two different proposed words: one is correct, the other one a system designed for solving a language game, called Guillo- is wrong. Each time she chooses the wrong word, the prize tine. The rule of the game is simple: the player observes five money is divided by half (the reason for the name Guillo- words, generally unrelated to each other, and in one minute tine). The five words are generally unrelated to each other, she has to provide a sixth word, semantically connected but each of them is strongly related to the word representing to the others. The system performs retrieval from several the solution. Once the five clues are given, the player has one knowledge sources, such as a dictionary, a set of proverbs, minute to provide the solution. An example of the game fol- and Wikipedia to realize a knowledge infusion process. The lows: Given the five words Capital, Pope, City, Colosseum, main motivation for designing an artificial player for Guil- YellowAndRed, the solution is Rome, because Rome is Cap- lotine is the challenge of providing the machine with the ital of Italy, the Pope lives in Rome, Rome is a City, the cultural and linguistic background knowledge which makes Colosseum is in Rome and YellowAndRed is an alternative it similar to a human being, with the ability of interpreting name for one of the Rome football teams. Often the solution natural language documents and reasoning on their content. is not so intuitive and the player needs different knowledge Our feeling is that the approach presented in this work has a sources to reason and find the correct word. great potential for other more practical applications besides OTTHO (On the Tip of my THOught) tries to solve the solving a language game. final stage of the Guillotine game. We assume that the five words are provided at the same time, neglecting the initial phase of choosing the words, that only concerns the reduc- 1. BACKGROUND AND MOTIVATION tion of the initial prize. Words are popular features of many games, and they play a central role in many language games. A language game is defined as a game involving natural language in which 2. OTTHO word meanings play an important role. Language games Guillotine is a cultural and linguistic game, and for this draw their challenge and excitement from the richness and reason we need to define an extended knowledge base for ambiguity of natural language. In this paper we present a representing the cultural and linguistic background knowl- system that tries to play the Guillotine game. The Guillo- edge of the player. Next, we have to realize a reasoning tine is a language game played in a show on RAI, the Italian mechanism able to retrieve the most appropriate pieces of National Broadcasting Service, in which a player is given a knowledge necessary to solve the game. set of five words (clues), each linked in some way to a spe- cific word that represents the unique solution of the game. 2.1 The Knowledge Sources ∗The full version appears in [3] After a deep analysis of the correlation between the clues and the solution, we chose to include the following knowl- edge sources, ranked according to the frequency with which they were helpful in finding the solution of the game: 1) Dictionary: the word representing the solution is con- tained in the description of a lemma or in some example Appears in the Proceedings of the 1st Italian Information Retrieval phrases using that lemma; Workshop (IIR’10), January 27–28, 2010, Padova, Italy. 2) Encyclopedia: as for the dictionary, the description of http://ims.dei.unipd.it/websites/iir10/index.html Copyright owned by the authors. an article contains the solution, but in this case it is neces- similar to the extent that they share contexts (surrounding sary to process a more detailed description of information; words). If ‘beer’ and ‘wine’ frequently occur in the same 3) Proverbs and aphorisms: short pieces of text in which context, say after ‘drink’, the hypothesis states that they the solution is found very close to the clues. are semantically related or similar. These sources need to be organized and processed in or- 2.2 The Reasoning Mechanism der to model relationships between words. The modeling We adopt a spreading activation model [1], which has been process must face the problem of the different characteris- used in other areas of Computer Science such as Informa- tics of the several knowledge sources, resulting in a set of tion Retrieval [2] as reasoning mechanism for OTTHO. The different heuristics for building the whole model on which pure spreading activation model consists of a network data to apply the reasoning mechanism. Since we are interested structure of nodes interconnected by links, that may be la- in finding relationships existing between words, we decided beled and/or weighted and usually have directions. In the to model each knowledge source using the set of correlations network for “The Guillotine” game, nodes represent words, existing between terms occurring in that specific source (a while links denote associations between words obtained from proverb, a definition in a dictionary, etc). Indeed, we used the knowledge sources. Spreading in the network is triggered a term-term matrix containing terms occurring in the mod- by clues. The activation of clues causes words with related eled knowledge source in which each cell contains the weight meanings (as modeled in the knowledge sources) to become representing the degree of correlation between the term on active. At the end of the weight propagation process, the the row and the one on the column. The computation of most “active” words represent good candidates to be the so- weights is different for each type of knowledge source. lution of the game. For the dictionary, we used the on-line De Mauro Par- avia Italian dictionary1 , containing 160,000 lemmas. We 3. BEYOND THE GAME obtained a lemma-term matrix containing weights repre- The system could be used for implementing an alternative senting the relationship between a lemma and terms used paradigm for associative retrieval on collections of text doc- to describe it. Because of the general lemma-definition or- uments [2], in which an initial indexing phase of documents ganization of entries in the dictionary, we can fairly claim can spread further “hidden” terms for retrieving other re- that the model is language-independent. Each Web page de- lated documents. The identification of hidden terms might scribing a lemma has been preprocessed in order to extract rely on the integration of specific pieces of knowledge rel- the most relevant information useful for computing weights evant for the domain of interest. This might represent a in the matrix. The text of each Web page is processed in valuable strategy for several domains, such as search engine order to skip the HTML tags, even if the formatting infor- advertising, in which customers’ search terms (and interests) mation is preserved in order to give higher weights to terms need to be matched with those of advertisers. Spreading formatted using bold or italic font. Stopwords are elimi- activation can be also effectively combined with document nated and abbreviations used in the definition of the lemma retrieval for semantic desktop search. are expanded. Weights in the matrix are computed using a classical strategy based on a tf-idf scheme, and normalized with respect to the length of the definition in which the term 4. REFERENCES occurs and the length of the entire dictionary. A detailed [1] A. M. Collins and E. F. Loftus. A Spreading Activation description of the heuristics for modeling the dictionary is Theory of Semantic Processing. Psychological Review, reported in [5]. 82(6):407–428, 1975. As for the dictionary, a tf-idf strategy has been used for [2] F. Crestani. Application of Spreading Activation defining the weights in the term-term matrix modeling the Techniques in Information Retrieval. Artificial knowledge source of proverbs, a collection of 1, 600 proverbs Intelligence, 11(6):453–482, 1997. gathered from the web2 . [3] P. Lops, P. Basile, M. de Gemmis, and G. Semeraro. The process of modeling Wikipedia is different from the Language Is the Skin of My Thought”: Integrating one adopted for proverbs and dictionary, due to the huge Wikipedia and AI to Support a Guillotine Player. In amount of information to be processed. We adopted a more AI*IA 2009, LNCS 5883, pages 324–333. Springer, scalable approach for processing Wikipedia entries, by us- 2009. ing models for representing concepts through vectors in a [4] M. Sahlgren. The Word-Space Model: Using high dimensional space, such as the Semantic Vectors or distributional analysis to represent syntagmatic and WordSpace models [4]. The core idea behind semantic vec- paradigmatic relations between words in tors is that words and concepts are represented by points high-dimensional vector spaces. PhD thesis, Stockholm in a mathematical space, and this representation is learned University, Department of Linguistics, 2006. from text in such a way that concepts with similar or related [5] G. Semeraro and M. d. G. P. Lops, P. Basile. On the meanings are near to one another in that space (geometric Tip of my Thought: Playing the Guillotine Game. In metaphor of meaning). The basis of semantic vectors model IJCAI 2009, pages 1543–154. Morgan Kaufmann, 2009. is the theory of meaning called distributional hypothesis, ac- cording to which the meaning of a word is determined by the rules of its use in the context of ordinary and concrete language behavior. This means that words are semantically 1 http://old.demauroparavia.it/ 2 http://web.tiscali.it/proverbiitaliani and http://giavelli.interfree.it/proverbi_ita.html