Natat in Cerebro: Intelligent Information Retrieval for “The
                Guillotine” Language Game *
                                                                                 ∗
                                                         [Extended Abstract]
                                         Pierpaolo Basile                  Marco de Gemmis
                                    Dept. of Computer Science         Dept. of Computer Science
                                         University of Bari                University of Bari
                                        via E. Orabona, 4                 via E. Orabona, 4
                                             Bari, Italy                       Bari, Italy
                                      basilepp@di.uniba.it            degemmis@di.uniba.it
                                        Pasquale Lops                  Giovanni Semeraro
                                    Dept. of Computer Science         Dept. of Computer Science
                                         University of Bari                University of Bari
                                        via E. Orabona, 4                 via E. Orabona, 4
                                             Bari, Italy                       Bari, Italy
                                         lops@di.uniba.it              semeraro@di.uniba.it

ABSTRACT                                                              She receives one word at a time, and must choose between
This paper describes OTTHO (On the Tip of my THOught),                two diﬀerent proposed words: one is correct, the other one
a system designed for solving a language game, called Guillo-         is wrong. Each time she chooses the wrong word, the prize
tine. The rule of the game is simple: the player observes ﬁve         money is divided by half (the reason for the name Guillo-
words, generally unrelated to each other, and in one minute           tine). The ﬁve words are generally unrelated to each other,
she has to provide a sixth word, semantically connected               but each of them is strongly related to the word representing
to the others. The system performs retrieval from several             the solution. Once the ﬁve clues are given, the player has one
knowledge sources, such as a dictionary, a set of proverbs,           minute to provide the solution. An example of the game fol-
and Wikipedia to realize a knowledge infusion process. The            lows: Given the ﬁve words Capital, Pope, City, Colosseum,
main motivation for designing an artiﬁcial player for Guil-           YellowAndRed, the solution is Rome, because Rome is Cap-
lotine is the challenge of providing the machine with the             ital of Italy, the Pope lives in Rome, Rome is a City, the
cultural and linguistic background knowledge which makes              Colosseum is in Rome and YellowAndRed is an alternative
it similar to a human being, with the ability of interpreting         name for one of the Rome football teams. Often the solution
natural language documents and reasoning on their content.            is not so intuitive and the player needs diﬀerent knowledge
Our feeling is that the approach presented in this work has a         sources to reason and ﬁnd the correct word.
great potential for other more practical applications besides            OTTHO (On the Tip of my THOught) tries to solve the
solving a language game.                                              ﬁnal stage of the Guillotine game. We assume that the ﬁve
                                                                      words are provided at the same time, neglecting the initial
                                                                      phase of choosing the words, that only concerns the reduc-
1.   BACKGROUND AND MOTIVATION                                        tion of the initial prize.
   Words are popular features of many games, and they play
a central role in many language games. A language game
is deﬁned as a game involving natural language in which               2.    OTTHO
word meanings play an important role. Language games                    Guillotine is a cultural and linguistic game, and for this
draw their challenge and excitement from the richness and             reason we need to deﬁne an extended knowledge base for
ambiguity of natural language. In this paper we present a             representing the cultural and linguistic background knowl-
system that tries to play the Guillotine game. The Guillo-            edge of the player. Next, we have to realize a reasoning
tine is a language game played in a show on RAI, the Italian          mechanism able to retrieve the most appropriate pieces of
National Broadcasting Service, in which a player is given a           knowledge necessary to solve the game.
set of ﬁve words (clues), each linked in some way to a spe-
ciﬁc word that represents the unique solution of the game.            2.1    The Knowledge Sources
∗The full version appears in [3]                                        After a deep analysis of the correlation between the clues
                                                                      and the solution, we chose to include the following knowl-
                                                                      edge sources, ranked according to the frequency with which
                                                                      they were helpful in ﬁnding the solution of the game:
                                                                      1) Dictionary: the word representing the solution is con-
                                                                      tained in the description of a lemma or in some example
Appears in the Proceedings of the 1st Italian Information Retrieval   phrases using that lemma;
Workshop (IIR’10), January 27–28, 2010, Padova, Italy.                2) Encyclopedia: as for the dictionary, the description of
http://ims.dei.unipd.it/websites/iir10/index.html
Copyright owned by the authors.
an article contains the solution, but in this case it is neces-   similar to the extent that they share contexts (surrounding
sary to process a more detailed description of information;       words). If ‘beer’ and ‘wine’ frequently occur in the same
3) Proverbs and aphorisms: short pieces of text in which          context, say after ‘drink’, the hypothesis states that they
the solution is found very close to the clues.                    are semantically related or similar.

   These sources need to be organized and processed in or-        2.2   The Reasoning Mechanism
der to model relationships between words. The modeling               We adopt a spreading activation model [1], which has been
process must face the problem of the diﬀerent characteris-        used in other areas of Computer Science such as Informa-
tics of the several knowledge sources, resulting in a set of      tion Retrieval [2] as reasoning mechanism for OTTHO. The
diﬀerent heuristics for building the whole model on which         pure spreading activation model consists of a network data
to apply the reasoning mechanism. Since we are interested         structure of nodes interconnected by links, that may be la-
in ﬁnding relationships existing between words, we decided        beled and/or weighted and usually have directions. In the
to model each knowledge source using the set of correlations      network for “The Guillotine” game, nodes represent words,
existing between terms occurring in that speciﬁc source (a        while links denote associations between words obtained from
proverb, a deﬁnition in a dictionary, etc). Indeed, we used       the knowledge sources. Spreading in the network is triggered
a term-term matrix containing terms occurring in the mod-         by clues. The activation of clues causes words with related
eled knowledge source in which each cell contains the weight      meanings (as modeled in the knowledge sources) to become
representing the degree of correlation between the term on        active. At the end of the weight propagation process, the
the row and the one on the column. The computation of             most “active” words represent good candidates to be the so-
weights is diﬀerent for each type of knowledge source.            lution of the game.
   For the dictionary, we used the on-line De Mauro Par-
avia Italian dictionary1 , containing 160,000 lemmas. We          3.    BEYOND THE GAME
obtained a lemma-term matrix containing weights repre-
                                                                     The system could be used for implementing an alternative
senting the relationship between a lemma and terms used
                                                                  paradigm for associative retrieval on collections of text doc-
to describe it. Because of the general lemma-deﬁnition or-
                                                                  uments [2], in which an initial indexing phase of documents
ganization of entries in the dictionary, we can fairly claim
                                                                  can spread further “hidden” terms for retrieving other re-
that the model is language-independent. Each Web page de-
                                                                  lated documents. The identiﬁcation of hidden terms might
scribing a lemma has been preprocessed in order to extract
                                                                  rely on the integration of speciﬁc pieces of knowledge rel-
the most relevant information useful for computing weights
                                                                  evant for the domain of interest. This might represent a
in the matrix. The text of each Web page is processed in
                                                                  valuable strategy for several domains, such as search engine
order to skip the HTML tags, even if the formatting infor-
                                                                  advertising, in which customers’ search terms (and interests)
mation is preserved in order to give higher weights to terms
                                                                  need to be matched with those of advertisers. Spreading
formatted using bold or italic font. Stopwords are elimi-
                                                                  activation can be also eﬀectively combined with document
nated and abbreviations used in the deﬁnition of the lemma
                                                                  retrieval for semantic desktop search.
are expanded. Weights in the matrix are computed using a
classical strategy based on a tf-idf scheme, and normalized
with respect to the length of the deﬁnition in which the term     4.    REFERENCES
occurs and the length of the entire dictionary. A detailed        [1] A. M. Collins and E. F. Loftus. A Spreading Activation
description of the heuristics for modeling the dictionary is          Theory of Semantic Processing. Psychological Review,
reported in [5].                                                      82(6):407–428, 1975.
   As for the dictionary, a tf-idf strategy has been used for     [2] F. Crestani. Application of Spreading Activation
deﬁning the weights in the term-term matrix modeling the              Techniques in Information Retrieval. Artiﬁcial
knowledge source of proverbs, a collection of 1, 600 proverbs         Intelligence, 11(6):453–482, 1997.
gathered from the web2 .                                          [3] P. Lops, P. Basile, M. de Gemmis, and G. Semeraro.
   The process of modeling Wikipedia is diﬀerent from the             Language Is the Skin of My Thought”: Integrating
one adopted for proverbs and dictionary, due to the huge              Wikipedia and AI to Support a Guillotine Player. In
amount of information to be processed. We adopted a more              AI*IA 2009, LNCS 5883, pages 324–333. Springer,
scalable approach for processing Wikipedia entries, by us-            2009.
ing models for representing concepts through vectors in a         [4] M. Sahlgren. The Word-Space Model: Using
high dimensional space, such as the Semantic Vectors or               distributional analysis to represent syntagmatic and
WordSpace models [4]. The core idea behind semantic vec-              paradigmatic relations between words in
tors is that words and concepts are represented by points             high-dimensional vector spaces. PhD thesis, Stockholm
in a mathematical space, and this representation is learned           University, Department of Linguistics, 2006.
from text in such a way that concepts with similar or related
                                                                  [5] G. Semeraro and M. d. G. P. Lops, P. Basile. On the
meanings are near to one another in that space (geometric
                                                                      Tip of my Thought: Playing the Guillotine Game. In
metaphor of meaning). The basis of semantic vectors model
                                                                      IJCAI 2009, pages 1543–154. Morgan Kaufmann, 2009.
is the theory of meaning called distributional hypothesis, ac-
cording to which the meaning of a word is determined by
the rules of its use in the context of ordinary and concrete
language behavior. This means that words are semantically
1
 http://old.demauroparavia.it/
2
 http://web.tiscali.it/proverbiitaliani and
http://giavelli.interfree.it/proverbi_ita.html