Overview of the EVALITA 2018 Solving language games (NLP4FUN) Task

    Pierpaolo Basile and Marco de Gemmis and Lucia Siciliani and Giovanni Semeraro
                             Department of Computer Science
                              Via E. Orabona, 4 - 70125 Bari
                               University of Bari Aldo Moro
                         {firstname.lastname}@uniba.it


                     Abstract                       language, and therefore have attracted the atten-
     English. This paper describes the first        tion of researchers in the fields of Artificial Intel-
     edition of the “Solving language games”        ligence and Natural Language Processing. For in-
     (NLP4FUN) task at the EVALITA 2018             stance, IBM Watson is a system which success-
     campaign. The task consists in design-         fully challenged human champions of Jeopardy!,
     ing an artificial player for “The Guillo-      a game in which contestants are presented with
     tine” (La Ghigliottina, in Italian), a chal-   clues in the form of answers, and must phrase their
     lenging language game which demands            responses in the form of a question (Ferrucci et
     knowledge covering a broad range of top-       al., 2010; Molino et al., 2015). Another popular
     ics. The game consists in finding a word       language game is solving crossword puzzles. The
     which is semantically correlated with a        first experience reported in the literature is Proverb
     set of 5 words called clues. Artificial        (Littman et al., 2002), that exploits large libraries
     players for that game can take advantage       of clues and solutions to past crossword puzzles.
     from the availability of open repositories     WebCrow is the first solver for Italian crosswords
     on the web, such as Wikipedia, that pro-       (Ernandes et al., 2008).
     vide the system with the cultural and lin-        The proposed task consists in designing a solver
     guistic background needed to find the so-      for “The Guillotine” (La Ghigliottina, in Italian)
     lution.                                        game. It is inspired by the final game of an Italian
     Italiano.      Questo lavoro descrive la       TV show called “L’eredità”. The game, broadcast
     prima edizione del task “Solving lan-          by Italian National TV, involves a single player,
     guage games” (NLP4FUN) task, pro-              who is given a set of five words - the clues - each
     posto durante la campagna di valutazione       linked in some way to a specific word that rep-
     EVALITA 2018. Il task consiste nella           resents the unique solution of the game. Words
     realizzazione di un giocatore artificiale      are unrelated to each other, but each of them has
     per “La Gigliottina”, un gioco linguistico     a hidden association with the solution. Once the
     molto sfidante, la cui soluzione richiede      clues are given, the player has one minute to find
     conoscenze in svariati campi. Il gioco         the solution. For example, given the five clues:
     consiste nel trovare una parola il cui sig-    sin, Newton, doctor, New York, bad, the solution
     nificato è correlato a quello di un insieme   is apple, because: the apple is the symbol of orig-
     di 5 parole, chiamate indizi. Un gioca-        inal sin in Christian theology; Newton discovered
     tore artificiale per questo task potrebbe      the gravity by means of an apple; “an apple a day
     sfruttare diverse sorgenti di conoscenza       keeps the doctor away” is a famous proverb; New
     disponibili online, come Wikipedia, che        York city is also called “the big apple”; and “one
     forniscano al sistema le conoscenze lin-       bad apple can spoil the whole bunch” is a popu-
     guistiche e culturali necessarie per ar-       lar phrase which figuratively means that the per-
     rivare alla soluzione.                         son doing wrong can have a negative influence on
                                                    those around him. “La Ghigliottina” is a chal-
                                                    lenging language game which demands knowl-
1    Motivation
                                                    edge covering a broad range of topics. Artificial
Language games draw their challenge and excite-     players for that game can take advantage from the
ment from the richness and ambiguity of natural     availability of open repositories on the web, such
as Wikipedia, that provide the system with the cul-              <clue>cane</clue>
tural and linguistic background needed to under-                 <clue>musica</clue>
stand clues (Basile et al., 2016; Semeraro et al.,               <clue>casa</clue>
2009; Semeraro et al., 2012).                                    <clue>pietra</clue>
   The task is part of EVALITA 2018, the pe-                     <solution>chiesa</solution>
riodic evaluation campaign of Natural Language                   <type>TV</type>
Processing (NLP) and speech tools for the Italian         </game>
language (Caselli et al., 2018).                          ...
   The paper is organized as follows: Section 2         </games>
reports details about the task, the dataset and the        The XML file consists of a root element games
evaluation protocol, while Section 3 describes the      which contains several game elements. Each game
systems participating in the task, and Section 4        has five clue elements and one solution. Moreover,
shows results.                                          the element type specifies the type of the game: TV
                                                        or boardgame.
2    Task Description: Dataset, Evaluation
                                                           The ranked list of solutions must be provided in
     Protocol and Measures
                                                        a single plain text file, according to the following
An instance of the game consists of a set of 5 clue     format:
words and 1 word given as the official solution for
                                                        id solution score rank time
that instance. We provided:
                                                           Values were separated by a whitespace charac-
    • a training set for the system development,        ter; time taken by the system to compute the list
      containing 315 instances of the game;             was also reported in milliseconds. An example of
                                                        a ranked list of solutions is reported below:
    • a test set for the evaluation, containing 105
      instances of the game.                            3fc953bd-... porta 0.978 1 3459
                                                        3fc953bd-... chiesa 0.932 2 3251
   In order to measure the performance of the par-
                                                        3fc953bd-... santo 0.897 3 4321
ticipants on games having different levels of diffi-
                                                        ...
culty, we provided instances taken both from the
                                                        3fc953bd-... carta 0.321 100 2343
TV game and from the official board game. In the
                                                        ...
training set, 204 instances (64.8%) came from the
TV game, 111 (35.2%) from the board game. In            2.2 Evaluation
the test set, 66 instances (62.9%) were collected
                                                        As evaluation measure, we adopt a weighted ver-
from the TV game, 39 (37.1%) from the board
                                                        sion of Mean Reciprocal Rank (MRR). Since time
game. In order to discourage participants from
                                                        is a critical factor in this game, the Reciprocal
cheating (e.g. finding the solution manually), in
                                                        Rank is weighted by a function which lowers the
the test set we included 300 fake games automat-
                                                        score based on the time taken by the computation.
ically created by us. Obviously, fake games were
                                                        In fact, in the TV game, the player has only one
not taken into account in the evaluation.
                                                        minute to provide the solution. Taking into ac-
   Any knowledge resource can be used to build
                                                        count these factors, the evaluation measure was:
an artificial player, except further instances of the
game. For each instance of the game, a ranked                         1 X 1         1 1
list of maximum 100 tentative solutions must be                                 max( , )                (1)
                                                                     |G| g∈G rg     tg 10
provided.
                                                           where G is the set of games and rg is the rank
2.1 Data Format
                                                        of the solution, while tg denotes the minutes taken
Both development and test set were provided in          by the system to give the tentative solutions. Sys-
XML format:                                             tems that took more than 10 minutes are equally
<games>                                                 penalized.
  <game>                                                   The evaluation was performed only on the 105
             <id>3fc953bd...</id>                       test games, for which we knew the correct solution
             <clue>uomo</clue>                          (results provided for fake games were excluded).
  We provided a separate ranking for TV and                       remarkable performance: MRR is very high, thus
boardgame, but the final ranking was computed on                  showing that the system is able to place the solu-
the the whole test set.                                           tion in the first positions of the ranking. We report,
                                                                  also, the standard MRR (M RR(std)) computed
3    Systems                                                      without taking into account the time. We notice
Twelve teams registered in the task, but only two                 that for UNIOR4NLP the value is equal to M RR:
of them actually submitted the results for the eval-              the system is able to provide the solution always in
uation. A short description of each system fol-                   the first minute, while the Squadrone system takes
lows:                                                             more time for solving games.
                                                                     Table 2 reports the results by game type (66 in-
UNIOR4FUN - The system described in (San-                         stances from the TV game and 39 instances from
   gati et al., 2018) is based on the idea that                   the boardgame). UNIOR4NLP shows similar re-
   clue words and corresponding solution are                      sults for both the game types, while the system
   often part of a multiword expression. There-                   proposed by Squadrone performs better on board
   fore, the system exploits six linguistic pat-                  games.
   terns1 that identify valid multiword expres-                      One possible explanation for this difference is
   sions connecting clue and solution pairs. The                  that board games are meant just for fun; they are
   core of the proposed solution is a set of freely               designed for the average player, whereas those
   available corpora and lexical resources built                  taken from the TV game are more difficult to solve
   by the authors, which are used to find poten-                  because they are intended to challenge the contes-
   tial solutions by computing mutual informa-                    tants of the show who try to win a money prize.
   tion.                                                          Therefore, TV games generally have very specific
                                                                  clues and require more extensive knowledge about
System by Luca Squadrone - In (Squadrone,
                                                                  world facts and particular topics to find the so-
    2018), the author proposed an algorithm
                                                                  lution than the average player has. As a conse-
    based on two steps. In the first one, for each
                                                                  quence, the UNIOR4NLP solution based on spe-
    clue of a game, a list of relevant keywords
                                                                  cific multiword expressions extracted from several
    is retrieved from linguistic corpora, so
                                                                  knowledge sources shows a more balanced perfor-
    that each clue is associated with keywords
                                                                  mance than the other system.
    representing the concepts having a relation
                                                                     However, despite the UNIOR4NLP system ob-
    with that clue. Then, words at the inter-
                                                                  tained remarkable results, very difficult games, re-
    section of the retrieved sets are considered
                                                                  quiring some kind of inference, are missed. For
    as candidate solutions. In the second step,
                                                                  example, for the following clues: uno, notte, la
    another knowledge source made of proverbs,
                                                                  trippa, auto, palazzo2 , the solution is portiere
    book and movie titles, word definitions, is
                                                                  (porter). In order to solve that game, two difficult
    exploited to count co-occurrences of clues
                                                                  inferences are needed:
    and candidate solutions.
                                                                     • uno is the number generally assigned to the
4    Results
                                                                       role of the goolkeeper (portiere) in football
                                                                       teams;
           Table 1: System results.
 System         MRR MRR (std)                       Solved           • “La Trippa” is the surname of “Antonio La
 UNIOR4NLP 0.6428 0.6428                            81.90%             Trippa”, a character of the Italian movie “Gli
 Squadrone      0.0134 0.0350                       25.71%             onorevoli”, whose job is the porter (portiere)
                                                                       of a building.
   Results of the evaluation in terms of M RR are                    We hope that in a further edition of this task par-
reported in Table 1. The best performance is ob-                  ticipants will take into account these kind of games
tained by the UNIOR4NLP team. They reached a                      in which the simple co-occurrence of words it is
    1
      We must underline that patterns are extracted from a set    not enough for solving the game. This is the most
of 100 games collected by authors. This is in contrast with the
                                                                     2
task guidelines; however, the games are not used for training          In English: one, night, “la trippa” (it was intended as a
the system.                                                       surname in this case), car, building
                               Table 2: System results for TV and boardgame
             System            MRR (TV) Solved (TV) MRR (board) Solved (board)
             UNIOR4NLP         0.6528        86.36%          0.6001         71.79%
             Squadrone         0.0068        25.75%          0.0245         25.64%


challenging aspect of this game. In order to com-         References
pare system performance by taking into account            Pierpaolo Basile, Marco de Gemmis, Pasquale Lops,
the different levels of difficulty of the games, we          and Giovanni Semeraro. 2016. Solving a complex
plan to annotate guillottines with this information          language game by using knowledge-based word as-
provided by human players. A deeper analysis of              sociations discovery. IEEE Transactions on Compu-
                                                             tational Intelligence and AI in Games, 8(1):13–26.
the results obtained by each system is provided in
the corresponding technical reports (Sangati et al.,      Tommaso Caselli, Nicole Novielli, Viviana Patti, and
2018; Squadrone, 2018).                                     Paolo Rosso. 2018. Evalita 2018: Overview of
                                                            the 6th evaluation campaign of natural language
   Finally, by looking at the statistics about the          processing and speech tools for italian. In Tom-
participation (12 registered teams, but only 2 of           maso Caselli, Nicole Novielli, Viviana Patti, and
                                                            Paolo Rosso, editors, Proceedings of Sixth Evalua-
them submitted the results), we conclude that the           tion Campaign of Natural Language Processing and
task is attractive but perhaps it is too hard to solve.     Speech Tools for Italian. Final Workshop (EVALITA
For further task editions, we plan to support the           2018), Turin, Italy. CEUR.org.
participants by providing pre-processed textual re-       Marco Ernandes, Giovanni Angelini, and Marco Gori.
sources useful for solving the task.                       2008. A web-based agent challenges human experts
                                                           on crosswords. AI Magazine, 29(1):77.
                                                          David Ferrucci, Eric Brown, Jennifer Chu-Carroll,
                                                            James Fan, David Gondek, Aditya A Kalyanpur,
5   Conclusions                                             Adam Lally, J William Murdock, Eric Nyberg, John
                                                            Prager, et al. 2010. Building watson: An overview
                                                            of the deepqa project. AI magazine, 31(3):59–79.
Language games draw their challenge and excite-
                                                          Michael L Littman, Greg A Keim, and Noam Shazeer.
ment from the richness and ambiguity of natural
                                                            2002. A probabilistic approach to solving crossword
language. This type of games are inconsistent with          puzzles. Artificial Intelligence, 134(1-2):23–55.
the closed world assumption: no fixed sets of rules
are sufficient to define the game play. The pro-          Piero Molino, Pasquale Lops, Giovanni Semeraro,
                                                             Marco de Gemmis, and Pierpaolo Basile. 2015.
posed task consisted in building an artificial player        Playing with knowledge: A virtual player for who
for a challenging language game which requires               wants to be a millionaire? that leverages ques-
from the player a strong linguistic and cultural             tion answering techniques. Artificial Intelligence,
background. The systems participating in the task            222:157–181.
were designed according to this idea: solving the         Federico Sangati, Antonio Pascucci, and Johanna
game strongly depends on the background knowl-              Monti. 2018. Exploiting Multiword Expressions
edge of the system. On the other hand, the results          to solve “La Ghigliottina”. In Tommaso Caselli,
                                                            Nicole Novielli, Viviana Patti, and Paolo Rosso, ed-
demonstrated that filling in the system with a solid        itors, Proceedings of the 6th evaluation campaign of
background knowledge is not enough to find the              Natural Language Processing and Speech tools for
solution, but strong NLP algorithms are required            Italian (EVALITA’18), Turin, Italy. CEUR.org.
to discover hidden correlation among words. In            Giovanni Semeraro, Pasquale Lops, Pierpaolo Basile,
fact, only the system based on specific linguistic          and Marco de Gemmis. 2009. On the Tip of My
patterns and multiword expressions was able to              Thought: Playing the Guillotine Game. In Craig
achieve high performance. Moreover, some games              Boutilier, editor, IJCAI 2009, Proceedings of the
                                                            21st International Joint Conference on Artificial In-
required a non-trivial inference step. For this kind        telligence, Pasadena, California, USA, July 11-17,
of games, systems must be equipped with deeper              2009, pages 1543–1548. Morgan Kaufmann.
reasoning capabilities. We hope that in further edi-
                                                          Giovanni Semeraro, Marco de Gemmis, Pasquale
tions of the task, participants will propose solu-          Lops, and Pierpaolo Basile. 2012. An artificial
tions that deal with this issue.                            player for a language game. IEEE Intelligent Sys-
                                                            tems, 27(5):36–43.
Luca Squadrone. 2018. Computer challenges guillo-
  tine: how an artificial player can solve a complex
  language TV game with web data analysis. In Tom-
  maso Caselli, Nicole Novielli, Viviana Patti, and
  Paolo Rosso, editors, Proceedings of the 6th evalua-
  tion campaign of Natural Language Processing and
  Speech tools for Italian (EVALITA’18), Turin, Italy.
  CEUR.org.