Overview of the EVALITA 2018 Solving language games (NLP4FUN) Task Pierpaolo Basile and Marco de Gemmis and Lucia Siciliani and Giovanni Semeraro Department of Computer Science Via E. Orabona, 4 - 70125 Bari University of Bari Aldo Moro {firstname.lastname}@uniba.it Abstract language, and therefore have attracted the atten- English. This paper describes the first tion of researchers in the fields of Artificial Intel- edition of the “Solving language games” ligence and Natural Language Processing. For in- (NLP4FUN) task at the EVALITA 2018 stance, IBM Watson is a system which success- campaign. The task consists in design- fully challenged human champions of Jeopardy!, ing an artificial player for “The Guillo- a game in which contestants are presented with tine” (La Ghigliottina, in Italian), a chal- clues in the form of answers, and must phrase their lenging language game which demands responses in the form of a question (Ferrucci et knowledge covering a broad range of top- al., 2010; Molino et al., 2015). Another popular ics. The game consists in finding a word language game is solving crossword puzzles. The which is semantically correlated with a first experience reported in the literature is Proverb set of 5 words called clues. Artificial (Littman et al., 2002), that exploits large libraries players for that game can take advantage of clues and solutions to past crossword puzzles. from the availability of open repositories WebCrow is the first solver for Italian crosswords on the web, such as Wikipedia, that pro- (Ernandes et al., 2008). vide the system with the cultural and lin- The proposed task consists in designing a solver guistic background needed to find the so- for “The Guillotine” (La Ghigliottina, in Italian) lution. game. It is inspired by the final game of an Italian Italiano. Questo lavoro descrive la TV show called “L’eredità”. The game, broadcast prima edizione del task “Solving lan- by Italian National TV, involves a single player, guage games” (NLP4FUN) task, pro- who is given a set of five words - the clues - each posto durante la campagna di valutazione linked in some way to a specific word that rep- EVALITA 2018. Il task consiste nella resents the unique solution of the game. Words realizzazione di un giocatore artificiale are unrelated to each other, but each of them has per “La Gigliottina”, un gioco linguistico a hidden association with the solution. Once the molto sfidante, la cui soluzione richiede clues are given, the player has one minute to find conoscenze in svariati campi. Il gioco the solution. For example, given the five clues: consiste nel trovare una parola il cui sig- sin, Newton, doctor, New York, bad, the solution nificato è correlato a quello di un insieme is apple, because: the apple is the symbol of orig- di 5 parole, chiamate indizi. Un gioca- inal sin in Christian theology; Newton discovered tore artificiale per questo task potrebbe the gravity by means of an apple; “an apple a day sfruttare diverse sorgenti di conoscenza keeps the doctor away” is a famous proverb; New disponibili online, come Wikipedia, che York city is also called “the big apple”; and “one forniscano al sistema le conoscenze lin- bad apple can spoil the whole bunch” is a popu- guistiche e culturali necessarie per ar- lar phrase which figuratively means that the per- rivare alla soluzione. son doing wrong can have a negative influence on those around him. “La Ghigliottina” is a chal- lenging language game which demands knowl- 1 Motivation edge covering a broad range of topics. Artificial Language games draw their challenge and excite- players for that game can take advantage from the ment from the richness and ambiguity of natural availability of open repositories on the web, such as Wikipedia, that provide the system with the cul- cane tural and linguistic background needed to under- musica stand clues (Basile et al., 2016; Semeraro et al., casa 2009; Semeraro et al., 2012). pietra The task is part of EVALITA 2018, the pe- chiesa riodic evaluation campaign of Natural Language TV Processing (NLP) and speech tools for the Italian language (Caselli et al., 2018). ... The paper is organized as follows: Section 2 reports details about the task, the dataset and the The XML file consists of a root element games evaluation protocol, while Section 3 describes the which contains several game elements. Each game systems participating in the task, and Section 4 has five clue elements and one solution. Moreover, shows results. the element type specifies the type of the game: TV or boardgame. 2 Task Description: Dataset, Evaluation The ranked list of solutions must be provided in Protocol and Measures a single plain text file, according to the following An instance of the game consists of a set of 5 clue format: words and 1 word given as the official solution for id solution score rank time that instance. We provided: Values were separated by a whitespace charac- • a training set for the system development, ter; time taken by the system to compute the list containing 315 instances of the game; was also reported in milliseconds. An example of a ranked list of solutions is reported below: • a test set for the evaluation, containing 105 instances of the game. 3fc953bd-... porta 0.978 1 3459 3fc953bd-... chiesa 0.932 2 3251 In order to measure the performance of the par- 3fc953bd-... santo 0.897 3 4321 ticipants on games having different levels of diffi- ... culty, we provided instances taken both from the 3fc953bd-... carta 0.321 100 2343 TV game and from the official board game. In the ... training set, 204 instances (64.8%) came from the TV game, 111 (35.2%) from the board game. In 2.2 Evaluation the test set, 66 instances (62.9%) were collected As evaluation measure, we adopt a weighted ver- from the TV game, 39 (37.1%) from the board sion of Mean Reciprocal Rank (MRR). Since time game. In order to discourage participants from is a critical factor in this game, the Reciprocal cheating (e.g. finding the solution manually), in Rank is weighted by a function which lowers the the test set we included 300 fake games automat- score based on the time taken by the computation. ically created by us. Obviously, fake games were In fact, in the TV game, the player has only one not taken into account in the evaluation. minute to provide the solution. Taking into ac- Any knowledge resource can be used to build count these factors, the evaluation measure was: an artificial player, except further instances of the game. For each instance of the game, a ranked 1 X 1 1 1 list of maximum 100 tentative solutions must be max( , ) (1) |G| g∈G rg tg 10 provided. where G is the set of games and rg is the rank 2.1 Data Format of the solution, while tg denotes the minutes taken Both development and test set were provided in by the system to give the tentative solutions. Sys- XML format: tems that took more than 10 minutes are equally penalized. The evaluation was performed only on the 105 3fc953bd... test games, for which we knew the correct solution uomo (results provided for fake games were excluded). We provided a separate ranking for TV and remarkable performance: MRR is very high, thus boardgame, but the final ranking was computed on showing that the system is able to place the solu- the the whole test set. tion in the first positions of the ranking. We report, also, the standard MRR (M RR(std)) computed 3 Systems without taking into account the time. We notice Twelve teams registered in the task, but only two that for UNIOR4NLP the value is equal to M RR: of them actually submitted the results for the eval- the system is able to provide the solution always in uation. A short description of each system fol- the first minute, while the Squadrone system takes lows: more time for solving games. Table 2 reports the results by game type (66 in- UNIOR4FUN - The system described in (San- stances from the TV game and 39 instances from gati et al., 2018) is based on the idea that the boardgame). UNIOR4NLP shows similar re- clue words and corresponding solution are sults for both the game types, while the system often part of a multiword expression. There- proposed by Squadrone performs better on board fore, the system exploits six linguistic pat- games. terns1 that identify valid multiword expres- One possible explanation for this difference is sions connecting clue and solution pairs. The that board games are meant just for fun; they are core of the proposed solution is a set of freely designed for the average player, whereas those available corpora and lexical resources built taken from the TV game are more difficult to solve by the authors, which are used to find poten- because they are intended to challenge the contes- tial solutions by computing mutual informa- tants of the show who try to win a money prize. tion. Therefore, TV games generally have very specific clues and require more extensive knowledge about System by Luca Squadrone - In (Squadrone, world facts and particular topics to find the so- 2018), the author proposed an algorithm lution than the average player has. As a conse- based on two steps. In the first one, for each quence, the UNIOR4NLP solution based on spe- clue of a game, a list of relevant keywords cific multiword expressions extracted from several is retrieved from linguistic corpora, so knowledge sources shows a more balanced perfor- that each clue is associated with keywords mance than the other system. representing the concepts having a relation However, despite the UNIOR4NLP system ob- with that clue. Then, words at the inter- tained remarkable results, very difficult games, re- section of the retrieved sets are considered quiring some kind of inference, are missed. For as candidate solutions. In the second step, example, for the following clues: uno, notte, la another knowledge source made of proverbs, trippa, auto, palazzo2 , the solution is portiere book and movie titles, word definitions, is (porter). In order to solve that game, two difficult exploited to count co-occurrences of clues inferences are needed: and candidate solutions. • uno is the number generally assigned to the 4 Results role of the goolkeeper (portiere) in football teams; Table 1: System results. System MRR MRR (std) Solved • “La Trippa” is the surname of “Antonio La UNIOR4NLP 0.6428 0.6428 81.90% Trippa”, a character of the Italian movie “Gli Squadrone 0.0134 0.0350 25.71% onorevoli”, whose job is the porter (portiere) of a building. Results of the evaluation in terms of M RR are We hope that in a further edition of this task par- reported in Table 1. The best performance is ob- ticipants will take into account these kind of games tained by the UNIOR4NLP team. They reached a in which the simple co-occurrence of words it is 1 We must underline that patterns are extracted from a set not enough for solving the game. This is the most of 100 games collected by authors. This is in contrast with the 2 task guidelines; however, the games are not used for training In English: one, night, “la trippa” (it was intended as a the system. surname in this case), car, building Table 2: System results for TV and boardgame System MRR (TV) Solved (TV) MRR (board) Solved (board) UNIOR4NLP 0.6528 86.36% 0.6001 71.79% Squadrone 0.0068 25.75% 0.0245 25.64% challenging aspect of this game. In order to com- References pare system performance by taking into account Pierpaolo Basile, Marco de Gemmis, Pasquale Lops, the different levels of difficulty of the games, we and Giovanni Semeraro. 2016. Solving a complex plan to annotate guillottines with this information language game by using knowledge-based word as- provided by human players. A deeper analysis of sociations discovery. IEEE Transactions on Compu- tational Intelligence and AI in Games, 8(1):13–26. the results obtained by each system is provided in the corresponding technical reports (Sangati et al., Tommaso Caselli, Nicole Novielli, Viviana Patti, and 2018; Squadrone, 2018). Paolo Rosso. 2018. Evalita 2018: Overview of the 6th evaluation campaign of natural language Finally, by looking at the statistics about the processing and speech tools for italian. In Tom- participation (12 registered teams, but only 2 of maso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of Sixth Evalua- them submitted the results), we conclude that the tion Campaign of Natural Language Processing and task is attractive but perhaps it is too hard to solve. Speech Tools for Italian. Final Workshop (EVALITA For further task editions, we plan to support the 2018), Turin, Italy. CEUR.org. participants by providing pre-processed textual re- Marco Ernandes, Giovanni Angelini, and Marco Gori. sources useful for solving the task. 2008. A web-based agent challenges human experts on crosswords. AI Magazine, 29(1):77. David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A Kalyanpur, 5 Conclusions Adam Lally, J William Murdock, Eric Nyberg, John Prager, et al. 2010. Building watson: An overview of the deepqa project. AI magazine, 31(3):59–79. Language games draw their challenge and excite- Michael L Littman, Greg A Keim, and Noam Shazeer. ment from the richness and ambiguity of natural 2002. A probabilistic approach to solving crossword language. This type of games are inconsistent with puzzles. Artificial Intelligence, 134(1-2):23–55. the closed world assumption: no fixed sets of rules are sufficient to define the game play. The pro- Piero Molino, Pasquale Lops, Giovanni Semeraro, Marco de Gemmis, and Pierpaolo Basile. 2015. posed task consisted in building an artificial player Playing with knowledge: A virtual player for who for a challenging language game which requires wants to be a millionaire? that leverages ques- from the player a strong linguistic and cultural tion answering techniques. Artificial Intelligence, background. The systems participating in the task 222:157–181. were designed according to this idea: solving the Federico Sangati, Antonio Pascucci, and Johanna game strongly depends on the background knowl- Monti. 2018. Exploiting Multiword Expressions edge of the system. On the other hand, the results to solve “La Ghigliottina”. In Tommaso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, ed- demonstrated that filling in the system with a solid itors, Proceedings of the 6th evaluation campaign of background knowledge is not enough to find the Natural Language Processing and Speech tools for solution, but strong NLP algorithms are required Italian (EVALITA’18), Turin, Italy. CEUR.org. to discover hidden correlation among words. In Giovanni Semeraro, Pasquale Lops, Pierpaolo Basile, fact, only the system based on specific linguistic and Marco de Gemmis. 2009. On the Tip of My patterns and multiword expressions was able to Thought: Playing the Guillotine Game. In Craig achieve high performance. Moreover, some games Boutilier, editor, IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial In- required a non-trivial inference step. For this kind telligence, Pasadena, California, USA, July 11-17, of games, systems must be equipped with deeper 2009, pages 1543–1548. Morgan Kaufmann. reasoning capabilities. We hope that in further edi- Giovanni Semeraro, Marco de Gemmis, Pasquale tions of the task, participants will propose solu- Lops, and Pierpaolo Basile. 2012. An artificial tions that deal with this issue. player for a language game. IEEE Intelligent Sys- tems, 27(5):36–43. Luca Squadrone. 2018. Computer challenges guillo- tine: how an artificial player can solve a complex language TV game with web data analysis. In Tom- maso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of the 6th evalua- tion campaign of Natural Language Processing and Speech tools for Italian (EVALITA’18), Turin, Italy. CEUR.org.