=Paper=
{{Paper
|id=Vol-2263/paper011
|storemode=property
|title=Overview of the EVALITA 2018 Solving Language Games (NLP4FUN) Task
|pdfUrl=https://ceur-ws.org/Vol-2263/paper011.pdf
|volume=Vol-2263
|authors=Pierpaolo Basile,Marco de Gemmis,Lucia Siciliani,Giovanni Semeraro
|dblpUrl=https://dblp.org/rec/conf/evalita/BasileGSS18
}}
==Overview of the EVALITA 2018 Solving Language Games (NLP4FUN) Task==
Overview of the EVALITA 2018 Solving language games (NLP4FUN) Task
Pierpaolo Basile and Marco de Gemmis and Lucia Siciliani and Giovanni Semeraro
Department of Computer Science
Via E. Orabona, 4 - 70125 Bari
University of Bari Aldo Moro
{firstname.lastname}@uniba.it
Abstract language, and therefore have attracted the atten-
English. This paper describes the first tion of researchers in the fields of Artificial Intel-
edition of the “Solving language games” ligence and Natural Language Processing. For in-
(NLP4FUN) task at the EVALITA 2018 stance, IBM Watson is a system which success-
campaign. The task consists in design- fully challenged human champions of Jeopardy!,
ing an artificial player for “The Guillo- a game in which contestants are presented with
tine” (La Ghigliottina, in Italian), a chal- clues in the form of answers, and must phrase their
lenging language game which demands responses in the form of a question (Ferrucci et
knowledge covering a broad range of top- al., 2010; Molino et al., 2015). Another popular
ics. The game consists in finding a word language game is solving crossword puzzles. The
which is semantically correlated with a first experience reported in the literature is Proverb
set of 5 words called clues. Artificial (Littman et al., 2002), that exploits large libraries
players for that game can take advantage of clues and solutions to past crossword puzzles.
from the availability of open repositories WebCrow is the first solver for Italian crosswords
on the web, such as Wikipedia, that pro- (Ernandes et al., 2008).
vide the system with the cultural and lin- The proposed task consists in designing a solver
guistic background needed to find the so- for “The Guillotine” (La Ghigliottina, in Italian)
lution. game. It is inspired by the final game of an Italian
Italiano. Questo lavoro descrive la TV show called “L’eredità”. The game, broadcast
prima edizione del task “Solving lan- by Italian National TV, involves a single player,
guage games” (NLP4FUN) task, pro- who is given a set of five words - the clues - each
posto durante la campagna di valutazione linked in some way to a specific word that rep-
EVALITA 2018. Il task consiste nella resents the unique solution of the game. Words
realizzazione di un giocatore artificiale are unrelated to each other, but each of them has
per “La Gigliottina”, un gioco linguistico a hidden association with the solution. Once the
molto sfidante, la cui soluzione richiede clues are given, the player has one minute to find
conoscenze in svariati campi. Il gioco the solution. For example, given the five clues:
consiste nel trovare una parola il cui sig- sin, Newton, doctor, New York, bad, the solution
nificato è correlato a quello di un insieme is apple, because: the apple is the symbol of orig-
di 5 parole, chiamate indizi. Un gioca- inal sin in Christian theology; Newton discovered
tore artificiale per questo task potrebbe the gravity by means of an apple; “an apple a day
sfruttare diverse sorgenti di conoscenza keeps the doctor away” is a famous proverb; New
disponibili online, come Wikipedia, che York city is also called “the big apple”; and “one
forniscano al sistema le conoscenze lin- bad apple can spoil the whole bunch” is a popu-
guistiche e culturali necessarie per ar- lar phrase which figuratively means that the per-
rivare alla soluzione. son doing wrong can have a negative influence on
those around him. “La Ghigliottina” is a chal-
lenging language game which demands knowl-
1 Motivation
edge covering a broad range of topics. Artificial
Language games draw their challenge and excite- players for that game can take advantage from the
ment from the richness and ambiguity of natural availability of open repositories on the web, such
as Wikipedia, that provide the system with the cul- cane
tural and linguistic background needed to under- musica
stand clues (Basile et al., 2016; Semeraro et al., casa
2009; Semeraro et al., 2012). pietra
The task is part of EVALITA 2018, the pe- chiesa
riodic evaluation campaign of Natural Language TV
Processing (NLP) and speech tools for the Italian
language (Caselli et al., 2018). ...
The paper is organized as follows: Section 2
reports details about the task, the dataset and the The XML file consists of a root element games
evaluation protocol, while Section 3 describes the which contains several game elements. Each game
systems participating in the task, and Section 4 has five clue elements and one solution. Moreover,
shows results. the element type specifies the type of the game: TV
or boardgame.
2 Task Description: Dataset, Evaluation
The ranked list of solutions must be provided in
Protocol and Measures
a single plain text file, according to the following
An instance of the game consists of a set of 5 clue format:
words and 1 word given as the official solution for
id solution score rank time
that instance. We provided:
Values were separated by a whitespace charac-
• a training set for the system development, ter; time taken by the system to compute the list
containing 315 instances of the game; was also reported in milliseconds. An example of
a ranked list of solutions is reported below:
• a test set for the evaluation, containing 105
instances of the game. 3fc953bd-... porta 0.978 1 3459
3fc953bd-... chiesa 0.932 2 3251
In order to measure the performance of the par-
3fc953bd-... santo 0.897 3 4321
ticipants on games having different levels of diffi-
...
culty, we provided instances taken both from the
3fc953bd-... carta 0.321 100 2343
TV game and from the official board game. In the
...
training set, 204 instances (64.8%) came from the
TV game, 111 (35.2%) from the board game. In 2.2 Evaluation
the test set, 66 instances (62.9%) were collected
As evaluation measure, we adopt a weighted ver-
from the TV game, 39 (37.1%) from the board
sion of Mean Reciprocal Rank (MRR). Since time
game. In order to discourage participants from
is a critical factor in this game, the Reciprocal
cheating (e.g. finding the solution manually), in
Rank is weighted by a function which lowers the
the test set we included 300 fake games automat-
score based on the time taken by the computation.
ically created by us. Obviously, fake games were
In fact, in the TV game, the player has only one
not taken into account in the evaluation.
minute to provide the solution. Taking into ac-
Any knowledge resource can be used to build
count these factors, the evaluation measure was:
an artificial player, except further instances of the
game. For each instance of the game, a ranked 1 X 1 1 1
list of maximum 100 tentative solutions must be max( , ) (1)
|G| g∈G rg tg 10
provided.
where G is the set of games and rg is the rank
2.1 Data Format
of the solution, while tg denotes the minutes taken
Both development and test set were provided in by the system to give the tentative solutions. Sys-
XML format: tems that took more than 10 minutes are equally
penalized.
The evaluation was performed only on the 105
3fc953bd... test games, for which we knew the correct solution
uomo (results provided for fake games were excluded).
We provided a separate ranking for TV and remarkable performance: MRR is very high, thus
boardgame, but the final ranking was computed on showing that the system is able to place the solu-
the the whole test set. tion in the first positions of the ranking. We report,
also, the standard MRR (M RR(std)) computed
3 Systems without taking into account the time. We notice
Twelve teams registered in the task, but only two that for UNIOR4NLP the value is equal to M RR:
of them actually submitted the results for the eval- the system is able to provide the solution always in
uation. A short description of each system fol- the first minute, while the Squadrone system takes
lows: more time for solving games.
Table 2 reports the results by game type (66 in-
UNIOR4FUN - The system described in (San- stances from the TV game and 39 instances from
gati et al., 2018) is based on the idea that the boardgame). UNIOR4NLP shows similar re-
clue words and corresponding solution are sults for both the game types, while the system
often part of a multiword expression. There- proposed by Squadrone performs better on board
fore, the system exploits six linguistic pat- games.
terns1 that identify valid multiword expres- One possible explanation for this difference is
sions connecting clue and solution pairs. The that board games are meant just for fun; they are
core of the proposed solution is a set of freely designed for the average player, whereas those
available corpora and lexical resources built taken from the TV game are more difficult to solve
by the authors, which are used to find poten- because they are intended to challenge the contes-
tial solutions by computing mutual informa- tants of the show who try to win a money prize.
tion. Therefore, TV games generally have very specific
clues and require more extensive knowledge about
System by Luca Squadrone - In (Squadrone,
world facts and particular topics to find the so-
2018), the author proposed an algorithm
lution than the average player has. As a conse-
based on two steps. In the first one, for each
quence, the UNIOR4NLP solution based on spe-
clue of a game, a list of relevant keywords
cific multiword expressions extracted from several
is retrieved from linguistic corpora, so
knowledge sources shows a more balanced perfor-
that each clue is associated with keywords
mance than the other system.
representing the concepts having a relation
However, despite the UNIOR4NLP system ob-
with that clue. Then, words at the inter-
tained remarkable results, very difficult games, re-
section of the retrieved sets are considered
quiring some kind of inference, are missed. For
as candidate solutions. In the second step,
example, for the following clues: uno, notte, la
another knowledge source made of proverbs,
trippa, auto, palazzo2 , the solution is portiere
book and movie titles, word definitions, is
(porter). In order to solve that game, two difficult
exploited to count co-occurrences of clues
inferences are needed:
and candidate solutions.
• uno is the number generally assigned to the
4 Results
role of the goolkeeper (portiere) in football
teams;
Table 1: System results.
System MRR MRR (std) Solved • “La Trippa” is the surname of “Antonio La
UNIOR4NLP 0.6428 0.6428 81.90% Trippa”, a character of the Italian movie “Gli
Squadrone 0.0134 0.0350 25.71% onorevoli”, whose job is the porter (portiere)
of a building.
Results of the evaluation in terms of M RR are We hope that in a further edition of this task par-
reported in Table 1. The best performance is ob- ticipants will take into account these kind of games
tained by the UNIOR4NLP team. They reached a in which the simple co-occurrence of words it is
1
We must underline that patterns are extracted from a set not enough for solving the game. This is the most
of 100 games collected by authors. This is in contrast with the
2
task guidelines; however, the games are not used for training In English: one, night, “la trippa” (it was intended as a
the system. surname in this case), car, building
Table 2: System results for TV and boardgame
System MRR (TV) Solved (TV) MRR (board) Solved (board)
UNIOR4NLP 0.6528 86.36% 0.6001 71.79%
Squadrone 0.0068 25.75% 0.0245 25.64%
challenging aspect of this game. In order to com- References
pare system performance by taking into account Pierpaolo Basile, Marco de Gemmis, Pasquale Lops,
the different levels of difficulty of the games, we and Giovanni Semeraro. 2016. Solving a complex
plan to annotate guillottines with this information language game by using knowledge-based word as-
provided by human players. A deeper analysis of sociations discovery. IEEE Transactions on Compu-
tational Intelligence and AI in Games, 8(1):13–26.
the results obtained by each system is provided in
the corresponding technical reports (Sangati et al., Tommaso Caselli, Nicole Novielli, Viviana Patti, and
2018; Squadrone, 2018). Paolo Rosso. 2018. Evalita 2018: Overview of
the 6th evaluation campaign of natural language
Finally, by looking at the statistics about the processing and speech tools for italian. In Tom-
participation (12 registered teams, but only 2 of maso Caselli, Nicole Novielli, Viviana Patti, and
Paolo Rosso, editors, Proceedings of Sixth Evalua-
them submitted the results), we conclude that the tion Campaign of Natural Language Processing and
task is attractive but perhaps it is too hard to solve. Speech Tools for Italian. Final Workshop (EVALITA
For further task editions, we plan to support the 2018), Turin, Italy. CEUR.org.
participants by providing pre-processed textual re- Marco Ernandes, Giovanni Angelini, and Marco Gori.
sources useful for solving the task. 2008. A web-based agent challenges human experts
on crosswords. AI Magazine, 29(1):77.
David Ferrucci, Eric Brown, Jennifer Chu-Carroll,
James Fan, David Gondek, Aditya A Kalyanpur,
5 Conclusions Adam Lally, J William Murdock, Eric Nyberg, John
Prager, et al. 2010. Building watson: An overview
of the deepqa project. AI magazine, 31(3):59–79.
Language games draw their challenge and excite-
Michael L Littman, Greg A Keim, and Noam Shazeer.
ment from the richness and ambiguity of natural
2002. A probabilistic approach to solving crossword
language. This type of games are inconsistent with puzzles. Artificial Intelligence, 134(1-2):23–55.
the closed world assumption: no fixed sets of rules
are sufficient to define the game play. The pro- Piero Molino, Pasquale Lops, Giovanni Semeraro,
Marco de Gemmis, and Pierpaolo Basile. 2015.
posed task consisted in building an artificial player Playing with knowledge: A virtual player for who
for a challenging language game which requires wants to be a millionaire? that leverages ques-
from the player a strong linguistic and cultural tion answering techniques. Artificial Intelligence,
background. The systems participating in the task 222:157–181.
were designed according to this idea: solving the Federico Sangati, Antonio Pascucci, and Johanna
game strongly depends on the background knowl- Monti. 2018. Exploiting Multiword Expressions
edge of the system. On the other hand, the results to solve “La Ghigliottina”. In Tommaso Caselli,
Nicole Novielli, Viviana Patti, and Paolo Rosso, ed-
demonstrated that filling in the system with a solid itors, Proceedings of the 6th evaluation campaign of
background knowledge is not enough to find the Natural Language Processing and Speech tools for
solution, but strong NLP algorithms are required Italian (EVALITA’18), Turin, Italy. CEUR.org.
to discover hidden correlation among words. In Giovanni Semeraro, Pasquale Lops, Pierpaolo Basile,
fact, only the system based on specific linguistic and Marco de Gemmis. 2009. On the Tip of My
patterns and multiword expressions was able to Thought: Playing the Guillotine Game. In Craig
achieve high performance. Moreover, some games Boutilier, editor, IJCAI 2009, Proceedings of the
21st International Joint Conference on Artificial In-
required a non-trivial inference step. For this kind telligence, Pasadena, California, USA, July 11-17,
of games, systems must be equipped with deeper 2009, pages 1543–1548. Morgan Kaufmann.
reasoning capabilities. We hope that in further edi-
Giovanni Semeraro, Marco de Gemmis, Pasquale
tions of the task, participants will propose solu- Lops, and Pierpaolo Basile. 2012. An artificial
tions that deal with this issue. player for a language game. IEEE Intelligent Sys-
tems, 27(5):36–43.
Luca Squadrone. 2018. Computer challenges guillo-
tine: how an artificial player can solve a complex
language TV game with web data analysis. In Tom-
maso Caselli, Nicole Novielli, Viviana Patti, and
Paolo Rosso, editors, Proceedings of the 6th evalua-
tion campaign of Natural Language Processing and
Speech tools for Italian (EVALITA’18), Turin, Italy.
CEUR.org.