=Paper=
{{Paper
|id=Vol-2263/paper045
|storemode=property
|title=Computer Challenges Guillotine: How an Artificial Player Can Solve a Complex Language TV game with Web Data Analysis
|pdfUrl=https://ceur-ws.org/Vol-2263/paper045.pdf
|volume=Vol-2263
|authors=Luca Squadrone
|dblpUrl=https://dblp.org/rec/conf/evalita/Squadrone18
}}
==Computer Challenges Guillotine: How an Artificial Player Can Solve a Complex Language TV game with Web Data Analysis==
Computer challenges guillotine: how an artificial player can solve a
complex language TV game with web data analysis
Luca Squadrone
University TorVergata
Rome, Italy
luca.squadrone@yahoo.it
Abstract fasi: nella prima fase, interroga la base
di conoscenza Morph-it con i cinque in-
English. This paper describes my attempt dizi, scarica i risultati ed esegue varie op-
to build an artificial player for a very pop- erazioni di intersezione tra i cinque set di
ular language game, called “The Guillo- dati; nella seconda fase, affina i risultati
tine”, within the Evalita Challenge (Basile attraverso altre fonti come il database dei
et al., 2018). I have built this artificial proverbi italiani e l’IMDb. Il mio gioca-
player to investigate how far we can go by tore artificiale ha identificato la soluzione
using resources available on the web and a tra le prime 100 soluzioni proposte nel
simple matching algorithm. The resources 25% dei casi. Il risultato ottenuto è an-
used are Morph-it (Zanchetta and Baroni, cora lontano da sistemi come OTTHO (Se-
2005) and other online resources. The res- meraro et al., 2012) che ha ottenuto la
olution algorithm is based on two steps: in soluzione nel 68% dei casi. Tuttavia, il
the first step, it interrogates the knowledge loro risultato è stato ottenuto con risorse
base Morph-it with the five data clues, più ampie e non solo con una semplice
download the results and perform vari- analisi web.
ous intersection operations between the
five data sets; in the second step, it re-
fines the results through the other sources 1 System description
such as the Italian proverbs database and
I have used Morph-it (Zanchetta and Baroni, 2005)
the IMDb. My artificial player identified
as a basis for knowledge, instead of building one,
the solution among the first 100 solutions
as it is free, easy to interrogate and above all suit-
proposed in 25% of cases. This is still
able for our purpose. Furthermore, it should not
far from systems like OTTHO (Semeraro
be underestimated that building a knowledge base
et al., 2012) that obtained the solution in
involves an enormous amount of work in terms of
68% of the cases. However, their result
time.
was obtained larger resources and not only
After querying the knowledge base with the five
with a simple web analysis.
data clues, the results are downloaded and various
Italiano. Il contributo descrive il ten- intersection operations are performed between the
tativo di costruire un giocatore arti- five data sets. This procedure allows us to find all
ficiale per un gioco linguistico molto possible solutions and is the basis for choosing the
popolare, chiamato ”La Ghigliottina”, solution.
nell’ambito dell’Evalita Challenge (Basile Then to find the final solution is verified the ex-
et al., 2018). Ho costruito questo gio- istence of proverbs, Aphorisms, movies or books
catore artificiale per verificare il lim- (etc.) that contain both the clue and the possible
ite raggiungibile utilizzando unicamente solution.
le risorse disponibili sul web e un sem-
2 The memory of the system: Morph-it!
plice algoritmo di matching. Le risorse
utilizzate sono Morph-it (Zanchetta and Morph-it! is a free morphological resource for the
Baroni, 2005) e altre risorse online. Italian language, a lexicon of inflected forms with
L’algoritmo di risoluzione si basa su due their lemma and morphological features. It was
designed by Marco Baroni and Eros Zanchetta. 3 The algorithm
The lexicon currently contains 505,074 entries and
35,056 lemmas. Morph-it! can be used as a The basic idea of the algorithm1 is to derive a set
data source for a lemmatizer/morphological ana- of words from MORPH-IT!, using only the 5 clues
lyzer/morphological generator. given in input. This will be the set of possible so-
lutions.
The main source of linguistic data was the “Re-
pubblica” corpus (approximately 380 million to- Then the probability of each of these words be-
kens), from which was extracted lemmas and in- ing the actual solution will be evaluated. This is
ferred morphological information not present in done through a second phase consisting of a ver-
the original corpus (i.e. gender) using distribu- ification of the existence of proverbs, Aphorisms,
tional as well as morphological cues. With that movies or books (etc.) that contain both the clue
information then was generated inflected forms for and the possible solution. The words with the most
all extracted lemmas. associations found are the solutions.
With the term ”possible solutions” I will indi-
Morph-it includes several corpora. A corpus
cate a selection of words where the solution is con-
(plural corpora) or text corpus is a large and struc-
tained, while with the term ”final solutions” I will
tured set of texts. In order to make the corpora
indicate the 100 words chosen among all the pos-
more useful for doing linguistic research, they are
sible solutions.
often subjected to a process known as annota-
I would also like to point out that the speed of
tion. An example of annotating a corpus is part-
execution of the algorithm depends on the speed
of-speech tagging, or POS-tagging, in which in-
of the internet connection, since I use an online
formation about each word’s part of speech (verb,
knowledge base and different data must be down-
noun, adjective, etc.)
loaded each time.
Since only some of these corpora are publicly
Here is the pseudo code of the first part of the
available, I have chosen:
algorithm that finds the possible solutions. It is
illustrated in Figure 1:
• La Repubblica, a corpus of Italian newspaper
texts published between 1985 and 2000 (ap- 1. The algorithm takes the 5 clues as input.
proximately 380M tokens);
2. For each clue it executes two queries, one
• ItWac Complete, a corpus of web pages in respectively in each corpus, Repubblica and
Italian crawled from Italian university web- itWac Complete.
sites
3. After downloading, the results from queries
are concatenated as text.
These two corpora form our knowledge base
and are united to provide the widest and most com- At the end of this procedure, for each clue,
prehensive knowledge base possible. the algorithm will generate a single text. In
Then, these resources will be used to match the this text there will be all the concepts with a
results found. relation to the clue.
4. Each of the five texts is transformed into a set
• Proverbs and Aphorisms on Italian version of single words. So I get 5 sets of words, one
of wikiquotes and on the database of Italian for each clue.
proverbs by ”Accademia della Crusca”
5. The intersection between these sets (which I
• Books database crawled from ibs web site will call ”Final Set”) is made. The solution,
semantically linked to all five clues, in most
• Film titles, crawled from the Internet Movie cases will be contained in the final set. This
Database hypothesis will be verified later, in the next
section.
• Word definitions, from Dictionary of the Ital- 1
Source code https://gitlab.com/osiast/computer-
ian language HOEPLI challenges-guillotine
• Books database crawled from ibs web site
• Film titles, crawled from the Internet Movie
Database
• Word definitions, from Dictionary of the Ital-
ian language HOEPLI
Whenever the clue and the solution are found to-
gether, for example in the same proverb or title of
Figure 1: Process for finding possible solutions for a film, an additional weight of 0.2 is assigned to
a run of the game that solution. The weight can vary from 0 to 1. It
indicates the probability that this is the solution of
the game.
6. Among the possible solutions there are many Here is the pseudo code of this second part of
insignificant words such as articles, conjunc- the algorithm.
tions and prepositions. To delete them, I sub-
tracted the set of Stop Words from the final 1. For each of the 5 clues download proverbs,
set. film and book titles, vocabulary definitions
and aphorisms containing that clue and put
Stop words are words which are filtered out
them together in one text.
before or after processing of natural language
data. Though ”stop words” usually refers to At this point we have 5 texts.
the most common words in a language, there
is no single universal list of stop words used 2. To all the possible solutions I assign the value
by all natural language processing tools, and 0 as weight.
indeed not all tools even use such a list. So,
3. Whenever one of the possible solutions is
I built a special list of Stop Words specifi-
found in one of these texts, its weight in-
cally for this purpose, based on two online
creases by 0.2.
resources:
4. Finally the first 100 are taken which have the
• A collection of stopwords on github 2
largest weight in descending order.
• A collection of stopwords from the
Snowball site 3 4 Test and Results
After the stopwords have been removed from Testing the algorithm is a key step in the validation
the final set, I finally have the set of possible process of the proposed solution.
solutions. In the first test below, I will run the algorithm
on 315 instances of the game in order to evaluate
The next step is the verification of the existence
its efficiency and study the results obtained. As
of proverbs, Aphorisms, movies or books (etc.)
already mentioned, each game is composed of five
that contain both one clue and the possible solu-
clues and a solution.
tion. As already mentioned, the words with the
The second test will be on the ”knowledge
most associations found are the final solution.
base”. I’m going to measure how many clues con-
So, along with each possible solution, I search
tains on average. This will be useful to indicate an
for clues within the following repositories:
upper-bound of efficiency that the algorithm can
not overstep.
• Proverbs and Aphorisms from two different
resources: Italian version of wikiquotes and 4.1 Test the algorithm
on the database of Italian proverbs by ”Ac-
cademia della Crusca” To evaluate the efficiency of the algorithm, I tried
it for 315 different games.4
2
github.com/stopwords-iso/stopwords-
it/blob/master/stopwords-it.txt 4
The games used can be downloaded from this link
3
snowball.tartarus.org/algorithms/italian/stop.txt https://goo.gl/6FpK3p
Results Number Percentage inside the corpora ”Repubblica” and ”ItWac Com-
Successful 242 76,83% plete”. The results were very positive. In fact,
Fail 73 23,17% out of 1575 clues, 1519 of them were always con-
Total games 315 nected with one or more correspondences to the
solution.
Table 1: Results on 315 games
We can see that out of 315 games:
Returned Test • 18% (56 of them) can not be resolved due to
MRR Solved
game game the absence of 1 or more clues in the chosen
This 0,0134 400 105 27 knowledge base;
Other 0,6428 405 105 86
• only 5% (17 of them) can not be resolved due
Table 2: comparison systems to the limit set on the algorithm query;
This means that without limiting the queries, I
The query is limited to 8,000 lines per request,
will find the solution at most 82% of the time.
as statistically I have noticed that it is a good com-
So this result shows that the limit of 8000 lines
promise between the resolution speed and the effi-
per query penalized the efficiency of the algorithm
ciency of the algorithm.
by only 5% percent.
I downloaded the set of games and coded them
This confirms that limiting the query to 8000
into a list. Then I ran the algorithm for each game
rows is a good compromise.
in the list and I memorized the results.
The data has been processed to create table 1. 5 Analysis of the results and future work
In over 76% of cases the exact solution is found
in the ”Possible Solution”. This shows that ”The The algorithm’s execution, both with the training
solution, semantically linked to all five clues, in set and with the test set, produced similar results.
most cases will be contained in the final set”. 5 From the data obtained we can notice that the per-
Regarding the final solutions, in 24,76% of centage of the solutions found in the first phase of
cases the solution is among the first 100. I got the algorithm decreases in the second phase.
a similar result with the test set, where I reached Furthermore the value of the MRR is very low
25,7%. despite the solution being found 27 times out of
Table 2 shows the distributions of the scores for 105.
the correct and missed solutions of our system on The reason for these results is that the number of
the full set of games in the test set in comparision resources in the matching phase are limited. So the
with other system.6 solution, despite being found, is often not among
first in the output of the 100 proposed solutions.
4.2 Test the knowledge base: does it always As future developments, we could improve the
contain the solution? algorithm to find the solution from the possible
solutions by increasing resources to provide more
To verify that the knowledge base is suitable for
accurate results.
our purpose and that it always contains the solu-
tion, I have performed tests on 1575 clues, that is,
all the ones I had. References
In particular I wanted to know if the knowledge
Pierpaolo Basile, Marco de Gemmis, Lucia Siciliani,
base contained a relationship between the solution and Giovanni Semeraro. 2018. Overview of the
and each of them. The basic idea was to look for evalita 2018 solving language games (nlp4fun) task.
the clue inside the corpora and then filter the re- In Tommaso Caselli, Nicole Novielli, Viviana Patti,
sults. For this task I used the NoSketch Engine, an and Paolo Rosso, editors, Proceedings of the 6th
evaluation campaign of Natural Language Process-
open-source tool to perform corpus searches. ing and Speech tools for Italian (EVALITA’18),
As already mentioned, I did tests with 1575 Turin, Italy. CEUR.org.
clues of the game looking for them (one at a time)
Giovanni Semeraro, Pasquale Lops, Marco de Gem-
5 mis, and Pierpaolo Basile. 2012. OTTHO: an ar-
The full results can be viewed at this link
https://goo.gl/BdCee9. tificial player for a complex language game. In
6
MRR (Mean Reciprocal Rank) Popularize Artificial Intelligence, Proceedings of the
AI*IA Workshop and Prize for Celebrating 100th
Anniversary of Alan Turing’s Birth, Rome, Italy,
June 15, 2012, pages 47–53.
Eros Zanchetta and Marco Baroni. 2005. Morph-it!
a free corpus-based morphological resource for the
italian language. Corpus Linguistics 2005, 1(1).