=Paper=
{{Paper
|id=Vol-2263/paper045
|storemode=property
|title=Computer Challenges Guillotine: How an Artificial Player Can Solve a Complex Language TV game with Web Data Analysis
|pdfUrl=https://ceur-ws.org/Vol-2263/paper045.pdf
|volume=Vol-2263
|authors=Luca Squadrone
|dblpUrl=https://dblp.org/rec/conf/evalita/Squadrone18
}}
==Computer Challenges Guillotine: How an Artificial Player Can Solve a Complex Language TV game with Web Data Analysis==
<pdf width="1500px">https://ceur-ws.org/Vol-2263/paper045.pdf</pdf>
<pre>
 Computer challenges guillotine: how an artificial player can solve a
       complex language TV game with web data analysis

                                     Luca Squadrone
                                   University TorVergata
                                       Rome, Italy
                               luca.squadrone@yahoo.it


                 Abstract                           fasi: nella prima fase, interroga la base
                                                    di conoscenza Morph-it con i cinque in-
English. This paper describes my attempt            dizi, scarica i risultati ed esegue varie op-
to build an artificial player for a very pop-       erazioni di intersezione tra i cinque set di
ular language game, called “The Guillo-             dati; nella seconda fase, affina i risultati
tine”, within the Evalita Challenge (Basile         attraverso altre fonti come il database dei
et al., 2018). I have built this artificial         proverbi italiani e l’IMDb. Il mio gioca-
player to investigate how far we can go by          tore artificiale ha identificato la soluzione
using resources available on the web and a          tra le prime 100 soluzioni proposte nel
simple matching algorithm. The resources            25% dei casi. Il risultato ottenuto è an-
used are Morph-it (Zanchetta and Baroni,            cora lontano da sistemi come OTTHO (Se-
2005) and other online resources. The res-          meraro et al., 2012) che ha ottenuto la
olution algorithm is based on two steps: in         soluzione nel 68% dei casi. Tuttavia, il
the first step, it interrogates the knowledge       loro risultato è stato ottenuto con risorse
base Morph-it with the five data clues,             più ampie e non solo con una semplice
download the results and perform vari-              analisi web.
ous intersection operations between the
five data sets; in the second step, it re-
fines the results through the other sources     1   System description
such as the Italian proverbs database and
                                                I have used Morph-it (Zanchetta and Baroni, 2005)
the IMDb. My artificial player identified
                                                as a basis for knowledge, instead of building one,
the solution among the first 100 solutions
                                                as it is free, easy to interrogate and above all suit-
proposed in 25% of cases. This is still
                                                able for our purpose. Furthermore, it should not
far from systems like OTTHO (Semeraro
                                                be underestimated that building a knowledge base
et al., 2012) that obtained the solution in
                                                involves an enormous amount of work in terms of
68% of the cases. However, their result
                                                time.
was obtained larger resources and not only
                                                   After querying the knowledge base with the five
with a simple web analysis.
                                                data clues, the results are downloaded and various
Italiano. Il contributo descrive il ten-        intersection operations are performed between the
tativo di costruire un giocatore arti-          five data sets. This procedure allows us to find all
ficiale per un gioco linguistico molto          possible solutions and is the basis for choosing the
popolare, chiamato ”La Ghigliottina”,           solution.
nell’ambito dell’Evalita Challenge (Basile         Then to find the final solution is verified the ex-
et al., 2018). Ho costruito questo gio-         istence of proverbs, Aphorisms, movies or books
catore artificiale per verificare il lim-       (etc.) that contain both the clue and the possible
ite raggiungibile utilizzando unicamente        solution.
le risorse disponibili sul web e un sem-
                                                2   The memory of the system: Morph-it!
plice algoritmo di matching. Le risorse
utilizzate sono Morph-it (Zanchetta and         Morph-it! is a free morphological resource for the
Baroni, 2005) e altre risorse online.           Italian language, a lexicon of inflected forms with
L’algoritmo di risoluzione si basa su due       their lemma and morphological features. It was
designed by Marco Baroni and Eros Zanchetta.            3     The algorithm
The lexicon currently contains 505,074 entries and
35,056 lemmas. Morph-it! can be used as a               The basic idea of the algorithm1 is to derive a set
data source for a lemmatizer/morphological ana-         of words from MORPH-IT!, using only the 5 clues
lyzer/morphological generator.                          given in input. This will be the set of possible so-
                                                        lutions.
   The main source of linguistic data was the “Re-
pubblica” corpus (approximately 380 million to-            Then the probability of each of these words be-
kens), from which was extracted lemmas and in-          ing the actual solution will be evaluated. This is
ferred morphological information not present in         done through a second phase consisting of a ver-
the original corpus (i.e. gender) using distribu-       ification of the existence of proverbs, Aphorisms,
tional as well as morphological cues. With that         movies or books (etc.) that contain both the clue
information then was generated inflected forms for      and the possible solution. The words with the most
all extracted lemmas.                                   associations found are the solutions.
                                                           With the term ”possible solutions” I will indi-
   Morph-it includes several corpora. A corpus
                                                        cate a selection of words where the solution is con-
(plural corpora) or text corpus is a large and struc-
                                                        tained, while with the term ”final solutions” I will
tured set of texts. In order to make the corpora
                                                        indicate the 100 words chosen among all the pos-
more useful for doing linguistic research, they are
                                                        sible solutions.
often subjected to a process known as annota-
                                                           I would also like to point out that the speed of
tion. An example of annotating a corpus is part-
                                                        execution of the algorithm depends on the speed
of-speech tagging, or POS-tagging, in which in-
                                                        of the internet connection, since I use an online
formation about each word’s part of speech (verb,
                                                        knowledge base and different data must be down-
noun, adjective, etc.)
                                                        loaded each time.
   Since only some of these corpora are publicly
                                                           Here is the pseudo code of the first part of the
available, I have chosen:
                                                        algorithm that finds the possible solutions. It is
                                                        illustrated in Figure 1:
  • La Repubblica, a corpus of Italian newspaper
    texts published between 1985 and 2000 (ap-              1. The algorithm takes the 5 clues as input.
    proximately 380M tokens);
                                                            2. For each clue it executes two queries, one
  • ItWac Complete, a corpus of web pages in                   respectively in each corpus, Repubblica and
    Italian crawled from Italian university web-               itWac Complete.
    sites
                                                            3. After downloading, the results from queries
                                                               are concatenated as text.
   These two corpora form our knowledge base
and are united to provide the widest and most com-             At the end of this procedure, for each clue,
prehensive knowledge base possible.                            the algorithm will generate a single text. In
   Then, these resources will be used to match the             this text there will be all the concepts with a
results found.                                                 relation to the clue.

                                                            4. Each of the five texts is transformed into a set
  • Proverbs and Aphorisms on Italian version                  of single words. So I get 5 sets of words, one
    of wikiquotes and on the database of Italian               for each clue.
    proverbs by ”Accademia della Crusca”
                                                            5. The intersection between these sets (which I
  • Books database crawled from ibs web site                   will call ”Final Set”) is made. The solution,
                                                               semantically linked to all five clues, in most
  • Film titles, crawled from the Internet Movie               cases will be contained in the final set. This
    Database                                                   hypothesis will be verified later, in the next
                                                               section.
  • Word definitions, from Dictionary of the Ital-         1
                                                             Source     code    https://gitlab.com/osiast/computer-
    ian language HOEPLI                                 challenges-guillotine
                                                               • Books database crawled from ibs web site

                                                               • Film titles, crawled from the Internet Movie
                                                                 Database

                                                               • Word definitions, from Dictionary of the Ital-
                                                                 ian language HOEPLI

                                                           Whenever the clue and the solution are found to-
                                                           gether, for example in the same proverb or title of
Figure 1: Process for finding possible solutions for       a film, an additional weight of 0.2 is assigned to
a run of the game                                          that solution. The weight can vary from 0 to 1. It
                                                           indicates the probability that this is the solution of
                                                           the game.
  6. Among the possible solutions there are many              Here is the pseudo code of this second part of
     insignificant words such as articles, conjunc-        the algorithm.
     tions and prepositions. To delete them, I sub-
     tracted the set of Stop Words from the final              1. For each of the 5 clues download proverbs,
     set.                                                         film and book titles, vocabulary definitions
                                                                  and aphorisms containing that clue and put
       Stop words are words which are filtered out
                                                                  them together in one text.
       before or after processing of natural language
       data. Though ”stop words” usually refers to                At this point we have 5 texts.
       the most common words in a language, there
       is no single universal list of stop words used          2. To all the possible solutions I assign the value
       by all natural language processing tools, and              0 as weight.
       indeed not all tools even use such a list. So,
                                                               3. Whenever one of the possible solutions is
       I built a special list of Stop Words specifi-
                                                                  found in one of these texts, its weight in-
       cally for this purpose, based on two online
                                                                  creases by 0.2.
       resources:
                                                               4. Finally the first 100 are taken which have the
         • A collection of stopwords on github 2
                                                                  largest weight in descending order.
         • A collection of stopwords from the
           Snowball site 3                                 4     Test and Results
       After the stopwords have been removed from          Testing the algorithm is a key step in the validation
       the final set, I finally have the set of possible   process of the proposed solution.
       solutions.                                             In the first test below, I will run the algorithm
                                                           on 315 instances of the game in order to evaluate
   The next step is the verification of the existence
                                                           its efficiency and study the results obtained. As
of proverbs, Aphorisms, movies or books (etc.)
                                                           already mentioned, each game is composed of five
that contain both one clue and the possible solu-
                                                           clues and a solution.
tion. As already mentioned, the words with the
                                                              The second test will be on the ”knowledge
most associations found are the final solution.
                                                           base”. I’m going to measure how many clues con-
   So, along with each possible solution, I search
                                                           tains on average. This will be useful to indicate an
for clues within the following repositories:
                                                           upper-bound of efficiency that the algorithm can
                                                           not overstep.
   • Proverbs and Aphorisms from two different
     resources: Italian version of wikiquotes and          4.1    Test the algorithm
     on the database of Italian proverbs by ”Ac-
     cademia della Crusca”                                 To evaluate the efficiency of the algorithm, I tried
                                                           it for 315 different games.4
   2
      github.com/stopwords-iso/stopwords-
it/blob/master/stopwords-it.txt                                4
                                                                 The games used can be downloaded from this link
    3
      snowball.tartarus.org/algorithms/italian/stop.txt    https://goo.gl/6FpK3p
       Results       Number     Percentage                   inside the corpora ”Repubblica” and ”ItWac Com-
       Successful    242        76,83%                       plete”. The results were very positive. In fact,
       Fail          73         23,17%                       out of 1575 clues, 1519 of them were always con-
       Total games   315                                     nected with one or more correspondences to the
                                                             solution.
            Table 1: Results on 315 games
                                                                We can see that out of 315 games:
                     Returned      Test                          • 18% (56 of them) can not be resolved due to
             MRR                           Solved
                      game        game                             the absence of 1 or more clues in the chosen
 This       0,0134     400         105         27                  knowledge base;
 Other      0,6428     405         105         86
                                                                 • only 5% (17 of them) can not be resolved due
            Table 2: comparison systems                            to the limit set on the algorithm query;

                                                               This means that without limiting the queries, I
   The query is limited to 8,000 lines per request,
                                                             will find the solution at most 82% of the time.
as statistically I have noticed that it is a good com-
                                                               So this result shows that the limit of 8000 lines
promise between the resolution speed and the effi-
                                                             per query penalized the efficiency of the algorithm
ciency of the algorithm.
                                                             by only 5% percent.
   I downloaded the set of games and coded them
                                                               This confirms that limiting the query to 8000
into a list. Then I ran the algorithm for each game
                                                             rows is a good compromise.
in the list and I memorized the results.
   The data has been processed to create table 1.            5    Analysis of the results and future work
   In over 76% of cases the exact solution is found
in the ”Possible Solution”. This shows that ”The             The algorithm’s execution, both with the training
solution, semantically linked to all five clues, in          set and with the test set, produced similar results.
most cases will be contained in the final set”. 5            From the data obtained we can notice that the per-
   Regarding the final solutions, in 24,76% of               centage of the solutions found in the first phase of
cases the solution is among the first 100. I got             the algorithm decreases in the second phase.
a similar result with the test set, where I reached             Furthermore the value of the MRR is very low
25,7%.                                                       despite the solution being found 27 times out of
   Table 2 shows the distributions of the scores for         105.
the correct and missed solutions of our system on               The reason for these results is that the number of
the full set of games in the test set in comparision         resources in the matching phase are limited. So the
with other system.6                                          solution, despite being found, is often not among
                                                             first in the output of the 100 proposed solutions.
4.2     Test the knowledge base: does it always                 As future developments, we could improve the
        contain the solution?                                algorithm to find the solution from the possible
                                                             solutions by increasing resources to provide more
To verify that the knowledge base is suitable for
                                                             accurate results.
our purpose and that it always contains the solu-
tion, I have performed tests on 1575 clues, that is,
all the ones I had.                                          References
   In particular I wanted to know if the knowledge
                                                             Pierpaolo Basile, Marco de Gemmis, Lucia Siciliani,
base contained a relationship between the solution              and Giovanni Semeraro. 2018. Overview of the
and each of them. The basic idea was to look for                evalita 2018 solving language games (nlp4fun) task.
the clue inside the corpora and then filter the re-             In Tommaso Caselli, Nicole Novielli, Viviana Patti,
sults. For this task I used the NoSketch Engine, an             and Paolo Rosso, editors, Proceedings of the 6th
                                                                evaluation campaign of Natural Language Process-
open-source tool to perform corpus searches.                    ing and Speech tools for Italian (EVALITA’18),
   As already mentioned, I did tests with 1575                  Turin, Italy. CEUR.org.
clues of the game looking for them (one at a time)
                                                             Giovanni Semeraro, Pasquale Lops, Marco de Gem-
   5                                                           mis, and Pierpaolo Basile. 2012. OTTHO: an ar-
      The full results can be viewed      at   this   link
https://goo.gl/BdCee9.                                         tificial player for a complex language game. In
    6
      MRR (Mean Reciprocal Rank)                               Popularize Artificial Intelligence, Proceedings of the
  AI*IA Workshop and Prize for Celebrating 100th
  Anniversary of Alan Turing’s Birth, Rome, Italy,
  June 15, 2012, pages 47–53.
Eros Zanchetta and Marco Baroni. 2005. Morph-it!
  a free corpus-based morphological resource for the
  italian language. Corpus Linguistics 2005, 1(1).

</pre>