=Paper=
{{Paper
|id=Vol-3033/paper72
|storemode=property
|title=A Multi-Strategy Approach to Crossword Clue Answer Retrieval and Ranking
|pdfUrl=https://ceur-ws.org/Vol-3033/paper72.pdf
|volume=Vol-3033
|authors=Andrea Zugarini,Marco Ernandes
|dblpUrl=https://dblp.org/rec/conf/clic-it/ZugariniE21
}}
==A Multi-Strategy Approach to Crossword Clue Answer Retrieval and Ranking==
<pdf width="1500px">https://ceur-ws.org/Vol-3033/paper72.pdf</pdf>
<pre>
    A Multi-Strategy Approach to Crossword Clue Answer Retrieval and
                                Ranking

                                  Andrea Zugarini1,2 , Marco Ernandes1
                                            1. Expert.ai, Italy
                                    2. DIISM University of Siena, Italy
                                {azugarini,mernandes}@expert.ai


                        Abstract                                    migliorano la qualità dei risultati quando
                                                                    usate congiuntamente.
    English. Crossword clues represent an
    extremely challenging form of Question
    Answering, due to their intentional ambi-                   1   Introduction
    guity. Databases of previously answered
    clues are a vital source for the retrieval                  Crossword Puzzles (CPs) resolution is a popular
    of candidate answers lists in Automatic                     game. As almost any other human game, it is
    Crossword Puzzles (CPs) resolution sys-                     possible to tackle the problem automatically. CPs
    tems. In this paper, we exploit language                    solvers frame it into a constraint satisfaction task,
    neural representations for the retrieval and                where the goal is to maximize the probability of
    ranking of crossword clues and answers.                     filling the grid with answers consistent with their
    We assess the performances of several em-                   clues and coherent to the puzzle scheme. These
    bedding models, both static and contex-                     systems (Littman et al., 2002; Ernandes et al.,
    tual, on Italian and English CPs. Results                   2005; Ginsberg, 2011) heavily rely on lists of can-
    indicate that embeddings usually outper-                    didate answers for each clue. Candidates’ quality
    form the baseline. Moreover, the use of                     is crucial to CPs resolution. If the correct answer
    embeddings for retrieval allows different                   is not present in the candidates’ list, the Crossword
    ranking strategies, which turned out to be                  Puzzle cannot be solved correctly. Moreover, even
    complementary, and lead to better results                   a poorly ranked correct answer can lead to a failure
    when used in combination.                                   in the crossword puzzle filling. Answers lists can
                                                                come from multiple solvers, where each solver is
    Italiano. Le domande dei cruciverba                         typically specialized in solving different kinds of
    rappresentano una forma di Question                         clues, and/or exploits different source of informa-
    Answering particolarmente complessa a                       tion. Such lists are mainly retrieved with two
    causa della loro intenzionale ambiguità. I                 techniques: (1) by querying the web with search
    risolutori automatici di cruciverba sfrut-                  engines using clue representations; (2) interrogat-
    tano ampiamente basi di dati di domande                     ing clue-answer databases that contain previously
    precedentemente risposte. In questo arti-                   answered clues. In this work, we focus on the lat-
    colo proponiamo l’uso di embeddings per                     ter.
    la ricerca semantica di domande-risposte                        In the problem of candidate answers retrieval
    da tali databases. Le performances sono                     from clue-answer knowledge sources, answers are
    valutate in cruciverba di lingua sia ital-                  ranked according to the similarity between a query
    iana che inglese, confrontando diversi tipi                 clue and the clues in the DB. The similarity is pro-
    di embeddings, sia contestuali che statici.                 vided by the search engine that assigns a score to
    I risultati suggeriscono che la ricerca se-                 each retrieved answer. Several approaches have
    mantica è migliore della baseline. Inoltre,                been carried out to re-rank the candidates’ list by
    l’utilizzo di embeddings permette di appli-                 means of learning to rank strategies (Barlacchi et
    care differenti strategie di retrieval, che,                al., 2014a; Barlacchi et al., 2014b; Nicosia et al.,
                                                                2015; Nicosia and Moschitti, 2016; Severyn et al.,
     Copyright © 2021 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       2015). These approaches require a training phase
ternational (CC BY 4.0).                                        to learn how to rank and mostly differ for the re-
ranking model or strategy adopted. In particular,      to the diffusion of distributed representations of
pre-trained distributed representations and neural     words (Bengio et al., 2003; Mikolov et al., 2013a;
networks are used for re-ranking clues in (Severyn     Mikolov et al., 2013b; Collobert et al., 2011;
et al., 2015).                                         Mikolov et al., 2018; Devlin et al., 2018) learned
   The re-ranking of answer candidates attempts to     through Language Modeling related tasks on large
improve the quality of candidates’ lists, assuming     corpora.
that the correct answer belongs to the list. Dif-         In general, the goal is to assign a fixed length
ferently from previous work, we aim at directly        representation of size d, aka embedding, to a tex-
retrieving richer lists of answer candidates from a    tual passage s such that similar text passages -
clue-answer database. In order to do so, we ex-        syntactically and/or semantically - are represented
ploit both static and contextual distributed repre-    closely in such space. An embedding model fe is
sentations to perform a semantic search on the DB.     a function mapping s to a d-dimensional vector,
An embedding-based search extends the retrieval        i.e: fe : s → Rd . Since language is a composition
to semantically related clues that may be phrased      of symbols (typically words), embedding models
differently. Moreover, it also allows us to map        first tokenize text and then process such tokens in
in the same space questions and answers, which         order to compute the representation of such textual
opens the way for ranking answers directly based       passage.
on their similarity with respect to the query clue.       Nowadays, there are lots of embedding mod-
Our approach requires no training on CPs data and      els, and for some of them pre-trained embed-
it can be applied with any pre-trained embedding       dings are available in a plethora of languages (Ya-
model.                                                 mada et al., 2020; Grave et al., 2018; Yang et
   In summary, the contributions of this work are:     al., 2019). Early methods like (Mikolov et al.,
(1) a semantic search approach to candidate an-        2013a) produce dense representations for single
swer retrieval in automatic crossword resolution;      tokens - mainly words - therefore further process-
(2) two complementary retrieval methodologies          ing is needed to obtain the actual representation of
(namely QC and QA) detecting candidate answers         s, when s is composed of multiple words. These
that when combined together (even naively) pro-        kinds of embeddings are also referred to as static
duce a better set of candidates; (3) a comparison      embeddings, since the representation of a token is
between different pre-trained language represen-       always the same regardless of the context in which
tations (either static or contextual).                 it appears. In (Mikolov et al., 2018), authors ex-
   The paper is organized as follows. First, we de-    tend (Mikolov et al., 2013a) introducing n-gram
scribe in Section 2 distributed representations of     and sub-word information and in (Le and Mikolov,
language. In Section 3, we present the two answer      2014), distributed representations are learned di-
retrieval approaches proposed in this work. Then,      rectly for sentences and documents.
in Section 4 we outline the experiments in detail,        Most of the proposed methods for contextual
and discuss the obtained results. Finally, we draw     embeddings were based on recurrent neural lan-
our conclusions in Section 5.                          guage models (Melamud et al., 2016; Yang et al.,
                                                       2019; Chidambaram et al., 2018; Mikolov et al.,
2   Language Representations                           2010; Marra et al., 2018; Peters et al., 2018),
                                                       until the introduction of transformer architectures
Assigning meaningful representations to language       (Vaswani et al., 2017; Devlin et al., 2018; Liu et
is a long standing problem. Since the inception        al., 2019) which are currently the state-of-the-art
of the first text mining solutions, the bag-of-words   models. In the next Section we will discuss how
technique has been widely adopted as one of the        such representations can be used to perform se-
standard approaches to text representation. In-        mantic search. In the experiments, we will exploit
verted indices and statistical weighting schemes       some of these embedding models - both static and
(as TF-IDF or BM25) are still to this day com-         contextual.
monly paired with bag-of-words, providing a scal-
able and effective approach to document retrieval.     3   Semantic Search
On the other hand, in the last decade, we have
assisted to tremendous progress in the field of        Traditional CPs solvers rely on Similar Clue Re-
Natural Language Processing. Huge credit goes          trieval mechanisms. The idea is to find possible
        [stan, totò, step,.... nasali]               [stan, ines, mike,.... nasali]
                                                                                           (C, A) constituted by M clue-answer pairs, where
                  ranking                                       rankng                     C and A indicate the list of all the clues and an-
      QC                                                                              QA   swers, respectively, while we denote a clue-answer
                                                                                           pair as: (c, a).
                                                                                              We assign a fixed-length representation qe ∈ Rd
                                                                                           to the query clue q, computed with an embedding
                                                                                           model:
    Il nome di Laurel                   Query:                                stan
            ....                   L'indimenticabile                           ....                            qe = fe (q).                (1)
     Il comico Laurel                    Laurel                              nasali
Le consonanti come la n
            ...                                                                ...         For contextual embeddings fe is the model itself,
                                             <Il nome di Laurel, stan>
                                                         ....
                                                                                           since they work directly on the sequence, whereas
           Clue-Answer
                                              <Il comico Laurel, stan>                     for static embeddings we have to collapse n word
                                         <Le consonanti come la n, nasali>
                DB                                       ...                               representations together into a single vector. For
                                                                                           simplicity, we simply average such embeddings.
Figure 1: Sketch of the two answer candidates re-                                             Analogously, each clue c ∈ C is encoded as in
trieval approaches: QC (on the left), QA (on the                                           Equation 1. Then, we measure the cosine similar-
right). In QC, ranking is based on the similarity                                          ity between the query and each clue:
between the query embedding and all the clues in
                                                                                                   score(q, (c, a)) = cos(qe , fe (c)),    (2)
the DB, while in QA the similarity is computed be-
tween the query and the answers in the database.                                           where cos(·, ·) denotes the cosine similarity. Thus,
                                                                                           we obtain a similarity score for each clue-answer
                                                                                           pair. In order to finally rank answers we average
answers from clues in the database that are simi-                                          all clue-answer pairs having the same answer:
lar to the given query. This is particularly effec-
                                                                                                              1    X
tive for crosswords, since the same clues tend to                                            score(q, a) =       ·      score(q, (c, ak )), (3)
be repeated over time, or may have little lexical                                                            |A|
                                                                                                                  ak ∈A
variations. Retrieval of similar clues is based on                                         where A indicates the set of clue-answer pairs
search engines based on classical IR algorithms                                            where the answer ak is equal to a. All the answers
such as TF-IDF or BM25, representing clues in the                                          in A are then ranked. Since we know a priori the
database as documents to retrieve, given the target                                        length of a query answer, candidates with incorrect
clue as query.                                                                             lengths are filtered out. We refer to this approach
   Here instead, we retrieve and rank documents                                            as QC (Query-Clue).
with semantic search. We propose two strategies,
namely QC and QA. QC is analogous to classical                                             3.2   Similar Answers Retrieval
similar clues retrieval systems, with the difference                                       Since we can map text into a fixed-length space,
that text is represented with a dense representa-                                          we can also rank by measuring the similarity be-
tion. The approach retrieves and ranks from the                                            tween the query and the answer itself. The query
DB clues similar to the query and returns in out-                                          is encoded exactly as in Equation 1. In this case
put the answers associated to those clues. QA, in-                                         however we only need the clue-answer DB to re-
stead, ranks the answers directly by computing the                                         trieve the set of unique answers, denoted as A.
cosine distance between the query and the answers                                          Similarly to Equation 2, we compute the cosine
themselves. Intuitively, the latter approach ranks                                         similarity between query and answer embeddings:
well answers semantically correlated to the ques-
tion itself, particularly useful for clues about syn-                                                score(q, a) = cos(qe , fe (a)),       (4)
onyms. As we will show in Section 4, due to their                                          for each a ∈ A, then we rank as in QC. We
different nature, the list of candidates retrieved by                                      call it QA (Query-Answer). It is important to re-
the two approaches are strongly complementary.                                             mark that QA is only feasible using latent repre-
A sketch of the two approaches is outlined in Fig.                                         sentations, traditional methods like TF-IDF are not
1. Let us describe them separately.                                                        suited because of their sparsity of representations.
                                                                                           Moreover, QA is somewhat an orthogonal strategy
3.1    Similar Clues Retrieval                                                             with respect to QC. We will see in Section 4, how
We are given a query clue which is a sequence of n                                         even a trivial ensemble of QA and QC is beneficial
words q := (w1 , . . . , wn ), and a clue-answer DB                                        to the performances.
4     Experiments
                                                             1.0       USE
In the experiments we aim to prove the effective-                      TF-IDF
ness of semantic search to retrieve accurate lists           0.8
of candidate answers, and to show that the QA
approach carries out complementary information               0.6
that can increase the coverage of the retrieval.
                                                             0.4
4.1    Experimental Setup                                    0.2
We considered for our experiments three
well known embedding models, two static                      0.0
                                                                   0      500   1000   1500   2000   2500   3000
(Word2Vec12 , FastText3 ) and one contextual
(Universal Sentence Encoder4 ), briefly denoted
as W2V, FT and USE, respectively. We exploited        Figure 2: Comparison between cumulative density
pre-trained models for all of them. In absence        functions of ranking using USE (blue) and TF-IDF
of an Italian USE model, we used for the Italian      (orange) methods on English crosswords.
crosswords database, the multilingual version of
USE, that was trained on 16 languages (Italian in-    English Crosswords. The data consist of a col-
cluded). Embedding models are compared against        lection of clue-answer pairs for crossword puz-
TF-IDF, which is a typical text representation in     zles published in the New York Times5 in 1997
document retrieval problems.                          and 2005, previously collected in (Ernandes et
   To measure performances, we used well known        al., 2008). Overall, there are about 61, 000 clue-
metrics of Retrieval systems. In particular we con-   answer pair samples. Clues, answers and clue-
sidered Mean Hit at k (MH@k) and Mean Re-             answer pairs may occur multiple times. A clue
ciprocal Rank (MRR). Hit at k is 1 if the cor-        is generally a short sentence, while answers are
rect answer is within the first k elements of the     usually made up of a single word, but there are
list, 0 otherwise. The hits at k are evaluated for    cases of multi-word answers. In such a case the
kP = {1, 5, 20, 100}. MRR is defined as follows:      answer is a string made of multiple words without
 1    n      1
n     q=1 rank(q) .                                   any word separator. After pre-processing we ob-
                                                      tain a corpus with 31, 808 pairs in which 27, 527
4.2    Datasets                                       questions and 8, 324 answers are unique.
We consider two different clue-answer databases       Italian Crosswords. The clue-answer database
for our experimentation. In particular, experi-       for Italian was constructed from CWDB v0.1 it
ments were carried out on two languages, Italian      corpus6 (Barlacchi et al., 2014a). We combined
and English, respectively on CWDB dataset (Bar-       pairs from both train and test splits, since we did
lacchi et al., 2014a) and New York Times Cross-       not perform any training in our experiments and
words. We apply the same pre-processing pipeline      we opportunely omitted the clue-answer pair itself
in both corpora. (1) We discarded clue-answer         during its evaluation. From the original 62, 011
pairs having answers with more than three char-       pairs, it remains 25, 545 pairs after pre-processing,
acters, because they are typically about linguistic   constituted of 5, 813 unique answers and 16, 970
puzzles and they are addressed differently in CPs     unique questions.
solvers. (2) Answer and clues containing special
characters are erased. (3) Text has been lower-       4.3      Results
cased and punctuation removed. (4) We kept only       All the results for Italian and English cross-
answers appearing in at least two clues.              words are outlined in Tables 1 and 2, respec-
  1
    English: https://code.google.com/archiv
                                                      tively. From them, we can catch several interest-
e/p/word2vec/                                         ing insights. First of all, contextual representa-
  2
    Italian: https://wikipedia2vec.github.io/         tions from Universal Sentence Encoders are gen-
wikipedia2vec/
  3                                                      5
    https://fasttext.cc/                                 https://www.nytimes.com/
  4                                                      6
    https://tfhub.dev/google/collections                 https://ikernels-portal.disi.unitn.i
/universal-sentence-encoder/1                         t/projects/webcrow
         Model                   Strategy   MH@1      MH@5        MH@20         MH@100          MRR
         W2V                     QA         14.97     32.55       50.35         71.59           23.80
         FT                      QA         6.78      14.47       26.88         52.46           11.44
         USE                     QA         7.89      17.81       29.30         46.80           13.24
         TF-IDF                  QC         60.79     66.43       68.53         72.62           63.54
         W2V                     QC         52.34     64.75       72.58         82.66           58.26
         FT                      QC         23.50     34.13       45.94         64.09           29.05
         USE                     QC         60.69     70.93       76.81         84.70           65.57
         EnsembleU SE−W 2V       QC-QA      -         73.59       82.39         91.22           -

Table 1: Evaluation of performances on CWDB Italian data. The best values of each column and strategy
are marked in bold for both QC and QA methods.


         Model                   Strategy   MH@1      MH@5        MH@20         MH@100          MRR
         W2V                     QA         7.58      17.27       27.78         42.62           12.66
         FT                      QA         7.72      17.35       27.29         43.42           12.75
         USE                     QA         8.63      19.69       30.01         45.17           14.25
         TF-IDF                  QC         26.15     37.62       44.09         49.54           31.46
         W2V                     QC         19.63     31.69       42.66         57.38           25.65
         FT                      QC         15.72     24.32       32.67         46.64           20.20
         USE                     QC         25.78     38.57       49.34         63.35           32.12
         EnsembleU SE−U SE       QC-QA      -         41.40       54.34         69.00           -

Table 2: Evaluation of performances on English data. The best values of each column and strategy are
marked in bold for both QC and QA methods.


erally the most effective ones, especially on sim-    candidates list grows (MH@20 and MH@100).
ilar clues retrieval (QC), where both the query       This behavior is also evident in Fig. 2, where
and the elements to rank are textual sequences.       we compare the cumulative distributions of rank-
Nonetheless, Word2vec embeddings work surpris-        ing with USE and TF-IDF. After the initial bump,
ingly well, outperforming FastText almost all the     TF-IDF hits growth is almost linear (i.e. random),
times. Furthermore, they are the best ones on QA      whereas the Universal Sentence Encoder keeps
search in Italian database. We believe the reason     growing significantly.
why Word2Vec outperforms USE on Italian QA
is twofold. First, the advantage of contextual em-    Ensembling QC and QA. Analyzing the re-
beddings is less evident in QA setup, indeed USE      sults, we observed that ranks from QA and QC
brings less benefits on English QA as well. Sec-      had low levels of overlaps. We reported in the last
ond, USE is a multilingual model, therefore its       line of Tables 1 and 2, performances of a naive
embeddings are less specialized than Word2Vec         ensemble approach to combine QC and QA strate-
which was instead trained for Italian only.           gies. Due to the limited levels of overlaps, we de-
   When comparing semantic search models              cided to merge the two ranks taking the first K/2
against the baseline (TF-IDF) - which is only pos-    ranks from each strategy to compute MH@K,
sible in QC - we can notice that, static embed-       K = {5, 20, 100}7 . We chose the best embed-
dings struggle to outperform it. Indeed, the sparse   ding model on each strategy. Despite its simplic-
nature of TF-IDF induces crisp similarity scores,     ity and the large room for improvements, the en-
very high for clues sharing the same keywords, ex-    semble significantly improved the performances in
tremely low for all the rest. On the contrary, sim-   both languages. This suggests possible directions
ilarity scores are more blurred with dense embed-     for further improving the retrieval of CPs solvers.
dings. As a consequence, TF-IDF achieves high
MH@1 and MH@5 scores (and MRR too). How-                 7
                                                           Since K=5 is not even, we took the first 3 ranks from QC
ever, TF-IDF leads to a poorer coverage when the      and the first two ranks from QA.
5   Conclusions                                          Marco Ernandes, Giovanni Angelini, and Marco Gori.
                                                          2005. Webcrow: A web-based system for crossword
In this paper, we proposed two different seman-           solving. In AAAI, pages 1412–1417.
tic search strategies (QC and QA) for ranking and
retrieving answer candidates to CPs clues. We            Marco Ernandes, Giovanni Angelini, and Marco Gori.
                                                          2008. A web-based agent challenges human experts
exploited pre-trained state-of-the-art embeddings,        on crosswords. AI Magazine, 29(1):77–77.
both static and contextual, to rank clue-answer
pairs from databases. Embedding-based retrieval          Matthew L Ginsberg. 2011. Dr. fill: Crosswords
overcomes some of the limitations of inverted in-         and an implemented solver for singly weighted csps.
dices models, leading to higher coverage ranks,           Journal of Artificial Intelligence Research, 42:851–
                                                          886.
and allowing similar answers retrieval (QA). Fi-
nally, we observed that, even a simple ensembling        Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-
that combines QC and QA, is effective and im-              mand Joulin, and Tomas Mikolov. 2018. Learn-
proves overall retrieval performances.                     ing word vectors for 157 languages. In Proceed-
                                                           ings of the International Conference on Language
   This opens further research directions, where           Resources and Evaluation (LREC 2018).
learning to rank methods could be exploited in or-
der to better combine candidate answer lists from        Quoc Le and Tomas Mikolov. 2014. Distributed repre-
complementary approaches like QC and QA.                   sentations of sentences and documents. In Interna-
                                                           tional conference on machine learning, pages 1188–
Acknowledgments                                            1196. PMLR.

We thank Nicola Landolfi and Marco Maggini for           Michael L Littman, Greg A Keim, and Noam Shazeer.
the great support and fruitful discussions.                2002. A probabilistic approach to solving crossword
                                                           puzzles. Artificial Intelligence, 134(1-2):23–55.

                                                         Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
References                                                 dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Gianni Barlacchi, Massimo Nicosia, and Alessandro          Luke Zettlemoyer, and Veselin Stoyanov. 2019.
  Moschitti. 2014a. Learning to rank answer candi-         Roberta: A robustly optimized bert pretraining ap-
  dates for automatic resolution of crossword puzzles.     proach. arXiv preprint arXiv:1907.11692.
  In Proceedings of the Eighteenth Conference on
  Computational Natural Language Learning, pages         Giuseppe Marra, Andrea Zugarini, Stefano Melacci,
  39–48.                                                   and Marco Maggini. 2018. An unsupervised
                                                           character-aware neural approach to word and con-
Gianni Barlacchi, Massimo Nicosia, and Alessandro          text representation learning. In International Con-
  Moschitti. 2014b. A retrieval model for automatic        ference on Artificial Neural Networks, pages 126–
  resolution of crossword puzzles in italian language.     136. Springer.
  In The First Italian Conference on Computational
  Linguistics CLiC-it 2014, page 33.                     Oren Melamud, Jacob Goldberger, and Ido Dagan.
                                                           2016. context2vec: Learning generic context em-
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and       bedding with bidirectional lstm. In Proceedings of
  Christian Jauvin. 2003. A neural probabilistic lan-      the 20th SIGNLL conference on computational nat-
  guage model. Journal of machine learning research,       ural language learning, pages 51–61.
  3(Feb):1137–1155.
                                                         Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan
Muthuraman Chidambaram, Yinfei Yang, Daniel Cer,
                                                           Černockỳ, and Sanjeev Khudanpur. 2010. Re-
 Steve Yuan, Yun-Hsuan Sung, Brian Strope, and Ray
                                                           current neural network based language model. In
 Kurzweil. 2018. Learning cross-lingual sentence
                                                           Eleventh annual conference of the international
 representations via a multi-task dual-encoder model.
                                                           speech communication association.
 arXiv preprint arXiv:1810.12836.

Ronan Collobert, Jason Weston, Léon Bottou, Michael     Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
  Karlen, Koray Kavukcuoglu, and Pavel Kuksa.              frey Dean. 2013a. Efficient estimation of word
  2011. Natural language processing (almost) from          representations in vector space. arXiv preprint
  scratch. Journal of Machine Learning Research,           arXiv:1301.3781.
  12(Aug):2493–2537.
                                                         Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              rado, and Jeff Dean. 2013b. Distributed represen-
   Kristina Toutanova. 2018. Bert: Pre-training of         tations of words and phrases and their composition-
   deep bidirectional transformers for language under-     ality. In Advances in neural information processing
   standing. arXiv preprint arXiv:1810.04805.              systems, pages 3111–3119.
Tomas Mikolov, Edouard Grave, Piotr Bojanowski,
  Christian Puhrsch, and Armand Joulin. 2018. Ad-
  vances in pre-training distributed word representa-
  tions. In Proceedings of the International Confer-
  ence on Language Resources and Evaluation (LREC
  2018).
Massimo Nicosia and Alessandro Moschitti. 2016.
 Crossword puzzle resolution in italian using distri-
 butional models for clue similarity. In IIR.
Massimo Nicosia, Gianni Barlacchi, and Alessandro
 Moschitti. 2015. Learning to rank aggregated an-
 swers for crossword puzzles. In European Con-
 ference on Information Retrieval, pages 556–561.
 Springer.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
 Gardner, Christopher Clark, Kenton Lee, and Luke
 Zettlemoyer. 2018. Deep contextualized word rep-
 resentations. In Proc. of NAACL.
Aliaksei Severyn, Massimo Nicosia, Gianni Barlacchi,
  and Alessandro Moschitti. 2015. Distributional
  neural networks for automatic resolution of cross-
  word puzzles. In Proceedings of the 53rd Annual
  Meeting of the Association for Computational Lin-
  guistics and the 7th International Joint Conference
  on Natural Language Processing (Volume 2: Short
  Papers), pages 199–204.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in neural information pro-
  cessing systems, pages 5998–6008.

Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki
   Shindo, Hideaki Takeda, Yoshiyasu Takefuji, and
   Yuji Matsumoto. 2020. Wikipedia2Vec: An ef-
   ficient toolkit for learning and visualizing the em-
   beddings of words and entities from Wikipedia. In
   Proceedings of the 2020 Conference on Empirical
   Methods in Natural Language Processing: System
   Demonstrations, pages 23–30. Association for Com-
   putational Linguistics.
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy
  Guo, Jax Law, Noah Constant, Gustavo Hernan-
  dez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan
  Sung, et al. 2019. Multilingual universal sen-
  tence encoder for semantic retrieval. arXiv preprint
  arXiv:1907.04307.

</pre>