=Paper=
{{Paper
|id=Vol-3033/paper72
|storemode=property
|title=A Multi-Strategy Approach to Crossword Clue Answer Retrieval and Ranking
|pdfUrl=https://ceur-ws.org/Vol-3033/paper72.pdf
|volume=Vol-3033
|authors=Andrea Zugarini,Marco Ernandes
|dblpUrl=https://dblp.org/rec/conf/clic-it/ZugariniE21
}}
==A Multi-Strategy Approach to Crossword Clue Answer Retrieval and Ranking==
A Multi-Strategy Approach to Crossword Clue Answer Retrieval and Ranking Andrea Zugarini1,2 , Marco Ernandes1 1. Expert.ai, Italy 2. DIISM University of Siena, Italy {azugarini,mernandes}@expert.ai Abstract migliorano la qualità dei risultati quando usate congiuntamente. English. Crossword clues represent an extremely challenging form of Question Answering, due to their intentional ambi- 1 Introduction guity. Databases of previously answered clues are a vital source for the retrieval Crossword Puzzles (CPs) resolution is a popular of candidate answers lists in Automatic game. As almost any other human game, it is Crossword Puzzles (CPs) resolution sys- possible to tackle the problem automatically. CPs tems. In this paper, we exploit language solvers frame it into a constraint satisfaction task, neural representations for the retrieval and where the goal is to maximize the probability of ranking of crossword clues and answers. filling the grid with answers consistent with their We assess the performances of several em- clues and coherent to the puzzle scheme. These bedding models, both static and contex- systems (Littman et al., 2002; Ernandes et al., tual, on Italian and English CPs. Results 2005; Ginsberg, 2011) heavily rely on lists of can- indicate that embeddings usually outper- didate answers for each clue. Candidates’ quality form the baseline. Moreover, the use of is crucial to CPs resolution. If the correct answer embeddings for retrieval allows different is not present in the candidates’ list, the Crossword ranking strategies, which turned out to be Puzzle cannot be solved correctly. Moreover, even complementary, and lead to better results a poorly ranked correct answer can lead to a failure when used in combination. in the crossword puzzle filling. Answers lists can come from multiple solvers, where each solver is Italiano. Le domande dei cruciverba typically specialized in solving different kinds of rappresentano una forma di Question clues, and/or exploits different source of informa- Answering particolarmente complessa a tion. Such lists are mainly retrieved with two causa della loro intenzionale ambiguità. I techniques: (1) by querying the web with search risolutori automatici di cruciverba sfrut- engines using clue representations; (2) interrogat- tano ampiamente basi di dati di domande ing clue-answer databases that contain previously precedentemente risposte. In questo arti- answered clues. In this work, we focus on the lat- colo proponiamo l’uso di embeddings per ter. la ricerca semantica di domande-risposte In the problem of candidate answers retrieval da tali databases. Le performances sono from clue-answer knowledge sources, answers are valutate in cruciverba di lingua sia ital- ranked according to the similarity between a query iana che inglese, confrontando diversi tipi clue and the clues in the DB. The similarity is pro- di embeddings, sia contestuali che statici. vided by the search engine that assigns a score to I risultati suggeriscono che la ricerca se- each retrieved answer. Several approaches have mantica è migliore della baseline. Inoltre, been carried out to re-rank the candidates’ list by l’utilizzo di embeddings permette di appli- means of learning to rank strategies (Barlacchi et care differenti strategie di retrieval, che, al., 2014a; Barlacchi et al., 2014b; Nicosia et al., 2015; Nicosia and Moschitti, 2016; Severyn et al., Copyright © 2021 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- 2015). These approaches require a training phase ternational (CC BY 4.0). to learn how to rank and mostly differ for the re- ranking model or strategy adopted. In particular, to the diffusion of distributed representations of pre-trained distributed representations and neural words (Bengio et al., 2003; Mikolov et al., 2013a; networks are used for re-ranking clues in (Severyn Mikolov et al., 2013b; Collobert et al., 2011; et al., 2015). Mikolov et al., 2018; Devlin et al., 2018) learned The re-ranking of answer candidates attempts to through Language Modeling related tasks on large improve the quality of candidates’ lists, assuming corpora. that the correct answer belongs to the list. Dif- In general, the goal is to assign a fixed length ferently from previous work, we aim at directly representation of size d, aka embedding, to a tex- retrieving richer lists of answer candidates from a tual passage s such that similar text passages - clue-answer database. In order to do so, we ex- syntactically and/or semantically - are represented ploit both static and contextual distributed repre- closely in such space. An embedding model fe is sentations to perform a semantic search on the DB. a function mapping s to a d-dimensional vector, An embedding-based search extends the retrieval i.e: fe : s → Rd . Since language is a composition to semantically related clues that may be phrased of symbols (typically words), embedding models differently. Moreover, it also allows us to map first tokenize text and then process such tokens in in the same space questions and answers, which order to compute the representation of such textual opens the way for ranking answers directly based passage. on their similarity with respect to the query clue. Nowadays, there are lots of embedding mod- Our approach requires no training on CPs data and els, and for some of them pre-trained embed- it can be applied with any pre-trained embedding dings are available in a plethora of languages (Ya- model. mada et al., 2020; Grave et al., 2018; Yang et In summary, the contributions of this work are: al., 2019). Early methods like (Mikolov et al., (1) a semantic search approach to candidate an- 2013a) produce dense representations for single swer retrieval in automatic crossword resolution; tokens - mainly words - therefore further process- (2) two complementary retrieval methodologies ing is needed to obtain the actual representation of (namely QC and QA) detecting candidate answers s, when s is composed of multiple words. These that when combined together (even naively) pro- kinds of embeddings are also referred to as static duce a better set of candidates; (3) a comparison embeddings, since the representation of a token is between different pre-trained language represen- always the same regardless of the context in which tations (either static or contextual). it appears. In (Mikolov et al., 2018), authors ex- The paper is organized as follows. First, we de- tend (Mikolov et al., 2013a) introducing n-gram scribe in Section 2 distributed representations of and sub-word information and in (Le and Mikolov, language. In Section 3, we present the two answer 2014), distributed representations are learned di- retrieval approaches proposed in this work. Then, rectly for sentences and documents. in Section 4 we outline the experiments in detail, Most of the proposed methods for contextual and discuss the obtained results. Finally, we draw embeddings were based on recurrent neural lan- our conclusions in Section 5. guage models (Melamud et al., 2016; Yang et al., 2019; Chidambaram et al., 2018; Mikolov et al., 2 Language Representations 2010; Marra et al., 2018; Peters et al., 2018), until the introduction of transformer architectures Assigning meaningful representations to language (Vaswani et al., 2017; Devlin et al., 2018; Liu et is a long standing problem. Since the inception al., 2019) which are currently the state-of-the-art of the first text mining solutions, the bag-of-words models. In the next Section we will discuss how technique has been widely adopted as one of the such representations can be used to perform se- standard approaches to text representation. In- mantic search. In the experiments, we will exploit verted indices and statistical weighting schemes some of these embedding models - both static and (as TF-IDF or BM25) are still to this day com- contextual. monly paired with bag-of-words, providing a scal- able and effective approach to document retrieval. 3 Semantic Search On the other hand, in the last decade, we have assisted to tremendous progress in the field of Traditional CPs solvers rely on Similar Clue Re- Natural Language Processing. Huge credit goes trieval mechanisms. The idea is to find possible [stan, totò, step,.... nasali] [stan, ines, mike,.... nasali] (C, A) constituted by M clue-answer pairs, where ranking rankng C and A indicate the list of all the clues and an- QC QA swers, respectively, while we denote a clue-answer pair as: (c, a). We assign a fixed-length representation qe ∈ Rd to the query clue q, computed with an embedding model: Il nome di Laurel Query: stan .... L'indimenticabile .... qe = fe (q). (1) Il comico Laurel Laurel nasali Le consonanti come la n ... ... For contextual embeddings fe is the model itself,.... since they work directly on the sequence, whereas Clue-Answer for static embeddings we have to collapse n word DB ... representations together into a single vector. For simplicity, we simply average such embeddings. Figure 1: Sketch of the two answer candidates re- Analogously, each clue c ∈ C is encoded as in trieval approaches: QC (on the left), QA (on the Equation 1. Then, we measure the cosine similar- right). In QC, ranking is based on the similarity ity between the query and each clue: between the query embedding and all the clues in score(q, (c, a)) = cos(qe , fe (c)), (2) the DB, while in QA the similarity is computed be- tween the query and the answers in the database. where cos(·, ·) denotes the cosine similarity. Thus, we obtain a similarity score for each clue-answer pair. In order to finally rank answers we average answers from clues in the database that are simi- all clue-answer pairs having the same answer: lar to the given query. This is particularly effec- 1 X tive for crosswords, since the same clues tend to score(q, a) = · score(q, (c, ak )), (3) be repeated over time, or may have little lexical |A| ak ∈A variations. Retrieval of similar clues is based on where A indicates the set of clue-answer pairs search engines based on classical IR algorithms where the answer ak is equal to a. All the answers such as TF-IDF or BM25, representing clues in the in A are then ranked. Since we know a priori the database as documents to retrieve, given the target length of a query answer, candidates with incorrect clue as query. lengths are filtered out. We refer to this approach Here instead, we retrieve and rank documents as QC (Query-Clue). with semantic search. We propose two strategies, namely QC and QA. QC is analogous to classical 3.2 Similar Answers Retrieval similar clues retrieval systems, with the difference Since we can map text into a fixed-length space, that text is represented with a dense representa- we can also rank by measuring the similarity be- tion. The approach retrieves and ranks from the tween the query and the answer itself. The query DB clues similar to the query and returns in out- is encoded exactly as in Equation 1. In this case put the answers associated to those clues. QA, in- however we only need the clue-answer DB to re- stead, ranks the answers directly by computing the trieve the set of unique answers, denoted as A. cosine distance between the query and the answers Similarly to Equation 2, we compute the cosine themselves. Intuitively, the latter approach ranks similarity between query and answer embeddings: well answers semantically correlated to the ques- tion itself, particularly useful for clues about syn- score(q, a) = cos(qe , fe (a)), (4) onyms. As we will show in Section 4, due to their for each a ∈ A, then we rank as in QC. We different nature, the list of candidates retrieved by call it QA (Query-Answer). It is important to re- the two approaches are strongly complementary. mark that QA is only feasible using latent repre- A sketch of the two approaches is outlined in Fig. sentations, traditional methods like TF-IDF are not 1. Let us describe them separately. suited because of their sparsity of representations. Moreover, QA is somewhat an orthogonal strategy 3.1 Similar Clues Retrieval with respect to QC. We will see in Section 4, how We are given a query clue which is a sequence of n even a trivial ensemble of QA and QC is beneficial words q := (w1 , . . . , wn ), and a clue-answer DB to the performances. 4 Experiments 1.0 USE In the experiments we aim to prove the effective- TF-IDF ness of semantic search to retrieve accurate lists 0.8 of candidate answers, and to show that the QA approach carries out complementary information 0.6 that can increase the coverage of the retrieval. 0.4 4.1 Experimental Setup 0.2 We considered for our experiments three well known embedding models, two static 0.0 0 500 1000 1500 2000 2500 3000 (Word2Vec12 , FastText3 ) and one contextual (Universal Sentence Encoder4 ), briefly denoted as W2V, FT and USE, respectively. We exploited Figure 2: Comparison between cumulative density pre-trained models for all of them. In absence functions of ranking using USE (blue) and TF-IDF of an Italian USE model, we used for the Italian (orange) methods on English crosswords. crosswords database, the multilingual version of USE, that was trained on 16 languages (Italian in- English Crosswords. The data consist of a col- cluded). Embedding models are compared against lection of clue-answer pairs for crossword puz- TF-IDF, which is a typical text representation in zles published in the New York Times5 in 1997 document retrieval problems. and 2005, previously collected in (Ernandes et To measure performances, we used well known al., 2008). Overall, there are about 61, 000 clue- metrics of Retrieval systems. In particular we con- answer pair samples. Clues, answers and clue- sidered Mean Hit at k (MH@k) and Mean Re- answer pairs may occur multiple times. A clue ciprocal Rank (MRR). Hit at k is 1 if the cor- is generally a short sentence, while answers are rect answer is within the first k elements of the usually made up of a single word, but there are list, 0 otherwise. The hits at k are evaluated for cases of multi-word answers. In such a case the kP = {1, 5, 20, 100}. MRR is defined as follows: answer is a string made of multiple words without 1 n 1 n q=1 rank(q) . any word separator. After pre-processing we ob- tain a corpus with 31, 808 pairs in which 27, 527 4.2 Datasets questions and 8, 324 answers are unique. We consider two different clue-answer databases Italian Crosswords. The clue-answer database for our experimentation. In particular, experi- for Italian was constructed from CWDB v0.1 it ments were carried out on two languages, Italian corpus6 (Barlacchi et al., 2014a). We combined and English, respectively on CWDB dataset (Bar- pairs from both train and test splits, since we did lacchi et al., 2014a) and New York Times Cross- not perform any training in our experiments and words. We apply the same pre-processing pipeline we opportunely omitted the clue-answer pair itself in both corpora. (1) We discarded clue-answer during its evaluation. From the original 62, 011 pairs having answers with more than three char- pairs, it remains 25, 545 pairs after pre-processing, acters, because they are typically about linguistic constituted of 5, 813 unique answers and 16, 970 puzzles and they are addressed differently in CPs unique questions. solvers. (2) Answer and clues containing special characters are erased. (3) Text has been lower- 4.3 Results cased and punctuation removed. (4) We kept only All the results for Italian and English cross- answers appearing in at least two clues. words are outlined in Tables 1 and 2, respec- 1 English: https://code.google.com/archiv tively. From them, we can catch several interest- e/p/word2vec/ ing insights. First of all, contextual representa- 2 Italian: https://wikipedia2vec.github.io/ tions from Universal Sentence Encoders are gen- wikipedia2vec/ 3 5 https://fasttext.cc/ https://www.nytimes.com/ 4 6 https://tfhub.dev/google/collections https://ikernels-portal.disi.unitn.i /universal-sentence-encoder/1 t/projects/webcrow Model Strategy MH@1 MH@5 MH@20 MH@100 MRR W2V QA 14.97 32.55 50.35 71.59 23.80 FT QA 6.78 14.47 26.88 52.46 11.44 USE QA 7.89 17.81 29.30 46.80 13.24 TF-IDF QC 60.79 66.43 68.53 72.62 63.54 W2V QC 52.34 64.75 72.58 82.66 58.26 FT QC 23.50 34.13 45.94 64.09 29.05 USE QC 60.69 70.93 76.81 84.70 65.57 EnsembleU SE−W 2V QC-QA - 73.59 82.39 91.22 - Table 1: Evaluation of performances on CWDB Italian data. The best values of each column and strategy are marked in bold for both QC and QA methods. Model Strategy MH@1 MH@5 MH@20 MH@100 MRR W2V QA 7.58 17.27 27.78 42.62 12.66 FT QA 7.72 17.35 27.29 43.42 12.75 USE QA 8.63 19.69 30.01 45.17 14.25 TF-IDF QC 26.15 37.62 44.09 49.54 31.46 W2V QC 19.63 31.69 42.66 57.38 25.65 FT QC 15.72 24.32 32.67 46.64 20.20 USE QC 25.78 38.57 49.34 63.35 32.12 EnsembleU SE−U SE QC-QA - 41.40 54.34 69.00 - Table 2: Evaluation of performances on English data. The best values of each column and strategy are marked in bold for both QC and QA methods. erally the most effective ones, especially on sim- candidates list grows (MH@20 and MH@100). ilar clues retrieval (QC), where both the query This behavior is also evident in Fig. 2, where and the elements to rank are textual sequences. we compare the cumulative distributions of rank- Nonetheless, Word2vec embeddings work surpris- ing with USE and TF-IDF. After the initial bump, ingly well, outperforming FastText almost all the TF-IDF hits growth is almost linear (i.e. random), times. Furthermore, they are the best ones on QA whereas the Universal Sentence Encoder keeps search in Italian database. We believe the reason growing significantly. why Word2Vec outperforms USE on Italian QA is twofold. First, the advantage of contextual em- Ensembling QC and QA. Analyzing the re- beddings is less evident in QA setup, indeed USE sults, we observed that ranks from QA and QC brings less benefits on English QA as well. Sec- had low levels of overlaps. We reported in the last ond, USE is a multilingual model, therefore its line of Tables 1 and 2, performances of a naive embeddings are less specialized than Word2Vec ensemble approach to combine QC and QA strate- which was instead trained for Italian only. gies. Due to the limited levels of overlaps, we de- When comparing semantic search models cided to merge the two ranks taking the first K/2 against the baseline (TF-IDF) - which is only pos- ranks from each strategy to compute MH@K, sible in QC - we can notice that, static embed- K = {5, 20, 100}7 . We chose the best embed- dings struggle to outperform it. Indeed, the sparse ding model on each strategy. Despite its simplic- nature of TF-IDF induces crisp similarity scores, ity and the large room for improvements, the en- very high for clues sharing the same keywords, ex- semble significantly improved the performances in tremely low for all the rest. On the contrary, sim- both languages. This suggests possible directions ilarity scores are more blurred with dense embed- for further improving the retrieval of CPs solvers. dings. As a consequence, TF-IDF achieves high MH@1 and MH@5 scores (and MRR too). How- 7 Since K=5 is not even, we took the first 3 ranks from QC ever, TF-IDF leads to a poorer coverage when the and the first two ranks from QA. 5 Conclusions Marco Ernandes, Giovanni Angelini, and Marco Gori. 2005. Webcrow: A web-based system for crossword In this paper, we proposed two different seman- solving. In AAAI, pages 1412–1417. tic search strategies (QC and QA) for ranking and retrieving answer candidates to CPs clues. We Marco Ernandes, Giovanni Angelini, and Marco Gori. 2008. A web-based agent challenges human experts exploited pre-trained state-of-the-art embeddings, on crosswords. AI Magazine, 29(1):77–77. both static and contextual, to rank clue-answer pairs from databases. Embedding-based retrieval Matthew L Ginsberg. 2011. Dr. fill: Crosswords overcomes some of the limitations of inverted in- and an implemented solver for singly weighted csps. dices models, leading to higher coverage ranks, Journal of Artificial Intelligence Research, 42:851– 886. and allowing similar answers retrieval (QA). Fi- nally, we observed that, even a simple ensembling Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- that combines QC and QA, is effective and im- mand Joulin, and Tomas Mikolov. 2018. Learn- proves overall retrieval performances. ing word vectors for 157 languages. In Proceed- ings of the International Conference on Language This opens further research directions, where Resources and Evaluation (LREC 2018). learning to rank methods could be exploited in or- der to better combine candidate answer lists from Quoc Le and Tomas Mikolov. 2014. Distributed repre- complementary approaches like QC and QA. sentations of sentences and documents. In Interna- tional conference on machine learning, pages 1188– Acknowledgments 1196. PMLR. We thank Nicola Landolfi and Marco Maggini for Michael L Littman, Greg A Keim, and Noam Shazeer. the great support and fruitful discussions. 2002. A probabilistic approach to solving crossword puzzles. Artificial Intelligence, 134(1-2):23–55. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- References dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Gianni Barlacchi, Massimo Nicosia, and Alessandro Luke Zettlemoyer, and Veselin Stoyanov. 2019. Moschitti. 2014a. Learning to rank answer candi- Roberta: A robustly optimized bert pretraining ap- dates for automatic resolution of crossword puzzles. proach. arXiv preprint arXiv:1907.11692. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages Giuseppe Marra, Andrea Zugarini, Stefano Melacci, 39–48. and Marco Maggini. 2018. An unsupervised character-aware neural approach to word and con- Gianni Barlacchi, Massimo Nicosia, and Alessandro text representation learning. In International Con- Moschitti. 2014b. A retrieval model for automatic ference on Artificial Neural Networks, pages 126– resolution of crossword puzzles in italian language. 136. Springer. In The First Italian Conference on Computational Linguistics CLiC-it 2014, page 33. Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. context2vec: Learning generic context em- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and bedding with bidirectional lstm. In Proceedings of Christian Jauvin. 2003. A neural probabilistic lan- the 20th SIGNLL conference on computational nat- guage model. Journal of machine learning research, ural language learning, pages 51–61. 3(Feb):1137–1155. Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Muthuraman Chidambaram, Yinfei Yang, Daniel Cer, Černockỳ, and Sanjeev Khudanpur. 2010. Re- Steve Yuan, Yun-Hsuan Sung, Brian Strope, and Ray current neural network based language model. In Kurzweil. 2018. Learning cross-lingual sentence Eleventh annual conference of the international representations via a multi-task dual-encoder model. speech communication association. arXiv preprint arXiv:1810.12836. Ronan Collobert, Jason Weston, Léon Bottou, Michael Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- Karlen, Koray Kavukcuoglu, and Pavel Kuksa. frey Dean. 2013a. Efficient estimation of word 2011. Natural language processing (almost) from representations in vector space. arXiv preprint scratch. Journal of Machine Learning Research, arXiv:1301.3781. 12(Aug):2493–2537. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and rado, and Jeff Dean. 2013b. Distributed represen- Kristina Toutanova. 2018. Bert: Pre-training of tations of words and phrases and their composition- deep bidirectional transformers for language under- ality. In Advances in neural information processing standing. arXiv preprint arXiv:1810.04805. systems, pages 3111–3119. Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Ad- vances in pre-training distributed word representa- tions. In Proceedings of the International Confer- ence on Language Resources and Evaluation (LREC 2018). Massimo Nicosia and Alessandro Moschitti. 2016. Crossword puzzle resolution in italian using distri- butional models for clue similarity. In IIR. Massimo Nicosia, Gianni Barlacchi, and Alessandro Moschitti. 2015. Learning to rank aggregated an- swers for crossword puzzles. In European Con- ference on Information Retrieval, pages 556–561. Springer. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- resentations. In Proc. of NAACL. Aliaksei Severyn, Massimo Nicosia, Gianni Barlacchi, and Alessandro Moschitti. 2015. Distributional neural networks for automatic resolution of cross- word puzzles. In Proceedings of the 53rd Annual Meeting of the Association for Computational Lin- guistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 199–204. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, and Yuji Matsumoto. 2020. Wikipedia2Vec: An ef- ficient toolkit for learning and visualizing the em- beddings of words and entities from Wikipedia. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 23–30. Association for Com- putational Linguistics. Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernan- dez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, et al. 2019. Multilingual universal sen- tence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307.