1. Introduction

Die Rätselrevolution: Automated German Crossword Solving

Andrea Zugarini

Thomas Röthenbacher

Kai Klede

Marco Ernandes

Bjoern M. Eskofier

0 1

Dario Zanca

0 0 Friedrich-Alexander-Universität Erlangen-Nürnberg , Germany 1 Institute of AI for Health, Helmholtz Zentrum München - German Research Center for Environmental Health , Neuherberg , Germany 2 expert.ai , Siena , Italy

Crossword puzzles are popular word games played in various languages around the world, with diverse styles across diferent countries. For this reason, automated crossword solvers designed for a language, may not work well on others. In this paper, we extend Webcrow, an automatic crossword solver, to German, making it the first program for crossword solving in the German language. To address the lack of large clue-answer crossword pairs data, Webcrow combines multiple modules, known as experts, which retrieve potential answers from various resources, including the web, knowledge graphs, and linguistic rules. The system is evaluated on a collection of crosswords from variegate sources, where it is able to solve perfectly 67% of them. Additional analysis reveals that while our solver achieved commendable results, puzzles with poorly constrained schemas and original clues still presented significant hurdles. These findings shed light on the complexity of the crossword-solving problem and emphasize the need for future research to address and overcome these particular challenges efectively.

eol>Automated Crosswords Solving German Crosswords Question Answering

1. Introduction Automated crossword solving is a challenging problem in

Artificial Intelligence and Natural Language Processing (NLP) research. Solving a puzzle requires multiple skills, ranging from encyclopedic knowledge and linguistics to reasoning and constraint satisfaction. In the past, several automated crossword systems have been proposed for English [1, 2, 3]. Despite the successful results achieved by these approaches, they do not investigate crossword resolution in other languages. All those methods heavily rely on large databases of previously answered clues to retrieve and rank answer candidates, that sometimes are even re-ranked [4, 5, 6, 7]. Berkley Crossword Solver [3] make also use of multiple Language Models that have been fine-tuned to segment answers in words and to correct wrong letters. The need for such resources hinders the application of these solutions to other languages.

WebCrow [8, 9] instead, is a crossword solver that was applied also to Italian puzzles. The architecture which is composed of multiple modules, called experts, facilitates the portability to new languages. In this work, we extend WebCrow to the German language. To the best of our knowledge, we are the first to propose an automatic solver for German crosswords.

The paper is organized as follows. In Section 2, the whole WebCrow architecture is described. Then, in Section 3 we present the data gathered for German and its usage by the WebCrow experts. Experiments are outlined in Section 4. Finally, we summarize our conclusions and directions for future works in Section 5.

2. WebCrow Web Search. Web Search is the expert that charac

terizes Webcrow, as the name suggests. This module retrieves answer candidates by searching on the web. Diferently from CWDB or Knowledge Graphs, it allows the retrieval of fresh information, that can be crucial to solve some clues. For each clue, the web is queried, making use of Bing APIs.1 The answer list can be built upon either or both the snippets and the full documents returned by the search. All the frequent words extracted from the documents are ranked according to their TF-IDF, also taking into account the rank of the documents in which they occur [8]. IDF must be pre-computed on a large collection of documents.

Lexicon. The lexicon is a large vocabulary of German words. For each answer to fill in the crossword, the dictionary module returns all the words in the lexicon that ift in length. The returned list is weighted by the n-gram model used in the Implicit Module described below.

2.2. Implicit Module 2.3. Belief Propagation and Grid Filling

The current version of WebCrow looks for a solution that maximizes the expected value of clues answered correctly. To compute the posterior probabilities () for an answer to be in slot it uses belief propagation. These probabilities then allow us to infer letter probabilities ,( ), that quantify the likelihood of a character at position of the answer to clue . It can be computed as in Equation 1, where () = means the -th letter of answer is .

,( ) =

∑︁ ∈,with ()= ().

(1)

If a cell in the grid is the -th cell of an horizontal

clue , and the -th cell in the vertical clue , then the probability ( ) becomes:

· ,( ) · ,( ),

1https://learn.microsoft.com/en-us/rest/api/cognitiveservices

bingsearch/bing-web-api-v7-reference.

CWDB. A large collection of previously answered

questions is paramount for any crossword solver. The CrossWord DataBase (CWDB) expert retrieves answers from a database of clue-answer pairs. Retrieval and ranking of the answers is based on the semantic similarity between a clue and the clues in the CWDB. We follow the approach in [10]. We used pre-trained encoders [11, 12, 13] to embed both the clues in the database and the ones of the crossword that is being solved.

Knowledge Graph. Ontologies are a rich source of

linguistic and encyclopedic knowledge, that is frequently required in crosswords. Knowledge Graphs (KGs) contain structured information about concepts and language.

The expert in Webcrow exploits KGs to collect a database analogous to the CWDB in a straightforward way: linguistic concepts in the knowledge graph are paired with their definitions, similarly to the approach followed with the clue-answer pairs from the CWDB. Following the same approach, an answer retrieval step is then applied: the definitions are encoded in a semantic representation and then ranked and retrieved accordingly to a semantic similarity between the clue and the definition in the

CWDB Knowledge

Graph

Lexicon

M e r g i n g

Belief Propagation Clue Answering Grid Filling mstl oddae mori pceel moti esrti ccst uopi nsri nua g e l i t s e d d o e i u s m o d nt uct et ml ai db opi odr ur nei where is a factor that normalizes to be a probability mass function. If the probability of a letter in a given cell exceeds a certain threshold, the solver “freezes” it in the solution and only allows answers that are consistent with such a letter.

A solution that maximizes the sum of the answer prob- included in total. abilities () is then found using Greedy Search, which Source iflls in the missing or incomplete answers by iteratively Bild selecting the most probable from the remaining answers Spiegel that fit into the missing cells. FHoAcus FAZ

3. Adapting WebCrow to German

3.1. Data Gathering an earlier cut-of date (Wednesday 31 st August, 2022) was chosen to leave more crosswords for use in validaWe collected two kinds of data: full crosswords, and a tion. The validation set then includes the crosswords large corpus of clue-answer pairs. that are not in the training set that were published up to the Thursday 24th November, 2022. The crosswords for Full Crosswords. Full crosswords were obtained from the test set were published after that date and up to the various newspapers that ofer crossword puzzles on their Wednesday 31st August, 2022, however, the daily publiwebsites. These include “Der Spiegel", “Bild-Zeitung", cations (every source except FAZ) were sub-sampled to “Focus", “Frankfurter Algemeine Zeitung" (FAZ) and include a similar number of each source. Because some “Hamburger Abendblatt" (HA). The number of crosswords crosswords, like the one from the New York Times, have from each source as well as the crossword dimensions, difering levels of dificulty for each day of the week [ 1], the frequency of their publication and the earliest re- we sampled every six crosswords, instead of every seven, trieved crossword is listed in Table 1. With the exception to make sure all days of the week were present in the test of FAZ, all the other crosswords schemas are in the so set. This resulted in ten crosswords for FAZ and twelve called Swedish style, characterized by brief definitions for each other source. that goes within the black cells. An example is shown in Figure 1. Clue-answer pairs. We collected a large set of

The retrieved documents were split into a training, clue-answer pairs crawling two German online crossvalidation, and test set. For each source except for FAZ, word web sites: kreuzwortraetsel.de (KWRDE) and the training set included all crosswords published up to kreuzwortraetsel-hilfe.com (KWRH). The download of and including the Friday 30th September, 2022. For FAZ, KWRDE was covered in a period between Friday 9th September, 2022 and Saturday 1st October, 2022. Similarly, Implicit Module. The tetra-grams in the Implicit ModKWRH was downloaded between Thursday 8th Septem- ule were generated from the answers in the CWDB. They ber, 2022 and Tuesday 20th September, 2022. As shown in include start and end tokens "$" and "^" to allow for Table 2, a total of 2.4 million unique clue-answer pairs specific n-grams at the beginning and endings of words. were obtained this way, containing 438 thousand unique They are weighted based on their frequency in the coranswers for German. To supplement the database, the pus. clues from the crosswords in the training set were extracted and added to the database. The individual and total contributions are shown in Table 2. 4. Experiments

In the evaluation we both measured the end-to-end performance of Webcrow and the answer retrieval capabilities of each single module. 3.2. Experts Adaptation CWDB. The 2.4 million clue-answer-pairs crawled

were used to build the CWDB expert. After minor preprocessing, clues were embedded with multi-lingual Uni- 4.1. Answer Retrieval versal Sentence Encoder (USE) [14] as in [10].

Although good answer ranking is clearly very important

for crossword resolution, it is even more essential to have Knowledge Graph. To extend the coverage of our the target answer present in the candidates’ list. Indeed, answer retrieval step, we made use of an additional pro- even poorly ranked target answers can be boosted during prietary German ontology2. Overall, it contained about belief propagation and grid filling, whereas a missing 943 thousand lemmas from various web sources like en- answer in the list would hardly be recovered, inevitably cyclopedias. As for CWDB, we retrieve answer candi- leading to errors or incomplete solutions. Hence, besides dates based on semantic similarity [10], thus we treat the MRR, we also consider coverage as a performance each lemma as the candidate answer, and its definition is indicator of our experts. embedded with USE. Results of single experts are summarized in Table 3, where we also report the quality after merging and beliefWeb-Search. To find relevant words, web search ex- propagation modules. As we can see, CWDB is the most perts require a database of document frequency values informative expert, with the highest MRR and coverfor the most common words. It was obtained from an on- age. Also, Web Search achieves interesting MRR scores. line database of words and their frequencies in German Lexicon and Knowledge Graph have poorer ranking qualmovie subtitles 3. Candidate answers were retrieved only ity, but they both contribute to increasing the coverage, from the web snippets. which is the main purpose of those modules. This can be observed from the coverage after Merging, where all Lexicon. The lexicon was constructed from a freely the lists are combined together. Clearly, there is a high available online German dictionary 4, after romanizing overlap in the experts’ lists, however, the union of all umlauts (e.g. ä to ae) and excluding all non-ASCII and of them reduces the number of missing target answers non-alphabetical characters. by 0.7% absolute, almost a third of all the missing target answers.

From Table 3 we can also observe how belief propagation significantly boosts the ranking quality, enormously facilitating the grid-filling stage. 2from Expert.ai. 3github.com/hermitdave/FrequencyWords/blob/master/content/ 2018/de/de_50k.txt 4github.com/enz/german-wordlist/blob/master/words

Number of clues

4.2. Crossword Solving

The whole system was assessed on German full crosswords from the test set described in Section 3. We mea- CWDB has little to no coverage of those clues, which are sured the solving quality with three indicators: the per- typically very important to constrain a large portion of centage of correctly inserted letters (OK letters), the per- the grid. Also, the style of the clues is remarkably difcentage of correctly inserted words (OK words), and the ferent. There are many wordplays and linguistic games fraction of perfectly solved crosswords (OK CWs). We unusual in the rest of the data, making them challenging report in Table 4 those metrics aggregated per cross- even for humans. word source. Overall 39 out of 58 (about 67% of them) To delve in further, we also analyzed MRR and covercrosswords were perfectly solved by German Webcrow. age after merging candidate answers divided by answer However, there is a strong variance between diferent length in Figure 3. We can easily notice that both of them sources. Smaller crossword grids like the ones from “Bild", are significantly below the scores reported in Table 3 were always solved without errors. Similar, satisfying for all the crosswords. In particular, there is a non neperformances were obtained in larger grids, also in 16x16 glectable portion of long answers with poor coverage schema from HA. In contrast, Webcrow failed in solving and close to zero MRR. Such a modest retrieval quality FAZ puzzles, with surprisingly low results. Only 13% combined with a less contrained schema inevitably lead of words and 21% of letters were correctly answered in the system to fail in solving the crossword. FAZ, far behind near-perfect crosswords from the other sources.

5. Conclusion

Performance on FAZ Crosswords. Diferent from In this work we presented German Webcrow, the first other sources, FAZ puzzles are not in Swedish format. crossword puzzle solver for the German Language. We Instead, they are characterized by grids populated with collected both a dataset of clue-answer pairs and a set of many black cells that reduce the number of constraints to German crosswords from diferent sources having variimpose on the solution. Moreover, the grid layout is dis- ous formats and styles. Webcrow achieved near-perfect posed in such a way that there is a significant percentage word accuracy in Swedish-type crosswords, that proved of answers longer than ten characters (see Figure 3), that to be generally easy to solve, solving overall 39/ 58 perin our CWDB coming are not present apart from those fectly. However, our solver performed poorly on FAZ coming from the few FAZ crosswords in training. Thus, crosswords. Those puzzles were extremely challeng

Acknowledgments

We thank Andreas Weber (raetsel4u.de) for allowing us to reprint one of his crossword puzzles in Figure 1 of this publication. ing for multiple reasons, such as the poorly constrained schemas, due to the presence of many black cells, and the rich presence of sophisticated, original clues, involving articulated wordplays that formed words not present in the candidate answers lists.

Challenging puzzles like the ones in FAZ are a clear example of how complex the problem is, and why studying crosswords in multiple languages and formats is important for automated crossword solving. In the future we plan to address the current limitations. In particular, we plan to investigate the use of generative models, to cope with the novel unseen clues.