<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Die Rätselrevolution: Automated German Crossword Solving</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Zugarini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Röthenbacher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kai Klede</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Ernandes</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bjoern M. Eskofier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dario Zanca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Friedrich-Alexander-Universität Erlangen-Nürnberg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of AI for Health, Helmholtz Zentrum München - German Research Center for Environmental Health</institution>
          ,
          <addr-line>Neuherberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>expert.ai</institution>
          ,
          <addr-line>Siena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Crossword puzzles are popular word games played in various languages around the world, with diverse styles across diferent countries. For this reason, automated crossword solvers designed for a language, may not work well on others. In this paper, we extend Webcrow, an automatic crossword solver, to German, making it the first program for crossword solving in the German language. To address the lack of large clue-answer crossword pairs data, Webcrow combines multiple modules, known as experts, which retrieve potential answers from various resources, including the web, knowledge graphs, and linguistic rules. The system is evaluated on a collection of crosswords from variegate sources, where it is able to solve perfectly 67% of them. Additional analysis reveals that while our solver achieved commendable results, puzzles with poorly constrained schemas and original clues still presented significant hurdles. These findings shed light on the complexity of the crossword-solving problem and emphasize the need for future research to address and overcome these particular challenges efectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automated Crosswords Solving</kwd>
        <kwd>German Crosswords</kwd>
        <kwd>Question Answering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Automated crossword solving is a challenging problem in</title>
        <p>Artificial Intelligence and Natural Language Processing
(NLP) research. Solving a puzzle requires multiple skills,
ranging from encyclopedic knowledge and linguistics to
reasoning and constraint satisfaction. In the past, several
automated crossword systems have been proposed for
English [1, 2, 3]. Despite the successful results achieved
by these approaches, they do not investigate crossword
resolution in other languages. All those methods heavily
rely on large databases of previously answered clues to
retrieve and rank answer candidates, that sometimes are
even re-ranked [4, 5, 6, 7]. Berkley Crossword Solver [3]
make also use of multiple Language Models that have
been fine-tuned to segment answers in words and to
correct wrong letters. The need for such resources hinders
the application of these solutions to other languages.</p>
        <p>WebCrow [8, 9] instead, is a crossword solver that was
applied also to Italian puzzles. The architecture which
is composed of multiple modules, called experts,
facilitates the portability to new languages. In this work, we
extend WebCrow to the German language. To the best of
our knowledge, we are the first to propose an automatic
solver for German crosswords.</p>
        <p>The paper is organized as follows. In Section 2, the
whole WebCrow architecture is described. Then, in
Section 3 we present the data gathered for German and its
usage by the WebCrow experts. Experiments are outlined
in Section 4. Finally, we summarize our conclusions and
directions for future works in Section 5.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. WebCrow</title>
      <sec id="sec-2-1">
        <title>Web Search. Web Search is the expert that charac</title>
        <p>terizes Webcrow, as the name suggests. This module
retrieves answer candidates by searching on the web.
Diferently from CWDB or Knowledge Graphs, it allows
the retrieval of fresh information, that can be crucial
to solve some clues. For each clue, the web is queried,
making use of Bing APIs.1 The answer list can be built
upon either or both the snippets and the full documents
returned by the search. All the frequent words extracted
from the documents are ranked according to their TF-IDF,
also taking into account the rank of the documents in
which they occur [8]. IDF must be pre-computed on a
large collection of documents.</p>
        <p>Lexicon. The lexicon is a large vocabulary of German
words. For each answer to fill in the crossword, the
dictionary module returns all the words in the lexicon that
ift in length. The returned list is weighted by the n-gram
model used in the Implicit Module described below.</p>
        <sec id="sec-2-1-1">
          <title>2.2. Implicit Module</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.3. Belief Propagation and Grid Filling</title>
          <p>The current version of WebCrow looks for a solution
that maximizes the expected value of clues answered
correctly. To compute the posterior probabilities () for an
answer  to be in slot  it uses belief propagation. These
probabilities then allow us to infer letter probabilities
,( ), that quantify the likelihood of a character  at
position  of the answer to clue . It can be computed
as in Equation 1, where () =  means the -th letter
of answer  is  .</p>
          <p>,( ) =</p>
          <p>∑︁
∈,with ()=
().</p>
          <p>(1)</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>If a cell  in the grid is the -th cell of an horizontal</title>
        <p>clue , and the -th cell in the vertical clue  , then the
probability ( ) becomes:</p>
        <p>· ,( ) · ,( ),</p>
      </sec>
      <sec id="sec-2-3">
        <title>1https://learn.microsoft.com/en-us/rest/api/cognitiveservices</title>
        <p>bingsearch/bing-web-api-v7-reference.</p>
      </sec>
      <sec id="sec-2-4">
        <title>CWDB. A large collection of previously answered</title>
        <p>questions is paramount for any crossword solver. The
CrossWord DataBase (CWDB) expert retrieves answers
from a database of clue-answer pairs. Retrieval and
ranking of the answers is based on the semantic similarity
between a clue and the clues in the CWDB. We follow the
approach in [10]. We used pre-trained encoders [11, 12, 13]
to embed both the clues in the database and the ones of
the crossword that is being solved.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Knowledge Graph. Ontologies are a rich source of</title>
        <p>linguistic and encyclopedic knowledge, that is frequently
required in crosswords. Knowledge Graphs (KGs)
contain structured information about concepts and language.</p>
        <p>The expert in Webcrow exploits KGs to collect a database
analogous to the CWDB in a straightforward way:
linguistic concepts in the knowledge graph are paired with
their definitions, similarly to the approach followed with
the clue-answer pairs from the CWDB. Following the
same approach, an answer retrieval step is then applied:
the definitions are encoded in a semantic representation
and then ranked and retrieved accordingly to a
semantic similarity between the clue and the definition in the</p>
        <p>CWDB
Knowledge</p>
        <p>Graph</p>
        <p>Lexicon</p>
        <p>M
e
r
g
i
n
g</p>
        <p>Belief
Propagation
Clue Answering
Grid
Filling
mstl oddae mori pceel moti esrti ccst uopi nsri nua
g e l i t s e d
d o e i u s m o d
nt uct et ml ai db opi odr ur nei
where  is a factor that normalizes  to be a probability
mass function. If the probability of a letter in a given
cell exceeds a certain threshold, the solver “freezes” it in
the solution and only allows answers that are consistent
with such a letter.</p>
        <p>A solution that maximizes the sum of the answer prob- included in total.
abilities () is then found using Greedy Search, which Source
iflls in the missing or incomplete answers by iteratively Bild
selecting the most probable from the remaining answers Spiegel
that fit into the missing cells. FHoAcus
FAZ</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Adapting WebCrow to German</title>
      <p>3.1. Data Gathering an earlier cut-of date (Wednesday 31 st August, 2022)
was chosen to leave more crosswords for use in
validaWe collected two kinds of data: full crosswords, and a tion. The validation set then includes the crosswords
large corpus of clue-answer pairs. that are not in the training set that were published up to
the Thursday 24th November, 2022. The crosswords for
Full Crosswords. Full crosswords were obtained from the test set were published after that date and up to the
various newspapers that ofer crossword puzzles on their Wednesday 31st August, 2022, however, the daily
publiwebsites. These include “Der Spiegel", “Bild-Zeitung", cations (every source except FAZ) were sub-sampled to
“Focus", “Frankfurter Algemeine Zeitung" (FAZ) and include a similar number of each source. Because some
“Hamburger Abendblatt" (HA). The number of crosswords crosswords, like the one from the New York Times, have
from each source as well as the crossword dimensions, difering levels of dificulty for each day of the week [ 1],
the frequency of their publication and the earliest re- we sampled every six crosswords, instead of every seven,
trieved crossword is listed in Table 1. With the exception to make sure all days of the week were present in the test
of FAZ, all the other crosswords schemas are in the so set. This resulted in ten crosswords for FAZ and twelve
called Swedish style, characterized by brief definitions for each other source.
that goes within the black cells. An example is shown in
Figure 1. Clue-answer pairs. We collected a large set of</p>
      <p>The retrieved documents were split into a training, clue-answer pairs crawling two German online
crossvalidation, and test set. For each source except for FAZ, word web sites: kreuzwortraetsel.de (KWRDE) and
the training set included all crosswords published up to kreuzwortraetsel-hilfe.com (KWRH). The download of
and including the Friday 30th September, 2022. For FAZ, KWRDE was covered in a period between Friday 9th
September, 2022 and Saturday 1st October, 2022. Similarly, Implicit Module. The tetra-grams in the Implicit
ModKWRH was downloaded between Thursday 8th Septem- ule were generated from the answers in the CWDB. They
ber, 2022 and Tuesday 20th September, 2022. As shown in include start and end tokens "$" and "^" to allow for
Table 2, a total of 2.4 million unique clue-answer pairs specific n-grams at the beginning and endings of words.
were obtained this way, containing 438 thousand unique They are weighted based on their frequency in the
coranswers for German. To supplement the database, the pus.
clues from the crosswords in the training set were
extracted and added to the database. The individual and
total contributions are shown in Table 2. 4. Experiments</p>
      <sec id="sec-3-1">
        <title>In the evaluation we both measured the end-to-end performance of Webcrow and the answer retrieval capabilities of each single module.</title>
        <sec id="sec-3-1-1">
          <title>3.2. Experts Adaptation</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>CWDB. The 2.4 million clue-answer-pairs crawled</title>
        <p>were used to build the CWDB expert. After minor
preprocessing, clues were embedded with multi-lingual Uni- 4.1. Answer Retrieval
versal Sentence Encoder (USE) [14] as in [10].</p>
      </sec>
      <sec id="sec-3-3">
        <title>Although good answer ranking is clearly very important</title>
        <p>for crossword resolution, it is even more essential to have
Knowledge Graph. To extend the coverage of our the target answer present in the candidates’ list. Indeed,
answer retrieval step, we made use of an additional pro- even poorly ranked target answers can be boosted during
prietary German ontology2. Overall, it contained about belief propagation and grid filling, whereas a missing
943 thousand lemmas from various web sources like en- answer in the list would hardly be recovered, inevitably
cyclopedias. As for CWDB, we retrieve answer candi- leading to errors or incomplete solutions. Hence, besides
dates based on semantic similarity [10], thus we treat the MRR, we also consider coverage as a performance
each lemma as the candidate answer, and its definition is indicator of our experts.
embedded with USE. Results of single experts are summarized in Table 3,
where we also report the quality after merging and
beliefWeb-Search. To find relevant words, web search ex- propagation modules. As we can see, CWDB is the most
perts require a database of document frequency values informative expert, with the highest MRR and
coverfor the most common words. It was obtained from an on- age. Also, Web Search achieves interesting MRR scores.
line database of words and their frequencies in German Lexicon and Knowledge Graph have poorer ranking
qualmovie subtitles 3. Candidate answers were retrieved only ity, but they both contribute to increasing the coverage,
from the web snippets. which is the main purpose of those modules. This can
be observed from the coverage after Merging, where all
Lexicon. The lexicon was constructed from a freely the lists are combined together. Clearly, there is a high
available online German dictionary 4, after romanizing overlap in the experts’ lists, however, the union of all
umlauts (e.g. ä to ae) and excluding all non-ASCII and of them reduces the number of missing target answers
non-alphabetical characters. by 0.7% absolute, almost a third of all the missing target
answers.</p>
        <p>From Table 3 we can also observe how belief
propagation significantly boosts the ranking quality, enormously
facilitating the grid-filling stage.
2from Expert.ai.
3github.com/hermitdave/FrequencyWords/blob/master/content/
2018/de/de_50k.txt
4github.com/enz/german-wordlist/blob/master/words</p>
        <p>Number of clues</p>
        <sec id="sec-3-3-1">
          <title>4.2. Crossword Solving</title>
          <p>The whole system was assessed on German full
crosswords from the test set described in Section 3. We mea- CWDB has little to no coverage of those clues, which are
sured the solving quality with three indicators: the per- typically very important to constrain a large portion of
centage of correctly inserted letters (OK letters), the per- the grid. Also, the style of the clues is remarkably
difcentage of correctly inserted words (OK words), and the ferent. There are many wordplays and linguistic games
fraction of perfectly solved crosswords (OK CWs). We unusual in the rest of the data, making them challenging
report in Table 4 those metrics aggregated per cross- even for humans.
word source. Overall 39 out of 58 (about 67% of them) To delve in further, we also analyzed MRR and
covercrosswords were perfectly solved by German Webcrow. age after merging candidate answers divided by answer
However, there is a strong variance between diferent length in Figure 3. We can easily notice that both of them
sources. Smaller crossword grids like the ones from “Bild", are significantly below the scores reported in Table 3
were always solved without errors. Similar, satisfying for all the crosswords. In particular, there is a non
neperformances were obtained in larger grids, also in 16x16 glectable portion of long answers with poor coverage
schema from HA. In contrast, Webcrow failed in solving and close to zero MRR. Such a modest retrieval quality
FAZ puzzles, with surprisingly low results. Only 13% combined with a less contrained schema inevitably lead
of words and 21% of letters were correctly answered in the system to fail in solving the crossword.
FAZ, far behind near-perfect crosswords from the other
sources.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>Performance on FAZ Crosswords. Diferent from In this work we presented German Webcrow, the first
other sources, FAZ puzzles are not in Swedish format. crossword puzzle solver for the German Language. We
Instead, they are characterized by grids populated with collected both a dataset of clue-answer pairs and a set of
many black cells that reduce the number of constraints to German crosswords from diferent sources having
variimpose on the solution. Moreover, the grid layout is dis- ous formats and styles. Webcrow achieved near-perfect
posed in such a way that there is a significant percentage word accuracy in Swedish-type crosswords, that proved
of answers longer than ten characters (see Figure 3), that to be generally easy to solve, solving overall 39/ 58
perin our CWDB coming are not present apart from those fectly. However, our solver performed poorly on FAZ
coming from the few FAZ crosswords in training. Thus, crosswords. Those puzzles were extremely
challeng</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We thank Andreas Weber (raetsel4u.de) for allowing us
to reprint one of his crossword puzzles in Figure 1 of this
publication.
ing for multiple reasons, such as the poorly constrained
schemas, due to the presence of many black cells, and the
rich presence of sophisticated, original clues, involving
articulated wordplays that formed words not present in
the candidate answers lists.</p>
      <p>Challenging puzzles like the ones in FAZ are a clear
example of how complex the problem is, and why studying
crosswords in multiple languages and formats is
important for automated crossword solving. In the future we
plan to address the current limitations. In particular, we
plan to investigate the use of generative models, to cope
with the novel unseen clues.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>