<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Crossword Puzzle Resolution in Italian Using Distributional Models for Clue Similarity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Massimo Nicosiay</string-name>
          <email>massimo.nicosia@unitn.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Moschitti</string-name>
          <email>amoschitti@qf.org.qa</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Leveraging previous knowledge is essential for the automatic resolution of Crossword Puzzles (CPs). Clues from a new crossword may have appeared in the past, verbatim or paraphrased, and thus we can extract similar clues using information retrieval (IR) techniques. The output of a search engine implementing the retrieval model can be rened using learning to rank techniques: the goal is to move the clues that have the same answer of the query clue to the top of the result list. The accuracy of a crossword solver heavily depends on the quality of the latter. In previous work, the lists generated by an IR engine were reranked with a linear model by exploiting the multiple occurrences of an answer in such lists. In this paper, following our recent work on CP resolution for the English language, we create a labelled dataset for Italian, and propose (i) a set of reranking baselines and (ii) a neural reranking model based on distributed representations of clues and answers. Our neural model improves over our proposed baselines and the state of the art.</p>
      </abstract>
      <kwd-group>
        <kwd>distributional models</kwd>
        <kwd>information retrieval</kwd>
        <kwd>learning to rank</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Automatic solvers of CPs require accurate list of candidate answers for nding
the correct solution to new clues. Candidate answers to a target clue can be found
by retrieving clues from past games that are similar to the former. Indeed, the
retrieved clues may have the same answers as the target clue. Databases (DBs)
of previously solved CPs (CPDBs) are thus very useful, since clues are often
reused or reformulated for building new CPs.</p>
      <p>In this paper, we propose distributional models for reranking answer
candidate lists generated by an IR engine. We present a set of baselines that exploit
distributed representations of similar clues. Most importantly, (i) we build a
dataset for clue retrieval for Italian, composed of 46,270 clues with their
associated answers, and (ii) we evaluate an e ective neural network model for
computing the similarity between clues. The presented dataset is an interesting
resource that we make available to the research community1. To assess the e
ec</p>
    </sec>
    <sec id="sec-2">
      <title>1 http://ikernels-portal.disi.unitn.it/projects/webcrow/</title>
      <p>
        tiveness of our model, we compare it with the state-of-the-art reranking model
for Italian [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The experimental results in this paper demonstrate that:
{ distributed representations are e ective in encoding reranking clue pairs and
candidate answers;
{ our neural network model is able to exploit the distributed representations
more e ectively than the other baseline models; and
{ our models can improve over a strong retrieval baseline and the previous
state-of-the-art system.
2</p>
      <sec id="sec-2-1">
        <title>Clue Reranking for Solving CPs</title>
        <p>In this section, we brie y present the ideas behind the CP resolution systems
and the state-of-the-art models for reranking answer candidates.
2.1</p>
        <sec id="sec-2-1-1">
          <title>CP Solvers</title>
          <p>
            CP solvers are in many ways similar to question answering (QA) systems such
as IBM Watson [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. Indeed, their goal is not di erent: in order to nd the
correct answer for a given clue, candidate answers are generated and then scored
according to more or less sophisticated strategies [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. The main di erence is the
grid- lling step of CP solvers, which is casted as a Probabilistic Constraint
Satisfaction Problem, e.g., [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. In this step, the squares of the crossword puzzle are
lled according to the crossword constraints. The possible combinations consider
words from dictionaries or from the lists of answer candidates. Such lists can be
generated by exploiting previously seen crossword puzzles or using subsystems
specialized on domain-speci c knowledge (e.g., famous persons, places, movies).
          </p>
          <p>
            WebCrow is one of the best systems [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] for the resolution of CPs, and it
relies on the aforementioned domain-specialized subsystems. In addition to that,
it includes (i) a retrieval model for accessing clues stored in a database, (ii) a
search module for nding answers from the Web, and (iii) a simple NLP pipeline.
          </p>
          <p>Clearly, feeding the solver with high quality answer lists (i.e., lists containing
the correct answers at the top) produces higher speed and accuracy in the
gridlling task. For this reason, a competitive CP solver needs accurate rankers.
2.2</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Similar Clue Retrieval and Reranking</title>
          <p>One important source of candidate answers is the DB of previously solved CPs.
A target clue for which we seek the answer is used to query an index containing
the clues of the DB. The list of candidate answers depends on the list of similar
clues returned by the search engine (SE). The target clue, the candidate clues and
their answers can be encoded into a machine learning model and the answers can
be scored by rerankers. The goal of the latter is to understand which candidate
clues are more similar to the target clue, and put the former at the top of the
candidate list, assuming their similarity indicates they share the same answer.</p>
          <p>The reranking step is important because often SEs do not retrieve the correct
clues in the rst results, i.e., the IR model is not able to capture the correct
semantics of the clues and their answers.</p>
          <p>
            In our work on Italian crosswords [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ], we applied a logistic regression model
to score aggregated sets of candidate clues with the same answer.
          </p>
          <p>
            However, for English crosswords, we also (i) applied a pairwise reranking
model on structural representations of clues [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]; (ii) designed a reranking model
for aggregating the evidence coming from multiple occurrences of the same
answers in a candidate list [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]; and (iii) combined a support vector machine
reranking model with a deep neural network, which interestingly learns a similarity
matrix M from the labelled data [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ].
          </p>
          <p>Motivated by the results in (iii), we applied the same distributional model for
Italian. Unfortunately, the model was not e ective due to the small number of
available Italian clues. The same problem surfaces when using a small training
set of English clues. To solve the issue, we opted not to learn the similarity
matrix M from the data. Thus, we use a simpler neural network architecture
and we feed similarity information directly into the model.
3</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Distributional Models for Reranking Similar Clues</title>
        <p>Previous methods for clue reranking use similarity features between clues based
on lexical matching or other distances between words computed on the Wikipedia
graph or the WordNet ontology.</p>
        <p>
          Treating words as atomic units has evident limitations, since this approach
ignores the context in which words appear. The idea that similar words tend to
occur in similar contexts has a long history in computational linguistic [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In
distributional semantics, a word is represented by a continuous vector, which is
the result of counting some word statistics from a large corpus. In our work, we
take advantage of modern methods for computing distributed representations of
words and sentences, which may alleviate the semantic gap between clues, and
therefore, induce better similarities.
        </p>
        <p>
          The neural network model for measuring the similarity between clues is
presented in Fig. 1, and it is essentially a Multilayer Perceptron (MLP), which is
a simpli cation of our previous work [
          <xref ref-type="bibr" rid="ref14 ref15 ref16">14, 16, 15</xref>
          ]. Given the dimensionality d of
the word embeddings, the main components are:
(i) sentence matrices sci 2 Rd jcij obtained by stacking the word vectors wj 2
        </p>
        <p>Rd of the corresponding words wj from the input clues ci;
(ii) a distributional sentence model f : Rd jcij ! Rd that maps the sentence
matrix of an input clue ci to a xed-size vector representations xci of size d;
(iii) an input layer that is the concatenation of the xed-size representation of
the target clue xc1 , the similar clue xc2 , and a feature vector fv;
(iv) a sequence of fully-connected hidden layers that capture the interactions
between the distributed representations of the clues and the feature vector;
(v) a softmax layer that outputs probability scores re ecting how well the clues
match with each other.</p>
        <p>The choice of the sentence model plays a crucial role as the global
representation of a clue contains the relevant information that the next layers in the
network will use to compute the similarity between the clues.</p>
        <p>
          Recently, distributional sentence models where f (s) is represented by a
sequence of convolutional-pooling feature maps, have shown state-of-the-art results
on many NLP tasks, e.g., [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ].
        </p>
        <p>Given the number of training instances in our dataset, we prefer to reduce
the number of learning parameters. For this reason, we opt for a simple sentence
model where f (sci ) = Pi wi=jcij, i.e., the word vectors, are averaged to a
single xed-sized vector x 2 Rd. In addition, our preliminary experiments for the
English language revealed that this simpler model works just as well as more
complicated single or multi-layer convolutional architectures.
4</p>
      </sec>
      <sec id="sec-2-3">
        <title>Experiments</title>
        <p>In this section we describe the experimental setting in which we evaluated our
models. To conclude, we present the results of the experiments.
4.1</p>
        <sec id="sec-2-3-1">
          <title>Experimental Setup</title>
          <p>Data. The corpus of Italian crosswords contains 46,270 clues in the Italian
language, with their associated answers, from La Settimana Enigmistica magazine,
La Repubblica newspaper and the Web. In the original dataset, there are some
clues containing words with hyphens in the middle. We normalize them by
removing the hyphens whenever the cleaned word is in a list of terms extracted
from the Italian Wiktionary2. In addition, we apply a simple tokenization rules
to the clue de nitions, in order to detach punctuation from the words. The
processed dataset contains 46,185 unique clue/answer pairs. The unique de nitions
are 45,644, indicating that some de nitions have multiple answer variations.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2 https://it.wiktionary.org/</title>
      <p>We construct our training and test datasets by sampling a set of training
clues, and a disjoint set of test clues. We index only the training clues to prevent
test clues from appearing in the training lists, and thus to avoid dependencies
between training and test data. The indexing is performed with the Lucene
library3. We enable the analyzer for Italian, which includes lowercasing, stemming
and removal of stopwords. We query the SE with each training clue, obtaining
related clues according to the BM25 retrieval model (we will refer to these clues
as candidate lists). We of course remove the rst exact match result (the training
query clue is contained in the index), and retrieve only the clues whose answer
length matches the answer length of the query clue. The candidate lists that do
not contain the query answer in the rst 10 positions are ltered out. Therefore,
our lists always contain an answer, and have a maximum of 10 results. The test
lists are constructed with the same process and constraints, by querying the
training index with the test clues. The only di erence is that we do not remove
the rst result, since the test clues are not present in the index.</p>
      <p>Thus, our training and test instances consist of pairs of clues, i.e., a query clue
and a similar candidate clue. The training set used in the experiments contains
the results of 10,000 query clues, while the development and test sets contain the
results of 1,000 query clues each. These numbers re ect the experimental setup
of the previous state-of-the-art model for Italian.</p>
      <p>
        Features. Our models use distributed representations of clue de nitions and
answers. Such representations are constructed from word embeddings [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The
latter are learned by running the word2vec tool on the ItWaC corpus [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a
crawl of the Italian web containing 1,585,620,279 tokens. We use the SkipGram
model trained with the hierarchical softmax algorithm. The dimensionality of
the embeddings is set to 100, the window size to 5, and words with frequency
less than 5 are ltered out.
      </p>
      <p>The clue de nitions are mapped to xed-sized vectors by computing the
average of their word embeddings, an approach also known as neural bag-of-words. It
would be interesting to weight the word vectors by classical IR statistics
associated with the corresponding words, but in this work, the vectors are unweighted.</p>
      <p>In addition to the clue vectors, we use a set of features for capturing the SE
result order, and the similarities between the distributed representations of clues
and answers.</p>
      <p>The reversed rank encodes the position of a candidate clue in the SE results.
The rank is a decreasing value that starts at 10 for the top clue.</p>
      <p>We also compute the cosine similarity of (i) the query and candidate clues,
(ii) the query clue and the candidate answer, (iii) the candidate clue and the
candidate answer. We obviously do not use the query answer since it is the gold
label during testing. Therefore, the additional feature vector has 4 dimensions.</p>
      <sec id="sec-3-1">
        <title>Distributional neural network model. Our neural network model clas</title>
        <p>si es pairs of query and candidate clues as similar or not. The input to the
model is a vector (of dimensionality 204) resulting from the concatenation of the
distributed representations of the query and candidate clues, together with the</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 https://lucene.apache.org/core/</title>
      <p>
        feature vector. We use two hidden layers of size 256 and adopt the ReLU [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] as
activation function. The model is regularized by applying dropout [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] on both
hidden layers. Dropout prevents the co-adaptation of hidden units by setting to
0 a proportion p of the latter, during the forward pass at training time. In our
case, we set p to 0.2.
      </p>
      <p>The network is trained using Stochastic Gradient Descent (SGD) with
shufed mini-batches. The batch size is set to 16 examples. We train the model for
100 epochs with early stopping, i.e., stopping when the Mean Average Precision
(MAP) on the development set does not increase for the last 7 epochs.</p>
      <p>Evaluation. To measure the impact of the baseline models and our neural
network model, we used well-known metrics for evaluating retrieval and QA
systems: REC-1@k (@1, @5), Mean Average Precision (MAP) and Mean Reciprocal
Rank (MRR). REC-1@k is the percentage of lists with a correct answer placed
at the rst position. Given a set of query clues Q, MRR is computed as follows:
M RR = 1 PjQj 1</p>
      <p>jQj q=1 rank(q) ;
where rank(q) is the position of the rst correct answer in the candidate list.
MAP is the mean of the average precision scores for each query:
q=1</p>
      <p>Q
1 X AveP (q):</p>
      <p>Q
4.2</p>
      <sec id="sec-4-1">
        <title>Results</title>
        <p>
          The rst section of Table 1 contains the measures reported in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], i.e., the previous
state-of-the-art model. MAP values are not reported in the original work. The
second section of the table contains the evaluation measures of our baselines and
neural network model.
        </p>
        <p>The WebCrow retrieval component establishes the matching between the
query clue and the clues in its index using the standard SQL search operator.
This explains its low performance compared to the other models.</p>
        <p>Our BM25 baseline is stronger due to the improved preprocessing of the
de nitions in the dataset of Italian clues.</p>
        <p>The Cosine baseline scores each target and candidate clue pair by the cosine
similarity of the distributed representations of the clues, i.e., the vectors obtained
by applying f (sci ) = Pi wi=jcij to the sentence matrices of the two clues. Then,
the pair in the lists are ordered by decreasing similarity.</p>
        <p>Our LR baseline is a Logistic Regression classi er trained on the input layer,
which is described in Section 3, together with the DNN neural network model.</p>
        <p>The results show the e ectiveness of the distributed representations of clues
and their answers. Both supervised models bene t from this information, but
the neural network, with its non-linearities, is able to better exploit the features
fed to the models. With respect to the previous state-of-the-art model, the DNN
produces 4.83% absolute and 5.95% relative improvement in MRR, and more
interestingly, 7.38% absolute and 10.38% relative improvement in REC-1@1.
This translates in more answers that are correctly selected and promoted to the
top of the candidate answer list. Interestingly, the Cosine baseline is able to
improve over the search engine.</p>
        <p>
          The DNN model uses less features than the structural reranking models
previously developed for this task, and does not require computationally expensive
NLP for annotating the clues. As an interesting side note, we point out that the
performance of the IR baseline for Italian are aligned with the performance of
the IR baseline for English. This may suggest that, given enough training data,
our simple model could perform even better, and we could be able to train a
neural model with a learned similarity matrix M [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
5
        </p>
        <sec id="sec-4-1-1">
          <title>Conclusions</title>
          <p>In this paper, we described the rst distributional models for the retrieval of
similar clues for crossword solving. We showed that distributed representations are
e ective for computing the similarity between clues, without involving
expensive NLP and feature extractors in the reranking system. Our models outperform
previous state-of-the-art system for the presented task, showing a consistent
improvement across all the evaluation metrics.</p>
          <p>We have described a dataset of clues and answers for Italian that we make
available to the research community, together with our experimental models.</p>
          <p>In the future, we plan to gather additional clue/answer pairs for Italian, in
order to train more complex neural network models. Additionally, we will apply
the models for question to question similarity in a question answering setting.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Acknowledgements</title>
          <p>This work has been partially supported by the EC project CogNet, 671625
(H2020-ICT-2014-2, Research and Innovation action) and by an IBM Faculty
Award. The rst author is supported by the Google European Doctoral
Fellowship 2015 in Statistical Natural Language Processing. Many thanks to the
anonymous reviewers for their valuable suggestions.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Barlacchi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicosia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Learning to rank answer candidates for automatic resolution of crossword puzzles</article-title>
          .
          <source>In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics (June</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Barlacchi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicosia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A retrieval model for automatic resolution of crossword puzzles in italian language</article-title>
          .
          <source>In: First Italian Conference on Computational Linguistics (CLiC-it)</source>
          , Pisa,
          <volume>09</volume>
          /12/2014-11/12/2014 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernardini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferraresi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zanchetta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>The wacky wide web: a collection of very large linguistically processed web-crawled corpora</article-title>
          .
          <source>Language resources and evaluation 43(3)</source>
          ,
          <volume>209</volume>
          {
          <fpage>226</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ernandes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Angelini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gori</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Webcrow: A web-based system for crossword solving</article-title>
          .
          <source>In: In Proc. of the First Computational</source>
          LinguisticsCLiC-it
          <year>2014</year>
          . pp.
          <volume>1412</volume>
          {
          <fpage>1417</fpage>
          . Menlo Park, Calif., AAAI Press (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ferrucci</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Chu-Carroll</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gondek</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalyanpur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lally</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murdock</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nyberg</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prager</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlaefer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Building watson: An overview of the deepqa project</article-title>
          .
          <source>AI Magazine</source>
          <volume>31</volume>
          (
          <issue>3</issue>
          ) (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Firth</surname>
            ,
            <given-names>J.R.:</given-names>
          </string-name>
          <article-title>A synopsis of linguistic theory</article-title>
          <year>1930</year>
          -
          <fpage>55</fpage>
          .
          <fpage>1952</fpage>
          -
          <volume>59</volume>
          ,
          <issue>1</issue>
          {
          <fpage>32</fpage>
          (
          <year>1957</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kalchbrenner</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grefenstette</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blunsom</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A convolutional neural network for modelling sentences</article-title>
          .
          <source>Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (June</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Convolutional neural networks for sentence classi cation</article-title>
          .
          <source>In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <volume>1746</volume>
          {
          <fpage>1751</fpage>
          .
          <string-name>
            <surname>Doha</surname>
          </string-name>
          , Qatar (
          <year>October 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Littman</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keim</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
          </string-name>
          , N.:
          <article-title>A probabilistic approach to solving crossword puzzles</article-title>
          .
          <source>Arti cial Intelligence</source>
          <volume>134</volume>
          ,
          <fpage>23</fpage>
          {
          <fpage>55</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          . pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Nair</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Recti ed linear units improve restricted boltzmann machines</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Machine Learning (ICML-10)</source>
          . pp.
          <volume>807</volume>
          {
          <issue>814</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nicosia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barlacchi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Learning to rank aggregated answers for crossword puzzles</article-title>
          .
          <source>In: Advances in Information Retrieval - 37th European Conference on IR Research</source>
          , ECIR, Vienna, Austria. Proceedings. pp.
          <volume>556</volume>
          {
          <issue>561</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pohl</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Heuristic search viewed as path nding in a graph</article-title>
          .
          <source>Arti cial Intelligence</source>
          <volume>1</volume>
          (
          <issue>34</issue>
          ),
          <volume>193</volume>
          {
          <fpage>204</fpage>
          (
          <year>1970</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Severyn</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Learning to rank short text pairs with convolutional deep neural networks</article-title>
          .
          <source>In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . pp.
          <volume>373</volume>
          {
          <fpage>382</fpage>
          . SIGIR '15,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Severyn</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Modeling relational information in question-answer pairs with convolutional neural networks</article-title>
          .
          <source>In Preprint arXiv:1604.01178</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Severyn</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicosia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barlacchi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Distributional neural networks for automatic resolution of crossword puzzles</article-title>
          .
          <source>In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (July</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Dropout: A simple way to prevent neural networks from over tting</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>15</volume>
          (
          <issue>1</issue>
          ),
          <year>1929</year>
          {
          <year>1958</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>