=Paper=
{{Paper
|id=Vol-2410/paper28.pdf
|storemode=property
|title=Spelling Correction as a Foreign Language
|pdfUrl=https://ceur-ws.org/Vol-2410/paper28.pdf
|volume=Vol-2410
|authors=Yingbo Zhou,Utkarsh Porwal,Roberto Konow
|dblpUrl=https://dblp.org/rec/conf/sigir/ZhouPK19
}}
==Spelling Correction as a Foreign Language==
<pdf width="1500px">https://ceur-ws.org/Vol-2410/paper28.pdf</pdf>
<pre>
                                    Spelling Correction as a Foreign Language
                       Yingbo Zhou∗                                                      Utkarsh Porwal                                      Roberto Konow
                        eBay Inc                                                             eBay Inc                                            eBay Inc
                   San Jose, California                                                 San Jose, California                                San Jose, California
                  yingbzhou@ebay.com                                                    uporwal@ebay.com                                    rkonow@ebay.com

ABSTRACT                                                                                                sources for spelling mistakes, e.g. typing too fast, unintentional key
In this paper, we reformulated the spelling correction problem as                                       stroke, phonetic ambiguity among others. Lastly, in certain context
a machine translation task under the encoder-decoder framework.                                         (e.g. in a search engine) it is not easy to obtain clean training data
This reformulation enabled us to use a single model for solving                                         for language model as the input does not follow what is typical in
this problem that is traditionally formulated as learning a language                                    natural language.
model and an error model. This model employs multi-layer recur-                                            Since the goal is to get text that maximize P(x| x̃), can we directly
rent neural networks as an encoder and a decoder. We demonstrate                                        model this conditional distribution instead? In this work, we ex-
the efectiveness of this model using an internal dataset, where the                                     plore this route, which by passes the need to have multiple models
training data is automatically obtained from user logs. The model                                       and avoid getting errors from multiple sources. We achieve this
ofers competitive performance as compared to the state of the art                                       by applying the sequence to sequence learning framework using
methods but does not require any feature engineering nor hand                                           recurrent neural networks [16] and reformulate the spelling cor-
tuning between models.                                                                                  rection problem as a neural machine translation problem, where
                                                                                                        the misspelled input is treated as a foreign language.
KEYWORDS
Spell Correction; Encoder Decoder; Noisy Channel
                                                                                                        2    RELATED WORK
ACM Reference Format:
Yingbo Zhou, Utkarsh Porwal, and Roberto Konow. 2019. Spelling Correction
                                                                                                        Spelling correction is used in wide range of applications other than
as a Foreign Language. In Proceedings of ACM SIGIR Workshop on eCommerce                                Web search [4] and e-commerce search such as personal search
(SIGIR 2019 eCom). ACM, New York, NY, USA, 4 pages.                                                     in email [7] to improve healthcare inquiries [13]. However, noisy
                                                                                                        channel model or its extensions remain a popular choice for design-
1      INTRODUCTION                                                                                     ing large scale spelling correction system. Gao et al. [6] proposed
                                                                                                        an extension of the noisy channel model where the language model
Having an automatic spelling correction service is crucial for any
                                                                                                        was scaled to Web scale and a distributed infrastructure to facilitate
e-commerce search engine as users often make spelling mistakes
                                                                                                        such scaling was proposed. They also proposed a phrase based
while issuing queries. A correct spelling correction not only re-
                                                                                                        error model. Similarly, Whitelaw et al. [17] also designed a large
duces the user’s mental load for the task, but also improves the
                                                                                                        scale spelling correction and autocorrection system that did require
quality of the search engine as it attempts to predict user’s inten-
                                                                                                        any manually annotated training data. They also designed their
tion. From a probabilistic perspective, let x̃ be the misspelled text
                                                                                                        large scale system following the noisy channel model where they
that we observe, spelling correction seeks to uncover the true text
                                                                                                        extended one of the earliest error models proposed by Brill et al. [3].
x∗ = arg maxx P(x| x̃). Traditionally, spelling correction problem
                                                                                                        Spelling correction problem has been formulated in several difer-
has been mostly approached by using the noisy channel model
                                                                                                        ent novel ways. Li et al. [12] used Hidden Markov Models to model
[11]. The model consists of two parts: 1) a language model (or
                                                                                                        spelling errors in a uniied framework. Likewise, Raaijmakers et
source model, i.e. P(x)) that represent the prior probability of the
                                                                                                        al. [14] used deep graphical model for spelling correction. They
intended correct input text; and 2) an error model (or channel
                                                                                                        formulated spelling correction as a document retrieval problem
model, i.e. P(x̃|x)) that represent the process, in which the correct
                                                                                                        where words are documents and for a misspelled query one has
input text got corrupted to an incorrect misspelled text. The i-
                                                                                                        to retrieve the appropriate document. Eger et al. [5] formulated
nal correction is therefore obtained by using the Bayes rule, i.e.
                                                                                                        spelling correction problem as a subproblem of the more general
x∗ = arg maxx P(x)P(x̃|x). There are several problem with this
                                                                                                        string-to-string translation problem. Their work is similar to ours
approach. First, we need two separate models and the error in esti-
                                                                                                        in spirit but difers signiicantly in implementation detail. We for-
mating one model would afect the performance of the inal output.
                                                                                                        mulate the spelling correction as a machine translation task and
Second, it is not easy to model the channel since there is a lot of
                                                                                                        to the best of our knowledge no other study has been conducted
∗ This work was done when author worked at eBay                                                         doing the same.

Permission    to make
Copyright © 2019   by thedigital
                           paper’sor hard copies
                                   authors. Copyingof  part orfor
                                                     permitted  allprivate
                                                                    of thisand
                                                                            work   for personal
                                                                               academic  purposes.or
In: J. Degenhardt,
classroom    use is S.granted
                       Kallumadi, U. Porwal,
                              without         A. Trotman
                                        fee provided      (eds.):
                                                        that copies are not made or distributed
Proceedings
for proit orofcommercial
                the SIGIR 2019   eCom workshop,
                             advantage   and thatJuly  2019,bear
                                                    copies   Paris, France,
                                                                  this noticepublished
                                                                               and theatfull citation   3    BACKGROUND AND PRELIMINARIES
http://ceur-ws.org
on  the irst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).                                                        The recurrent neural network (RNN) is a natural extension to feed-
SIGIR 2019 eCom, July 2019, Paris, France                                                               forward neural network for modeling sequential data. More for-
© 2019 Copyright held by the owner/author(s).                                                           mally, let (x 1 , x 2 , . . . , xT ), x t ∈ Rd be the input, an RNN update its
SIGIR 2019 eCom, July 2019, Paris, France                                                             Yingbo Zhou, Utkarsh Porwal, and Roberto Konow


internal recurrent hidden states by doing the following computa-                    where the input is the misspelled text and the output is the cor-
tion:                                                                               responding correct spellings. One challenge for this formulation
                        ht = ψ (ht −1 , x t )                (1)                    is that unlike in machine translation problem, the vocabulary is
                                                                                    large but still limited2 . However, in spelling correction, the input
where ψ is a nonlinear function. Traditionally, in a standard RNN
                                                                                    vocabulary is potentially unbounded, which rules out the possibility
the ψ is implemented as an aine transformation followed by a
                                                                                    of applying word based encoding for this problem. In addition, the
pointwise nonlinearity, such as
                                                                                    large output vocabulary is a general challenge in neural network
             ht = ψ (ht −1 , x t ) = tanh(W x t + U ht −1 + bh )                    based machine translation models because of the large Softmax
                                                                                    output matrix.
In addition, the RNN may also have outputs (y1 , y2 , . . . , yT ), yt ∈
                                                                                        The input/output vocabulary problem can be solved by using a
Ro that can be calculated by using another nonlinear function ϕ
                                                                                    character based encoding scheme. Although it seems appropriate
                               yt = ϕ(ht , x t )                                    for encoding the input, this scheme puts unnecessary burden on
                                                                                    the decoder, since for a correction the decoder need to learn the
From this recursion, the recurrent neural network naturally models
                                                                                    correct spelling of the word, word boundaries, and etc. We choose
the conditional probability P(yt |x 1 , . . . , x t ).
                                                                                    the byte pair encoding (BPE) scheme [15] that strikes the balance
   One problem with standard RNN is that it is diicult for them to
                                                                                    between too large output vocabulary and too much learning burden
learn long term dependencies [2, 9], and therefore in practice more
                                                                                    for decoders. In this scheme, the vocabulary is built by recursively
sophisticated function ψ are often used to alleviate this problem.
                                                                                    merging most frequent pairs of strings starting from character,
For example the long short term memory (LSTM) [10] is one widely
                                                                                    and the vocabulary size is controlled by the number of merging
used recursive unit that is designed to learn long term dependencies.
                                                                                    iterations.
A layer LSTM consists of three gates and one memory cell, the
                                                                                        As shown in papers [1], encoding the whole input string to a
computation of LSTM is as following1 :
                                                                                    single ixed length vector is not optimal, since it may not reserve all
        i t = σ (Wi x t + Ui ht −1 + bi )                                    (2)    the information that is required for a successful decoding. Therefore,
       ot = σ (Wo x t + Uo ht −1 + bo )                                      (3)    we introduce the attention mechanism from Bahdanau et al.[1] into
                                                                                    this model. Formally, the attention model calculates a context vector
        ft = σ (Wf x t + Uf ht −1 + bf )                                     (4)
                                                                                    c i from the encoding states h 1 , . . . , hT and decoder state si−1 by
       c t = ft ⊙ c t −1 + (1 − ft ) ⊙ tanh(Wc x t + Uc ht −1 + bc )         (5)
                                                                                                                  Õ
                                                                                                                  T
       ht = ot ⊙ tanh(c t )                                                  (6)                           ci =          λi j h j                                    (9)
where W , U , and b represents the corresponding input-to-hidden,                                                  j=1
hidden-to-hidden weights and biases respectively. σ (·) denotes the                                                      exp{ai j }
                                                                                                          λi j = ÍT                                                 (10)
sigmoid function, and ⊙ is the elementwise product.                                                                          exp{aik }
                                                                                                                      k=1
    Another problem when using RNN to solve sequence to sequence
                                                                                                          ai j = tanh(Ws si−1 + Wh h j + b)                         (11)
learning problem is that it is not clear what strategy to apply when
the input and output sequence does not share the same length                        where Ws , Wh are the weight vector for alignment model, and b
(i.e. for outputs we have T ′ time steps, which may not equal to T ),               denotes the bias.
which is the typical setting for this type of tasks. Sutskever et al.                  Now we are ready to introduce the full model for spelling cor-
[16] propose to use an auto-encoder type of strategy, where the                     rection. The model takes a sequence of input (characters or BPE
input sequence is encoded to a ixed length vector by using the                      encoded sub-words) x 1 , . . . , xT and outputs a sequence of BPE en-
last hidden state of the recurrent neural network, and then decode                  coded sub-words y1 , . . . , yT ′ . For each input token the encoder
the output sequence from the vector. In more detail, let input and                  learns a function fe to map to its hidden representation ht
output sequence have T and T ′ time steps, and fe , fd denote the
                                                                                                                  ht = fe (ht −1 , x t ; θ e )                      (12)
encoding and decoding functions respectively, then the model tries
to learn P(y1 , . . . , yT ′ |x 1 , . . . , xT ) by                                                               h0 = 0                                            (13)

                          s ≜ fe (x 1 , . . . , xT ) = hT                    (7)    The attentional decoder irst obtain the context vector c t based on
                                                                                    equation 10, and then learns a function fd that decodes yt from the
                        yt ≜ fd (s, y1 , . . . , yt −1 )                     (8)    context vector c t
where fe and fd are implemented using multi-layer LSTMs.                                                   p(yt |st ) = softmax(W st + bd )                         (14)

4    SPELLING CORRECTION AS A FOREIGN                                                                              st = fd (st −1 , c t ; θd )                      (15)
     LANGUAGE                                                                                                      s 0 = U hT                                       (16)
It is easy to see that spelling correction problem can be formulated                where W and bd are the output matrix and bias, U is a matrix that
as a sequence to sequence learning problem as mentioned in section                  make sure that the hidden states of encoder would be consistent
3. In this sense, it is very similar to a machine translation problem,              with the decoder’s. In our implementation, both fe and fd are
1 Sometimes additional weight matrix and vector are added to generate output from   2 The vocabulary is limited in a sense that the number of words are upper bounded, in
h t for LSTM, we choose to stick with the original formulation for simplicity.      general
Spelling Correction as a Foreign Language                                                                                        SIGIR 2019 eCom, July 2019, Paris, France


                                                                                                                                       Decoder

                                            Encoder                                                                   Y1          Y2                       EOS


                       h1              h2                 …              hT                                           s1          s2             …             sT’


                        …               …                                …                                            …           …                            …


                                                          …                                                                                      …
                                                                                               Attn:
                                                                                                Ci
                                                          …                                                                                      …


                        x1             x2                 …             EOS                                          EOS          y1             …             yT’


Figure 1: Encoder-Decoder with attention framework used for spelling correction. The encoder is a multi-layer recurrent
neural network, the irst layer of encoder is a bidirectional recurrent neural network. The attention model produces a context
vector Ci based on all encoding hidden states hi and previous decoding state si−1 . The decoder is a multi-layer recurrent neural
network, and the decoding output Yi depend both on the context vector c i and the previous inputs y1 . . . yi−1 .


modeled using a multi-layer LSTM. As a whole, the end-to-end                                           Table 1: Results on test dataset with various methods. C-2-
model is then trying to learn                                                                          C denotes that the model uses character based encoder and
                                                                                                       decoder; W-2-W denotes that the model uses BPE partial
                                                Ö
                                                T′
                                                                                                       word based encoder and decoder; and C-2-W denotes that
    p(y1 , . . . , yT ′ |x 1 , . . . , xT ) =         p(yi |x 1 , . . . , xT , yi−1 )   (17)
                                                i=1
                                                                                                       the model uses a character based encoder and BPE partial
                                                                                                       word based decoder.
                                                ÖT′
                                            =         p(yi | fd (si−1 , c i ); θd )     (18)
                                                i=1                                                                          Method                  Accuracy
notice that in equation 18 the context vector c i is a function of                                                           Hasan et al.[8]          62.0%
the encoding function fe , so we are not left the encoder isolated.                                                          C-2-W RNN                59.9 %
Since all components are smooth and diferentiable, the model                                                                 W-2-W RNN                62.5 %
can be easily trained with gradient based method to maximize the                                                             C-2-C RNN                55.1%
likelihood on the dataset.

5   EXPERIMENTS
                                                                                                          We use beam search to obtain the inal result from the model. The
We test our model in the setting of correcting e-commerce queries.
                                                                                                       result is illustrated in table 1, it is clear that our albeit much simpler,
Unlike machine translation problem, there is no public datasets
                                                                                                       our RNN based model ofers competitive performance as compare
for e-commerce spelling correction, and therefore we collect both
                                                                                                       to the previous methods. It is interesting to note that, the BPE based
training and evaluation data internally. For training data, we use
                                                                                                       encoder and decoder performs the best. The better performance
the event logs that tracks user behavior on an e-commerce website.
                                                                                                       may attribute to the shorter resultant sequence as compared to the
Our heuristic for inding potential spelling related queries is based
                                                                                                       character case, and possibly more semantic meaningful segments
on consecutive user actions in one search session. The hypothesis
                                                                                                       from the sub-words as compared to the characters. Surprisingly,
is that users will try to modify the search query until the search
                                                                                                       the character based decoder performs quite well considering the
result is desirable with the search intent, and from this sequence
                                                                                                       complexity of the learning task. This demonstrated the beneit from
of action on queries we can potentially extract the misspelling
                                                                                                       end-to-end training and the robustness of the framework.
and correct spelled query pair. Obviously, this includes a lot more
diversity on query activities besides spelling mistakes, and thus
additional iltering is required to obtain representative data for                                      6    CONCLUSION
spelling correction. We use the same techniques as Hasan et al.[8].                                    In this paper, we reformulated the spelling correction problem as
Filtering multiple months of data from our data warehouse, we                                          a machine translation task under the encoder-decoder framework.
got about 70 million misspelling and spell correction pairs as our                                     The reformulation allowed us to use a single model for solving the
training data. For testing, we use the same dataset as in paper                                        problem and can be trained from end-to-end. We demonstrate the
[8], where it contains 4602 queries and the samples are labeled by                                     efectiveness of this model using an internal dataset, where the
human.                                                                                                 training data is automatically obtained from user logs. Despite the
SIGIR 2019 eCom, July 2019, Paris, France                                                Yingbo Zhou, Utkarsh Porwal, and Roberto Konow


simplicity of the model, it performed competitively as compared to
the state of the art methods that require a lot of feature engineering
and human intervention.

REFERENCES
 [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-
     chine translation by jointly learning to align and translate. arXiv preprint
     arXiv:1409.0473 (2014).
 [2] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term
     dependencies with gradient descent is diicult. IEEE transactions on neural
     networks 5, 2 (1994), 157ś166.
 [3] Eric Brill and Robert C Moore. 2000. An improved error model for noisy channel
     spelling correction. In Proceedings of the 38th Annual Meeting on Association for
     Computational Linguistics. Association for Computational Linguistics, 286ś293.
 [4] Silviu Cucerzan and Eric Brill. 2004. Spelling correction as an iterative process
     that exploits the collective knowledge of web users. In Proceedings of the 2004
     Conference on Empirical Methods in Natural Language Processing. 293ś300.
 [5] Stefen Eger, Tim vor der Brück, and Alexander Mehler. 2016. A comparison of
     four character-level string-to-string translation models for (OCR) spelling error
     correction. The Prague Bulletin of Mathematical Linguistics 105, 1 (2016), 77ś99.
 [6] Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk, and Xu Sun. 2010. A large
     scale ranker-based system for search query spelling correction. In Proceedings of
     the 23rd International Conference on Computational Linguistics. Association for
     Computational Linguistics, 358ś366.
 [7] Jai Gupta, Zhen Qin, Michael Bendersky, and Donald Metzler. 2019. Personalized
     Online Spell Correction for Personal Search. In The World Wide Web Conference
     (WWW ’19). ACM, New York, NY, USA, 2785ś2791. https://doi.org/10.1145/
     3308558.3313706
 [8] Sasa Hasan, Carmen Heger, and Saab Mansour. 2015. Spelling Correction of User
     Search Queries through Statistical Machine Translation.. In EMNLP. 451ś460.
 [9] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. 2001.
     Gradient low in recurrent nets: the diiculty of learning long-term dependencies.
[10] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
     computation 9, 8 (1997), 1735ś1780.
[11] Mark D Kernighan, Kenneth W Church, and William A Gale. 1990. A spelling
     correction program based on a noisy channel model. In Proceedings of the 13th
     conference on Computational linguistics-Volume 2. Association for Computational
     Linguistics, 205ś210.
[12] Yanen Li, Huizhong Duan, and ChengXiang Zhai. 2012. CloudSpeller: query
     spelling correction by using a uniied hidden markov model with web-scale
     resources. In Proceedings of the 21st International Conference on World Wide Web.
     ACM, 561ś562.
[13] Chris J Lu, Alan R Aronson, Sonya E Shooshan, and Dina Demner-
     Fushman. 2019. Spell checker for consumer language (CSpell). Journal
     of the American Medical Informatics Association 26, 3 (01 2019), 211ś218.
     https://doi.org/10.1093/jamia/ocy171 arXiv:http://oup.prod.sis.lan/jamia/article-
     pdf/26/3/211/27642469/ocy171.pdf
[14] Stephan Raaijmakers. 2013. A deep graphical model for spelling correction.
     (2013).
[15] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine
     translation of rare words with subword units. arXiv preprint arXiv:1508.07909
     (2015).
[16] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
     with neural networks. In Advances in neural information processing systems. 3104ś
     3112.
[17] Casey Whitelaw, Ben Hutchinson, Grace Y Chung, and Gerard Ellis. 2009. Using
     the web for language independent spellchecking and autocorrection. In Proceed-
     ings of the 2009 Conference on Empirical Methods in Natural Language Processing:
     Volume 2-Volume 2. Association for Computational Linguistics, 890ś899.

</pre>