Legal Query Reformulation using Deep Learning
                          Arunprasath Shankar                                                    Venkata Nagaraju Buddarapu
                               LexisNexis                                                                   LexisNexis
                             Raleigh, USA                                                                 Raleigh, USA
                  arunprasath.shankar@lexisnexis.com                                        venkatanagaraju.buddarapu@lexisnexis.com

ABSTRACT                                                                              achieves this by using either his own prior knowledge or through
Query reformulation is the process of iteratively modifying a query                   assisted tools. Legal query reformulation is not a trivial task, as
to improve the quality of search engine results. In recent years,                     legal search queries are often short, complex and lack context. Au-
the task of reformulating natural language (NL) queries has re-                       tomatic query correction systems are one of the most widely used
ceived considerable diligence from both industry and academic                         tools within NL applications. Search engines support users in this
communities. Traditionally, query reformulation has been mostly                       task in two ways, (a) explicitly by suggesting related queries or
approached by using the noisy channel model. Since legal queries                      query completions, or (b) implicitly by expanding the query to im-
are diverse and multi-faceted, these traditional approaches can-                      prove quality and recall of organic results. Successful reformula-
not effectively handle low frequency and out-of-vocabulary (OOV)                      tions must closely align with the original query both syntactically,
words. Motivated by these issues, we rethink the task of legal query                  as sequences of characters or words, and semantically, often in-
reformulation as a type of monolingual neural machine translation                     volving transparent taxonomic relations.
(NMT) problem, where the input (source) query is potentially er-                          From a probabilistic perspective, given an incorrect query q −
roneous and the output (target) query is its corrected form. We                       that we wish to reformulate to a correct query q + , we seek to
propose a unified and principled framework with multiple levels                       model q + = arg maxq P (q|q − ) where q is the original query or
of granularity. Specifically, (i) an encoder with character atten-                    ground truth. In a traditional noisy channel model [2], the model
tion which augments the subword representation; (ii) a decoder                        consists of two parts: (i) a language model i.e., P(q) that represent
with attentions that enable the representations from different lev-                   the prior probability of the intended correct input query; and (ii)
els of granularity to control the translation cooperatively and (iii) a               an error model i.e., P (q|q − ) that represent the process in which
semi-supervised methodology to extract and augment a large-scale                      the correct input query gets corrupted to its incorrect form. There
dataset of NL query pairs combining syntactic and semantic opera-                     are several drawbacks with this approach: (i) we need two sepa-
tions. We establish the effectiveness of our methodology using an                     rate models and the error in estimating one model would affect
internal dataset, where the training data is automatically obtained                   the performance of the final output, (ii) it is not easy to model the
from user query logs. We further demonstrate that training deep                       channel since there is a lot of sources for these mistakes e.g., typing
neural networks on additional data with synthesized errors can                        too fast, unintentional key stroke, phonetic ambiguity, etc. and (iii)
improve performance for translation.                                                  it is not easy to obtain clean training data for language model as
                                                                                      the input does not follow what is typical in NL. Since the end goal
1.    INTRODUCTION                                                                    is to get a query that maximizes P (q|q − ), can we directly model
Technology and innovation are transforming the legal profession                       this conditional distribution instead.
in manifold ways. The legal industry is undergoing significant dis-                       In this work, we explore this route, which by passes the need
ruption and arguably machine intelligence is most advancing in                        to have multiple models and avoid suffering errors from multiple
the area of discovery and search. Technologies are moving from                        sources. We achieve this by applying the encoder-decoder frame-
basic keyword searches to predictive coding, which involves algo-                     work combined with attention learning using recurrent neural net-
rithms that predict whether a document is relevant or not. In [1],                    works [3] and rethink the query correction problem as a NMT prob-
McGinnis et al. envisage two phases of technological changes in                       lem, where the incorrect input is treated as a foreign language.
this area. The first phase, expected to come in the next 10 years, in-                We also propose a semi-supervised methodology to perform er-
volves perfecting semantic search that will allow lawyers to input                    ror detection and construct a large-scale dataset for training and
NL queries to interfaces and systems responding to those queries                      experimentation. To the best of our knowledge, there is no prior
directly with relevant information. The second phase involves tech-                   research work on the idea of using attention learning in a legal
nology that is able to identify issues, given a set of facts and then                 setting. Also, our work is the first that uses neural machine trans-
suggest relevant authorities that might apply to those issues.                        lation based methodologies on user generated legal data for query
   Query reformulation or correction is a crucial component for                       reformulation. We demonstrate that NMT models can successfully
any service that requires users to type in NL, such as a search en-                   be applied to the task of query reformulation in a legal background
gine. It is an iterative process where a user reformulates (rewrites)                 and validate that the trained models are naturally capable of han-
queries to improve search results or gain new information. He                         dling orthographic errors and rare words, and can flexibly correct
                                                                                      a variety of error types. We further find that augmenting the net-
In: Proceedings of the Third Workshop on Automated Semantic Analysis of Informa-      work training data with queries containing synthesized errors can
tion in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada.                 result in significant gains in performance.
©2019 Copyright held by the owner/author(s). Copyrighting permitted for private and
academic purposes.
Published at http:// ceur-ws.org.
2.    APPLICATION                                                           approaches and, second is the process of error detection (selecting
An “answer card” is a search feature, usually displayed in a box or         candidate queries that are incorrect) for training the reformulation
panel, that occurs above organic results and tries to directly answer       models. We will introduce related studies specific to these areas in
a question. Most answer boxes are primarily text, containing a rel-         the this section.
atively short answer and may provide limited information based
on user’s entered query. In legal context, the source can be a legal        3.1    Query Reformulation
question, profile search or can take many forms. Figure 1 shows a           3.1.1 Traditional Approaches: In [5], Xu et al. used top results
card that is an answer to a legal question “define motion to com-           retrieved by the original query to perform reformulation by expan-
pel” and Figure 2 shows a profile card for a judge search using the         sion. This method is popular and influenced by the initial ranking
query “who is judge antonin scalia ?”.                                      of results, however it cannot utilize user-generated data. Other ap-
                                                                            proaches focused on using user query logs to expand a query by
                                                                            means of clickthrough rate [6], co-occurrence in search sessions,
     Motion to compel                                                       or query similarity based on click-through graphs [7]. The advan-
     A motion to compel asks the court to order either the opposing         tage of these approaches is that user feedback is readily available
     party or a third party to take some action.
                                                                            in user query logs and can efficiently be precomputed. However,
                                                                            these approaches all face the problem of requiring some resource
                                                                            dependency on a specific search engine.
                       Figure 1: Answer Card                                3.1.2 Statistical Machine Translation: The task of machine cor-
   Vertical engines are search engines that specialize in different         rection is defined as correcting a N -character or word source query
types of search. Answer cards are usually backed by vertical search         S = s 1 , . . . , s N = s 1N into a M-character or word target query
engines that are tightly coupled with AI/NLP systems. These NLP             T = t 1 , . . . , t M = t 1M . Thus, the correction system can be defined
systems may be a machine/deep learning model(s) recognizing and             as a function F:
classifying entities that trigger and render respective cards based
on user’s query intent[4]. NLP models rely heavily on vocabulary                                          TD = F (S)                             (1)
and any OOV words contained in the query can have an adverse
                                                                            which returns a correction hypothesis TD given an input word S.
effect on the coverage or functioning of answer cards.
                                                                            Recently, several studies have been adopting data from user query
                                                                            logs as input to Statistical Machine Translation (SMT) for query re-
                                                                            formulation and expansion [8][9]. These methods treat user queries
                                Antonin Scalia                              as one language and the reformulated queries as another language.
                                                                            However, their major drawback is the difficulty of modeling cor-
                                                                            rections in different granularities, e.g., characters or words, which
                        Former Associate Justice
                          Supreme Court (U.S.)                              is necessary to reduce the rate of unknown words that are detri-
                                                                            mental to their proper functioning.
                     Born: March 11, 1936, Trenton, NJ
                                                                               Our approach differs in several ways. First, we consider full
                     Died: February 13, 2016                                query pairs as training data, and do not use single word tokens
                                                                            as a primary mode of operation. Second, we do not train explicit
                     Appointed by: Ronald Reagan
                                                                            error models P(w |s) for words w and observed corrections s, but
                                                                            use standard word/character based NMT models to derive more
                                                                            meaningful and accurate lexical translation models.
                        Figure 2: Profile Card
                                                                            3.1.3 Neural Machine Translation: Recently, the use of neu-
   The scope of this research is to develop a framework to detect           ral networks has delivered significant gains for mapping tasks be-
these erroneous queries and autocorrect them. The framework acts            tween pairs of sequences due to to their ability to learn a better
as an extra layer between the answer cards and NLP recognition              representation of the data and context. Recurrent Neural Networks
systems to facilitate a coherent mechanism for coping up with               (RNNs) are a set of neural networks for processing sequential data
missing information (vocabulary); which in turn enable us to re-            and modeling long-distance dependencies which is a common phe-
turn more appropriate cards. For e.g., for the above mentioned              nomenon in human language. NMT models [10] use RNNs and
answer card queries, lets say a user typed in “defin e motiom               learn to map from source language input to target language input
to compell” or “whi is jufge antonin scakia?” instead, our de-              via continuous-space intermediate representations effectively. We
ployed framework would fix (reformulate) the queries to its correct         use an encoder-decoder bound uni/bi-directional RNNs with an at-
form and display the cards successfully.                                    tention mechanism as the core component of our proposed system.
                                                                            [11]. The encoder maps the input query to a higher-level represen-
3.    BACKGROUND AND RELATED WORK                                           tation with a uni or bi-directional RNN architecture similar to that
There are two major areas related to our research and the proposed          of [12]. The decoder also employs a RNN that uses content-based
models. First is the work on the task of query reformulation in             attention mechanism [13] to attend to the encoded representation
information retrieval involving traditional, statistical and neural         and generate the output query one character at a time.
                                                                        2
3.1.4 Character Level Reasoning: Word-level NMT models are                   words (tokens). Table 1 shows top 10 ranked queries (left) and to-
poorly suited to handle OOV words due fixed vocabulary [14]. Re-             kens (right) at the end of query preparation process.
cent works have proposed workarounds for this problem. Since a
word is usually thought of as the basic unit of language commu-                     Rank            Query            Rank Token
nication [15], early NMT systems built these representations start-                   1      breach of contract        1   motion
ing from the word level [16]. Machine learning/NLP models typi-                       2      unjust enrichment         2   contract
cally limit vocabulary size due to the complexity of training [17].                   3     summary judgement          3   court
Therefore, they are unable to translate rare words, and it is a stan-                 4          res judicata          4   judgment
dard practice to replace OOV words with unknown (UNK) symbol.                         5              fraud             5   insurance
By doing so, useful information is discarded, resulting in systems                    6       motion to dismiss        6   evidence
that are not able to correct erroneous words. In our framework,                       7          conversion            7   property
we strive to circumvent this issue by using smaller units such as                     8     statute of limitations     8   attorney
subwords to address the problem of OOV words.                                         9          defamation            9   liability
                                                                                     10        civil procedure        10   statute
3.2    Error Detection                                                                     Table 1: Top 10 Queries and Tokens
In most translation systems, before any reformulation is carried
out, a detection process is conducted on the input query to extract          4.2       Query Representation
any potentially incorrect words. In [18], He et al. proposed a learn-        The first step in creating a model, is to choose a representation for
ing to rewrite framework that focuses on candidate ranking for               the input and output. For instance, a query can be represented in
error detection. A non-word error refers to a potentially incorrect          word-level, which is transferring the data with indices that refer to
word that does not exist in a given dictionary. Dictionary lookup            the words or in character-level, which is transferring the data with
is one of the basic techniques employed to compare input strings             indices that refer to a closed vocabulary of language specific char-
with the entries of a language resource, e.g., lexicon or corpus [19].       acters and symbols. However, in this work, we anticipate the need
Such a language resource must contain all inflected forms of the             to handle an exponentially large input space of tokens by choosing
words and it should be updated regularly. If a given word does not           a character-level query representation. Thus, model inputs are in-
exist in the language resource, it will be marked as a potentially           dividual characters which are a sequence of indices corresponding
incorrect word which is a huge disadvantage of this approach.                characters.
   Minimum edit distance is one of the most studied techniques
for error detection. It is based on counting edit operations, which                            Legal
are defined in most systems as insertion, deletion, substitution and                          Names

transposition. Jaro-Winkler [20], Wagner-Fischer [21], and Leven-                                              Data
                                                                                                            Augmentation
                                                                                Legal
shtein [22] are among the most famous edit distance algorithms.                Corpora
This approach has a main drawback, in that it penalizes all change                                                                  User Input
                                                                                               Error
operations in the same way, without taking into account the char-                            Detection

acter that is used in the change operation. In contrast to above               Query                        Training Pairs    NMT                Reformulated
                                                                                                                                                     Input
mentioned approaches, our error detection methodology combines                  Log

both syntactic and semantic attributes of a user query to extract in-                        FastText
                                                                                            Embeddings
correct non-word occurrences. Here, we establish a unified frame-
work (see Figure 3) that encompasses aspects like frequency dis-
tribution, query rank and subword representations for error detec-                           Figure 3: Proposed Framework
tion alongside NMT and attention learning.                                      We have defined specific characters like <START> and <END>
                                                                             in order to specify start and end of each query. This character will
                                                                             be later used in the training process as a criteria. Note that in
4. PROPOSED FRAMEWORK
                                                                             the character-level representation, we have also considered spaces,
4.1 Query Preparation                                                        certain diacritics and punctuations as unique characters, i.e., 40
For our experiments, we collected a total of ∼ 62M distinct queries          unique characters in all. Once a query is mapped into indices, the
grouped by frequency, collected over a period of 5 years. The queries        characters are embedded, i.e. each character is represented by a
under study included a considerable proportion of boolean queries            one-hot vector and is multiplied by a trainable matrix with the size
∼ 32M of them. We used a heuristic approach to filter search queries.        of the input X embeddings size. The embedding allows the model
Using a combination of regular expressions and strict constraints,           to group together characters that are similar for the task.
we excluded any query with boolean or special character connec-
tors; this left us with ∼ 29M NL queries.                                    4.3       Candidate Selection
   We used the punkt sentence tokenizer [23] to tokenize the queries         Our scoring heuristic calculates a sequence of feature scores for
into words. The tokens were further filtered down to remove noise            each token from the prepared vocabulary. The key motivation is
and short forms. Any token that is a digit and of length ⩾ 5 was             to separate these tokens into +ve (correct) and −ve (incorrect)
excluded from our study. The created vocabulary contained ∼ 1M               buckets. We achieve this by combing three distributional attributes.
                                                                         3
First, since our vocabulary is derived from user log queries, we                     is computed as follows:
can associate the individual tokens to the rank of its constituent                          
queries. Second, we can also affiliate a token to its individual fre-                        1, if (x ∈ ▽H ) ∨ (x ∈ ▽ J ) ∨ (x ∈ ▽E ) ∨ (x ∈ ▽C )
                                                                                            
                                                                                        ϕ=  1                                                    (5)
quency within the vocabulary. And third, by the membership of                                , otherwise
                                                                                            2
the tokens to selected legal lexicons. Our selected lexicons com-
prise of extracted vocabulary from headnotes (∼ 217K) derived                        4.3.5 Relevance Score: Finally, a relevance score (Ω) is com-
from US legal caselaw documents, legal names consisting of judge                     puted by combining all the scores derived from equations 2, 3, 4
(∼ 67K) and expert witness (∼ 388K) names derived legal master                       and 5 as follows:
                                                                                                       (      1 ) (             ) (       1 )
databases. We also use a secondary lexicon of common US english                             Ω= ϕ· 1−                · 1 − e −χ · 1 −               (6)
names (∼ 234K).                                                                                           e 2ξ + 1                      1+ζ
    Let Q be an ordered set of distinct legal queries derived from                   The relevance score expresses the relative importance of a token
user logs and ranked by frequency. Q ⇒ {q 1 , q 2 , . . . , qr } where r             and can be used to classify a word into +ve or −ve bucket. We es-
is the rank of the query; r ∈ R. Let R denote an ordered set of ranks;               tablish this separation by using a cut-off threshold (∼0.83) obtained
R ⇒ {1, 2, . . . , L}, where L is the length of Q i.e., total number of              via trial and error experiments.
distinct queries. Let ▽ ⇒ {x 1 , x 2 , . . . , xl } denote the vocabulary of
distinct tokens derived from Q where l is length of the individual                   4.4    Ranking Strategy using FastText
query. For any given token x in vocabulary ▽, it can belong to a                     Traditional approaches to representing rare and OOV words follow
subset of queries Q̂; Q̂ ⊆ Q. This implies that the token can mapped                 assigning a random vector [24] or binning all rare words into a new
to a closed set of ranks R̂; R̂ ⊆ R. E.g., the token discriminatory can              “UNK” word type. Another popular technique is to encode word
belong to multiple queries within Q, hence R̂ could be a closed set                  definitions with an external resource (Long et al., 2016) and train
e.g., {6, 37, 763, 23445}.                                                           at the morpheme level [25]. Creating semantic representations for
                                                                                     OOV words is a difficult NLP task. And word-based (Mikolov et
4.3.1 Mean Rank Score: Given R̂ ⊆ R, for a token x ∈ Q, R̂ can
                                                                                     al., 2013; aka word2vec)[26] approaches do not contribute well to
be defined as {r 1 , r 2 , . . . , rl }. We compute the mean rank score (ξ )
                                                                                     resolve this problem.
as follows:
                                    ∑
                                    l            ri                                                Token       Score         Token     Score
                                          1−
                                   i =1          L                       (2)                      roberds     0.9967        cbreach    0.9839
                             ξ =
                                          |R̂|                                                     robers     0.9962         fbreach   0.9771
                                                                                                  robergs     0.9913       onbreach 0.9684
where |R̂| is size of R̂.                                                                         robertus 0.9851           breachh 0.9683
4.3.2 Neighbor Rank Score: Given a token x 0 , it can co-occur                                   robertson 0.9797          bebreach 0.9659
alongside other tokens from the vocabulary contained to the queries                                robertt    0.9777         breachj   0.9657
under investigation. Let ▽  ˆ ⇒ {x 1 , x 2 , . . . , xl } denote the subset of                     rowert     0.9766        breachn 0.9643
tokens co-occurring alongside x. For every candidate token inside                                 rochfort 0.9758           ofbreach 0.9641
▽
ˆ , we have a corresponding mean rank score ξ derived from equa-                                   rogerts    0.9756       mcbreach 0.9627
tion 2. We restrict the window of co-occurrence to 2 neighbors, one                                robtert    0.9731       orbreach 0.9612
before and after the token. Let ξˆ be the set of corresponding mean                            Table 2: Top-10 Similar Words by fastText
rank scores; ξˆ ⇒ {x 1 , x 2 , . . . , xl } The neighbor rank score (χ ) is             Character-based embedding approaches are more suited to han-
computed as:                                                                         dle the OOV problem (Bojanowski et al., 2017; aka fastText)[27].
                                                                                     For mapping the +ve tokens to their corresponding variants (in-
                                      ∑
                                      l
                                        ξi                                           correct −ve counterparts), we used fastText [28]. fastText is a
                               χ=                                        (3)         character-based embedding approach that it is well-suited to pro-
                                                 ˆ
                                      i =1 | ξ |                                     duce embeddings for OOV words. It’s able to do this by learning
4.3.3 Frequency Score: Let ϵ denote the frequency of a token;                        vectors for character n-grams within the word and summing those
x ∈ ▽. Frequency score (ζ ) is a normalized value representing the                   vectors to produce the final vector or embedding for the word itself.
importance of frequency of occurrence of a token in the corpus. It                   For training, we set the embedding size to 100 and the training win-
is defined as follows:                                                               dow to 10. We also set alpha value to 0.025, min and max values of
                                                                                     n to 2 and 6 respectively. The embedding matrix was created using
                                           ϵ
                                ζ =                                      (4)         a batch size of 10K words in 100 epochs. We trained fastText em-
                                          |▽|                                        beddings on the 29M NL queries using a p3.2xlarge EC2 instance
4.3.4 Lexicon Membership Score: Let ▽H denote the vocabu-                            with 1 Tesla V 100 GPU. Table 2 shows top 10 most similar words
lary of tokens derived from headnotes. Let ▽ J denote the vocab-                     for the OOV +ve word “roberts” (left) and OOV −ve word “breach”.
ulary of tokens derived from judge master (judge name database)
and let ▽E denote tokens from expert witness database. Let ▽C                        4.5    Query Augmentation
indicate all the tokens from common valid US English names vo-                       Legal names like judge, expert witness, attorney names etc. are
cabulary. For any given token x, the lexicon membership score (ϕ)                    often misspelled by users during search. Since we were not able
                                                                                 4
to capture a considerable amount of these errors from query logs                and uses the hidden state for the next input character. Figure 4(a)
(real world data), we decided to synthetically augment these type               portrays a simple overview of the encoder.
of errors and generate queries for legal names algorithmically. For
                                                                                                                                            Previous
this purpose, we took all the names (+ve instances) from our legal                                                       Input
                                                                                                                                            Hidden

corpora and applied a variant of Needleman-Wunsch algorithm
to create its −ve counterparts.
                                                                                                                         Embedding
   The distance computation is identical to Levenshtein except that
character mistakes are given different weights depending on how
far two characters are on a standard keyboard layout (QWERTY).                                             Previous               ReLU
                                                                                        Input
The weights are assigned by projecting the keyboard characters                                             Hidden

onto a cartesian system. For e.g., A to S is given a mistake weight
of 0.4, while A to D is a 0.6. We generate query pairs using 4 types of                     Embedding
                                                                                                                                      GRU

transformation - addition, deletion, replacement and transposition.
Table 3 shows a few of the generated examples for augmentation.
                                                                                                     GRU                         Softmax


      Misspelled             Corrected                  Operation
      mchael lowy            michael lowy         addition                              Output                 Hidden      Output           Hidden

      david yamhamoto        david yamamoto       deletion                                       (a) Encoder                     (b) Decoder
      christibe ballard      christine ballard  replacement
      warren glciker         warren glicker    transposition                                        Figure 4: Encoder and Decoder
                    Table 3: Augmented Queries
                                                                                   Following [29], we use a bi-directional RNN with Gated Recur-
                                                                                rent Units (GRUs) to encode the source query:
4.6     Training Pairs                                                                               #»          (# »      #»)
                                                                                                     hi = GRU hi−1 , si ; Θ
In previous sections, we discussed the need to segregate the vocab-                                  »#          (» #     »# )               (8)
ulary into +ve and −ve tokens and the use of fastText character                                      hi = GRU hi−1 , si ; Θ
n-grams or subwords to capture word mappings (+ve to −ve). At
                                                                                where si is the i th source embedding, GRU is a gated recurrent unit,
the end of the candidate selection process, the entire vocabulary is            #»      »#
                                                                                Θ and Θ are the parameters of forward and backward GRU, respec-
broken into two groups of ∼180K +ve and ∼830K −ve tokens. At
                                                                                tively. The annotation of each source character qi− is obtained by
the end of ranking using fastText, we end up with around ∼2.4M
                                                                                concatenating the forward and backward hidden states:
legal term mapping and ∼50K legal name mapping. The legal name
                                                                                                                     #» 
mapping is expanded via the augmentation process elaborated in                                                »#»#      hi
the previous section to around ∼1.5M pairs. The legal term and                                                hi =  »#                       (9)
                                                                                                                      hi 
name mappings are then matched against the indexed 60M user
queries to create query pairs that will be used for training our
translation models.                                                             4.7.2 Decoder: The decoder is also built using a RNN. It takes the
                                                                                same representation of the input context along with the previous
4.7     Neural Machine Translation                                              hidden state and output character prediction, and generates a new
                                                                                hidden decoder state and a new output character prediction. The
Most NMT systems follow the encoder-decoder framework with
                                                                                decoder is a forward RNN (uni-directional) with GRUs predicting
attention mechanism proposed by Bahdanau et al. [29]. Given a
                                                                                the translation y character by character. This prediction takes the
source query q − = q 1− · · · qi− · · · q −  I and a target query q
                                                                    + =
  +         +        +
                                                                                form of a probability distribution over the entire output vocabulary
q 1 · · · q j · · · q J , we aim to directly model the translation proba-       (100 characters). At every step of decoding, the decoder is given
bility as:                                                                      an input character and hidden state. The initial input character is
                                   ∏
                                   J   (                 )                      the start of query character <START>, and the first hidden state
               P(q + |q − ; Θ) =     P q j |q +     −
                                              <j , q ; Θ
                                                                     (7)        is the context vector (the encoder’s last hidden state). Figure 4(b)
                                    1                                           illustrates a decoder and its associated states. The probability of
where Θ is a set of parameters and q +                                          generating the j th word y j is:
                                     <j is a sequence of previously                                                                   ψ j−1 
generated target characters.                                                                    (
                                                                                                              −
                                                                                                                    )           *.  ϕ  +/
                                                                                              P q+j |q +
                                                                                                       <j , q   ; θ   = softmax          j    (10)
4.7.1 Encoder: We use an RNN to architect an encoder. The En-                                                                    ,  κ j  -
coder outputs a value for every character from the input query.
The task of the encoder is to provide a representation of the input             where ψ j−1 is the character embedding of the j −1th target word, ϕ j
query. The input is a sequence of characters, for which the embed-              is the decoder’s hidden state of time j, and κ j is the context vector
ding matrix is constructed using fasttext[28]. Also to get the right            at time j. The state ϕ j is computed as:
                                                                                                              (     [        ]       )
context, we explore both uni and bi-directional RNNs here. For ev-                                                    ψ j−1
                                                                                                 ϕ j = GRU ϕ j−1 ,             ; Θϕ               (11)
ery input character the encoder outputs a vector and a hidden state,                                                     κj
                                                                            5
                                      S            C        A       K              I          A       <END>

                                                                                                               Input Character Embeddings


                                                                                                               Left-to-Right GRU


                                                                                                               Right-to-Left GRU


                                                                                                               Attention


                                                                                                               Input Context


                                                                                                               Hidden State


                                                                                                               Output Character Predictions


                                                                                                               Error


                                                                                                               Given Output Characters


                                                                                                               Output Character Embeddings


                         S            C            A        L       I              A         <END>


                                                       Figure 5: Encoder Decoder with Attention


4.8   Attention Mechanism                                                              4.9    Gated Recurrent Unit (GRU)
The attention mechanism was introduced to address the limitation                       In practice, a simple RNN is difficult to train properly due to the
of modeling long dependencies and the efficient usage of memory                        problems of the vanishing/exploding gradient as described in [30].
for computation. It intervenes as an intermediate layer between the                    Therefore, in this work, we utilize GRU (Cho et al., 2014 [11]) as an
encoder and the decoder, having the objective of capturing the in-                     improved version of simple RNN which can alleviate the gradient
formation from the sequence of tokens that are relevant to the con-                    problem. Along the lines of Long Short Term Memory (LSTM), in
tents of the query [13]. The mechanism is shown in Figure 5. The                       GRUs the forget and input gates are coupled into an update gate zt .
basic problem that the mechanism solves is that instead of forcing                     The advantage of GRUs over LSTMs is the smaller number of gates
the network to encode all parameters into one fixed-length vector,                     that makes them less memory as well as computationally intense,
it allows the network to make use of the input sequence.                               which is often a critical aspect for NMT.
    Key to our approach is the use of a character-based model with                        Given an input sequence (x1 , x2 , . . . , xN ), GRU can be adopted
an attention mechanism, which allows for orthographic errors to                        as an encoder to compute the corresponding sequence of hidden
be captured and avoids the OOV problem suffered by word-based                          state ht = (h1 , h2 , . . . , hN ) as:
NMT methods. Unlike the encoder-decoder model that uses the                                               zt = σ (Wx z xt + Uhz ht −1 )                 (15)
same context vector for every hidden state of the decoder, attention
computes the context vector c j as a weighted sum of the source                                           rt = σ (Wxr xt + Uhr ht −1 )                  (16)
annotations:                                                                                         H
                                                                                                     ht = tanh (Wxh xt + Ur h (r ⊗ ht −1 ))             (17)
                               ∑
                               I
                                            »#»#                                                        ht = (1 − zt ) ⊗ ht −1 + zt ⊙ H
                                                                                                                                      ht                (18)
                        κj =          ∆ji · hi                          (12)
                               i =1                                                    where σ is the sigmoid function and ⊗ is an element-wise multipli-
                                                                                       cation operator. zt , rt and H
                                                                                                                    ht are the update gate, reset gate and
where the attention weight ∆ji is computed as:
                                                                                       candidate activation, respectively. Wx z , Wxr , Wxh , Uhz , Uhr and
                                   ( )                                                 Ur h are related weight matrices.
                               exp Ξji
                    ∆ji = ∑          ( )                                (13)
                             I
                             i =1 exp Ξ ji
                                                                                       5. EXPERIMENTS
                                (                                                      5.1 Sequence Representation
                                               #»)
                 Ξji = α aT tanh βa ϕ j−1 + γa hi                       (14)           An input or output sequence for an error correction system can be
                                                                                       represented in different levels. At the character-level, a sequence
where α a , βa and γa are the weight matrices, and Ξji is the model                    is processed character by character. When searching for errors, hu-
                                »#»#
that scores how well ϕ j−1 and hi match.                                               mans often consider a bigger sequence of characters at word-level.
                                                                               6
Clause-level, phrase-level, sentence-level and text-level are other
                                                                                                   Input                                 Input
common representations for modeling NMT sequences. In our re-                                        10                                   100

search, we worked at character-level where each character in an                                 Embedding                             Embedding
input sequence is mapped to a real-valued number and its corre-                                  10 X 1024                            100 X 1024

sponding embedding. In order to model linguistic dependencies in                                  GRU                                    GRU
each sequence, we took a selected list of characters into account                               1024 tanh                              1024 tanh

including space. This enabled us to deal with different kinds of er-                           Repeat Vector                         Repeat Vector
                                                                                                10 X 1024                             100 X 1024
rors and a larger range of characters in each sequence. On the other
hand, the output of our models were also represented at the char-                                 GRU                                    GRU
                                                                                                1024 tanh                              1024 tanh
acter level.
                                                                                              Time Distributed                      Time Distributed
                                                                                             10 X 144964 ReLU                        100 X 40 ReLU

5.2    Selection Strategy                                                                   (a) Word-based                     (b) Char-based
In our experiments, for tuning and evaluation purposes we used
a gold reference annotation following a simple selection strategy.                       Figure 6: Architecture - NMT I and II
The strategy is used to check whether for a search query pair (x, y),
the left query x is erroneous, and if so, whether the similar candi-
date y is a correct correction or else provide the most likely cor-
rection. If x was not incorrect, we propagate query x to the gold                                   Input 1
                                                                                                      100
                                                                                                                                       Input 2
                                                                                                                                         100
reference y, i.e. for those cases, we have it identity as a true neg-
                                                                                                 Embedding 1                        Embedding 2
ative. For training our models, we split the ∼4M query pairs into                                 100 X 1024                         100 X 1024

three splits - train, dev and test in the ratio 99.995 : 0.005 : 0.005
                                                                                                    GRU 1                             GRU 2
respectively. This produces 20K queries for dev and test set each.                                 1024 tanh                         1024 tanh

The true negatives in these sets (i.e. entries that do not need to be                                            Time Distributed
corrected) account to about ∼2%. The validation of annotation was                                                 100 X 40 ReLU

mostly done via subject matter experts.
                                                                                            Figure 7: Architecture - NMT III
5.3    Model Architecture
In this section, we discuss about the various architectures experi-
mented for neural machine translation.                                                              Input 1                            Input 2
                                                                                                      100                                100


5.3.1 NMT I:. The first architecture we experimented with, for                                   Embedding 1                        Embedding 2
                                                                                                  100 X 1024                         100 X 1024
the task of NMT is a word-based model. The input layer uses a
dense vector representation of the vocabulary (144,964 words) de-                                   GRU 1
                                                                                                  100 X 1024
                                                                                                                                       GRU 2
                                                                                                                                     100 X 1024
rived from query pairs followed by an embedding layer of dimen-
                                                                                                                     Dot 1
sion 10X1024. The length of the sequence of the query is constrained                                               100 X 100
to a maximum fixed length of 10 words. This architecture uses a re-
                                                                                                                   Attention
peat vector along with 1024 GRU units. This is the only word-based                                                 100 X 100

architecture we experimented with for NMT.                                                                          Dot 2
                                                                                                                  100 X 1024

5.3.2 NMT II:. Architecture II follows a similar architecture as                                                 Concatenate
NMT I, except the word sequence is replaced with a character se-                                                 100 X 2048

quence. The sequence length here is constrained by two factors: (i)                                            Time Distributed 1
                                                                                                                  100 X 1024
the total number of characters that constitute the vocabulary (40
characters) which includes special characters and, (ii) the length                                             Time Distributed 2
                                                                                                                   100 X 40
of the sequence which is set to fixed maximum length of 100 char-
acters. Both architectures NMT I and II do not use attention and
use a single embedding at the character level. Figure 6 shows the                           Figure 8: Architecture - NMT IV
architectural layers for NMT models I and II.
                                                                             5.3.4 NMT IV. Model IV is the first setup to use an attention
5.3.3 NMT III. In the third architecture setup, we use two charac-           layer. It is similar to NMT III in using two character embeddings,
ter embedding layers which is different from NMT I and II architec-          one at the encoder and other at the decoder level. See Figure 8.
tures which uses only one embedding. NMT III also does not use
a repeat vector layer. This is a character-based only architecture           5.3.5 Bi-directional Architectures. We also experimented the
and does not use attention. Figure 7 illustrates the architecture of         previously discussed neural architectures (NMT I to IV) by replac-
NMT model III. The dimensions of embedding and time-distributed              ing uni-directional GRUs with bi-directional GRU units. This is
layers are similar to NMT II.                                                only applied to the encoder component of the architecture.
                                                                         7
5.4    Training                                                             6.2    GLEU
For our experiments, we used p3.16xlarge EC2 instance with 8                Recently, Napoles et al. ameliorated BLEU metric for evaluation
Tesla V 100 GPUs. Table 4 shows the training time (in minutes) for          of grammatical error correction systems and proposed the Gener-
the various architecture setups (100 epochs with a batch size of            alized Language Evaluation Understanding (GLEU ) [33][34]. For
512). The NMT models were implemented using TensorFlow.                     GLEU , the precision is modified to assign extra weight to the n-
                                                                            grams that are present in the reference and the hypothesis, but
                                                                            not those of the input. We have used the last update of the original
                                                                            implementation of the GLEU introduced in [34].
                       Model         Avg Time
                    NMT I (Uni)      87.41                                  6.3    Character n-gram F-score (CHRF )
                    NMT I (Bi)      126.32                                  CHRF calculates the sentence level character n-gram F -score as
                   NMT II (Uni)      95.76                                  described in Maja Popovic, 2015 [35]. It is shown to correlate very
                    NMT II (Bi)     163.73                                  well with human rankings of different machine translation outputs,
                   NMT III (Uni)    114.05                                  especially for morphologically rich targets. For our study, we use
                    NMT III (Bi)    239.81                                  CHRF 1 (standard F -score, β = 1) with uniform n-gram weights.
                   NMT IV (Uni)     168.25
                    NMT IV (Bi)     320.94                                  7.    RESULTS
                 Table 4: NMT - Training Time                               In the previous section, we presented the translation models for
                                                                            the task of query reformulation and the details of the models were
                                                                            explained. In this section, the results obtained from the models are
                                                                            discussed. Table 6 and 7 shows results of our correction models
6.    EVALUATION METRICS                                                    using the evaluation metrics discussed in the previous section.
This section presents evaluation methods used to evaluate the NMT
models. Although various methods can be used to evaluate a ma-              7.1    Baseline
chine translation system, for our use case of query reformulation,          We define pairs of source queries and their ground-truth annota-
evaluation is limited to a binary classification of predictions where       tions as the baseline of our NMT models. In this baseline, we as-
the matching elements are considered as true predictions and oth-           sume that none of our implemented models intervene in the task of
ers as false. However, a good evaluation must include more details          correction and only references are considered as correction. Simply
about this comparison. Over the years, a number of metrics have             saying, baseline is a model that makes no corrections on the input
been proposed for evaluation of query correction, each motivated            query. It enables us to interpret the performance of each model in
by weaknesses of previous metrics. There is no single best evalu-           comparison to the default results.
ation metric and the performance of a metric depends on the re-
search goals and application. Thus, we have evaluated our system            7.2    Winners
based on the most popular metrics to date.                                  As we expect, since the baseline system contains the ground-truth
   In classification tasks, accuracy is one of the most widely used         correction, the BLEU and the GLEU scores for the baseline system
performance measure. Accuracy corresponds to the ratio of cor-              have a maximum value 1.00. Using these metrics, the bi-directional
rectly classified inputs to the total number of inputs. One drawback        NMT model (model IV) with the attention layer shows higher scores
of this metric is that correction actions are completely ignored. In        in comparison to other models, i.e., F -score = 94.11%, BLEU = 0.9255
order to take the correction actions into scope, we use additional          and GLEU = 0.9256. Interestingly, CHRF scores of models III and
performance metrics like precision, recall and F-score. For our ex-         IV (bi-directional) are almost identical. The choice of metric also
periments, we use F 0 . 5 , since it places twice as much emphasis on       portrays few other insights, for example CHRF performs poorly
precision as on recall. This metric has also been used in CoNLL-            on the addition operation. Similarly, GLEU metric is a bad choice
2014 shared task [31]. We also use more modified metrics such as            for operations replacement and transposition while performs very
BLEU , GLEU and Character n-gram F-score (CHRF ) thus gaining               well on other type of operations.
better insight into translation system’s performance. These metrics
are explained in the following subsections.
                                                                                                     Uni                      Bi
6.1    BLEU                                                                         Model     P       R      F0.5     P       R      F0.5
The Bilingual Evaluation Understudy Score (BLEU ) is a widely pop-                 NMT I    83.76   85.37   84.08   83.91    81.01   83.31
ular metric for machine translation [32]. For our evaluations, in or-             NMT II    86.24   84.61   85.90   88.64    87.22   88.35
der to apply BLEU to individual queries, we used smoothed BLEU ,                  NMT III   91.17   80.04   90.74   92.77    93.86   92.98
whereby we add 1 to each of the n-gram counts before we calculate                 NMT IV    93.08   90.79   92.61   93.61    96.19   94.11
the n-gram precisions. This prevents any of the n-gram precisions
from being zero, and thus will result in non-zero values even when                   Table 6: NMT Results - Standard Metrics (%)
there are not any 4-gram matches.
                                                                        8
                                             Incorrect                                                  Correct
                          contact void due to barratry or champterty               contract void due to barratry or champerty
                       nondisclsoure unenforceable lackof consideration        nondisclosure unenforceable lack of consideration
                           summary judgment for breach o fcontract                 summary judgment for breach of contract
                                declaratry judgmentnd discovery                        declaratory judgment and discovery
                            business judgment rulegross negligencce               business judgment rule and gross negligence
                                  tennesee power of attorneyey                             tennessee power of attorney
                            collaterale stopp el amount of damages                     collateral stoppel amount of damages
                         agreement to remove fence statue of drauds              agreement to remove fence statute of frauds
                          purpose of motion inlimine are evidentary               purpose of motion in limine are evidentiary
                                officerwilliam alexsander karabelas                     officer william alexander karabelas
                            associate justise ruth bader ginsberg                     associate justice ruth bader ginsburg
                                officerwilliam alexsander karabelas                     officer william alexander karabelas
                                                 Table 5: Sample Results - Query Reformulation


                          Uni                             Bi                    [37] have been used for the task of error correction. We plan to
                                                                                explore these networks to read the input sequence multiple times
     Model    BLEU       GLEU      CHRF       BLEU       GLEU     CHRF
                                                                                in order to make an output and also update memory contents at
 NMT I        0.8253     0.8341     0.8601    0.8352     0.8476   0.8632        each step. We also plan to extend our work to question answering
NMT II        0.8305     0.8484     0.8519    0.8524     0.8742   0.8849        for grammar error correction and apply reformulation to boolean
NMT III       0.8645     0.8924     0.8734    0.8752     0.8994   0.9056        queries.
NMT IV        0.9024     0.9255     0.9145    0.9255     0.9256   0.9055
             Table 7: NMT Results - Modified Metrics
                                                                                REFERENCES
7.3    Limitations                                                               [1] R. Pearce and J. Mcginnis, “The great disruption: How machine intelligence will
                                                                                     transform the role of lawyers in the delivery of legal services,” Fordham Law
A few examples of source queries and its output corrections for                      Review, vol. 82, p. 3041, 05 2014.
some of our trained NMT models are illustrated in Table 5. The red               [2] M. D. Kernighan, K. W. Church, and W. A. Gale, “A spelling correction program
                                                                                     based on a noisy channel model,” in Proceedings of the 13th Conference on Compu-
tags refer to the incorrect tokens in the input and the green tags                   tational Linguistics - Volume 2, COLING ’90, (Stroudsburg, PA, USA), pp. 205–210,
are the correctly predicted tokens by our trained models. Looking                    Association for Computational Linguistics, 1990.
carefully at the distribution of incorrect prediction of correct in-             [3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural
                                                                                     networks,” in Advances in neural information processing systems, pp. 3104–3112,
put words, we can deduce that the models perform less sensibly                       2014.
when the size of sequence become gradually bigger. To prove this,                [4] S. Arunprasath and B. Venkata Nagaraju, “Deep ensemble learning for legal
we evaluated the models by limiting the sequences to a fixed size.                   query understanding,” in Proceedings of CIKM 2018 Workshop on Legal Data An-
                                                                                     alytics and Mining (LeDAM 2018), CEUR-WS.org, October 2018. To appear.
Figure 9 shows the attention heatmap for a sample query.                         [5] J. Xu and W. B. Croft, “Query expansion using local and global document anal-
                                                                                     ysis,” in Proceedings of the 19th Annual International ACM SIGIR Conference on
8.    CONCLUSION                                                                     Research and Development in Information Retrieval, SIGIR ’96, (New York, NY,
                                                                                     USA), pp. 4–11, ACM, 1996.
In this paper, we have investigated the potential of using character-            [6] H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma, “Probabilistic query expansion using
                                                                                     query logs,” in Proceedings of the 11th International Conference on World Wide
level information and subword-based NMT models for the problem                       Web, WWW ’02, (New York, NY, USA), pp. 325–332, ACM, 2002.
of query reformulation. First, we extended the encoder with a char-              [7] B. M. Fonseca, P. Golgher, B. Pôssas, B. Ribeiro-Neto, and N. Ziviani, “Concept-
acter attention mechanism for learning better source side repre-                     based interactive query expansion,” in Proceedings of the 14th ACM International
                                                                                     Conference on Information and Knowledge Management, CIKM ’05, (New York,
sentations. Then, we incorporated information about source side                      NY, USA), pp. 696–703, ACM, 2005.
characters into the decoder with attention, so that the character-               [8] J. Gao and J.-Y. Nie, “Towards concept-based translation models using search
level information can cooperate with the word-level information                      logs for query expansion,” in Proceedings of the 21st ACM International Conference
                                                                                     on Information and Knowledge Management, CIKM ’12, (New York, NY, USA),
to better control the translation. Our experiments demonstrate the                   pp. 1:1–1:10, ACM, 2012.
effectiveness of our models and proves that both OOV and frequent                [9] S. Riezler, Y. Liu, and A. Vasserman, “Translating queries into snippets for im-
                                                                                     proved query expansion,” in COLING, 2008.
words benefit from the character-level information.                             [10] D. Britz, Q. V. Le, and R. Pryzant, “Effective domain mixing for neural machine
                                                                                     translation.,” in WMT (O. Bojar, C. Buck, R. Chatterjee, C. Federmann, Y. Graham,
9.    FUTURE WORK                                                                    B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, and J. Kreutzer, eds.), pp. 118–
                                                                                     126, Association for Computational Linguistics, 2017.
For future research, we plan to explore translation models with                 [11] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk,
action level, i.e., prevent over learning of models by not training                  and Y. Bengio, “Learning phrase representations using rnn encoder–decoder for
                                                                                     statistical machine translation,” in Proceedings of the 2014 Conference on Empir-
them over correct input tokens (action =“OK”). Recently, reinforce-                  ical Methods in Natural Language Processing (EMNLP), (Doha, Qatar), pp. 1724–
ment learning techniques [36] and End-to-End Memory Networks                         1734, Association for Computational Linguistics, Oct. 2014.
                                                                           9
                                      0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00        0.00   0.00    0.00    0.00     0.00    0.00    0.00    0.00    0.04
                            r d
                                      0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00        0.00   0.00    0.00    0.00     0.00    0.01    0.04    0.72    0.01
                                                                                                                                                                                      0.8
                                      0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00        0.00   0.00    0.00    0.00     0.01    0.11    0.83    0.03    0.00
                            e d w a


                                      0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00        0.00   0.00    0.00    0.01     0.36    0.55    0.06    0.00    0.00
                                      0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.00        0.00   0.00    0.03    0.42     0.48    0.05    0.00    0.00    0.00
                                      0.01   0.01   0.01   0.00   0.00   0.00   0.00   0.01        0.01   0.01    0.52    0.36     0.04    0.01    0.00    0.00    0.00               0.6
                                      0.00   0.00   0.00   0.00   0.00   0.00   0.00   0.06        0.01   0.91    0.01    0.00     0.01    0.00    0.00    0.00    0.00
                     Corrected


                                      0.00   0.00   0.00   0.00   0.00   0.00   0.01   0.39        0.49   0.09    0.00    0.00     0.00    0.00    0.00    0.00    0.00
                            i n e


                                      0.00   0.00   0.00   0.00   0.01   0.00   0.02   0.51        0.28   0.17    0.01    0.00     0.00    0.00    0.00    0.00    0.00
                                                                                                                                                                                      0.4
                                      0.00   0.00   0.00   0.00   0.00   0.04   0.30   0.57        0.06   0.03    0.00    0.00     0.00    0.00    0.00    0.00    0.00
                                      0.00   0.00   0.00   0.00   0.01   0.84   0.11   0.01        0.01   0.00    0.00    0.00     0.01    0.00    0.00    0.00    0.00
                            t


                                      0.00   0.00   0.00   0.01   0.09   0.87   0.01   0.01        0.01   0.00    0.00    0.00     0.00    0.00    0.00    0.00    0.00
                            i s


                                      0.00   0.00   0.01   0.28   0.65   0.02   0.02   0.01        0.00   0.00    0.00    0.00     0.00    0.00    0.00    0.00    0.00               0.2
                                      0.00   0.00   0.02   0.83   0.13   0.00   0.00   0.00        0.00   0.00    0.00    0.00     0.00    0.00    0.00    0.00    0.00
                            r


                                      0.01   0.39   0.28   0.27   0.01   0.02   0.00   0.00        0.00   0.00    0.00    0.00     0.00    0.00    0.00    0.00    0.00
                            c h


                                      0.20   0.59   0.14   0.03   0.01   0.00   0.00   0.00        0.00   0.00    0.02    0.00     0.00    0.00    0.00    0.00    0.00
                                       c      h      r      i      s      t      i      b           e               e       d       w       w       a        r       d                0.0
                                                                                            Misspelled


                                                                  Figure 9: Attention Mechanism Heatmap


[12] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” CoRR,                 [26] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word rep-
     vol. abs/1508.01211, 2015.                                                                           resentations in vector space,” CoRR, vol. abs/1301.3781, 2013.
[13] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly                      [27] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with
     learning to align and translate,” CoRR, vol. abs/1409.0473, 2014.                                    subword information,” TACL, vol. 5, pp. 135–146, 2017.
[14] A. Graves, “Generating sequences with recurrent neural networks.,” CoRR,                        [28] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with
     vol. abs/1308.0850, 2013.                                                                            subword information,” Transactions of the Association for Computational Linguis-
[15] R. Jackendoff, Semantic Structures. Cambridge, MA: MIT Press, 1990.                                  tics, vol. 5, pp. 135–146, 2017.
[16] R. Weng, S. Huang, Z. Zheng, X.-Y. Dai, and J. Chen, “Neural machine transla-                   [29] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly
     tion with word predictions.,” in EMNLP (M. Palmer, R. Hwa, and S. Riedel, eds.),                     learning to align and translate,” CoRR, vol. abs/1409.0473, 2014.
     pp. 136–145, Association for Computational Linguistics, 2017.                                   [30] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with
[17] S. Jean, K. Cho, R. Memisevic, and Y. Bengio, “On using very large target vocab-                     gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2,
     ulary for neural machine translation,” in Proceedings of the 53rd Annual Meeting                     pp. 157–166, 1994.
     of the Association for Computational Linguistics and the 7th International Joint                [31] H. T. Ng, S. M. Wu, T. Briscoe, C. Hadiwinoto, R. H. Susanto, and C. Bryant,
     Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1–10,                         “The conll-2014 shared task on grammatical error correction,” in Proceedings of
     Association for Computational Linguistics, 2015.                                                     the Eighteenth Conference on Computational Natural Language Learning: Shared
[18] Y. He, J. Tang, H. Ouyang, C. Kang, D. Yin, and Y. Chang, “Learning to rewrite                       Task, pp. 1–14, Association for Computational Linguistics, 2014.
     queries,” in Proceedings of the 25th ACM International on Conference on Informa-                [32] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic
     tion and Knowledge Management, CIKM ’16, (New York, NY, USA), pp. 1443–                              evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of
     1452, ACM, 2016.                                                                                     the Association for Computational Linguistics, 2002.
[19] K. Kukich, “Techniques for automatically correcting words in text,” ACM Com-                    [33] C. Napoles, K. Sakaguchi, M. Post, and J. Tetreault, “Ground truth for grammati-
     put. Surv., vol. 24, pp. 377–439, Dec. 1992.                                                         cal error correction metrics,” in Proceedings of the 53rd Annual Meeting of the As-
[20] W. E. Winkler, “String comparator metrics and enhanced decision rules in the                         sociation for Computational Linguistics and the 7th International Joint Conference
     fellegi-sunter model of record linkage,” in Proceedings of the Section on Survey                     on Natural Language Processing (Volume 2: Short Papers), pp. 588–593, Associa-
     Research, pp. 354–359, 1990.                                                                         tion for Computational Linguistics, 2015.
[21] R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,” J. ACM,              [34] C. Napoles, K. Sakaguchi, M. Post, and J. R. Tetreault, “GLEU without tuning,”
     vol. 21, pp. 168–173, Jan. 1974.                                                                     CoRR, vol. abs/1605.02592, 2016.
[22] V. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and                   [35] M. Popović, “chrF: character n-gram f-score for automatic MT evaluation,” in
     Reversals,” Soviet Physics Doklady, vol. 10, p. 707, 1966.                                           Proceedings of the Tenth Workshop on Statistical Machine Translation, (Lisbon,
[23] E. Loper and S. Bird, “Nltk: The natural language toolkit,” in Proceedings of                        Portugal), pp. 392–395, Association for Computational Linguistics, Sept. 2015.
     the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural                   [36] K. Sakaguchi, M. Post, and B. Van Durme, “Grammatical error correction with
     Language Processing and Computational Linguistics - Volume 1, ETMTNLP ’02,                           neural reinforcement learning,” in Proceedings of the Eighth International Joint
     (Stroudsburg, PA, USA), pp. 63–70, Association for Computational Linguistics,                        Conference on Natural Language Processing (Volume 2: Short Papers), pp. 366–372,
     2002.                                                                                                Asian Federation of Natural Language Processing, 2017.
[24] B. Dhingra, H. Liu, R. Salakhutdinov, and W. W. Cohen, “A comparative study of                  [37] S. Sukhbaatar, a. Szlam, J. Weston, and R. Fergus, “End-to-end memory net-
     word embeddings for reading comprehension,” CoRR, vol. abs/1703.00993, 2017.                         works,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D.
[25] T. Luong, R. Socher, and C. Manning, “Better word representations with recur-                        Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 2440–2448, Curran
     sive neural networks for morphology,” in Proceedings of the Seventeenth Confer-                      Associates, Inc., 2015.
     ence on Computational Natural Language Learning, pp. 104–113, Association for
     Computational Linguistics, 2013.


                                                                                              10