=Paper=
{{Paper
|id=Vol-2385/paper4
|storemode=property
|title=Legal Query Reformulation using Deep Learning
|pdfUrl=https://ceur-ws.org/Vol-2385/paper4.pdf
|volume=Vol-2385
|authors=Arunprasath Shankar,Venkata Nagaraju Buddarapu
|dblpUrl=https://dblp.org/rec/conf/icail/ShankarB19
}}
==Legal Query Reformulation using Deep Learning==
Legal Query Reformulation using Deep Learning Arunprasath Shankar Venkata Nagaraju Buddarapu LexisNexis LexisNexis Raleigh, USA Raleigh, USA arunprasath.shankar@lexisnexis.com venkatanagaraju.buddarapu@lexisnexis.com ABSTRACT achieves this by using either his own prior knowledge or through Query reformulation is the process of iteratively modifying a query assisted tools. Legal query reformulation is not a trivial task, as to improve the quality of search engine results. In recent years, legal search queries are often short, complex and lack context. Au- the task of reformulating natural language (NL) queries has re- tomatic query correction systems are one of the most widely used ceived considerable diligence from both industry and academic tools within NL applications. Search engines support users in this communities. Traditionally, query reformulation has been mostly task in two ways, (a) explicitly by suggesting related queries or approached by using the noisy channel model. Since legal queries query completions, or (b) implicitly by expanding the query to im- are diverse and multi-faceted, these traditional approaches can- prove quality and recall of organic results. Successful reformula- not effectively handle low frequency and out-of-vocabulary (OOV) tions must closely align with the original query both syntactically, words. Motivated by these issues, we rethink the task of legal query as sequences of characters or words, and semantically, often in- reformulation as a type of monolingual neural machine translation volving transparent taxonomic relations. (NMT) problem, where the input (source) query is potentially er- From a probabilistic perspective, given an incorrect query q − roneous and the output (target) query is its corrected form. We that we wish to reformulate to a correct query q + , we seek to propose a unified and principled framework with multiple levels model q + = arg maxq P (q|q − ) where q is the original query or of granularity. Specifically, (i) an encoder with character atten- ground truth. In a traditional noisy channel model [2], the model tion which augments the subword representation; (ii) a decoder consists of two parts: (i) a language model i.e., P(q) that represent with attentions that enable the representations from different lev- the prior probability of the intended correct input query; and (ii) els of granularity to control the translation cooperatively and (iii) a an error model i.e., P (q|q − ) that represent the process in which semi-supervised methodology to extract and augment a large-scale the correct input query gets corrupted to its incorrect form. There dataset of NL query pairs combining syntactic and semantic opera- are several drawbacks with this approach: (i) we need two sepa- tions. We establish the effectiveness of our methodology using an rate models and the error in estimating one model would affect internal dataset, where the training data is automatically obtained the performance of the final output, (ii) it is not easy to model the from user query logs. We further demonstrate that training deep channel since there is a lot of sources for these mistakes e.g., typing neural networks on additional data with synthesized errors can too fast, unintentional key stroke, phonetic ambiguity, etc. and (iii) improve performance for translation. it is not easy to obtain clean training data for language model as the input does not follow what is typical in NL. Since the end goal 1. INTRODUCTION is to get a query that maximizes P (q|q − ), can we directly model Technology and innovation are transforming the legal profession this conditional distribution instead. in manifold ways. The legal industry is undergoing significant dis- In this work, we explore this route, which by passes the need ruption and arguably machine intelligence is most advancing in to have multiple models and avoid suffering errors from multiple the area of discovery and search. Technologies are moving from sources. We achieve this by applying the encoder-decoder frame- basic keyword searches to predictive coding, which involves algo- work combined with attention learning using recurrent neural net- rithms that predict whether a document is relevant or not. In [1], works [3] and rethink the query correction problem as a NMT prob- McGinnis et al. envisage two phases of technological changes in lem, where the incorrect input is treated as a foreign language. this area. The first phase, expected to come in the next 10 years, in- We also propose a semi-supervised methodology to perform er- volves perfecting semantic search that will allow lawyers to input ror detection and construct a large-scale dataset for training and NL queries to interfaces and systems responding to those queries experimentation. To the best of our knowledge, there is no prior directly with relevant information. The second phase involves tech- research work on the idea of using attention learning in a legal nology that is able to identify issues, given a set of facts and then setting. Also, our work is the first that uses neural machine trans- suggest relevant authorities that might apply to those issues. lation based methodologies on user generated legal data for query Query reformulation or correction is a crucial component for reformulation. We demonstrate that NMT models can successfully any service that requires users to type in NL, such as a search en- be applied to the task of query reformulation in a legal background gine. It is an iterative process where a user reformulates (rewrites) and validate that the trained models are naturally capable of han- queries to improve search results or gain new information. He dling orthographic errors and rare words, and can flexibly correct a variety of error types. We further find that augmenting the net- In: Proceedings of the Third Workshop on Automated Semantic Analysis of Informa- work training data with queries containing synthesized errors can tion in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada. result in significant gains in performance. ©2019 Copyright held by the owner/author(s). Copyrighting permitted for private and academic purposes. Published at http:// ceur-ws.org. 2. APPLICATION approaches and, second is the process of error detection (selecting An “answer card” is a search feature, usually displayed in a box or candidate queries that are incorrect) for training the reformulation panel, that occurs above organic results and tries to directly answer models. We will introduce related studies specific to these areas in a question. Most answer boxes are primarily text, containing a rel- the this section. atively short answer and may provide limited information based on user’s entered query. In legal context, the source can be a legal 3.1 Query Reformulation question, profile search or can take many forms. Figure 1 shows a 3.1.1 Traditional Approaches: In [5], Xu et al. used top results card that is an answer to a legal question “define motion to com- retrieved by the original query to perform reformulation by expan- pel” and Figure 2 shows a profile card for a judge search using the sion. This method is popular and influenced by the initial ranking query “who is judge antonin scalia ?”. of results, however it cannot utilize user-generated data. Other ap- proaches focused on using user query logs to expand a query by means of clickthrough rate [6], co-occurrence in search sessions, Motion to compel or query similarity based on click-through graphs [7]. The advan- A motion to compel asks the court to order either the opposing tage of these approaches is that user feedback is readily available party or a third party to take some action. in user query logs and can efficiently be precomputed. However, these approaches all face the problem of requiring some resource dependency on a specific search engine. Figure 1: Answer Card 3.1.2 Statistical Machine Translation: The task of machine cor- Vertical engines are search engines that specialize in different rection is defined as correcting a N -character or word source query types of search. Answer cards are usually backed by vertical search S = s 1 , . . . , s N = s 1N into a M-character or word target query engines that are tightly coupled with AI/NLP systems. These NLP T = t 1 , . . . , t M = t 1M . Thus, the correction system can be defined systems may be a machine/deep learning model(s) recognizing and as a function F: classifying entities that trigger and render respective cards based on user’s query intent[4]. NLP models rely heavily on vocabulary TD = F (S) (1) and any OOV words contained in the query can have an adverse which returns a correction hypothesis TD given an input word S. effect on the coverage or functioning of answer cards. Recently, several studies have been adopting data from user query logs as input to Statistical Machine Translation (SMT) for query re- formulation and expansion [8][9]. These methods treat user queries Antonin Scalia as one language and the reformulated queries as another language. However, their major drawback is the difficulty of modeling cor- rections in different granularities, e.g., characters or words, which Former Associate Justice Supreme Court (U.S.) is necessary to reduce the rate of unknown words that are detri- mental to their proper functioning. Born: March 11, 1936, Trenton, NJ Our approach differs in several ways. First, we consider full Died: February 13, 2016 query pairs as training data, and do not use single word tokens as a primary mode of operation. Second, we do not train explicit Appointed by: Ronald Reagan error models P(w |s) for words w and observed corrections s, but use standard word/character based NMT models to derive more meaningful and accurate lexical translation models. Figure 2: Profile Card 3.1.3 Neural Machine Translation: Recently, the use of neu- The scope of this research is to develop a framework to detect ral networks has delivered significant gains for mapping tasks be- these erroneous queries and autocorrect them. The framework acts tween pairs of sequences due to to their ability to learn a better as an extra layer between the answer cards and NLP recognition representation of the data and context. Recurrent Neural Networks systems to facilitate a coherent mechanism for coping up with (RNNs) are a set of neural networks for processing sequential data missing information (vocabulary); which in turn enable us to re- and modeling long-distance dependencies which is a common phe- turn more appropriate cards. For e.g., for the above mentioned nomenon in human language. NMT models [10] use RNNs and answer card queries, lets say a user typed in “defin e motiom learn to map from source language input to target language input to compell” or “whi is jufge antonin scakia?” instead, our de- via continuous-space intermediate representations effectively. We ployed framework would fix (reformulate) the queries to its correct use an encoder-decoder bound uni/bi-directional RNNs with an at- form and display the cards successfully. tention mechanism as the core component of our proposed system. [11]. The encoder maps the input query to a higher-level represen- 3. BACKGROUND AND RELATED WORK tation with a uni or bi-directional RNN architecture similar to that There are two major areas related to our research and the proposed of [12]. The decoder also employs a RNN that uses content-based models. First is the work on the task of query reformulation in attention mechanism [13] to attend to the encoded representation information retrieval involving traditional, statistical and neural and generate the output query one character at a time. 2 3.1.4 Character Level Reasoning: Word-level NMT models are words (tokens). Table 1 shows top 10 ranked queries (left) and to- poorly suited to handle OOV words due fixed vocabulary [14]. Re- kens (right) at the end of query preparation process. cent works have proposed workarounds for this problem. Since a word is usually thought of as the basic unit of language commu- Rank Query Rank Token nication [15], early NMT systems built these representations start- 1 breach of contract 1 motion ing from the word level [16]. Machine learning/NLP models typi- 2 unjust enrichment 2 contract cally limit vocabulary size due to the complexity of training [17]. 3 summary judgement 3 court Therefore, they are unable to translate rare words, and it is a stan- 4 res judicata 4 judgment dard practice to replace OOV words with unknown (UNK) symbol. 5 fraud 5 insurance By doing so, useful information is discarded, resulting in systems 6 motion to dismiss 6 evidence that are not able to correct erroneous words. In our framework, 7 conversion 7 property we strive to circumvent this issue by using smaller units such as 8 statute of limitations 8 attorney subwords to address the problem of OOV words. 9 defamation 9 liability 10 civil procedure 10 statute 3.2 Error Detection Table 1: Top 10 Queries and Tokens In most translation systems, before any reformulation is carried out, a detection process is conducted on the input query to extract 4.2 Query Representation any potentially incorrect words. In [18], He et al. proposed a learn- The first step in creating a model, is to choose a representation for ing to rewrite framework that focuses on candidate ranking for the input and output. For instance, a query can be represented in error detection. A non-word error refers to a potentially incorrect word-level, which is transferring the data with indices that refer to word that does not exist in a given dictionary. Dictionary lookup the words or in character-level, which is transferring the data with is one of the basic techniques employed to compare input strings indices that refer to a closed vocabulary of language specific char- with the entries of a language resource, e.g., lexicon or corpus [19]. acters and symbols. However, in this work, we anticipate the need Such a language resource must contain all inflected forms of the to handle an exponentially large input space of tokens by choosing words and it should be updated regularly. If a given word does not a character-level query representation. Thus, model inputs are in- exist in the language resource, it will be marked as a potentially dividual characters which are a sequence of indices corresponding incorrect word which is a huge disadvantage of this approach. characters. Minimum edit distance is one of the most studied techniques for error detection. It is based on counting edit operations, which Legal are defined in most systems as insertion, deletion, substitution and Names transposition. Jaro-Winkler [20], Wagner-Fischer [21], and Leven- Data Augmentation Legal shtein [22] are among the most famous edit distance algorithms. Corpora This approach has a main drawback, in that it penalizes all change User Input Error operations in the same way, without taking into account the char- Detection acter that is used in the change operation. In contrast to above Query Training Pairs NMT Reformulated Input mentioned approaches, our error detection methodology combines Log both syntactic and semantic attributes of a user query to extract in- FastText Embeddings correct non-word occurrences. Here, we establish a unified frame- work (see Figure 3) that encompasses aspects like frequency dis- tribution, query rank and subword representations for error detec- Figure 3: Proposed Framework tion alongside NMT and attention learning. We have defined specific characters likeand in order to specify start and end of each query. This character will be later used in the training process as a criteria. Note that in 4. PROPOSED FRAMEWORK the character-level representation, we have also considered spaces, 4.1 Query Preparation certain diacritics and punctuations as unique characters, i.e., 40 For our experiments, we collected a total of ∼ 62M distinct queries unique characters in all. Once a query is mapped into indices, the grouped by frequency, collected over a period of 5 years. The queries characters are embedded, i.e. each character is represented by a under study included a considerable proportion of boolean queries one-hot vector and is multiplied by a trainable matrix with the size ∼ 32M of them. We used a heuristic approach to filter search queries. of the input X embeddings size. The embedding allows the model Using a combination of regular expressions and strict constraints, to group together characters that are similar for the task. we excluded any query with boolean or special character connec- tors; this left us with ∼ 29M NL queries. 4.3 Candidate Selection We used the punkt sentence tokenizer [23] to tokenize the queries Our scoring heuristic calculates a sequence of feature scores for into words. The tokens were further filtered down to remove noise each token from the prepared vocabulary. The key motivation is and short forms. Any token that is a digit and of length ⩾ 5 was to separate these tokens into +ve (correct) and −ve (incorrect) excluded from our study. The created vocabulary contained ∼ 1M buckets. We achieve this by combing three distributional attributes. 3 First, since our vocabulary is derived from user log queries, we is computed as follows: can associate the individual tokens to the rank of its constituent queries. Second, we can also affiliate a token to its individual fre- 1, if (x ∈ ▽H ) ∨ (x ∈ ▽ J ) ∨ (x ∈ ▽E ) ∨ (x ∈ ▽C ) ϕ= 1 (5) quency within the vocabulary. And third, by the membership of , otherwise 2 the tokens to selected legal lexicons. Our selected lexicons com- prise of extracted vocabulary from headnotes (∼ 217K) derived 4.3.5 Relevance Score: Finally, a relevance score (Ω) is com- from US legal caselaw documents, legal names consisting of judge puted by combining all the scores derived from equations 2, 3, 4 (∼ 67K) and expert witness (∼ 388K) names derived legal master and 5 as follows: ( 1 ) ( ) ( 1 ) databases. We also use a secondary lexicon of common US english Ω= ϕ· 1− · 1 − e −χ · 1 − (6) names (∼ 234K). e 2ξ + 1 1+ζ Let Q be an ordered set of distinct legal queries derived from The relevance score expresses the relative importance of a token user logs and ranked by frequency. Q ⇒ {q 1 , q 2 , . . . , qr } where r and can be used to classify a word into +ve or −ve bucket. We es- is the rank of the query; r ∈ R. Let R denote an ordered set of ranks; tablish this separation by using a cut-off threshold (∼0.83) obtained R ⇒ {1, 2, . . . , L}, where L is the length of Q i.e., total number of via trial and error experiments. distinct queries. Let ▽ ⇒ {x 1 , x 2 , . . . , xl } denote the vocabulary of distinct tokens derived from Q where l is length of the individual 4.4 Ranking Strategy using FastText query. For any given token x in vocabulary ▽, it can belong to a Traditional approaches to representing rare and OOV words follow subset of queries Q̂; Q̂ ⊆ Q. This implies that the token can mapped assigning a random vector [24] or binning all rare words into a new to a closed set of ranks R̂; R̂ ⊆ R. E.g., the token discriminatory can “UNK” word type. Another popular technique is to encode word belong to multiple queries within Q, hence R̂ could be a closed set definitions with an external resource (Long et al., 2016) and train e.g., {6, 37, 763, 23445}. at the morpheme level [25]. Creating semantic representations for OOV words is a difficult NLP task. And word-based (Mikolov et 4.3.1 Mean Rank Score: Given R̂ ⊆ R, for a token x ∈ Q, R̂ can al., 2013; aka word2vec)[26] approaches do not contribute well to be defined as {r 1 , r 2 , . . . , rl }. We compute the mean rank score (ξ ) resolve this problem. as follows: ∑ l ri Token Score Token Score 1− i =1 L (2) roberds 0.9967 cbreach 0.9839 ξ = |R̂| robers 0.9962 fbreach 0.9771 robergs 0.9913 onbreach 0.9684 where |R̂| is size of R̂. robertus 0.9851 breachh 0.9683 4.3.2 Neighbor Rank Score: Given a token x 0 , it can co-occur robertson 0.9797 bebreach 0.9659 alongside other tokens from the vocabulary contained to the queries robertt 0.9777 breachj 0.9657 under investigation. Let ▽ ˆ ⇒ {x 1 , x 2 , . . . , xl } denote the subset of rowert 0.9766 breachn 0.9643 tokens co-occurring alongside x. For every candidate token inside rochfort 0.9758 ofbreach 0.9641 ▽ ˆ , we have a corresponding mean rank score ξ derived from equa- rogerts 0.9756 mcbreach 0.9627 tion 2. We restrict the window of co-occurrence to 2 neighbors, one robtert 0.9731 orbreach 0.9612 before and after the token. Let ξˆ be the set of corresponding mean Table 2: Top-10 Similar Words by fastText rank scores; ξˆ ⇒ {x 1 , x 2 , . . . , xl } The neighbor rank score (χ ) is Character-based embedding approaches are more suited to han- computed as: dle the OOV problem (Bojanowski et al., 2017; aka fastText)[27]. For mapping the +ve tokens to their corresponding variants (in- ∑ l ξi correct −ve counterparts), we used fastText [28]. fastText is a χ= (3) character-based embedding approach that it is well-suited to pro- ˆ i =1 | ξ | duce embeddings for OOV words. It’s able to do this by learning 4.3.3 Frequency Score: Let ϵ denote the frequency of a token; vectors for character n-grams within the word and summing those x ∈ ▽. Frequency score (ζ ) is a normalized value representing the vectors to produce the final vector or embedding for the word itself. importance of frequency of occurrence of a token in the corpus. It For training, we set the embedding size to 100 and the training win- is defined as follows: dow to 10. We also set alpha value to 0.025, min and max values of n to 2 and 6 respectively. The embedding matrix was created using ϵ ζ = (4) a batch size of 10K words in 100 epochs. We trained fastText em- |▽| beddings on the 29M NL queries using a p3.2xlarge EC2 instance 4.3.4 Lexicon Membership Score: Let ▽H denote the vocabu- with 1 Tesla V 100 GPU. Table 2 shows top 10 most similar words lary of tokens derived from headnotes. Let ▽ J denote the vocab- for the OOV +ve word “roberts” (left) and OOV −ve word “breach”. ulary of tokens derived from judge master (judge name database) and let ▽E denote tokens from expert witness database. Let ▽C 4.5 Query Augmentation indicate all the tokens from common valid US English names vo- Legal names like judge, expert witness, attorney names etc. are cabulary. For any given token x, the lexicon membership score (ϕ) often misspelled by users during search. Since we were not able 4 to capture a considerable amount of these errors from query logs and uses the hidden state for the next input character. Figure 4(a) (real world data), we decided to synthetically augment these type portrays a simple overview of the encoder. of errors and generate queries for legal names algorithmically. For Previous this purpose, we took all the names (+ve instances) from our legal Input Hidden corpora and applied a variant of Needleman-Wunsch algorithm to create its −ve counterparts. Embedding The distance computation is identical to Levenshtein except that character mistakes are given different weights depending on how far two characters are on a standard keyboard layout (QWERTY). Previous ReLU Input The weights are assigned by projecting the keyboard characters Hidden onto a cartesian system. For e.g., A to S is given a mistake weight of 0.4, while A to D is a 0.6. We generate query pairs using 4 types of Embedding GRU transformation - addition, deletion, replacement and transposition. Table 3 shows a few of the generated examples for augmentation. GRU Softmax Misspelled Corrected Operation mchael lowy michael lowy addition Output Hidden Output Hidden david yamhamoto david yamamoto deletion (a) Encoder (b) Decoder christibe ballard christine ballard replacement warren glciker warren glicker transposition Figure 4: Encoder and Decoder Table 3: Augmented Queries Following [29], we use a bi-directional RNN with Gated Recur- rent Units (GRUs) to encode the source query: 4.6 Training Pairs #» (# » #») hi = GRU hi−1 , si ; Θ In previous sections, we discussed the need to segregate the vocab- »# (» # »# ) (8) ulary into +ve and −ve tokens and the use of fastText character hi = GRU hi−1 , si ; Θ n-grams or subwords to capture word mappings (+ve to −ve). At where si is the i th source embedding, GRU is a gated recurrent unit, the end of the candidate selection process, the entire vocabulary is #» »# Θ and Θ are the parameters of forward and backward GRU, respec- broken into two groups of ∼180K +ve and ∼830K −ve tokens. At tively. The annotation of each source character qi− is obtained by the end of ranking using fastText, we end up with around ∼2.4M concatenating the forward and backward hidden states: legal term mapping and ∼50K legal name mapping. The legal name #» mapping is expanded via the augmentation process elaborated in »#»# hi the previous section to around ∼1.5M pairs. The legal term and hi = »# (9) hi name mappings are then matched against the indexed 60M user queries to create query pairs that will be used for training our translation models. 4.7.2 Decoder: The decoder is also built using a RNN. It takes the same representation of the input context along with the previous 4.7 Neural Machine Translation hidden state and output character prediction, and generates a new hidden decoder state and a new output character prediction. The Most NMT systems follow the encoder-decoder framework with decoder is a forward RNN (uni-directional) with GRUs predicting attention mechanism proposed by Bahdanau et al. [29]. Given a the translation y character by character. This prediction takes the source query q − = q 1− · · · qi− · · · q − I and a target query q + = + + + form of a probability distribution over the entire output vocabulary q 1 · · · q j · · · q J , we aim to directly model the translation proba- (100 characters). At every step of decoding, the decoder is given bility as: an input character and hidden state. The initial input character is ∏ J ( ) the start of query character , and the first hidden state P(q + |q − ; Θ) = P q j |q + − Input Character Embeddings Left-to-Right GRU Right-to-Left GRU Attention Input Context Hidden State Output Character Predictions Error Given Output Characters Output Character Embeddings S C A L I A Figure 5: Encoder Decoder with Attention 4.8 Attention Mechanism 4.9 Gated Recurrent Unit (GRU) The attention mechanism was introduced to address the limitation In practice, a simple RNN is difficult to train properly due to the of modeling long dependencies and the efficient usage of memory problems of the vanishing/exploding gradient as described in [30]. for computation. It intervenes as an intermediate layer between the Therefore, in this work, we utilize GRU (Cho et al., 2014 [11]) as an encoder and the decoder, having the objective of capturing the in- improved version of simple RNN which can alleviate the gradient formation from the sequence of tokens that are relevant to the con- problem. Along the lines of Long Short Term Memory (LSTM), in tents of the query [13]. The mechanism is shown in Figure 5. The GRUs the forget and input gates are coupled into an update gate zt . basic problem that the mechanism solves is that instead of forcing The advantage of GRUs over LSTMs is the smaller number of gates the network to encode all parameters into one fixed-length vector, that makes them less memory as well as computationally intense, it allows the network to make use of the input sequence. which is often a critical aspect for NMT. Key to our approach is the use of a character-based model with Given an input sequence (x1 , x2 , . . . , xN ), GRU can be adopted an attention mechanism, which allows for orthographic errors to as an encoder to compute the corresponding sequence of hidden be captured and avoids the OOV problem suffered by word-based state ht = (h1 , h2 , . . . , hN ) as: NMT methods. Unlike the encoder-decoder model that uses the zt = σ (Wx z xt + Uhz ht −1 ) (15) same context vector for every hidden state of the decoder, attention computes the context vector c j as a weighted sum of the source rt = σ (Wxr xt + Uhr ht −1 ) (16) annotations: H ht = tanh (Wxh xt + Ur h (r ⊗ ht −1 )) (17) ∑ I »#»# ht = (1 − zt ) ⊗ ht −1 + zt ⊙ H ht (18) κj = ∆ji · hi (12) i =1 where σ is the sigmoid function and ⊗ is an element-wise multipli- cation operator. zt , rt and H ht are the update gate, reset gate and where the attention weight ∆ji is computed as: candidate activation, respectively. Wx z , Wxr , Wxh , Uhz , Uhr and ( ) Ur h are related weight matrices. exp Ξji ∆ji = ∑ ( ) (13) I i =1 exp Ξ ji 5. EXPERIMENTS ( 5.1 Sequence Representation #») Ξji = α aT tanh βa ϕ j−1 + γa hi (14) An input or output sequence for an error correction system can be represented in different levels. At the character-level, a sequence where α a , βa and γa are the weight matrices, and Ξji is the model is processed character by character. When searching for errors, hu- »#»# that scores how well ϕ j−1 and hi match. mans often consider a bigger sequence of characters at word-level. 6 Clause-level, phrase-level, sentence-level and text-level are other Input Input common representations for modeling NMT sequences. In our re- 10 100 search, we worked at character-level where each character in an Embedding Embedding input sequence is mapped to a real-valued number and its corre- 10 X 1024 100 X 1024 sponding embedding. In order to model linguistic dependencies in GRU GRU each sequence, we took a selected list of characters into account 1024 tanh 1024 tanh including space. This enabled us to deal with different kinds of er- Repeat Vector Repeat Vector 10 X 1024 100 X 1024 rors and a larger range of characters in each sequence. On the other hand, the output of our models were also represented at the char- GRU GRU 1024 tanh 1024 tanh acter level. Time Distributed Time Distributed 10 X 144964 ReLU 100 X 40 ReLU 5.2 Selection Strategy (a) Word-based (b) Char-based In our experiments, for tuning and evaluation purposes we used a gold reference annotation following a simple selection strategy. Figure 6: Architecture - NMT I and II The strategy is used to check whether for a search query pair (x, y), the left query x is erroneous, and if so, whether the similar candi- date y is a correct correction or else provide the most likely cor- rection. If x was not incorrect, we propagate query x to the gold Input 1 100 Input 2 100 reference y, i.e. for those cases, we have it identity as a true neg- Embedding 1 Embedding 2 ative. For training our models, we split the ∼4M query pairs into 100 X 1024 100 X 1024 three splits - train, dev and test in the ratio 99.995 : 0.005 : 0.005 GRU 1 GRU 2 respectively. This produces 20K queries for dev and test set each. 1024 tanh 1024 tanh The true negatives in these sets (i.e. entries that do not need to be Time Distributed corrected) account to about ∼2%. The validation of annotation was 100 X 40 ReLU mostly done via subject matter experts. Figure 7: Architecture - NMT III 5.3 Model Architecture In this section, we discuss about the various architectures experi- mented for neural machine translation. Input 1 Input 2 100 100 5.3.1 NMT I:. The first architecture we experimented with, for Embedding 1 Embedding 2 100 X 1024 100 X 1024 the task of NMT is a word-based model. The input layer uses a dense vector representation of the vocabulary (144,964 words) de- GRU 1 100 X 1024 GRU 2 100 X 1024 rived from query pairs followed by an embedding layer of dimen- Dot 1 sion 10X1024. The length of the sequence of the query is constrained 100 X 100 to a maximum fixed length of 10 words. This architecture uses a re- Attention peat vector along with 1024 GRU units. This is the only word-based 100 X 100 architecture we experimented with for NMT. Dot 2 100 X 1024 5.3.2 NMT II:. Architecture II follows a similar architecture as Concatenate NMT I, except the word sequence is replaced with a character se- 100 X 2048 quence. The sequence length here is constrained by two factors: (i) Time Distributed 1 100 X 1024 the total number of characters that constitute the vocabulary (40 characters) which includes special characters and, (ii) the length Time Distributed 2 100 X 40 of the sequence which is set to fixed maximum length of 100 char- acters. Both architectures NMT I and II do not use attention and use a single embedding at the character level. Figure 6 shows the Figure 8: Architecture - NMT IV architectural layers for NMT models I and II. 5.3.4 NMT IV. Model IV is the first setup to use an attention 5.3.3 NMT III. In the third architecture setup, we use two charac- layer. It is similar to NMT III in using two character embeddings, ter embedding layers which is different from NMT I and II architec- one at the encoder and other at the decoder level. See Figure 8. tures which uses only one embedding. NMT III also does not use a repeat vector layer. This is a character-based only architecture 5.3.5 Bi-directional Architectures. We also experimented the and does not use attention. Figure 7 illustrates the architecture of previously discussed neural architectures (NMT I to IV) by replac- NMT model III. The dimensions of embedding and time-distributed ing uni-directional GRUs with bi-directional GRU units. This is layers are similar to NMT II. only applied to the encoder component of the architecture. 7 5.4 Training 6.2 GLEU For our experiments, we used p3.16xlarge EC2 instance with 8 Recently, Napoles et al. ameliorated BLEU metric for evaluation Tesla V 100 GPUs. Table 4 shows the training time (in minutes) for of grammatical error correction systems and proposed the Gener- the various architecture setups (100 epochs with a batch size of alized Language Evaluation Understanding (GLEU ) [33][34]. For 512). The NMT models were implemented using TensorFlow. GLEU , the precision is modified to assign extra weight to the n- grams that are present in the reference and the hypothesis, but not those of the input. We have used the last update of the original implementation of the GLEU introduced in [34]. Model Avg Time NMT I (Uni) 87.41 6.3 Character n-gram F-score (CHRF ) NMT I (Bi) 126.32 CHRF calculates the sentence level character n-gram F -score as NMT II (Uni) 95.76 described in Maja Popovic, 2015 [35]. It is shown to correlate very NMT II (Bi) 163.73 well with human rankings of different machine translation outputs, NMT III (Uni) 114.05 especially for morphologically rich targets. For our study, we use NMT III (Bi) 239.81 CHRF 1 (standard F -score, β = 1) with uniform n-gram weights. NMT IV (Uni) 168.25 NMT IV (Bi) 320.94 7. RESULTS Table 4: NMT - Training Time In the previous section, we presented the translation models for the task of query reformulation and the details of the models were explained. In this section, the results obtained from the models are discussed. Table 6 and 7 shows results of our correction models 6. EVALUATION METRICS using the evaluation metrics discussed in the previous section. This section presents evaluation methods used to evaluate the NMT models. Although various methods can be used to evaluate a ma- 7.1 Baseline chine translation system, for our use case of query reformulation, We define pairs of source queries and their ground-truth annota- evaluation is limited to a binary classification of predictions where tions as the baseline of our NMT models. In this baseline, we as- the matching elements are considered as true predictions and oth- sume that none of our implemented models intervene in the task of ers as false. However, a good evaluation must include more details correction and only references are considered as correction. Simply about this comparison. Over the years, a number of metrics have saying, baseline is a model that makes no corrections on the input been proposed for evaluation of query correction, each motivated query. It enables us to interpret the performance of each model in by weaknesses of previous metrics. There is no single best evalu- comparison to the default results. ation metric and the performance of a metric depends on the re- search goals and application. Thus, we have evaluated our system 7.2 Winners based on the most popular metrics to date. As we expect, since the baseline system contains the ground-truth In classification tasks, accuracy is one of the most widely used correction, the BLEU and the GLEU scores for the baseline system performance measure. Accuracy corresponds to the ratio of cor- have a maximum value 1.00. Using these metrics, the bi-directional rectly classified inputs to the total number of inputs. One drawback NMT model (model IV) with the attention layer shows higher scores of this metric is that correction actions are completely ignored. In in comparison to other models, i.e., F -score = 94.11%, BLEU = 0.9255 order to take the correction actions into scope, we use additional and GLEU = 0.9256. Interestingly, CHRF scores of models III and performance metrics like precision, recall and F-score. For our ex- IV (bi-directional) are almost identical. The choice of metric also periments, we use F 0 . 5 , since it places twice as much emphasis on portrays few other insights, for example CHRF performs poorly precision as on recall. This metric has also been used in CoNLL- on the addition operation. Similarly, GLEU metric is a bad choice 2014 shared task [31]. We also use more modified metrics such as for operations replacement and transposition while performs very BLEU , GLEU and Character n-gram F-score (CHRF ) thus gaining well on other type of operations. better insight into translation system’s performance. These metrics are explained in the following subsections. Uni Bi 6.1 BLEU Model P R F0.5 P R F0.5 The Bilingual Evaluation Understudy Score (BLEU ) is a widely pop- NMT I 83.76 85.37 84.08 83.91 81.01 83.31 ular metric for machine translation [32]. For our evaluations, in or- NMT II 86.24 84.61 85.90 88.64 87.22 88.35 der to apply BLEU to individual queries, we used smoothed BLEU , NMT III 91.17 80.04 90.74 92.77 93.86 92.98 whereby we add 1 to each of the n-gram counts before we calculate NMT IV 93.08 90.79 92.61 93.61 96.19 94.11 the n-gram precisions. This prevents any of the n-gram precisions from being zero, and thus will result in non-zero values even when Table 6: NMT Results - Standard Metrics (%) there are not any 4-gram matches. 8 Incorrect Correct contact void due to barratry or champterty contract void due to barratry or champerty nondisclsoure unenforceable lackof consideration nondisclosure unenforceable lack of consideration summary judgment for breach o fcontract summary judgment for breach of contract declaratry judgmentnd discovery declaratory judgment and discovery business judgment rulegross negligencce business judgment rule and gross negligence tennesee power of attorneyey tennessee power of attorney collaterale stopp el amount of damages collateral stoppel amount of damages agreement to remove fence statue of drauds agreement to remove fence statute of frauds purpose of motion inlimine are evidentary purpose of motion in limine are evidentiary officerwilliam alexsander karabelas officer william alexander karabelas associate justise ruth bader ginsberg associate justice ruth bader ginsburg officerwilliam alexsander karabelas officer william alexander karabelas Table 5: Sample Results - Query Reformulation Uni Bi [37] have been used for the task of error correction. We plan to explore these networks to read the input sequence multiple times Model BLEU GLEU CHRF BLEU GLEU CHRF in order to make an output and also update memory contents at NMT I 0.8253 0.8341 0.8601 0.8352 0.8476 0.8632 each step. We also plan to extend our work to question answering NMT II 0.8305 0.8484 0.8519 0.8524 0.8742 0.8849 for grammar error correction and apply reformulation to boolean NMT III 0.8645 0.8924 0.8734 0.8752 0.8994 0.9056 queries. NMT IV 0.9024 0.9255 0.9145 0.9255 0.9256 0.9055 Table 7: NMT Results - Modified Metrics REFERENCES 7.3 Limitations [1] R. Pearce and J. Mcginnis, “The great disruption: How machine intelligence will transform the role of lawyers in the delivery of legal services,” Fordham Law A few examples of source queries and its output corrections for Review, vol. 82, p. 3041, 05 2014. some of our trained NMT models are illustrated in Table 5. The red [2] M. D. Kernighan, K. W. Church, and W. A. Gale, “A spelling correction program based on a noisy channel model,” in Proceedings of the 13th Conference on Compu- tags refer to the incorrect tokens in the input and the green tags tational Linguistics - Volume 2, COLING ’90, (Stroudsburg, PA, USA), pp. 205–210, are the correctly predicted tokens by our trained models. Looking Association for Computational Linguistics, 1990. carefully at the distribution of incorrect prediction of correct in- [3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, pp. 3104–3112, put words, we can deduce that the models perform less sensibly 2014. when the size of sequence become gradually bigger. To prove this, [4] S. Arunprasath and B. Venkata Nagaraju, “Deep ensemble learning for legal we evaluated the models by limiting the sequences to a fixed size. query understanding,” in Proceedings of CIKM 2018 Workshop on Legal Data An- alytics and Mining (LeDAM 2018), CEUR-WS.org, October 2018. To appear. Figure 9 shows the attention heatmap for a sample query. [5] J. Xu and W. B. Croft, “Query expansion using local and global document anal- ysis,” in Proceedings of the 19th Annual International ACM SIGIR Conference on 8. CONCLUSION Research and Development in Information Retrieval, SIGIR ’96, (New York, NY, USA), pp. 4–11, ACM, 1996. In this paper, we have investigated the potential of using character- [6] H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma, “Probabilistic query expansion using query logs,” in Proceedings of the 11th International Conference on World Wide level information and subword-based NMT models for the problem Web, WWW ’02, (New York, NY, USA), pp. 325–332, ACM, 2002. of query reformulation. First, we extended the encoder with a char- [7] B. M. Fonseca, P. Golgher, B. Pôssas, B. Ribeiro-Neto, and N. Ziviani, “Concept- acter attention mechanism for learning better source side repre- based interactive query expansion,” in Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM ’05, (New York, sentations. Then, we incorporated information about source side NY, USA), pp. 696–703, ACM, 2005. characters into the decoder with attention, so that the character- [8] J. Gao and J.-Y. Nie, “Towards concept-based translation models using search level information can cooperate with the word-level information logs for query expansion,” in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, (New York, NY, USA), to better control the translation. Our experiments demonstrate the pp. 1:1–1:10, ACM, 2012. effectiveness of our models and proves that both OOV and frequent [9] S. Riezler, Y. Liu, and A. Vasserman, “Translating queries into snippets for im- proved query expansion,” in COLING, 2008. words benefit from the character-level information. [10] D. Britz, Q. V. Le, and R. Pryzant, “Effective domain mixing for neural machine translation.,” in WMT (O. Bojar, C. Buck, R. Chatterjee, C. Federmann, Y. Graham, 9. FUTURE WORK B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, and J. Kreutzer, eds.), pp. 118– 126, Association for Computational Linguistics, 2017. For future research, we plan to explore translation models with [11] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, action level, i.e., prevent over learning of models by not training and Y. Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empir- them over correct input tokens (action =“OK”). Recently, reinforce- ical Methods in Natural Language Processing (EMNLP), (Doha, Qatar), pp. 1724– ment learning techniques [36] and End-to-End Memory Networks 1734, Association for Computational Linguistics, Oct. 2014. 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 r d 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.04 0.72 0.01 0.8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.11 0.83 0.03 0.00 e d w a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.36 0.55 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.42 0.48 0.05 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.52 0.36 0.04 0.01 0.00 0.00 0.00 0.6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 0.01 0.91 0.01 0.00 0.01 0.00 0.00 0.00 0.00 Corrected 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.39 0.49 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 i n e 0.00 0.00 0.00 0.00 0.01 0.00 0.02 0.51 0.28 0.17 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.4 0.00 0.00 0.00 0.00 0.00 0.04 0.30 0.57 0.06 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.84 0.11 0.01 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 t 0.00 0.00 0.00 0.01 0.09 0.87 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 i s 0.00 0.00 0.01 0.28 0.65 0.02 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.2 0.00 0.00 0.02 0.83 0.13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 r 0.01 0.39 0.28 0.27 0.01 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 c h 0.20 0.59 0.14 0.03 0.01 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 c h r i s t i b e e d w w a r d 0.0 Misspelled Figure 9: Attention Mechanism Heatmap [12] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” CoRR, [26] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word rep- vol. abs/1508.01211, 2015. resentations in vector space,” CoRR, vol. abs/1301.3781, 2013. [13] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly [27] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with learning to align and translate,” CoRR, vol. abs/1409.0473, 2014. subword information,” TACL, vol. 5, pp. 135–146, 2017. [14] A. Graves, “Generating sequences with recurrent neural networks.,” CoRR, [28] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with vol. abs/1308.0850, 2013. subword information,” Transactions of the Association for Computational Linguis- [15] R. Jackendoff, Semantic Structures. Cambridge, MA: MIT Press, 1990. tics, vol. 5, pp. 135–146, 2017. [16] R. Weng, S. Huang, Z. Zheng, X.-Y. Dai, and J. Chen, “Neural machine transla- [29] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly tion with word predictions.,” in EMNLP (M. Palmer, R. Hwa, and S. Riedel, eds.), learning to align and translate,” CoRR, vol. abs/1409.0473, 2014. pp. 136–145, Association for Computational Linguistics, 2017. [30] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with [17] S. Jean, K. Cho, R. Memisevic, and Y. Bengio, “On using very large target vocab- gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, ulary for neural machine translation,” in Proceedings of the 53rd Annual Meeting pp. 157–166, 1994. of the Association for Computational Linguistics and the 7th International Joint [31] H. T. Ng, S. M. Wu, T. Briscoe, C. Hadiwinoto, R. H. Susanto, and C. Bryant, Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1–10, “The conll-2014 shared task on grammatical error correction,” in Proceedings of Association for Computational Linguistics, 2015. the Eighteenth Conference on Computational Natural Language Learning: Shared [18] Y. He, J. Tang, H. Ouyang, C. Kang, D. Yin, and Y. Chang, “Learning to rewrite Task, pp. 1–14, Association for Computational Linguistics, 2014. queries,” in Proceedings of the 25th ACM International on Conference on Informa- [32] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic tion and Knowledge Management, CIKM ’16, (New York, NY, USA), pp. 1443– evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of 1452, ACM, 2016. the Association for Computational Linguistics, 2002. [19] K. Kukich, “Techniques for automatically correcting words in text,” ACM Com- [33] C. Napoles, K. Sakaguchi, M. Post, and J. Tetreault, “Ground truth for grammati- put. Surv., vol. 24, pp. 377–439, Dec. 1992. cal error correction metrics,” in Proceedings of the 53rd Annual Meeting of the As- [20] W. E. Winkler, “String comparator metrics and enhanced decision rules in the sociation for Computational Linguistics and the 7th International Joint Conference fellegi-sunter model of record linkage,” in Proceedings of the Section on Survey on Natural Language Processing (Volume 2: Short Papers), pp. 588–593, Associa- Research, pp. 354–359, 1990. tion for Computational Linguistics, 2015. [21] R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,” J. ACM, [34] C. Napoles, K. Sakaguchi, M. Post, and J. R. Tetreault, “GLEU without tuning,” vol. 21, pp. 168–173, Jan. 1974. CoRR, vol. abs/1605.02592, 2016. [22] V. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and [35] M. Popović, “chrF: character n-gram f-score for automatic MT evaluation,” in Reversals,” Soviet Physics Doklady, vol. 10, p. 707, 1966. Proceedings of the Tenth Workshop on Statistical Machine Translation, (Lisbon, [23] E. Loper and S. Bird, “Nltk: The natural language toolkit,” in Proceedings of Portugal), pp. 392–395, Association for Computational Linguistics, Sept. 2015. the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural [36] K. Sakaguchi, M. Post, and B. Van Durme, “Grammatical error correction with Language Processing and Computational Linguistics - Volume 1, ETMTNLP ’02, neural reinforcement learning,” in Proceedings of the Eighth International Joint (Stroudsburg, PA, USA), pp. 63–70, Association for Computational Linguistics, Conference on Natural Language Processing (Volume 2: Short Papers), pp. 366–372, 2002. Asian Federation of Natural Language Processing, 2017. [24] B. Dhingra, H. Liu, R. Salakhutdinov, and W. W. Cohen, “A comparative study of [37] S. Sukhbaatar, a. Szlam, J. Weston, and R. Fergus, “End-to-end memory net- word embeddings for reading comprehension,” CoRR, vol. abs/1703.00993, 2017. works,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. [25] T. Luong, R. Socher, and C. Manning, “Better word representations with recur- Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 2440–2448, Curran sive neural networks for morphology,” in Proceedings of the Seventeenth Confer- Associates, Inc., 2015. ence on Computational Natural Language Learning, pp. 104–113, Association for Computational Linguistics, 2013. 10