Legal Query Reformulation using Deep Learning Arunprasath Shankar Venkata Nagaraju Buddarapu LexisNexis LexisNexis Raleigh, USA Raleigh, USA arunprasath.shankar@lexisnexis.com venkatanagaraju.buddarapu@lexisnexis.com ABSTRACT achieves this by using either his own prior knowledge or through Query reformulation is the process of iteratively modifying a query assisted tools. Legal query reformulation is not a trivial task, as to improve the quality of search engine results. In recent years, legal search queries are often short, complex and lack context. Au- the task of reformulating natural language (NL) queries has re- tomatic query correction systems are one of the most widely used ceived considerable diligence from both industry and academic tools within NL applications. Search engines support users in this communities. Traditionally, query reformulation has been mostly task in two ways, (a) explicitly by suggesting related queries or approached by using the noisy channel model. Since legal queries query completions, or (b) implicitly by expanding the query to im- are diverse and multi-faceted, these traditional approaches can- prove quality and recall of organic results. Successful reformula- not effectively handle low frequency and out-of-vocabulary (OOV) tions must closely align with the original query both syntactically, words. Motivated by these issues, we rethink the task of legal query as sequences of characters or words, and semantically, often in- reformulation as a type of monolingual neural machine translation volving transparent taxonomic relations. (NMT) problem, where the input (source) query is potentially er- From a probabilistic perspective, given an incorrect query q − roneous and the output (target) query is its corrected form. We that we wish to reformulate to a correct query q + , we seek to propose a unified and principled framework with multiple levels model q + = arg maxq P (q|q − ) where q is the original query or of granularity. Specifically, (i) an encoder with character atten- ground truth. In a traditional noisy channel model [2], the model tion which augments the subword representation; (ii) a decoder consists of two parts: (i) a language model i.e., P(q) that represent with attentions that enable the representations from different lev- the prior probability of the intended correct input query; and (ii) els of granularity to control the translation cooperatively and (iii) a an error model i.e., P (q|q − ) that represent the process in which semi-supervised methodology to extract and augment a large-scale the correct input query gets corrupted to its incorrect form. There dataset of NL query pairs combining syntactic and semantic opera- are several drawbacks with this approach: (i) we need two sepa- tions. We establish the effectiveness of our methodology using an rate models and the error in estimating one model would affect internal dataset, where the training data is automatically obtained the performance of the final output, (ii) it is not easy to model the from user query logs. We further demonstrate that training deep channel since there is a lot of sources for these mistakes e.g., typing neural networks on additional data with synthesized errors can too fast, unintentional key stroke, phonetic ambiguity, etc. and (iii) improve performance for translation. it is not easy to obtain clean training data for language model as the input does not follow what is typical in NL. Since the end goal 1. INTRODUCTION is to get a query that maximizes P (q|q − ), can we directly model Technology and innovation are transforming the legal profession this conditional distribution instead. in manifold ways. The legal industry is undergoing significant dis- In this work, we explore this route, which by passes the need ruption and arguably machine intelligence is most advancing in to have multiple models and avoid suffering errors from multiple the area of discovery and search. Technologies are moving from sources. We achieve this by applying the encoder-decoder frame- basic keyword searches to predictive coding, which involves algo- work combined with attention learning using recurrent neural net- rithms that predict whether a document is relevant or not. In [1], works [3] and rethink the query correction problem as a NMT prob- McGinnis et al. envisage two phases of technological changes in lem, where the incorrect input is treated as a foreign language. this area. The first phase, expected to come in the next 10 years, in- We also propose a semi-supervised methodology to perform er- volves perfecting semantic search that will allow lawyers to input ror detection and construct a large-scale dataset for training and NL queries to interfaces and systems responding to those queries experimentation. To the best of our knowledge, there is no prior directly with relevant information. The second phase involves tech- research work on the idea of using attention learning in a legal nology that is able to identify issues, given a set of facts and then setting. Also, our work is the first that uses neural machine trans- suggest relevant authorities that might apply to those issues. lation based methodologies on user generated legal data for query Query reformulation or correction is a crucial component for reformulation. We demonstrate that NMT models can successfully any service that requires users to type in NL, such as a search en- be applied to the task of query reformulation in a legal background gine. It is an iterative process where a user reformulates (rewrites) and validate that the trained models are naturally capable of han- queries to improve search results or gain new information. He dling orthographic errors and rare words, and can flexibly correct a variety of error types. We further find that augmenting the net- In: Proceedings of the Third Workshop on Automated Semantic Analysis of Informa- work training data with queries containing synthesized errors can tion in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada. result in significant gains in performance. ©2019 Copyright held by the owner/author(s). Copyrighting permitted for private and academic purposes. Published at http:// ceur-ws.org. 2. APPLICATION approaches and, second is the process of error detection (selecting An “answer card” is a search feature, usually displayed in a box or candidate queries that are incorrect) for training the reformulation panel, that occurs above organic results and tries to directly answer models. We will introduce related studies specific to these areas in a question. Most answer boxes are primarily text, containing a rel- the this section. atively short answer and may provide limited information based on user’s entered query. In legal context, the source can be a legal 3.1 Query Reformulation question, profile search or can take many forms. Figure 1 shows a 3.1.1 Traditional Approaches: In [5], Xu et al. used top results card that is an answer to a legal question “define motion to com- retrieved by the original query to perform reformulation by expan- pel” and Figure 2 shows a profile card for a judge search using the sion. This method is popular and influenced by the initial ranking query “who is judge antonin scalia ?”. of results, however it cannot utilize user-generated data. Other ap- proaches focused on using user query logs to expand a query by means of clickthrough rate [6], co-occurrence in search sessions, Motion to compel or query similarity based on click-through graphs [7]. The advan- A motion to compel asks the court to order either the opposing tage of these approaches is that user feedback is readily available party or a third party to take some action. in user query logs and can efficiently be precomputed. However, these approaches all face the problem of requiring some resource dependency on a specific search engine. Figure 1: Answer Card 3.1.2 Statistical Machine Translation: The task of machine cor- Vertical engines are search engines that specialize in different rection is defined as correcting a N -character or word source query types of search. Answer cards are usually backed by vertical search S = s 1 , . . . , s N = s 1N into a M-character or word target query engines that are tightly coupled with AI/NLP systems. These NLP T = t 1 , . . . , t M = t 1M . Thus, the correction system can be defined systems may be a machine/deep learning model(s) recognizing and as a function F: classifying entities that trigger and render respective cards based on user’s query intent[4]. NLP models rely heavily on vocabulary TD = F (S) (1) and any OOV words contained in the query can have an adverse which returns a correction hypothesis TD given an input word S. effect on the coverage or functioning of answer cards. Recently, several studies have been adopting data from user query logs as input to Statistical Machine Translation (SMT) for query re- formulation and expansion [8][9]. These methods treat user queries Antonin Scalia as one language and the reformulated queries as another language. However, their major drawback is the difficulty of modeling cor- rections in different granularities, e.g., characters or words, which Former Associate Justice Supreme Court (U.S.) is necessary to reduce the rate of unknown words that are detri- mental to their proper functioning. Born: March 11, 1936, Trenton, NJ Our approach differs in several ways. First, we consider full Died: February 13, 2016 query pairs as training data, and do not use single word tokens as a primary mode of operation. Second, we do not train explicit Appointed by: Ronald Reagan error models P(w |s) for words w and observed corrections s, but use standard word/character based NMT models to derive more meaningful and accurate lexical translation models. Figure 2: Profile Card 3.1.3 Neural Machine Translation: Recently, the use of neu- The scope of this research is to develop a framework to detect ral networks has delivered significant gains for mapping tasks be- these erroneous queries and autocorrect them. The framework acts tween pairs of sequences due to to their ability to learn a better as an extra layer between the answer cards and NLP recognition representation of the data and context. Recurrent Neural Networks systems to facilitate a coherent mechanism for coping up with (RNNs) are a set of neural networks for processing sequential data missing information (vocabulary); which in turn enable us to re- and modeling long-distance dependencies which is a common phe- turn more appropriate cards. For e.g., for the above mentioned nomenon in human language. NMT models [10] use RNNs and answer card queries, lets say a user typed in “defin e motiom learn to map from source language input to target language input to compell” or “whi is jufge antonin scakia?” instead, our de- via continuous-space intermediate representations effectively. We ployed framework would fix (reformulate) the queries to its correct use an encoder-decoder bound uni/bi-directional RNNs with an at- form and display the cards successfully. tention mechanism as the core component of our proposed system. [11]. The encoder maps the input query to a higher-level represen- 3. BACKGROUND AND RELATED WORK tation with a uni or bi-directional RNN architecture similar to that There are two major areas related to our research and the proposed of [12]. The decoder also employs a RNN that uses content-based models. First is the work on the task of query reformulation in attention mechanism [13] to attend to the encoded representation information retrieval involving traditional, statistical and neural and generate the output query one character at a time. 2 3.1.4 Character Level Reasoning: Word-level NMT models are words (tokens). Table 1 shows top 10 ranked queries (left) and to- poorly suited to handle OOV words due fixed vocabulary [14]. Re- kens (right) at the end of query preparation process. cent works have proposed workarounds for this problem. Since a word is usually thought of as the basic unit of language commu- Rank Query Rank Token nication [15], early NMT systems built these representations start- 1 breach of contract 1 motion ing from the word level [16]. Machine learning/NLP models typi- 2 unjust enrichment 2 contract cally limit vocabulary size due to the complexity of training [17]. 3 summary judgement 3 court Therefore, they are unable to translate rare words, and it is a stan- 4 res judicata 4 judgment dard practice to replace OOV words with unknown (UNK) symbol. 5 fraud 5 insurance By doing so, useful information is discarded, resulting in systems 6 motion to dismiss 6 evidence that are not able to correct erroneous words. In our framework, 7 conversion 7 property we strive to circumvent this issue by using smaller units such as 8 statute of limitations 8 attorney subwords to address the problem of OOV words. 9 defamation 9 liability 10 civil procedure 10 statute 3.2 Error Detection Table 1: Top 10 Queries and Tokens In most translation systems, before any reformulation is carried out, a detection process is conducted on the input query to extract 4.2 Query Representation any potentially incorrect words. In [18], He et al. proposed a learn- The first step in creating a model, is to choose a representation for ing to rewrite framework that focuses on candidate ranking for the input and output. For instance, a query can be represented in error detection. A non-word error refers to a potentially incorrect word-level, which is transferring the data with indices that refer to word that does not exist in a given dictionary. Dictionary lookup the words or in character-level, which is transferring the data with is one of the basic techniques employed to compare input strings indices that refer to a closed vocabulary of language specific char- with the entries of a language resource, e.g., lexicon or corpus [19]. acters and symbols. However, in this work, we anticipate the need Such a language resource must contain all inflected forms of the to handle an exponentially large input space of tokens by choosing words and it should be updated regularly. If a given word does not a character-level query representation. Thus, model inputs are in- exist in the language resource, it will be marked as a potentially dividual characters which are a sequence of indices corresponding incorrect word which is a huge disadvantage of this approach. characters. Minimum edit distance is one of the most studied techniques for error detection. It is based on counting edit operations, which Legal are defined in most systems as insertion, deletion, substitution and Names transposition. Jaro-Winkler [20], Wagner-Fischer [21], and Leven- Data Augmentation Legal shtein [22] are among the most famous edit distance algorithms. Corpora This approach has a main drawback, in that it penalizes all change User Input Error operations in the same way, without taking into account the char- Detection acter that is used in the change operation. In contrast to above Query Training Pairs NMT Reformulated Input mentioned approaches, our error detection methodology combines Log both syntactic and semantic attributes of a user query to extract in- FastText Embeddings correct non-word occurrences. Here, we establish a unified frame- work (see Figure 3) that encompasses aspects like frequency dis- tribution, query rank and subword representations for error detec- Figure 3: Proposed Framework tion alongside NMT and attention learning. We have defined specific characters like and in order to specify start and end of each query. This character will be later used in the training process as a criteria. Note that in 4. PROPOSED FRAMEWORK the character-level representation, we have also considered spaces, 4.1 Query Preparation certain diacritics and punctuations as unique characters, i.e., 40 For our experiments, we collected a total of ∼ 62M distinct queries unique characters in all. Once a query is mapped into indices, the grouped by frequency, collected over a period of 5 years. The queries characters are embedded, i.e. each character is represented by a under study included a considerable proportion of boolean queries one-hot vector and is multiplied by a trainable matrix with the size ∼ 32M of them. We used a heuristic approach to filter search queries. of the input X embeddings size. The embedding allows the model Using a combination of regular expressions and strict constraints, to group together characters that are similar for the task. we excluded any query with boolean or special character connec- tors; this left us with ∼ 29M NL queries. 4.3 Candidate Selection We used the punkt sentence tokenizer [23] to tokenize the queries Our scoring heuristic calculates a sequence of feature scores for into words. The tokens were further filtered down to remove noise each token from the prepared vocabulary. The key motivation is and short forms. Any token that is a digit and of length ⩾ 5 was to separate these tokens into +ve (correct) and −ve (incorrect) excluded from our study. The created vocabulary contained ∼ 1M buckets. We achieve this by combing three distributional attributes. 3 First, since our vocabulary is derived from user log queries, we is computed as follows: can associate the individual tokens to the rank of its constituent  queries. Second, we can also affiliate a token to its individual fre-  1, if (x ∈ ▽H ) ∨ (x ∈ ▽ J ) ∨ (x ∈ ▽E ) ∨ (x ∈ ▽C )  ϕ=  1 (5) quency within the vocabulary. And third, by the membership of  , otherwise 2 the tokens to selected legal lexicons. Our selected lexicons com- prise of extracted vocabulary from headnotes (∼ 217K) derived 4.3.5 Relevance Score: Finally, a relevance score (Ω) is com- from US legal caselaw documents, legal names consisting of judge puted by combining all the scores derived from equations 2, 3, 4 (∼ 67K) and expert witness (∼ 388K) names derived legal master and 5 as follows: ( 1 ) ( ) ( 1 ) databases. We also use a secondary lexicon of common US english Ω= ϕ· 1− · 1 − e −χ · 1 − (6) names (∼ 234K). e 2ξ + 1 1+ζ Let Q be an ordered set of distinct legal queries derived from The relevance score expresses the relative importance of a token user logs and ranked by frequency. Q ⇒ {q 1 , q 2 , . . . , qr } where r and can be used to classify a word into +ve or −ve bucket. We es- is the rank of the query; r ∈ R. Let R denote an ordered set of ranks; tablish this separation by using a cut-off threshold (∼0.83) obtained R ⇒ {1, 2, . . . , L}, where L is the length of Q i.e., total number of via trial and error experiments. distinct queries. Let ▽ ⇒ {x 1 , x 2 , . . . , xl } denote the vocabulary of distinct tokens derived from Q where l is length of the individual 4.4 Ranking Strategy using FastText query. For any given token x in vocabulary ▽, it can belong to a Traditional approaches to representing rare and OOV words follow subset of queries Q̂; Q̂ ⊆ Q. This implies that the token can mapped assigning a random vector [24] or binning all rare words into a new to a closed set of ranks R̂; R̂ ⊆ R. E.g., the token discriminatory can “UNK” word type. Another popular technique is to encode word belong to multiple queries within Q, hence R̂ could be a closed set definitions with an external resource (Long et al., 2016) and train e.g., {6, 37, 763, 23445}. at the morpheme level [25]. Creating semantic representations for OOV words is a difficult NLP task. And word-based (Mikolov et 4.3.1 Mean Rank Score: Given R̂ ⊆ R, for a token x ∈ Q, R̂ can al., 2013; aka word2vec)[26] approaches do not contribute well to be defined as {r 1 , r 2 , . . . , rl }. We compute the mean rank score (ξ ) resolve this problem. as follows: ∑ l ri Token Score Token Score 1− i =1 L (2) roberds 0.9967 cbreach 0.9839 ξ = |R̂| robers 0.9962 fbreach 0.9771 robergs 0.9913 onbreach 0.9684 where |R̂| is size of R̂. robertus 0.9851 breachh 0.9683 4.3.2 Neighbor Rank Score: Given a token x 0 , it can co-occur robertson 0.9797 bebreach 0.9659 alongside other tokens from the vocabulary contained to the queries robertt 0.9777 breachj 0.9657 under investigation. Let ▽ ˆ ⇒ {x 1 , x 2 , . . . , xl } denote the subset of rowert 0.9766 breachn 0.9643 tokens co-occurring alongside x. For every candidate token inside rochfort 0.9758 ofbreach 0.9641 ▽ ˆ , we have a corresponding mean rank score ξ derived from equa- rogerts 0.9756 mcbreach 0.9627 tion 2. We restrict the window of co-occurrence to 2 neighbors, one robtert 0.9731 orbreach 0.9612 before and after the token. Let ξˆ be the set of corresponding mean Table 2: Top-10 Similar Words by fastText rank scores; ξˆ ⇒ {x 1 , x 2 , . . . , xl } The neighbor rank score (χ ) is Character-based embedding approaches are more suited to han- computed as: dle the OOV problem (Bojanowski et al., 2017; aka fastText)[27]. For mapping the +ve tokens to their corresponding variants (in- ∑ l ξi correct −ve counterparts), we used fastText [28]. fastText is a χ= (3) character-based embedding approach that it is well-suited to pro- ˆ i =1 | ξ | duce embeddings for OOV words. It’s able to do this by learning 4.3.3 Frequency Score: Let ϵ denote the frequency of a token; vectors for character n-grams within the word and summing those x ∈ ▽. Frequency score (ζ ) is a normalized value representing the vectors to produce the final vector or embedding for the word itself. importance of frequency of occurrence of a token in the corpus. It For training, we set the embedding size to 100 and the training win- is defined as follows: dow to 10. We also set alpha value to 0.025, min and max values of n to 2 and 6 respectively. The embedding matrix was created using ϵ ζ = (4) a batch size of 10K words in 100 epochs. We trained fastText em- |▽| beddings on the 29M NL queries using a p3.2xlarge EC2 instance 4.3.4 Lexicon Membership Score: Let ▽H denote the vocabu- with 1 Tesla V 100 GPU. Table 2 shows top 10 most similar words lary of tokens derived from headnotes. Let ▽ J denote the vocab- for the OOV +ve word “roberts” (left) and OOV −ve word “breach”. ulary of tokens derived from judge master (judge name database) and let ▽E denote tokens from expert witness database. Let ▽C 4.5 Query Augmentation indicate all the tokens from common valid US English names vo- Legal names like judge, expert witness, attorney names etc. are cabulary. For any given token x, the lexicon membership score (ϕ) often misspelled by users during search. Since we were not able 4 to capture a considerable amount of these errors from query logs and uses the hidden state for the next input character. Figure 4(a) (real world data), we decided to synthetically augment these type portrays a simple overview of the encoder. of errors and generate queries for legal names algorithmically. For Previous this purpose, we took all the names (+ve instances) from our legal Input Hidden corpora and applied a variant of Needleman-Wunsch algorithm to create its −ve counterparts. Embedding The distance computation is identical to Levenshtein except that character mistakes are given different weights depending on how far two characters are on a standard keyboard layout (QWERTY). Previous ReLU Input The weights are assigned by projecting the keyboard characters Hidden onto a cartesian system. For e.g., A to S is given a mistake weight of 0.4, while A to D is a 0.6. We generate query pairs using 4 types of Embedding GRU transformation - addition, deletion, replacement and transposition. Table 3 shows a few of the generated examples for augmentation. GRU Softmax Misspelled Corrected Operation mchael lowy michael lowy addition Output Hidden Output Hidden david yamhamoto david yamamoto deletion (a) Encoder (b) Decoder christibe ballard christine ballard replacement warren glciker warren glicker transposition Figure 4: Encoder and Decoder Table 3: Augmented Queries Following [29], we use a bi-directional RNN with Gated Recur- rent Units (GRUs) to encode the source query: 4.6 Training Pairs #» (# » #») hi = GRU hi−1 , si ; Θ In previous sections, we discussed the need to segregate the vocab- »# (» # »# ) (8) ulary into +ve and −ve tokens and the use of fastText character hi = GRU hi−1 , si ; Θ n-grams or subwords to capture word mappings (+ve to −ve). At where si is the i th source embedding, GRU is a gated recurrent unit, the end of the candidate selection process, the entire vocabulary is #» »# Θ and Θ are the parameters of forward and backward GRU, respec- broken into two groups of ∼180K +ve and ∼830K −ve tokens. At tively. The annotation of each source character qi− is obtained by the end of ranking using fastText, we end up with around ∼2.4M concatenating the forward and backward hidden states: legal term mapping and ∼50K legal name mapping. The legal name  #»  mapping is expanded via the augmentation process elaborated in »#»# hi the previous section to around ∼1.5M pairs. The legal term and hi =  »#  (9)  hi  name mappings are then matched against the indexed 60M user queries to create query pairs that will be used for training our translation models. 4.7.2 Decoder: The decoder is also built using a RNN. It takes the same representation of the input context along with the previous 4.7 Neural Machine Translation hidden state and output character prediction, and generates a new hidden decoder state and a new output character prediction. The Most NMT systems follow the encoder-decoder framework with decoder is a forward RNN (uni-directional) with GRUs predicting attention mechanism proposed by Bahdanau et al. [29]. Given a the translation y character by character. This prediction takes the source query q − = q 1− · · · qi− · · · q − I and a target query q + = + + + form of a probability distribution over the entire output vocabulary q 1 · · · q j · · · q J , we aim to directly model the translation proba- (100 characters). At every step of decoding, the decoder is given bility as: an input character and hidden state. The initial input character is ∏ J ( ) the start of query character , and the first hidden state P(q + |q − ; Θ) = P q j |q + −