Deep Ensemble Learning for Legal Query Understanding Arunprasath Shankar Venkata Nagaraju Buddarapu LexisNexis LexisNexis Raleigh, USA Raleigh, USA arunprasath.shankar@lexisnexis.com venkatanagaraju.buddarapu@lexisnexis.com 1. Introduction Abstract With US legal services market approaching $500 bn in 2018 that is ≈2% of US GDP [1] and substantial incentives to increase market share, maximizing user satisfaction with search results continues to be the pri- Legal query understanding is a complex prob- mary focus for legal search industry. Recently, there is lem that involves two natural language pro- an upward trend of embracing vertical search engines cessing (NLP) tasks that needs to be solved by legal companies since they provide a specialized together: (i) identifying intent of the user type of information service to satisfy a user’s intent. and (ii) recognizing entities within the queries. Using a specialized user interface, a vertical search en- The problem equates to decomposing a legal gine can return more relevant results than a general query into its individual components and de- search engine for in-domain legal queries. ciphering the underlying differences that can In practice, a user has to decide his choice of ver- occur due to pragmatics. Identifying the de- tical search engine beforehand to satisfy his query in- sired intent and recognizing correct entities tent. It would be convenient if a query intent identifier helps us return back relevant results to the could be provided in a general search engine that could user. Deep Neural Networks (DNNs) have re- precisely predict whether a query should trigger a ver- cently achieved great success surpassing tradi- tical search in a certain domain. Moreover, since a tional statistical approaches. In this work, we user’s query may implicitly express more than one in- experiment with several DNN architectures tent, it would be very helpful if a general search engine towards legal query intent classification and could detect all query intents, distribute it to appro- entity recognition. Deep Neural architectures priate vertical search engines and effectively organize like Recurrent Neural Networks (RNNs), Long the results from the different vertical search engines to Short Term Memory (LSTM), Convolutional satisfy a user’s need. Consequently, understanding a Neural Networks (CNNs) and Gated Recur- query intent is crucial for providing better search re- rent Units (GRU) were applied and com- sults and thus improving the overall satisfaction of the pared against one another both individually user. and as combinations. The models were also compared against machine learning (ML) and Understanding intent and identifying legal entities rule-based approaches. In this paper, we de- from a user’s query can help legal search automati- scribe a methodology that integrates posterior cally route the query to corresponding vertical search probabilities produced by the best DNN mod- engines and obtain relevant contents, thus, greatly im- els and create a stacked framework for com- proving user satisfaction meaning better assistance to bining the different predictors to improve pre- law researches to support legal arguments and deci- diction accuracy and F-measure for legal in- sions. Law researchers have to cope with a tremendous tent classification and entity recognition. load of legal content since sources of law can originate from diversified sources like judicial branch, legisla- tive branch (statutes), legal reference books, journals, Copyright © CIKM 2018 for the individual papers by the papers' news etc. This makes legal search and understanding authors. Copyright © CIKM 2018 for the volume as a collection queries in legal search such an important aspect to re- by its editors. This volume and its papers are published under trieve relevant support documents for supporting one’s the Creative Commons License Attribution 4.0 International (CC BY 4.0). argument. significantly outperforms other approaches for both in- Legal search is hard as it demands writing complex tent classification and legal NER. queries to retrieve desired content from information retrieval (IR) systems. Classifying legal queries and 2. Background and Related Work identifying domain specific legal entities from queries is even harder. For e.g., in the query: “who is supreme DL systems have dramatically improved the state of court magistrate John Roberts and abortion law?”, the several domains like NLP, computer vision, image pro- word “magistrate” can be resolved to a cessing etc. Various deep architectures and learning when observed along with the context phrase “supreme methods have been developed with distinct strengths court”. Similarly, the phrase “abortion law” can be and weaknesses in recent years. Deep ensemble learn- identified as a when seen alongside ing is a learning paradigm where ensembles of several a supporting context. However, since we also observe neural networks show improved generalization capa- the interrogative phrase “who is”, we can safely assume bilities that outperform those of single networks. For that the intent of this query is or deep learning of multi-layer neural networks, ensem- search. ble learning is still applicable [2, 3]. However, there is Similar to google queries, legal queries can be clas- not much work done towards legal domain involving sified into one of three categories. (i) Do: The users neither deep learning nor ensemble learning. How can wants to do something, like buy a product or ser- ensemble learning be applied to various DNN archi- vice. E.g., Buy Law Books, Research Guides, Po- tectures to achieve better results for legal tasks is the lice/Personal Reports, Real Estate Property Records primary focus of this paper. etc. (ii) Know: An informational query, where the Most ML approaches to text understanding consists user wants to learn about a subject. E.g., law, statute, of tokenizing a string of characters into structures such doctrine etc. Very often single word queries are clas- as words, phrases, sentences or paragraphs, and then sified at least partially as “Know” queries. (iii) Go: apply some statistical classification algorithm onto the Also known as a navigational query, the user wants to statistics of such structures [4]. These techniques work go to a specific site scoped to a particular legal entity. well when applied to a narrowly defined domain. E.g., a query where a user wants to know analytics of Typical queries submitted to legal search engines a specific legal entity like a judge or an expert wit- contain very short keyword phrases, which are gener- ness. Our research in this work focuses only on “Go” ally insufficient to fully describe a user’s information type of queries, e.g., an user wanting to understand a need. Thus, it is a challenging problem to classify particular judge profile with the query “judge John D. millions of queries into some predefined categories. A Roberts”. variety of related topical query classification problems The process of finding named entities in a text and have been investigated in the past [5] [6]. Most of classifying them to a semantic type is called named them seek to use statistical machine learning methods entity recognition (NER). Legal NER is nearly always to train a classifier to predict the category of an input used in conjunction with intent classification systems. query. From the statistical learning perspective, in or- Given a query, the goal of NER system described in der to obtain a classifier that has good generalization this paper is two-fold: (i) segment the input into se- ability in predicting future unseen data, two condi- mantic chunks, and (ii) classify each chunk into a tions should be satisfied: discriminative feature repre- predefined set of semantic classes. For e.g., given a sentation and sufficient training samples. However, for query “judge john d roberts”, the desired output would the problem of query intent classification, even though be: “judge” = and “john d roberts” = there are huge volumes of legal queries, both condi- . Here the class represents a tions are hardly to met due to the sparseness of query person or a judge entity and the class rep- features coupled with the sparseness of labeled train- resents any non judge specific term. ing data. In [5], Beitzel et al. attempted to solve this The aim of this research is to improve the under- problem through augmenting the query with more fea- standing of queries involved in legal research. In this tures using external knowledge, such as search engine paper, we explore three different approaches: (a) ma- results and achieved fair results. chine learning (ML) with feature engineering, (b) deep DNNs have revolutionized the field of NLP. RNNs learning (DL) without any feature engineering and (c) and CNNs, the two main types of DNN architectures ensemble of deep neural networks. We then perform are widely explored to handle various NLP tasks. CNN quantitative evaluations in comparison to an already is supposed to be good at extracting positional invari- existing baseline model which is a rule-based classifi- ant features and RNN at modeling units in sequence. cation system in production. Finally, we show from The state-of-the-art on many NLP tasks often switches our experimental results that a deep ensemble model due to the battle of CNNs and RNNs. While CNNs take advantage of local coherence in the input to cut ogy to identify query intent by mapping the query to down on the number of weights, RNNs are used to pro- a representation space backed by Wikipedia [12]. In cess sequential data (often with LSTM cells). RNNs [13], the authors applied CNNs for intent classification are also good at representing very large implicit inter- and achieved results in par with state-of-the-results. nal structures that are difficult even to think about. There are also many recent works that combine In summary, the conventional wisdom is that RNNs RNNs with CNNs for different NLP tasks. For ex- should be used when the context is richer and there ample, in [14], Chiu et al. presented a novel neural is more state information that needs to be captured. network architecture that automatically detects word This proposition has been challenged by CNNs re- and character-level features using a hybrid bidirec- cently with the claim that finite state information of tional LSTM and CNN architecture, eliminating the limited scope can be more efficiently handled by mul- need for most of feature engineering and established tiple convolution layers. We think both are true, and a new state-of-the-art performance with an F1 score one should not go for RNNs just for the sake of it, of 91.62% on CoNLL-2003 data set. In [15], Lim- instead more efficient deep CNNs should be tried in sopatham et al. proposed a bidirectional LSTM (Bi- limited context situations. But for more complex im- LSTM) to automatically learn orthographic features plicit mappings where context and state information from tweets. spans are much bigger, RNNs are the best and at this There is a handful of papers that delve into legal point almost the only tool. applications using DNNs. In [16], Sugathadasa et al. There is not a lot of previous research work involv- proposed a system that includes a page ranking graph ing DL and legal domain for the problems of intent network with TF-IDF to build document embeddings classification and NER. However, there are a handful by creating a vector space for the legal domain, which of papers that talk about DNN approaches for non- can be trained using a doc2vec neural network model legal intent classification and general NER. Also re- supporting incremental and extensive training for scal- cently, RNNs and CNNs have been applied on a variety ability of the system. In [17], Nanda et al. proposed of NLP tasks with various degree of success. Below, a hybrid model using LSTM and CNN which utilizes we talk about the evolution of various DNNs and their word embeddings trained on the Google News vectors applications towards NLP tasks. and evaluated the results on COLIEE 2017 dataset. In [7], Zhai et al. adopted RNNs as building blocks They demonstrated that the performance of LSTM + to learn desired representations from massive user click CNN model was competitive with other textual entail- logs. The authors proposed a novel attention network ment systems. Similarly, in [18], the authors proposed that learns to assign attention scores to words within a a methodology to employ DNNs and word2vec for re- sequence (query or ad). In [8], Hochreiter and Schmid- trieval of civil articles. huber introduced LSTM which solved the most com- For NER, Huang et al. proposed a variety of LSTM plex, artificial long-time-lag tasks that have never been and LSTM variants like LSTM + CRF for NER and solved by previous recurrent network algorithms. In achieved state-of-the-art results [19]. Recently this [9], Chung et al. compared different types of recur- year, in [20], Peters et al. introduced a new type of rent units in RNNs especially focusing on units that deep contextualized word representation that models implement a gating mechanism, such as LSTM and both semantics and polysemy and improved the state GRU units. They evaluated these units on the tasks of the art across six challenging NLP problems, includ- of polyphonic music modeling and speech signal mod- ing question answering, textual entailment, sentiment eling and proved LSTMs and GRUs work better than analysis and NER. traditional recurrent units. In this paper, we scope our research limited to judge In [10], Hinton et al. introduced CNN and used queries meaning queries that are meant to be a search it to for image classification on the ImageNet dataset for judge profile. Being this the scope of the problem, and established new state-of-the-art results. In [11], the intent classification task needs to classify a given LeCun et al. demonstrated that we can apply DL query as a judge query or not. Once the query is iden- to text understanding from character-level inputs all tified as a judge query, the NER system that follows, the way up to abstract text concepts, using tempo- recognizes if the query contains any person entities. ral CNNs. They applied CNNs to various large-scale The entities are then routed to vertical searches that datasets, including ontology classification, sentiment follow. The structure of the paper is as follows. Sec- analysis, text categorization etc. and showed that tem- tion 3. discusses the various experiments that was car- poral CNNs can achieve astonishing performance with- ried out to solve legal query understanding. That is out any prior knowledge of syntactic or semantic struc- followed by Section 4. that presents and analyses over- tures with regards to a human language. With respect all results. The paper ends with Section 5. which gives to intent classification, Hu et al. devised a methodol- the conclusion and discussion on future work. 3. MODELS AND EXPERIMENTS Type Example Volume % Judge judge John D. Roberts 34 The adoption of artificial Intelligence (AI) technology Word Wheel Ohio Municipal Court, 6 is undoubtedly transforming the practice of law. Many Bellefontaine in the legal profession are aware that using AI can Query Log sexual harassment 5 greatly reduce time and costs while increasing predic- Statute statute /s limitations /s ac- 31 tion measures. In the following subsections, we talk tual /s fraud about data collection, augmentation and training of Elements Law contract defense uncon- 20 the different DNN models for legal tasks we experi- scionability elements mented for: (i) intent classification and (ii) legal entity Case Search Powers v. USAA 4 recognition. Table 1: Data Types by Volume 3.1 Data Collection For the purpose of our experiments, we created data sets comprising of different types of legal queries such Intent NER as judge search, case law search, statutes/elements Train 505,760 1,052,000 Dev 20,000 20,000 search, etc. For intent classification, since the clas- Test 20,000 20,000 sification task is binary, we labeled all judge queries to be positive and the rest as negative. For NER, we Table 2: Data Sets used three labels: (i) used to tag any to- kens that are not part of a judge name (ii) denotes the beginning of a person (judge) name and (iii) denotes the inside of a name. Table 1 has recently attracted much attention in NLP and IR portrays the different query types by volume we used tasks. The embedding vectors are typically learned for our experiments. All of the collected data labeled based on term proximity in a large corpus and is used by our subject matter experts (SMEs). All of the data to accurately predict adjacent word(s) for a given word discussed here are proprietary to LexisNexis. or context [21]. For the purpose of training our NER DNN models, we created word vectors of size 580,614 3.2 Data Augmentation and Balancing to be used as word embeddings to the embedding layer The labeled data was good enough to get started but of our DNN models. The word vectors were created by was highly imbalanced with respect to NER labels. In training a word2vec continous bag of words (CBOW) order to balance out the data set, we augmented the model on AWS ml.p3.2xlarge instance with a single data by (i) expanding queries by pattern using judge NVIDIA Tesla V100 GPU. As an input to the model, names from our proprietary judge master database, (ii) 10 Million queries were obtained from user session logs oversampling the under represented patterns and (iii) were used. These queries were ordered by frequency. under sampled the over exemplified patterns. The top The word2vec model was trained in 112.2 minutes for 10 patterns by frequency are shown in table 3. These 100 epochs. 10 patterns alone contributed to ≈78% of the labeled data. The balancing act was carried out by following two strategies: For oversampling, we pick a pattern Pattern Count and create synthetic data points (queries) by randomly B-PER I-PER I-PER 217,374 substituting judge names/titles around the patterns to be fitted. For the process of under sampling, random B-PER I-PER 109,500 data points are selected from buckets of pattern and O B-PER I-PER I-PER 92,436 removed. The sampling process is stopped when there B-PER I-PER I-PER I-PER 73,668 are equal number of queries in all the buckets of pat- O B-PER I-PER 53,896 terns. Once the data is augmented and balanced, we O B-PER I-PER I-PER I-PER I-PER 48,157 split the data into train, dev and test sets. Table 2 shows the ratio of data split for both intent classifica- B-PER I-PER I-PER I-PER 39,970 tion and NER. B-PER I-PER I-PER I-PER I-PER 33,474 O B-PER 22,986 3.3 Word Embedding O O O B-PER I-PER I-PER I-PER O 9,968 Learning a high-dimensional dense representation for vocabulary terms, also known as a word embedding, Table 3: Top 10 Patterns Dev Test Model F FP FN F FP FN Rule Engine (Baseline) 98.66 57 209 98.69 52 208 Logistic Regression 98.31 137 141 98.48 132 120 Gaussian NB 89.79 1789 20 90.08 1749 20 AdaBoost 97.58 137 256 97.54 165 241 Decision Tree 95.75 25 647 95.96 25 623 Linear SVM 98.28 139 142 98.54 124 117 Multi-layer Perceptron 98.36 107 160 98.57 119 117 Table 4: Intent Classification - Baseline vs ML 3.4 Intent Classification Embedding [49701 x 200] 3.4.1 Baseline vs ML Approaches: LSTM [100, tanh] For intent classification, we tried a few different ML classifiers firsthand before delving into the deep learn- ing arena. The ML models were compared against Flatten [200, tanh] a rule-based query recognition system highlighted that was already established as our baseline system TimeDistributed [1, sigmoid] to improve. Classifiers were selected from both the linear and non-linear category. The list of classifiers tried and their corresponding results (F1 score) against Figure 1: LSTM for Intent Classification baseline are shown in table 4 above. Our main require- ment while picking and implementing the classifiers in figure 1. The embedding layer that feeds the LSTM was to minimize the overall number of false positives. layer is composed of a vocabulary of 49,701 words with This is because false positives with respect to intent an output dimension of 200. 100 hidden units are used classification can intercept queries that are meant to for the LSTM layer. The flatten layer uses 200 hidden be routed to a different vertical search engine affect- units. The flatten layer takes a tensor of any shape and ing the overall customer satisfaction towards the prod- transform it into a one dimensional tensor. Both uni uct. Amongst the linear classifiers, linear SVM seemed and bi-directional LSTMs were trained for the task. to perform slightly better than the logistic regression classifier. Amongst the non-linear category, the multi- 3.4.4 CNN based Intent Classifier: layer perceptron highlighted seemed to perform the best beating decision trees, adaboost and naive bayes CNNs are built out of many layers of pattern recog- approaches. nizers stacked on top of each other. Convolutional is a way of saying that the machine looks at small parts of a query first rather than trying to account for the 3.4.2 Feature Engineering for ML Classifiers: whole thing. Each successive layer combines informa- There are different ways one can address a judge in tion from these small parts to fill in the bigger picture legal taxonomy such as “chief justice”, “associate jus- and assemble complex patterns of meaning. After try- tice”, “magistrate” etc. All of the different phrases ing few variants of LSTM architectures, we started ex- used for addressing a judge are compiled into a bag perimenting with CNNs for intent classification. The of words representation. In addition to bag of words, general neural architecture we used for CNN is show all the ML classifiers use POS, gazetteer, word shape in figure 2. The major difference here compared to the and orthographic features to represent semantic and LSTM models are: CNNs use two dense layers after linguistic meaning of a query. the flatten layer whereas LSTMs use only one layer. The 1D convolutional layer uses 32 filters with a ker- 3.4.3 LSTM based Intent Classifier: nel size = 8 and the max pooling layer uses 2 strides with a pool size = 2. The LSTM, first described in [8], attempts to circum- vent the vanishing gradient problem by separating the 3.4.5 Hybrid Models: memory and output representation, and having each dimension of the current memory unit depending lin- As a next step, the best performing models from both early on the memory unit of the previous time step. the LSTM and CNN pools were picked and combined The DNN architecture we tried using LSTM is shown into a hybrid model for classification. We built two Embedding [49701 x 200] Avg Time Epochs Batch Size LSTM 75.54 50 1000 Bi-LSTM 93.72 30 1000 Conv1D [32, 8, relu] CNN 37.22 100 500 LSTM + CNN 84.21 50 1000 MaxPooling1D [2, 2, valid] Bi-LSTM + CNN 101.45 50 1000 Table 5: Train Statistics - Intent Classification Flatten [200, tanh] TimeDistributed [10, relu] CNN Ensemble Classifier TimeDistributed [1, sigmoid] H1(x) LSTM H2(x) Σσi.Hi(x) Figure 2: CNN for Intent Classification H3(x) Input Bi-LSTM Ensemble Output Embedding [49701 x 200] H4(x) H5(x) Conv1D [32, 8, relu] LSTM + CNN MaxPooling1D [2, 2, valid] Bi-LSTM + CNN Bidirectional LSTM [100, tanh] Flatten [200, tanh] Figure 4: Ensemble Classifier for Intent Classification learning, we perform a linear combination of the orig- TimeDistributed [10, relu] inal class posterior probabilities produced by the best DNN models at the word level (see table 8). A set of TimeDistributed [1, sigmoid] parameters in the form of full matrices are associated with the linear combination, which are learned using the training data consisting of the word-level poste- Figure 3: Bi-LSTM + CNN for Intent Classification rior probabilities of the different models and its cor- responding word-level target values (0 or 1). Figure 4 models for this experiment, LSTM + CNN and Bi- depicts the model architecture of the ensemble classi- LSTM + CNN. The architectural diagram for Bi- fier for the task of intent classification. LSTM + CNN model is shown in figure 3. 3.4.7 Training: 3.4.6 Deep Ensemble for Intent Classification: Table 5 below shows the overall training statistics Ensemble learning is a ML paradigm where multiple for the different DNN models deployed for intent learners are trained to solve the same problem. In con- classification. The models were trained on AWS trast to ordinary ML approaches which try to learn one ml.p3.8xlarge instance with 4 NVIDIA Tesla V100 hypothesis from training data, ensemble methods try GPUs. Average time shown below is measured in min- to construct a set of hypotheses and combine them to utes. use. For intent classification, we use stacking which applies several models to original data. In stacking 3.5 Named Entity Recognition we don’t have just an empirical formula for our weight 3.5.1 Baseline: function, rather use a logistic regression model to esti- mate the input together with outputs of every model Baseline model for NER is shown in table 12 . The to estimate the weights or, in other words, to deter- baseline model (highlighted in ) was previously estab- mine what models perform well and what badly given lished and it is the same rule-based system that was these input data. set as baseline for intent classification. The row high- To simplify the stacking procedure for ensemble lighted in shows metrics from a conditional random Architecture Dev Test Dropout Hidden Units Embedding F-score FP FN F-score FP FN LSTM I False 50 200 99.74 25 27 99.72 14 42 LSTM II False 50 100 99.64 52 20 99.74 26 27 LSTM III False 50 200 99.77 23 24 99.79 17 25 LSTM IV True 50 200 99.66 54 15 99.78 24 21 Bi-LSTM I False 100 200 99.83 19 20 99.84 11 22 Bi-LSTM II True 100 200 99.72 38 18 99.76 26 23 Table 6: Intent Classification - LSTM Architecture Dev Test Filters Kernel Padding F-score FP FN F-score FP FN CNN I 32 16 same 99.83 21 13 99.83 15 18 CNN II 32 8 same 99.84 19 13 99.83 14 18 CNN III 64 8 same 99.82 24 11 99.84 13 18 CNN IV 64 16 same 99.78 28 15 99.89 15 15 CNN V 64 4 same 99.81 31 9 99.83 20 14 Table 7: Intent Classification - CNN Dev Test Model F-score FP FN F FP FN LSTM III 99.77 23 24 99.79 17 25 Bi-LSTM I 99.81 19 20 99.84 11 22 CNN II 99.84 19 13 99.83 14 18 LSTM III + CNN II 99.81 26 14 99.90 10 11 Bi-LSTM I + CNN II 99.88 14 11 99.86 7 22 Ensemble (Top 5) 99.91 9 9 99.91 4 15 Table 8: Intent Classification - Winners & Ensemble Architecture Dev Test Filters Kernel Padding Precision Recall F-score Precision Recall F-score CNN I 128 5 same 99.2707 99.2654 99.2665 99.0583 99.0511 99.0526 CNN II 128 5 same 98.6557 98.6435 98.6468 98.9799 98.9654 98.9681 CNN III 64 5 same 99.2632 99.2549 99.2564 99.0729 99.0605 99.0627 CNN IV 128 10 same 99.3972 99.3967 99.3969 99.2193 99.2177 99.2181 Table 9: Named Entity Recognition - CNN Architecture Dev Test Dropout Hidden Units Precision Recall F-score Precision Recall F-score RNN I 0.4 100 99.1119 99.1022 99.1042 98.8990 98.8876 98.8900 RNN II 0.4 100 99.1171 99.1063 99.1084 98.9171 98.9039 98.9065 RNN III 0.2 100 99.1750 99.1665 99.1682 98.9818 98.9711 98.9733 LSTM 0.4 100 99.1727 99.1624 99.1643 98.9894 98.9764 98.9788 Bi-LSTM 0.4 100 99.5344 99.5331 99.5334 99.3422 99.3397 99.3403 GRU 0.4 100 99.1256 99.1227 99.1235 98.8564 98.8523 98.8534 Bi-GRU 0.4 100 99.4734 99.4733 99.4734 99.2931 99.2929 99.293 Table 10: Named Entity Recognition - Recurrent Neural Networks Dev Test Model Precision Recall F-score Precision Recall F-score RNN III + CNN IV 99.3812 99.3794 99.3798 99.2027 99.1998 99.2635 LSTM + CNN IV 99.4036 99.3995 99.4002 99.269 99.2624 99.2635 Bi-LSTM + CNN IV 99.5286 99.5262 99.5267 99.4043 99.4007 99.4013 GRU + CNN IV 99.4365 99.4323 99.4330 99.2528 99.2466 99.2477 Bi-GRU + CNN IV 99.532 99.5294 99.5299 99.3829 99.3786 99.3793 Table 11: Named Entity Recognition - Hybrid Models Dev Test Model Precision Recall F-score Precision Recall F-score Rule Engine (Baseline) 93.2794 92.9001 92.7009 94.0157 93.7872 93.6778 CRF 92.1442 89.3441 91.3463 91.3421 90.0323 90.6341 CNN IV 99.3972 99.3967 99.3969 99.2193 99.2177 99.2181 LSTM + CNN IV 99.4036 99.3995 99.4002 99.2692 99.2624 99.2635 Bi-LSTM 99.5344 99.5331 99.5334 99.3422 99.3397 99.3403 Bi-LSTM + CNN IV 99.5286 99.5262 99.5267 99.4043 99.4007 99.4013 Bi-GRU + CNN IV 99.5323 99.5294 99.5299 99.3829 99.3786 99.3793 Ensemble (Top 5) 99.5596 99.5577 99.5581 99.4193 99.4159 99.4165 Table 12: Named Entity Recognition - Winners & Ensemble Embedding [577149 x 200] by a LSTM layer. Dropout [0.2] 3.5.4 CNN based Named Entity Recognition: RNN [200, tanh] For NER, we also experimented with CNN. The neu- ral architecture for CNN is shown in figure 6. The TimeDistributed [3, softmax] Conv1D layer consists of 128 filters with kernel size set to 5. In contrary to the CNN used for intent clas- Figure 5: RNN for NER sification, this architecture does not use a max pooling layer. field (CRF) probabilistic model that was previously implemented. This model was eventually discarded since it was outperformed by most of DNN models. 3.5.5 GRU based Named Entity Recognition: 3.5.2 RNN based Named Entity Recognition: More recently, gated recurrent units have been pro- posed [9] as a simplification of the LSTM, while keep- For NER, we started with RNNs. The neural ar- ing the ability to retain information over long se- chitecture for RNN is shown in figure 5 above. For quences. Unlike LSTM, GRU uses only two gates, the embedding layer, we used pre-trained word2vec memory units do not exist, and the linear interpolation embeddings of dimensions (577,149 x 200). For the occurs in the hidden state. As part of our experiment, RNN layer, 200 hidden units were used. The time dis- we replaced the RNN layer in the architecture shown tributed output layer has 3 units and uses a softmax in figure 5 with a GRU layer. function since we have 3 classes in total (, & . A dropout value of 0.2 was also used for the configuration. 3.5.6 Hybrid Models: 3.5.3 LSTM based Named Entity Recognition: For NER, we combined the above discussed models. LSTMs in general circumvent the vanishing gradient Some of the hybrid models we built are RNN + CNN, problem faced by RNNs. For NER, we used both uni LSTM + CNN and GRU + CNN (both uni and bi- and bi-directional LSTMs. The architecture is similar directional). Architecture of Bi-LSTM + CNN is to shown in figure 5 except the RNN layer is replaced shown in figure 7. Avg Time Epochs Batch Size RNN 85.23 50 1000 Embedding [577149 x 200] LSTM 104.38 50 1000 Bi-LSTM 123.84 50 1000 GRU 94.2 50 1000 Conv1D [128, 5, relu] Bi-GRU 104.27 50 1000 CNN 72.02 100 500 TimeDistributed [3, softmax] RNN + CNN 97.93 50 1000 LSTM + CNN 134.81 50 1000 Bi-LSTM + CNN 154.43 50 1000 GRU + CNN 115.24 50 1000 Figure 6: CNN for NER Bi-GRU + CNN 132.64 50 1000 Table 13: Train Statistics - Named Entity Recognition 3.5.7 Deep Ensemble for Named Entity Embedding [577149 x 200] Recognition: To create an ensemble for NER, the DNN models are Conv1D [128, 5, relu] ranked by their F1 score. The top 5 best models are then picked and stacked into an ensemble. Ensemble Dropout [0.2] of top 10 models was also experimented and discarded since it underperformed compared to the ensemble of top 5. Figure 8 shows the architecture of the chosen Bidirectional LSTM [200, tanh] ensemble model. TimeDistributed [3, softmax] 3.5.8 Training: The training time for epochs are show in table 13. A batch here corresponds to a chunk of user input Figure 7: Bi-LSTM + CNN for NER queries. The neural architectures were implemented using tensorflow, keras and scikit-learn. 4. Results 4.1 Evaluation Metrics CNN Ensemble Classifier We utilize standard measures to evaluate the perfor- mance of our classifiers, i.e., precision, recall and F1- H1(x) measure. Precision (P) is the proportion of actual pos- Bi-LSTM itive class members returned by our method among H2(x) all predicted positive class members returned by our Σσi.Hi(x) method. Recall (R) is the proportion of predicted pos- H3(x) LSTM + itive members among all actual positive class members Input Ensemble Output CNN in the data. F1 is the harmonic average of precision H4(x) and recall which is defined as F1 = 2PR/(P+R). H5(x) Bi-LSTM + CNN 4.2 Best Performers Empirical results for intent classification are shown in Bi-GRU + tables 6, 7 & 8. Results for NER are shown in tables CNN 9, 10, 11 & 12. Based on evaluation metrics, we can clearly see that the ensemble models outperform all other DNN models in both tasks, intent classification Figure 8: Ensemble Classifier for Named Entity as well as NER by a good margin. In case of intent Recognition classification, the ensemble model (top 5) highlighted in table 8 has the lowest count of false positives and false negatives on both the dev and test data sets. It also has the highest F1 score value = 99.91% beating SIGKDD International Conference on Knowledge Discov- the baseline rule-based system by a margin of ≈1.5%. ery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 1295–1304, 2016. In case of NER, the ensemble model (top 5) highlighted in table 12 outperforms all other DNN models and [8] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, pp. 1735–1780, Nov. beats the baseline model by a margin of ≈6%. 1997. [9] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empir- 5. Conclusion and Future Work ical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” CoRR, vol. abs/1412.3555, 2014. Our results show an ensemble model of stacking dif- [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet ferent DNNs of varying architectures outperforms in- Classification with Deep Convolutional Neural Networks,” dividual performances of DNNs for the tasks of le- in Advances in Neural Information Processing Systems 25: gal intent classification and entity recognition. RNNs, 26th Annual Conference on Neural Information Processing LSTMs, GRUs and even CNNs, all compress the nec- Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 1106–1114, essary information of a source query into a fixed-length 2012. vector. This makes it difficult for the DNNs to cope [11] X. Zhang and Y. LeCun, “Text Understanding from with long queries, especially those that are longer than Scratch,” CoRR, vol. abs/1502.01710, 2015. the queries in the training corpus. In future, we plan [12] J. Hu, G. Wang, F. H. Lochovsky, J. Sun, and Z. Chen, to use attention within queries. Attention is the idea “Understanding User’s Query Intent with Wikipedia,” in of freeing a DNN architecture from the fixed-length Proceedings of the 18th International Conference on World internal representation. The DNN models we trained Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, pp. 471–480, 2009. are at the word level, in future we plan to expand the size of the training data and try DNN models at the [13] H. B. Hashemi, A. Asiaee, and R. Kraft, “Query Intent De- tection using Convolutional Neural Networks,” in WSDM character level. Moreover, since the difference in per- QRUMS Workshop, 2016. formance between the DNN models were rather small, [14] J. P. C. Chiu and E. Nichols, “Named Entity Recognition we plan to run tests of statistical significance and error with Bidirectional LSTM-CNNs,” TACL, vol. 4, pp. 357– analysis to capture performance by patterns. Lastly, 370, 2016. we also plan to look into the impact on our models [15] N. Limsopatham and N. Collier, “Bidirectional LSTM for with respect to data and covariance shifts. Named Entity Recognition in Twitter Messages,” in Pro- ceedings of the 2nd Workshop on Noisy User-generated Text, NUT@COLING 2016, Osaka, Japan, December 11, 6. Acknowledgements 2016, pp. 145–152, 2016. This research was supported by LexisNexis, Raleigh [16] K. Sugathadasa, B. Ayesha, N. de Silva, A. S. Perera, Technology Center, USA. V. Jayawardana, D. Lakmal, and M. Perera, “Legal Doc- ument Retrieval using Document Vector Embeddings and Deep Learning.,” CoRR, vol. abs/1805.10685, 2018. References [17] R. Nanda, K. J. Adebayo, L. D. Caro, G. Boella, and [1] “http://www.legalexecutiveinstitute.com.” L. Robaldo, “Legal Information Retrieval using Topic Clus- [2] L. Deng and J. C. Platt, “Ensemble deep learning for tering and Neural Networks,” in COLIEE 2017. 4th Com- speech recognition,” in INTERSPEECH 2014, 15th Annual petition on Legal Information Extraction and Entailment, Conference of the International Speech Communication held in conjunction with the 16th International Conference Association, Singapore, September 14-18, 2014, pp. 1915– on Artificial Intelligence and Law (ICAIL 2017) in King’s 1919, 2014. College London, UK., pp. 68–78, 2017. [3] X. Zhou, L. Xie, P. Zhang, and Y. Zhang, “An Ensem- [18] A. H. N. Tran, “Applying Deep Neural Network to Re- ble of Deep Neural Networks for Object Tracking,” in trieve Relevant Civil Law Articles,” in Proceedings of the 2014 IEEE International Conference on Image Process- Student Research Workshop Associated with RANLP 2017, ing (ICIP), pp. 843–847, Oct 2014. (Varna), pp. 46–48, INCOMA Ltd., September 2017. [4] S. G. Soderland, “Building a Machine Learning based Text [19] Z. Huang, W. Xu, and K. Yu, “Bidirectional Understanding System,” 05 2001. LSTM-CRF Models for Sequence Tagging.,” CoRR, vol. abs/1508.01991, 2015. [5] S. M. Beitzel, E. C. Jensen, O. Frieder, D. D. Lewis, A. Chowdhury, and A. Kolcz, “Improving Automatic [20] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, Query Classification via Semi-supervised Learning,” in C. Clark, K. Lee, and L. Zettlemoyer, “Deep Contextual- Fifth IEEE International Conference on Data Mining ized Word Representations.,” CoRR, vol. abs/1802.05365, (ICDM’05), pp. 8 pp.–, Nov 2005. 2018. [6] D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, [21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and and Q. Yang, “Q2c@ust:Our Winning Solution to Query J. Dean, “Distributed Representations of Words and Classification in KDDCUP 2005,” SIGKDD Explorations, Phrases and their Compositionality.,” in NIPS (C. J. C. vol. 7, pp. 100–110, 2005. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, eds.), pp. 3111–3119, 2013. [7] S. Zhai, K. Chang, R. Zhang, and Z. M. Zhang, “DeepIn- tent: Learning Attentions for Online Advertising with Re- current Neural Networks,” in Proceedings of the 22nd ACM