Deep Ensemble Learning for Legal Query Understanding

             Arunprasath Shankar                                       Venkata Nagaraju Buddarapu
                  LexisNexis                                                    LexisNexis
                Raleigh, USA                                                   Raleigh, USA
      arunprasath.shankar@lexisnexis.com                         venkatanagaraju.buddarapu@lexisnexis.com


                                                                   1. Introduction

                         Abstract                                  With US legal services market approaching $500 bn
                                                                   in 2018 that is ≈2% of US GDP [1] and substantial
                                                                   incentives to increase market share, maximizing user
                                                                   satisfaction with search results continues to be the pri-
     Legal query understanding is a complex prob-                  mary focus for legal search industry. Recently, there is
     lem that involves two natural language pro-                   an upward trend of embracing vertical search engines
     cessing (NLP) tasks that needs to be solved                   by legal companies since they provide a specialized
     together: (i) identifying intent of the user                  type of information service to satisfy a user’s intent.
     and (ii) recognizing entities within the queries.             Using a specialized user interface, a vertical search en-
     The problem equates to decomposing a legal                    gine can return more relevant results than a general
     query into its individual components and de-                  search engine for in-domain legal queries.
     ciphering the underlying differences that can
                                                                      In practice, a user has to decide his choice of ver-
     occur due to pragmatics. Identifying the de-
                                                                   tical search engine beforehand to satisfy his query in-
     sired intent and recognizing correct entities
                                                                   tent. It would be convenient if a query intent identifier
     helps us return back relevant results to the
                                                                   could be provided in a general search engine that could
     user. Deep Neural Networks (DNNs) have re-
                                                                   precisely predict whether a query should trigger a ver-
     cently achieved great success surpassing tradi-
                                                                   tical search in a certain domain. Moreover, since a
     tional statistical approaches. In this work, we
                                                                   user’s query may implicitly express more than one in-
     experiment with several DNN architectures
                                                                   tent, it would be very helpful if a general search engine
     towards legal query intent classification and
                                                                   could detect all query intents, distribute it to appro-
     entity recognition. Deep Neural architectures
                                                                   priate vertical search engines and effectively organize
     like Recurrent Neural Networks (RNNs), Long
                                                                   the results from the different vertical search engines to
     Short Term Memory (LSTM), Convolutional
                                                                   satisfy a user’s need. Consequently, understanding a
     Neural Networks (CNNs) and Gated Recur-
                                                                   query intent is crucial for providing better search re-
     rent Units (GRU) were applied and com-
                                                                   sults and thus improving the overall satisfaction of the
     pared against one another both individually
                                                                   user.
     and as combinations. The models were also
     compared against machine learning (ML) and                       Understanding intent and identifying legal entities
     rule-based approaches. In this paper, we de-                  from a user’s query can help legal search automati-
     scribe a methodology that integrates posterior                cally route the query to corresponding vertical search
     probabilities produced by the best DNN mod-                   engines and obtain relevant contents, thus, greatly im-
     els and create a stacked framework for com-                   proving user satisfaction meaning better assistance to
     bining the different predictors to improve pre-               law researches to support legal arguments and deci-
     diction accuracy and F-measure for legal in-                  sions. Law researchers have to cope with a tremendous
     tent classification and entity recognition.                   load of legal content since sources of law can originate
                                                                   from diversified sources like judicial branch, legisla-
                                                                   tive branch (statutes), legal reference books, journals,
Copyright © CIKM 2018 for the individual papers by the papers'     news etc. This makes legal search and understanding
authors. Copyright © CIKM 2018 for the volume as a collection      queries in legal search such an important aspect to re-
by its editors. This volume and its papers are published under
                                                                   trieve relevant support documents for supporting one’s
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
argument.                                                    significantly outperforms other approaches for both in-
    Legal search is hard as it demands writing complex       tent classification and legal NER.
queries to retrieve desired content from information
retrieval (IR) systems. Classifying legal queries and        2. Background and Related Work
identifying domain specific legal entities from queries
is even harder. For e.g., in the query: “who is supreme      DL systems have dramatically improved the state of
court magistrate John Roberts and abortion law?”, the        several domains like NLP, computer vision, image pro-
word “magistrate” can be resolved to a <judge title>         cessing etc. Various deep architectures and learning
when observed along with the context phrase “supreme         methods have been developed with distinct strengths
court”. Similarly, the phrase “abortion law” can be          and weaknesses in recent years. Deep ensemble learn-
identified as a <practice area> when seen alongside          ing is a learning paradigm where ensembles of several
a supporting context. However, since we also observe         neural networks show improved generalization capa-
the interrogative phrase “who is”, we can safely assume      bilities that outperform those of single networks. For
that the intent of this query is <judge> or <person>         deep learning of multi-layer neural networks, ensem-
search.                                                      ble learning is still applicable [2, 3]. However, there is
    Similar to google queries, legal queries can be clas-    not much work done towards legal domain involving
sified into one of three categories. (i) Do: The users       neither deep learning nor ensemble learning. How can
wants to do something, like buy a product or ser-            ensemble learning be applied to various DNN archi-
vice. E.g., Buy Law Books, Research Guides, Po-              tectures to achieve better results for legal tasks is the
lice/Personal Reports, Real Estate Property Records          primary focus of this paper.
etc. (ii) Know: An informational query, where the                Most ML approaches to text understanding consists
user wants to learn about a subject. E.g., law, statute,     of tokenizing a string of characters into structures such
doctrine etc. Very often single word queries are clas-       as words, phrases, sentences or paragraphs, and then
sified at least partially as “Know” queries. (iii) Go:       apply some statistical classification algorithm onto the
Also known as a navigational query, the user wants to        statistics of such structures [4]. These techniques work
go to a specific site scoped to a particular legal entity.   well when applied to a narrowly defined domain.
E.g., a query where a user wants to know analytics of            Typical queries submitted to legal search engines
a specific legal entity like a judge or an expert wit-       contain very short keyword phrases, which are gener-
ness. Our research in this work focuses only on “Go”         ally insufficient to fully describe a user’s information
type of queries, e.g., an user wanting to understand a       need. Thus, it is a challenging problem to classify
particular judge profile with the query “judge John D.       millions of queries into some predefined categories. A
Roberts”.                                                    variety of related topical query classification problems
    The process of finding named entities in a text and      have been investigated in the past [5] [6]. Most of
classifying them to a semantic type is called named          them seek to use statistical machine learning methods
entity recognition (NER). Legal NER is nearly always         to train a classifier to predict the category of an input
used in conjunction with intent classification systems.      query. From the statistical learning perspective, in or-
Given a query, the goal of NER system described in           der to obtain a classifier that has good generalization
this paper is two-fold: (i) segment the input into se-       ability in predicting future unseen data, two condi-
mantic chunks, and (ii) classify each chunk into a           tions should be satisfied: discriminative feature repre-
predefined set of semantic classes. For e.g., given a        sentation and sufficient training samples. However, for
query “judge john d roberts”, the desired output would       the problem of query intent classification, even though
be: “judge” = <OTHER> and “john d roberts” =                 there are huge volumes of legal queries, both condi-
<JUDGE>. Here the class <JUDGE> represents a                 tions are hardly to met due to the sparseness of query
person or a judge entity and the class <OTHER> rep-          features coupled with the sparseness of labeled train-
resents any non judge specific term.                         ing data. In [5], Beitzel et al. attempted to solve this
    The aim of this research is to improve the under-        problem through augmenting the query with more fea-
standing of queries involved in legal research. In this      tures using external knowledge, such as search engine
paper, we explore three different approaches: (a) ma-        results and achieved fair results.
chine learning (ML) with feature engineering, (b) deep           DNNs have revolutionized the field of NLP. RNNs
learning (DL) without any feature engineering and (c)        and CNNs, the two main types of DNN architectures
ensemble of deep neural networks. We then perform            are widely explored to handle various NLP tasks. CNN
quantitative evaluations in comparison to an already         is supposed to be good at extracting positional invari-
existing baseline model which is a rule-based classifi-      ant features and RNN at modeling units in sequence.
cation system in production. Finally, we show from           The state-of-the-art on many NLP tasks often switches
our experimental results that a deep ensemble model          due to the battle of CNNs and RNNs. While CNNs
take advantage of local coherence in the input to cut       ogy to identify query intent by mapping the query to
down on the number of weights, RNNs are used to pro-        a representation space backed by Wikipedia [12]. In
cess sequential data (often with LSTM cells). RNNs          [13], the authors applied CNNs for intent classification
are also good at representing very large implicit inter-    and achieved results in par with state-of-the-results.
nal structures that are difficult even to think about.          There are also many recent works that combine
    In summary, the conventional wisdom is that RNNs        RNNs with CNNs for different NLP tasks. For ex-
should be used when the context is richer and there         ample, in [14], Chiu et al. presented a novel neural
is more state information that needs to be captured.        network architecture that automatically detects word
This proposition has been challenged by CNNs re-            and character-level features using a hybrid bidirec-
cently with the claim that finite state information of      tional LSTM and CNN architecture, eliminating the
limited scope can be more efficiently handled by mul-       need for most of feature engineering and established
tiple convolution layers. We think both are true, and       a new state-of-the-art performance with an F1 score
one should not go for RNNs just for the sake of it,         of 91.62% on CoNLL-2003 data set. In [15], Lim-
instead more efficient deep CNNs should be tried in         sopatham et al. proposed a bidirectional LSTM (Bi-
limited context situations. But for more complex im-        LSTM) to automatically learn orthographic features
plicit mappings where context and state information         from tweets.
spans are much bigger, RNNs are the best and at this            There is a handful of papers that delve into legal
point almost the only tool.                                 applications using DNNs. In [16], Sugathadasa et al.
    There is not a lot of previous research work involv-    proposed a system that includes a page ranking graph
ing DL and legal domain for the problems of intent          network with TF-IDF to build document embeddings
classification and NER. However, there are a handful        by creating a vector space for the legal domain, which
of papers that talk about DNN approaches for non-           can be trained using a doc2vec neural network model
legal intent classification and general NER. Also re-       supporting incremental and extensive training for scal-
cently, RNNs and CNNs have been applied on a variety        ability of the system. In [17], Nanda et al. proposed
of NLP tasks with various degree of success. Below,         a hybrid model using LSTM and CNN which utilizes
we talk about the evolution of various DNNs and their       word embeddings trained on the Google News vectors
applications towards NLP tasks.                             and evaluated the results on COLIEE 2017 dataset.
    In [7], Zhai et al. adopted RNNs as building blocks     They demonstrated that the performance of LSTM +
to learn desired representations from massive user click    CNN model was competitive with other textual entail-
logs. The authors proposed a novel attention network        ment systems. Similarly, in [18], the authors proposed
that learns to assign attention scores to words within a    a methodology to employ DNNs and word2vec for re-
sequence (query or ad). In [8], Hochreiter and Schmid-      trieval of civil articles.
huber introduced LSTM which solved the most com-                For NER, Huang et al. proposed a variety of LSTM
plex, artificial long-time-lag tasks that have never been   and LSTM variants like LSTM + CRF for NER and
solved by previous recurrent network algorithms. In         achieved state-of-the-art results [19]. Recently this
[9], Chung et al. compared different types of recur-        year, in [20], Peters et al. introduced a new type of
rent units in RNNs especially focusing on units that        deep contextualized word representation that models
implement a gating mechanism, such as LSTM and              both semantics and polysemy and improved the state
GRU units. They evaluated these units on the tasks          of the art across six challenging NLP problems, includ-
of polyphonic music modeling and speech signal mod-         ing question answering, textual entailment, sentiment
eling and proved LSTMs and GRUs work better than            analysis and NER.
traditional recurrent units.                                    In this paper, we scope our research limited to judge
    In [10], Hinton et al. introduced CNN and used          queries meaning queries that are meant to be a search
it to for image classification on the ImageNet dataset      for judge profile. Being this the scope of the problem,
and established new state-of-the-art results. In [11],      the intent classification task needs to classify a given
LeCun et al. demonstrated that we can apply DL              query as a judge query or not. Once the query is iden-
to text understanding from character-level inputs all       tified as a judge query, the NER system that follows,
the way up to abstract text concepts, using tempo-          recognizes if the query contains any person entities.
ral CNNs. They applied CNNs to various large-scale          The entities are then routed to vertical searches that
datasets, including ontology classification, sentiment      follow. The structure of the paper is as follows. Sec-
analysis, text categorization etc. and showed that tem-     tion 3. discusses the various experiments that was car-
poral CNNs can achieve astonishing performance with-        ried out to solve legal query understanding. That is
out any prior knowledge of syntactic or semantic struc-     followed by Section 4. that presents and analyses over-
tures with regards to a human language. With respect        all results. The paper ends with Section 5. which gives
to intent classification, Hu et al. devised a methodol-     the conclusion and discussion on future work.
3. MODELS AND EXPERIMENTS                                      Type         Example                          Volume %
                                                               Judge        judge John D. Roberts                34
The adoption of artificial Intelligence (AI) technology        Word Wheel   Ohio     Municipal    Court,         6
is undoubtedly transforming the practice of law. Many                       Bellefontaine
in the legal profession are aware that using AI can            Query Log    sexual harassment                        5
greatly reduce time and costs while increasing predic-         Statute      statute /s limitations /s ac-            31
tion measures. In the following subsections, we talk                        tual /s fraud
about data collection, augmentation and training of            Elements Law contract defense uncon-                  20
the different DNN models for legal tasks we experi-                         scionability elements
mented for: (i) intent classification and (ii) legal entity    Case Search  Powers v. USAA                           4
recognition.
                                                                        Table 1: Data Types by Volume
3.1 Data Collection
For the purpose of our experiments, we created data
sets comprising of different types of legal queries such                            Intent      NER
as judge search, case law search, statutes/elements                         Train   505,760   1,052,000
                                                                            Dev     20,000      20,000
search, etc. For intent classification, since the clas-
                                                                            Test    20,000      20,000
sification task is binary, we labeled all judge queries
to be positive and the rest as negative. For NER, we                           Table 2: Data Sets
used three labels: (i) <OTHER> used to tag any to-
kens that are not part of a judge name (ii) <B-PER>
denotes the beginning of a person (judge) name and
(iii) <I-PER> denotes the inside of a name. Table 1           has recently attracted much attention in NLP and IR
portrays the different query types by volume we used          tasks. The embedding vectors are typically learned
for our experiments. All of the collected data labeled        based on term proximity in a large corpus and is used
by our subject matter experts (SMEs). All of the data         to accurately predict adjacent word(s) for a given word
discussed here are proprietary to LexisNexis.                 or context [21]. For the purpose of training our NER
                                                              DNN models, we created word vectors of size 580,614
3.2 Data Augmentation and Balancing                           to be used as word embeddings to the embedding layer
The labeled data was good enough to get started but           of our DNN models. The word vectors were created by
was highly imbalanced with respect to NER labels. In          training a word2vec continous bag of words (CBOW)
order to balance out the data set, we augmented the           model on AWS ml.p3.2xlarge instance with a single
data by (i) expanding queries by pattern using judge          NVIDIA Tesla V100 GPU. As an input to the model,
names from our proprietary judge master database, (ii)        10 Million queries were obtained from user session logs
oversampling the under represented patterns and (iii)         were used. These queries were ordered by frequency.
under sampled the over exemplified patterns. The top          The word2vec model was trained in 112.2 minutes for
10 patterns by frequency are shown in table 3. These          100 epochs.
10 patterns alone contributed to ≈78% of the labeled
data. The balancing act was carried out by following
two strategies: For oversampling, we pick a pattern                             Pattern                     Count
and create synthetic data points (queries) by randomly
                                                                          B-PER I-PER I-PER                 217,374
substituting judge names/titles around the patterns to
be fitted. For the process of under sampling, random                         B-PER I-PER                    109,500
data points are selected from buckets of pattern and                     O B-PER I-PER I-PER                92,436
removed. The sampling process is stopped when there                   B-PER I-PER I-PER I-PER               73,668
are equal number of queries in all the buckets of pat-                      O B-PER I-PER                   53,896
terns. Once the data is augmented and balanced, we
                                                                  O B-PER I-PER I-PER I-PER I-PER           48,157
split the data into train, dev and test sets. Table 2
shows the ratio of data split for both intent classifica-             B-PER I-PER I-PER I-PER               39,970
tion and NER.                                                      B-PER I-PER I-PER I-PER I-PER            33,474
                                                                               O B-PER                      22,986
3.3 Word Embedding
                                                                  O O O B-PER I-PER I-PER I-PER O            9,968
Learning a high-dimensional dense representation for
vocabulary terms, also known as a word embedding,                           Table 3: Top 10 Patterns
                                                            Dev                  Test
                                         Model        F      FP    FN      F       FP    FN
                         Rule Engine (Baseline)   98.66       57   209   98.69     52   208
                            Logistic Regression   98.31      137   141   98.48    132   120
                                  Gaussian NB     89.79     1789    20   90.08   1749    20
                                      AdaBoost    97.58      137   256   97.54    165   241
                                  Decision Tree   95.75       25   647   95.96     25   623
                                   Linear SVM     98.28      139   142   98.54    124   117
                         Multi-layer Perceptron   98.36      107   160   98.57    119   117

                                  Table 4: Intent Classification - Baseline vs ML
3.4 Intent Classification                                                      Embedding [49701 x 200]

3.4.1 Baseline vs ML Approaches:
                                                                                  LSTM [100, tanh]
For intent classification, we tried a few different ML
classifiers firsthand before delving into the deep learn-
ing arena. The ML models were compared against                                   Flatten [200, tanh]
a rule-based query recognition system highlighted
that was already established as our baseline system                         TimeDistributed [1, sigmoid]
to improve. Classifiers were selected from both the
linear and non-linear category. The list of classifiers
tried and their corresponding results (F1 score) against            Figure 1: LSTM for Intent Classification
baseline are shown in table 4 above. Our main require-
ment while picking and implementing the classifiers           in figure 1. The embedding layer that feeds the LSTM
was to minimize the overall number of false positives.        layer is composed of a vocabulary of 49,701 words with
This is because false positives with respect to intent        an output dimension of 200. 100 hidden units are used
classification can intercept queries that are meant to        for the LSTM layer. The flatten layer uses 200 hidden
be routed to a different vertical search engine affect-       units. The flatten layer takes a tensor of any shape and
ing the overall customer satisfaction towards the prod-       transform it into a one dimensional tensor. Both uni
uct. Amongst the linear classifiers, linear SVM seemed        and bi-directional LSTMs were trained for the task.
to perform slightly better than the logistic regression
classifier. Amongst the non-linear category, the multi-       3.4.4 CNN based Intent Classifier:
layer perceptron highlighted seemed to perform the
best beating decision trees, adaboost and naive bayes         CNNs are built out of many layers of pattern recog-
approaches.                                                   nizers stacked on top of each other. Convolutional is
                                                              a way of saying that the machine looks at small parts
                                                              of a query first rather than trying to account for the
3.4.2 Feature Engineering for ML Classifiers:
                                                              whole thing. Each successive layer combines informa-
There are different ways one can address a judge in           tion from these small parts to fill in the bigger picture
legal taxonomy such as “chief justice”, “associate jus-       and assemble complex patterns of meaning. After try-
tice”, “magistrate” etc. All of the different phrases         ing few variants of LSTM architectures, we started ex-
used for addressing a judge are compiled into a bag           perimenting with CNNs for intent classification. The
of words representation. In addition to bag of words,         general neural architecture we used for CNN is show
all the ML classifiers use POS, gazetteer, word shape         in figure 2. The major difference here compared to the
and orthographic features to represent semantic and           LSTM models are: CNNs use two dense layers after
linguistic meaning of a query.                                the flatten layer whereas LSTMs use only one layer.
                                                              The 1D convolutional layer uses 32 filters with a ker-
3.4.3 LSTM based Intent Classifier:                           nel size = 8 and the max pooling layer uses 2 strides
                                                              with a pool size = 2.
The LSTM, first described in [8], attempts to circum-
vent the vanishing gradient problem by separating the
                                                              3.4.5 Hybrid Models:
memory and output representation, and having each
dimension of the current memory unit depending lin-           As a next step, the best performing models from both
early on the memory unit of the previous time step.           the LSTM and CNN pools were picked and combined
The DNN architecture we tried using LSTM is shown             into a hybrid model for classification. We built two
                Embedding [49701 x 200]                                              Avg Time             Epochs            Batch Size
                                                                  LSTM                  75.54               50                 1000
                                                                 Bi-LSTM                93.72               30                 1000
                  Conv1D [32, 8, relu]
                                                                   CNN                  37.22              100                  500
                                                              LSTM + CNN                84.21               50                 1000
               MaxPooling1D [2, 2, valid]                    Bi-LSTM + CNN             101.45               50                 1000

                                                               Table 5: Train Statistics - Intent Classification
                   Flatten [200, tanh]


               TimeDistributed [10, relu]
                                                                           CNN          Ensemble Classifier


              TimeDistributed [1, sigmoid]                                           H1(x)

                                                                           LSTM
                                                                                     H2(x)
                                                                                                              Σσi.Hi(x)
       Figure 2: CNN for Intent Classification
                                                                                     H3(x)
                                                              Input       Bi-LSTM            Ensemble              Output
                Embedding [49701 x 200]
                                                                                     H4(x)


                                                                                     H5(x)
                  Conv1D [32, 8, relu]                                    LSTM +
                                                                           CNN


               MaxPooling1D [2, 2, valid]
                                                                         Bi-LSTM +
                                                                            CNN
             Bidirectional LSTM [100, tanh]


                   Flatten [200, tanh]                      Figure 4: Ensemble Classifier for Intent Classification
                                                            learning, we perform a linear combination of the orig-
               TimeDistributed [10, relu]                   inal class posterior probabilities produced by the best
                                                            DNN models at the word level (see table 8). A set of
              TimeDistributed [1, sigmoid]                  parameters in the form of full matrices are associated
                                                            with the linear combination, which are learned using
                                                            the training data consisting of the word-level poste-
Figure 3: Bi-LSTM + CNN for Intent Classification           rior probabilities of the different models and its cor-
                                                            responding word-level target values (0 or 1). Figure 4
models for this experiment, LSTM + CNN and Bi-              depicts the model architecture of the ensemble classi-
LSTM + CNN. The architectural diagram for Bi-               fier for the task of intent classification.
LSTM + CNN model is shown in figure 3.
                                                            3.4.7 Training:
3.4.6 Deep Ensemble for Intent Classification:
                                                            Table 5 below shows the overall training statistics
Ensemble learning is a ML paradigm where multiple           for the different DNN models deployed for intent
learners are trained to solve the same problem. In con-     classification. The models were trained on AWS
trast to ordinary ML approaches which try to learn one      ml.p3.8xlarge instance with 4 NVIDIA Tesla V100
hypothesis from training data, ensemble methods try         GPUs. Average time shown below is measured in min-
to construct a set of hypotheses and combine them to        utes.
use. For intent classification, we use stacking which
applies several models to original data. In stacking        3.5 Named Entity Recognition
we don’t have just an empirical formula for our weight
                                                            3.5.1 Baseline:
function, rather use a logistic regression model to esti-
mate the input together with outputs of every model         Baseline model for NER is shown in table 12 . The
to estimate the weights or, in other words, to deter-       baseline model (highlighted in ) was previously estab-
mine what models perform well and what badly given          lished and it is the same rule-based system that was
these input data.                                           set as baseline for intent classification. The row high-
   To simplify the stacking procedure for ensemble          lighted in shows metrics from a conditional random
                                   Architecture                                         Dev                                  Test
                 Dropout          Hidden Units      Embedding                 F-score     FP        FN              F-score     FP     FN
       LSTM I          False                  50             200               99.74         25      27               99.72      14     42
      LSTM II          False                  50             100               99.64         52      20               99.74      26     27
     LSTM III          False                  50             200               99.77         23      24               99.79      17     25
     LSTM IV           True                   50             200               99.66         54      15               99.78      24     21
    Bi-LSTM I          False                 100             200               99.83         19      20               99.84      11     22
   Bi-LSTM II          True                  100             200               99.72         38      18               99.76      26     23

                                             Table 6: Intent Classification - LSTM

                                  Architecture                                Dev                               Test
                        Filters     Kernel     Padding              F-score     FP       FN               F-score     FP      FN
             CNN I          32          16         same              99.83       21       13                99.83      15      18
            CNN II          32           8         same              99.84       19       13                99.83      14      18
           CNN III          64           8         same              99.82       24       11                99.84      13      18
           CNN IV           64          16         same              99.78       28       15                99.89      15      15
            CNN V           64           4         same              99.81       31        9                99.83      20      14

                                             Table 7: Intent Classification - CNN

                                                                     Dev                          Test
                                               Model      F-score     FP       FN        F         FP     FN
                                     LSTM III               99.77      23      24       99.79       17     25
                                    Bi-LSTM I               99.81      19      20       99.84       11     22
                                        CNN II              99.84      19      13       99.83       14     18
                            LSTM III + CNN II               99.81      26      14       99.90       10     11
                           Bi-LSTM I + CNN II               99.88      14      11       99.86        7     22
                               Ensemble (Top 5)             99.91       9       9       99.91        4     15

                                   Table 8: Intent Classification - Winners & Ensemble

                     Architecture                                     Dev                                               Test
           Filters     Kernel      Padding             Precision      Recall        F-score               Precision         Recall    F-score
  CNN I       128           5         same                99.2707    99.2654        99.2665                99.0583      99.0511       99.0526
 CNN II       128           5         same                98.6557    98.6435        98.6468                98.9799      98.9654       98.9681
CNN III        64           5         same                99.2632    99.2549        99.2564                99.0729      99.0605       99.0627
CNN IV        128          10         same                99.3972    99.3967        99.3969                99.2193      99.2177       99.2181

                                       Table 9: Named Entity Recognition - CNN

                     Architecture                                      Dev                                               Test
           Dropout       Hidden Units                  Precision       Recall       F-score               Precision         Recall    F-score
   RNN I         0.4                 100                  99.1119     99.1022       99.1042                 98.8990     98.8876       98.8900
  RNN II         0.4                 100                  99.1171     99.1063       99.1084                 98.9171     98.9039       98.9065
 RNN III         0.2                 100                  99.1750     99.1665       99.1682                 98.9818     98.9711       98.9733
   LSTM          0.4                 100                  99.1727     99.1624       99.1643                 98.9894     98.9764       98.9788
Bi-LSTM          0.4                 100                  99.5344     99.5331       99.5334                 99.3422     99.3397       99.3403
    GRU          0.4                 100                  99.1256     99.1227       99.1235                 98.8564     98.8523       98.8534
 Bi-GRU          0.4                 100                  99.4734     99.4733       99.4734                 99.2931     99.2929        99.293

                        Table 10: Named Entity Recognition - Recurrent Neural Networks
                                                      Dev                            Test
                               Model   Precision    Recall    F-score    Precision    Recall    F-score
               RNN III + CNN IV         99.3812    99.3794    99.3798    99.2027     99.1998    99.2635
                 LSTM + CNN IV          99.4036    99.3995    99.4002     99.269     99.2624    99.2635
              Bi-LSTM + CNN IV          99.5286    99.5262    99.5267    99.4043     99.4007    99.4013
                  GRU + CNN IV          99.4365    99.4323    99.4330    99.2528     99.2466    99.2477
               Bi-GRU + CNN IV           99.532    99.5294    99.5299    99.3829     99.3786    99.3793

                               Table 11: Named Entity Recognition - Hybrid Models

                                                       Dev                            Test
                               Model   Precision     Recall    F-score   Precision     Recall    F-score
             Rule Engine (Baseline)      93.2794   92.9001    92.7009     94.0157    93.7872    93.6778
                              CRF        92.1442   89.3441    91.3463     91.3421    90.0323    90.6341
                           CNN IV        99.3972   99.3967    99.3969     99.2193    99.2177    99.2181
                 LSTM + CNN IV           99.4036   99.3995    99.4002     99.2692    99.2624    99.2635
                          Bi-LSTM        99.5344   99.5331    99.5334     99.3422    99.3397    99.3403
              Bi-LSTM + CNN IV           99.5286   99.5262    99.5267     99.4043    99.4007    99.4013
               Bi-GRU + CNN IV           99.5323   99.5294    99.5299     99.3829    99.3786    99.3793
                  Ensemble (Top 5)       99.5596   99.5577    99.5581     99.4193    99.4159    99.4165

                         Table 12: Named Entity Recognition - Winners & Ensemble
 Embedding [577149 x 200]                                     by a LSTM layer.

       Dropout [0.2]
                                                              3.5.4 CNN based Named Entity Recognition:
      RNN [200, tanh]                                         For NER, we also experimented with CNN. The neu-
                                                              ral architecture for CNN is shown in figure 6. The
TimeDistributed [3, softmax]                                  Conv1D layer consists of 128 filters with kernel size
                                                              set to 5. In contrary to the CNN used for intent clas-
              Figure 5: RNN for NER                           sification, this architecture does not use a max pooling
                                                              layer.
field (CRF) probabilistic model that was previously
implemented. This model was eventually discarded
since it was outperformed by most of DNN models.              3.5.5 GRU based Named Entity Recognition:

3.5.2 RNN based Named Entity Recognition:                     More recently, gated recurrent units have been pro-
                                                              posed [9] as a simplification of the LSTM, while keep-
For NER, we started with RNNs. The neural ar-
                                                              ing the ability to retain information over long se-
chitecture for RNN is shown in figure 5 above. For
                                                              quences. Unlike LSTM, GRU uses only two gates,
the embedding layer, we used pre-trained word2vec
                                                              memory units do not exist, and the linear interpolation
embeddings of dimensions (577,149 x 200). For the
                                                              occurs in the hidden state. As part of our experiment,
RNN layer, 200 hidden units were used. The time dis-
                                                              we replaced the RNN layer in the architecture shown
tributed output layer has 3 units and uses a softmax
                                                              in figure 5 with a GRU layer.
function since we have 3 classes in total (<OTHER>,
<B-PER> & <I-PER>. A dropout value of 0.2 was
also used for the configuration.
                                                              3.5.6 Hybrid Models:
3.5.3 LSTM based Named Entity Recognition:
                                                              For NER, we combined the above discussed models.
LSTMs in general circumvent the vanishing gradient            Some of the hybrid models we built are RNN + CNN,
problem faced by RNNs. For NER, we used both uni              LSTM + CNN and GRU + CNN (both uni and bi-
and bi-directional LSTMs. The architecture is similar         directional). Architecture of Bi-LSTM + CNN is
to shown in figure 5 except the RNN layer is replaced         shown in figure 7.
                                                                                     Avg Time        Epochs   Batch Size
                                                                       RNN              85.23          50        1000
               Embedding [577149 x 200]                               LSTM             104.38          50        1000
                                                                     Bi-LSTM           123.84          50        1000
                                                                       GRU              94.2           50        1000
                  Conv1D [128, 5, relu]                              Bi-GRU            104.27          50        1000
                                                                       CNN              72.02         100         500
             TimeDistributed [3, softmax]                          RNN + CNN            97.93          50        1000
                                                                  LSTM + CNN           134.81          50        1000
                                                                 Bi-LSTM + CNN         154.43          50        1000
                                                                   GRU + CNN           115.24          50        1000
               Figure 6: CNN for NER
                                                                 Bi-GRU + CNN          132.64          50        1000

                                                                Table 13: Train Statistics - Named Entity Recognition

                                                                3.5.7 Deep Ensemble            for    Named      Entity
               Embedding [577149 x 200]                               Recognition:
                                                                To create an ensemble for NER, the DNN models are
                  Conv1D [128, 5, relu]                         ranked by their F1 score. The top 5 best models are
                                                                then picked and stacked into an ensemble. Ensemble
                         Dropout [0.2]                          of top 10 models was also experimented and discarded
                                                                since it underperformed compared to the ensemble of
                                                                top 5. Figure 8 shows the architecture of the chosen
            Bidirectional LSTM [200, tanh]
                                                                ensemble model.

             TimeDistributed [3, softmax]                       3.5.8 Training:
                                                                The training time for epochs are show in table 13.
                                                                A batch here corresponds to a chunk of user input
         Figure 7: Bi-LSTM + CNN for NER
                                                                queries. The neural architectures were implemented
                                                                using tensorflow, keras and scikit-learn.

                                                                4. Results
                                                                4.1 Evaluation Metrics
               CNN          Ensemble Classifier                 We utilize standard measures to evaluate the perfor-
                                                                mance of our classifiers, i.e., precision, recall and F1-
                         H1(x)                                  measure. Precision (P) is the proportion of actual pos-
             Bi-LSTM
                                                                itive class members returned by our method among
                         H2(x)                                  all predicted positive class members returned by our
                                                  Σσi.Hi(x)
                                                                method. Recall (R) is the proportion of predicted pos-
                         H3(x)
              LSTM +                                            itive members among all actual positive class members
 Input                           Ensemble              Output
               CNN                                              in the data. F1 is the harmonic average of precision
                         H4(x)
                                                                and recall which is defined as F1 = 2PR/(P+R).
                         H5(x)
             Bi-LSTM +
                CNN                                             4.2 Best Performers
                                                                Empirical results for intent classification are shown in
             Bi-GRU +                                           tables 6, 7 & 8. Results for NER are shown in tables
               CNN
                                                                9, 10, 11 & 12. Based on evaluation metrics, we can
                                                                clearly see that the ensemble models outperform all
                                                                other DNN models in both tasks, intent classification
Figure 8: Ensemble Classifier for Named Entity                  as well as NER by a good margin. In case of intent
Recognition                                                     classification, the ensemble model (top 5) highlighted
                                                                  in table 8 has the lowest count of false positives and
                                                                false negatives on both the dev and test data sets. It
also has the highest F1 score value = 99.91% beating                 SIGKDD International Conference on Knowledge Discov-
the baseline rule-based system by a margin of ≈1.5%.                 ery and Data Mining, San Francisco, CA, USA, August
                                                                     13-17, 2016, pp. 1295–1304, 2016.
In case of NER, the ensemble model (top 5) highlighted
  in table 12 outperforms all other DNN models and                [8] S. Hochreiter and J. Schmidhuber, “Long Short-Term
                                                                      Memory,” Neural Comput., vol. 9, pp. 1735–1780, Nov.
beats the baseline model by a margin of ≈6%.                          1997.
                                                                  [9] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empir-
5. Conclusion and Future Work                                         ical Evaluation of Gated Recurrent Neural Networks on
                                                                      Sequence Modeling,” CoRR, vol. abs/1412.3555, 2014.
Our results show an ensemble model of stacking dif-
                                                                 [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
ferent DNNs of varying architectures outperforms in-                  Classification with Deep Convolutional Neural Networks,”
dividual performances of DNNs for the tasks of le-                    in Advances in Neural Information Processing Systems 25:
gal intent classification and entity recognition. RNNs,               26th Annual Conference on Neural Information Processing
LSTMs, GRUs and even CNNs, all compress the nec-                      Systems 2012. Proceedings of a meeting held December 3-6,
                                                                      2012, Lake Tahoe, Nevada, United States., pp. 1106–1114,
essary information of a source query into a fixed-length              2012.
vector. This makes it difficult for the DNNs to cope
                                                                 [11] X. Zhang and Y. LeCun, “Text Understanding from
with long queries, especially those that are longer than              Scratch,” CoRR, vol. abs/1502.01710, 2015.
the queries in the training corpus. In future, we plan           [12] J. Hu, G. Wang, F. H. Lochovsky, J. Sun, and Z. Chen,
to use attention within queries. Attention is the idea                “Understanding User’s Query Intent with Wikipedia,” in
of freeing a DNN architecture from the fixed-length                   Proceedings of the 18th International Conference on World
internal representation. The DNN models we trained                    Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009,
                                                                      pp. 471–480, 2009.
are at the word level, in future we plan to expand the
size of the training data and try DNN models at the              [13] H. B. Hashemi, A. Asiaee, and R. Kraft, “Query Intent De-
                                                                      tection using Convolutional Neural Networks,” in WSDM
character level. Moreover, since the difference in per-               QRUMS Workshop, 2016.
formance between the DNN models were rather small,               [14] J. P. C. Chiu and E. Nichols, “Named Entity Recognition
we plan to run tests of statistical significance and error            with Bidirectional LSTM-CNNs,” TACL, vol. 4, pp. 357–
analysis to capture performance by patterns. Lastly,                  370, 2016.
we also plan to look into the impact on our models               [15] N. Limsopatham and N. Collier, “Bidirectional LSTM for
with respect to data and covariance shifts.                           Named Entity Recognition in Twitter Messages,” in Pro-
                                                                      ceedings of the 2nd Workshop on Noisy User-generated
                                                                      Text, NUT@COLING 2016, Osaka, Japan, December 11,
6. Acknowledgements                                                   2016, pp. 145–152, 2016.
This research was supported by LexisNexis, Raleigh               [16] K. Sugathadasa, B. Ayesha, N. de Silva, A. S. Perera,
Technology Center, USA.                                               V. Jayawardana, D. Lakmal, and M. Perera, “Legal Doc-
                                                                      ument Retrieval using Document Vector Embeddings and
                                                                      Deep Learning.,” CoRR, vol. abs/1805.10685, 2018.
References
                                                                 [17] R. Nanda, K. J. Adebayo, L. D. Caro, G. Boella, and
 [1] “http://www.legalexecutiveinstitute.com.”                        L. Robaldo, “Legal Information Retrieval using Topic Clus-
 [2] L. Deng and J. C. Platt, “Ensemble deep learning for             tering and Neural Networks,” in COLIEE 2017. 4th Com-
     speech recognition,” in INTERSPEECH 2014, 15th Annual            petition on Legal Information Extraction and Entailment,
     Conference of the International Speech Communication             held in conjunction with the 16th International Conference
     Association, Singapore, September 14-18, 2014, pp. 1915–         on Artificial Intelligence and Law (ICAIL 2017) in King’s
     1919, 2014.                                                      College London, UK., pp. 68–78, 2017.
 [3] X. Zhou, L. Xie, P. Zhang, and Y. Zhang, “An Ensem-         [18] A. H. N. Tran, “Applying Deep Neural Network to Re-
     ble of Deep Neural Networks for Object Tracking,” in             trieve Relevant Civil Law Articles,” in Proceedings of the
     2014 IEEE International Conference on Image Process-             Student Research Workshop Associated with RANLP 2017,
     ing (ICIP), pp. 843–847, Oct 2014.                               (Varna), pp. 46–48, INCOMA Ltd., September 2017.
 [4] S. G. Soderland, “Building a Machine Learning based Text    [19] Z. Huang, W. Xu, and K. Yu, “Bidirectional
     Understanding System,” 05 2001.                                  LSTM-CRF Models for Sequence Tagging.,” CoRR,
                                                                      vol. abs/1508.01991, 2015.
 [5] S. M. Beitzel, E. C. Jensen, O. Frieder, D. D. Lewis,
     A. Chowdhury, and A. Kolcz, “Improving Automatic            [20] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner,
     Query Classification via Semi-supervised Learning,” in           C. Clark, K. Lee, and L. Zettlemoyer, “Deep Contextual-
     Fifth IEEE International Conference on Data Mining               ized Word Representations.,” CoRR, vol. abs/1802.05365,
     (ICDM’05), pp. 8 pp.–, Nov 2005.                                 2018.
 [6] D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin,       [21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
     and Q. Yang, “Q2c@ust:Our Winning Solution to Query              J. Dean, “Distributed Representations of Words and
     Classification in KDDCUP 2005,” SIGKDD Explorations,             Phrases and their Compositionality.,” in NIPS (C. J. C.
     vol. 7, pp. 100–110, 2005.                                       Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger,
                                                                      eds.), pp. 3111–3119, 2013.
 [7] S. Zhai, K. Chang, R. Zhang, and Z. M. Zhang, “DeepIn-
     tent: Learning Attentions for Online Advertising with Re-
     current Neural Networks,” in Proceedings of the 22nd ACM