Data Shift in Legal AI Systems
                   Venkata Nagaraju Buddarapu                                                          Arunprasath Shankar
                             LexisNexis                                                                      LexisNexis
                           Raleigh, USA                                                                    Raleigh, USA
             venkatanagaraju.buddarapu@lexisnexis.com                                           arunprasath.shankar@lexisnexis.com

ABSTRACT
One of the fundamental assumptions with any machine learning
(ML) system is that training data comes from the same distribu-
                                                                                                              Web
                                                                                                            Application
tion as the real world data. However, in many real-world applica-
                                                                                                            Browser
                                                                                                              Layer
tions, this important assumption is often violated including legal                         Web                                    NLP               Database
research. A scenario where training and test samples follow dif-                         Browser                                 System              Layer
ferent input distributions is known as covariate shift. This shift                                            Service
                                                                                                               Layer
in data is often responsible for the deterioration in predictive per-
formance of machine learning systems. The motivation of this re-
                                                                                                                                                    Database
search is to study the effect of covariate shift on deep learning
systems used in legal research. In this paper, we propose a uni-
fied framework to detect covariate shift impacting AI systems and                                        Figure 1: CALR Workflow
formulate a strategy to adapt to this shift on a periodic basis. To
our knowledge, our work is the first to apply data shift detection
and adaption techniques to deep learning systems involving high                       assume the topic of this query is about the opinions of a judge.
dimensional word embeddings. Through experiments and evalua-                          Table 1 presents examples for legal queries with different intents.
tions, we demonstrate that our framework can accurately detect
                                                                                          Intent              Example
data (covariate) shift on legal AI systems involving deep neural ar-
chitectures.                                                                              case search         marbury v. madison, 1803
                                                                                          judge               judge john roberts
                                                                                          expert witness      expert henry lee
1.    INTRODUCTION
                                                                                          definitions         foreign corrupt practices act of 1977 ?
CALR stands for Computer Assisted Legal Research and is a mode                            seminal cases       seminal cases on murder
of legal research that uses electronic databases that comprises court                     burden of proof     burden of proof for hearsay statement
documents, statutes, secondary materials etc. Professional lawyers
                                                                                          doctrine            what is assumed duty doctrine
and paralegals rely on CALR applications for the precise under-
                                                                                          elements            elements of child abuse
standing of the law and to serve the client’s best interest. Search
                                                                                          statutes            statute of limitations for mail fraud
engines are a crucial component of legal research technology to-
day, and its primary goal is to identify and retrieve information                                           Table 1: Query Intents
needed to support legal decision making. When a user types the
query “most cited cases by judge john d roberts”, he strives to under-                    Identifying query intent is a classification problem, and the pro-
stand the most often cited cases by a judge and tries to anticipate                   cess of recognizing domain-specific entities is known as named en-
the judge’s behavior. This plays a crucial role in uplifting his legal                tity recognition (NER), which also belong to the classification fam-
research experience. Consequently, understanding a query intent                       ily. In general, intent and entity recognition are two primary com-
is essential for providing better search results, thus improving cus-                 ponents of any natural language processing (NLP) system. Over
tomers’ overall satisfaction.                                                         the past decade, the field of NLP has heavily influenced the way
    Understanding a query intent requires classifying legal queries                   legal search works, shifting discovery from pure keyword-based
and identifying domain-specific legal entities, which is a complex                    methodologies to a more context-oriented NLP techniques.
problem [1]. E.g., in the query: “what are the opinions by judge john                     Figure 1 depicts the workflow of a typical CALR application. The
doe in civil cases dealing with dog bites ? ”, the word “judge” can                   browser is a tool by which users provide input; the application
be treated as a judge search when observed along with the context                     layer coordinates user interactions with a service layer that trig-
phrase “opinions by”. The phrase “civil cases” can be identified as                   gers a search. The NLP system complements the service layer for
a <practice area> when seen alongside a supporting context and                        query understanding by leveraging a database layer. The database
similarly “dog bites” can be treated as keywords. However, since                      layer usually retrieves relevant information in the form of legal
we also observe the interrogative phrase “what are”, we can safely                    documents.
                                                                                          NLP systems are usually built using supervised approaches and
In: Proceedings of the Third Workshop on Automated Semantic Analysis of Informa-      is a type of learning that uses a function to map a given input to an
tion in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada.                 output. It infers learning features from labeled data consisting of
©2019 Copyright held by the owner/author(s). Copyrighting permitted for private and
academic purposes.                                                                    training examples. For example, given a query “what are the opin-
Published at http:// ceur-ws.org.                                                     ions by judge john doe in civil cases dealing with dog bites ?”, the
   CALR Application          Query Log

                                                                        Feature                 Model
                                                Data Discovery                                                      Model Scoring                ML/DL Models
                                                                      Engineering            Development

      Case Law
        Docs


         Legal Environment                                          Machine Learning Development Life Cycle


                                           Figure 2: QIC and LER Model Development Lifecycle


output label for intent classification is “judge”. On the other hand,          learning on deep neural models as a necessary step to ensure con-
we need to construct two output labels “john doe” as <judge en-                sistent quality amongst AI applications deployed for legal CALR.
tity> and “civil cases” as <practice area> for the task of legal
entity recognition.                                                            2.    BACKGROUND KNOWLEDGE
    Our NLP system mainly consists of two models: (i) a model for              Dataset shift research on machine learning classification algorithms
identifying legal query intent which we call as Query Intent Clas-             is interesting and foreseen to become a more difficult problem to
sifier (QIC), and (ii) a model for recognizing legal entities called as        solve in non-stationary environments. In section 2.1, we introduce
Legal Entity Recognition (LER). QIC and LER model development                  the dataset shift definitions concerning classification problems from
cycles follow a standard machine learning development life cycle               [2]’s extensive literature survey, section 2.2 discusses the causes of
as shown in Figure 2. These cycles usually require application data            dataset shift in general and 2.3 the various analysis methods for co-
as a prerequisite. In our case, the data is derived from user logs as          variate shifts.
mentioned previously. NLP systems usually reside within a learn-
ing environment and learn from the data collected within this en-
vironment.
                                                                               2.1     Data Shift Types
    In general, any legal learning environment is comprised of users           In this section, we explain the different classification and data shift
(lawyers and paralegals), continually changing legal corpus and re-            types. In general, a classification problem is defined by:
formulating legal queries. Most machine learning methods assume                      • A set of features or covariates X
the learning environments to be static, which is not the case with                   • A target (class) variable Y
real-world applications such as CALR, email spam filters, stock                      • A joint distribution P(Y , X )
market prediction systems etc. Real world applications including
                                                                               X → Y problems are those where the class labels Y is predicted
legal systems are mostly dynamic in nature and often incur distri-
                                                                               based on the values of covariates X . Inversely, Y → X problems
bution changes to its underlying data. This phenomenon is known
                                                                               are the ones where the class label Y causally determines the values
as data shift in the machine learning arena. These data shifts usu-
                                                                               of covariates X . Thus, by analyzing the relationship between X and
ally results in performance degradation of NLP systems deployed
                                                                               Y , we can define three different types of data shifts:
as real-world applications.
    The evolving nature of legal environment demands continuous                2.1.1 Covariate Shift: Covariate shift refers to changes in the
monitoring and adaption to data shifts, in order to alleviate the              distribution of input variable X = {x 1 , x 2 · · · x n }. Here x 1 , x 2 · · · x n
issue of performance degradation in NLP systems. Data shift has                are called the covariates and any distribution changes in one or
been receiving significant attention in recent years amongst the               more of these covariates is termed as covariate shift.
machine learning community. Dataset shift refers to the problem
where training and real-world datasets follow different distribu-              2.1.2 Prior Probability Shift: Distribution changes to the class
tions. The section 2. contains formal definitions of these shifts. Since       variable Y is referred to as prior probability shift, and it appears
this problem may occur in many real-life scenarios, detecting and              only in Y → X problems.
adapting to dataset shift becomes a vital research aspect in ma-
chine learning. This research aims to observe, detect and adapt co-            2.1.3 Concept Shift: Concept shift occurs when the relationship
variate shift on deep learning models using high-dimensional word              between input variable X and class variable Y changes. There other
embeddings, derived from a corpus of legal queries. We demon-                  shifts in theory, but we are not defining them since they emerge
strate the usefulness of adapting covariate shift with incremental             sporadically. In general, dataset shift is a phenomenon that occurs
                                                                               when new data distribution leads to a change in the distribution of
                                                                               a single feature, a combination of features, or class boundaries.
                                                                           2
                                                Train Phase


                  Original    Monthly
                                                                            Input Layer
                               Real
                   Train                           Word
                               World
                    Set                          Embedding
                                Set
                                                                        Embedding Layer


                                                                            RNN Layer
                                                                                                                        Train Set
                                                                                                                         Query
                                                   User Query               Output Layer               Classify

                                                                                                                        Real World
                                                                                                                          Query
                                                 Test Phase                 Shift Model


                                                              Figure 3: Shift Model


2.2    Data Shift Causes                                                     3.    RELATED WORK
Selection bias and non-stationary environments are the two pri-              Data (covariate) shift is an area of machine learning that has been
mary reasons for data shift in general. Selection bias occurs when           gaining popularity in recent years. In this section, we will discuss
a training set does not exactly represent a real world test set. Lack        some of the very few existing works related to this sub-domain
of randomness in training sample selection, improper samples, and            of AI research. First, statistics based identification methods have
biased sampling rules often influence selection bias. On the other           been widely adopted in several fields recently yielding good re-
hand, non-stationary environments often must deal with dynamic               sults. For the area of data shift, in [3][4][5], statistical methods
nature. Hence, not handling dataset shift in real-world application          such as exponential weighted moving average and Kolmogorov-
creates an overfitted model on training samples, hence unreliable            Smirnov were used as detection methodologies especially towards
model predictions.                                                           time-series data, and big data online streams applications. In [6],
                                                                             the authors discuss hierarchical hypothesis testing techniques for
                                                                             concept shift detection in streaming applications.
2.3    Covariate Shift Analysis                                                 Most real-world deep learning applications need training and
In this section, we introduce an overview of three well known co-            the training phase usually face an internal covariate shift. In [7],
variate shift analysis methodologies.                                        Sergey et al. proposed an ensemble of batch normalized networks
                                                                             to detect shifts in image classification. Word embeddings are con-
2.3.1 Visualization: This methodology is the simplest, visualiz-             sidered as the building blocks for NLP and the problem of choosing
ing one covariate at a time. It requires humans spotting the differ-         a right embedding for a particular NLP task is always a problem
ence in covariate distribution using histograms.                             of trial and error. In [8], authors have discussed the various fac-
                                                                             tors influencing a word embedding’s stability, and one such factor
                                                                             is word frequency. Insufficient vocabulary affects word frequency
2.3.2 Statistical Distance: In this type of analysis, methods in-
                                                                             and landing a perfect real-world sufficient vocabulary is not a one-
volving statistical metrics such as mean, variance, population sta-
                                                                             step process. In our paper, we demonstrate that continuously up-
bility index (PSI), Kullback-Leibler divergence, and Kolmogorov-
                                                                             dating word embeddings to represent real world data promotes the
Smirnov etc. are used for detecting shifts.
                                                                             model’s performance.
                                                                                In [9], the researchers propose a novel minimax approach for re-
2.3.3 Uncertainty Quantification: This method fits a probabilis-             gression problems under covariate shift. Non-stationary environ-
tic model on the training data and every prediction on new data              ments influence and change the machine learning development
is associated with a confidence interval or uncertainty. Lower un-           process. Under covariate shift, the standard model selection tech-
certainty on new real world data is considered no shift and higher           niques such as cross-validation do not work as expected. Hence,
uncertainty means a shift.                                                   an importance-based weighted cross validation strategy was pro-
   All the methodologies mentioned above have a common draw-                 posed in [10]. However, this method necessitates the presence of
back of not being suitable for the analysis of high dimensional fea-         covariate shift during the development phase. Sample re-weighting
tures. In our work, we strive to overcome this drawback by defin-            and active learning are well-known methods for adapting covari-
ing a shift detection algorithm to capture performance degradation           ate shift. Sample re-weighting re-weights every training point in
in real world machine/deep learning systems, especially scoped to-           the learning process based on the probability of a being inside the
wards legal data.
                                                                        3
                                                    QIC                                     LER
                                                Fm0 = 0.9344                            Fm0 = 0.8773
                   Month        P        R            Fm            ∆           P            R            Fm         ∆
                     m1      0.9821    0.9687       0.9725       +0.038       0.9613       0.8958       0.9123    +0.350
                     m2      0.8194    0.7708       0.7726       −0.1618      0.6291       0.6102       0.6195    −0.2578
                     m3      0.9531    0.8541       0.8669       −0.0675      0.9028       0.7149       0.7979    −0.0794
                     m4       0.875    0.8041       0.8380       −0.0964      0.7821       0.6073       0.6837    −0.1936
                     m5      0.9791    0.9583       0.9636       +0.0292      0.9613       0.8958       0.9123    +0.0350
                     m6      0.7777    0.6999       0.7042       −0.2302      0.6962       0.6444       0.6599    −0.2174
                     m7      0.8697    0.7916       0.7707       −0.1637      0.7851       0.7185       0.7289    −0.1484
                     m8      0.8779    0.8291       0.8396       −0.0948      0.7248       0.6999       0.7018    −0.1755
                                Table 2: Current System - Real World Performance Metrics (Monthly)


training set. When adequate samples are available for the train-              learning. In the output layer, IOB tags [14] were used to generate la-
ing set, active learning is adapted. Active learning selects test in-         bels for LER. For intent classification, since the task is a multi-class
stances that dramatically influence the learning process and hopes            problem, we grouped and labeled all of the queries into 4 classes -
to reduce the uncertainty under covariate shift. Some earlier works           judge, expert witness, seminal cases, and other.
[11][12] have discussed these approaches.
    In this paper, we scope our research to handling covariate shift
in word embeddings acquired from legal search queries. It also dis-           4.2      The Problem
cusses an incremental learning approach for adapting covariate                Legal data, in general, is both complex and diverse. User queries
shift to legal AI systems. To our knowledge, this work is the first of        and word vocabularies extracted from these queries change over
its kind to apply data shift on word embeddings focussing on deep             time. This in turn leads to changes in the underlying word embed-
learning applications. It is also the first to apply it to legal domain       dings which are usually the core components behind AI system(s).
space.                                                                        A word’s embeddings are vectors that represent some aspect of its
                                                                              meaning and are generally trained on large, unlabeled corpora (in
4. THE PROPOSED FRAMEWORK                                                     our work - legal queries). Any change in word embeddings results
                                                                              in complications and inconsistencies within feature weights that
4.1 Current System                                                            are part of the embedding matrix. Furthermore this change makes
Deep learning (DL) systems learn representations of data with mul-            it harder to accomplish a consistent prediction model whose be-
tiple levels of abstraction and are composed of several processing            havior does not change frequently in production.
layers. These methods have dramatically improved the state-of-the-               Our legal queries were both natural and un-natural, meaning,
art in NLP empowered by word embeddings. Learning a high di-                  natural queries are mostly synthesized (augmented by us) and the
mensional dense representation for vocabulary terms, also known               un-natural queries are those derived from user logs. Users tend
as a word embedding, has recently attracted much attention in NLP             to type queries in more un-natural format. E.g., the query, “justice
and information retrieval tasks. The embedding vectors are typi-              marshall abortion law 2017” is a very un-natural representation of
cally learned based on term proximity from a large corpus and are             language having multiple intents. Also, more structured queries
used to accurately predict adjacent word(s), given a word or con-             like boolean queries are also un-natural in its representation.
text.                                                                            Word embeddings are almost universally useful across a wide
   For the purpose of this study, we consider two NLP models                  range of tasks, but there were many limitations to this method.
which we had developed earlier: (i) a model for identifying legal             Word embeddings are generally used for shallow language mod-
query intent namely Query Intent Classifier (QIC), and (ii) a model           eling tasks, so there is a limitation to what the word embeddings
for recognizing legal entities termed as Legal Entity Recognition             can capture. Unlike RNNs and other complex architectures, lan-
(LER). Our DL models follow a similar architecture as described in            guage models like word2vec have trouble capturing the meaning
Figure 3. They consist of four layers - input, embedding, recurrent           of combinations of words, negation, etc. On the other hand, in-
neural network (RNN) and an output layer. The input layer receives            stead of training a model to map a single vector for each word,
a dense representation of the word vocabulary derived from legal              RNNs learn to map a vector to each word based on the entire sen-
queries contained in user logs. The vocabulary is a diverse mixture           tence/surrounding context.
of legal query types (intent), e.g., judge queries, case search, legal           Another key limitation is that word embedding models do not
definitions and others shown in Table 1.                                      take context into account. For instance, the word “lynch” has dif-
   For the embedding layer, we use pre-trained word embeddings                ferent meanings. According to California Penal Code 405a, “Lynch”
trained via a word2vec[13] model using ∼1M queries derived from               is defined as, “the taking by means of a riot of any person from the
user logs. The RNN layer consists of bi-directional Long Short Term           lawful custody of any police officer.” It also refers to killing some-
Memory (LSTM) units primarily used for sequence to sequence                   one without legal authority, usually by hanging; and “lynch law”
                                                                          4
refers to the punishment of presumed crimes or offenses, usually               and training data are similar. If significant word distributions shift
by death, without due process of law. The above example illustrates            in the real-world test queries, then the classifier correctly classi-
word embeddings built on non-stationary legal vocabulary are sus-              fies test queries from training queries, hence proving a distribution
ceptible to data shift. Therefore, it is a necessity to develop strate-        change in the covariates or word embeddings.
gies and techniques to overcome the issue. In the next section, we
discuss the performance degradation of our DL models observed                    Symbol                                 Usage
over 8 months of experimentation showing a covariate shift.                      Xt            Variable denoting current training set of user queries
                                                                                 Yt            Target variable denoting output labels for X t
4.3    Performance Degradation                                                   (X, Y )t      Variable denoting input-output pairs for X t and Yt
                                                                                 Xm            Variable denoting current month’s user queries
To evaluate our DL models, we chose F 1 score as a metric. In Ta-
                                                                                 Ym            Target variable denoting output labels for X m
ble 2, m 0 denotes the initial 0th month, and Fm0 denotes the corre-             (X, Y )m      Variable denoting input-output pairs for X m and Ym
sponding F 1 score for models - QIC and LER for that month (m 0 ).               Xt m          Variable denoting combined queries of X t and X m
After development, the models were deployed to production in                     Yt m          Target variable output labels for X t m
month m 0 . At m 0 , QIC’s Fm0 = 0.9344 and LER ’s Fm0 = 0.8733.                 (X, Y )t m    Variable denoting input-output pairs for X t m and Yt m
These F 1 scores set the baseline for performance comparisons, and               M∆            Binary classifier for the shift model
∆ represents performance gain or loss correlated to these base-                  F∆            F 1-score of M ∆
lines.                                                                           Ψ             Matthews Correlation Coefficient of M ∆
   The initial performance score observed during model develop-                                   Table 3: Algorithm Notations
ment was good, significant performance degradation was observed
after month m 0 , QIC and LER model performance scores and the
observed shifts (delta values) for months m 1 to m 8 are shown in Ta-
                                                                               4.4.2 Notations: Table 3 displays the list of symbols used for
ble 2. Months m 1 and m 5 saw no significant performance degrada-
                                                                               defining the algorithm. In general, X as input variables and Y as
tion. Remaining months {m 2 ,m 3 ,m 4 ,m 6 ,m 7 ,m 8 } are the ones with
significant performance degradation. Month m 6 witnessed a max-                output variables are the X and Y to the binary classifier M ∆ (shift
imum QIC degradation where ∆ = −0.2302 and month m 2 ob-                       model). Along with the standard F 1 score, an additional qualitative
served a maximum LER degradation whose ∆ = −0.2578.                            measure called Matthews correlation coefficient Ψ is also used for
   Although, we achieved good baseline performance results, the                assessing M ∆ . For this work, we utilized a supplementary ∆ sym-
degradation after deployment prompted us to research on the cause              bol for all shift model related symbols. Also suffixes t and m serve
of degradation. Our analysis discovered that covariate shift in le-            as current training and current month respectively.
gal user queries consequently influences legal word embeddings
causing distribution changes. These changes include observing vo-               Algorithm 1: CSD Algorithm
cabulary differences such as new words or part of speech patterns.               Input: X t and Xm
There is not enough research to identify word embedding distri-                  Output: YES if covariate shift, NO otherwise
bution changes or covariate shift in legal (or any) environment. In            1 Assign target labels 0 to Yt and 1 to Ym
this work, we propose a unique algorithm to detect covariate shift             2 Combine inputs X t and X m along with their respective
in the legal queries as explained in the upcoming section.                         output labels Yt and Ym to create dataset (X , Y )tm
                                                                               3 Perform the classic train-test split on dataset (X , Y )tm to
4.4    Algorithm                                                                   create train set (X , Y )∆
                                                                                                            train and test set (X , Y )test
                                                                                                                                       ∆

Previously, we talked about how our system suffers from covariate              4 Train a Word2Vec model using both X t and X m
shift and Table 2 showed changes in F 1score clearly showing per-              5 Using word embeddings as features from Step 4, create a shift
formance degradation. Our proposed algorithm detects covariate                     detection model M ∆ trained on (X , Y )∆    train and tested it on
shift in legal user queries that impacts high dimensional word em-
                                                                                  (X , Y )∆
                                                                                          test
beddings that derive from it. The degradation part was discussed                                         ∆ and Ψ using the expected labels from
                                                                               6 Compute F 1 Score F
earlier. This section first introduces the intuition behind the algo-
                                                                                   Step 1 and the predicted labels from Step 5.
rithm. Next, it discusses notations used for defining the algorithm,                  ∆ > 0.7 and Ψ > 0.2 then return YES else return NO
                                                                               7 If F
followed by explanation and results.
4.4.1 Intuition: The core intuition behind the proposed algorithm
is to detect covariate shift by classifying “new” real-world data              4.4.3 Steps: Our proposed covariate shift algorithm(CSD) is illus-
(legal queries) as similar or different to the training data (“old”).          trated in Algorithm 1 and contains 7 steps in total. Assignment of
The algorithm starts with building a binary classifier over the com-           target (output) variables is performed in step 1 where 0 indicates
bined dataset (proprietary to LexisNexis) of current training and
                                                                               a query from the current training set and 1 meaning it originated
monthly user queries, and then predicts a probability that a user
query is a member of a training set. We assign membership to both              from a real-world test. Step 2 creates input-output pairs (X , Y )tm
training (old) and real-world (new or test) data with output labels            from (X t , Yt ) and (Xm , Ym ). We then perform a standard train-test
- train and test to create input-output pairs for the shift classifier         split of 80% train queries (X , Y )∆
                                                                                                                  train and 20% test queries (X , Y )test .
                                                                                                                                                     ∆

(binary). Inconsiderable training error (covariate shift) in new user          In order to create word embeddings, a word2cec model was trained
queries and limited accuracy indicates that new real-world data                on combined dataset X tm .
                                                                           5
   The architecture of our proposed shift model is shown in Figure             alarms and result in no effect on system’s performance, whereas
3 and it resembles the architecture of the previously discussed DL             false negatives result in system’s performance degradation.
models (QIC and LER), (i) an input layer, (ii) an embedding layer in                                Number of Shifts Detected
                                                                                               R=                             X 100             (1)
the form of pre-trained word2vec embedding, (iii) RNN layer con-                                     Total Number of Shifts
sisting of LSTM units, and (iv) an output layer with sigmoid func-              For our shift detection, we achieved a recall of ∼80% as show in
tion that outputs a binary membership (0 or 1). Tokenized input                Table 5. Thus, CSD acts as an alarm and it is the first process in
queries are passed to the input layer along with its output labels,            the two-step procedure to learn under covariate shift. In the next
using which the shift model predicts membership (0 or 1) for the               section, we explain how we adapt to covariate shift through incre-
given query. The next step in the algorithm is to train a classifier           mental learning.
M ∆ on (X , Y )∆train and test it on (X , Y )test . To determine covari-
                                             ∆

ate shift, we calculate standard F 1 test score F ∆ and shift score Ψ          4.5    Incremental learning
on the test set using expected labels assigned initially. If F ∆ > 0.7         This is the second stage of our proposed framework that facilitates
and Ψ > 0.2, then algorithm returns “YES” as an indication of co-              adapting to covariate shift through incremental learning. Incre-
variate shift and “NO” otherwise. With trial and error, we arrive at           mental learning is a method where the model’s knowledge base
the requirement that F ∆ should be ≥ 0.7 and also Ψ should exceed              is updated continuously.
0.2 to result in a covariate shift, and these values provide excellent
coverage detecting the covariate shift for our application.
                                                                                                 Month Train Size (X t )
                                                                                                   m0         1,026,883
       Month     F∆     Shift Score (Ψ)    Is Covariate Shift ?                                    m1         1,026,883
         m1       0.5         0.01                   No                                            m2         1,176,640
         m2      0.96         0.22                   Yes                                           m3         2,081,806
         m3      0.53         0.17                   No                                            m4         2,081,806
         m4      0.87         0.32                   Yes                                           m5         2,327,126
         m5      0.51         0.02                   No                                            m6         2,364,142
         m6      0.92         0.38                   Yes                                           m7         2,466,806
         m7      0.89         0.26                   Yes                                           m8         2,629,006
         m8      0.98         0.54                   Yes                                     Table 6: Cumulative Training Size
                 Table 4: Detection Test Results

                                                                                  For incremental learning, at each iteration of CSD, X t is updated
4.4.4 Results: Using our proposed algorithm, we conducted de-                  on a monthly basis. Table 6 shows the cumulative X t size of train-
tection tests for 8 months as mentioned in Table 2, and Table 4                ing data after each incremental learning. The baseline training set
outlines the results of the tests. It comprises a shift score Ψ based          holds 1, 026, 883 queries at first month m 0 and through incremen-
on Matthews Correlation Coefficient, Fm  ∆ denoting shift model’s F 1          tal learning, the training set is accumulated to 2, 629, 006 at last
score on current training X tm and the column “Is Covariate Shift”             month m 8 . We address covariate shift in legal queries by incremen-
indicating “Yes” if covariate shift and “No” otherwise. Results from           tal re-training at each detection. To retrain our QIC and LER mod-
Table 4 shows the algorithm was able to detect covariate shift accu-           els, we used an expert system which is a Java-based rule engine.
rately for all months except month m 3 . Although the performance              Subject Matter Experts (SMEs) produced the desired output labels
for month m 3 has decreased, shift detection failed to detect this             Ym . The performance improvements after re-training is displayed
with a shift score Ψ=1.7 that did not meet our threshold.                      in Table 7. It contains two F 1 scores, (i) an original F 1 score Fm
                                                                                                                                                M
                                                                               from Table 2 before retraining and, (ii) a new latest F 1 score LF m
                   Month     Shift (Ψ)    TP    FN                             after retraining. Original performance degradation is ∆ from Table
                    m1       0.01    3     7                                   2 and the result of incremental learning is measured by subtracting
                    m2       0.22    3     7                                   the latest F 1 score from the base F 1 score i.e., δ =FL
                                                                                                                                      m - Fm .
                    m3        0.1    7    3                                       Apart from overcoming the performance degradation, incremen-
                    m4       0.32    3     7                                   tal learning improved the F 1 score over the baseline. QIC has the
                    m5       0.02    3     7                                   highest improvement δ = +0.144 on m 8 and LER has δ = 0.0191
                    m6       0.38    3     7                                   on month m 2 . Our experimental results demonstrate the effective-
                    m7       0.26    3     7                                   ness of the proposed covariate shift-detection and incremental learn-
                    m8       0.54    3     7                                   ing strategy.
              Table 5: CSD Algorithm - Evaluation
                                                                               4.6    Environment
                                                                               For training our DL and shift models, we used AWS ml.p3.8xlarge
   To evaluate the performance of our algorithm, we considered                 instance with 4 NVIDIA Tesla V100 GPUs. Average time taken for
recall R (in %) as a measuring metric. False positives trigger false           all models is ∼94 minutes for 100 epochs, and training time for
                                                                           6
                                                                QIC                                                             LER
                                                            Fm0 = 0.9344                                                    Fm0 = 0.8773
                    Month          Fm            ∆               LFm                  δ           Fm             ∆                LFm                  δ
                      m2         0.7726      −0.1618            0.9493           +0.0149         0.6195     −0.2578              0.8964           +0.0191
                      m3         0.8669      −0.0675            0.9312           −0.0032         0.7979     −0.0794              0.8032           −0.0741
                      m5         0.8380      −0.0964            0.9589           +0.0245         0.6837     −0.1936              0.8790           +0.0017
                      m6         0.7042      −0.2302            0.9200           −0.0144         0.6599     −0.21742             0.8529           −0.0244
                      m7         0.7707      −0.1637            0.9221           −0.0123         0.7289     −0.1484              0.8400           +0.0373
                      m8         0.8396      −0.0948            0.9488           +0.144          0.7018     −0.1755              0.8928           +0.0155
                                                             Table 7: Results after incremental learning


word2vec is ∼84 minutes for 10 epochs. For implementing our shift                                  [3] H. Raza, G. Prasad, and Y. Li, “Dataset shift detection in non-stationary environ-
and DL models, we used TensorFlow [15]. For our DL models, we                                          ments using ewma charts,” Proceedings - 2013 IEEE International Conference on
                                                                                                       Systems, Man, and Cybernetics, SMC 2013, pp. 3151–3156, 10 2013.
fixed a batch size of 512.                                                                         [4] D. Zhao, L. Bu, C. Alippi, and Q. Wei, “A kolmogorov-smirnov test to detect
                                                                                                       changes in stationarity in big data,” IFAC-PapersOnLine, vol. 50, pp. 14260 –
                                                                                                       14265, 2017.
5.    PERFORMANCE METRICS                                                                          [5] H. Raza, H. Cecotti, Y. Li, and G. Prasad, “Adaptive learning with covariate
We utilize standard measures to evaluate the performance of our                                        shift-detection for motor imagery-based brain—computer interface,” Soft Com-
                                                                                                       put., vol. 20, pp. 3085–3096, Aug. 2016.
QIC, LER and shift model classifiers, i.e., precision (P), recall (R),                             [6] S. Yu, X. Wang, and J. C. Príncipe, “Request-and-reverify: Hierarchical hypothe-
and F 1-measure. Precision (P) is the proportion of actual positive                                    sis testing for concept drift detection with expensive labels,” pp. 3033–3039, 07
                                                                                                       2018.
class members returned by our method among all predicted posi-                                     [7] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network train-
tive class members returned by our method. Recall (R) is the pro-                                      ing by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015.
portion of predicted positive members among all actual positive                                    [8] L. Wendlandt, J. K. Kummerfeld, and R. Mihalcea, “Factors influencing the sur-
                                                                                                       prising instability of word embeddings,” in Proceedings of the 2018 Conference of
class members in the data. F 1= 2PR/(P+R) is the harmonic aver-                                        the North American Chapter of the Association for Computational Linguistics: Hu-
age of precision and recall. We also utilized Matthews correlation                                     man Language Technologies, Volume 1 (Long Papers), pp. 2092–2102, Association
coefficient (MCC) to compute the shift score and MCC is a correla-                                     for Computational Linguistics, 2018.
                                                                                                   [9] X. Chen, M. Monfort, A. Liu, and B. D. Ziebart, “Robust covariate shift,” in Pro-
tion coefficient between actual and expected predictions. It varies                                    ceedings of the 19th International Conference on Artificial Intelligence and Statistics
between -1 and +1: -1 when actual and expected are entirely dif-                                       (A. Gretton and C. C. Robert, eds.), vol. 51 of Proceedings of Machine Learning
                                                                                                       Research, (Cadiz, Spain), pp. 1270–1279, PMLR, 09–11 May 2016.
ferent, 1 when there is a perfect match and 0 when it is random.                                  [10] M. Sugiyama, M. Krauledat, and K.-R. Müller, “Covariate shift adaptation by im-
Accompanying shift F 1 score, MCC Ψ was used as shift score for                                        portance weighted cross validation,” J. Mach. Learn. Res., vol. 8, pp. 985–1005,
measuring the similarity between training and test set.                                                Dec. 2007.
                                                                                                  [11] A. Liu and K. Asif, “Addressing covariate shift in active learning with adversarial
                                                                                                       prediction,” ICML 2015 Workshop of Active Learning.
6.    CONCLUSION AND FUTURE WORK                                                                  [12] M. Sugiyama, S. Nakajima, H. Kashima, P. V. Bunau, and M. Kawanabe, “Direct
                                                                                                       importance estimation for covariate shift adaptation,” Annals of the Institute of
Learning strategies under covariate shift have been receiving sig-                                     Statistical Mathematics, 2008.
nificant research interest recently. In non-static environments such                              [13] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed represen-
                                                                                                       tations of words and phrases and their compositionality,” in Proceedings of the
as legal, learning methods need to employ unique learning strate-                                      26th International Conference on Neural Information Processing Systems - Volume
gies and covariate shift monitoring systems to acquire a greater                                       2, NIPS’13, (USA), pp. 3111–3119, Curran Associates Inc., 2013.
capability to generalize the learning. Our proposed framework in                                  [14] L. Ramshaw and M. Marcus, “Text chunking using transformation-based learn-
                                                                                                       ing,” in Third Workshop on Very Large Corpora, 1995.
this work belongs to the category of incremental learning under                                   [15] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghe-
covariate shift for legal AI systems, and its core component is a                                      mawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G.
                                                                                                       Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and
covariate shift detection algorithm which detects shift in our deep                                    X. Zheng, “Tensorflow: A system for large-scale machine learning,” in Proceed-
learning models - QIC and LER. The results demonstrate the ben-                                        ings of the 12th USENIX Conference on Operating Systems Design and Implemen-
efit of building a monitor system for covariate shift detection and                                    tation, OSDI’16, (Berkeley, CA, USA), pp. 265–283, USENIX Association, 2016.
also its adaptation through incremental learning. For future work,
we plan to extend our current research by employing better strate-
gies to reduce false negatives. Also, we are plan to apply a similar
strategy to other data shifts such as prior probability and concept
shifts.

REFERENCES
 [1] S. Arunprasath and B. Venkata Nagaraju, “Deep ensemble learning for legal
     query understanding,” in Proceedings of CIKM 2018 Workshop on Legal Data An-
     alytics and Mining (LeDAM 2018), CEUR-WS.org, October 2018. To appear.
 [2] J. G. Moreno-Torres, T. Raeder, R. Alaiz-RodríGuez, N. V. Chawla, and F. Her-
     rera, “A unifying view on dataset shift in classification,” Pattern Recogn., vol. 45,
     pp. 521–530, Jan. 2012.
                                                                                             7