Data Shift in Legal AI Systems Venkata Nagaraju Buddarapu Arunprasath Shankar LexisNexis LexisNexis Raleigh, USA Raleigh, USA venkatanagaraju.buddarapu@lexisnexis.com arunprasath.shankar@lexisnexis.com ABSTRACT One of the fundamental assumptions with any machine learning (ML) system is that training data comes from the same distribu- Web Application tion as the real world data. However, in many real-world applica- Browser Layer tions, this important assumption is often violated including legal Web NLP Database research. A scenario where training and test samples follow dif- Browser System Layer ferent input distributions is known as covariate shift. This shift Service Layer in data is often responsible for the deterioration in predictive per- formance of machine learning systems. The motivation of this re- Database search is to study the effect of covariate shift on deep learning systems used in legal research. In this paper, we propose a uni- fied framework to detect covariate shift impacting AI systems and Figure 1: CALR Workflow formulate a strategy to adapt to this shift on a periodic basis. To our knowledge, our work is the first to apply data shift detection and adaption techniques to deep learning systems involving high assume the topic of this query is about the opinions of a judge. dimensional word embeddings. Through experiments and evalua- Table 1 presents examples for legal queries with different intents. tions, we demonstrate that our framework can accurately detect Intent Example data (covariate) shift on legal AI systems involving deep neural ar- chitectures. case search marbury v. madison, 1803 judge judge john roberts expert witness expert henry lee 1. INTRODUCTION definitions foreign corrupt practices act of 1977 ? CALR stands for Computer Assisted Legal Research and is a mode seminal cases seminal cases on murder of legal research that uses electronic databases that comprises court burden of proof burden of proof for hearsay statement documents, statutes, secondary materials etc. Professional lawyers doctrine what is assumed duty doctrine and paralegals rely on CALR applications for the precise under- elements elements of child abuse standing of the law and to serve the client’s best interest. Search statutes statute of limitations for mail fraud engines are a crucial component of legal research technology to- day, and its primary goal is to identify and retrieve information Table 1: Query Intents needed to support legal decision making. When a user types the query “most cited cases by judge john d roberts”, he strives to under- Identifying query intent is a classification problem, and the pro- stand the most often cited cases by a judge and tries to anticipate cess of recognizing domain-specific entities is known as named en- the judge’s behavior. This plays a crucial role in uplifting his legal tity recognition (NER), which also belong to the classification fam- research experience. Consequently, understanding a query intent ily. In general, intent and entity recognition are two primary com- is essential for providing better search results, thus improving cus- ponents of any natural language processing (NLP) system. Over tomers’ overall satisfaction. the past decade, the field of NLP has heavily influenced the way Understanding a query intent requires classifying legal queries legal search works, shifting discovery from pure keyword-based and identifying domain-specific legal entities, which is a complex methodologies to a more context-oriented NLP techniques. problem [1]. E.g., in the query: “what are the opinions by judge john Figure 1 depicts the workflow of a typical CALR application. The doe in civil cases dealing with dog bites ? ”, the word “judge” can browser is a tool by which users provide input; the application be treated as a judge search when observed along with the context layer coordinates user interactions with a service layer that trig- phrase “opinions by”. The phrase “civil cases” can be identified as gers a search. The NLP system complements the service layer for a when seen alongside a supporting context and query understanding by leveraging a database layer. The database similarly “dog bites” can be treated as keywords. However, since layer usually retrieves relevant information in the form of legal we also observe the interrogative phrase “what are”, we can safely documents. NLP systems are usually built using supervised approaches and In: Proceedings of the Third Workshop on Automated Semantic Analysis of Informa- is a type of learning that uses a function to map a given input to an tion in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada. output. It infers learning features from labeled data consisting of ©2019 Copyright held by the owner/author(s). Copyrighting permitted for private and academic purposes. training examples. For example, given a query “what are the opin- Published at http:// ceur-ws.org. ions by judge john doe in civil cases dealing with dog bites ?”, the CALR Application Query Log Feature Model Data Discovery Model Scoring ML/DL Models Engineering Development Case Law Docs Legal Environment Machine Learning Development Life Cycle Figure 2: QIC and LER Model Development Lifecycle output label for intent classification is “judge”. On the other hand, learning on deep neural models as a necessary step to ensure con- we need to construct two output labels “john doe” as and “civil cases” as for the task of legal entity recognition. 2. BACKGROUND KNOWLEDGE Our NLP system mainly consists of two models: (i) a model for Dataset shift research on machine learning classification algorithms identifying legal query intent which we call as Query Intent Clas- is interesting and foreseen to become a more difficult problem to sifier (QIC), and (ii) a model for recognizing legal entities called as solve in non-stationary environments. In section 2.1, we introduce Legal Entity Recognition (LER). QIC and LER model development the dataset shift definitions concerning classification problems from cycles follow a standard machine learning development life cycle [2]’s extensive literature survey, section 2.2 discusses the causes of as shown in Figure 2. These cycles usually require application data dataset shift in general and 2.3 the various analysis methods for co- as a prerequisite. In our case, the data is derived from user logs as variate shifts. mentioned previously. NLP systems usually reside within a learn- ing environment and learn from the data collected within this en- vironment. 2.1 Data Shift Types In general, any legal learning environment is comprised of users In this section, we explain the different classification and data shift (lawyers and paralegals), continually changing legal corpus and re- types. In general, a classification problem is defined by: formulating legal queries. Most machine learning methods assume • A set of features or covariates X the learning environments to be static, which is not the case with • A target (class) variable Y real-world applications such as CALR, email spam filters, stock • A joint distribution P(Y , X ) market prediction systems etc. Real world applications including X → Y problems are those where the class labels Y is predicted legal systems are mostly dynamic in nature and often incur distri- based on the values of covariates X . Inversely, Y → X problems bution changes to its underlying data. This phenomenon is known are the ones where the class label Y causally determines the values as data shift in the machine learning arena. These data shifts usu- of covariates X . Thus, by analyzing the relationship between X and ally results in performance degradation of NLP systems deployed Y , we can define three different types of data shifts: as real-world applications. The evolving nature of legal environment demands continuous 2.1.1 Covariate Shift: Covariate shift refers to changes in the monitoring and adaption to data shifts, in order to alleviate the distribution of input variable X = {x 1 , x 2 · · · x n }. Here x 1 , x 2 · · · x n issue of performance degradation in NLP systems. Data shift has are called the covariates and any distribution changes in one or been receiving significant attention in recent years amongst the more of these covariates is termed as covariate shift. machine learning community. Dataset shift refers to the problem where training and real-world datasets follow different distribu- 2.1.2 Prior Probability Shift: Distribution changes to the class tions. The section 2. contains formal definitions of these shifts. Since variable Y is referred to as prior probability shift, and it appears this problem may occur in many real-life scenarios, detecting and only in Y → X problems. adapting to dataset shift becomes a vital research aspect in ma- chine learning. This research aims to observe, detect and adapt co- 2.1.3 Concept Shift: Concept shift occurs when the relationship variate shift on deep learning models using high-dimensional word between input variable X and class variable Y changes. There other embeddings, derived from a corpus of legal queries. We demon- shifts in theory, but we are not defining them since they emerge strate the usefulness of adapting covariate shift with incremental sporadically. In general, dataset shift is a phenomenon that occurs when new data distribution leads to a change in the distribution of a single feature, a combination of features, or class boundaries. 2 Train Phase Original Monthly Input Layer Real Train Word World Set Embedding Set Embedding Layer RNN Layer Train Set Query User Query Output Layer Classify Real World Query Test Phase Shift Model Figure 3: Shift Model 2.2 Data Shift Causes 3. RELATED WORK Selection bias and non-stationary environments are the two pri- Data (covariate) shift is an area of machine learning that has been mary reasons for data shift in general. Selection bias occurs when gaining popularity in recent years. In this section, we will discuss a training set does not exactly represent a real world test set. Lack some of the very few existing works related to this sub-domain of randomness in training sample selection, improper samples, and of AI research. First, statistics based identification methods have biased sampling rules often influence selection bias. On the other been widely adopted in several fields recently yielding good re- hand, non-stationary environments often must deal with dynamic sults. For the area of data shift, in [3][4][5], statistical methods nature. Hence, not handling dataset shift in real-world application such as exponential weighted moving average and Kolmogorov- creates an overfitted model on training samples, hence unreliable Smirnov were used as detection methodologies especially towards model predictions. time-series data, and big data online streams applications. In [6], the authors discuss hierarchical hypothesis testing techniques for concept shift detection in streaming applications. 2.3 Covariate Shift Analysis Most real-world deep learning applications need training and In this section, we introduce an overview of three well known co- the training phase usually face an internal covariate shift. In [7], variate shift analysis methodologies. Sergey et al. proposed an ensemble of batch normalized networks to detect shifts in image classification. Word embeddings are con- 2.3.1 Visualization: This methodology is the simplest, visualiz- sidered as the building blocks for NLP and the problem of choosing ing one covariate at a time. It requires humans spotting the differ- a right embedding for a particular NLP task is always a problem ence in covariate distribution using histograms. of trial and error. In [8], authors have discussed the various fac- tors influencing a word embedding’s stability, and one such factor is word frequency. Insufficient vocabulary affects word frequency 2.3.2 Statistical Distance: In this type of analysis, methods in- and landing a perfect real-world sufficient vocabulary is not a one- volving statistical metrics such as mean, variance, population sta- step process. In our paper, we demonstrate that continuously up- bility index (PSI), Kullback-Leibler divergence, and Kolmogorov- dating word embeddings to represent real world data promotes the Smirnov etc. are used for detecting shifts. model’s performance. In [9], the researchers propose a novel minimax approach for re- 2.3.3 Uncertainty Quantification: This method fits a probabilis- gression problems under covariate shift. Non-stationary environ- tic model on the training data and every prediction on new data ments influence and change the machine learning development is associated with a confidence interval or uncertainty. Lower un- process. Under covariate shift, the standard model selection tech- certainty on new real world data is considered no shift and higher niques such as cross-validation do not work as expected. Hence, uncertainty means a shift. an importance-based weighted cross validation strategy was pro- All the methodologies mentioned above have a common draw- posed in [10]. However, this method necessitates the presence of back of not being suitable for the analysis of high dimensional fea- covariate shift during the development phase. Sample re-weighting tures. In our work, we strive to overcome this drawback by defin- and active learning are well-known methods for adapting covari- ing a shift detection algorithm to capture performance degradation ate shift. Sample re-weighting re-weights every training point in in real world machine/deep learning systems, especially scoped to- the learning process based on the probability of a being inside the wards legal data. 3 QIC LER Fm0 = 0.9344 Fm0 = 0.8773 Month P R Fm ∆ P R Fm ∆ m1 0.9821 0.9687 0.9725 +0.038 0.9613 0.8958 0.9123 +0.350 m2 0.8194 0.7708 0.7726 −0.1618 0.6291 0.6102 0.6195 −0.2578 m3 0.9531 0.8541 0.8669 −0.0675 0.9028 0.7149 0.7979 −0.0794 m4 0.875 0.8041 0.8380 −0.0964 0.7821 0.6073 0.6837 −0.1936 m5 0.9791 0.9583 0.9636 +0.0292 0.9613 0.8958 0.9123 +0.0350 m6 0.7777 0.6999 0.7042 −0.2302 0.6962 0.6444 0.6599 −0.2174 m7 0.8697 0.7916 0.7707 −0.1637 0.7851 0.7185 0.7289 −0.1484 m8 0.8779 0.8291 0.8396 −0.0948 0.7248 0.6999 0.7018 −0.1755 Table 2: Current System - Real World Performance Metrics (Monthly) training set. When adequate samples are available for the train- learning. In the output layer, IOB tags [14] were used to generate la- ing set, active learning is adapted. Active learning selects test in- bels for LER. For intent classification, since the task is a multi-class stances that dramatically influence the learning process and hopes problem, we grouped and labeled all of the queries into 4 classes - to reduce the uncertainty under covariate shift. Some earlier works judge, expert witness, seminal cases, and other. [11][12] have discussed these approaches. In this paper, we scope our research to handling covariate shift in word embeddings acquired from legal search queries. It also dis- 4.2 The Problem cusses an incremental learning approach for adapting covariate Legal data, in general, is both complex and diverse. User queries shift to legal AI systems. To our knowledge, this work is the first of and word vocabularies extracted from these queries change over its kind to apply data shift on word embeddings focussing on deep time. This in turn leads to changes in the underlying word embed- learning applications. It is also the first to apply it to legal domain dings which are usually the core components behind AI system(s). space. A word’s embeddings are vectors that represent some aspect of its meaning and are generally trained on large, unlabeled corpora (in 4. THE PROPOSED FRAMEWORK our work - legal queries). Any change in word embeddings results in complications and inconsistencies within feature weights that 4.1 Current System are part of the embedding matrix. Furthermore this change makes Deep learning (DL) systems learn representations of data with mul- it harder to accomplish a consistent prediction model whose be- tiple levels of abstraction and are composed of several processing havior does not change frequently in production. layers. These methods have dramatically improved the state-of-the- Our legal queries were both natural and un-natural, meaning, art in NLP empowered by word embeddings. Learning a high di- natural queries are mostly synthesized (augmented by us) and the mensional dense representation for vocabulary terms, also known un-natural queries are those derived from user logs. Users tend as a word embedding, has recently attracted much attention in NLP to type queries in more un-natural format. E.g., the query, “justice and information retrieval tasks. The embedding vectors are typi- marshall abortion law 2017” is a very un-natural representation of cally learned based on term proximity from a large corpus and are language having multiple intents. Also, more structured queries used to accurately predict adjacent word(s), given a word or con- like boolean queries are also un-natural in its representation. text. Word embeddings are almost universally useful across a wide For the purpose of this study, we consider two NLP models range of tasks, but there were many limitations to this method. which we had developed earlier: (i) a model for identifying legal Word embeddings are generally used for shallow language mod- query intent namely Query Intent Classifier (QIC), and (ii) a model eling tasks, so there is a limitation to what the word embeddings for recognizing legal entities termed as Legal Entity Recognition can capture. Unlike RNNs and other complex architectures, lan- (LER). Our DL models follow a similar architecture as described in guage models like word2vec have trouble capturing the meaning Figure 3. They consist of four layers - input, embedding, recurrent of combinations of words, negation, etc. On the other hand, in- neural network (RNN) and an output layer. The input layer receives stead of training a model to map a single vector for each word, a dense representation of the word vocabulary derived from legal RNNs learn to map a vector to each word based on the entire sen- queries contained in user logs. The vocabulary is a diverse mixture tence/surrounding context. of legal query types (intent), e.g., judge queries, case search, legal Another key limitation is that word embedding models do not definitions and others shown in Table 1. take context into account. For instance, the word “lynch” has dif- For the embedding layer, we use pre-trained word embeddings ferent meanings. According to California Penal Code 405a, “Lynch” trained via a word2vec[13] model using ∼1M queries derived from is defined as, “the taking by means of a riot of any person from the user logs. The RNN layer consists of bi-directional Long Short Term lawful custody of any police officer.” It also refers to killing some- Memory (LSTM) units primarily used for sequence to sequence one without legal authority, usually by hanging; and “lynch law” 4 refers to the punishment of presumed crimes or offenses, usually and training data are similar. If significant word distributions shift by death, without due process of law. The above example illustrates in the real-world test queries, then the classifier correctly classi- word embeddings built on non-stationary legal vocabulary are sus- fies test queries from training queries, hence proving a distribution ceptible to data shift. Therefore, it is a necessity to develop strate- change in the covariates or word embeddings. gies and techniques to overcome the issue. In the next section, we discuss the performance degradation of our DL models observed Symbol Usage over 8 months of experimentation showing a covariate shift. Xt Variable denoting current training set of user queries Yt Target variable denoting output labels for X t 4.3 Performance Degradation (X, Y )t Variable denoting input-output pairs for X t and Yt Xm Variable denoting current month’s user queries To evaluate our DL models, we chose F 1 score as a metric. In Ta- Ym Target variable denoting output labels for X m ble 2, m 0 denotes the initial 0th month, and Fm0 denotes the corre- (X, Y )m Variable denoting input-output pairs for X m and Ym sponding F 1 score for models - QIC and LER for that month (m 0 ). Xt m Variable denoting combined queries of X t and X m After development, the models were deployed to production in Yt m Target variable output labels for X t m month m 0 . At m 0 , QIC’s Fm0 = 0.9344 and LER ’s Fm0 = 0.8733. (X, Y )t m Variable denoting input-output pairs for X t m and Yt m These F 1 scores set the baseline for performance comparisons, and M∆ Binary classifier for the shift model ∆ represents performance gain or loss correlated to these base- F∆ F 1-score of M ∆ lines. Ψ Matthews Correlation Coefficient of M ∆ The initial performance score observed during model develop- Table 3: Algorithm Notations ment was good, significant performance degradation was observed after month m 0 , QIC and LER model performance scores and the observed shifts (delta values) for months m 1 to m 8 are shown in Ta- 4.4.2 Notations: Table 3 displays the list of symbols used for ble 2. Months m 1 and m 5 saw no significant performance degrada- defining the algorithm. In general, X as input variables and Y as tion. Remaining months {m 2 ,m 3 ,m 4 ,m 6 ,m 7 ,m 8 } are the ones with significant performance degradation. Month m 6 witnessed a max- output variables are the X and Y to the binary classifier M ∆ (shift imum QIC degradation where ∆ = −0.2302 and month m 2 ob- model). Along with the standard F 1 score, an additional qualitative served a maximum LER degradation whose ∆ = −0.2578. measure called Matthews correlation coefficient Ψ is also used for Although, we achieved good baseline performance results, the assessing M ∆ . For this work, we utilized a supplementary ∆ sym- degradation after deployment prompted us to research on the cause bol for all shift model related symbols. Also suffixes t and m serve of degradation. Our analysis discovered that covariate shift in le- as current training and current month respectively. gal user queries consequently influences legal word embeddings causing distribution changes. These changes include observing vo- Algorithm 1: CSD Algorithm cabulary differences such as new words or part of speech patterns. Input: X t and Xm There is not enough research to identify word embedding distri- Output: YES if covariate shift, NO otherwise bution changes or covariate shift in legal (or any) environment. In 1 Assign target labels 0 to Yt and 1 to Ym this work, we propose a unique algorithm to detect covariate shift 2 Combine inputs X t and X m along with their respective in the legal queries as explained in the upcoming section. output labels Yt and Ym to create dataset (X , Y )tm 3 Perform the classic train-test split on dataset (X , Y )tm to 4.4 Algorithm create train set (X , Y )∆ train and test set (X , Y )test ∆ Previously, we talked about how our system suffers from covariate 4 Train a Word2Vec model using both X t and X m shift and Table 2 showed changes in F 1score clearly showing per- 5 Using word embeddings as features from Step 4, create a shift formance degradation. Our proposed algorithm detects covariate detection model M ∆ trained on (X , Y )∆ train and tested it on shift in legal user queries that impacts high dimensional word em- (X , Y )∆ test beddings that derive from it. The degradation part was discussed ∆ and Ψ using the expected labels from 6 Compute F 1 Score F earlier. This section first introduces the intuition behind the algo- Step 1 and the predicted labels from Step 5. rithm. Next, it discusses notations used for defining the algorithm, ∆ > 0.7 and Ψ > 0.2 then return YES else return NO 7 If F followed by explanation and results. 4.4.1 Intuition: The core intuition behind the proposed algorithm is to detect covariate shift by classifying “new” real-world data 4.4.3 Steps: Our proposed covariate shift algorithm(CSD) is illus- (legal queries) as similar or different to the training data (“old”). trated in Algorithm 1 and contains 7 steps in total. Assignment of The algorithm starts with building a binary classifier over the com- target (output) variables is performed in step 1 where 0 indicates bined dataset (proprietary to LexisNexis) of current training and a query from the current training set and 1 meaning it originated monthly user queries, and then predicts a probability that a user query is a member of a training set. We assign membership to both from a real-world test. Step 2 creates input-output pairs (X , Y )tm training (old) and real-world (new or test) data with output labels from (X t , Yt ) and (Xm , Ym ). We then perform a standard train-test - train and test to create input-output pairs for the shift classifier split of 80% train queries (X , Y )∆ train and 20% test queries (X , Y )test . ∆ (binary). Inconsiderable training error (covariate shift) in new user In order to create word embeddings, a word2cec model was trained queries and limited accuracy indicates that new real-world data on combined dataset X tm . 5 The architecture of our proposed shift model is shown in Figure alarms and result in no effect on system’s performance, whereas 3 and it resembles the architecture of the previously discussed DL false negatives result in system’s performance degradation. models (QIC and LER), (i) an input layer, (ii) an embedding layer in Number of Shifts Detected R= X 100 (1) the form of pre-trained word2vec embedding, (iii) RNN layer con- Total Number of Shifts sisting of LSTM units, and (iv) an output layer with sigmoid func- For our shift detection, we achieved a recall of ∼80% as show in tion that outputs a binary membership (0 or 1). Tokenized input Table 5. Thus, CSD acts as an alarm and it is the first process in queries are passed to the input layer along with its output labels, the two-step procedure to learn under covariate shift. In the next using which the shift model predicts membership (0 or 1) for the section, we explain how we adapt to covariate shift through incre- given query. The next step in the algorithm is to train a classifier mental learning. M ∆ on (X , Y )∆train and test it on (X , Y )test . To determine covari- ∆ ate shift, we calculate standard F 1 test score F ∆ and shift score Ψ 4.5 Incremental learning on the test set using expected labels assigned initially. If F ∆ > 0.7 This is the second stage of our proposed framework that facilitates and Ψ > 0.2, then algorithm returns “YES” as an indication of co- adapting to covariate shift through incremental learning. Incre- variate shift and “NO” otherwise. With trial and error, we arrive at mental learning is a method where the model’s knowledge base the requirement that F ∆ should be ≥ 0.7 and also Ψ should exceed is updated continuously. 0.2 to result in a covariate shift, and these values provide excellent coverage detecting the covariate shift for our application. Month Train Size (X t ) m0 1,026,883 Month F∆ Shift Score (Ψ) Is Covariate Shift ? m1 1,026,883 m1 0.5 0.01 No m2 1,176,640 m2 0.96 0.22 Yes m3 2,081,806 m3 0.53 0.17 No m4 2,081,806 m4 0.87 0.32 Yes m5 2,327,126 m5 0.51 0.02 No m6 2,364,142 m6 0.92 0.38 Yes m7 2,466,806 m7 0.89 0.26 Yes m8 2,629,006 m8 0.98 0.54 Yes Table 6: Cumulative Training Size Table 4: Detection Test Results For incremental learning, at each iteration of CSD, X t is updated 4.4.4 Results: Using our proposed algorithm, we conducted de- on a monthly basis. Table 6 shows the cumulative X t size of train- tection tests for 8 months as mentioned in Table 2, and Table 4 ing data after each incremental learning. The baseline training set outlines the results of the tests. It comprises a shift score Ψ based holds 1, 026, 883 queries at first month m 0 and through incremen- on Matthews Correlation Coefficient, Fm ∆ denoting shift model’s F 1 tal learning, the training set is accumulated to 2, 629, 006 at last score on current training X tm and the column “Is Covariate Shift” month m 8 . We address covariate shift in legal queries by incremen- indicating “Yes” if covariate shift and “No” otherwise. Results from tal re-training at each detection. To retrain our QIC and LER mod- Table 4 shows the algorithm was able to detect covariate shift accu- els, we used an expert system which is a Java-based rule engine. rately for all months except month m 3 . Although the performance Subject Matter Experts (SMEs) produced the desired output labels for month m 3 has decreased, shift detection failed to detect this Ym . The performance improvements after re-training is displayed with a shift score Ψ=1.7 that did not meet our threshold. in Table 7. It contains two F 1 scores, (i) an original F 1 score Fm M from Table 2 before retraining and, (ii) a new latest F 1 score LF m Month Shift (Ψ) TP FN after retraining. Original performance degradation is ∆ from Table m1 0.01 3 7 2 and the result of incremental learning is measured by subtracting m2 0.22 3 7 the latest F 1 score from the base F 1 score i.e., δ =FL m - Fm . m3 0.1 7 3 Apart from overcoming the performance degradation, incremen- m4 0.32 3 7 tal learning improved the F 1 score over the baseline. QIC has the m5 0.02 3 7 highest improvement δ = +0.144 on m 8 and LER has δ = 0.0191 m6 0.38 3 7 on month m 2 . Our experimental results demonstrate the effective- m7 0.26 3 7 ness of the proposed covariate shift-detection and incremental learn- m8 0.54 3 7 ing strategy. Table 5: CSD Algorithm - Evaluation 4.6 Environment For training our DL and shift models, we used AWS ml.p3.8xlarge To evaluate the performance of our algorithm, we considered instance with 4 NVIDIA Tesla V100 GPUs. Average time taken for recall R (in %) as a measuring metric. False positives trigger false all models is ∼94 minutes for 100 epochs, and training time for 6 QIC LER Fm0 = 0.9344 Fm0 = 0.8773 Month Fm ∆ LFm δ Fm ∆ LFm δ m2 0.7726 −0.1618 0.9493 +0.0149 0.6195 −0.2578 0.8964 +0.0191 m3 0.8669 −0.0675 0.9312 −0.0032 0.7979 −0.0794 0.8032 −0.0741 m5 0.8380 −0.0964 0.9589 +0.0245 0.6837 −0.1936 0.8790 +0.0017 m6 0.7042 −0.2302 0.9200 −0.0144 0.6599 −0.21742 0.8529 −0.0244 m7 0.7707 −0.1637 0.9221 −0.0123 0.7289 −0.1484 0.8400 +0.0373 m8 0.8396 −0.0948 0.9488 +0.144 0.7018 −0.1755 0.8928 +0.0155 Table 7: Results after incremental learning word2vec is ∼84 minutes for 10 epochs. For implementing our shift [3] H. Raza, G. Prasad, and Y. Li, “Dataset shift detection in non-stationary environ- and DL models, we used TensorFlow [15]. For our DL models, we ments using ewma charts,” Proceedings - 2013 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2013, pp. 3151–3156, 10 2013. fixed a batch size of 512. [4] D. Zhao, L. Bu, C. Alippi, and Q. Wei, “A kolmogorov-smirnov test to detect changes in stationarity in big data,” IFAC-PapersOnLine, vol. 50, pp. 14260 – 14265, 2017. 5. PERFORMANCE METRICS [5] H. Raza, H. Cecotti, Y. Li, and G. Prasad, “Adaptive learning with covariate We utilize standard measures to evaluate the performance of our shift-detection for motor imagery-based brain—computer interface,” Soft Com- put., vol. 20, pp. 3085–3096, Aug. 2016. QIC, LER and shift model classifiers, i.e., precision (P), recall (R), [6] S. Yu, X. Wang, and J. C. Príncipe, “Request-and-reverify: Hierarchical hypothe- and F 1-measure. Precision (P) is the proportion of actual positive sis testing for concept drift detection with expensive labels,” pp. 3033–3039, 07 2018. class members returned by our method among all predicted posi- [7] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network train- tive class members returned by our method. Recall (R) is the pro- ing by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. portion of predicted positive members among all actual positive [8] L. Wendlandt, J. K. Kummerfeld, and R. Mihalcea, “Factors influencing the sur- prising instability of word embeddings,” in Proceedings of the 2018 Conference of class members in the data. F 1= 2PR/(P+R) is the harmonic aver- the North American Chapter of the Association for Computational Linguistics: Hu- age of precision and recall. We also utilized Matthews correlation man Language Technologies, Volume 1 (Long Papers), pp. 2092–2102, Association coefficient (MCC) to compute the shift score and MCC is a correla- for Computational Linguistics, 2018. [9] X. Chen, M. Monfort, A. Liu, and B. D. Ziebart, “Robust covariate shift,” in Pro- tion coefficient between actual and expected predictions. It varies ceedings of the 19th International Conference on Artificial Intelligence and Statistics between -1 and +1: -1 when actual and expected are entirely dif- (A. Gretton and C. C. Robert, eds.), vol. 51 of Proceedings of Machine Learning Research, (Cadiz, Spain), pp. 1270–1279, PMLR, 09–11 May 2016. ferent, 1 when there is a perfect match and 0 when it is random. [10] M. Sugiyama, M. Krauledat, and K.-R. Müller, “Covariate shift adaptation by im- Accompanying shift F 1 score, MCC Ψ was used as shift score for portance weighted cross validation,” J. Mach. Learn. Res., vol. 8, pp. 985–1005, measuring the similarity between training and test set. Dec. 2007. [11] A. Liu and K. Asif, “Addressing covariate shift in active learning with adversarial prediction,” ICML 2015 Workshop of Active Learning. 6. CONCLUSION AND FUTURE WORK [12] M. Sugiyama, S. Nakajima, H. Kashima, P. V. Bunau, and M. Kawanabe, “Direct importance estimation for covariate shift adaptation,” Annals of the Institute of Learning strategies under covariate shift have been receiving sig- Statistical Mathematics, 2008. nificant research interest recently. In non-static environments such [13] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed represen- tations of words and phrases and their compositionality,” in Proceedings of the as legal, learning methods need to employ unique learning strate- 26th International Conference on Neural Information Processing Systems - Volume gies and covariate shift monitoring systems to acquire a greater 2, NIPS’13, (USA), pp. 3111–3119, Curran Associates Inc., 2013. capability to generalize the learning. Our proposed framework in [14] L. Ramshaw and M. Marcus, “Text chunking using transformation-based learn- ing,” in Third Workshop on Very Large Corpora, 1995. this work belongs to the category of incremental learning under [15] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghe- covariate shift for legal AI systems, and its core component is a mawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and covariate shift detection algorithm which detects shift in our deep X. Zheng, “Tensorflow: A system for large-scale machine learning,” in Proceed- learning models - QIC and LER. The results demonstrate the ben- ings of the 12th USENIX Conference on Operating Systems Design and Implemen- efit of building a monitor system for covariate shift detection and tation, OSDI’16, (Berkeley, CA, USA), pp. 265–283, USENIX Association, 2016. also its adaptation through incremental learning. For future work, we plan to extend our current research by employing better strate- gies to reduce false negatives. Also, we are plan to apply a similar strategy to other data shifts such as prior probability and concept shifts. REFERENCES [1] S. Arunprasath and B. Venkata Nagaraju, “Deep ensemble learning for legal query understanding,” in Proceedings of CIKM 2018 Workshop on Legal Data An- alytics and Mining (LeDAM 2018), CEUR-WS.org, October 2018. To appear. [2] J. G. Moreno-Torres, T. Raeder, R. Alaiz-RodríGuez, N. V. Chawla, and F. Her- rera, “A unifying view on dataset shift in classification,” Pattern Recogn., vol. 45, pp. 521–530, Jan. 2012. 7