Leveraging the Transductive Nature of e-Discovery in Cost-Sensitive Technology-Assisted Review Alessio Molinari Istituto di Scienza e Tecnologia dell’Informazione Consiglio Nazionale delle Ricerche 56124 Pisa, IT E-mail: alessiomolinar@gmail.com Abstract. MINECORE is a recently proposed algorithm for minimizing the expected costs of review for topical relevance (a.k.a. “responsiveness”) and sensitivity (a.k.a. “privilege”) in e-discovery. Given a set of documents that must be classified by both responsiveness and privilege, for each such doc- ument and for both classification criteria MINECORE determines whether the class assigned by an automated classifier should be manually reviewed or not. This determination is heavily dependent on the (“posterior”) probabil- ities of class membership returned by the automated classifiers, on the costs of manually reviewing a document (for responsiveness, for privilege, or for both), and on the costs that different types of misclassification would bring about. We attempt to improve on MINECORE by leveraging the trans- ductive nature of e-discovery, i.e., the fact that the set of documents that must be classified is finite and available at training time. This allows us to use EMQ, a well-known algorithm that attempts to improve the quality of the posterior probabilities of unlabelled documents in transductive settings, with the goal of improving the quality (a) of the posterior probabilities that are input to MINECORE, and thus (b) of MINECORE’s output. We re- port experimental results obtained on a large (≈ 800K) dataset of textual documents. 1 What is e-discovery? When a civil lawsuit is filed in the United States of America, the judge may re- quest an involved party to produce to the other party any evidence relevant to the lawsuit. Upon receiving such a request, the producing party must search in their electronically stored information for any document which might be relevant to the case, in order to disclose it to the other party; this task usually goes under the name of e-discovery. In order to do so, it is typically the case that the producing party asks junior lawyers to review (i.e., annotate) for “responsiveness” (i.e., top- ical relevance to the case) the candidate documents, and then asks senior lawyers to review for “privilege” (i.e., presence of sensitive content, which would allow the party to rightfully withhold the document) the documents that have been deemed responsive. As a consequence of this two-phase review, each document is classified into one of the three following classes: – cP (which stands for “Produce”): The document has been deemed responsive and not privileged; as such, it should be produced to the receiving party; Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). FDIA 2019, 17-18 July 2019, Milan, Italy. Alessio Molinari – cL (which stands for “Log”): The document has been deemed responsive and privileged; as such, it should be entered into a “privilege log” (i.e., a repository open to inspection by the judge) and should not be disclosed to the receiving party; – cW (which stands for “Withhold”): The document has been deemed not respon- sive, which means that the producing party should not produce it. In such a scenario, the producing party may incur two types of costs, i.e., annotation costs (since lawyers need to be paid for their reviewing work), and misclassification costs (that might derive when an inappropriate class – e.g., “Produce” instead of “Log” – is chosen for a document). Given the enormous amount of digital documents that should undergo review in many practical cases, a need for automation of the review process has arisen. Several techniques for technology-assisted review (TAR), that combine techniques from information retrieval and machine learning, have thus been proposed in the last ten years [5]. The aim of TAR algorithms is that of supporting the review process, minimizing the overall costs deriving from annotation efforts and misclassifications. 2 An overview of MINECORE MINECORE [4] is a recently proposed decision-theoretic algorithm for minimizing the expected costs of review for responsiveness and privilege in e-discovery. Given a set D of documents that must each be assigned a class in C = {cP , cL , cW } based on whether they belong or not to the class cr of responsive documents and/or to the class cp of privileged documents, the goal of MINECORE is to determine, for each document in D, whether manually reviewing d by responsiveness and/or privilege is expected to be cost-effective or not. This determination is based 1. on the (“posterior”) probabilities of class membership (written as Pr(cr |d) and Pr(cp |d), hereafter called the posteriors) returned by automated classifiers hr (that classifies documents by responsiveness) and hp (that classifies documents by privilege); 2. on the unit costs of manually checking a document for responsiveness (λar ) or for privilege (λap ), where superscript a stands for “annotation”; 3. on the costs λm ij incurred when mistakenly assigning class ci to a document which should be assigned class cj , where ci , cj ∈ C and superscript m stands for “misclassification”. Bullet 3 is due to the fact that in e-discovery not all misclassifications are equally serious; for instance, inadvertently disclosing a privileged document is typically a very serious mistake, while inadvertently disclosing a nonresponsive nonprivileged document is usually a less serious one. MINECORE consists of three phases, which we summarize below. In Phase 1 we train the two automated classifiers hr and hp , and use them to generate, for each document d ∈ D, the two posteriors Pr(cr |d) and Pr(cp |d) mentioned in Bullet 1. We can reasonably assume cr and cp to be stochastically independent, which implies that we may assume Pr(cP |d) = Pr(cr |d) Pr(cp |d), Pr(cL |d) = Pr(cr |d) Pr(cp |d), and Pr(cW |d) = Pr(cr |d). MINECORE takes a risk minimization approach, i.e., it classifies each document d in the class h(d) = arg min R(d, ci ) ci X (1) = arg min λm ij Pr(cj |d) ci j∈{P,L,W } Leveraging the Transductive Nature of e-Discovery where R(d, ci ) is the risk associated with assigning d to class ci . In other words, MINECORE assigns to each document d the class that brings about the minimum misclassification risk, i.e., the minimum expected misclassification cost, thus avoid- ing courses of actions for which a combination of probability of class membership and misclassification cost is high. The function for measuring the global misclassi- fication cost is thus X K m (D) = λm ij Dij (2) i,j∈{P,L,W } where Dij is the number of documents d ∈ D whose predicted class h(d) is ci and whose true class (which we denote by y(d)) is cj . Phase 2 and Phase 3 are essentially identical to each other. The only difference is that, while Phase 2 determines the subset of documents which are expected to be cost-effective to manually review by responsiveness, Phase 3 determines (once these documents have been indeed manually reviewed by responsiveness) the same for privilege. We will thus only describe Phase 2, leaving it to the reader to work out the details of Phase 3. Note that, if τr documents are reviewed by responsiveness and τp documents are reviewed by privilege, the overall cost of the entire process may be described as K o (D) = K m (D) + K a (D) (3) = K m (D) + λar τr + λap τp If document d is reviewed by responsiveness, this has the effect of removing (assum- ing infallible reviewers) any uncertainty about whether d ∈ cr or not. In other words, if by subscript n ∈ {1, 2, 3} we indicate the value of a given quantity after Phase n has been carried out, reviewing d by responsiveness means that Pr2 (cr |d) will be either 0 or 1. As a result, if d is reviewed by responsiveness it will in general hold that Pr1 (cr |d) 6= Pr2 (cr |d), h1 (d) 6= h2 (d), and K1m (D) 6= K2m (D). Since reviewing d by responsiveness brings about an additional λar cost, it is worthwhile to annotate d only if, as a result of the annotation, K2o (D) ≤ K1o (D), i.e., K2m (D) + λar ≤ K1m (D); in other words, the additional annotation cost must be offset by a reduction in overall misclassification cost of greater or equal magnitude. Of course, computing precisely whether there is going to be such a reduction at all is not possible, because at the time of deciding whether d should be annotated or not we do not know the value of yr (d) (a binary variable that indicates whether the reviewer will annotate d as responsive or not), and we do not know the true label y(d) of d. However, it is possible to compute an expectation of this reduction over the yr (d) and y(d) variables; when this expected value exceeds λar , MINECORE decrees that d should be annotated by responsiveness. We refer the reader to [4, §3] for details on how the above expected value is computed, and for a full mathematical specification. 3 Improving our posteriors via EMQ The goal of MINECORE is to identify the τr (resp., τp ) documents that, when reviewed for responsiveness (resp., privilege), will each bring about a reduction in the expected overall cost of the review process. In order to do this, MINECORE uses as input the data listed in the three bullets at the beginning of §2. In this work we attempt to improve this process not by modifying MINECORE, but by improving the quality of the input that MINECORE receives, with the goal of having MINECORE bring about a higher reduction in expected costs. Since the input data mentioned in Bullets 2 and 3 are user-defined parameters, and are hence Alessio Molinari not under our control, we focus on the input data mentioned in Bullet 1, i.e., the posteriors Pr(cr |d) and Pr(cp |d). Our contribution is based on the observation that e-discovery has, from the standpoint of machine learning, a transductive nature, i.e., the unlabelled set D is finite and available at training time. This fact can be exploited in order to improve the quality of our posteriors.1 In fact, Saerens and colleagues [6] have presented an instance of the well-known EM (“expectation maximization”) algorithm (an instance which, following [2], we will call EMQ) that iteratively improves the quality of such probabilities in transductive contexts. EMQ is based on a mutually recursive algorithm that alternates between (a) recomputing the prior probabilities Pr(c) (hereafter: the priors) as the average of the posteriors over the entire set, and (b) recomputing the posteriors by multiplying the old posteriors by the ratio between the new priors and the old priors. The iteration is carried out until convergence, and has been shown to deliver both improved posteriors [6] and improved priors [2]. Therefore, (a) we obtain calibrated posteriors Pr(cr |d) and Pr(cp |d), for each document d ∈ D, from our classifiers hr and hp (which, in the experiments of §4, we obtain via a linear SVM), (b) we update them via EMQ, and (c) we feed them to MINECORE. 4 Experiments We run experiments in which we compare two versions of MINECORE, one that uses the original posteriors (here dubbed MINECOREMLE , since the probabilities are obtained from a Maximum Likelihood Estimator ) and one that uses the EMQ- enhanced posteriors (here dubbed MINECOREEMQ ). We use the same dataset as [4], which consists of ≈ 20,000 training documents, ≈ 780,000 test documents, and 120 instantiations of the (cr , cp ) pair of classes; the values of λar , λap , λm ij , are from “CostStructure1” in [4, §4.3]. We refer to [4, §4] for other details about the experimental setup. The evaluation function we use is K o (D) from Equation Ko (D) 3. We measure the ratio ∆K o (D) = K oMLE (D) between K o (D) as deriving from EMQ MINECOREMLE and K o (D) as deriving from MINECOREEMQ ; if this ratio is > 1, this means that EMQ delivers an improvement, and that our attempt has been successful. We show the outcome of our experiment in Figure 1, where we display the class pairs on the X axis (sorted by their ∆K o (D) value in order to improve the readability of the plot) and ∆K o (D) on the Y axis; the red line indicates the value ∆K o (D) = 1. A dot above the red line indicates, for the corresponding class pair, an improvement in overall cost with respect to the MINECOREMLE baseline, whereas a dot below the red line indicates a deterioration. Figure 1 shows that for 60% of the pairs, using EMQ brings about a deterioration. The average value of ∆K o (D) across the 120 class pairs is 0.99, which indicates a 1% average deterioration. This is surprising, given the positive results that have been reported in past literature for EMQ. As a result, we have further tried to compare the quality of the PrMLE (c|d) and PrEMQ (c|d) sets of posteriors via an application-independent measure of quality. For this, we have chosen “soft accuracy” (a variant of what has been called the Brier score), which corresponds to the standard accuracy mea- TP+TN sure A = TP+FN+FP+TN as evaluated on a “soft” contingency table where cells are (instead of counts) sums of probabilities of class membership.2 Essentially no 1 At a first approximation, we may say that the quality of a posterior Pr(c|d) is high when |hc (d) − Pr(c|d)| is low; see §4 for more details. 2 For example, for a document d that actually belongs to c: (1) if c is actually assigned, for a standard contingency table this contributes a value of 1 to the TP cell; (2) if a Leveraging the Transductive Nature of e-Discovery Fig. 1. Values of ∆K o (D) for the 120 class pairs; values above (resp., below) the red line indicate that the use of EMQ has generated an improvement (resp., a deterioration). difference in soft accuracy was detected, with MLE posteriors obtaining A = 0.9743 and EMQ posteriors obtaining A = 0.9742. Note that, in our case, ∆K o is a more appropriate evaluation measure than A, since it is application-dependent. In fact, while A measures the quality of our posteriors in a somehow abstract way, ∆K o directly measures the impact that these posteriors have on MINECORE; this adds to the disappointment that the results returned by EMQ have brought about. We have also checked if the application of EMQ has improved the class priors. In order to do so we have measured, for each estimation method M ∈ {MLE, EMQ} and for each class c that shows up in at least one of the 120 class pairs, the absolute estimation error (AE), i.e., the absolute value AEM (c) = | Pr(c) − P̂r(c)M | of the difference between the true class prior Pr(c) and the estimated class prior P̂r(c)M ; for each M ∈ {MLE, EMQ} we have then averaged these values across all such classes. The results show that AEMLE = 0.191 and AEEMQ = 0.082, i.e., the use of EMQ brings about a reduction of approximately 57% in the absolute estimation error of the class priors. Overall, this result is surprising: the use of EMQ improved (as expected, and by a very large margin) the quality of the priors, but did not bring about any improvement (actually: brought about a small deterioration) in the quality of the posteriors. The reason this is surprising is that the quality of the posteriors and the quality of the priors should go hand in hand; indeed, the very reason why EMQ is expected to improve the priors is that it computes it as the sum of the supposedly better-quality posteriors that it also computes. One of the reasons we are witnessing such an unexpected behaviour might lie beneath the fact that the “distribution drift” in our data is very low (the average difference between the prevalence of a class in the training and in the test sets is ≤ 0.6%). Indeed, the results of [1] indicate that EMQ tends to work better in high- drift conditions that in low-drift ones. In order to check whether, in a dataset charac- terized by higher drift, EMQ would deliver a better performance, we have generated artificial drift in the dataset by removing increasingly high quantities of negative ex- amples from the training set. This should allow EMQ to improve the quality of both priors and posteriors, thus improving the performance of MINECORE. Since exam- ple removal is completely random, we have repeated the experiment several times, first removing up to 60% (and eventually up to 80%) of the negatives. MINECORE posterior Pr(c|d) is returned, for a “soft” contingency table this contributes a value of Pr(c|d) to the TP cell and a value (1 − Pr(c|d)) to the FN cell. Alessio Molinari Fig. 2. Values of ∆K o (D) for the 120 class pairs after a random 80% of the negative training examples have been removed. was then run first with MLE-generated posteriors and then with EMQ-generated posteriors. However, once again the results did not support our conjecture: even after removing 80% of the negatives from the training set (see Figure 2), ∆K o (D) does not get higher than 1, and gets slightly smaller than 1 in most cases. 5 Conclusion Although our idea of improving the posteriors input to MINECORE by leveraging the transductive nature of our dataset seemed (and still seems to us) reasonable and promising, the experiments we have ran so far indicate the opposite: EMQ has not improved our posteriors, while it has improved the priors. This is still work in progress, and one hypothesis that we are going to test is that the EMQ algorithm might be well-suited to datasets that exhibit some type of drift but not others [3]. We still believe that exploiting the transductive nature of e-discovery could prove a key intuition for improving the results of MINECORE. Indeed, our future work on this will focus on new ways to leverage this aspect. Acknowledgments. The research described in this paper is part of the work carried out by the author for his MSc thesis in Digital Humanities at the University of Pisa (supervisors: Andrea Esuli and Fabrizio Sebastiani), to be discussed in 2019. References 1. Caelen, O.: Quantification and learning algorithms to manage prior probability shift. Master’s thesis, Institut de Statistique, Biostatistique et Sciences Actuarielles, Univer- sité Catholique de Louvain, Louvain-la-Neuve, BE (2017) 2. Gao, W., Sebastiani, F.: Tweet sentiment: From classification to quantification. In: Proceedings of the 7th International Conference on Advances in Social Net- work Analysis and Mining (ASONAM 2015). pp. 97–104. Paris, FR (2015). https://doi.org/10.1145/2808797.2809327 3. Moreno-Torres, J.G., Raeder, T., Alaiz-Rodrı́guez, R., Chawla, N.V., Herrera, F.: A unifying view on dataset shift in classification. Pattern Recognition 45(1), 521–530 (2012). https://doi.org/10.1016/j.patcog.2011.06.019 4. Oard, D.W., Sebastiani, F., Vinjumur, J.K.: Jointly minimizing the expected costs of re- view for responsiveness and privilege in e-discovery. ACM Transactions on Information Systems 37(1), 11:1–11:35 (2019). https://doi.org/10.1145/3268928 Leveraging the Transductive Nature of e-Discovery 5. Oard, D.W., Webber, W.: Information retrieval for e-discovery. Foun- dations and Trends in Information Retrieval 7(2/3), 99–237 (2013). https://doi.org/http://dx.doi.org/10.1561/1500000025 6. Saerens, M., Latinne, P., Decaestecker, C.: Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation 14(1), 21–41 (2002). https://doi.org/10.1162/089976602753284446