=Paper=
{{Paper
|id=Vol-2888/paper10
|storemode=property
|title=On the Effectiveness of Portable Models versus Human Expertise under Continuous Active Learning
|pdfUrl=https://ceur-ws.org/Vol-2888/paper10.pdf
|volume=Vol-2888
|authors=Jeremy Pickens,Thomas C. Gricks III, Esq.
|dblpUrl=https://dblp.org/rec/conf/icail/PickensG21
}}
==On the Effectiveness of Portable Models versus Human Expertise under Continuous Active Learning==
On the Effectiveness of Portable Models versus Human Expertise under Continuous Active Learning Jeremy Pickens Thomas C. Gricks III, Esq. OpenText OpenText Denver, USA Denver, USA jpickens@opentext.com tgricks@opentext.com KEYWORDS human-driven seeding. It is an open question whether technology continuous active learning, human augmentation, human expertise, assisted review seeded by portable models offers a clear, sustained transfer learning, portable models advantage over approaches that begin with human input. Therefore, this work constitutes an initial study into the relationship between 1 INTRODUCTION human vs machine seeding and overall review efficiency. eDiscovery is the process of identifying, preserving, collecting, re- viewing, and producing to requesting parties electronically stored 2 MOTIVATION information that is potentially relevant to a civil litigation or reg- Separate and apart from the inherent value of an assessment of ulatory inquiry. Of these activities, the review component is by the impact of portable models on TAR, there are two principles far the most expensive and time consuming [8]. Modern, effective attendant to the creation of portable models that serve as a further approaches to document review run the gamut from pure human- motivation for this study: (1) the increased regulatory pressure to driven processes such as boolean keyword search followed by linear maintain personal privacy; and (2) the growing need for stringent review, to predominantly AI-driven approaches using various forms cyber security measures. Consideration of both principals is gener- of machine learning. A review process that involves a significant, ally recognized as an essential step in the development and utility though not exclusive, supervised machine learning component is of modern AI applications, given their breadth and proliferation. typically referred to as technology assisted review (TAR). Recent years have seen an increased scrutiny from EU and United One of the most efficient approaches to TAR in recent years States regulatory agencies. Data collection and reuse is under heavy involves a combined human-machine (IA, or intelligence amplifica- examination as regulators seek to minimize data collection and tion) approach known as Continuous Active Learning (CAL) [5]. maximize privacy and security. Portable models are a form of data As with any TAR review, a CAL review will benefit in some mea- reuse; the models would not exist were it not for the original data. sure by overcoming the cold start problem: The machine typically As such, there are rights and obligations around the use of the cannot begin making predictions until it has been fed some num- data that goes in to training portable models, and a strong need for ber of training documents, aka seeds. In an early CAL approach, clearer assessments of risk when porting models. As Bacon et al initial sets of training documents were selected via human effort, noted [1]: e.g., manual keyword searching. This approach to selecting seed The use of machine learning (“ML”) models to process documents relies on human knowledge and intuition. proprietary data is becoming increasingly common Recently in the legal technology sector, another seeding ap- as companies recognize the potential benefits that proach that does not rely on human assessment of the review collec- ML can provide. Many IT vendors offer ML services tion but is based on artificial intelligence (AI) methods and derived that can generate valuable insights derived from their from documents outside the collection has been gaining momentum. customer’s proprietary data and know-how. For com- For this technique, which is often referred to as “portable models”, panies that have not yet established their own ML and known in the wider machine learning community as transfer expertise in-house, these services can offer significant learning, initial seed documents are selected not via human input, business advantages. However, there may be cases but by predictions from a machine learning model trained using where one party owns the ML model, another party documents from prior matters or related datasets. Portable models has the business expertise, and a third party owns the take a pure AI approach and eschew human knowledge in the cold data. In such cases, significant intellectual property start seeding process. (“IP”) and data protection and security risks may arise. Notwithstanding the benefits asserted by the proponents of Naturally, most companies that invest in building an portable models as a seed-generation technique, we are aware of ML model are looking for a return on their investment. no formal or even informal studies addressing the overall impact of From a financial perspective, such companies focus on portable model seeding on the efficiency of a TAR review relative to using the IP laws and related IP contract terms, such as IP assignments and license grants, to maximize Proceedings of the Second International Workshop of AI and Intelligent Assistance for Legal Professional in the Digital Workplace (LegalAIIA 2021), held in conjunction their control over the ML model and associated input with ICAIL 2021. June 21, 2021. São Paulo, Brazil. Copyright ©2021 for this paper by its and results. Data protection laws can run counter to authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Published at http://ceur-ws.org these objectives by imposing an array of requirements LegalAIIA 2021, June 2021, São Paulo, Brazil and restrictions on the processing of various types of data, particularly to the extent they include personal LegalAIIA 2021, June 2021, São Paulo, Brazil Pickens and Gricks information. The interplay between these competing Given the potential dangers associated with modern AI applica- considerations can lead to interesting results, espe- tions such as portable models, we therefore ask: Do portable models cially when a number of different parties have a stake provided a cognizable sustained advantage over human augmented in the outcome. IA processes sufficient to warrant their use in the face of privacy The second, perhaps more important challenge with respect to and cybersecurity concerns? If not, perhaps the safer and more ap- portable models is the possibility of data leakage. In recent years, propriate approach is to continue using traditional human-driven computer security and machine learning researchers have increased techniques. the sophistication of membership inference attacks [9]. These at- tacks are a way of probing black box, non-transparent models to 3 RELATED WORK “discover or reconstruct the examples used to train the machine The key foundation in our investigation is the observation that the learning model” [6]. The basic process is that: current state-of-the-art document review TAR process is based on continuous active learning (CAL) [5]. Given seed documents, the An attacker creates random records for a target ma- basic CAL process induces a supervised machine learning model chine learning model served on a [portable model] which then predicts the most likely responsive, unreviewed docu- service. The attacker feeds each record into the model. ments. After some (relatively small) number of those top-ranked Based on the confidence score the model returns, the predictions are reviewed and coded, another model is induced and attacker tunes the record’s features and reruns it by the next most likely documents are queued for review. The process the model. The process continues until the model continues until a high recall target is hit. reaches a very high confidence score. At this point, Review workflows that are based on CAL have what might be the record is identical or very similar to one of the called a “just in time” approach to prediction. Rather than attempt- examples used to train the model. After gathering ing to induce a perfect model up front, CAL workflows dynamically enough high confidence records, the attacker uses the adjust as the review continues. Often this means that early disad- dataset to train a set of “shadow models” to predict vantages, and even early advantages, wash out in the process. For whether a data record was part of the target model’s example [citation anonymized for review] found that four searchers training data. This creates an ensemble of models that each working independently to find seed documents found different can train a membership inference attack model. The fi- and different numbers of seeds. But after separately using each seed nal model can then predict whether a data record was set to initialize a CAL review, approximately the same number of included in the training dataset of the target machine documents needed to be reviewed to achieve high recall. This study learning model. The researchers found that this attack asks similar questions in the context of portable models—whether was successful on many different machine learning there is a significant improvement in review efficiency when using services and architectures. [6] portable models relative to traditional, non-AI techniques. Carlini et al [3, 4] further elaborate on the potential for portable Another common portable model theme is the claim that the models to reveal private or sensitive information: more historical data they are trained on, the better their predictions One such risk is the potential for models to leak details will be. While that may be true in some instances, it may not be in from the data on which they’re trained. While this others. What constitutes privileged documents in one matter might may be a concern for all large language models, addi- have a different set of characteristics as privileged documents in tional issues may arise if a model trained on private others matter. What constitutes evidence of fraud, or sexual harass- data were to be made publicly available. Because these ment in one matter might be different than in other matters. No datasets can be large (hundreds of gigabytes) and pull amount of “big data” gathered from dozens (hundreds? thousands?) from a range of sources, they can sometimes contain of prior matters and composed into a monolithic portable model sensitive data, including personally identifiable infor- may be relevant to the current problem if the patterns in the current mation (PII): names, phone numbers, addresses, etc., problem don’t match the historical ones. Therefore a question that even if trained on public data. This raises the possibil- every eDiscovery practitioner should be asking herself is where ity that a model trained using such data could reflect the best source of evidence for seeding the current task lies. As [2] some of these private details in its output. notes: “The real goal should not be big data but to ask ourselves, for a given problem, what is the right data and how much of it is Tramer et al [10] note that entire models may even be stolen needed. For some problems this would imply big data, but for the via such techniques, even when the adversary only has black box majority of the problems much less data is necessary.” (observations of outputs only, rather than internal workings) ac- cess to the model: “The tension between model confidentiality and 4 RESEARCH QUESTIONS public access motivates our investigation of model extraction at- tacks. In such attacks, an adversary with black-box access, but no We engage three primary research questions. The first question prior knowledge of an ML model’s parameters or training data, level-sets the value of the pure AI (portable model) approach. The aims to duplicate the functionality of (i.e., “steal”) the model…We second two questions compare the portable model approach to a show simple, efficient attacks that extract target ML models with human-initiated process. near-perfect fidelity for popular model classes including logistic • RQ1 Does a CAL review seeded by a portable model outper- regression, neural networks, and decision trees.” form (at high recall) one seeded randomly On the Effectiveness of Portable Models versus Human Expertise under Continuous Active Learning LegalAIIA 2021, June 2021, São Paulo, Brazil • RQ2 Do portable models initially find more relevant docu- Collection Stats Manual Seeding Stats ments than does human effort Topic Total Rel Richness Queries Minutes Total Docs • RQ3 Does a CAL review seeded by a portable model outper- 403 1090 0.38% 9 52 60 form (at high recall) one seeded by human effort 422 31 0.01% 11 26 63 424 497 0.17% 13 48 81 426 120 0.04% 7 26 32 When attempting to consider these questions in general, issues 420 737 0.25% 7 11 54 naturally arise: What portable models are we talking about? Trained 407 1586 0.55% 6 23 59 on what data? And how close was that data to the target distribu- 414 839 0.29% 12 39 108 tion? And what humans seeded the comparison approach? And 410 1346 0.46% 10 24 69 what was their prior knowledge of the subject matter? 401 229 0.08% 9 29 56 These questions matter, and while we cannot answer them for 406 127 0.04% 9 25 42 every possible training set and human searcher, we have structured 433 112 0.04% 12 43 65 the experiments in such a way as to give the most possible “benefit 415 12106 4.17% 11 43 77 of the doubt” to the portable model, and the least possible benefit 430 991 0.34% 9 41 72 to the human searcher. Thus if there are significant advantages of 417 5931 2.04% 10 25 85 portable models over human effort, these should be most readily 413 546 0.19% 9 23 88 apparent when portable models are given the most affordances and 432 140 0.05% 13 29 45 humans the least. 402 638 0.22% 6 8 75 The primary manner in which portable models are given an 427 241 0.08% 9 25 82 advantage is that we train them on a set of documents that is drawn 419 1989 0.69% 7 27 50 from the exact same distribution as the target collection to which 404 545 0.19% 10 23 54 they will be applied. In practice, portable models are never given 408 116 0.04% 9 23 91 this advantage. Prior cases in eDiscovery are not always exactly the 418 187 0.06% 8 27 53 same. Different collections, even from the same corporate entity, 411 89 0.03% 9 29 64 exhibit different distributions, especially as employees and business 412 1410 0.49% 8 29 64 activities change and evolve over time. Naturally, the more different 416 1446 0.50% 8 34 51 the source distribution, the less effective portable models will be 423 286 0.10% 8 22 41 when applied to a new target collection. However, by holding the 429 827 0.29% 12 35 62 distribution the same, this gives us an upper bound on portable 409 202 0.07% 15 44 67 model effectiveness and establishes a strong baseline against which 405 122 0.04% 9 36 66 the human effort can be compared. 428 464 0.16% 13 36 67 At the same time, the human effort is minimized. As will be 431 144 0.05% 10 28 57 described in more detail below, a small team of human searchers 434 38 0.01% 14 43 50 worked for a collective total of approximately half an hour per 425 714 0.25% 12 47 76 topic. None of the humans were experts in any of the topics, nor 421 21 0.01% 14 45 60 did anyone have recent prior knowledge on the topics, as the events in this Jeb Bush TREC collection [7] took place a decade or more Averages 9.9 31.4 64.3 prior to when the searchers worked and most of the issues were Table 1: Collection and Manual Effort Statistics local to Florida and did not make national news. In practice, humans are rarely given this disadvantage. They often work for more than thirty minutes on a problem and can have broad domain expertise that comes from having worked on similar cases in the past. Thus, our experiments consist of a comparison between portable model effectiveness is likely to be lower, though how much lower models trained in the best possible light vs human effort that is remains to be studied. kept at a minimum. We do this because the core concept of portable models is that they will be sufficiently broad in scope so as to be 5 EXPERIMENTS able to identify relevant documents in a collection that contains documents of a similar content and context to those on which they 5.1 Data were trained. (“Relevance” here refers to the notion of “what is We test these research questions using the TREC 2016 total recall desired”, be it some sort of topical similarity such as age discrimina- track document collection, topics, and relevance judgments [7]. This tion or fraud cases, or something like privilege.) That distributional dataset contains 34 topics each with a varying number of relevant similarity is not always guaranteed, and in fact it can be difficult a documents. Nonetheless, the richness of the majority of topics is priori to know whether you a portable model has been trained on under 1%, i.e. relatively low richness topics where portable models data similar enough to be useful. By using documents intentionally allege to be most effective. Table 1 contains statistics on each topic. drawn from the exact same distribution, we are able to show an The first column is the topic ID from 401 to 434, sorted in a manner upper bound on portable model effectiveness. In practice, portable that will be described in Section 6.3. The next two columns contain LegalAIIA 2021, June 2021, São Paulo, Brazil Pickens and Gricks the number of total relevant documents and the richness for each the desired recall level is achieved, which in these experiments are topic. There are 290,099 total documents in the collection. set to 80%. Human effort, aka manual seeding, was done with a small team While selecting a source collection for portable model training of four searchers. For each topic, two of the searchers were in- from the same distribution as the target collection already offers structed to run a single query and code the first 25 documents that great advantage to the predictive capabilities of a portable model, resulted from that query. The other two searchers were given more i.e. puts it above where it would likely perform in more realistic interactive leeway and were instructed to utilize as many searches scenarios, we extend this advantage even further by giving the and whatever other analytic tools (clustering, timeline views, etc.) model larger and larger source collections on which to train. We as they wanted, with a goal of working for about 15-30 minutes and compare three primary source/target partitions: 20/80, 50/50, and stopping once they had tagged 25 documents. This was not strictly 80/20, with k=5, k=2, and k=5, respectively. (In the 80/20 case, steps controlled, and some reviewers worked a few minutes longer, some (3a) and (3c) are reversed, with the current group used as the target a few minutes shorter. And some marked a few more than 25 doc- collection and the other groups used as the source collection.) The uments, and some a few less, as is to be expected in normal, “in reason for the 20/80 partition is that the eDiscovery problem is the moment” flow of knowledge work. Table 1 contains the manual a recall-oriented task. The larger the review population, aka the effort statistics — with the total number of queries, total number of target collection, the more realistic the CAL process is likely to be. minutes, and total unique documents tagged as either relevant or However, the disadvantage is that only 20% of the TREC collection is non-relevant — for each topic. On average, the human reviewers be used for training the portable model. The 80/20 partition reverses worked for 31.4 minutes, issued 9.9 queries, and coded 64.3 docu- the balance: 80% of the collection is used to train the portable model, ments so the overall effort was done at a fairly high pace and was but only 20% of the collection is available to simulate the CAL relatively minimal in comparison to the size of the collection. review, which can be problematic for especially sparse topics. The 50/50 partition splits the difference. 5.2 Experiment Structure Comment: An astute observer may find slight fault with the struc- In order to compare portable models against both random and ture of this experimental setup, in that there is a small amount of human-seeded techniques in RQ1 through RQ3, there needs to be knowledge overlap between the source and target partitions when a collection on which the portable model can be trained, separate doing human seeding. Specifically, the human searchers originally from the collection on which it and the comparative approaches searched across the entire collection rather than across split col- are deployed. We will refer to these two collections as “source” and lections. It is possible that a document found by a human searcher “target’, respectively. For the reasons enumerated in Section 4, we that ended up in a source partition has led the human to issue a carve out the portable model training source collection from the query that found more or better documents that ended up in the same distribution as the target collection, and do so by selecting target partition. Thus even though the human-found documents documents at random. For a given topic, we: in only the target partition are used to seed a CAL process (Step 3d, above), the existence of some of those seeds could have been (1) Shuffle the collection randomly influenced by knowledge of documents in the source partition. We (2) Split the collection into k groups note this issue and make it explicit, but do not think that it affects (3) For each group: the overall conclusions of the experiment. One reason is that even if (a) Use that group as the portable model training “source” humans had some knowledge of documents in the source partition collection S when finding the documents in the target partition, the portable (b) Train a model M using every document in S model M is given knowledge of every document, positive and neg- (c) Use the remaining groups as the “target” collection T ative, in the source partition. Table 3 shows the raw number of (d) Select manual (human) seeds H by intersecting all found positive documents used for training in the source partition, and docs (see Table 1) with T it swamps the documents that the humans would have looked at (e) Select random seeds R from T until five positives exam- in their short search sessions. Thus, one can think of any overlap ples are found during human seed selection as the background knowledge that (f) Selected portable seeds P from the top of the M-induced human would likely already be expected to possess when working ranking on T in an amount equal to |H | in a real scenario. E.g. an investigator working on detecting fraud or (g) Use the appropriate seeds to run each experiment RQ1 sexual harassment likely has some implicit background knowledge through RQ3 of fraud or sexual harassment. (4) Average results across all k groups for the topic, but do not average across topics 6 RESULTS The specifics of step (3g) depends on the research question being tested. For example, for RQ1, R and P are each (separately) used to 6.1 RQ1: Portable- vs Random-Seeded Recall seed a continuous active learning (CAL) process. For RQ3, H and The results for our first question are found Table 2. Under the rubric P are used. Other than the different seedings, these CAL processes of symmetry, the results are expressed in terms of raw percentage are run exactly as in [5] except that updates are done every 30 point (not percentage) differences between the precision achieved documents rather than every 1000 documents. And unlike some at 80% recall for the portable model P-seeded review versus a ran- approaches, the learning is not relevance feedback for a limited dom R-seeded review, and averaged across all partitions for each number of steps. It is truely continuous in that it does not stop until topic. Positive values indicate better portable model performance; On the Effectiveness of Portable Models versus Human Expertise under Continuous Active Learning LegalAIIA 2021, June 2021, São Paulo, Brazil Topic 20/80 Partition 50/50 Partition 80/20 Partition Thus in answer to the question: Does P-seeding produce an ∆precision efficacious result, the answer is yes. However, the more important 403 74.4 85.2 92.2 question is not whether portable models are useful; it is whether 422 9.3 28 21.8 they are useful relative to other reasonable, simpler, or less risky 424 75.1 77.8 65.2 alternatives. For that we turn to the remaining research questions. 426 53.8 52.2 32.2 420 76.7 71.4 78.4 6.2 RQ2: Portable vs Human Seed Initial 407 55.8 51.7 50.4 Relevance 414 8.8 6.2 6.9 The results for our second questions are found in Table 3 under the 410 35.6 29.3 50.1 Target Portable and Target Manual columns for each partition group. 401 23.3 26.1 17.2 For example, in the 20/80 partition, where 80% of the collection is 406 8.7 3.8 6.7 used as the target collection and on average across all 5 folds, on 433 54.7 48.6 39.2 topic 403 human effort found 18.4 positively-coded seed documents 415 -0.8 0.3 4.7 whereas the portable model M found 41.6 at the same level of effort 430 14.2 7.1 17.9 (i.e. at 60 documents, as per Table 1). On topic 421 under the 20/80 417 6.2 10 17.5 partition, humans found an average 10.4 documents and M found 413 67.9 69.3 57.8 3.2. 432 43.3 46.9 13.4 The average number of positive training documents in the source 402 4.2 3.3 4 partition, i.e. the data on which M is trained, is shown. The number 427 60.5 52.5 34.4 of negative training examples is the remainder of the fold. Averages 419 1.4 11 6.8 across all 34 topics are shown at the bottom of the table, as is a 404 1.2 -0.4 0.3 binomial p-value. 408 0.3 0 0.3 These results show that when 20% of the collection is used to 418 0.5 0.2 0 train M (20/80 parition), even though that data is literally from 411 0.6 0.3 0.1 the same distribution as the target fold, the various M are able 412 14.4 17.8 5.9 to find seed documents at a rate no better than a small amount of 416 1.1 2.8 0.3 human effort. There is only a difference of 0.2 documents across 423 12.5 5.8 4.9 all topics, and while there is some variation between topics the 429 76.7 80.9 70.9 differences are not statistically significant (p=0.303). As the training 409 26.8 15.2 3.1 partition increases, and 50% then 80% of the collection is used to 405 66 60.4 43.1 train each M, so too does the ability of the model to find more seed 428 15.1 12.3 15.6 documents. On the 50/50 partition M finds on average 3.5 more 431 26.8 19.2 15.4 documents than the human at the given effort level, and on the 434 36.4 51 54 80/20 partition it finds an average 1.5 more documents. Both results 425 66.7 59 59 are statistically significant. 421 1.3 9.1 3.7 When the number of seeds is normalized per fold and topic by Averages the number of total seeds found, i.e. the positive seed precision, 30.0 29.8 26.3 the manual effort has an average precision of 66.0% across all folds, p<0.000001 p<0.000001 p<0.000001 whereas M precision is 64.2%, 76.5%, and 78.1%, respectively across Table 2: CAL review relative precision based on portable 20/80, 50/50, and 80/20. That is, even though the average number of model seeding versus random seeding documents that M finds on the 50/50 partition is larger (3.5) than on the 80/20 partition (1.5), the latter partition is smaller. The actual precision goes up slightly. Thus in answer to the question: Do portable models initially find more relevant documents than does human effort, the answer is negative values the opposite. Nearly universally, P-seeding out- performs random seeding; the p-value under a binomial test is < mixed. When given 20% of the collection for training, they do not. 0.00001. When given 50% or 80%, they do. However, the improvement is This is a wholly expected result. Closer examination of the simu- modest: a few percentage points, or a few extra documents. lated review orderings shows that in low richness domains most of the precision loss comes not from the CAL iterations, but from the 6.3 RQ3: Portable- vs Human-Seeded Recall larger number of documents needed to find enough positive ones The results for our third and final question are also found in Table 3 to start ranking. Note that random seeding on the target partition under the ∆precision columns. Again, in the interest of symmetric also outperforms fully linear review by an average of 12.2, 10.6, magnitudes, ∆precision is the percentage point difference between and 5.6 percentage points on the 20/80, 50/50, and 80/20 partitions, P-seeded versus H -seeded CAL. While again there is some varia- respectively. So even random seeding of CAL is better than no CAL tion across topics, on average on the 20/80 partition, P-seeding is at all. 0.4 percentage points worse, while on the 50/50 and 80/20 partitions LegalAIIA 2021, June 2021, São Paulo, Brazil Pickens and Gricks P-seeding is 0.6 and 1.2 percentage points better. However, none therefore risk is lower, we do not yet find that the portable model of these results are statistically significant (p=0.303). provides a sustained advantage. Furthermore, when we look at the raw document count differ- ence between the two conditions (not shown in the table) another story emerges. In the 80/20 partition, on those topics for which P- seeded CAL is better, is it better on average by 186 total documents. 8 FUTURE WORK Where H -seeded CAL is better, it is better by 896 documents. On Certainly this is but one study and more studies with a wider range the 50/50 partition, P- versus H -seeding is 271 vs 743 documents of models, data collections, and human effort are needed. Perhaps better, and on the 20/80 partition it is 166 versus 655 documents. the humans could have done even better if given more time, were There is no consistent advantage of either approach over the other, working on a domain in which they had specific expertise, or were but the negative consequences of H seeding seems to be far smaller given more powerful analytics with which to find seed documents. than those of P seeding. Conversely, portable models were given all possible advantages The reason for the topical sort order across all tables should now in this experimental structure by building them on documents become clear: All tables in this paper are sorted by the ∆precision drawn from the exact same distribution as the target collect, in of the 80/20 partition. This seems to be the partition for which ever increasing amounts (20%, 50%, and 80%). It is not likely that portable models are the strongest; they have the most training data. portable models will ever be trained on prior data as perfectly And sorting by ∆precision allows us to see where P-seeding vs similar to the target distribution. Therefore, portable models might H -seeding each shine. To that end, we introduce one more metric likely have performed much worse in realistic scenarios where the into the discussion: The WTF “ineffectiveness” metric [11, 12] en- source and target collections are further apart. E.g. when modeling capsulates the notion of not only looking at average performance, fraud or sexual harassment, does what constitute evidence of fraud but at outliers. A system that has good average performance but or harassment in one collection express itself the same way in egregious outliers might want to be avoided, especially in eDiscov- another collection? Future research is needed in three different ery where every case matters and the costs incurred by an outlier areas: (a) More and larger collections from similar but not identical are more significant than in, say, ad hoc web search. distributions on which to train models, or perhaps the “right” small From this perspective, we see that where portable models per- collections on which to train models, as per [2], (b) more advanced form the strongest, i.e. on the 80/20 partition where they are given models and better transfer learning, and (c) better and stronger 80% of the available positive documents, there are outliers in both baselines against which to compare. directions. The top three P-advantage outliers show a 60.5, 12.0, Better and stronger baselines are not limited to more effective hu- and 5.7 percentage point difference. The top three H -advantage man effort. They also include other existing, common practices. For outliers show a 25.0, 12.6, and 11.1 percentage point difference. example, many companies dealing with sensitive information keep However, in terms of raw document counts these translate to a 670, lexicons of search terms used to find sensitive information. While 667, and 468 documents for the P-advantage, and 9249, 2006, and a lexicon could in some sense be thought of as an “unweighted” 892 documents for H -advantage. There appear to be fewer “WTFs” portable model, one difference is that it’s manually constructed, from H -seeding. transparent, and can embed human intuition and patterns never seen in prior data, i.e. lexicons do not need to be trained. Another common approach for corporations with repeat litigation is the 7 CONCLUSION idea of a “drop in seed”. That is, instead of building large models We have shown that a portable model can be useful. Certainly rel- based on huge datasets from all possible prior matters, some in ative to linear review, and even relative to randomly seeded CAL the industry have developed the ad hoc practice of taking a few workflows, taking a portable approach offers a significant advan- coded documents from previous matters, which matters are known tage. They are also marginally better than humans when it comes to be similar to the current matter, and using those as the initial to finding initial seed documents. When it comes to sustained ad- seeds on the target collection. This of course only works behind vantage, i.e. precision at 80% recall, the advantages fade. There is no the firewall, as companies will not transfer documents to other statistically significant difference in human vs portably seeded CAL companies. But given the security, privacy, and related member- workflows, and slight evidence that the outliers for the portable ship inference attack risks of portable models, companies might approach are worse. not want to transfer their own models to a competitor, either. So We note also that porting models carries with it significant risk in addition to comparing portable models against human seeding, in the form of intellectual property rights, data leakage via member- they should be compared against lexicons and drop-in seeds. Per- ship inference attacks, privacy, and security. It is every party’s own haps these latter approaches outperform both portable models and subjective decision as to whether the advantages of portable models human-seeded approaches when considering the total cost of a outweigh the challenges and risk. However, from the results in this review and not just the document count. study we would recommend continuing to invest in human-driven The cost of the portable model (vendor charge) versus the human seeding (IA–intelligence augmentation) processes and not going approach (e.g. half an hour of searcher time) needs to be consid- all in on AI. At least relative to the topics studied in this paper, the ered as well, and not just the cost of the subsequent document modicum of effort required of the human are a fair trade relative to review. In short, this is a rich space for the exploration of tradeoffs, risk. Even when portable models are built on a corporation’s own risks, and advantages for various human-driven vs machine-driven data, and models are not swapped between different owners and eDiscovery processes. On the Effectiveness of Portable Models versus Human Expertise under Continuous Active Learning LegalAIIA 2021, June 2021, São Paulo, Brazil 20/80 Partition 50/50 Partition 80/20 Partition Positive Counts Positive Counts Positive Counts Source Target Source Target Source Target Portable Manual ∆prec Portable Manual ∆prec Portable Manual ∆prec Topic (P) (H ) (P) (H ) (P) (H ) 403 218.0 41.6 18.4 27.8 545.0 26.0 11.5 42.8 872.0 10.4 4.6 60.5 422 6.2 6.6 12.8 -8.2 15.5 11.0 8.0 14.6 24.8 3.8 3.2 12.0 424 99.0 40.2 43.2 0.0 247.5 28.5 27.0 2.4 396.0 11.4 10.8 5.7 426 24.0 23.2 36.0 22.6 60.0 21.5 22.5 7.2 96.0 8.8 9.0 4.6 420 147.4 50.8 49.6 2.6 368.5 33.0 31.0 0.6 589.6 13.2 12.4 4.4 407 316.2 31.4 30.4 1.4 790.5 19.5 19.0 2.5 1264.8 7.8 7.6 4.2 414 167.6 36.2 6.4 2.9 419.0 24.0 4.0 0.2 670.4 9.4 1.6 3.2 410 269.0 71.6 68.0 -1.6 672.5 45.0 42.5 -2.8 1076.0 18.0 17.0 2.9 401 45.8 33.6 36.0 3.6 114.5 27.0 22.5 2.1 183.2 11.0 9.0 2.8 406 25.2 11.8 31.2 1.3 63.0 21.5 19.5 -1.7 100.8 9.4 7.8 2.0 433 22.4 28.4 30.4 -5.6 56.0 25.0 19.0 -2.2 89.6 10.8 7.6 1.5 415 2408.4 33.6 46.4 0.1 6021.0 21.5 29.0 -2.9 9633.6 8.0 11.6 1.3 430 198.0 62.6 48.0 -1.0 495.0 40.5 30.0 0.4 792.0 16.8 12.0 1.3 417 1186.2 87.2 81.6 0.5 2965.5 54.5 51.0 0.8 4744.8 21.8 20.4 0.9 413 109.2 51.2 52.8 -0.3 273.0 34.0 33.0 0.4 436.8 13.8 13.2 0.6 432 28.0 10.2 31.2 -12.1 70.0 20.5 19.5 6.1 112.0 8.6 7.8 0.4 402 127.0 49.4 43.2 0.1 317.5 34.0 27.0 0.6 508.0 14.0 10.8 0.3 427 48.2 30.8 35.2 2.1 120.5 23.5 22.0 5.2 192.8 10.2 8.8 0.2 419 396.6 28.4 40.0 -0.2 991.5 18.0 25.0 -0.6 1586.4 7.2 10.0 0.1 404 108.8 30.6 26.4 0.1 272.0 19.5 16.5 -0.1 435.2 8.2 6.6 0.0 408 22.8 20.8 16.0 0.1 57.0 20.5 10.0 0.0 91.2 8.6 4.0 0.0 418 37.4 13.4 10.4 0.1 93.5 16.0 6.5 0.0 149.6 7.6 2.6 0.0 411 17.8 9.4 9.6 -0.1 44.5 11.5 6.0 0.1 71.2 4.4 2.4 -0.1 412 280.0 55.8 47.2 1.5 700.0 37.5 29.5 0.7 1120.0 14.4 11.8 -0.1 416 279.0 36.4 29.6 0.6 697.5 20.5 18.5 0.1 1116.0 8.0 7.4 -0.2 423 57.2 16.8 16.8 0.2 143.0 13.0 10.5 1.0 228.8 5.4 4.2 -0.4 429 165.4 61.4 63.2 -1.5 413.5 39.0 39.5 -0.4 661.6 15.6 15.8 -1.2 409 39.8 25.2 28.0 11.7 99.5 23.5 17.5 -1.8 159.2 10.2 7.0 -1.9 405 23.8 31.0 32.8 -0.4 59.5 22.0 20.5 1.1 95.2 9.2 8.2 -2.4 428 92.4 55.6 48.0 2.7 231.0 34.5 30.0 -0.9 369.6 14.0 12.0 -5.4 431 28.8 24.8 30.4 -3.6 72.0 21.0 19.0 -11.6 115.2 7.6 7.6 -8.3 434 7.6 5.6 24.0 -29.8 19.0 9.0 15.0 -13.6 30.4 5.4 6.0 -11.1 425 142.8 51.2 44.0 -7.8 357.0 32.0 27.5 -10.8 571.2 12.8 11.0 -12.6 421 4.2 3.2 10.4 -23.4 10.5 5.0 6.5 -18.6 16.8 1.8 2.6 -25.0 AVG 210.3 34.4 34.6 -0.4 525.8 25.1 21.6 0.6 841.2 10.2 8.7 1.2 p=0.303 p=0.303 p=0.0002 p=0.303 p=0.00004 p=0.303 Table 3: Across each of the various partitions: (1) Number of positive training documents in the portable model “source”, (2) The number of positive seed documents found by the portable model P and the manual H approaches, and (3) the relative precision (∆prec) of P-seeding over H -seeding at 80% recall, i.e. the precision of the former minus the precision of the latter. Positive numbers indicate that P-seeding was more effective, negative numbers that H -seeding was more effective. REFERENCES [3] Nicholas Carlini. 2020. Privacy Considerations in Large Language Mod- [1] Brittany Bacon, Tyler Maddry, and Anna Pateraki. 2020. Training a Machine els. Retrieved April 29, 2021 from https://ai.googleblog.com/2020/12/privacy- Learning Model Using Customer Proprietary Data: Navigating Key IP and Data considerations-in-large.html Protection Considerations. Pratt’s Privacy and Cybersecurity Law Report 6, 8 (Oct. [4] Nicholas Carlini, Florian Tramer, Eric Wallace, Mathew Jagielski, Ariel Herbert- 2020), 233–244. Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, [2] Ricardo Baeza-Yates. 2013. Big Data or Right Data?. In Proceedings of the 7th Alina Oprea, and Colin Raffel. 2020. Extracting Training Data from Large Lan- Alberto Mendelzon International Workshop on Foundations of Data Management guage Models. Article arXiv:2012.07805. (AMW 2013), Loreto Bravo and Maurizio Lenzerini (Eds.), Vol. 1087 (CEUR Work- [5] G. V. Cormack and M. R. Grossman. 2014. Evaluation of machine-learning shop Proceedings). Puebla/Cholula, Mexico. http://ceur-ws.org/Vol-1087/ protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. Gold Coast, Australia, 153–162. LegalAIIA 2021, June 2021, São Paulo, Brazil Pickens and Gricks [6] Ben Dixon. 2021. Machine Learning: What are Membership Inference Attacks? Re- Inference Attacks and Defenses on Machine Learning Models. In Proceedings of trieved April 29, 2021 from https://bdtechtalks.com/2021/04/23/machine-learning- Network and Distributed Systems Security (NDSS) Symposium. San Diego, Califor- membership-inference-attacks/ nia. [7] Maura R. Grossman, Gordon V. Cormack, and Adam Roegiest. 2016. TREC 2016 [10] Florian Tramer, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart. Total Recall Track Overview. In In NIST Special Publication 500-321: The Twenty- 2016. Stealing Machine Learning Models via Prediction APIs. In Proceedings of Fifth Text REtrieval Conference Proceedings (TREC 2016), Ellen M. Voorhees and the 25th USENIX Security Symposium. Austin, Texas, 601–618. Angela Ellis (Eds.). Gaithersburg, Maryland. https://trec.nist.gov/pubs/trec25/ [11] Daniel Tunkelang. 2012. WTF!@k: Measuring Ineffectiveness. Retrieved trec2016.html April 29, 2021 from https://thenoisychannel.com/2012/08/20/wtf-k-measuring- [8] Nicholas M. Pace and Laura Zakaras. 2012. Where the Money Goes: Understanding ineffectiveness/ Litigant Expenditures for Producing Electronic Discovery. Rand Corporation, Santa [12] Ellen Voorhees. 2004. Measuring Ineffectiveness. In Proceedings of the 27th annual Monica, CA, USA. ACM SIGIR conference on Research and Development in Information Retrieval. [9] Ahmed Salem, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and Sheffield, UK, 562–563. https://doi.org/10.1145/1008992.1009121 Michael Backes. 2019. ML-Leaks: Model and Data Independent Membership