=Paper= {{Paper |id=Vol-2888/paper10 |storemode=property |title=On the Effectiveness of Portable Models versus Human Expertise under Continuous Active Learning |pdfUrl=https://ceur-ws.org/Vol-2888/paper10.pdf |volume=Vol-2888 |authors=Jeremy Pickens,Thomas C. Gricks III, Esq. |dblpUrl=https://dblp.org/rec/conf/icail/PickensG21 }} ==On the Effectiveness of Portable Models versus Human Expertise under Continuous Active Learning== https://ceur-ws.org/Vol-2888/paper10.pdf
On the Effectiveness of Portable Models versus Human Expertise
               under Continuous Active Learning
                                 Jeremy Pickens                                                          Thomas C. Gricks III, Esq.
                                    OpenText                                                                        OpenText
                                   Denver, USA                                                                    Denver, USA
                             jpickens@opentext.com                                                           tgricks@opentext.com

KEYWORDS                                                                                   human-driven seeding. It is an open question whether technology
continuous active learning, human augmentation, human expertise,                           assisted review seeded by portable models offers a clear, sustained
transfer learning, portable models                                                         advantage over approaches that begin with human input. Therefore,
                                                                                           this work constitutes an initial study into the relationship between
1     INTRODUCTION                                                                         human vs machine seeding and overall review efficiency.
eDiscovery is the process of identifying, preserving, collecting, re-
viewing, and producing to requesting parties electronically stored
                                                                                           2   MOTIVATION
information that is potentially relevant to a civil litigation or reg-                     Separate and apart from the inherent value of an assessment of
ulatory inquiry. Of these activities, the review component is by                           the impact of portable models on TAR, there are two principles
far the most expensive and time consuming [8]. Modern, effective                           attendant to the creation of portable models that serve as a further
approaches to document review run the gamut from pure human-                               motivation for this study: (1) the increased regulatory pressure to
driven processes such as boolean keyword search followed by linear                         maintain personal privacy; and (2) the growing need for stringent
review, to predominantly AI-driven approaches using various forms                          cyber security measures. Consideration of both principals is gener-
of machine learning. A review process that involves a significant,                         ally recognized as an essential step in the development and utility
though not exclusive, supervised machine learning component is                             of modern AI applications, given their breadth and proliferation.
typically referred to as technology assisted review (TAR).                                    Recent years have seen an increased scrutiny from EU and United
   One of the most efficient approaches to TAR in recent years                             States regulatory agencies. Data collection and reuse is under heavy
involves a combined human-machine (IA, or intelligence amplifica-                          examination as regulators seek to minimize data collection and
tion) approach known as Continuous Active Learning (CAL) [5].                              maximize privacy and security. Portable models are a form of data
As with any TAR review, a CAL review will benefit in some mea-                             reuse; the models would not exist were it not for the original data.
sure by overcoming the cold start problem: The machine typically                           As such, there are rights and obligations around the use of the
cannot begin making predictions until it has been fed some num-                            data that goes in to training portable models, and a strong need for
ber of training documents, aka seeds. In an early CAL approach,                            clearer assessments of risk when porting models. As Bacon et al
initial sets of training documents were selected via human effort,                         noted [1]:
e.g., manual keyword searching. This approach to selecting seed                                  The use of machine learning (“ML”) models to process
documents relies on human knowledge and intuition.                                               proprietary data is becoming increasingly common
   Recently in the legal technology sector, another seeding ap-                                  as companies recognize the potential benefits that
proach that does not rely on human assessment of the review collec-                              ML can provide. Many IT vendors offer ML services
tion but is based on artificial intelligence (AI) methods and derived                            that can generate valuable insights derived from their
from documents outside the collection has been gaining momentum.                                 customer’s proprietary data and know-how. For com-
For this technique, which is often referred to as “portable models”,                             panies that have not yet established their own ML
and known in the wider machine learning community as transfer                                    expertise in-house, these services can offer significant
learning, initial seed documents are selected not via human input,                               business advantages. However, there may be cases
but by predictions from a machine learning model trained using                                   where one party owns the ML model, another party
documents from prior matters or related datasets. Portable models                                has the business expertise, and a third party owns the
take a pure AI approach and eschew human knowledge in the cold                                   data. In such cases, significant intellectual property
start seeding process.                                                                           (“IP”) and data protection and security risks may arise.
   Notwithstanding the benefits asserted by the proponents of                                    Naturally, most companies that invest in building an
portable models as a seed-generation technique, we are aware of                                  ML model are looking for a return on their investment.
no formal or even informal studies addressing the overall impact of                              From a financial perspective, such companies focus on
portable model seeding on the efficiency of a TAR review relative to                             using the IP laws and related IP contract terms, such
                                                                                                 as IP assignments and license grants, to maximize
Proceedings of the Second International Workshop of AI and Intelligent Assistance
for Legal Professional in the Digital Workplace (LegalAIIA 2021), held in conjunction            their control over the ML model and associated input
with ICAIL 2021. June 21, 2021. São Paulo, Brazil. Copyright ©2021 for this paper by its         and results. Data protection laws can run counter to
authors. Use permitted under Creative Commons License Attribution 4.0 International
(CC BY 4.0). Published at http://ceur-ws.org
                                                                                                 these objectives by imposing an array of requirements
LegalAIIA 2021, June 2021, São Paulo, Brazil                                                     and restrictions on the processing of various types of
                                                                                                 data, particularly to the extent they include personal
LegalAIIA 2021, June 2021, São Paulo, Brazil                                                                                  Pickens and Gricks


        information. The interplay between these competing                  Given the potential dangers associated with modern AI applica-
        considerations can lead to interesting results, espe-            tions such as portable models, we therefore ask: Do portable models
        cially when a number of different parties have a stake           provided a cognizable sustained advantage over human augmented
        in the outcome.                                                  IA processes sufficient to warrant their use in the face of privacy
   The second, perhaps more important challenge with respect to          and cybersecurity concerns? If not, perhaps the safer and more ap-
portable models is the possibility of data leakage. In recent years,     propriate approach is to continue using traditional human-driven
computer security and machine learning researchers have increased        techniques.
the sophistication of membership inference attacks [9]. These at-
tacks are a way of probing black box, non-transparent models to          3   RELATED WORK
“discover or reconstruct the examples used to train the machine          The key foundation in our investigation is the observation that the
learning model” [6]. The basic process is that:                          current state-of-the-art document review TAR process is based on
                                                                         continuous active learning (CAL) [5]. Given seed documents, the
        An attacker creates random records for a target ma-
                                                                         basic CAL process induces a supervised machine learning model
        chine learning model served on a [portable model]
                                                                         which then predicts the most likely responsive, unreviewed docu-
        service. The attacker feeds each record into the model.
                                                                         ments. After some (relatively small) number of those top-ranked
        Based on the confidence score the model returns, the
                                                                         predictions are reviewed and coded, another model is induced and
        attacker tunes the record’s features and reruns it by
                                                                         the next most likely documents are queued for review. The process
        the model. The process continues until the model
                                                                         continues until a high recall target is hit.
        reaches a very high confidence score. At this point,
                                                                            Review workflows that are based on CAL have what might be
        the record is identical or very similar to one of the
                                                                         called a “just in time” approach to prediction. Rather than attempt-
        examples used to train the model. After gathering
                                                                         ing to induce a perfect model up front, CAL workflows dynamically
        enough high confidence records, the attacker uses the
                                                                         adjust as the review continues. Often this means that early disad-
        dataset to train a set of “shadow models” to predict
                                                                         vantages, and even early advantages, wash out in the process. For
        whether a data record was part of the target model’s
                                                                         example [citation anonymized for review] found that four searchers
        training data. This creates an ensemble of models that
                                                                         each working independently to find seed documents found different
        can train a membership inference attack model. The fi-
                                                                         and different numbers of seeds. But after separately using each seed
        nal model can then predict whether a data record was
                                                                         set to initialize a CAL review, approximately the same number of
        included in the training dataset of the target machine
                                                                         documents needed to be reviewed to achieve high recall. This study
        learning model. The researchers found that this attack
                                                                         asks similar questions in the context of portable models—whether
        was successful on many different machine learning
                                                                         there is a significant improvement in review efficiency when using
        services and architectures. [6]
                                                                         portable models relative to traditional, non-AI techniques.
  Carlini et al [3, 4] further elaborate on the potential for portable      Another common portable model theme is the claim that the
models to reveal private or sensitive information:                       more historical data they are trained on, the better their predictions
        One such risk is the potential for models to leak details        will be. While that may be true in some instances, it may not be in
        from the data on which they’re trained. While this               others. What constitutes privileged documents in one matter might
        may be a concern for all large language models, addi-            have a different set of characteristics as privileged documents in
        tional issues may arise if a model trained on private            others matter. What constitutes evidence of fraud, or sexual harass-
        data were to be made publicly available. Because these           ment in one matter might be different than in other matters. No
        datasets can be large (hundreds of gigabytes) and pull           amount of “big data” gathered from dozens (hundreds? thousands?)
        from a range of sources, they can sometimes contain              of prior matters and composed into a monolithic portable model
        sensitive data, including personally identifiable infor-         may be relevant to the current problem if the patterns in the current
        mation (PII): names, phone numbers, addresses, etc.,             problem don’t match the historical ones. Therefore a question that
        even if trained on public data. This raises the possibil-        every eDiscovery practitioner should be asking herself is where
        ity that a model trained using such data could reflect           the best source of evidence for seeding the current task lies. As [2]
        some of these private details in its output.                     notes: “The real goal should not be big data but to ask ourselves,
                                                                         for a given problem, what is the right data and how much of it is
   Tramer et al [10] note that entire models may even be stolen          needed. For some problems this would imply big data, but for the
via such techniques, even when the adversary only has black box          majority of the problems much less data is necessary.”
(observations of outputs only, rather than internal workings) ac-
cess to the model: “The tension between model confidentiality and
                                                                         4   RESEARCH QUESTIONS
public access motivates our investigation of model extraction at-
tacks. In such attacks, an adversary with black-box access, but no       We engage three primary research questions. The first question
prior knowledge of an ML model’s parameters or training data,            level-sets the value of the pure AI (portable model) approach. The
aims to duplicate the functionality of (i.e., “steal”) the model…We      second two questions compare the portable model approach to a
show simple, efficient attacks that extract target ML models with        human-initiated process.
near-perfect fidelity for popular model classes including logistic           • RQ1 Does a CAL review seeded by a portable model outper-
regression, neural networks, and decision trees.”                              form (at high recall) one seeded randomly
On the Effectiveness of Portable Models versus Human Expertise under Continuous Active Learning                           LegalAIIA 2021, June 2021, São Paulo, Brazil


     • RQ2 Do portable models initially find more relevant docu-                                      Collection Stats         Manual Seeding Stats
       ments than does human effort                                                      Topic      Total Rel Richness     Queries Minutes Total Docs
     • RQ3 Does a CAL review seeded by a portable model outper-                           403           1090      0.38%        9            52              60
       form (at high recall) one seeded by human effort                                   422             31      0.01%       11            26              63
                                                                                          424            497      0.17%       13            48              81
                                                                                          426            120      0.04%        7            26              32
   When attempting to consider these questions in general, issues                         420            737      0.25%        7            11              54
naturally arise: What portable models are we talking about? Trained                       407           1586      0.55%        6            23              59
on what data? And how close was that data to the target distribu-                         414            839      0.29%       12            39             108
tion? And what humans seeded the comparison approach? And                                 410           1346      0.46%       10            24              69
what was their prior knowledge of the subject matter?                                     401            229      0.08%        9            29              56
   These questions matter, and while we cannot answer them for                            406            127      0.04%        9            25              42
every possible training set and human searcher, we have structured
                                                                                          433            112      0.04%       12            43              65
the experiments in such a way as to give the most possible “benefit
                                                                                          415          12106      4.17%       11            43              77
of the doubt” to the portable model, and the least possible benefit
                                                                                          430            991      0.34%        9            41              72
to the human searcher. Thus if there are significant advantages of
                                                                                          417           5931      2.04%       10            25              85
portable models over human effort, these should be most readily
                                                                                          413            546      0.19%        9            23              88
apparent when portable models are given the most affordances and
                                                                                          432            140      0.05%       13            29              45
humans the least.
                                                                                          402            638      0.22%        6             8              75
   The primary manner in which portable models are given an
                                                                                          427            241      0.08%        9            25              82
advantage is that we train them on a set of documents that is drawn
                                                                                          419           1989      0.69%        7            27              50
from the exact same distribution as the target collection to which
                                                                                          404            545      0.19%       10            23              54
they will be applied. In practice, portable models are never given
                                                                                          408            116      0.04%        9            23              91
this advantage. Prior cases in eDiscovery are not always exactly the
                                                                                          418            187      0.06%        8            27              53
same. Different collections, even from the same corporate entity,
                                                                                          411             89      0.03%        9            29              64
exhibit different distributions, especially as employees and business
                                                                                          412           1410      0.49%        8            29              64
activities change and evolve over time. Naturally, the more different
                                                                                          416           1446      0.50%        8            34              51
the source distribution, the less effective portable models will be
                                                                                          423            286      0.10%        8            22              41
when applied to a new target collection. However, by holding the
                                                                                          429            827      0.29%       12            35              62
distribution the same, this gives us an upper bound on portable
                                                                                          409            202      0.07%       15            44              67
model effectiveness and establishes a strong baseline against which
                                                                                          405            122      0.04%        9            36              66
the human effort can be compared.
                                                                                          428            464      0.16%       13            36              67
   At the same time, the human effort is minimized. As will be
                                                                                          431            144      0.05%       10            28              57
described in more detail below, a small team of human searchers
                                                                                          434             38      0.01%       14            43              50
worked for a collective total of approximately half an hour per
                                                                                          425            714      0.25%       12            47              76
topic. None of the humans were experts in any of the topics, nor
                                                                                          421             21      0.01%       14            45              60
did anyone have recent prior knowledge on the topics, as the events
in this Jeb Bush TREC collection [7] took place a decade or more                                         Averages             9.9          31.4            64.3
prior to when the searchers worked and most of the issues were                                    Table 1: Collection and Manual Effort Statistics
local to Florida and did not make national news. In practice, humans
are rarely given this disadvantage. They often work for more than
thirty minutes on a problem and can have broad domain expertise
that comes from having worked on similar cases in the past.
   Thus, our experiments consist of a comparison between portable                      model effectiveness is likely to be lower, though how much lower
models trained in the best possible light vs human effort that is                      remains to be studied.
kept at a minimum. We do this because the core concept of portable
models is that they will be sufficiently broad in scope so as to be                    5 EXPERIMENTS
able to identify relevant documents in a collection that contains
documents of a similar content and context to those on which they                      5.1 Data
were trained. (“Relevance” here refers to the notion of “what is                       We test these research questions using the TREC 2016 total recall
desired”, be it some sort of topical similarity such as age discrimina-                track document collection, topics, and relevance judgments [7]. This
tion or fraud cases, or something like privilege.) That distributional                 dataset contains 34 topics each with a varying number of relevant
similarity is not always guaranteed, and in fact it can be difficult a                 documents. Nonetheless, the richness of the majority of topics is
priori to know whether you a portable model has been trained on                        under 1%, i.e. relatively low richness topics where portable models
data similar enough to be useful. By using documents intentionally                     allege to be most effective. Table 1 contains statistics on each topic.
drawn from the exact same distribution, we are able to show an                         The first column is the topic ID from 401 to 434, sorted in a manner
upper bound on portable model effectiveness. In practice, portable                     that will be described in Section 6.3. The next two columns contain
LegalAIIA 2021, June 2021, São Paulo, Brazil                                                                                    Pickens and Gricks


the number of total relevant documents and the richness for each          the desired recall level is achieved, which in these experiments are
topic. There are 290,099 total documents in the collection.               set to 80%.
   Human effort, aka manual seeding, was done with a small team               While selecting a source collection for portable model training
of four searchers. For each topic, two of the searchers were in-          from the same distribution as the target collection already offers
structed to run a single query and code the first 25 documents that       great advantage to the predictive capabilities of a portable model,
resulted from that query. The other two searchers were given more         i.e. puts it above where it would likely perform in more realistic
interactive leeway and were instructed to utilize as many searches        scenarios, we extend this advantage even further by giving the
and whatever other analytic tools (clustering, timeline views, etc.)      model larger and larger source collections on which to train. We
as they wanted, with a goal of working for about 15-30 minutes and        compare three primary source/target partitions: 20/80, 50/50, and
stopping once they had tagged 25 documents. This was not strictly         80/20, with k=5, k=2, and k=5, respectively. (In the 80/20 case, steps
controlled, and some reviewers worked a few minutes longer, some          (3a) and (3c) are reversed, with the current group used as the target
a few minutes shorter. And some marked a few more than 25 doc-            collection and the other groups used as the source collection.) The
uments, and some a few less, as is to be expected in normal, “in          reason for the 20/80 partition is that the eDiscovery problem is
the moment” flow of knowledge work. Table 1 contains the manual           a recall-oriented task. The larger the review population, aka the
effort statistics — with the total number of queries, total number of     target collection, the more realistic the CAL process is likely to be.
minutes, and total unique documents tagged as either relevant or          However, the disadvantage is that only 20% of the TREC collection is
non-relevant — for each topic. On average, the human reviewers            be used for training the portable model. The 80/20 partition reverses
worked for 31.4 minutes, issued 9.9 queries, and coded 64.3 docu-         the balance: 80% of the collection is used to train the portable model,
ments so the overall effort was done at a fairly high pace and was        but only 20% of the collection is available to simulate the CAL
relatively minimal in comparison to the size of the collection.           review, which can be problematic for especially sparse topics. The
                                                                          50/50 partition splits the difference.
5.2     Experiment Structure                                                  Comment: An astute observer may find slight fault with the struc-
In order to compare portable models against both random and               ture of this experimental setup, in that there is a small amount of
human-seeded techniques in RQ1 through RQ3, there needs to be             knowledge overlap between the source and target partitions when
a collection on which the portable model can be trained, separate         doing human seeding. Specifically, the human searchers originally
from the collection on which it and the comparative approaches            searched across the entire collection rather than across split col-
are deployed. We will refer to these two collections as “source” and      lections. It is possible that a document found by a human searcher
“target’, respectively. For the reasons enumerated in Section 4, we       that ended up in a source partition has led the human to issue a
carve out the portable model training source collection from the          query that found more or better documents that ended up in the
same distribution as the target collection, and do so by selecting        target partition. Thus even though the human-found documents
documents at random. For a given topic, we:                               in only the target partition are used to seed a CAL process (Step
                                                                          3d, above), the existence of some of those seeds could have been
   (1) Shuffle the collection randomly                                    influenced by knowledge of documents in the source partition. We
   (2) Split the collection into k groups                                 note this issue and make it explicit, but do not think that it affects
   (3) For each group:                                                    the overall conclusions of the experiment. One reason is that even if
      (a) Use that group as the portable model training “source”          humans had some knowledge of documents in the source partition
           collection S                                                   when finding the documents in the target partition, the portable
     (b) Train a model M using every document in S                        model M is given knowledge of every document, positive and neg-
      (c) Use the remaining groups as the “target” collection T           ative, in the source partition. Table 3 shows the raw number of
      (d) Select manual (human) seeds H by intersecting all found         positive documents used for training in the source partition, and
           docs (see Table 1) with T                                      it swamps the documents that the humans would have looked at
      (e) Select random seeds R from T until five positives exam-         in their short search sessions. Thus, one can think of any overlap
           ples are found                                                 during human seed selection as the background knowledge that
       (f) Selected portable seeds P from the top of the M-induced        human would likely already be expected to possess when working
           ranking on T in an amount equal to |H |                        in a real scenario. E.g. an investigator working on detecting fraud or
      (g) Use the appropriate seeds to run each experiment RQ1            sexual harassment likely has some implicit background knowledge
           through RQ3                                                    of fraud or sexual harassment.
   (4) Average results across all k groups for the topic, but do not
        average across topics
                                                                          6 RESULTS
   The specifics of step (3g) depends on the research question being
tested. For example, for RQ1, R and P are each (separately) used to       6.1 RQ1: Portable- vs Random-Seeded Recall
seed a continuous active learning (CAL) process. For RQ3, H and           The results for our first question are found Table 2. Under the rubric
P are used. Other than the different seedings, these CAL processes        of symmetry, the results are expressed in terms of raw percentage
are run exactly as in [5] except that updates are done every 30           point (not percentage) differences between the precision achieved
documents rather than every 1000 documents. And unlike some               at 80% recall for the portable model P-seeded review versus a ran-
approaches, the learning is not relevance feedback for a limited          dom R-seeded review, and averaged across all partitions for each
number of steps. It is truely continuous in that it does not stop until   topic. Positive values indicate better portable model performance;
On the Effectiveness of Portable Models versus Human Expertise under Continuous Active Learning                          LegalAIIA 2021, June 2021, São Paulo, Brazil


    Topic     20/80 Partition       50/50 Partition      80/20 Partition                  Thus in answer to the question: Does P-seeding produce an
                                      ∆precision                                       efficacious result, the answer is yes. However, the more important
     403             74.4                85.2                  92.2                    question is not whether portable models are useful; it is whether
     422              9.3                 28                   21.8                    they are useful relative to other reasonable, simpler, or less risky
     424             75.1                77.8                  65.2                    alternatives. For that we turn to the remaining research questions.
     426             53.8                52.2                  32.2
     420             76.7                71.4                  78.4                    6.2        RQ2: Portable vs Human Seed Initial
     407             55.8                51.7                  50.4                               Relevance
     414              8.8                 6.2                  6.9
                                                                                       The results for our second questions are found in Table 3 under the
     410             35.6                29.3                  50.1
                                                                                       Target Portable and Target Manual columns for each partition group.
     401             23.3                26.1                  17.2
                                                                                       For example, in the 20/80 partition, where 80% of the collection is
     406              8.7                 3.8                  6.7
                                                                                       used as the target collection and on average across all 5 folds, on
     433             54.7                48.6                  39.2                    topic 403 human effort found 18.4 positively-coded seed documents
     415             -0.8                 0.3                  4.7                     whereas the portable model M found 41.6 at the same level of effort
     430             14.2                 7.1                  17.9                    (i.e. at 60 documents, as per Table 1). On topic 421 under the 20/80
     417              6.2                 10                   17.5                    partition, humans found an average 10.4 documents and M found
     413             67.9                69.3                  57.8                    3.2.
     432             43.3                46.9                  13.4                        The average number of positive training documents in the source
     402              4.2                 3.3                    4                     partition, i.e. the data on which M is trained, is shown. The number
     427             60.5                52.5                  34.4                    of negative training examples is the remainder of the fold. Averages
     419              1.4                 11                   6.8                     across all 34 topics are shown at the bottom of the table, as is a
     404              1.2                -0.4                  0.3                     binomial p-value.
     408              0.3                  0                   0.3                         These results show that when 20% of the collection is used to
     418              0.5                 0.2                    0                     train M (20/80 parition), even though that data is literally from
     411              0.6                 0.3                  0.1                     the same distribution as the target fold, the various M are able
     412             14.4                17.8                  5.9                     to find seed documents at a rate no better than a small amount of
     416              1.1                 2.8                  0.3                     human effort. There is only a difference of 0.2 documents across
     423             12.5                 5.8                  4.9                     all topics, and while there is some variation between topics the
     429             76.7                80.9                  70.9                    differences are not statistically significant (p=0.303). As the training
     409             26.8                15.2                  3.1                     partition increases, and 50% then 80% of the collection is used to
     405              66                 60.4                  43.1                    train each M, so too does the ability of the model to find more seed
     428             15.1                12.3                  15.6                    documents. On the 50/50 partition M finds on average 3.5 more
     431             26.8                19.2                  15.4                    documents than the human at the given effort level, and on the
     434             36.4                 51                    54                     80/20 partition it finds an average 1.5 more documents. Both results
     425             66.7                 59                    59                     are statistically significant.
     421              1.3                 9.1                  3.7                         When the number of seeds is normalized per fold and topic by
                                       Averages                                        the number of total seeds found, i.e. the positive seed precision,
                    30.0                 29.8                 26.3                     the manual effort has an average precision of 66.0% across all folds,
                 p<0.000001           p<0.000001           p<0.000001                  whereas M precision is 64.2%, 76.5%, and 78.1%, respectively across
Table 2: CAL review relative precision based on portable                               20/80, 50/50, and 80/20. That is, even though the average number of
model seeding versus random seeding                                                    documents that M finds on the 50/50 partition is larger (3.5) than
                                                                                       on the 80/20 partition (1.5), the latter partition is smaller. The actual
                                                                                       precision goes up slightly.
                                                                                           Thus in answer to the question: Do portable models initially find
                                                                                       more relevant documents than does human effort, the answer is
negative values the opposite. Nearly universally, P-seeding out-
performs random seeding; the p-value under a binomial test is <                        mixed. When given 20% of the collection for training, they do not.
0.00001.                                                                               When given 50% or 80%, they do. However, the improvement is
   This is a wholly expected result. Closer examination of the simu-                   modest: a few percentage points, or a few extra documents.
lated review orderings shows that in low richness domains most of
the precision loss comes not from the CAL iterations, but from the                     6.3        RQ3: Portable- vs Human-Seeded Recall
larger number of documents needed to find enough positive ones                         The results for our third and final question are also found in Table 3
to start ranking. Note that random seeding on the target partition                     under the ∆precision columns. Again, in the interest of symmetric
also outperforms fully linear review by an average of 12.2, 10.6,                      magnitudes, ∆precision is the percentage point difference between
and 5.6 percentage points on the 20/80, 50/50, and 80/20 partitions,                   P-seeded versus H -seeded CAL. While again there is some varia-
respectively. So even random seeding of CAL is better than no CAL                      tion across topics, on average on the 20/80 partition, P-seeding is
at all.                                                                                0.4 percentage points worse, while on the 50/50 and 80/20 partitions
LegalAIIA 2021, June 2021, São Paulo, Brazil                                                                                      Pickens and Gricks


P-seeding is 0.6 and 1.2 percentage points better. However, none           therefore risk is lower, we do not yet find that the portable model
of these results are statistically significant (p=0.303).                  provides a sustained advantage.
   Furthermore, when we look at the raw document count differ-
ence between the two conditions (not shown in the table) another
story emerges. In the 80/20 partition, on those topics for which P-
seeded CAL is better, is it better on average by 186 total documents.      8    FUTURE WORK
Where H -seeded CAL is better, it is better by 896 documents. On           Certainly this is but one study and more studies with a wider range
the 50/50 partition, P- versus H -seeding is 271 vs 743 documents          of models, data collections, and human effort are needed. Perhaps
better, and on the 20/80 partition it is 166 versus 655 documents.         the humans could have done even better if given more time, were
There is no consistent advantage of either approach over the other,        working on a domain in which they had specific expertise, or were
but the negative consequences of H seeding seems to be far smaller         given more powerful analytics with which to find seed documents.
than those of P seeding.                                                   Conversely, portable models were given all possible advantages
   The reason for the topical sort order across all tables should now      in this experimental structure by building them on documents
become clear: All tables in this paper are sorted by the ∆precision        drawn from the exact same distribution as the target collect, in
of the 80/20 partition. This seems to be the partition for which           ever increasing amounts (20%, 50%, and 80%). It is not likely that
portable models are the strongest; they have the most training data.       portable models will ever be trained on prior data as perfectly
And sorting by ∆precision allows us to see where P-seeding vs              similar to the target distribution. Therefore, portable models might
H -seeding each shine. To that end, we introduce one more metric           likely have performed much worse in realistic scenarios where the
into the discussion: The WTF “ineffectiveness” metric [11, 12] en-         source and target collections are further apart. E.g. when modeling
capsulates the notion of not only looking at average performance,          fraud or sexual harassment, does what constitute evidence of fraud
but at outliers. A system that has good average performance but            or harassment in one collection express itself the same way in
egregious outliers might want to be avoided, especially in eDiscov-        another collection? Future research is needed in three different
ery where every case matters and the costs incurred by an outlier          areas: (a) More and larger collections from similar but not identical
are more significant than in, say, ad hoc web search.                      distributions on which to train models, or perhaps the “right” small
   From this perspective, we see that where portable models per-           collections on which to train models, as per [2], (b) more advanced
form the strongest, i.e. on the 80/20 partition where they are given       models and better transfer learning, and (c) better and stronger
80% of the available positive documents, there are outliers in both        baselines against which to compare.
directions. The top three P-advantage outliers show a 60.5, 12.0,             Better and stronger baselines are not limited to more effective hu-
and 5.7 percentage point difference. The top three H -advantage            man effort. They also include other existing, common practices. For
outliers show a 25.0, 12.6, and 11.1 percentage point difference.          example, many companies dealing with sensitive information keep
However, in terms of raw document counts these translate to a 670,         lexicons of search terms used to find sensitive information. While
667, and 468 documents for the P-advantage, and 9249, 2006, and            a lexicon could in some sense be thought of as an “unweighted”
892 documents for H -advantage. There appear to be fewer “WTFs”            portable model, one difference is that it’s manually constructed,
from H -seeding.                                                           transparent, and can embed human intuition and patterns never
                                                                           seen in prior data, i.e. lexicons do not need to be trained. Another
                                                                           common approach for corporations with repeat litigation is the
7    CONCLUSION                                                            idea of a “drop in seed”. That is, instead of building large models
We have shown that a portable model can be useful. Certainly rel-          based on huge datasets from all possible prior matters, some in
ative to linear review, and even relative to randomly seeded CAL           the industry have developed the ad hoc practice of taking a few
workflows, taking a portable approach offers a significant advan-          coded documents from previous matters, which matters are known
tage. They are also marginally better than humans when it comes            to be similar to the current matter, and using those as the initial
to finding initial seed documents. When it comes to sustained ad-          seeds on the target collection. This of course only works behind
vantage, i.e. precision at 80% recall, the advantages fade. There is no    the firewall, as companies will not transfer documents to other
statistically significant difference in human vs portably seeded CAL       companies. But given the security, privacy, and related member-
workflows, and slight evidence that the outliers for the portable          ship inference attack risks of portable models, companies might
approach are worse.                                                        not want to transfer their own models to a competitor, either. So
    We note also that porting models carries with it significant risk      in addition to comparing portable models against human seeding,
in the form of intellectual property rights, data leakage via member-      they should be compared against lexicons and drop-in seeds. Per-
ship inference attacks, privacy, and security. It is every party’s own     haps these latter approaches outperform both portable models and
subjective decision as to whether the advantages of portable models        human-seeded approaches when considering the total cost of a
outweigh the challenges and risk. However, from the results in this        review and not just the document count.
study we would recommend continuing to invest in human-driven                 The cost of the portable model (vendor charge) versus the human
seeding (IA–intelligence augmentation) processes and not going             approach (e.g. half an hour of searcher time) needs to be consid-
all in on AI. At least relative to the topics studied in this paper, the   ered as well, and not just the cost of the subsequent document
modicum of effort required of the human are a fair trade relative to       review. In short, this is a rich space for the exploration of tradeoffs,
risk. Even when portable models are built on a corporation’s own           risks, and advantages for various human-driven vs machine-driven
data, and models are not swapped between different owners and              eDiscovery processes.
On the Effectiveness of Portable Models versus Human Expertise under Continuous Active Learning                                     LegalAIIA 2021, June 2021, São Paulo, Brazil


                           20/80 Partition                                        50/50 Partition                                    80/20 Partition
                     Positive Counts                                        Positive Counts                                    Positive Counts
                Source         Target                                  Source         Target                              Source         Target
                        Portable Manual                  ∆prec                 Portable Manual               ∆prec                Portable Manual                 ∆prec
    Topic
                           (P)        (H )                                        (P)        (H )                                    (P)        (H )
     403         218.0         41.6          18.4          27.8         545.0         26.0       11.5         42.8        872.0          10.4           4.6          60.5
     422          6.2           6.6          12.8          -8.2         15.5          11.0       8.0          14.6         24.8           3.8           3.2          12.0
     424         99.0          40.2          43.2           0.0         247.5         28.5       27.0           2.4       396.0          11.4          10.8           5.7
     426         24.0          23.2          36.0          22.6         60.0          21.5       22.5           7.2        96.0           8.8           9.0           4.6
     420         147.4         50.8          49.6           2.6         368.5         33.0       31.0           0.6       589.6          13.2          12.4           4.4
     407         316.2         31.4          30.4           1.4         790.5         19.5       19.0           2.5       1264.8          7.8           7.6           4.2
     414         167.6         36.2           6.4           2.9         419.0         24.0       4.0            0.2       670.4           9.4           1.6           3.2
     410         269.0         71.6          68.0          -1.6         672.5         45.0       42.5          -2.8       1076.0         18.0          17.0           2.9
     401         45.8          33.6          36.0           3.6         114.5         27.0       22.5           2.1       183.2          11.0           9.0           2.8
     406         25.2          11.8          31.2           1.3         63.0          21.5       19.5          -1.7       100.8           9.4           7.8           2.0
     433         22.4          28.4          30.4          -5.6         56.0          25.0       19.0          -2.2        89.6          10.8           7.6           1.5
     415        2408.4         33.6          46.4           0.1        6021.0         21.5       29.0          -2.9       9633.6          8.0          11.6           1.3
     430         198.0         62.6          48.0          -1.0         495.0         40.5       30.0           0.4       792.0          16.8          12.0           1.3
     417        1186.2         87.2          81.6           0.5        2965.5         54.5       51.0           0.8       4744.8         21.8          20.4           0.9
     413         109.2         51.2          52.8          -0.3         273.0         34.0       33.0           0.4       436.8          13.8          13.2           0.6
     432         28.0          10.2          31.2         -12.1         70.0          20.5       19.5           6.1       112.0           8.6           7.8           0.4
     402         127.0         49.4          43.2           0.1         317.5         34.0       27.0           0.6       508.0          14.0          10.8           0.3
     427         48.2          30.8          35.2           2.1         120.5         23.5       22.0           5.2       192.8          10.2           8.8           0.2
     419         396.6         28.4          40.0          -0.2         991.5         18.0       25.0          -0.6       1586.4          7.2          10.0           0.1
     404         108.8         30.6          26.4           0.1         272.0         19.5       16.5          -0.1       435.2           8.2           6.6           0.0
     408         22.8          20.8          16.0           0.1         57.0          20.5       10.0           0.0        91.2           8.6           4.0           0.0
     418         37.4          13.4          10.4           0.1         93.5          16.0       6.5            0.0       149.6           7.6           2.6           0.0
     411         17.8           9.4           9.6          -0.1         44.5          11.5       6.0            0.1        71.2           4.4           2.4          -0.1
     412         280.0         55.8          47.2           1.5         700.0         37.5       29.5           0.7       1120.0         14.4          11.8          -0.1
     416         279.0         36.4          29.6           0.6         697.5         20.5       18.5           0.1       1116.0          8.0           7.4          -0.2
     423         57.2          16.8          16.8           0.2         143.0         13.0       10.5           1.0       228.8           5.4           4.2          -0.4
     429         165.4         61.4          63.2          -1.5         413.5         39.0       39.5          -0.4       661.6          15.6          15.8          -1.2
     409         39.8          25.2          28.0          11.7         99.5          23.5       17.5          -1.8       159.2          10.2           7.0          -1.9
     405         23.8          31.0          32.8          -0.4         59.5          22.0       20.5           1.1        95.2           9.2           8.2          -2.4
     428         92.4          55.6          48.0           2.7         231.0         34.5       30.0          -0.9       369.6          14.0          12.0          -5.4
     431         28.8          24.8          30.4          -3.6         72.0          21.0       19.0         -11.6       115.2           7.6           7.6          -8.3
     434          7.6           5.6          24.0         -29.8         19.0          9.0        15.0         -13.6        30.4           5.4           6.0         -11.1
     425         142.8         51.2          44.0          -7.8         357.0         32.0       27.5         -10.8       571.2          12.8          11.0         -12.6
     421          4.2           3.2          10.4         -23.4         10.5          5.0        6.5          -18.6        16.8           1.8           2.6         -25.0
    AVG          210.3         34.4      34.6            -0.4           525.8         25.1      21.6          0.6          841.2         10.2      8.7              1.2
                                  p=0.303               p=0.303                          p=0.0002           p=0.303                        p=0.00004              p=0.303
Table 3: Across each of the various partitions: (1) Number of positive training documents in the portable model “source”, (2)
The number of positive seed documents found by the portable model P and the manual H approaches, and (3) the relative
precision (∆prec) of P-seeding over H -seeding at 80% recall, i.e. the precision of the former minus the precision of the latter.
Positive numbers indicate that P-seeding was more effective, negative numbers that H -seeding was more effective.



REFERENCES                                                                                   [3] Nicholas Carlini. 2020. Privacy Considerations in Large Language Mod-
 [1] Brittany Bacon, Tyler Maddry, and Anna Pateraki. 2020. Training a Machine                   els. Retrieved April 29, 2021 from https://ai.googleblog.com/2020/12/privacy-
     Learning Model Using Customer Proprietary Data: Navigating Key IP and Data                  considerations-in-large.html
     Protection Considerations. Pratt’s Privacy and Cybersecurity Law Report 6, 8 (Oct.      [4] Nicholas Carlini, Florian Tramer, Eric Wallace, Mathew Jagielski, Ariel Herbert-
     2020), 233–244.                                                                             Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson,
 [2] Ricardo Baeza-Yates. 2013. Big Data or Right Data?. In Proceedings of the 7th               Alina Oprea, and Colin Raffel. 2020. Extracting Training Data from Large Lan-
     Alberto Mendelzon International Workshop on Foundations of Data Management                  guage Models. Article arXiv:2012.07805.
     (AMW 2013), Loreto Bravo and Maurizio Lenzerini (Eds.), Vol. 1087 (CEUR Work-           [5] G. V. Cormack and M. R. Grossman. 2014. Evaluation of machine-learning
     shop Proceedings). Puebla/Cholula, Mexico. http://ceur-ws.org/Vol-1087/                     protocols for technology-assisted review in electronic discovery. In Proceedings
                                                                                                 of the 37th International ACM SIGIR Conference on Research and Development in
                                                                                                 Information Retrieval. Gold Coast, Australia, 153–162.
LegalAIIA 2021, June 2021, São Paulo, Brazil                                                                                                               Pickens and Gricks


 [6] Ben Dixon. 2021. Machine Learning: What are Membership Inference Attacks? Re-            Inference Attacks and Defenses on Machine Learning Models. In Proceedings of
     trieved April 29, 2021 from https://bdtechtalks.com/2021/04/23/machine-learning-         Network and Distributed Systems Security (NDSS) Symposium. San Diego, Califor-
     membership-inference-attacks/                                                            nia.
 [7] Maura R. Grossman, Gordon V. Cormack, and Adam Roegiest. 2016. TREC 2016            [10] Florian Tramer, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart.
     Total Recall Track Overview. In In NIST Special Publication 500-321: The Twenty-         2016. Stealing Machine Learning Models via Prediction APIs. In Proceedings of
     Fifth Text REtrieval Conference Proceedings (TREC 2016), Ellen M. Voorhees and           the 25th USENIX Security Symposium. Austin, Texas, 601–618.
     Angela Ellis (Eds.). Gaithersburg, Maryland. https://trec.nist.gov/pubs/trec25/     [11] Daniel Tunkelang. 2012. WTF!@k: Measuring Ineffectiveness.               Retrieved
     trec2016.html                                                                            April 29, 2021 from https://thenoisychannel.com/2012/08/20/wtf-k-measuring-
 [8] Nicholas M. Pace and Laura Zakaras. 2012. Where the Money Goes: Understanding            ineffectiveness/
     Litigant Expenditures for Producing Electronic Discovery. Rand Corporation, Santa   [12] Ellen Voorhees. 2004. Measuring Ineffectiveness. In Proceedings of the 27th annual
     Monica, CA, USA.                                                                         ACM SIGIR conference on Research and Development in Information Retrieval.
 [9] Ahmed Salem, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and               Sheffield, UK, 562–563. https://doi.org/10.1145/1008992.1009121
     Michael Backes. 2019. ML-Leaks: Model and Data Independent Membership