<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging the Transductive Nature of e-Discovery in Cost-Sensitive Technology-Assisted Review</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>IT E-mail: alessiomolinar@gmail.com</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>MINECORE is a recently proposed algorithm for minimizing the expected costs of review for topical relevance (a.k.a. \responsiveness") and sensitivity (a.k.a. \privilege") in e-discovery. Given a set of documents that must be classi ed by both responsiveness and privilege, for each such document and for both classi cation criteria MINECORE determines whether the class assigned by an automated classi er should be manually reviewed or not. This determination is heavily dependent on the (\posterior") probabilities of class membership returned by the automated classi ers, on the costs of manually reviewing a document (for responsiveness, for privilege, or for both), and on the costs that di erent types of misclassi cation would bring about. We attempt to improve on MINECORE by leveraging the transductive nature of e-discovery, i.e., the fact that the set of documents that must be classi ed is nite and available at training time. This allows us to use EMQ, a well-known algorithm that attempts to improve the quality of the posterior probabilities of unlabelled documents in transductive settings, with the goal of improving the quality (a) of the posterior probabilities that are input to MINECORE, and thus (b) of MINECORE's output. We report experimental results obtained on a large ( 800K) dataset of textual documents.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>{ cP (which stands for \Produce"): The document has been deemed responsive
and not privileged; as such, it should be produced to the receiving party;
{ cL (which stands for \Log"): The document has been deemed responsive and
privileged; as such, it should be entered into a \privilege log" (i.e., a repository
open to inspection by the judge) and should not be disclosed to the receiving
party;
{ cW (which stands for \Withhold"): The document has been deemed not
responsive, which means that the producing party should not produce it.
In such a scenario, the producing party may incur two types of costs, i.e., annotation
costs (since lawyers need to be paid for their reviewing work), and misclassi cation
costs (that might derive when an inappropriate class { e.g., \Produce" instead of
\Log" { is chosen for a document).</p>
      <p>
        Given the enormous amount of digital documents that should undergo review
in many practical cases, a need for automation of the review process has arisen.
Several techniques for technology-assisted review (TAR), that combine techniques
from information retrieval and machine learning, have thus been proposed in the last
ten years [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The aim of TAR algorithms is that of supporting the review process,
minimizing the overall costs deriving from annotation e orts and misclassi cations.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>An overview of MINECORE</title>
      <p>
        MINECORE [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a recently proposed decision-theoretic algorithm for minimizing
the expected costs of review for responsiveness and privilege in e-discovery. Given a
set D of documents that must each be assigned a class in C = fcP ; cL; cW g based on
whether they belong or not to the class cr of responsive documents and/or to the
class cp of privileged documents, the goal of MINECORE is to determine, for each
document in D, whether manually reviewing d by responsiveness and/or privilege
is expected to be cost-e ective or not. This determination is based
1. on the (\posterior") probabilities of class membership (written as Pr(crjd) and
Pr(cpjd), hereafter called the posteriors ) returned by automated classi ers hr
(that classi es documents by responsiveness) and hp (that classi es documents
by privilege);
2. on the unit costs of manually checking a document for responsiveness ( ra) or
for privilege ( pa), where superscript a stands for \annotation";
3. on the costs imj incurred when mistakenly assigning class ci to a document
which should be assigned class cj , where ci; cj 2 C and superscript m stands for
\misclassi cation".
      </p>
      <p>Bullet 3 is due to the fact that in e-discovery not all misclassi cations are equally
serious; for instance, inadvertently disclosing a privileged document is typically a
very serious mistake, while inadvertently disclosing a nonresponsive nonprivileged
document is usually a less serious one.</p>
      <p>MINECORE consists of three phases, which we summarize below. In Phase 1
we train the two automated classi ers hr and hp, and use them to generate, for each
document d 2 D, the two posteriors Pr(crjd) and Pr(cpjd) mentioned in Bullet 1. We
can reasonably assume cr and cp to be stochastically independent, which implies
that we may assume Pr(cP jd) = Pr(crjd) Pr(cpjd), Pr(cLjd) = Pr(crjd) Pr(cpjd),
and Pr(cW jd) = Pr(crjd). MINECORE takes a risk minimization approach, i.e., it
classi es each document d in the class
h(d) = arg min R(d; ci)</p>
      <p>ci
= arg min
ci j2fP;L;W g</p>
      <p>X
imj Pr(cj jd)
(1)
where R(d; ci) is the risk associated with assigning d to class ci. In other words,
MINECORE assigns to each document d the class that brings about the minimum
misclassi cation risk, i.e., the minimum expected misclassi cation cost, thus
avoiding courses of actions for which a combination of probability of class membership
and misclassi cation cost is high. The function for measuring the global
misclassication cost is thus</p>
      <p>Km(D) =</p>
      <p>X
i;j2fP;L;W g
imj Dij
where Dij is the number of documents d 2 D whose predicted class h(d) is ci and
whose true class (which we denote by y(d)) is cj .</p>
      <p>Phase 2 and Phase 3 are essentially identical to each other. The only di erence
is that, while Phase 2 determines the subset of documents which are expected to
be cost-e ective to manually review by responsiveness, Phase 3 determines (once
these documents have been indeed manually reviewed by responsiveness) the same
for privilege. We will thus only describe Phase 2, leaving it to the reader to work
out the details of Phase 3.</p>
      <p>Note that, if r documents are reviewed by responsiveness and p documents are
reviewed by privilege, the overall cost of the entire process may be described as
(2)
(3)
Ko(D) = Km(D) + Ka(D)
= Km(D) + ra r +
a
p p
If document d is reviewed by responsiveness, this has the e ect of removing
(assuming infallible reviewers) any uncertainty about whether d 2 cr or not. In other words,
if by subscript n 2 f1; 2; 3g we indicate the value of a given quantity after Phase
n has been carried out, reviewing d by responsiveness means that Pr2(crjd) will be
either 0 or 1. As a result, if d is reviewed by responsiveness it will in general hold
that Pr1(crjd) 6= Pr2(crjd), h1(d) 6= h2(d), and K1m(D) 6= K2m(D). Since reviewing d
by responsiveness brings about an additional ra cost, it is worthwhile to annotate d
only if, as a result of the annotation, K2o(D) K1o(D), i.e., K2m(D) + ra K1m(D);
in other words, the additional annotation cost must be o set by a reduction in
overall misclassi cation cost of greater or equal magnitude. Of course, computing
precisely whether there is going to be such a reduction at all is not possible, because
at the time of deciding whether d should be annotated or not we do not know the
value of yr(d) (a binary variable that indicates whether the reviewer will annotate
d as responsive or not), and we do not know the true label y(d) of d. However,
it is possible to compute an expectation of this reduction over the yr(d) and y(d)
variables; when this expected value exceeds ra, MINECORE decrees that d should
be annotated by responsiveness. We refer the reader to [4, x3] for details on how
the above expected value is computed, and for a full mathematical speci cation.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Improving our posteriors via EMQ</title>
      <p>The goal of MINECORE is to identify the r (resp., p) documents that, when
reviewed for responsiveness (resp., privilege), will each bring about a reduction in
the expected overall cost of the review process. In order to do this, MINECORE
uses as input the data listed in the three bullets at the beginning of x2.</p>
      <p>In this work we attempt to improve this process not by modifying MINECORE,
but by improving the quality of the input that MINECORE receives, with the goal
of having MINECORE bring about a higher reduction in expected costs. Since the
input data mentioned in Bullets 2 and 3 are user-de ned parameters, and are hence
not under our control, we focus on the input data mentioned in Bullet 1, i.e., the
posteriors Pr(crjd) and Pr(cpjd).</p>
      <p>
        Our contribution is based on the observation that e-discovery has, from the
standpoint of machine learning, a transductive nature, i.e., the unlabelled set D is
nite and available at training time. This fact can be exploited in order to improve
the quality of our posteriors.1 In fact, Saerens and colleagues [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have presented
an instance of the well-known EM (\expectation maximization") algorithm (an
instance which, following [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we will call EMQ) that iteratively improves the quality
of such probabilities in transductive contexts. EMQ is based on a mutually recursive
algorithm that alternates between (a) recomputing the prior probabilities Pr(c)
(hereafter: the priors ) as the average of the posteriors over the entire set, and (b)
recomputing the posteriors by multiplying the old posteriors by the ratio between
the new priors and the old priors. The iteration is carried out until convergence,
and has been shown to deliver both improved posteriors [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and improved priors [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Therefore, (a) we obtain calibrated posteriors Pr(crjd) and Pr(cpjd), for each
document d 2 D, from our classi ers hr and hp (which, in the experiments of x4,
we obtain via a linear SVM), (b) we update them via EMQ, and (c) we feed them
to MINECORE.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        We run experiments in which we compare two versions of MINECORE, one that
uses the original posteriors (here dubbed MINECOREMLE, since the probabilities
are obtained from a Maximum Likelihood Estimator ) and one that uses the
EMQenhanced posteriors (here dubbed MINECOREEMQ). We use the same dataset as
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which consists of 20,000 training documents, 780,000 test documents,
and 120 instantiations of the (cr; cp) pair of classes; the values of ra, pa, imj ,
are from \CostStructure1" in [4, x4.3]. We refer to [4, x4] for other details about
the experimental setup. The evaluation function we use is Ko(D) from Equation
3. We measure the ratio
      </p>
      <p>KKEoMoMLQE((DD)) between Ko(D) as deriving from</p>
      <p>Ko (D) =
MINECOREMLE and Ko(D) as deriving from MINECOREEMQ; if this ratio is &gt; 1,
this means that EMQ delivers an improvement, and that our attempt has been
successful.</p>
      <p>We show the outcome of our experiment in Figure 1, where we display the
class pairs on the X axis (sorted by their Ko (D) value in order to improve the
readability of the plot) and Ko (D) on the Y axis; the red line indicates the value</p>
      <p>Ko (D) = 1. A dot above the red line indicates, for the corresponding class pair, an
improvement in overall cost with respect to the MINECOREMLE baseline, whereas
a dot below the red line indicates a deterioration. Figure 1 shows that for 60% of
the pairs, using EMQ brings about a deterioration. The average value of Ko (D)
across the 120 class pairs is 0.99, which indicates a 1% average deterioration.</p>
      <p>This is surprising, given the positive results that have been reported in past
literature for EMQ. As a result, we have further tried to compare the quality of
the PrMLE(cjd) and PrEMQ(cjd) sets of posteriors via an application-independent
measure of quality. For this, we have chosen \soft accuracy" (a variant of what
has been called the Brier score), which corresponds to the standard accuracy
mea</p>
      <p>TP+TN
sure A = TP+FN+FP+TN as evaluated on a \soft" contingency table where cells
are (instead of counts) sums of probabilities of class membership.2 Essentially no
1 At a rst approximation, we may say that the quality of a posterior Pr(cjd) is high when
jhc(d) Pr(cjd)j is low; see x4 for more details.
2 For example, for a document d that actually belongs to c: (1) if c is actually assigned,
for a standard contingency table this contributes a value of 1 to the TP cell; (2) if a
di erence in soft accuracy was detected, with MLE posteriors obtaining A = 0:9743
and EMQ posteriors obtaining A = 0:9742. Note that, in our case, Ko is a more
appropriate evaluation measure than A, since it is application-dependent. In fact,
while A measures the quality of our posteriors in a somehow abstract way, Ko
directly measures the impact that these posteriors have on MINECORE; this adds
to the disappointment that the results returned by EMQ have brought about.</p>
      <p>We have also checked if the application of EMQ has improved the class priors.
In order to do so we have measured, for each estimation method M 2 fMLE; EMQg
and for each class c that shows up in at least one of the 120 class pairs, the absolute
estimation error (AE), i.e., the absolute value AEM (c) = j Pr(c) P^r(c)M j of the
di erence between the true class prior Pr(c) and the estimated class prior P^r(c)M ;
for each M 2 fMLE; EMQg we have then averaged these values across all such
classes. The results show that AEMLE = 0:191 and AEEMQ = 0:082, i.e., the use
of EMQ brings about a reduction of approximately 57% in the absolute estimation
error of the class priors.</p>
      <p>Overall, this result is surprising: the use of EMQ improved (as expected, and
by a very large margin) the quality of the priors, but did not bring about any
improvement (actually: brought about a small deterioration) in the quality of the
posteriors. The reason this is surprising is that the quality of the posteriors and the
quality of the priors should go hand in hand; indeed, the very reason why EMQ is
expected to improve the priors is that it computes it as the sum of the supposedly
better-quality posteriors that it also computes.</p>
      <p>
        One of the reasons we are witnessing such an unexpected behaviour might lie
beneath the fact that the \distribution drift" in our data is very low (the average
di erence between the prevalence of a class in the training and in the test sets is
0:6%). Indeed, the results of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] indicate that EMQ tends to work better in
highdrift conditions that in low-drift ones. In order to check whether, in a dataset
characterized by higher drift, EMQ would deliver a better performance, we have generated
arti cial drift in the dataset by removing increasingly high quantities of negative
examples from the training set. This should allow EMQ to improve the quality of both
priors and posteriors, thus improving the performance of MINECORE. Since
example removal is completely random, we have repeated the experiment several times,
rst removing up to 60% (and eventually up to 80%) of the negatives. MINECORE
posterior Pr(cjd) is returned, for a \soft" contingency table this contributes a value of
Pr(cjd) to the TP cell and a value (1 Pr(cjd)) to the FN cell.
was then run rst with MLE-generated posteriors and then with EMQ-generated
posteriors. However, once again the results did not support our conjecture: even
after removing 80% of the negatives from the training set (see Figure 2), Ko (D)
does not get higher than 1, and gets slightly smaller than 1 in most cases.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        Although our idea of improving the posteriors input to MINECORE by leveraging
the transductive nature of our dataset seemed (and still seems to us) reasonable
and promising, the experiments we have ran so far indicate the opposite: EMQ has
not improved our posteriors, while it has improved the priors. This is still work in
progress, and one hypothesis that we are going to test is that the EMQ algorithm
might be well-suited to datasets that exhibit some type of drift but not others [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>We still believe that exploiting the transductive nature of e-discovery could prove
a key intuition for improving the results of MINECORE. Indeed, our future work
on this will focus on new ways to leverage this aspect.</p>
      <p>Acknowledgments. The research described in this paper is part of the work carried
out by the author for his MSc thesis in Digital Humanities at the University of Pisa
(supervisors: Andrea Esuli and Fabrizio Sebastiani), to be discussed in 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Caelen</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Quanti cation and learning algorithms to manage prior probability shift</article-title>
          .
          <source>Master's thesis</source>
          , Institut de Statistique,
          <source>Biostatistique et Sciences Actuarielles</source>
          , Universite Catholique de Louvain,
          <article-title>Louvain-la-</article-title>
          <string-name>
            <surname>Neuve</surname>
            ,
            <given-names>BE</given-names>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Tweet sentiment: From classi cation to quanti cation</article-title>
          .
          <source>In: Proceedings of the 7th International Conference on Advances in Social Network Analysis and Mining (ASONAM</source>
          <year>2015</year>
          ). pp.
          <volume>97</volume>
          {
          <fpage>104</fpage>
          . Paris, FR (
          <year>2015</year>
          ). https://doi.org/10.1145/2808797.2809327
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Moreno-Torres</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raeder</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alaiz-Rodr</surname>
            <given-names>guez</given-names>
          </string-name>
          , R.,
          <string-name>
            <surname>Chawla</surname>
            ,
            <given-names>N.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herrera</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A unifying view on dataset shift in classi cation</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <volume>521</volume>
          {
          <fpage>530</fpage>
          (
          <year>2012</year>
          ). https://doi.org/10.1016/j.patcog.
          <year>2011</year>
          .
          <volume>06</volume>
          .019
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Oard</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinjumur</surname>
            ,
            <given-names>J.K.</given-names>
          </string-name>
          :
          <article-title>Jointly minimizing the expected costs of review for responsiveness and privilege in e-discovery</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          <volume>37</volume>
          (
          <issue>1</issue>
          ),
          <volume>11</volume>
          :1{
          <fpage>11</fpage>
          :
          <fpage>35</fpage>
          (
          <year>2019</year>
          ). https://doi.org/10.1145/3268928
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Oard</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Webber</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Information retrieval for e-discovery</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          <volume>7</volume>
          (
          <issue>2</issue>
          /3),
          <volume>99</volume>
          {
          <fpage>237</fpage>
          (
          <year>2013</year>
          ). https://doi.org/http://dx.doi.org/10.1561/1500000025
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Saerens</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Latinne</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decaestecker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Adjusting the outputs of a classi er to new a priori probabilities: A simple procedure</article-title>
          .
          <source>Neural Computation</source>
          <volume>14</volume>
          (
          <issue>1</issue>
          ),
          <volume>21</volume>
          {
          <fpage>41</fpage>
          (
          <year>2002</year>
          ). https://doi.org/10.1162/089976602753284446
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>