<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Factors Affecting the Performance of Reviewers in a Large- Scale Technology-Assisted Review Project</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrew Harbison</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maura R. Grossman</string-name>
          <email>maura.grossman@uwaterloo.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gordon V. Cormack</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bronagh McManus</string-name>
          <email>bronagh.mcmanus@ie.gt.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tom O'Halloran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Waterloo, Waterloo, Ontario, Canada Information Retrieval Services, Grant Thornton Ireland</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>As part of a programme of research arising from a major litigation in the Irish Courts, the authors were able to make observations as to the behaviour of a group of trained, experienced, professional barristers carrying out a series of reviews under real-world conditions. From these observations, we were able to discern several anomalous behaviour patterns which, if not identified and controlled, might have had a significant effect on the outcomes of the reviews. These behaviours were, as far as could be seen, not any fault of the reviewers, nor were they due to the specific conditions of the review, and therefore, would appear to be likely to occur in any similar review process. We describe the behaviours seen during the research programme, propose reasons why these behaviours may have arisen, and propose measures by which they can be controlled-for in other similar projects.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval</kwd>
        <kwd>electronic discovery</kwd>
        <kwd>technology-assisted review</kwd>
        <kwd>TAR</kwd>
        <kwd>Continuous Active Learning</kwd>
        <kwd>CAL</kwd>
        <kwd>recall</kwd>
        <kwd>elusion</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        While considerable work has been done in recent decades on measuring the performance of
technology-assisted information-retrieval techniques and technologies (for example [
        <xref ref-type="bibr" rid="ref2">1</xref>
        ],[
        <xref ref-type="bibr" rid="ref3">2</xref>
        ] &amp; [
        <xref ref-type="bibr" rid="ref4">3</xref>
        ] etc.),
less attention has been paid to the performance of perhaps the most important aspect of any
information-retrieval exercise: the reviewers relied upon to “train” the information-retrieval models
in the first place. Some work was done on this topic early last decade, most notably by Grossman &amp;
Cormack [
        <xref ref-type="bibr" rid="ref1 ref5">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">5</xref>
        ], and Roitblat et al. [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ], but in recent years there has been little attention paid to
this issue. The authors recently addressed the problem of assessing reviewer influence on the
performance of information-retrieval systems in another publication [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ] but, in general, it seems the
discipline still tends to rely on the assumption that the human review component of
technologyassisted reviews is invariably accurate and that relevance can always be assessed according to some
“gold standard” of truth, when it has been long understood that neither assumption can be relied
upon [
        <xref ref-type="bibr" rid="ref9">8</xref>
        ].
      </p>
      <p>Between 2017 and 2022, the authors carried out a complex, large-scale electronic discovery project
comprising over 300 million unique documents drawn from the information systems and archives of
a large, defunct insurance company, based in Ireland. The electronic discovery was carried out as part
of legal proceedings taken by the insurance company’s legal administrators against the company’s
former auditors for negligence. As part of the proceedings, the Defendant demanded extraordinarily
broad discovery, covering over 15 years of documents, across an array of information systems, most
of which had either been archived or retired. The project also entailed the cataloging and retrieval of
data from over 1,000 backup tapes. The Plaintiff’s claim was of the order of $1 billion.</p>
      <p>The electronic discovery project posed a significant number of challenges which could only be
effectively met using modern information-retrieval technology. For example, the Defendant required
the Plaintiff to produce documents under 74 separate discovery criteria (Requests for Production, or
RFPs) of which 70 required relevance assessment. Complicating matters further, Irish discovery rules
require that documents be identified along with the specific discovery criteria (i.e., the RFP) to which
they are responsive.</p>
      <p>To respond to the challenges of this project, the administrators decided to employ Continuous
Active Learning® (CAL®) tools developed by Maura R. Grossman and Gordon V. Cormack, as it was
found that commercially available technology-assisted review (TAR) tools were unlikely to be able to
cope with the review burden without the unreasonable expenditure of time and resources. The
Grossman and Cormack CAL tools had to be validated, however, to ensure that they would be
accepted as being of a standard equivalent to the leading commercially available tools, which would
render them acceptable for use in an Irish Court. Accordingly, a multifaceted testing program was
run between January 2021 and September 2022 to evaluate the Grossman and Cormack CAL® method
against that of a leading commercial TAR tool and also to evaluate the logistics necessary for such an
exceptionally large, complex document review. This testing program inevitably led to detailed
assessment of reviewer behaviour and performance in the context of the two TAR systems under
examination.</p>
      <p>The testing programme was unique in that it was carried out on a huge document corpus in a
high-value, real-world litigation. The review criteria were determined by the counterparties in the
litigation and the research team was not involved in their content or design. The reviewer group was
made up of highly qualified and trained barristers all of whom had previous experience in large-scale
document review projects and all of whom were paid commercial rates for their work. The review
was supported by professional litigators from an international law firm (the Maples Group). The
testing was well funded because of its importance to the overall litigation. Our testing budget was of
the order of €3 million. Finally, an unusually heavy focus was placed on quality assurance, data
collection, and performance measurement in the conduct of the review. It was essential for the
purposes of the case that the results of the testing be highly defensible.</p>
      <p>
        This paper summarises the findings of these assessments of reviewer performance. It does not
delve in detail into specific aspects of reviewer activity but instead identifies different instances of
anomalous behaviour or performance observed and measured during the broader testing and attempts
to explain why these behaviour patterns arose. Any of the observed behaviors would be worthy of
deeper analysis and some have, indeed, been examined in more detail already by the authors in other
publications e.g., [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ],[
        <xref ref-type="bibr" rid="ref11">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Initial Steps</title>
      <p>
        It was understood in advance that reviewers were unlikely to be able to cope with 70 RFPs
simultaneously [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ], so the 70 RFPs were repartitioned into 10 broader “composite categories” (CCs),
each covering a subset of the 70. Our plan was to have the reviewers carry out 20 separate CAL
reviews – one for each for the CCs on both the Grossman and Cormack CAL® system and on the
leading commercial TAR tool. It was then intended that the reviewers carry out sub-reviews for
documents found responsive to each of the CCs to assign them to the specific RFPs contained within
the CC.
      </p>
      <p>In carrying out reviews of individual CCs, reviewers were organised in pairs, or groups of four (in
two pairs). Reviewers were issued documents for review in batches, usually 100 or 200 documents in
size. Reviewers were also assigned an additional 10% of the documents also assigned to their partner
reviewer for quality assurance purposes. This “crossover” set of documents allowed us to quickly
detect when reviewer pairs were diverging in their assessment of documents. We attempted, as far as
practical, to ensure that reviewers in pairs carried out their reviews in parallel, reviewing similar
numbers of documents per day as one another. Reviewer pairs were briefed on the same material
prior to each review, by the same briefing team, at the same time.</p>
      <p>Tests were performed to see if this was a practical approach to the problem. Testing covered
multiple different streams, which will be discussed below. However, in general it was found that while
both technologies worked adequately, the Grossman and Cormack CAL® system significantly
outperforming the commercial TAR tool. And while the reviewers, all trained and experienced
barristers, were intelligent and capable, the review of the documents continually posed problems in
maintaining reviewer performance, consistency, and accuracy across both test systems. In essence,
the reviewers and the technologies often failed to “gel”.</p>
      <sec id="sec-2-1">
        <title>3. Fundamental Problems in</title>
      </sec>
      <sec id="sec-2-2">
        <title>Litigation.</title>
      </sec>
      <sec id="sec-2-3">
        <title>Manual</title>
      </sec>
      <sec id="sec-2-4">
        <title>Review of</title>
      </sec>
      <sec id="sec-2-5">
        <title>Documents in</title>
        <p>
          Order 31 Rule 12 of the Rules of the Superior Courts in Ireland, as enacted in Statutory Instrument
93 of 2990, requires that:
“Any party may apply to the Court by way of notice of motion for an order directing any other
party to any cause or matter to make discovery on oath of the documents which are or have
been in his possession, power or procurement relating to any matter in question therein. Every
such notice of motion shall specify the precise categories of documents in respect of which
discovery is sought and shall be grounded upon the affidavit of the party seeking such an order
of discovery…”[
          <xref ref-type="bibr" rid="ref12">11</xref>
          ]
        </p>
        <p>This Order requires that documents be produced subject to the ill-defined criterion that they be
“related to any matter in question.” This criterion is supposedly compensated for by the stipulation
that the categories of documents requiring discovery be precisely specified. However, in practice,
categories are often defined in broad or even ambiguous terms.</p>
        <p>
          This practice is not unique to the Irish jurisdiction. Discovery under the U.S. Federal Rules of Civil
Procedure, the U.K. Civil Procedure Rules (specifically Practice Direction 31A [
          <xref ref-type="bibr" rid="ref13">12</xref>
          ]) and in other
Common Law jurisdictions reflect similar ambiguity about the concept of document relevance. All
assume that reviewers are always capable of correctly discerning relevant from non-relevant
documents, even though reviews are typically conducted in circumstances where the documents
themselves are ambiguous, the criteria against which they are being reviewed are imperfectly defined,
and where the reviewers’ own knowledge of the specific matters under contention is limited. The
infallibility of reviewers has long been disproven in both the general information-retrieval literature
and in research directly related to electronic discovery processes [
          <xref ref-type="bibr" rid="ref8">7</xref>
          ],[
          <xref ref-type="bibr" rid="ref14">13</xref>
          ],[
          <xref ref-type="bibr" rid="ref15">14</xref>
          ],[
          <xref ref-type="bibr" rid="ref16">15</xref>
          ]. Different
reviewers working on the same document sets rarely achieve positive agreement of more than 70%
in their assessment of document relevance. [
          <xref ref-type="bibr" rid="ref6">5</xref>
          ],[
          <xref ref-type="bibr" rid="ref7">6</xref>
          ],[
          <xref ref-type="bibr" rid="ref17">16</xref>
          ]
        </p>
        <p>
          Reviewer assessments are also affected by the conditions of the review, such as the prior
expectations of reviewers and the density or “richness” of responsive documents in the review set.
Considerable emphasis is placed on the concept of Recall (i.e., the proportion of relevant documents
in a document set returned by a specific information-retrieval process) even though, in the almost
inevitable absence of a reliable “gold standard” of relevance, such a concept is of limited value. Recall
is of some worth in comparing the relative performance of different systems across the same data set,
but as a measure of absolute review performance it is considerably flawed. The Recall value alone
(even based on independent blind assessments) tells us little and should not be used as an absolute
acceptance standard in and of itself [
          <xref ref-type="bibr" rid="ref8">7</xref>
          ].
        </p>
        <p>
          The assessment of electronic discovery reviews is beset with miscalculations of Recall by
combining estimates taken from different samples, assessed by different reviewers, under different
conditions. What is typically seen in modern litigation reviews is Recall measured either by:
•
•
comparing the number of documents coded responsive during the review to the number
estimated in advance from a random sample of the entire collection, and stopping when the
former is 70% of the latter, or
comparing the number of documents coded responsive during the review to the number
estimated from a random sample of the as-yet-unreviewed documents (dubbed “Elusion” by
Roitblat[
          <xref ref-type="bibr" rid="ref2">1</xref>
          ].) Recall is estimated to be the former divided by the sum.
        </p>
        <p>Both these methods are flawed. The former because, all other issues aside, the basis for
determining Recall is usually an assessment of the “richness” of the document collection based review
of a modest sample of documents before the review begins. This makes any decision on the end-point
of the review largely arbitrary (per Goodhart’s Law). The latter method fails because typically the
reviewers reviewing the “Elusion” sample are incentivised to mark as few documents relevant as
possible to maximise the Recall result, and therefore (consciously or unconsciously) assess the sample
according to much stricter criteria than those used in the document review proper. The consequence
of this is that the results of the document review proper and that of the Elusion test are likely to be
barely related.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Examples of Observed Anomalous Reviewer Behaviour</title>
      <p>As stated above, this paper is not intended to provide quantitative evidence of specific anomalous
behaviours observed about reviewers working on the litigation, but instead to set out a typology of
those behaviours, to provide real-life examples of each and, as far as is practical, to propose possible
underlying causes for these behaviours. In general terms, we observed the following anomalous
behaviours on multiple occasions throughout the 18 month testing programme:
•
•
•
•
•
•</p>
      <p>Substantial differences between reviewers’ assessments of similar document sets as reflected
in the proportion of documents found relevant in specific document review batches. In some
cases, individual reviewers would continue to find large proportions of the documents
provided to them relevant when other reviewers working on similar batches of documents,
exported from the TAR system at the same stage of the review process, were finding much
fewer. One would expect that, in general, document batches exported from any TAR system
at a given point in a review would tend to have the same or similar proportions of relevant
and non-relevant documents – after all, at a given point in any TAR process, a certain
proportion of the relevant documents in the system would already have been found, and a
certain proportion would remain to be found. While this is what was usually observed in
reviews, it often was not.</p>
      <p>Significant disagreements between reviewers on the relevance of documents, and also
disagreements between the reviewers and other team members assigned to provide quality
assurance on the review. Indeed, often paired reviewers would agree with one another more
that with the quality assurance (QA) team, raising the question as to whether there is much
benefit in doing review QA at all where such a pairing approach is employed.</p>
      <p>The extent to which reviewers disagreed as to the relevance of documents was also far greater
than might have been expected. We had assumed that, where one reviewer found a document
relevant and another not relevant, that the difference would typically be minor – a question
of context and interpretation. Instead, we often found that the disagreements were
fundamental, with reviewers holding strongly opposed views as to the relevance of certain
documents.</p>
      <p>Calculation of Recall using “Elusion testing” methods was found to be highly unreliable. Other
validation techniques relying on confusion-matrix testing (see below), while still flawed, were
nevertheless considerably more reliable.</p>
      <p>The review platforms themselves and the way the different TAR methodologies were put into
practice appeared to have a substantial bearing on the results of the reviews.</p>
      <p>
        Categorisation of documents proved to be of very limited value – a finding of considerable
relevance to Irish legal cases where (as noted above), categorisation of documents is
mandatory in electronic discovery. These findings have been discussed in a separate
publication [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ] and, unfortunately, there is not space to replicate them here.
      </p>
      <p>We will now discuss each of these issues in more detail.</p>
    </sec>
    <sec id="sec-4">
      <title>4.1. Different Reviewer Assessments of Similar / Identical Documents</title>
      <p>Despite the QA measures in place (as described above in Section 2), it was regularly observed that,
after some time, individual reviewers would develop very different views of what constituted
relevance than their review partner (or group.) This either manifested as the reviewer being far more
conservative than their partner(s) about what they considered a relevant or far more open.</p>
      <p>It was observed that in CAL-based TAR reviews, when most reviewers began to “run-out” of
relevant documents as the CAL process reached its end-point, certain reviewers would go on marking
documents relevant at much the same rate as they had before. For example, in the four-person review
of composite category 5 (ref. Table 1), despite the four reviewers reviewing at approximately similar
rates, a single reviewer, J, developed a much broader interpretation of what constituted a relevant
document than his three partners. He therefore continued finding “relevant” documents in the data
set long after the others had finished.
A
J
D</p>
      <p>L</p>
      <p>Similarly in a two-person review (ref. Table 2), this time of a different composite category with
different reviewers, Reviewer L began “running out” of documents long before Reviewer A, despite
both reviewers reviewing broadly in parallel. It was found, again, that A had developed a much
broader definition of relevance than reviewer L.</p>
      <p>The same behaviour was observed on several other occasions. It did not seem to be restricted to
particular reviewers, review tools, or CCs (review criteria). Instead, a single reviewer in a single
review under a specific CC would unilaterally develop their own view of what constituted relevance,
often based on their understanding of documents previously reviewed or because they had learned
(usually incorrectly) to correlate certain vocabulary or phraseology in documents with relevance.
They would then use this altered version of the relevance criterion to continue the review, despite
the fact that the correct review criterion remained readily available to them in their briefing notes.</p>
      <p>This tendency was also visible when the progress of reviews was plotted on a graph. For example,
in Figure 1, it can be observed that in a review of CC#3, certain reviewers perceived a downward
trend in the number of relevant documents being seen well before others did. Reviewer A reviewed a
batch on 13 December with only 30% of documents deemed relevant whereas Reviewer C was still
finding more than 30% in batches from the same CAL process almost a month later, on 10 January.</p>
      <p>Similarly, in CC#5, a two-reviewer process (Figure 2), we see Reviewer B assessing a downturn in
relevant documents present to them days before their partner, Reviewer A.</p>
      <p>We conclude that there is not a great deal that can be done to correct this phenomenon. At the
core of the issue is the simple and long-understood fact that different people often interpret the same
documents differently, and this interpretation is influenced by multifarious factors arising both from
review conditions and from the individuals’ prior training, knowledge, and life experiences. In many
cases, the reviewer with the anomalous review results was not strictly-speaking “wrong” in their
relevance assessments, but rather had simply acquired a different understanding of what relevance
truly was. In others, it turned out that the reviewer had simply “lost track” of what the criterion for
relevance was. This was not a reflection of the reviewers’ lack of intelligence, experience, or attention
to detail, but instead something that could happen to anyone. It is the nature of humans that they
identify patterns in data, and sometimes they fix on an incorrect pattern, leading them to go astray.</p>
      <p>As discussed in Section 2 above, reviewers were assigned to specific reviews in pairs. They were
also each assigned 10% of their review partner’s documents to allow us to quickly identify when pairs
of reviewers were diverging in their understanding of the relevance criterion (CC) in each review.
Two of the composite categories underwent two separate reviews for reasons relating to the
proceedings. In five of the reviews, the 10% of shared documents were also reviews independently by
our QA team, made up of solicitors from the Maples Group who were deeply familiar with the issues
in the legal proceedings, but resource limitations prevented us from doing this for all reviews. We
recorded the levels of agreement between paired reviewers in respect of a review of large subset of
the overall data set, and also between each reviewer in a pair and the QA reviewer and obtained the
following results (ref. Table 3.)</p>
      <p>As can be seen, the method of using 10% of documents as “crossovers” for ensuring consistency
between reviewers worked well in most cases and reviewers achieved levels of agreement greater
than 80% in most reviews. There were, however, four cases where we were unable to achieve
agreement as high as 80%. In particular, in the second review of CC1 we observed an agreement level
of only 56% despite having the reviewers working in parallel and conferring on what documents
should be considered relevant or not. Remember that the reviewers in question were similarly
experienced, qualified barristers who had been briefed simultaneously using the same briefing
material, yet still they could not consistently agree on relevance on a CC where previously another
reviewer pair had achieved close agreement.</p>
      <p>Similar characteristics were observed in the case of CCs 5, 7, and 8. In the case of CCs 5 and 7,
bringing in QA did not make a substantial difference because the QA reviewers tended to disagree
roughly equally with the assessments of both primary reviewers. These findings raise the question as
to how much value a third, independent QA reviewer adds to the process, as it appears that QA
reviewers’ assessments are likely to be as subjective as those of the reviewers they are overseeing.</p>
    </sec>
    <sec id="sec-5">
      <title>4.3. Extent of Reviewer Disagreement</title>
      <p>As part of our assessment of aspects of the Grossman and Cormack CAL® tool, we decided that it
would be helpful to determine the extent to which reviewers were disagreeing on the relevance of
certain documents. We therefore selected three CCs and, instead of having the reviewer pairs review
Relevant / Non-Relevant as usual, we had them review according to four degrees of relevance:
Strongly Agree / Agree / Unsure / Not-Relevant, and then assessed the differences between the paired
reviewers’ relevance assessment. Table 4 sets out our findings.</p>
      <p>As might be expected, in the case of all three CCs reviewed, most documents were assigned the
same designation by both the paired reviewers, although not quite as high as might have been
expected from the results in Table 3. However, where there was disagreement, it was often
substantial. In CC6, 496 documents were reviewed with one degree of disagreement between the
paired reviewers, 204 with two degrees, and 113 with three degrees (i.e., one reviewer designating a
document “Strongly Agree,” while the other designating it “Not Relevant”). In CC8, 178 documents
were assessed with one degree of disagreement, 209 with two degrees, and 66 with three. In CC4, 259
documents were assessed with one degree of disagreement, 419 with 2 degrees, and 62 with three.</p>
      <p>These findings indicate that it is unsafe to assume that, where reviewers disagree as to the
relevance of a document that that disagreement is most probably minor. Instead, disagreement can
be quite fundamental. For example, in the case of CC4, the reviewers disagreed by two degrees or
more on over 20% of the documents reviewed. It was not merely a matter of one reviewer thinking
that a document was probably relevant and the other that it was probably not. There was often
fundamental disagreement as to the correct way to assess a document against the criterion provided.</p>
    </sec>
    <sec id="sec-6">
      <title>4.4. Reviewer Disagreement Makes Recall Measures Unreliable.</title>
      <p>In carrying out the comparison of the Grossman and Cormack CAL® system against the leading
commercial discovery TAR tool, we relied upon confusion-matrix tests2 to assess the relative Recall
of both platforms for most of the ten CC criteria for which documents were reviewed. In the case of
the commercial discovery TAR tool, we also carried out Elusion tests in the manner set out by the
tool’s manufacturer in their support documents and training courses. Each Elusion test was carried
out twice for each CC, once with the test being completed by the same reviewer pair who carried out
the CAL review, and a second time, but with the Elusion test review being completed by a different
reviewer pair to those who had carried out the CAL review. The results are set out in Table 5
(overleaf).</p>
      <p>Several results tend to stand out. The Recall results for the leading commercial TAR tool are
usually substantially lower under confusion-matrix testing than in Elusion tests carried out both by
the review team and by independent reviewers (note that time and resources were only available to
carry out confusion-matrix tests for six of the ten CC reviews under both the Grossman and Cormack
2 The formula for calculating Recall for the leading commercial TAR system under the confusion-matrix method discussed here was as
follows: the number of relevant documents in the sample for commercial TAR tool / the total number of relevant documents in the sample
for both the commercial and Grossman &amp; Cormack CAL system + the number of mis-labelled relevant documents in the sample + the
number relevant documents in the unreviewed population.</p>
      <p>The formula for calculating Recall for the Grossman and Cormack system under the confusion-matrix method was the number of relevant
documents in the sample for CAL® / the total number of relevant documents in the sample for both systems + the number of mislabelled
relevant documents in the sample + the number relevant documents in the unreviewed population.</p>
      <p>
        A detailed description of the method can be found in [
        <xref ref-type="bibr" rid="ref11">10</xref>
        ]
CAL® system and the leading commercial TAR one.) It appears that Recall results tend to be
overstated in Elusion testing for reasons described in Section 3 above.
      </p>
      <p>Recall results produced by Elusion testing on the leading commercial TAR system were quite
inconsistent. It was by no means certain that using an independent Elusion test reviewer would
produce lower Recall figures. Some results produced were highly problematic. CC6 was, in fact, a
somewhat ambiguous review criterion which produced issues of review consistency throughout the
testing process. Nevertheless, both Elusion tests supposedly proved that the TAR results had achieved
almost perfect Recall. The real reason for the high Recall score was that, because the CC was
ambiguously defined, Elusion reviewers could justify marking practically every document in the
Elusion-testing sample as non-relevant, resulting in “perfect” Recall. This demonstrates a
fundamental and insurmountable problem in the use of Recall as a validation method for TAR-based
reviews.</p>
    </sec>
    <sec id="sec-7">
      <title>4.5. Potential Influence of Review Platform on Reviewer Behaviour</title>
      <p>As can be observed in Table 5, the Recall results produced using the Grossman and Cormack CAL®
system were substantially better than those obtained by the same review team using the market
leading TAR platform when both systems were assessed in the same way. The Grossman and
Cormack CAL® system also obtained consistently higher review precision and numbers of responsive
documents (on average over 30% more responsive documents identified) despite the commercial TAR
systems’ often high Elusion-testing-based Recall figures. This demonstrates that the CAL model
employed in carrying out a legal review can have a substantial impact on the results of that review.</p>
      <p>Another issue observed was that the commercial TAR tool, in training its CAL model made use of
“uncertainty sampling.3” We are advised that the commercial TAR tool requires that around 30% of
the documents reviewed in training to be non-relevant. We observed, however, that the use of
uncertainty sampling seemed to confuse reviewers and increase the chances of them losing track of
what constituted relevance. We had intended following-up on this finding further but have since been
informed that in future releases of the commercial TAR tool, uncertainty sampling will be able to be
manually switched off by the user at any point, usually when they predict stabilisation4 has occurred.
The impact this will have on the commercial TAR system’s ability to continue to learn and accurately
predict for new, yet, unfound classes of documents remains unclear.</p>
      <p>Conversely, we observed that the Grossman and Cormack CAL® tool, which tends to provide
reviewers with document sets much richer in relevant material, seemed to increase the chances that
reviewers would mark equivocally relevant documents relevant.</p>
      <p>Finally, the commercial TAR tool seemed to score ambiguous documents lower than the Grossman
and Cormack CAL® tool. Many documents identified and marked relevant in the Grossman and
3 Uncertainty Sampling, at least as it is defined in terms of the commercial tool tested in this research, involves the inclusion of low-ranked
documents in the TAR review set to allow the TAR model to accurately model the cut-off between relevant and non-relevant documents.
We believe that uncertainty sampling is required by the commercial tool because it bases its TAR model on Support Vector Machines,
which requires the inclusion of relevant and non-relevant documents in training its model.
4 In this context, stabilisation is reached at a point in training the model where the addition of further training data has little or no effect on the
the rankings of documents within the model. It is a rather ill-defined concept.[ ]
Cormack CAL® review were never even seen in the review of the same documents carried out using
the leading commercial TAR tool. This to some degree explains the better precision and responsive
document retrieval rate observed using the Grossman and Cormack CAL® system.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Conclusions</title>
      <p>There remains a problem with the fundamental concept of “relevance” certainly as it applies to
information retrieval in the legal sphere and probably more generally in the discipline. However
much one might be desired, there is unlikely to be any “gold standard” against which the relevance
of documents to specific criteria can be assessed. Human language is often imprecise and it is the
nature of legal proceedings that the documents involved are often equivocal in meaning and
ambiguous in content. If there was no uncertainty in the nature of the documents involved, there
might not, after all, be any legal questions to be decided in the first place.</p>
      <p>In practice, documents can be fully relevant, probably relevant, or possibly relevant, and it is by
no means certain that even the best reviewer will review them in a manner consistent with another
competent reviewer. Reviewers draw the line between relevance and non-relevance in different
places and in different circumstances, and often disagree with one another far more fundamentally
than might be expected. The only occasion, it seems, when reviewers will keep a consistently
conservative view of what constitutes relevance is in completing Elusion tests, where there is
normally an incentive to find as few relevant documents as possible as this will maximise the Recall
figure.</p>
      <p>We have observed that reviewers:
•
•
•
•
•
•
often lose track of what constitutes relevance while reviewing a document set. This error
seems to occur more often the longer a review continues. We found that having reviewers
review in pairs with 10% of crossover documents between them allowed us quickly to identify
reviewers losing track of relevance, but it could not stop those reviewers from drifting off the
relevance criteria.
will also occasionally keep marking documents relevant in CAL reviews even when reviewers
on the same project have begun coding most of the documents provided to them as
nonrelevant. This phenomenon seems to arise either from the reviewers losing track of relevance
or because they have, during their review, developed a much broader sense of what is relevant
than their counterparts may have.
often fundamentally disagree about relevance. There may be an assumption that when
reviewers disagree about how well a document aligns to a review criterion that such
judgments are quite subtle. We found that, in fact, for a significant number of decisions
disagreements between reviewers are considerable and fundamental.
are influenced by how their review platform works and is set-up. In particular, it appears that
feeding low-scoring documents into a CAL review can significantly affect reviewers’ ability
to “stay on track” in the review.
disagree with QA reviewers quite as much as they do with one another. This raises the
question of whether QA reviewers provide much additional value, particularly where other
QA measures, such as paired reviewers are already in place.</p>
      <p>
        And, while this is considered in much more detail in [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ], reviewers are, in practice, extremely
poor at categorizing documents. The more categories there are, the worse and slower they
get.
      </p>
      <p>The findings described here are necessarily summary in nature. These are observations that were
collected in research primarily focused in other areas. Nevertheless, all of these observed behaviours
must necessarily have a significant bearing on the effectiveness and accuracy of any legal review
using TAR or, indeed, any information-retrieval process requiring the human review of documents.
There is an old saying in computer science “garbage in, garbage out.” These findings suggest that
perhaps more attention should be paid to the training of information-retrieval systems so that garbage
does not creep in simply as a consequence of normal, unavoidable human behaviour.</p>
    </sec>
    <sec id="sec-9">
      <title>6. Acknowledgements</title>
      <p>The authors would like to acknowledge the assistance of Martin Elliot, the Partners and staff of
Grant Thornton Ireland, and the partners and staff of the Maples Group in the research that led to
this paper.</p>
    </sec>
    <sec id="sec-10">
      <title>7. References</title>
      <p>[17] James Waldron, John Rabiej, “Technology Assisted Review (TAR) Guidelines”, January 2019,
EDRM, https://edrm.net/wp-content/uploads/2019/02/TAR-Guidelines-Final.pdf</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>4.2. Disagreement on Relevance.</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Roitblat</surname>
          </string-name>
          ,
          <article-title>"Search and information retrieval science." In Sedona Conf</article-title>
          . J., vol.
          <volume>8</volume>
          , p.
          <fpage>225</fpage>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.R.</given-names>
            <surname>Baron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hedin</surname>
          </string-name>
          , et al.
          <article-title>Evaluation of information retrieval for E-discovery</article-title>
          .
          <source>Artif Intell Law</source>
          <volume>18</volume>
          ,
          <fpage>347</fpage>
          -
          <lpage>386</lpage>
          (
          <year>2010</year>
          ). https://doi.org/10.1007/s10506-010-9093-9
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <article-title>Evaluating information retrieval system performance based on user preference</article-title>
          .
          <source>J Intell Inf Syst</source>
          <volume>34</volume>
          ,
          <fpage>227</fpage>
          -
          <lpage>248</lpage>
          (
          <year>2010</year>
          ). https://doi.org/10.1007/s10844-009-0096-5
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          , “
          <article-title>Navigating Imprecision in Relevance Assessments on the Road to Total Recall: Roger and Me,”</article-title>
          <source>In Proc of. SIGIR '17</source>
          ,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          .
          <year>2017</year>
          , doi: 10.1145/3077136.3080812.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          and
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          , “
          <article-title>Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review,”</article-title>
          <source>Richmond Journal of Law and Technology</source>
          , vol.
          <volume>17</volume>
          , no.
          <issue>3</issue>
          , p.
          <fpage>11</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          .
          <year>2011</year>
          , [Online]. Available at: https://scholarship.richmond.edu/cgi/viewcontent.cgi?article=1344&amp;context=jolt
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Roitblat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kershaw</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Oot</surname>
          </string-name>
          , “
          <article-title>Document categorization in legal electronic discovery: computer classification vs</article-title>
          . manual review,
          <source>” Journal of the Association for Information Science and Technology</source>
          , vol.
          <volume>61</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>70</fpage>
          -
          <lpage>80</lpage>
          , Oct.
          <year>2009</year>
          , doi: 10.1002/asi.21233.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Harbison</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. O'Halloran</surname>
            ,
            <given-names>B. McManus,</given-names>
          </string-name>
          (
          <year>2024</year>
          )
          <article-title>“Unbiased Validation of Technology-Assisted Review for eDiscovery”</article-title>
          ,
          <source>In Proc of. SIGIR '24</source>
          ,
          <string-name>
            <surname>July</surname>
          </string-name>
          .
          <year>2024</year>
          doi: 10.1145/3626772.3657903
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Roegiest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Impact of surrogate assessments on high-recall retrieval</article-title>
          .
          <source>In Proc. SIGIR '15</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>McManus</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. O'Halloran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Harbison</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>G.V.</given-names>
          </string-name>
          <string-name>
            <surname>Cormack</surname>
          </string-name>
          , (
          <year>2024</year>
          )..
          <article-title>Limitations of the Utility of Categorization in eDiscovery Review Efforts</article-title>
          . In: Li,
          <string-name>
            <surname>S</surname>
          </string-name>
          . (eds) Information Management.
          <source>ICIM 2024. Communications in Computer and Information Science</source>
          , vol
          <volume>2102</volume>
          . Springer, Cham. https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -64359-0_
          <fpage>24</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <surname>T. O'Halloran</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>McManus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Harbison</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>G.V.</given-names>
          </string-name>
          <string-name>
            <surname>Cormack</surname>
          </string-name>
          , (
          <year>2024</year>
          ).
          <article-title>Comparison of Tools and Methods for Technology-Assisted Review</article-title>
          . In: Li,
          <string-name>
            <surname>S</surname>
          </string-name>
          . (eds) Information Management.
          <source>ICIM 2024. Communications in Computer and Information Science</source>
          , vol
          <volume>2102</volume>
          . Springer, Cham. https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -64359-
          <issue>0</issue>
          _
          <fpage>9</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <article-title>Government of Ireland, Statutory Instrument</article-title>
          .
          <source>No. 93/2009 - Rules of the Superior Courts (Discovery)</source>
          <year>2009</year>
          , https://www.irishstatutebook.ie/eli/2009/si/93/made/en/print
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <surname>U.K</surname>
          </string-name>
          . Department of Justice, Civil Procedure Rules,
          <source>Practice Direction 31A - Disclosure and Inspection</source>
          , https://www.justice.gov.uk/courts/procedure-rules/civil/rules/part31/pd_part31a
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Webber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Hedin</surname>
          </string-name>
          , “
          <article-title>Assessor error in stratified evaluation,”</article-title>
          <source>In Proc of. CIKM '10</source>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          .
          <year>2010</year>
          , doi: 10.1145/1871437.1871508.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Saracevic</surname>
          </string-name>
          , “
          <article-title>Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance</article-title>
          ,
          <source>” Journal of the Association for Information Science and Technology</source>
          , vol.
          <volume>58</volume>
          , no.
          <issue>13</issue>
          , pp.
          <fpage>1915</fpage>
          -
          <lpage>1933</lpage>
          , Jan.
          <year>2007</year>
          , doi: 10.1002/asi.20682.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Saracevic</surname>
          </string-name>
          , “
          <article-title>Relevance: A review of the literature and a framework for thinking on the notion in information science</article-title>
          .
          <source>Part III: Behaviour and effects of relevance,” Journal of the Association for Information Science and Technology</source>
          , vol.
          <volume>58</volume>
          , no.
          <issue>13</issue>
          , pp.
          <fpage>2126</fpage>
          -
          <lpage>2144</lpage>
          , Jan.
          <year>2007</year>
          , doi: 10.1002/asi.20681.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          , “
          <article-title>Variations in relevance judgments and the measurement of retrieval effectiveness</article-title>
          ,
          <source>” Information Processing and Management</source>
          , vol.
          <volume>36</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>697</fpage>
          -
          <lpage>716</lpage>
          ,
          <year>Sep 2000</year>
          , doi: 10.1016/s0306-
          <volume>4573</volume>
          (
          <issue>00</issue>
          )
          <fpage>00010</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>