<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Technology-Assisted Review in Empirical Medicine: Waterloo Participation in CLEF eHealth 2018</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gordon V. Cormack</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maura R. Grossman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Waterloo</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Screening articles for studies to include in systematic reviews is an application of technology-assisted review (\TAR"). In this work, we applied the Baseline Model Implementation (\BMI") from the TREC Total Recall Track (2015-2016) to the CLEF eHealth 2018 task of screening MEDLINE abstracts to identify articles reporting studies to be considered for inclusion. We employed exactly the same approach for Sub-Task 1 and Sub-Task 2, which was in turn exactly the same approach employed for the CLEF 2017 eHealth Lab. The only di erence was that for SubTask 1, the entire Pubmed/MedLine database was searched; whereas for Sub-Task 2, the only records searched were those identi ed by CLEF using Boolean queries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Apparatus</title>
      <p>
        Task 2 is essentially the Technology-Assisted Review (\TAR") task addressed by
the TREC 2015 and TREC 2016 Total Recall Tracks [
        <xref ref-type="bibr" rid="ref12 ref9">12, 9</xref>
        ]. For our participation
in CLEF 2018, we reprised our Total Recall e orts, and also our e orts from
CLEF 2017 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] using the same apparatus.
      </p>
      <p>At TREC, the systems under test were given, at the outset, a corpus of
documents and a set of topics. For each topic, a system under test repeatedly
submitted documents from the corpus to a server, and in return, was given a
simulated human assessment of \relevant" or \not relevant" for each document.</p>
      <p>The objective was to identify as many relevant documents as possible, while
submitting as few non-relevant documents as possible. The tension between these
two criteria was evaluated using rank-based measures (e.g., recall as a function of
the number of documents submitted), as well as set-based measures (e.g., recall
at a point when a certain number of documents, speci ed contemporaneously by
the system, had been submitted).</p>
      <p>
        Prior to TREC, we made available a Baseline Model Implementation
(\BMI"),1 to illustrate the client-server protocol, as well as to provide
baseline results for comparison. BMI, which encapsulates our AutoTAR Continuous
Active Learning (\CAL") method [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], yielded rank-based results that compared
favorably will all systems under test. During the course of our participation in
TREC, we developed and tested the \knee method" stopping procedure [
        <xref ref-type="bibr" rid="ref2 ref3 ref5">3, 2, 5</xref>
        ],
with the purpose of achieving high recall with high probability.
      </p>
      <p>
        Sub-Task 2, which was the only task for CLEF 2017, di ered operationally
from the TREC Total Recall Track in that a list of document identi ers, rather
than a corpus, was supplied at the outset, and a complete set of relevance
assessments, rather than an assessment server were used to simulate human
assessments. Sub-Task 2 also di ered substantively from the Total Recall Track in
that the corpus for each topic was narrowed by a search phase speci c to that
topic, and therefore yielded a much smaller set that was richer in relevant
documents. Sub-Task 2 di ered further in that two sets of relevance assessments were
available: the assessments from a previously conducted screening phase, and the
assessments from a previously conducted selection phase, raising the question of
which assessments (or combination of assessments) should be used to simulate
relevance feedback, and which should be used to evaluate the results (cf. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
      </p>
      <p>Sub-Task 1, new for CLEF 2018, resembles the TREC Total Recall Track,
in that no topic-speci c culling of the document set is done; each search applies
to the entire 30M-document Pubmed/MEDLINE collection.</p>
      <sec id="sec-2-1">
        <title>1 Available under GNU</title>
        <p>http://cormack.uwaterloo.ca/trecvm.
General</p>
        <p>Public</p>
        <p>License
at</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Training and Con guration</title>
      <sec id="sec-3-1">
        <title>Document Corpora</title>
        <p>The corpus for each topic consisted of abstracts from MEDLINE/Pubmed2
identi ed by PMID. On April 1, 2018, we fetched the entire MEDLINE dataset
consisting of 28,256,688 XML les, each containing the titles, abstracts, and
metadata for an article. We used the raw XML les as documents in the corpora
that were supplied at the outset to BMI.</p>
        <p>For Sub-Task 1, we applied BMI to the entire corpus of 28,256,688 les, thus
combining the search and screening phases. In a pilot experiment on the test
topics, we found that no assessments were available for many, if not most, of
the highly ranked documents returned by BMI. To our eye, these documents
were indistinguishable from those for which \relevant" assessments were
provided. We investigated, without success, the reasons why these documents were
not retrieved by the previously conducted search phase. For example, the
documents in question were neither newer nor older than those for which assessments
were available, and appeared to contain relevant terms from the search query.
Nonetheless, for Sub-Task 1, we treated any document for which no qrel was
available to be \not relevant" for the purpose of feedback.</p>
        <p>In a separate manual run, the authors used their own judgement to assess the
relevance of abstracts returned by BMI, in order to provide relevance feedback.</p>
        <p>For Sub-Task 2, we used a common corpus consisting of all documents that
were assessed for any of the 30 test topics. That is, for any given topic, the
corpus consisted of all the documents assessed for that topic, as well as all
the documents assessed for each of the other 29 topics. Our rationale was that
including documents retrieved for all topics would introduce enough diversity
to unskew su ciently the term-frequency statics. This approach appeared to
achieve the e ciency of using reduced corpora and the e ectiveness of using the
full dataset, and was chosen for our o cial tests: For the o cial tests, the corpus
consisted of all documents assessed for any of the 30 test topics; any unassessed
document was considered \not relevant."
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Relevance Feedback</title>
        <p>For Sub-Task 2, we used two modes of relevance feedback:
1. Relevance feedback based on the screening-phase assessments (Method</p>
        <p>UWA.Task2 for o cial testing);
2. Relevance feedback based on a hybrid of screening-phase and selection-phase
assessments (Method UWB.Task2 for o cial testing).</p>
        <p>The rst method is straightforward: When BMI identi es a document for
assessment, the judgment returned to BMI is that supplied by CLEF for either
the screening phase (the \abstract qrels"). The second method operates in two</p>
        <sec id="sec-3-2-1">
          <title>2 See https://www.nlm.nih.gov/bsd/pmresources.html.</title>
          <p>phases: At the outset, the judgment returned to BMI is that of the abstract qrels.
The abstract qrels continue to be used until BMI identi es one document that
is relevant not only according to the abstract qrels, but also according to the
content qrels. Thereafter, the judgment returned to BMI is that of the content
qrels.</p>
          <p>For Sub-Task 3, we used three modes of relevance feedback:
1. Relevance feedback based on the screening-phase assessments (Method UWA
.Task1 for o cial testing);
2. Manual feedback based on the authors' relevance assessments (Method</p>
          <p>UWG.Task1 for o cial testing);
3. Manual feedback based on the authors' relevance positive assessments,
followed by relevance feedback based on the screening-phase assessments
(Method UWX.Task1 for o cial testing).
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Stopping Criterion</title>
        <p>
          For threshold-based evaluation, it was necessary to implement a stopping
procedure to terminate screening when the best compromise between recall and e ort
had been achieved, for some de nition of \best." In our opinion,
technologyassisted review should be considered a satisfactory alternative to manual
review only if it yields comparable or superior recall, with high probability.
Toward this end, we deployed our knee method with default parameters ( =
156 min(relret; 150), = 100 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]), which interprets a sharp fall-o in the
slope of the gain curve (recall vs. review e ort) as evidence that substantially
all relevant documents have been identi ed.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>AutoTAR</title>
      <p>
        In 2015, we published the details and rationale for AutoTAR [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which remains,
to this date, the most e ective TAR method of which we are aware. BMI
implements AutoTAR exactly as described above, except for the substitution of
So a-ML logistic regression in place of SVMlight (see [4, Section 3.1]). It has
no dataset- or topic-speci c tuning parameters; except for modi cations to
incorporate the CLEF corpora and relevance assessments, and our knee-method
stopping procedure, we used BMI \out of the box."
      </p>
      <p>
        The AutoTAR/BMI algorithm, as modi ed for CLEF, is detailed in
Algorithm 1, which is reproduced from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] with the following changes:
{ In Step 1, AutoTAR gives the option of starting with a relevant document,
or with a synthetic document. Here, we used a synthetic document consisting
of the title of the topic, and nothing else.
{ In Step 7, we introduced two di erent ways to simulate user feedback,
corresponding to Method A and Method B, described above in Section 3.2.
{ In Step 10, we introduced the option to terminate the process when the
knee-method stopping criterion was met.
      </p>
      <p>Algorithm 1 The AutoTAR Continuous Active Learning (\CAL") Method,
as Implemented by the TREC Baseline Model Implementation (\BMI") and
deployed by Waterloo for the CLEF Technologically Assisted Review Task.
1. The initial training set consists of a synthetic document containing only the topic
title, labeled as \relevant."
2. Set the initial batch size B to 1.
3. Temporarily augment the training set by adding 100 random documents from the
collection, provisionally labeled as \not relevant."
4. Apply logistic regression to the training set.
5. Remove the random documents added in step 3.
6. Select the highest-scoring B documents that have not yet yet been screened.
7. Label each of the B documents as \relevant" or \not relevant" by consulting:
(a) Previous \abstract" assessments supplied by CLEF [Method A]; or,
(b) Previous \document" assessments, once the rst \relevant" document
assessment is encountered [Method B].
8. Add the labeled documents to the training set.
9. Increase B by 1B0 .
10. Repeat steps 3 through 10 until either:
(a) All documents have been screened [for ranked evaluation]; or,
(b) The \knee-method" stopping criterion is met [for threshold evaluation].
Internally, BMI constructs a normalized TF-IDF ((1 + log tf ) log dNf )
wordvector representation of each document in the corpus (which, as noted in
Section 3.1, consists of raw XML les), where a word is considered to be any
sequence of two or more alphanumeric characters not containing a digit, that
occurs at least twice in the corpus. Scoring is e ected by So a-ML3 with
parameters \--learner type logreg-pegasos --loop type roc --lambda 0.0001
--iterations 200000." As noted above, these parameters were xed when BMI
was created in 2015.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <sec id="sec-5-1">
        <title>3 See https://github.com/glycerine/so a-ml.</title>
        <p>the review: The screening assessments are available only for documents retrieved
by the search phase; the selection assessments are available only for documents
retrieved by the search phase, and judged relevant during the screening phase.
Therefore, from the assessments, it is impossible to determine whether an article
not retrieved by the search phase, or an article eliminated during the screening
phase, describes a study that should have been included in the review. The
CLEF architecture tacitly assumes that no such articles exist; in other words,
that the search and screening phases used to generate the relevance assessments
were infallible, and each attained 100% recall.</p>
        <p>
          Such an assumption is unrealistic, and limits the recall of any simulated TAR
method to that of the manual review to which it is compared [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. As noted in
the Cochrane Handbook [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] with regard to the search phase: \[T]here comes a
point where the rewards of further searching may not be worth the e ort required
to identify the additional references." And with regard to the screening phase:
\Using at least two authors may reduce the possibility that relevant reports will
be discarded (Edwards 2002 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ])."
        </p>
        <p>
          Our hypothesis that our TAR runs found relevant articles that were missed
by the search phase, or incorrectly discarded in the screening phase, is based
on results from other domains [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], where TAR acting as a \second assessor"
was able to identify potentially relevant documents that had been judged
\nonrelevant" by a human assessor. When we applied Method A to the 30 topics, it
identi ed 9,250 potentially relevant articles for which the abstract qrel was \not
relevant." Acquiring a second opinion on each of these documents would increase
the cost of the TAR review by approximately 12%, and would, we believe, yield
a substantial number of relevant documents, over and above the 670 identi ed
in the abstract qrels.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Autonomy and reliability of continuous active learning for technology-assisted review</article-title>
          .
          <source>arXiv preprint arXiv:1504.06868</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Waterloo (Cormack) participation in the TREC 2015 Total Recall Track</article-title>
          .
          <source>In Proceedings of The Twenty-Fourth Text REtrieval Conference</source>
          , TREC 2015, Gaithersburg, Maryland, USA, November
          <volume>17</volume>
          -
          <issue>20</issue>
          ,
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Engineering quality and reliability in technology-assisted review</article-title>
          .
          <source>In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2016</year>
          , Pisa, Italy,
          <source>July 17-21</source>
          ,
          <year>2016</year>
          , pages
          <fpage>75</fpage>
          {
          <fpage>84</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Scalability of continuous active learning for reliable high-recall text classi cation</article-title>
          .
          <source>In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management</source>
          ,
          <string-name>
            <surname>CIKM</surname>
          </string-name>
          <year>2016</year>
          ,
          <article-title>Indianapolis</article-title>
          , IN, USA, October
          <volume>24</volume>
          -
          <issue>28</issue>
          ,
          <year>2016</year>
          , pages
          <fpage>1039</fpage>
          {
          <fpage>1048</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          . \
          <article-title>When to stop" Waterloo (Cormack) participation in the TREC 2016 Total Recall Track</article-title>
          .
          <source>In Proceedings of The Twenty-Fifth Text REtrieval Conference</source>
          , TREC 2016, Gaithersburg, Maryland, USA, November
          <volume>15</volume>
          -
          <issue>18</issue>
          ,
          <year>2016</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Navigating imprecision in relevance assessments on the road to total recall: Roger and me</article-title>
          .
          <source>In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2017</year>
          , Tokyo, Japan,
          <source>August</source>
          <volume>7</volume>
          -
          <issue>11</issue>
          ,
          <year>2017</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Technology-assisted review in empirical medicine: Waterloo participation in clef ehealth 2017</article-title>
          . Working Notes of CLEF, pages
          <volume>11</volume>
          {
          <fpage>14</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>P.</given-names>
            <surname>Edwards</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , C. DiGuiseppi, S. Pratap,
          <string-name>
            <surname>I. Roberts</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Wentz</surname>
          </string-name>
          .
          <article-title>Identi cation of randomized controlled trials in systematic reviews: accuracy and reliability of screening records</article-title>
          . Statistics in Medicine,
          <volume>21</volume>
          (
          <issue>11</issue>
          ):
          <volume>1635</volume>
          {
          <fpage>1640</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Roegiest</surname>
          </string-name>
          .
          <article-title>TREC 2016 Total Recall Track overview</article-title>
          .
          <source>In Proceedings of The Twenty-Fifth Text REtrieval Conference</source>
          , TREC 2016, Gaithersburg, Maryland, USA, November
          <volume>15</volume>
          -
          <issue>18</issue>
          ,
          <year>2016</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Higgins</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Green</surname>
          </string-name>
          .
          <article-title>Cochrane handbook for systematic reviews of interventions</article-title>
          , volume
          <volume>4</volume>
          . John Wiley &amp; Sons,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. E. Kanoulas,
          <string-name>
            <given-names>R.</given-names>
            <surname>Spijker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>L. Azzopardi.</surname>
          </string-name>
          <article-title>CLEF technologically assisted reviews in empirical medicine overview</article-title>
          .
          <source>In CLEF 2018 Evaluation Labs and Workshop: Online Working Notes, CEUR Workshop Proceedings. CEUR-WS.org</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>A.</given-names>
            <surname>Roegiest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          .
          <article-title>TREC 2015 total recall track overview</article-title>
          .
          <source>In Proceedings of The Twenty-Fifth Text REtrieval Conference</source>
          , TREC 2015, Gaithersburg, Maryland, USA, November
          <volume>17</volume>
          -
          <issue>20</issue>
          ,
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. H.
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Kanoulas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Azzopardi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Ramadier</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Robert</surname>
            ,
            <given-names>J. R. M.</given-names>
          </string-name>
          <string-name>
            <surname>Palotti</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Zuccon</surname>
          </string-name>
          .
          <article-title>Overview of the CLEF ehealth evaluation lab 2018</article-title>
          .
          <source>In CLEF 2018 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science</source>
          . Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>