<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Technology-Assisted Review in Empirical Medicine: Waterloo Participation in CLEF eHealth 2017</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gordon V. Cormack</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maura R. Grossman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cheriton School of Computer Science University of Waterloo Waterloo ON N2L 3G1</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>s should be examined for each topic. According to thresholdbased evaluation, the knee method identi ed every article that should have been included (100% recall), while examining 2; 659 abstracts, on average, per topic|72.8% of the 3; 655 abstracts, that would have required examination, on average, had a manual approach been used instead. While our results suggest that TAR can substantially improve the e ciency of abstract screening without compromising recall, there remains room for improvement both in ranking and stopping criterion, as well as important factors that were not addressed in the CLEF eHealth 2017 framework: the completeness of the universe of abstracts gathered using keyword search, and the accuracy of the human assessments of the collected abstracts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The University of Waterloo participated in Task 2, Technologically Assisted
Reviews in Empirical Medicine [10], of the CLEF 2017 eHealth Evaluation Lab [12].
Task 2 simulates the second phase|screening |in a prototypical three-phase
work ow to identify studies for inclusion in a systematic review:
1. Search: First, Boolean queries are used to identify as many articles as possible
that may describe studies that should be included;
2. Screening: Second, titles and abstracts of the articles identi ed in the search
phase are examined to eliminate those which could not possibly describe
studies that should be included; and
3. Selection: Finally, articles that survived the screening phase are read in full to
determine whether or not they meet the systematic review inclusion criteria.
The overall objective of our research is to improve the human e ciency, as well
as the e ectiveness, of work ows to identify studies for inclusion in systematic
reviews. The results of our CLEF experiments support the hypothesis that
continuous active learning (\CAL") can substantially improve the human e ciency
of screening, without substantially compromising its e ectiveness. The results
also are consistent with the further hypothesis that CAL actually improves e
ectiveness by identifying articles missed in the search phase, or articles mistakenly
eliminated during the screening phase. While this hypothesis cannot be tested
immediately within the framework of Task 2, we have identi ed a set of articles
that, were it determined that they describe one or more studies that should have
been included in the review, would demonstrate CAL's superior e ectiveness.
Task 2 is essentially the Technology-Assisted Review (\TAR") task addressed by
the TREC 2015 and TREC 2016 Total Recall Tracks [11, 8]. For our participation
in CLEF, we reprised our Total Recall e orts using the same apparatus.</p>
      <p>At TREC, the systems under test were given, at the outset, a corpus of
documents and a set of topics. For each topic, a system under test repeatedly
submitted documents from the corpus to a server, and in return, was given a
simulated human assessment of \relevant" or \not relevant" for each document.</p>
      <p>The objective was to identify as many relevant documents as possible, while
submitting as few non-relevant documents as possible. The tension between these
Algorithm 1 The AutoTAR Continuous Active Learning (\CAL") Method,
as Implemented by the TREC Baseline Model Implementation (\BMI") and
deployed by Waterloo for the CLEF Technologically Assisted Review Task.
1. The initial training set consists of a synthetic document containing only the topic
title, labeled as \relevant."
2. Set the initial batch size B to 1.
3. Temporarily augment the training set by adding 100 random documents from the
collection, provisionally labeled as \not relevant."
4. Apply logistic regression to the training set.
5. Remove the random documents added in step 3.
6. Select the highest-scoring B documents that have not yet yet been screened.
7. Label each of the B documents as \relevant" or \not relevant" by consulting:
(a) Previous \abstract" assessments supplied by CLEF [Method A]; or,
(b) Previous \document" assessments, once the rst \relevant" document
assessment is encountered [Method B].
8. Add the labeled documents to the training set.
9. Increase B by 1B0 .
10. Repeat steps 3 through 10 until either:
(a) All documents have been screened [for ranked evaluation]; or,
(b) The \knee-method" stopping criterion is met [for threshold evaluation].
two criteria was evaluated using rank-based measures (e.g., recall as a function of
the number of documents submitted), as well as set-based measures (e.g., recall
at a point when a certain number of documents, speci ed contemporaneously by
the system, had been submitted).</p>
      <p>
        Prior to TREC, we made available a Baseline Model Implementation
(\BMI"),1 to illustrate the client-server protocol, as well as to provide
baseline results for comparison. BMI, which encapsulates our AutoTAR Continuous
Active Learning (\CAL") method [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], yielded rank-based results that compared
favorably will all systems under test. During the course of our participation in
TREC, we developed and tested the \knee method" stopping procedure [
        <xref ref-type="bibr" rid="ref2 ref3 ref5">3, 2, 5</xref>
        ],
with the purpose of achieving high recall with high probability.
      </p>
      <p>
        Task 2 di ered operationally from the TREC Total Recall Track in that a
list of document identi ers, rather than a corpus, was supplied at the outset,
and a complete set of relevance assessments, rather than an assessment server
were used to simulate human assessments. Task 2 also di ered substantively
from the Total Recall Track in that the corpus for each topic was narrowed
by a search phase speci c to that topic, and therefore yielded a much smaller
set that was richer in relevant documents. Task 2 di ered further in that two
sets of relevance assessments were available: the assessments from a previously
conducted screening phase, and the assessments from a previously conducted
selection phase, raising the question of which assessments (or combination of
assessments) should be used to simulate relevance feedback, and which should
be used to evaluate the results (cf. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
      </p>
      <p>Task 2 provides no method equivalent to TREC's \call your shot" for a
system under test to specify a stopping criterion (for threshold-based evaluation),
while at the same time continuing until every document in the corpus has been
submitted for assessment (for rank-based evaluation).</p>
      <p>Task 2, however, unlike TREC, a orded participants the opportunity to
conduct task-speci c tuning and con guration, by supplying 20 training topics (with
corresponding corpora and assessments) in advance of the exercise, followed by
30 test topics, which were used for evaluation.</p>
    </sec>
    <sec id="sec-2">
      <title>Training and Con guration</title>
      <sec id="sec-2-1">
        <title>Document Corpora</title>
        <p>The corpus for each topic consisted of abstracts from MEDLINE/Pubmed2
identi ed by PMID. On March 8, 2017, we fetched the entire MEDLINE dataset
consisting of 27,348,935 XML les, each containing the titles, abstracts, and
metadata for an article. We used the raw XML les as documents in the corpora
that were supplied at the outset to BMI.</p>
        <sec id="sec-2-1-1">
          <title>1 Available under GNU General</title>
          <p>http://cormack.uwaterloo.ca/trecvm.
2 See https://www.nlm.nih.gov/bsd/pmresources.html.</p>
          <p>Public</p>
          <p>License
at</p>
          <p>Our original intent had been to apply BMI to the entire corpus of 27,348,935
les, thus combining the search and screening phases. When we employed this
strategy in a pilot experiment on the test topics, we found that no assessments
were available for many, if not most, of the highly ranked documents returned
by BMI. To our eye, these documents were indistinguishable from those for
which \relevant" assessments were provided. We investigated, without success,
the reasons why these documents were not retrieved by the previously conducted
search phase. For example, the documents in question were neither newer nor
older than those for which assessments were available, and appeared to contain
relevant terms from the search query. As we were unable to reproduce the results
of the CLEF search phase, we chose to ignore|for the purpose of relevance
feedback and evaluation|documents for which no assessments were available.
Ignoring these unjudged documents, our pilot experiment yielded what appeared
to be reasonable rank-based results.</p>
          <p>Ignoring documents for feedback and evaluation yields a substantially
different result from removing them from the corpus altogether. In a second pilot
experiment, we constructed a separate corpus for each topic, consisting of only
those documents for which relevance assessments were available. While BMI ran
much faster on these reduced corpora than on the 27M dataset, results were
apparently inferior. We conjecture that this inferior result can be explained by
skewed term-frequency statistics in the reduced corpora.</p>
          <p>As a compromise between the e ectiveness of searching the 27M dataset and
the (computational) e ciency of searching the reduced corpora, we conducted
a third pilot experiment using a common corpus consisting of all documents
that were assessed for any of the 20 test topics. That is, for any given topic,
the corpus consisted of all the documents assessed for that topic, as well as all
the documents assessed for each of the other 19 topics. Our rationale was that
including documents retrieved for all topics would introduce enough diversity
to unskew su ciently the term-frequency statics. This approach appeared to
achieve the e ciency of using reduced corpora and the e ectiveness of using
the full dataset, and was chosen for our o cial tests: For the o cial tests, the
corpus consisted of all documents assessed for any of the 30 test topics (less four
documents whose PMIDs were not present in our MEDLINE database); from
this corpus, we submitted and solicited feedback only for documents for which
assessments were available.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Relevance Feedback</title>
        <p>We investigated three modes of relevance feedback, of which only two were
selected for o cial testing:
1. Relevance feedback based on the screening-phase assessments (selected as</p>
        <p>Method A for o cial testing);
2. Relevance feedback based on the selection-phase assessments (not selected
for o cial testing);
3. Relevance feedback based on a hybrid of screening-phase and selection-phase
assessments (selected as Method B for o cial testing).</p>
        <p>The rst and second methods are straightforward: When BMI identi es a
document for assessment, the judgment returned to BMI is that supplied by CLEF
for either the screening phase (the \abstract qrels") or the selection phase (\the
content qrels"). The third method operates in two phases: At the outset, the
judgment returned to BMI is that of the abstract qrels. The abstract qrels
continue to be used until BMI identi es one document that is relevant not only
according to the abstract qrels, but also according to the content qrels.
Thereafter, the judgment returned to BMI is that of the content qrels.</p>
        <p>In our pilot experiments, we found that the rst method consistently yielded
superior rank-based results, whether evaluated using the abstract qrels or the
content qrels. The second method yielded consistently inferior results. The third
method showed similar, but slightly inferior, results, to the rst method, when
evaluated using the content qrels. Based on our pilot results, we selected the rst
and third methods, denoted as Method A and Method B, respectively, for our
o cial experiments.
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Stopping Criterion</title>
        <p>
          For threshold-based evaluation, it was necessary to implement a stopping
procedure to terminate screening when the best compromise between recall and e ort
had been achieved, for some de nition of \best." In our opinion,
technologyassisted review should be considered a satisfactory alternative to manual
review only if it yields comparable or superior recall, with high probability.
Toward this end, we deployed our knee method with default parameters ( =
156 min(relret; 150), = 100 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]), which interprets a sharp fall-o in the
slope of the gain curve (recall vs. review e ort) as evidence that substantially
all relevant documents have been identi ed.
3.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Runs and Evaluation</title>
        <p>The Task 2 guidelines specify a plethora of run types and evaluation measures,
which may be classi ed on two orthogonal dimensions:
1. Rank-based vs. threshold-based (or set-based) evaluation; and
2. Simple vs. cost-sensitive scoring.</p>
        <p>The strategies to optimize these measures are incompatible, occasioning us to
submit four versions of the output from each of our two runs, for a total of eight
submissions, detailed in Table 1. The only di erence between the \rank" and
\thresh" runs is that the latter are truncated using the knee-method stopping
procedure; the only di erence between the \normal" and \cost" runs is that
the \interaction eld" \AF" is replaced by \AFS" where the document receives
a \relevant" assessment, and by \AFN" where the document receives a
\nonrelevant" assessment.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>AutoTAR</title>
      <p>
        In 2015, we published the details and rationale for AutoTAR [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which remains,
to this date, the most e ective TAR method of which we are aware. BMI
implements AutoTAR exactly as described above, except for the substitution of
So a-ML logistic regression in place of SVMlight (see [4, Section 3.1]). It has
no dataset- or topic-speci c tuning parameters; except for modi cations to
incorporate the CLEF corpora and relevance assessments, and our knee-method
stopping procedure, we used BMI \out of the box."
      </p>
      <p>
        The AutoTAR/BMI algorithm, as modi ed for CLEF, is detailed in
Algorithm 1, which is reproduced from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] with the following changes:
{ In Step 1, AutoTAR gives the option of starting with a relevant document,
or with a synthetic document. Here, we used a synthetic document consisting
of the title of the topic, and nothing else.
{ In Step 7, we introduced two di erent ways to simulate user feedback,
corresponding to Method A and Method B, described above in Section 3.2.
{ In Step 10, we introduced the option to terminate the process when the
knee-method stopping criterion was met.
      </p>
      <p>Internally, BMI constructs a normalized TF-IDF ((1 + log tf ) log dNf )
wordvector representation of each document in the corpus (which, as noted in
Section 3.1, consists of raw XML les), where a word is considered to be any
sequence of two or more alphanumeric characters not containing a digit, that
occurs at least twice in the corpus. Scoring is e ected by So a-ML3 with
parameters \--learner type logreg-pegasos --loop type roc --lambda 0.0001
--iterations 200000." As noted above, these parameters were xed when BMI
was created in 2015.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>We present separately the results for our threshold-based and rank-based runs,
reporting only simple threshold-based and simple rank-based measures for each,
computed using the content qrels. At the time of writing, cost-sensitive
evaluation was not available to CLEF participants.
5.1</p>
      <sec id="sec-4-1">
        <title>Threshold-Based Results</title>
        <p>Our threshold-based results are shown in Table 2. Perhaps the most important
result is shown in the rst three lines: Across 30 topics, Method A identi ed all
607 articles referencing studies that should have been included, thus achieving
100% recall. Method B, on the other hand, identi ed 575 of the articles, achieving
97.9% recall. Method A, however, entailed the review of 79,765 (72.8%) of the</p>
        <sec id="sec-4-1-1">
          <title>3 See https://github.com/glycerine/so a-ml.</title>
          <p>109,560 abstracts identi ed by the search phase, while method B entailed the
review of only 52,934 (48.3%) of the documents.</p>
          <p>In other words, Method A was more e ective, but Method B was more
efcient. According to the combined loss measure which considers both factors,
Method B was superior.
5.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Rank-Based Results</title>
        <p>Our rank-based results are shown in Table 3. Work saved over sampling
(\WSS")|a measure commonly reported for systematic review|re ects how
many fewer documents would have been needed to have been reviewed to achieve
a particular level of recall, if it were somehow known exactly when that level had
been achieved. Thus, WSS, along with all other rank-based measures, is a
measure of what might have been, rather than achieved e ectiveness. According to
WSS, Method A is marginally inferior to Method B at 95% recall (0.815 vs.
0.824), and at 100% recall (0.823 vs. 0.830).</p>
        <p>Conversely, Method A is marginally superior to Method B in terms of the
number of documents that had to be examined per topic before 100% recall
was achieved (461 vs. 469, representing 12.6% and 12.8%, respectively, of the
average number of documents per topic). In other words, Method A could have
achieved 100% recall with roughly on-sixth the review e ort, had a stopping
procedure been able to determine when 100% recall had occurred. Similarly,
Method B could have achieved 100% recall with roughly four times less e ort
that it actually required to achieve 97.9% recall, had a stopping procedure been
available.</p>
        <p>The Normalized Cumulative Gain (\NCG") results|which report the recall
achieved when a speci ed fraction (between 10% and 100%) of the documents
have been reviewed|tell much the same story: Very high recall could have been
achieved at a fraction of the review e ort, had it been know when high recall
had been achieved.</p>
        <p>In our opinion, cumulative measures like norm area and average precision
yield very little insight into the actual or hypothetical e ectiveness of
technologyassisted review for screening purposes.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>We believe that both sets of the CLEF assessments are incomplete with respect
to the overall objective of identifying all studies that should be included in the
review: The screening assessments are available only for documents retrieved
by the search phase; the selection assessments are available only for documents
retrieved by the search phase, and judged relevant during the screening phase.
Therefore, from the assessments, it is impossible to determine whether an article
not retrieved by the search phase, or an article eliminated during the screening
phase, describes a study that should have been included in the review. The Task
2 architecture tacitly assumes that no such articles exist; in other words, that
the search and screening phases used to generate the relevance assessments were
infallible, and each attained 100% recall.</p>
      <p>
        Such an assumption is unrealistic, and limits the recall of any simulated TAR
method to that of the manual review to which it is compared [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. As noted in the
Cochrane Handbook [9] with regard to the search phase: \[T]here comes a point
where the rewards of further searching may not be worth the e ort required
to identify the additional references." And with regard to the screening phase:
\Using at least two authors may reduce the possibility that relevant reports will
be discarded (Edwards 2002 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ])."
      </p>
      <p>
        Our hypothesis that our TAR runs found relevant articles that were missed
by the search phase, or incorrectly discarded in the screening phase, is based
on results from other domains [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], where TAR acting as a \second assessor"
was able to identify potentially relevant documents that had been judged
\nonrelevant" by a human assessor. When we applied Method A to the 30 topics, it
identi ed 9,250 potentially relevant articles for which the abstract qrel was \not
relevant." Acquiring a second opinion on each of these documents would increase
the cost of the TAR review by approximately 12%, and would, we believe, yield
a substantial number of relevant documents, over and above the 670 identi ed
in the abstract qrels.
8. M. R. Grossman, G. V. Cormack, and A. Roegiest. TREC 2016 Total Recall Track
overview. In Proceedings of The Twenty-Fifth Text REtrieval Conference, TREC
2016, Gaithersburg, Maryland, USA, November 15-18, 2016, 2016.
9. J. P. Higgins and S. Green. Cochrane handbook for systematic reviews of
interventions, volume 4. John Wiley &amp; Sons, 2011.
10. E. Kanoulas, D. Li, L. Azzopardi, and R. Spijker. Overview of the CLEF
technologically assisted reviews in empirical medicine. In Working Notes of CLEF 2017
- Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14,
2017., CEUR Workshop Proceedings. CEUR-WS.org, 2017.
11. A. Roegiest, G. V. Cormack, M. R. Grossman, and C. L. A. Clarke. TREC 2015
total recall track overview. In Proceedings of The Twenty-Fifth Text REtrieval
Conference, TREC 2015, Gaithersburg, Maryland, USA, November 17-20, 2015,
2015.
12. H. Suominen, L. Kelly, L. Goeuriot, E. Kanoulas, A. Neveol, G. Zuccon, and
J. R. M. Palotti. Overview of the CLEF ehealth evaluation lab 2017. In
Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International
Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September
1114, 2017, Proceedings, Lecture Notes in Computer Science. Springer, 2017.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Autonomy and reliability of continuous active learning for technology-assisted review</article-title>
          .
          <source>arXiv preprint arXiv:1504.06868</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Waterloo (Cormack) participation in the TREC 2015 Total Recall Track</article-title>
          .
          <source>In Proceedings of The Twenty-Fourth Text REtrieval Conference</source>
          , TREC 2015, Gaithersburg, Maryland, USA, November
          <volume>17</volume>
          -
          <issue>20</issue>
          ,
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Engineering quality and reliability in technology-assisted review</article-title>
          .
          <source>In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2016</year>
          , Pisa, Italy,
          <source>July 17-21</source>
          ,
          <year>2016</year>
          , pages
          <fpage>75</fpage>
          {
          <fpage>84</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Scalability of continuous active learning for reliable high-recall text classi cation</article-title>
          .
          <source>In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management</source>
          ,
          <string-name>
            <surname>CIKM</surname>
          </string-name>
          <year>2016</year>
          ,
          <article-title>Indianapolis</article-title>
          , IN, USA, October
          <volume>24</volume>
          -
          <issue>28</issue>
          ,
          <year>2016</year>
          , pages
          <fpage>1039</fpage>
          {
          <fpage>1048</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          . \
          <article-title>When to stop" Waterloo (Cormack) participation in the TREC 2016 Total Recall Track</article-title>
          .
          <source>In Proceedings of The Twenty-Fifth Text REtrieval Conference</source>
          , TREC 2016, Gaithersburg, Maryland, USA, November
          <volume>15</volume>
          -
          <issue>18</issue>
          ,
          <year>2016</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grossman</surname>
          </string-name>
          .
          <article-title>Navigating imprecision in relevance assessments on the road to total recall: Roger and me</article-title>
          .
          <source>In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2017</year>
          , Tokyo, Japan,
          <source>August</source>
          <volume>7</volume>
          -
          <issue>11</issue>
          ,
          <year>2017</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>P.</given-names>
            <surname>Edwards</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , C. DiGuiseppi, S. Pratap,
          <string-name>
            <surname>I. Roberts</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Wentz</surname>
          </string-name>
          .
          <article-title>Identi cation of randomized controlled trials in systematic reviews: accuracy and reliability of screening records</article-title>
          . Statistics in Medicine,
          <volume>21</volume>
          (
          <issue>11</issue>
          ):
          <volume>1635</volume>
          {
          <fpage>1640</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>