<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Retrieving and Ranking Studies for Systematic Reviews: University of Sheffield's Approach to CLEF eHealth 2018 Task 2</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Datasets</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amal Alharbi</institution>
          ,
          <addr-line>William Briggs and Mark Stevenson</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Sheffield</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>This paper describes the University of Sheffield's approach to CLEF 2018 eHealth Task 2: Technologically Assisted Reviews in Empirical Medicine. This task focuses on identifying relevant studies for systematic reviews. The University of Sheffield participated in both subtasks. Our approach to subtask 1 was to extract keywords from search protocols and form them into queries designed to retrieve relevant documents. Our approach to subtask 2 was to enrich queries with terms designed to identify diagnostic test accuracy studies and also by making use of relevance feedback. A total of six official runs were submitted.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Systematic reviews aim to identify and summarise all available evidence to answer a
specific question such as ‘for deep vein thrombosis is D-dimer testing or ultrasound
more accurate for diagnosis?’[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The process of conducting a systematic review is
time-consuming and a single review may require up to 12 months of expert effort [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ].
A significant amount of this effort is spent on manually screening studies to identify
those which should be included in the review. This effort can be significantly reduced
by applying text mining techniques to identify relevant studies (semi-)automatically
[
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4,5,6,7</xref>
        ].
      </p>
      <p>
        There are three main stages in the process of identifying relevant studies for a
systematic review [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]:
      </p>
      <p>This paper is structured as follows: Sections 2 and 3 describe the approach to and
results obtained for subtasks 1 and 2 (respectively). Conclusions are presented in Section
4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Subtask 1: No Boolean Search</title>
      <p>Before constructing a Boolean Query, reviewers design and write a search protocol that
defines in detail what constitutes a relevant study for their review.</p>
      <p>The goal of subtask 1 is to create a search strategy based on the protocol without
developing a Boolean query. Participants were expected to interpret the protocol and
use information from it to identify relevant studies directly from PubMed. The task is
a complex problem that can be viewed as involving both Information Extraction and
search.
2.1</p>
      <sec id="sec-2-1">
        <title>Datasets</title>
        <p>Participants in Subtask 1 were provided with 40 example reviews for training and an
additional 30 for testing. The data provided included full protocols as well as protocol
summaries. The summaries were variable in length and typically contain three main
headings: topic, title and objective. See Figure 1 for an example protocol summary.</p>
        <p>Topic: CD008122
Title: Rapid diagnostic tests for diagnosing uncomplicated P. falciparum malaria in
endemic countries
Objective: To assess the diagnostic accuracy of RDTs for detecting clinical P. falciparum
malaria (symptoms suggestive of malaria plus P. falciparum parasitaemia detectable
by microscopy) in persons living in malaria endemic areas who present to ambulatory
healthcare facilities with symptoms of malaria, and to identify which types and brands
of commercial test best detect clinical P. falciparum malaria.
Sheffield’s approach to subtask 1 used the RAKE [12] keyword extraction algorithm to
interpret protocols and Apache Lucene1 as the IR engine.</p>
        <p>The PubMed database was retrieved from the PubMed FTP site2 as a set of XML
files and indexed using Apache Lucene. The abstract and title of each citation is parsed
1 https://lucene.apache.org/
2 ftp://ftp.ncbi.nlm.nih.gov/
out of the XML files and concatenated. Each citation is pre-processed by carrying out
tokenisation, stop word removal and stemming.</p>
        <p>Information was extracted from the protocol summaries, rather than the full
protocols, since our analysis suggested that this contained the key information that was
useful for creating a search.</p>
        <p>Protocol summaries were pre-processed by removing references, single characters,
headers, titles and markup tags. RAKE is then applied to the remaining content with a
minimum keyword frequency set to 1 (i.e. return all terms). The extracted terms are
concatenated together to form a query which is used to retrieve citations from the Lucene
index (see Figure 2 for example of the query generated from protocol summary shown
in Figure 1).</p>
        <p>endemic countries objective ambulatory healthcare facilities rapid diagnostic tests
falciparum parasitaemia detectable malaria endemic areas diagnostic accuracy falciparum
malaria
– The sheffield-Boolean run uses terms that occur most frequently in the document
and query as a basis for ranking. Documents that contain more query terms will
feature higher in the overall rankings. The Apache Lucene Boolean similarity class3
was used for the implementation.
– The sheffield-tfidf run uses a cosine similarity measure to compare the similarity
between the query and the PubMed article. Documents and queries are represented
as tf.idf weighted vectors. The Apache Lucene tf.idf similarity class4 was used.
– The sheffield-bm25 run uses the BM25 similarity measure [13] implemented using
the Apache Lucene BM25 similarity class5.
2.4</p>
      </sec>
      <sec id="sec-2-2">
        <title>Results and Discussion</title>
        <p>Results were computed over the training and test datasets and shown in Table 1 and
Table 2.
3 https://lucene.apache.org/core/7_0_1/core/org/apache/lucene/
search/similarities/BooleanSimilarity.html
4 https://lucene.apache.org/core/7_0_1/core/org/apache/lucene/
search/similarities/TFIDFSimilarity.html
5 https://lucene.apache.org/core/7_0_1/core/org/apache/lucene/
search/similarities/BM25Similarity.html
Training Dataset Results for the training dataset are shown in Table 1. The Boolean
search method achieves 0.341 recall for the 5000 documents returned. This approach is
limited by the fact there is no weighting of term importance and the information used
for ranking is based only on the number of query terms contained in documents. Using
tf.idf leads to a slight improvement in performance (0.375 recall), presumably due to
the availability of term weighting information. The best recall score for the training
data (0.587) is obtained using BM25. The overall pattern of results is as expected for
the training data, particularly the relatively strong performance of BM25.
norm_area
sheffield-boolean</p>
        <p>sheffield-tfidf
sheffield-bm25
Test Dataset Results for the test dataset are shown in Table 2. Performance using the
tfidf ranking method was surprisingly poor on this dataset. This may be due to the use
of RAKE to extract keywords which reduces the impact of the idf element of the tfidf
similarity measure. As with the training data, BM25 achieves the best performance,
although generally lower than for the training data.
norm_area
sheffield-boolean</p>
        <p>sheffield-tfidf
sheffield-bm25
Subtask 2 focuses on the second stage of conducting systematic reviews (Title and
abstract screening). Participants are asked to rank the list of PubMed Document Identifiers
(PMIDs) returned from the Boolean query with the goal that all relevant citations appear
as early as possible.
3.1
The dataset for this subtask consists of 72 DTA reviews. This dataset divided into 42
reviews for training dataset and 30 reviews for test dataset. For each review,
participants are provided with the topic ID, title of the review (written by Cochrane
experts), a Boolean query using either OVID or PubMed syntax (manually constructed by
Cochrane experts), set of PMIDs returned by running the query in MEDLINE database
and relevance judgement at both abstract and content levels. Figure 3 shows an example
topic from the training dataset.</p>
        <p>Topic: CD010705
Title: The diagnostic accuracy of the GenoType R MTBDRsl assay for the
detection of resistance to second-line anti-tuberculosis drugs.
The University of Sheffield’s submission for subtask 2 extended the approach used for
CLEF 2017 [15] by augmenting queries with terms designed to identify DTA studies
and by making use of relevance feedback.</p>
        <p>Our approach used the topic title and terms from the Boolean query. A simple
parser was used to extract terms from the Boolean query automatically. The topic
title and terms extracted from the query were pre-processed by tokenisation, stemming
and removal of stop words6. The same pre-processing steps were applied to the title and
abstract for each PMID returned for the Boolean query.
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Runs</title>
        <p>Three runs were submitted to the subtask 2 official evaluation: sheffield-query_terms,
sheffield-general_terms and sheffield-feedback.</p>
        <p>– The sheffield-query_terms run ranks abstracts by comparing each citation against
topic title and terms extracted from the query. We used tf.idf weighted vectors to
represent information obtained from the topic and citations then calculate the
similarity between them using the cosine metric7. Abstracts are ranked based on this
similarity score. (N.B. This run is the same as our best performing run from CLEF
2017, i.e. Sheffield-run-4.)
– The sheffield-general_terms run extends the previous approach (sheffield-query_terms)
by enriching queries with terms designed specifically to identify DTA studies.
Terms from standard filters developed to identify DTA studies [16] were added
to the queries (see Figure 4).</p>
        <p>’sensitivity’, ’specificity’, ’diagnos’, ’diagnosis’,</p>
        <p>’predictive’, ’accuracy’
– The sheffield-feedback run used relevance feedback to re-rank abstracts based on
relevance judgements. 10% of the abstracts (up to a maximum of 1,000) were used
to update the query vector using Rocchio’s algorithm (equation 1) and the abstracts
re-ranked [17].</p>
        <p>#
q m =
#
q +</p>
        <p>Nr #
8dj2Dr</p>
        <p>X #d j</p>
        <p>Nn #
8dj2Dn</p>
        <p>X #d j
(1)
# #
where q is the original query vector, d j is a weighted term vector associated with
abstract j. Dr is the set of relevant abstracts among the abstracts retrieved and Nr
is the number of abstracts in Dr. Dn is the set of non-relevant abstracts among the
abstracts retrieved and Nn is the number of abstracts in Dn. A range of values for
the weighting parameters ( , and ) were explored using the training data and it
was found that the best results were achieved by setting = = 1 and = 1.5.
6 NLTK’s tokenize and LancasterStemmer packages were used for tokenisation and
stemming. PubMed’s stop word list was used: https://www.ncbi.nlm.nih.gov/
books/NBK3827/table/pubmedhelp.T.stopwords/
7 Scikit-learn’s TfidfVectorizer and linear_kernel packages were used for this steps
3.4
Training Dataset Table 3 shows the results8 of applying our approach on the
training dataset. As expected, all submitted runs outperform the baseline where the list of
PubMed abstracts is randomly ordered.</p>
        <p>Performance improves when general terms are added to the queries (comparing
sheffield-query_terms and sheffield-general_terms). These results demonstrate the
usefulness of including general terms which provide information about the types of
citations that are likely to be relevant for DTA reviews, independently of their specific
topic.</p>
        <p>The best performance was achieved using relevance feedback (sheffield-feedback).
The average precision (ap) increased by 0.577 when compared with the baseline. It also
produced the best results for work saved oversampling (wss@100 and wss@95), and
the norm area was also improved by 0.359, 0.522 and 0.423 respectively. An
improvement in performance is to be expected when relevance feedback is used, given that
additional information available in the relevance judgements. The scale of the
improvement demonstrates just how useful this information is for the task.
Test Dataset Table 4 shows the results on the test dataset. The pattern of performance is
similar to that observed for the training dataset. The best performance was achieved by
using relevance feedback (sheffield-feedback). The ap improved by 0.556 when
compared with baseline. The wss@100, wss@95 and the norm area were also improved by
0.421, 0.606 and 0.418 respectively.</p>
        <p>Results from both training and test datasets demonstrate that retrieval performance
for technology-assisted reviews can be improved by adding additional terms indicating
the type of citation likely to be on interest for the review are added to the queries and
by applying relevance feedback.
8 Using the script provided by task organisers: https://github.com/leifos/tar.
7221
5737
5519
5171
This paper presented the University of Sheffield’s approach to CLEF2018 task 2. For
subtask 1, we established a method for retrieving document using search protocol
summaries. We showed RAKE was an effective way of identifying keywords in the protocol
from which queries could be created. For subtask 2, results demonstrated that
augmenting queries with terms designed to identify DTA studies and applying relevance
feedback improve retrieval performance.
9. H. Suominen, L. Kelly, L. Goeuriot, E. Kanoulas, L. Azzopardi, R. Spijker, D. Li,
A. Névéol, L. Ramadier, A. Robert, G. Zuccon, and J. Palotti, “Overview of the CLEF
eHealth Evaluation Lab 2018,” in CLEF 2018 - 8th Conference and Labs of the Evaluation
Forum, Lecture Notes in Computer Science (LNCS), (France), Springer, September 2018.
10. E. Kanoulas, R. Spijker, D. Li, and L. Azzopardi, “CLEF 2018 Technology Assisted
Reviews in Empirical Medicine Overview,” in CLEF 2018Evaluation Labs and Workshop:
Online Working Notes, (France), CEUR-WS, September 2018.
11. K. Abba, J. Deeks, P. Olliaro, C. Naing, S. Jackson, Y. Takwoingi, S. Donegan, and
P. Garner, “Rapid diagnostic tests for diagnosing uncomplicated P. falciparum malaria in
endemic countries,” The Cochrane Database of Systematic Reviews, 2011. CD008122.
12. R. Stuart, E. Dave, C. Nick, and C. Wendy, Automatic Keyword Extraction from Individual</p>
        <p>Documents, ch. 1, pp. 1–20. Wiley-Blackwell, 2010.
13. S. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford, “Okapi at
TREC–3,” in Overview of the Third Text REtrieval Conference (TREC–3), pp. 109–126,
Gaithersburg, MD: NIST, 1995.
14. “The diagnostic accuracy of the GenoType( R ) MTBDRsl assay for the detection of
resistance to second-line anti-tuberculosis drugs,” The Cochrane database of systematic
reviews, 2014. CD010705.
15. A. Alharbi and M. Stevenson, “Ranking abstracts to identify relevant evidence for
systematic reviews: The University of Sheffield’s approach to CLEF eHealth 2017 Task 2 ,”
in Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, CEUR
Workshop Proceedings, (Dublin, Ireland), CEUR-WS.org, September 11-14 2017.
16. “Health Information Research Unit. Search Filters for MEDLINE in Ovid Syntax and the</p>
        <p>PubMed translation [Internet]. 2016 [updated 2016 February 09 ;cited 2018 January 15].”
17. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval: The Concepts and
Technology Behind Search. USA: Addison-Wesley Publishing Company, 2nd ed., 2011.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Di Nisio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Squizzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W. S.</given-names>
            <surname>Rutjes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Büller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Zwinderman</surname>
          </string-name>
          , and
          <string-name>
            <surname>P. M. M. Bossuyt</surname>
          </string-name>
          , “
          <article-title>Diagnostic accuracy of D-dimer test for exclusion of venous thromboembolism: a systematic review</article-title>
          ,
          <source>” Journal of Thrombosis and Haemostasis</source>
          , vol.
          <volume>5</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>296</fpage>
          -
          <lpage>304</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ambert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>McDonagh</surname>
          </string-name>
          , “
          <article-title>A prospective evaluation of an automated classification system to support evidence-based medicine and systematic review</article-title>
          ,
          <source>” AMIA Annual Symposium Proceedings</source>
          , vol.
          <year>2010</year>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>125</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>S.</given-names>
            <surname>Karimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pohl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cavedon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          , “
          <article-title>Boolean versus ranked querying for biomedical systematic reviews,” BMC medical informatics and decision making</article-title>
          , vol.
          <volume>10</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>A. O'Mara-Eves</surname>
            ,
            <given-names>J.</given-names>
            Thomas, J.
          </string-name>
          <string-name>
            <surname>McNaught</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Miwa</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ananiadou</surname>
          </string-name>
          , “
          <article-title>Using text mining for study identification in systematic reviews: a systematic review of current approaches,” Systematic Reviews</article-title>
          , vol.
          <volume>4</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Paisley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sevra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stevenson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Archer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Preston</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Chilcott</surname>
          </string-name>
          , “
          <article-title>Identifying Potential Early Biomarkers of Acute Myocaridal Infarction in the Biomedical Literature: A Comparison of Text Mining and Manual Sifting Techniques,”</article-title>
          <source>in Proceedings of the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) 19th Annual European Congress</source>
          , (Vienna, Austria),
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M.</given-names>
            <surname>Miwa</surname>
          </string-name>
          , J. Thomas,
          <string-name>
            <given-names>A. O</given-names>
            <surname>'Mara-Eves</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          , “
          <article-title>Reducing systematic review workload through certainty-based screening</article-title>
          ,
          <source>” Journal of Biomedical Informatics</source>
          , vol.
          <volume>51</volume>
          , pp.
          <fpage>242</fpage>
          -
          <lpage>253</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Shemilt</surname>
          </string-name>
          , Khan, Park, and Thomas, “
          <article-title>Use of Cost-effectiveness Analysis to Compare the Efficiency of Study Identification Methods in Systematic Reviews</article-title>
          ,” Systematic reviews,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Suominen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Névéol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Robert</surname>
          </string-name>
          , E. Kanoulas,
          <string-name>
            <given-names>R.</given-names>
            <surname>Spijker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Palotti</surname>
          </string-name>
          , and G. Zuccon, “
          <article-title>CLEF 2017 eHealth Evaluation Lab Overview</article-title>
          ,”
          <source>CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS)</source>
          , Springer, September,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>