<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining Inter-Review Learning-to-Rank and Intra-Review Incremental Training for Title and Abstract Screening in Systematic Reviews</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonios Anagnostou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Athanasios Lagopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigorios Tsoumakas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Vlahavas</string-name>
          <email>vlahavasg@csd.auth.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Informatics, Aristotle University of Thessaloniki</institution>
          ,
          <addr-line>54124 Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>), with di erent elds of the topics (title, query). Our incrementally trained model is a support vector machine trained on a TF-IDF representation of title and abstract of the documents. The results of our approach are promising, reaching 0.658 normalized cumulative gain in the top 10 ranked documents in the simple evaluation setting and 0.846 in the cost-e ective evaluation setting, the latter assuming feedback can be obtained from an intermediate user/oracle instead of the end-user.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Evidence-Based Medicine (EBM) is an approach to medical practice that makes
use of the current best clinical evidence in making decisions about the care and
treatment of individual patients [13]. Researchers in the medical domain conduct
systematic research to nd the best available evidence and form review articles
summarizing their discoveries on a certain topic. These systematic reviews
usually include three stages:
1. Document retrieval. Experts build a Boolean query and submit it to
a medical database, which returns a set of possibly relevant documents.
Boolean queries typically have very complicated syntax and consist of
multiple lines. Such a query can be found for reference in Listing 1.1.
2. Title and abstract screening. Experts go through the title and abstract
of the set of documents retrieved by the previous stage and perform a rst
level of screening.
3. Document screening. Experts go through the full text of each document
that passes the screening of the previous stage to decide whether it will be
included in their systematic review.</p>
      <p>Considering the rapid pace with which libraries of medical articles are
expanding, Systematic Review can be a very di cult and time-consuming task.</p>
      <p>
        Task II [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] of CLEF eHealth 2017 lab concerns Technologically Assisted
Reviews in Empirical Medicine, focusing on Diagnostic Test Accuracy (DTA),
and aims to automate the second part of this process by ranking the set of
documents retrieved in the rst stage. Its goal is to produce an e cient ordering of
the documents retrieved in the rst stage, by reducing the amount of documents
that experts have to go through for their reviews. This can be accomplished in
two stages: by classifying documents (relevant or not) and by thresholding, ie.
showing only a subset of the returned documents (the ones that are highest on
the list).
      </p>
      <p>Listing 1.1. Example of query in data set</p>
      <p>It is the rst time this task take place and very little research is previously
done on the topic. Previous approaches on this problem use an ensemble of
Support Vector Machines (SVM), built over di erent feature spaces (documents'
titles, text, etc.). [15] Other approaches use Active Learning techniques to
improve results' relevance by utilizing domain experts' knowledge. [14] Finally,
Learning to Rank (LTR) approaches have also been tested on biomedical data
and have shown promising results. [12]</p>
      <p>Our approaches on the task are based on binary classi cation methods
combined with existing Learning to Rank techniques. We experimented with di erent
classi ers and we also introduce a hybrid classi cation mechanism which consists
of two parts: an inter-topic classi er, based on features computed on the training
set and an intra-topic classi er, which is trained upon the test set documents.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Task overview</title>
      <p>In CLEF eHealth 2017 Task II, participants were given a total of 20 topics with
the corresponding document IDs. An example of such topic can be nd in Listing
1.1. Summarizing the topics' structure, they all contain:</p>
      <sec id="sec-2-1">
        <title>1. A distinct topic id,</title>
        <p>2. A topic title,
3. An Ovid MEDLINE query and
4. a set of documents' PIDs that are returned from the query.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Similarly, documents contain the following elds:</title>
      </sec>
      <sec id="sec-2-3">
        <title>1. A distinct pid,</title>
        <p>2. A title,
3. The abstract text and
4. Mesh headings, based on their taxonomy
The test set comprised of topics in similar structure, summing up to a total of
30 topics.</p>
        <p>For both the training and the test set, participants were also provided with
the corresponding document relevance sheet, in which relevance was provided
in the format shown in Listing 1.2, where 0 denotes negative relevance and 1
denotes positive one.</p>
        <p>Listing 1.2. Example of query/document relevance.
1 CD010438
2 CD010438
3 CD010438
4 CD010438
5 CD010438
6 CD010438
7 ...
8
9 CD011984
10 CD011984
11 CD011984
12 ...
1. Area under the recall-precision curve, i.e. Average Precision (metric in task's
evaluation script: ap)
2. Minimum number of documents returned to retrieve all R relevant
documents (metric in task's evaluation script: last rel) a measure for optimistic
thresholding
3. Work Saved over Sampling @ Recall (metric in task's evaluation script:
wss 100, and wss 95)</p>
        <p>T N + F N
N {(1 Recall)
4. Area under the cumulative recall curve normalized by the optimal area
(metric in task's evaluation script: norm area)
5. Normalized cumulative gain @ 0% to 100% of documents shown (metric in
task's evaluation script: NCG@0 to NCG@100)
6. Total cost uniform (metric in task's evaluation script: total cost uniform)
optimal area = R</p>
        <p>N {</p>
        <p>R2
2
m
R
(N
n)</p>
        <p>Cp
where:
{ N is the total number of documents in the collection
{ n is the number of documents shown to the user
{ (N n) is the number of documents not shown to the user
{ m is the number of missing relevant documents
{ Ca is the cost paid for experts/users reviewing returned documents'
abstracts to determine their relevance, and
{ Cp = 2 Ca
7. Total cost weighted (metric in task's evaluation script: total cost weighted)
Xm 1</p>
        <p>2i (N
i=1
n)</p>
        <p>
          Cp
8. Reliability (metric in task's evaluation script: loss er) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
        </p>
        <p>Reliability = lossr + losse
where
{ lossr = (1</p>
        <p>
          recall)2 (metric in task's evaluation script: loss r)
{ losse = R+n100 1N00 )2 (metric in task's evaluation script: loss e)
{ recall = nRr (metric in task's evaluation script: r) and
{ nr is the number of relevant document found and R the total number of
relevant documents
The architecture of our approach, which comprises two models, is depicted in
Figures 1 to 3. The rst model is a learning-to-rank binary classi er that
considers a topic-document pair as input and whether the document is relevant for the
topic or not as output (Figure 1). This inter-topic model is used at a rst stage
of our approach in order to obtain an initial ranking of all documents returned
by the Boolean query of an unseen test topic. The second model is a standard
binary classi er that considers a document of the given test topic as input and
whether this document is relevant to the test topic as output. This intra-topic
model is incrementally trained based on relevance feedback that it requests after
returning one or more documents to the user. The rst version of this model
is trained based on feedback obtained from the top k ranked documents by the
inter-topic model (Figure 2). The re-ranking of subsequent documents is from
then on based solely on the intra-topic model (Figure 3).
For each htopic; documenti pair, we extracted a number of features, following
the paradigm of [12]. The majority of the features were computed by considering
the similarity of di erent elds of the document (title, abstract), with di erent
elds of the topic (title, query), using a variety of similarity metrics, such as the
number of common terms between the topic and the document parts, Levenshtein
distance, cosine similarity or OKAPI BM25 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We also computed features based
solely on the topic.
        </p>
        <p>In order to use the rich information available in the query eld of the topics,
we used Polyglot1, a JavaScript tool that can parse and produce a full syntactical
1 https://github.com/CREBP/sra-polyglot
tree of Ovid MEDLINE queries. In particular, we extracted those medical subject
headings (MeSH) that should characterize the retrieved documents, avoiding the
ones that are negated in the query syntax. As an example, according to Polyglot,
the MeSH terms found in the Ovid MEDLINE query of Listing 1.1 are the
following:</p>
        <p>We eventually settled to the 24 features that can be found in Table 1, after
extensive investigation of the performance of our model with additional
variations of these features. Two of these features are only topic-dependent, denoted
with T in the Category column of Table 1, as opposed to the rest 22 of the
features that dependent on both the topic and the document, denoted with T D.
The notation used in the Description column of Table 1 is explained here:
{ t represents the title of each topic, consisting of tokens ti.
{ m represent the MeSH terms extracted from the query of each topic.
{ d represents the title or abstract of a document, consisting of jdj tokens dj .
{ c(x; d) denotes the number of occurrences of title token or MeSH term x of
the topic in document d.</p>
        <p>
          We have experimented with a variety of di erent classi ers, including
Support Vector Machines [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], Gradient Boosting [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], eXtreme Gradient Boosting
(XGBoost) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and LamdaMART [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The best results were achieved with
XGBoost. We have also experimented with a variety of undersampling techniques,
such as EasyEnsemble [10], but this did not lead to accuracy improvements.
3.2
        </p>
        <p>Intra-topic model
The rst version of the intra-topic model is trained based on the top k documents
as ranked by the inter-topic model. We then iteratively re-rank the rest of the
documents, expanding the training set of the intra-topic model with the
topranked document, until the whole list has been added to the training set or a
certain threshold is reached. This iterative feedback and reranking mechanism
is described in detail in Algorithm 1. For the local classi er, a standard TF-IDF
vectorization was used, enhanced with English stop words removal.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation setups and results</title>
      <p>Task II of CLEF eHealth 2017 supported two experimental setups: one for simple
evaluation and one for cost e ective.</p>
      <p>In the simple evaluation, our aim was to utilize relevance feedback as much as
possible without any cap or limitation, so as to experiment with di erent
techniques for boosting ranking metrics. In the cost-e ective evaluation, we have
implemented thresholding by limiting the amount of documents (column Threshold
in Table 2) that we request feedback for and by not showing to users documents
for which negative relevance was received.</p>
      <p>In Table 3 you can nd the o cial results for the simple evaluation setup.
In Table 4 you can nd the results for the cost e ective evaluation, as they
derive from the evaluation script provided by the task's organizers. Please note,
that because of undergoing software enhancements in the script some metrics in
the cost-e ective evaluation might be inaccurate, e.g. the total cost metrics, as
they have not be adjusted to di erent run outputs for that setup. Each of the
runs have a parameterized version of HybridRankSVM and thresholding points,
which are listed in Table 2.
4 k0 k;
5 while not f inalRanking contains both relevant and irrelevant documents do
6 k0 k0 + 1;
7 f inalRankingk0 = Rk0 ;
8 while not length(f inalRanking) == n OR length(f inalRanking) == tfinal do
9 train(f inalRanking) ; // Train a local classifier by asking for
abstract or document relevance for these documents
localRanking = rerank(R f inalRanking) ; // Rerank the rest of the
initial list R from the predictions of the local classifier
if length(f inalRanking) &lt; tstep then</p>
      <p>step = stepinit;
10</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and future work</title>
      <p>In conclusion, in this paper we introduced a hybrid classi cation approach for
medical document ranking. Our approach constructs a global classi cation model
based on LTR features of the training documents, produces an initial ranking
for the test documents and then iteratively asks for feedback and rerank them
based on the acquired relevance.</p>
      <p>
        As future work, we believe that experimentation with more features, such as
semantic representations (e.g. word2vec [11], LDA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], etc.) or di erent
undersampling setups could boost metrics even further. Moreover, it would be worthy
to experiment with other classi cation approaches as well, such as neural
networks.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work has been partially funded by Atypon Systems Inc.
10. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory under-sampling for class-imbalance
learning. In: Proceedings - IEEE International Conference on Data Mining, ICDM.
pp. 965{969 (2006)
11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: E cient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
12. Qin, T., Liu, T.Y., Xu, J., Li, H.: LETOR: A benchmark collection for research
on learning to rank for information retrieval. Information Retrieval 13(4), 346{374
(2010)
13. Sackett, D.L.: Evidence-based medicine. In: Seminars in perinatology. vol. 21, pp.</p>
      <p>3{5. Elsevier (1997)
14. Wallace, B.C., Small, K., Brodley, C.E., Trikalinos, T.A.: Active learning for
biomedical citation screening. In: Proceedings of the 16th ACM SIGKDD
international conference on Knowledge discovery and data mining. pp. 173{182. ACM
(2010)
15. Wallace, B.C., Trikalinos, T.A., Lau, J., Brodley, C., Schmid, C.H.:
Semiautomated screening of biomedical citations for systematic reviews. BMC
bioinformatics 11(1), 55 (2010)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research 3(Jan)</source>
          ,
          <volume>993</volume>
          {
          <fpage>1022</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Burges</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>From ranknet to lambdarank to lambdamart: An overview</article-title>
          .
          <source>Learning</source>
          <volume>11</volume>
          (
          <fpage>23</fpage>
          -
          <lpage>581</lpage>
          ),
          <volume>81</volume>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guestrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          .
          <source>In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          . pp.
          <volume>785</volume>
          {
          <fpage>794</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cormack</surname>
            ,
            <given-names>G.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          :
          <article-title>Engineering quality and reliability in technologyassisted review</article-title>
          .
          <source>In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . pp.
          <volume>75</volume>
          {
          <fpage>84</fpage>
          . SIGIR '16,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2016</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/2911451. 2911510
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine Learning</source>
          <volume>20</volume>
          (
          <issue>3</issue>
          ),
          <volume>273</volume>
          {
          <fpage>297</fpage>
          (
          <year>1995</year>
          ), http://dx.doi.org/10.1023/A:1022627411411
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Stochastic gradient boosting</article-title>
          .
          <source>Computational Statistics &amp; Data Analysis</source>
          <volume>38</volume>
          (
          <issue>4</issue>
          ),
          <volume>367</volume>
          {
          <fpage>378</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanoulas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spijker</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palotti</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuccon</surname>
          </string-name>
          , G.:
          <article-title>Clef 2017 ehealth evaluation lab overview</article-title>
          .
          <source>CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS)</source>
          (
          <year>September 2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>K.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          :
          <article-title>A probabilistic model of information retrieval: development and comparative experiments: Part 2</article-title>
          .
          <source>Information processing &amp; management 36(6)</source>
          ,
          <volume>809</volume>
          {
          <fpage>840</fpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kanoulas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azzopardi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spijker</surname>
          </string-name>
          , R.:
          <article-title>Clef 2017 technologically assisted reviews in empirical medicine overview</article-title>
          . In: Working Notes of CLEF 2017 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , Dublin, Ireland (
          <year>2017</year>
          ),
          <article-title>CEUR-WS</article-title>
          .org
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>