<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Aristotle University's Approach to the Technologically Assisted Reviews in Empirical Medicine Task of the 2018 CLEF eHealth Lab</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Aristotle University of Thessaloniki</institution>
          ,
          <addr-line>Thessaloniki 54124</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Systematic reviews are literature reviewing processes that aim to retrieve all relevant content based on a speci c topic, in an exhaustive manner. Such reviews are particularly useful in healthcare, where decision making must take into account all possible evidence, and are usually done by constructing a boolean query and submitting it to a database, and then screening the retrieved documents for relevant ones. Task 2 of CLEF 2018 eHealth lab focuses on automating this process on two fronts: Sub-Task 1 is about bypassing the construction of the boolean query, retrieving relevant documents and ranking them by relevance based on a protocol that describes a topic, and Sub-Task 2 is about ranking the documents retrieved by an already constructed query by Cochrane experts. We present our approaches for both sub-tasks, which combine a learning-to-rank model trained on multiple reviews with a model incrementally trained on each individual review using relevance feedback.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Systematic reviews are a crucial part of Evidence-Based Medicine, which uses
any current evidence to support a decision on how a patient will be treated.
These reviews aim to nd the aforementioned evidence, which should t some
criteria in order to take part in the nal decision-making. Systematic reviews
can be broken down into a 3-step process:
1. Document Retrieval: An expert builds a boolean query that describes
their review topic, which is later submitted to a medical database. Boolean
queries are queries that de ne if a document is relevant by the existence
(or not) of user-speci ed terms in the document. By using boolean logic,
complex queries with multiple rules can be constructed in order to lter
through large amounts of information.
2. Title and Abstract Screening: After the possibly relevant documents
have been retrieved, they must be screened to nd the truly relevant ones.
Screening takes part in two stages: in the rst stage, experts review each
retrieved document's title and abstract, and decide if it is non-relevant, or
if it is possibly relevant and must be read in full to decide.
3. Document Screening: The second stage of screening is reading the full text
of the document that passed through the rst screening stage, and deciding
if it should take part in the review.</p>
      <p>Document screening is the most time-consuming task of this process.
Medical databases are expanding rapidly - PubMed counts 26,759,3991 citations as
of 2017. Boolean queries on such databases are bound to retrieve a large amount
of documents, hence the need for automation in such a task. This, however, is a
complex problem, due to the imbalance of the data (few relevant documents, too
many non-relevant documents) and the misclassi cation cost, where not
including a relevant document might have a great toll on the nal decision making.</p>
      <p>
        Task 2 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] of CLEF 2018 eHealth lab [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] focuses on the rst two parts of
the systematic review process. Our approach consists of phrase extraction and
querying for the document retrieval step, as well as a hybrid classi cation model
for the title and abstract screening step, which initially ranks the retrieved
documents using Learning-to-Rank (LTR) features and then uses relevance feedback
to iteratively re-rank them, based on simple text representations.
      </p>
      <p>The rest of this paper is organized as follows: we brie y describe Task 2 of
CLEF 2018 eHealth lab in Section 2, and in Section 3 we analyze our approaches.
Section 4 contains the results and the submitted runs, and nally Section 5
concludes and discusses future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Task Overview</title>
      <p>This year, CLEF eHealth's Task 2 was split into two sub-tasks. Sub-Task 1 was
about searching in PubMed for relevant documents given a piece of text, while
Sub-Task 2 was the same as last year's CLEF eHealth Task 2.</p>
      <p>Sub-Task 1 aims to bypass the rst part of a Systematic Review - the
construction of the boolean query, that would later on be submitted in a database
to retrieve possibly relevant documents.</p>
      <p>Given 40 topics as training set and 30 as test set, participants were asked
to return a ranking with a maximum of 5000 documents per topic. Each topic
contained its id, title and objective, as well as a protocol that described that
particular topic. Each topic protocol had 6 elds, with another objective eld
that was slightly di erent than the topic's one:
1. Objective
2. Type of Study
3. Participants
4. Index Tests
5. Target Conditions
6. Reference Standards
1 https://www.nlm.nih.gov/bsd/licensee/2017_stats/2017_LO.html
For each topic, participants were also provided with a date cut-o . This cut-o
was also used in the Boolean Queries that were constructed by Cochrane experts
to retrieve relevant documents.</p>
      <p>Sub-Task 2 concerns the e cient ranking of the possibly relevant documents
retrieved. Given a topic, its query and the documents retrieved, the goal is to
rank the documents so that most relevant ones appear rst, as well as to nd a
threshold, after which no documents will be shown to the user. The training set
consisted of 42 topics, where each topic contained:
1. A unique topic ID
2. A title
3. An Ovid MEDLINE boolean query, constructed by Cochrane experts
4. The PubMed IDs as returned from the execution of the boolean query</p>
      <p>For both tasks, the relevant document PIDs (PubMed IDs) were provided as
well, for abstract and content relevance. This enabled the use of algorithms that
requested relevance feedback from the user.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Our Approach</title>
      <p>
        For both sub-tasks, we used last year's model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] with some enhancements, as
well as some modi cations for Sub-Task 1. It consists of two models:
1. An inter-topic XGBoost [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] classi er that is trained on LTR features between
a topic and a document and produces an initial ranking of the documents.
      </p>
      <p>This inter-topic model is trained on all the training topics.
2. An intra-topic Support Vector Machine (SVM) classi er that is iteratively
trained on TF-IDF vectors after asking feedback for documents that are
ranked the highest by the inter-topic model. This intra-topic model is trained
for each of the test topics using relevance feedback at prediction time.</p>
      <p>Algorithm 1 describes the re-ranking algorithm employed by the intra-topic
model.
3.1</p>
      <p>Sub-Task 1: No Boolean Search
The rst step for Sub-Task 1 was to nd the initial relevant documents. For each
topic, we used its title and objective to create queries that were later submitted
to PubMed. To construct the queries, we tokenized both pieces of text, removed
the stop-words, and extracted phrases from the resulting word lists. Figure 1
shows an example of this process.</p>
      <p>The phrases we extracted were n-grams (n 2 f2; 3; 4; 5; 6g) of the words
of each piece of text. Each phrase was then submitted to PubMed, with the
date cut-o as given for each topic, for which we retrieved a maximum of 2500
documents.</p>
      <p>
        For the query construction, we also experimented with TextRank [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], an
algorithm for keyword extraction. After extracting the keywords from both the
4 k0 k;
5 while not f inalRanking contains both relevant and irrelevant documents do
6 k0 k0 + 1;
7 f inalRankingk0 = Rk0 ;
10
8 while not length(f inalRanking) == n OR length(f inalRanking) == tfinal do
9 train(f inalRanking) ; // Train a local classifier by asking for
abstract or document relevance for these documents
localRanking = rerank(R f inalRanking) ; // Rerank the rest of the
initial list R from the predictions of the local classifier
if length(f inalRanking) &lt; tstep then
      </p>
      <p>step = stepinit;
else</p>
      <p>step = stepsecondary;
for i = k0 to k0 + step do</p>
      <p>f inalRankingi localRankingi k0 ;
17 return f inalRanking;
title and the objective, we created the queries the same way as described above,
where the text of each topic was now its keywords. This process did not seem
to work well, as it decreased the total recall. We further experimented with
the number of maximum allowed documents per query, where we had to trade
between recall and number of documents retrieved. The 2500 limit proved to be a
good t, since retrieving more documents would not increase recall signi cantly,
but would require our models to rank many more documents.</p>
      <p>
        After retrieving the possibly relevant documents per topic, we use the
intertopic and intra-topic models to rank them. The LTR features used for the
intertopic model were computed using the title and abstract of each document and
the di erent elds of each topic protocol, as well as the topic's title and
objective. Table 1 shows the features employed by our model. For the inter-topic
model, we use an Easy Ensemble [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] of 10 XGBoost classi ers, where each
classi er is trained on all the relevant documents and a subset of the non-relevant
documents, randomly sampled, sampling 5 non-relevant documents per each
relevant.
      </p>
      <p>After getting an initial ranking from the inter-topic model, we use the
intratopic model to re-rank up to the rst 20,000 documents, and keep the rst 5000,
as per the task's limit.
3.2</p>
      <p>Sub-Task 2: Abstract and Title Screening
For the second sub-task, we also employed last year's model with a few modi
cations on both the inter-topic and the intra-topic model.</p>
      <p>
        Inter-Topic Model On the Inter-Topic model, we included some semantic
information using additional LTR features. Table 2 shows the features, with
which we previously experimented, along with the new semantic features. We
further advanced our model by removing the stop-words and xed some minor
issues with the BM25 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] features.
      </p>
      <p>Features 1-24 are the same as last year's submission. We distinguish between
two topic elds - the query, which is a list of Medical Subject Headings (MeSH)
terms extracted from the topic's Ovid Medline query and the title. MeSH terms
are semantic annotations added manually on PubMed documents. The notation
used for the LTR features is as follows:
1. t is a topic eld
2. d is a document eld
3. c(ti; d) counts the number of times the term ti appears on the document eld
d
4. c(mi; d) counts the number of times the MeSH (Medical Subject Headings)
term mi appears on the document eld d
5. jCj is the total number of documents in the collection
6. df (ti) is the number of documents that contain the term ti
7. levenshtein(mi; dj ) is the levenshtein distance betwen the MeSH term mi
and the term dj</p>
      <p>For features 25 and 26, we applied Latent Semantic Analysis to the TF-IDF
vectors of the titles and the abstracts of each document, keeping 200 components.
Then for each document in a topic, we computed the cosine similarity of their
LSA vectors (topic title - document title and topic title - document abstract).</p>
      <p>
        Features 27 and 28 use Word2Vec [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] vectors, obtained from the BioASQ
challenge2. These vectors were trained on 10,876,004 abstracts from PubMed,
with a vocabulary of 1,701,632 words and a dimensionality of 200. For each piece
of text, we sum up all its word vectors and average, which results in a single
vector representing the document. Then, we compute the cosine similarities between
a topic and a document using these vectors.
      </p>
      <p>
        Features 29 and 30 use the Word2Vec vectors again, this time to compute
the Word Mover's Distance [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] between pieces of text.
2 http://bioasq.org/
      </p>
      <p>Feature 31 uses document vector representations which we obtained by
training a Doc2Vec [10] model on the documents collected from the training set. The
model was trained on each document's title and abstract. The vectors for
documents not in the model's training set were inferred.</p>
      <p>The new semantic features seemed to improve performance, but some of them
proved to be better than others. For the nal runs, from the semantic features
Category Topic eld Document eld</p>
      <p>T D Title Title
T D Title Title
T D Title Abstract
T D Title Abstract
T D Query Title
T D Query Title
T D Query Title</p>
      <p>Query
Query
Query
Title
Title
Title
Title
Query
Query
Title
Title
Query
Query
Title
Title
Query
Query
Title
Title
Title
Title
Title
Title
Title</p>
      <p>Title
Abstract
Abstract</p>
      <p>Title
Abstract</p>
      <p>Title
Abstract</p>
      <p>Title
Abstract</p>
      <p>Title
Abstract</p>
      <p>Title
Abstract</p>
      <p>Title
Abstract</p>
      <p>Title
Abstract</p>
      <p>Title
Abstract</p>
      <p>Title
Abstract
Abstract
we kept only 25, 26, 29 and 30, which use the Latent Semantic Analysis and the
Word Mover's Distance.</p>
      <p>Apart from adding new LTR features, we experimented with a variety of
other techniques. First, we tried expanding the title query with more words,
to obtain a bigger piece of text, so as to compute more accurate similarities.
For each word in the title, we found its K most similar words using cosine
similarity on the Word2Vec embeddings and added them to the title. Even for
small values of K (e.g. 2) this did not seem to improve performance. We further
tested to provide the document vectors (query title, document) from Doc2Vec
directly to the inter-topic model, either concatenated or subtracted one from
another, which still did not improve performance. Lastly, we experimented with
undersampling techniques - speci cally Easy Ensemble and SMOTE [11], which
did not improve performance either. On the contrary, Easy Ensemble works well
for the rst sub-task, where the number of non-relevant documents is on average
an order of magnitude larger.</p>
      <p>Intra-Topic Model For the intra-Topic model, we relaxed the C parameter
of the SVM, which controls how "strict" the hyperplane will be in avoiding
misclassi cation to allow for a bigger margin. The intuition for this came from
the fact that due to the sheer class imbalance, nding a hyperplane with a bigger
margin will probably t the data better than nding a strict one which may lead
to over tting. This relaxation seemed to improve the model's predictions in our
evaluations.</p>
      <p>Additionally, we experimented with di erent SVM kernels, but they proved
much slower and less e cient than the linear one. We also added n-grams (2, 3)
but they did not give better results either. Finally, we tried to use embeddings
for this task as well, by using the average Word2Vec vectors or the document
vectors from Doc2Vec as input instead of the simple TF-IDF representations, to
no avail.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Both sub-tasks of CLEF E-health Task 2 supported both thresholded and
nonthresholded runs. Our models, however, do not apply a threshold to the nal
ranking automatically - instead, we submitted thresholded runs on xed
handpicked thresholds.</p>
      <p>The metrics used for evaluation were multiple and they are described in detail
in the task's website3. The primary ones (as mentioned in the task's website) are
the Mean Average Precision and the Recall, on which we focus below. Note that
in the o cial evaluation script4, which we used to produce the following results,
3 https://sites.google.com/view/clef-ehealth-2018/</p>
      <p>task-2-technologically-assisted-reviews-in-empirical-medicine
4 https://github.com/CLEF-TAR/tar
Mean Average Precision is computed on the whole ranking, without taking the
threshold into account.</p>
      <p>Table 3 shows our results for sub-task 1. The reranking parameters for the
intra-topic model of HybridSVM are:</p>
      <p>k = 10; stepinitial = 1; tstep = 200; stepsecondary = 50; tfinal = 1000
The Threshold column refers to the hand-picked threshold mentioned above, and
the Train Relevance column refers to which relevances were used for training
abstract or content. For evaluation, content relevance was used as per the
competition's guideline. We submitted runs 1, 2 and 3, since we only found that
training with abstract relevance gave slightly better results after the submission
deadline. This is, however, an interesting observation - since there are more
relevant documents at abstract level than at content level, the class imbalance was
slightly less e ective when training with the abstract relevance, thus producing
slightly better results.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and future work</title>
      <p>In this paper, we described our approaches for both sub-tasks of Task 2 of CLEF
eHealth 2018. We introduced new features and tweaked last year's models to
improve performance, with a tendency towards semantic features.</p>
      <p>As future work, we believe that more improvements can be made in both
subtasks. For Sub-Task 1, the query construction stage could bene t from ltering
out words that are not medically relevant, in order to reduce the number of
queries and consequently reduce the number of retrieved documents. For the
ranking model (sub-tasks 1 and 2), more semantic features could bene t the
inter-topic model, while a better strategy for asking feedback in the intra-topic
model could boost the metrics. Finally, it would be interesting to apply deep
learning techniques to the task, and try to use word embeddings in a more
e cient way.
10. Qv Le and Tomas Mikolov. Distributed Representations of Sentences and
Documents. In Proceedings of the 31th International Conference on Machine Learning,
ICML 2014, volume 32, pages 1188{1196, Beijing, China, 2014.
11. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer.</p>
      <p>SMOTE: Synthetic minority over-sampling technique. Journal of Arti cial
Intelligence Research, 2002.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Evangelos</given-names>
            <surname>Kanoulas</surname>
          </string-name>
          , Rene Spijker,
          <string-name>
            <given-names>Dan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Leif</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          .
          <article-title>CLEF 2018 Technology Assisted Reviews in Empirical Medicine Overview</article-title>
          .
          <source>In CLEF 2018 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Hanna</given-names>
            <surname>Suominen</surname>
          </string-name>
          , Liadh Kelly, Lorraine Goeuriot, Evangelos Kanoulas, Leif Azzopardi, Rene Spijker,
          <string-name>
            <given-names>Dan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Aurlie</given-names>
            <surname>Neveol</surname>
          </string-name>
          , Lionel Ramadier, Aude Robert, Guido Zuccon, and
          <string-name>
            <given-names>Joao</given-names>
            <surname>Palotti</surname>
          </string-name>
          .
          <article-title>Overview of the CLEF eHealth Evaluation Lab 2018</article-title>
          .
          <article-title>CLEF 2018</article-title>
          .
          <source>In 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS)</source>
          , Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Antonios</given-names>
            <surname>Anagnostou</surname>
          </string-name>
          , Athanasios Lagopoulos, Grigorios Tsoumakas, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Vlahavas</surname>
          </string-name>
          .
          <article-title>Combining inter-review learning-to-rank and intra-review incremental training for title and abstract screening in systematic reviews</article-title>
          .
          <source>In CLEF 2017 Working Notes, CEUR Workshop Proceedings</source>
          , volume
          <year>1866</year>
          , Dublin, Ireland,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Tianqi</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Guestrin</surname>
          </string-name>
          .
          <source>XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16</source>
          , pages
          <fpage>785</fpage>
          {
          <fpage>794</fpage>
          , New York, New York, USA,
          <year>2016</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Rada</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Tarau</surname>
          </string-name>
          . TextRank:
          <article-title>Bringing order into texts</article-title>
          .
          <source>In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004</source>
          , pages
          <fpage>404</fpage>
          {
          <fpage>411</fpage>
          ,
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Xu-Ying</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jianxin Wu</surname>
          </string-name>
          , and
          <string-name>
            <surname>Zhi-Hua Zhou</surname>
          </string-name>
          .
          <article-title>Exploratory Undersampling for Class Imbalance Learning</article-title>
          .
          <source>IEEE Transactions on Systems, Man and Cybernetics</source>
          ,
          <volume>39</volume>
          (
          <issue>2</issue>
          ):
          <volume>539</volume>
          {
          <fpage>550</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>K.</given-names>
            <surname>Sparck Jones</surname>
          </string-name>
          , Karen Sparck Jones,
          <string-name>
            <given-names>S</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S E</given-names>
            <surname>Robertson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Stephen E</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <article-title>A probabilistic model of information retrieval: development and comparative experiments Part 2</article-title>
          .
          <string-name>
            <given-names>Information</given-names>
            <surname>Processing</surname>
          </string-name>
          and Management,
          <volume>36</volume>
          :
          <fpage>809</fpage>
          {
          <fpage>840</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Greg Corrado, Kai Chen, and
          <article-title>Je rey Dean. E cient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>Proceedings of the International Conference on Learning Representations (ICLR</source>
          <year>2013</year>
          ), pages
          <fpage>1</fpage>
          {
          <fpage>12</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Matt J Kusner</surname>
          </string-name>
          ,
          <article-title>Yu Sun, Nicholas I Kolkin, and Kilian Q Weinberger. From Word Embeddings To Document Distances</article-title>
          .
          <source>Proceedings of The 32nd International Conference on Machine Learning</source>
          ,
          <volume>37</volume>
          :
          <fpage>957</fpage>
          {
          <fpage>966</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>