-

Aristotle University's Approach to the Technologically Assisted Reviews in Empirical Medicine Task of the 2018 CLEF eHealth Lab

0 Aristotle University of Thessaloniki , Thessaloniki 54124 , Greece

Systematic reviews are literature reviewing processes that aim to retrieve all relevant content based on a speci c topic, in an exhaustive manner. Such reviews are particularly useful in healthcare, where decision making must take into account all possible evidence, and are usually done by constructing a boolean query and submitting it to a database, and then screening the retrieved documents for relevant ones. Task 2 of CLEF 2018 eHealth lab focuses on automating this process on two fronts: Sub-Task 1 is about bypassing the construction of the boolean query, retrieving relevant documents and ranking them by relevance based on a protocol that describes a topic, and Sub-Task 2 is about ranking the documents retrieved by an already constructed query by Cochrane experts. We present our approaches for both sub-tasks, which combine a learning-to-rank model trained on multiple reviews with a model incrementally trained on each individual review using relevance feedback.

Systematic reviews are a crucial part of Evidence-Based Medicine, which uses any current evidence to support a decision on how a patient will be treated. These reviews aim to nd the aforementioned evidence, which should t some criteria in order to take part in the nal decision-making. Systematic reviews can be broken down into a 3-step process: 1. Document Retrieval: An expert builds a boolean query that describes their review topic, which is later submitted to a medical database. Boolean queries are queries that de ne if a document is relevant by the existence (or not) of user-speci ed terms in the document. By using boolean logic, complex queries with multiple rules can be constructed in order to lter through large amounts of information. 2. Title and Abstract Screening: After the possibly relevant documents have been retrieved, they must be screened to nd the truly relevant ones. Screening takes part in two stages: in the rst stage, experts review each retrieved document's title and abstract, and decide if it is non-relevant, or if it is possibly relevant and must be read in full to decide. 3. Document Screening: The second stage of screening is reading the full text of the document that passed through the rst screening stage, and deciding if it should take part in the review.

Document screening is the most time-consuming task of this process. Medical databases are expanding rapidly - PubMed counts 26,759,3991 citations as of 2017. Boolean queries on such databases are bound to retrieve a large amount of documents, hence the need for automation in such a task. This, however, is a complex problem, due to the imbalance of the data (few relevant documents, too many non-relevant documents) and the misclassi cation cost, where not including a relevant document might have a great toll on the nal decision making.

Task 2 [ 1 ] of CLEF 2018 eHealth lab [ 2 ] focuses on the rst two parts of the systematic review process. Our approach consists of phrase extraction and querying for the document retrieval step, as well as a hybrid classi cation model for the title and abstract screening step, which initially ranks the retrieved documents using Learning-to-Rank (LTR) features and then uses relevance feedback to iteratively re-rank them, based on simple text representations.

The rest of this paper is organized as follows: we brie y describe Task 2 of CLEF 2018 eHealth lab in Section 2, and in Section 3 we analyze our approaches. Section 4 contains the results and the submitted runs, and nally Section 5 concludes and discusses future work. 2

Task Overview

This year, CLEF eHealth's Task 2 was split into two sub-tasks. Sub-Task 1 was about searching in PubMed for relevant documents given a piece of text, while Sub-Task 2 was the same as last year's CLEF eHealth Task 2.

Sub-Task 1 aims to bypass the rst part of a Systematic Review - the construction of the boolean query, that would later on be submitted in a database to retrieve possibly relevant documents.

Given 40 topics as training set and 30 as test set, participants were asked to return a ranking with a maximum of 5000 documents per topic. Each topic contained its id, title and objective, as well as a protocol that described that particular topic. Each topic protocol had 6 elds, with another objective eld that was slightly di erent than the topic's one: 1. Objective 2. Type of Study 3. Participants 4. Index Tests 5. Target Conditions 6. Reference Standards 1 https://www.nlm.nih.gov/bsd/licensee/2017_stats/2017_LO.html For each topic, participants were also provided with a date cut-o . This cut-o was also used in the Boolean Queries that were constructed by Cochrane experts to retrieve relevant documents.

Sub-Task 2 concerns the e cient ranking of the possibly relevant documents retrieved. Given a topic, its query and the documents retrieved, the goal is to rank the documents so that most relevant ones appear rst, as well as to nd a threshold, after which no documents will be shown to the user. The training set consisted of 42 topics, where each topic contained: 1. A unique topic ID 2. A title 3. An Ovid MEDLINE boolean query, constructed by Cochrane experts 4. The PubMed IDs as returned from the execution of the boolean query

For both tasks, the relevant document PIDs (PubMed IDs) were provided as well, for abstract and content relevance. This enabled the use of algorithms that requested relevance feedback from the user. 3

Our Approach

For both sub-tasks, we used last year's model [ 3 ] with some enhancements, as well as some modi cations for Sub-Task 1. It consists of two models: 1. An inter-topic XGBoost [ 4 ] classi er that is trained on LTR features between a topic and a document and produces an initial ranking of the documents.

This inter-topic model is trained on all the training topics. 2. An intra-topic Support Vector Machine (SVM) classi er that is iteratively trained on TF-IDF vectors after asking feedback for documents that are ranked the highest by the inter-topic model. This intra-topic model is trained for each of the test topics using relevance feedback at prediction time.

Algorithm 1 describes the re-ranking algorithm employed by the intra-topic model. 3.1

Sub-Task 1: No Boolean Search The rst step for Sub-Task 1 was to nd the initial relevant documents. For each topic, we used its title and objective to create queries that were later submitted to PubMed. To construct the queries, we tokenized both pieces of text, removed the stop-words, and extracted phrases from the resulting word lists. Figure 1 shows an example of this process.

The phrases we extracted were n-grams (n 2 f2; 3; 4; 5; 6g) of the words of each piece of text. Each phrase was then submitted to PubMed, with the date cut-o as given for each topic, for which we retrieved a maximum of 2500 documents.

For the query construction, we also experimented with TextRank [ 5 ], an algorithm for keyword extraction. After extracting the keywords from both the 4 k0 k; 5 while not f inalRanking contains both relevant and irrelevant documents do 6 k0 k0 + 1; 7 f inalRankingk0 = Rk0 ; 10 8 while not length(f inalRanking) == n OR length(f inalRanking) == tfinal do 9 train(f inalRanking) ; // Train a local classifier by asking for abstract or document relevance for these documents localRanking = rerank(R f inalRanking) ; // Rerank the rest of the initial list R from the predictions of the local classifier if length(f inalRanking) < tstep then

step = stepinit; else

step = stepsecondary; for i = k0 to k0 + step do

f inalRankingi localRankingi k0 ; 17 return f inalRanking; title and the objective, we created the queries the same way as described above, where the text of each topic was now its keywords. This process did not seem to work well, as it decreased the total recall. We further experimented with the number of maximum allowed documents per query, where we had to trade between recall and number of documents retrieved. The 2500 limit proved to be a good t, since retrieving more documents would not increase recall signi cantly, but would require our models to rank many more documents.

After retrieving the possibly relevant documents per topic, we use the intertopic and intra-topic models to rank them. The LTR features used for the intertopic model were computed using the title and abstract of each document and the di erent elds of each topic protocol, as well as the topic's title and objective. Table 1 shows the features employed by our model. For the inter-topic model, we use an Easy Ensemble [ 6 ] of 10 XGBoost classi ers, where each classi er is trained on all the relevant documents and a subset of the non-relevant documents, randomly sampled, sampling 5 non-relevant documents per each relevant.

After getting an initial ranking from the inter-topic model, we use the intratopic model to re-rank up to the rst 20,000 documents, and keep the rst 5000, as per the task's limit. 3.2

Sub-Task 2: Abstract and Title Screening For the second sub-task, we also employed last year's model with a few modi cations on both the inter-topic and the intra-topic model.

Inter-Topic Model On the Inter-Topic model, we included some semantic information using additional LTR features. Table 2 shows the features, with which we previously experimented, along with the new semantic features. We further advanced our model by removing the stop-words and xed some minor issues with the BM25 [ 7 ] features.

Features 1-24 are the same as last year's submission. We distinguish between two topic elds - the query, which is a list of Medical Subject Headings (MeSH) terms extracted from the topic's Ovid Medline query and the title. MeSH terms are semantic annotations added manually on PubMed documents. The notation used for the LTR features is as follows: 1. t is a topic eld 2. d is a document eld 3. c(ti; d) counts the number of times the term ti appears on the document eld d 4. c(mi; d) counts the number of times the MeSH (Medical Subject Headings) term mi appears on the document eld d 5. jCj is the total number of documents in the collection 6. df (ti) is the number of documents that contain the term ti 7. levenshtein(mi; dj ) is the levenshtein distance betwen the MeSH term mi and the term dj

For features 25 and 26, we applied Latent Semantic Analysis to the TF-IDF vectors of the titles and the abstracts of each document, keeping 200 components. Then for each document in a topic, we computed the cosine similarity of their LSA vectors (topic title - document title and topic title - document abstract).

Features 27 and 28 use Word2Vec [ 8 ] vectors, obtained from the BioASQ challenge2. These vectors were trained on 10,876,004 abstracts from PubMed, with a vocabulary of 1,701,632 words and a dimensionality of 200. For each piece of text, we sum up all its word vectors and average, which results in a single vector representing the document. Then, we compute the cosine similarities between a topic and a document using these vectors.

Features 29 and 30 use the Word2Vec vectors again, this time to compute the Word Mover's Distance [ 9 ] between pieces of text. 2 http://bioasq.org/

Feature 31 uses document vector representations which we obtained by training a Doc2Vec [10] model on the documents collected from the training set. The model was trained on each document's title and abstract. The vectors for documents not in the model's training set were inferred.

The new semantic features seemed to improve performance, but some of them proved to be better than others. For the nal runs, from the semantic features Category Topic eld Document eld

T D Title Title T D Title Title T D Title Abstract T D Title Abstract T D Query Title T D Query Title T D Query Title

Query Query Query Title Title Title Title Query Query Title Title Query Query Title Title Query Query Title Title Title Title Title Title Title

Title Abstract Abstract

Title Abstract

Title Abstract Abstract we kept only 25, 26, 29 and 30, which use the Latent Semantic Analysis and the Word Mover's Distance.

Apart from adding new LTR features, we experimented with a variety of other techniques. First, we tried expanding the title query with more words, to obtain a bigger piece of text, so as to compute more accurate similarities. For each word in the title, we found its K most similar words using cosine similarity on the Word2Vec embeddings and added them to the title. Even for small values of K (e.g. 2) this did not seem to improve performance. We further tested to provide the document vectors (query title, document) from Doc2Vec directly to the inter-topic model, either concatenated or subtracted one from another, which still did not improve performance. Lastly, we experimented with undersampling techniques - speci cally Easy Ensemble and SMOTE [11], which did not improve performance either. On the contrary, Easy Ensemble works well for the rst sub-task, where the number of non-relevant documents is on average an order of magnitude larger.

Intra-Topic Model For the intra-Topic model, we relaxed the C parameter of the SVM, which controls how "strict" the hyperplane will be in avoiding misclassi cation to allow for a bigger margin. The intuition for this came from the fact that due to the sheer class imbalance, nding a hyperplane with a bigger margin will probably t the data better than nding a strict one which may lead to over tting. This relaxation seemed to improve the model's predictions in our evaluations.

Additionally, we experimented with di erent SVM kernels, but they proved much slower and less e cient than the linear one. We also added n-grams (2, 3) but they did not give better results either. Finally, we tried to use embeddings for this task as well, by using the average Word2Vec vectors or the document vectors from Doc2Vec as input instead of the simple TF-IDF representations, to no avail. 4

Results

Both sub-tasks of CLEF E-health Task 2 supported both thresholded and nonthresholded runs. Our models, however, do not apply a threshold to the nal ranking automatically - instead, we submitted thresholded runs on xed handpicked thresholds.

The metrics used for evaluation were multiple and they are described in detail in the task's website3. The primary ones (as mentioned in the task's website) are the Mean Average Precision and the Recall, on which we focus below. Note that in the o cial evaluation script4, which we used to produce the following results, 3 https://sites.google.com/view/clef-ehealth-2018/

task-2-technologically-assisted-reviews-in-empirical-medicine 4 https://github.com/CLEF-TAR/tar Mean Average Precision is computed on the whole ranking, without taking the threshold into account.

Table 3 shows our results for sub-task 1. The reranking parameters for the intra-topic model of HybridSVM are:

k = 10; stepinitial = 1; tstep = 200; stepsecondary = 50; tfinal = 1000 The Threshold column refers to the hand-picked threshold mentioned above, and the Train Relevance column refers to which relevances were used for training abstract or content. For evaluation, content relevance was used as per the competition's guideline. We submitted runs 1, 2 and 3, since we only found that training with abstract relevance gave slightly better results after the submission deadline. This is, however, an interesting observation - since there are more relevant documents at abstract level than at content level, the class imbalance was slightly less e ective when training with the abstract relevance, thus producing slightly better results.

Conclusion and future work

In this paper, we described our approaches for both sub-tasks of Task 2 of CLEF eHealth 2018. We introduced new features and tweaked last year's models to improve performance, with a tendency towards semantic features.

As future work, we believe that more improvements can be made in both subtasks. For Sub-Task 1, the query construction stage could bene t from ltering out words that are not medically relevant, in order to reduce the number of queries and consequently reduce the number of retrieved documents. For the ranking model (sub-tasks 1 and 2), more semantic features could bene t the inter-topic model, while a better strategy for asking feedback in the intra-topic model could boost the metrics. Finally, it would be interesting to apply deep learning techniques to the task, and try to use word embeddings in a more e cient way. 10. Qv Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, volume 32, pages 1188{1196, Beijing, China, 2014. 11. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer.

SMOTE: Synthetic minority over-sampling technique. Journal of Arti cial Intelligence Research, 2002.

Evangelos

Kanoulas , Rene Spijker,

Dan

Li ,

and Leif

Azzopardi . CLEF 2018 Technology Assisted Reviews in Empirical Medicine Overview . In CLEF 2018 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS , 2018 .

Hanna

Suominen , Liadh Kelly, Lorraine Goeuriot, Evangelos Kanoulas, Leif Azzopardi, Rene Spijker,

Dan

Li ,

Aurlie

Neveol , Lionel Ramadier, Aude Robert, Guido Zuccon, and

Joao

Palotti . Overview of the CLEF eHealth Evaluation Lab 2018 . CLEF 2018 . In 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS) , Springer, 2018 .

Antonios

Anagnostou , Athanasios Lagopoulos, Grigorios Tsoumakas, and

Ioannis

Vlahavas . Combining inter-review learning-to-rank and intra-review incremental training for title and abstract screening in systematic reviews . In CLEF 2017 Working Notes, CEUR Workshop Proceedings , volume 1866 , Dublin, Ireland, 2017 .

Tianqi

Chen and

Carlos

Guestrin . XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16 , pages 785 { 794 , New York, New York, USA, 2016 . ACM Press.

Rada

Mihalcea and

Paul

Tarau . TextRank: Bringing order into texts . In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004 , pages 404 { 411 , Barcelona , Spain, 2004 .

6. Xu-Ying

Liu

, Jianxin Wu , and Zhi-Hua Zhou . Exploratory Undersampling for Class Imbalance Learning . IEEE Transactions on Systems, Man and Cybernetics , 39 ( 2 ): 539 { 550 , 2009 .

Sparck Jones , Karen Sparck Jones,

Walker ,

S E

Robertson , and

Stephen E

Robertson . A probabilistic model of information retrieval: development and comparative experiments Part 2 .

Information

Processing and Management, 36 : 809 { 840 , 2000 .

Tomas

Mikolov , Greg Corrado, Kai Chen, and Je rey Dean. E cient Estimation of Word Representations in Vector Space . Proceedings of the International Conference on Learning Representations (ICLR 2013 ), pages 1 { 12 , 2013 .

9. Matt J Kusner , Yu Sun, Nicholas I Kolkin, and Kilian Q Weinberger. From Word Embeddings To Document Distances . Proceedings of The 32nd International Conference on Machine Learning , 37 : 957 { 966 , 2015 .