-

A Distributed E ort Approach for Systematic Reviews. IMS Unipd at CLEF 2019 eHealth Task 2.

Giorgio Maria Di Nunzio

giorgiomaria.dinunzio@unipd.it 0 1 0 Department of Information Engineering 1 Department of Mathematics University of Padua , Italy

This is the third participation of the Information Management Systems (IMS) group at CLEF eHealth Task of Technologically Assisted Reviews in Empirical Medicine. This task focuses on the problem of medical systematic reviews, a problem which requires a recall close (if not equal) to 100%. Semi-Automated approaches are essential to support these type of searches when the amount of data exceed the limits of users, i.e. in terms of attention or patience. We present a variation of the system we presented last year; in particular, not only we set the maximum amount of documents that the physician is willing to read, but we distribute the e ort across the topics proportionally to the number of documents in the pool. We compare the results of this approach with the \frozen" system we used in 2018 and a BM25 baseline.

In this paper, we describe the participation of the Information Management Systems (IMS) group at CLEF eHealth 2019 [ 2 ] Technology Assisted Review Task [ 1 ]. This task focuses on the problem of systematic reviews, that is the process of collecting articles that summarise all evidence (if possible) that has been published regarding a certain medical topic. This task requires long search sessions by experts in the eld of medicine; for this reason, semi-automatic approaches are essential to support these type of searches when the amount of data exceed the limits of users, i.e. in terms of attention or patience.

The objective of our participation was to compare the system that we used in the previous year, with a new strategy to distribute the e ort of the user (the physician or an expert in the eld of medicine) across the topics. In particular, { we re-use the stopping strategy to simulate the maximum amount of documents that a physician is willing to review in the two-dimensional approach presented in [ 5 ]; { we distribute the e ort, in terms of number of documents to read, proportionally to the size of the pool of documents for each topic; { we estimate the 95% con dence interval of the proportion of relevant documents present in the collection [ 6 ].

The source code of the experiments is available for reproducibility purposes.3 2

Approach

In this paper, we continue to investigate the interaction with the two dimensional interpretation of the BM25 model applied to the problem of explicit relevance feedback [ 9, 3, 8, 5, 7, 6 ].

In particular, the two-dimensional representation of probabilities [ 4, 9 ] is an intuitive way of presenting a two-class classi cation problem on a two-dimensional space. Given two classes, for example relvant R and non-relevant N R, a document d is assigned to category R if the following inequality holds: P (djN R) < m P (djR) +q | {yz } | {xz } (1) where P (djR) and P (djN R) are the likelihoods of the object d given the two categories, while m and q are two parameters that can be optimized to compensate for either the unbalanced class issues or di erent misclassi cation costs.

We focused on the following problems: 1. study the e ectiveness of a classi er given a xed amount of documents that a physician is willing to review; 2. design a sampling strategy to estimate the 95% con dence interval of the number of relevant documents in the collection.

In the experiments, we used the same procedure we used lst year [ 6 ]: { we set a number n of documents that the physician is willing to read and a number s that tells the algorithm when (every s documents) to randomly sample a document from the collection instead of presenting to the physician the next most relevant document; { for each topic, we run an optimized (hyper-parameters) BM25 retrieval model and we obtain the relevance feedback for the rst abstract in the ranking list; { from the second document until n=2 1, we continuously update the relevance weights of the terms according to the explicit relevance feedback given by the physician (simulated by the qrels available with the test collection); { for the last half of the documents n=2 that the physician is willing to read, we use a Nave Bayes classi er continuously updated with the explicit relevance feedback [ 5 ]. 3 https://github.com/gmdn/CLEF2019

Instead of setting n equal for all topics, this year we tried a di erent approach in order to let the user to read more documents for those topics with more documents in the pool. In Table 1, we show, for each topic, the number of documents in the pool, the proportion of documents of the pool compared to the total number of documents pooled, the number of documents we will show to the user (to be multiplied by 2). 3

Experiments

For all the experiments, we set the values of the BM25 hyper-parameters in the same way we did in [ 6 ]. 3.1

O cial Runs

We submitted runs for three di erent systems: { a BM25 baseline with continuous active learning and a xed threshold for each topic, { the \frozen" system fo 2018 with di erent proportions of documents to be read for the initial phase but with a xed threshold for each topic, { the new approach with a di erent threshold for each topic.

In particular, for the frozen system, we used 10% or 50% of the initial pool of documents per topic to build the classi er. The new distributed e ort approach uses 10% of the pool at the beginning of the training, but, in general, it may stop earlier compared to the other approach if the e ort required for a topic is low in terms of documents allowed. 3.2

Uno cial Runs

In order to compare the BM25 model with a similar proportion of documents shown to the user, we added some BM25 runs and removed some others that showed a di erent number of documents. 3.3

Evaluation Measures

In order to evaluate the performance of the systems, we chose the number of documents shown to the user as one of the performance measures since, in our case, it is also the point where we stop retrieving documents. In addition, we use recall and averaged recall across topics to measure the accuracy of the retrieval. 3.4

Results

In Figures 1 and 2, we show a topic by topic comparison of groups of runs: BM25, distributed e ort, orginal 2018 with 10% or 50% of the initial pool selected. By increasing the threshold of the number of documents shown to the user, we are able to tune the performance of the system and reach an average recall close to 100% for all the systems under evaluation. Some topics are much more di cult than others; for example, topic CD011558 requires the retrieval of most of the pooled documents in order to achieve a reasonable recall (around 0.8).

In Figure 3, we show the performance of the four groups of runs in terms of average recall (across topics) given the number of documents shown to the user. By increasing the number of documents (from left to right) the four approaches increase the average recall and go beyond 90% even with less than 4% of the total number of documents, for example the two 2018 approaches of the frozen system.

The distributed e ort approach we proposed this year performed worse than expected. It seems that by reducing the number of documents allowed per topic too much, especially for topics with smaller pools, we obtain a suboptimal system compared to the original one. In other terms, it may be more convenient to set up a xed cost per topic and use all the documents of the pool if necessary, instead of saving some resources for topics with more documents in the pool. 4

Conclusions

In this work, we presented a variation of the continuous active learning approach used in [ 6 ] that uses a xed stopping strategy to simulate the maximum amount of documents that a physician is willing to review and a sampling strategy that is used to estimate the number of relevant documents in the collection. The result of the distributed e ort approach were worse than expected, compared to the original approach in presented in 2018. The performance of the new system is still remarkable since it achieves an average recall of 90% by using only 10% of the documents in the collection; however, the original system can achieve the same results by reducing the number of documents shown to the user by half.

We are currently analyzing the results provided by the organizers and adding to the o cial runs a set of uno cial runs that will complete the picture of all the possible settings. As future work, we will study a methodology to dynamically vary the amount of documents according to the estimate of the amount of relevant documents still missing. 1.00 0.75 9 2 1 1 4 8 6 8 49 4090009D0C10D0C10D0C1D0CDC 6 6 4 46 6078008D0C0D0CDC 7 7 0 4 6964203823905518070511D10C11D0C11D0C1D0C1D0C11D0C1D0C1D0CDC 3 40558571168617687871977206120012D0C14D03C12D30C12D0C1D0C1D0CDC 9 801622 3 424525 12 61 6012D01 5 5 5 2 7 6 6 2 1 69768 0901004D0C0D0CDC CD0C0D0CDC

topic (a) Topic by topic BM25 results run 2019_baseline_bm25_t200 2019_baseline_bm25_t400 2019_baseline_bm25_t600 2019_baseline_bm25_t800 2019_baseline_bm25_t1000 2019_baseline_bm25_t2000 996126014401046D406087D806078D807049D004049D006099D604120D003180D203190D505180D705131D104101D5051t8o1D50p7C11i1Dc608C161D706C181D708C171D907C172D006C192D008C102D106C142D203C132D304C122D405C12D50C12D0C12D0C1D0 5 5 5 7 6 6 2 1 6 6 2 1 1 69768 0000D0 D (b) Topic by topic distributed e ort results CD0CDC C C C C C C C C C C C C C C

Fig. 1: Results for BM25 and distributed e ort runs 1.00 0.75 96261441046406087D8060C78D8070C49D0040C49D0060C99D60(41C20aD003)1C80D20T31C90D5o051C8p0D705i1C3c1D1041C0b1D505y1Ct8o1D50p7t1C1i1Doc6081pC61D7i061Cc81D708o1C71D9r071Ci72Dg0061Ci92Dn0081Ca02D1l061C42D22031C032D31041C228D4051C52pD5051C12D5001C72Dr061CeD6s0u2lts 1 6 6 2 1 1 69768 9 1 0 0 DC 0 0 0 DC 0 0 DC CD0CDC run 2018_stem_original_t100 96261441046406087D8060C78D8070C49D0040C49D0060C99D6041C20D0031C80D2031C90D5051C80D7051C31D1041C01D5051C81D5071C11D6081C61D7061C81D7081C71D9071C72D0061C92D0081C02D1061C42D2031C32D3041C22D4051C52D5051C2D501C72D061CD60 2 1 6 6 2 1 1 69768 9 1 0 0 DC 0 0 0 DC 0 0 DC CD0CDC

topic (b) Topic by topic original 2018 p50

Fig. 2: Results for original 2018 p10 and p50 runs

30000 documents shown (feedback) 40000 50000

Evangelos

Kanoulas ,

Dan

Li ,

Leif

Azzopardi , and Rene Spijker, editors. CLEF 2019 Technology Assisted Reviews in Empirical Medicine Overview . CLEF 2019 Evaluation Labs and Workshop: Online Working Notes., CEUR Workshop Proceedings. CEUR-WS.org , 2019 .

Liadh

Kelly , Hanna Suominen, Lorraine Goeuriot, Mariana Neves, Evangelos Kanoulas,

Dan

Li ,

Leif

Azzopardi , Rene Spijker, Guido Zuccon, Jimmy, and Joao Palotti, editors. Overview of the CLEF eHealth Evaluation Lab 2019 . CLEF 2019 - 10th Conference and Labs of the Evaluation Forum. Lecture Notes in Computer Science (LNCS) , Springer, September 2019 .

Giorgio

Maria Di Nunzio . A new decision to take for cost-sensitive nave bayes classi ers . Inf . Process. Manage., 50 ( 5 ): 653 { 674 , 2014 .

Giorgio

Maria Di Nunzio . Interactive text categorisation: The geometry of likelihood spaces . Studies in Computational Intelligence , 668 : 13 { 34 , 2017 .

Giorgio

Maria Di Nunzio . A study of an automatic stopping strategy for technologically assisted medical reviews . In Advances in Information Retrieval - 40th European Conference on IR Research , ECIR 2018 , Grenoble, France, March 26-29, 2018 , Proceedings, pages 672 { 677 , 2018 .

Giorgio

Maria Di Nunzio , Giacomo Ciu reda, and Federica Vezzani. Interactive sampling for systematic reviews . IMS unipd at CLEF 2018 ehealth task 2 . In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum , Avignon, France, September 10-14 , 2018 ., 2018 .

Giorgio

Maria Di Nunzio , Maria Maistro, and

Federica

Vezzani . A gami ed approach to nave bayes classi cation: A case study for newswires and systematic medical reviews . In Companion of the The Web Conference 2018 on The Web Conference 2018 , WWW 2018 , Lyon , France, April 23-27 , 2018 , pages 1139 { 1146 , 2018 .

Giorgio

Maria Di Nunzio , Maria Maistro, and Daniel Zilio. Gami cation for machine learning: The classi cation game . In Proceedings of the Third International Workshop on Gami cation for Information Retrieval co-located with 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016 ), Pisa, Italy, July 21 , 2016 ., pages 45 { 52 , 2016 .

Giorgio

Maria Di Nunzio , Maria Maistro, and

Daniel

Zilio . The university of padua (IMS) at TREC 2016 total recall track . In Proceedings of The Twenty-Fifth Text REtrieval Conference , TREC 2016, Gaithersburg, Maryland, USA, November 15 - 18 , 2016 , 2016 .