Introduction

Interactive Sampling for Systematic Reviews. IMS Unipd At CLEF 2018 eHealth Task 2.

Giorgio Maria Di Nunzio

giorgiomaria.dinunzio@unipd.it 0

Giacomo Ciu reda

giacomo.ciuffreda@studenti.unipd.it 1

Federica Vezzani

federica.vezzani@phd.unipd.it 1 0 Dept. of Information Engineering 1 Dept. of Linguistic and Literary Studies 2 University of Padua

This is the second participation of the Information Management Systems (IMS) group at CLEF eHealth Task of Technologically Assisted Reviews in Empirical Medicine. This task focuses on the problem of medical systematic reviews, a problem which requires a recall close (if not equal) to 100%. Semi-Automated approaches are essential to support these type of searches when the amount of data exceed the limits of users, i.e. in terms of attention or patience. We present a variation of the two-dimensional approach which 1) sets the maximum amount of documents that the physician is willing to read, 2) takes into account a sampling strategy to estimate the 95% con dence interval of the number of relevant documents present in the collection.

Introduction

In this paper, we describe the participation of the Information Management Systems (IMS) group at CLEF eHealth 2018 [10] Task [ 1 ]. This task focuses on the problem of systematic reviews, that is the process of collecting articles that summarise all evidence (if possible) that has been published regarding a certain medical topic. This task requires long search sessions by experts in the eld of medicine; for this reason, semi-automatic approaches are essential to support these type of searches when the amount of data exceed the limits of users, i.e. in terms of attention or patience.

The objective of our participation to this task was to:

tf where wi is the weight of the i-th term, k1 and b are two parameters (some default parameters are3 k1 = 1:2 and b = 0:75), tf is the term frequency in the document, and wBIM is the Binary Independence Model weight of the i-th term: i where iR and iN R are the parameters of the Bernoulli random variable that represent the presence (or absence) of the i-th term in the relevant (R) and non-relevant (N R) documents. The estimate of each parameter is: feedback [ 8, 2, 7, 4, 6 ]. In order to explain how the two-dimensional BM25 space works, in the following sections we present a brief review of the BM25 model. 2.1

BM25 The BM25 is a probabilistic retrieval model where the weight of a term in a document is equal to [ 9 ]: (1) (2) (3) (4) (5) iR =

ri + R +

R +

R iN R =

N ni R + ri +

N R N R +

N R where R is the number of relevant documents, ri the number of relevant documents in which the i-th term appears, N is the total number of documents and ni is the total number of documents in which the i-th term appears. Parameters and correspond to the hyper-parameter of the conjugate beta prior distribution of the Bernoulli random variable. For R = R = 0:5 and R =N R= 0:5, we obtain the de nition of the well-known Robertson - Sparck Jones weight wRSJ . Given a document d, the probability of the document being relevant is i proportional to:

P (Rjd) /

X wBM25(tf )

i i2d 2.2

Two-Dimensional Model

The two-dimensional representation of probabilities [ 3, 8 ] is an intuitive way of presenting a two-class classi cation problem on a two-dimensional space. Given 3 http://terrier.org two classes, for example relvant R and non-relevant N R, a document d is assigned to category R if the following inequality holds:

P (djN R) < m P (djR) +q | {yz } | {xz } where P (djR) and P (djN R) are the likelihoods of the object d given the two categories, while m and q are two parameters that can be optimized to compensate for either the unbalanced class issues or di erent misclassi cation costs.

If we interpret the two likelihoods as two coordinates x and y of a two dimensional space, the problem of classi cation can be studied on a two-dimensional plot. The decision of the classi cation is represented by the line y = mx + q that splits the plane into two parts: all the points that fall `below' this line are classi ed as objects that belong to class R.

Two-dimensional BM25 In order to link the two-dimensional model to the BM25 model, rst we de ne the BIM weight as a di erence of logarithms: We now have all the elements to de ne the two coordinates x = P (djR) and y = P (djN R) in the following way: where Pi2d indicates (with an abuse of notation) the sum over all the terms of document d.

In Figure 1, we show an example of the visualization of a collection of documents using the two-dimensional BM25 model. Relevant and non relevant documents which have already been judged by a user (in our case the physician) are colored in green and red; documents that have not been judged are greyed. The two lines represents two possibile decision lines (see Equation 6) to rank/classify new documents as relevant. rel rel not rel not judged −200 −150 −100 x −50 0 1. study the e ectiveness of a classi er given a xed amount of documents that a physician is willing to review; 2. design a sampling strategy to estimate the 95% con dence interval of the number of relevant documents in the collection.

In the experiments, we used the following procedure: { we set a number n of documents that the physician is willing to read and a number s that tells the algorithm when (every s documents) to randomly sample a document from the collection instead of presenting to the physician the next most relevant document; { for each topic, we run an optimized (hyper-parameters) BM25 retrieval model and we obtain the relevance feedback for the rst abstract in the ranking list; { from the second document until n=2 1, we continuously update the relevance weights of the terms according to the explicit relevance feedback given by the physician (simulated by the qrels available with the test collection); { for the last half of the documents n=2 that the physician is willing to read, we use a Nave Bayes classi er continuously updated with the explicit relevance feedback [ 4 ]. 4

Experiments

{ {

R = N R = 1:0

R = N R = 0:01 For all the experiments, we set the values of the BM25 hyper-parameters in the following way: These values are consistent with other experiments and indicate that a beta prior distribution that discounts the `presence' of a term in favour of its `absence' (high and low ) results in a better retrieval performance [ 5 ]. The slope m of the decision line is set m = 1:0 and q = 0 for the rst half n=2 of the documents; then, m and q are continuously updated according to the relevance information [ 4 ]. 4.1

O cial runs

We submitted three runs by varying the number of documents n that the physician is willing to read per topic: n = 1000, n = 2000, n = 3000. We set the parameter s = 10, this means that every ten documents we sample a random document from the collection instead of showing to the physician the next ranked document. The three o cial runs are named as follows: { ims unipd t500.task2, n = 1000 { ims unipd t1000.task2, n = 2000 { ims unipd t1500.task2, n = 3000

In Figure 2, we show the recall per topic for each o cial run. We see that there are two topic in particular that are more di cult than the others: CD009263 and CD012010 with a recall less (or close to) 0.6 for all the runs. Seven topics can be considered as medium di cult (recall between 0.6 and 0.6 for at least one of the experiments): CD008567, CD010213, CD010502, CD012165, CD012179, CD012281, CD012599.

in Figure 3, we compare the results of our three runs with the summary of all the other CLEF 2018 participant. This plot con rms that most of high and medium di cult topics are also topics that, on average, were di cult for most of the participants (barplots more stretched and median far from value 1.00). Con dence intervals of number of relevant documents During the experiments, every 10 documents we sample a random document from the collection and show the document for relevance assessment in order to estimate the number of relevant documents in the collection. In Table 1, 2, and 3, we show a breakdown 1.0 0.8 of the number of documents per topic, how many topics were read (explicit relevance feedback), the number of relevant documents, how many documents were randomly sampled, the estimate of the number of relevant documents based on the random sample as well as the 95% con dence interval (minimum and maximum range), and the number of relevant documents found within the limit of the threshold. In most cases, the estimate of the number of relevant documents (and the 95% range) is much larger than the true number of relevant documents. The analysis of the results shown in these table is still under study since we would need a more sophisticated cost-bene t model to understand whether we want to put more e ort in the estimate of the number of relevant documents or in the automatic classi er. 4.2

Uno cial runs

In addition to the three o cial runs, we prepared two uno cial runs in order to study the feasibility of the query rewriting approach based on the work of [ 5 ]. We asked two experts in linguistics to rewrite the query, each with a di erent goal: the rst variant is written with the aim of creating a list of keywords resulting from the semic analysis (the study of meaning in linguistic units) of the technical terms contained in the initial query. The second variant is written with the aim of reformulating the information need into a humanly readable sentence using 1.00 0.75 llca0.50 e r alternative terms such as synonyms, orthographic variants, related forms and/or acronyms. The two experts worked independently from each other by following a structured linguistic methodology and focusing on di erent terminological aspects. We name these two experiments with \keyword" and \readable". Linguistic Methodology: Terminological Record The methodology applied for the process of query rewriting is based on a linguistic and terminological analysis of all the technical terms contained in the information needs provided in the dataset. The approach is divided into the following steps: 1. Recognition of technical terms; 2. Extraction of technical terms; 3. Linguistic and semantic analysis; 4. Formulation of terminological records; 5. Query rewriting.

The core of our methodology is basically a new model of terminological record used for the analysis of medical terminology [11]. This tool is a structured set of terminological data referring to a speci c concept and it is used in order to provide linguistic information about the concept itself and the term used for its 1.0 0.9 0.8 designation both for its lexical and semantic framing. This terminological record is composed of four general elds, which individually refer to formal features, semantics, corpus and references. Each eld in turn is divided in speci c sub elds describing the term according to linguistic and notional criterions.

Focusing on the rst two sub elds, the section named \formal feature" contains lexical and morphosyntactic information such as genre, tonic accent, spelling, etymology (derivation and composition), orthographic variant, acronyms/expansions and related forms. From the semantic viewpoint, the sub eld \semantics" contains the de nition of the term, its semic analysis, cases of phraseology (collocations and colligations) and all the possible semantic variants.

For example for topic CD011602, the information need provided is: Ultrasonography for diagnosis of alcoholic cirrhosis in people with alcoholic liver disease.

We initially proceeded with the extraction of technical terms (both single-word and multi-word terms) such as ultrasonography, diagnosis, alcoholic cirrhosis, cirrhosis, alcoholic liver disease, liver, disease and then we started to formulate terminological records for each of them. The sub eld named \formal feature" was useful for the human readable reformulation, whereas `semantics" sub eld provided the information necessary for the keywords reformulation. First variant: keywords reformulation In particular, semic analysis turns out to be the most useful process for the keyword reformulation and it aims to decompose the meaning of the term analyzed. This process consists of breaking down the sememe (i.e. the meaning) of a word in all its sense components, e.g. the semes. So for exemple, for the term cirrhosis the process of decomposition of meaning produced the following list of keywords: /chronic disease/ /liver/ /degeneration/ /cells/ /human body/ /in ammation/ / brous/ /thickening/ /tissue/ /alcoholism/ /hepatitis/.

We repeat this kind of analysis of each technical term in the information need and considering the above mentioned exemple for topic CD011602, the keyword reformulation is the following: /technique/ /echoes/ /ultrasound pulses/ /ultrasound/ /pulse/ /delineate/ /areas/ /di erent density/ /body/ /human being/ /cells/ /examination/ /evaluation/ /diagnostic/ /diagnosing/ /diagnose/ /alcohol/ /chronic/ /disease/ /cirrhosis of the liver/ /liver/ degeneration/ /cells/ /in ammation/ / brous/ /thickening/ /tissue/ /alcoholism/ /hepatitis/ /patient/ /large lobed glandulare organ/ /abdomen/ vertebrates/ /metabolic processes/ /disorder/ /structure/ /function/ / symptoms/ /a ect/ /location/ /physical injury/.

Second variant: human readable reformulation The second type of query

was written with the aim of reformulating the information need in a humanly readable sentence. Thanks to terminological records, we have been able to replace original terms with validly attested synonyms and use orthographic alternatives as variants of the medical terms provided in the original information need as well as to systematically replace acronyms with their expansions and expansions with their acronyms. Considering the previous topic CD011602, we obtained the following readable reformulation:

Diagnostic accuracy of medical ultrasound, known as diagnostic sonography or ultrasonography, for the detection of alcoholic liver disease (ALD) as the liver manifestations of alcohol overconsumption, including fatty liver, alcoholic hepatitis, and chronic hepatitis with liver brosis or cirrhosis. topic original readable keyword original readable keyword original readable keyword 5

Discussion

We are currently evaluating the results of these reformulation topic by topic, reformulation in the top 10 retrieved documents, Table 5.

In this phase of the analysis, we noted that there are some topics for which the two reformulations (\keywords" and/or \readable") retrieved, in the rst 10 positions, more relevant documents than the original query. Table 6 shows these topics and the number of documents retrieved depending on the type of reformulation. We then proceed with the manual analysis of such topics by reading the abstracts of the relevant documents retrieved from the two variants and we started to analyse from a linguistic viewpoint which terms contained in the two reformulations allowed the retrieval of such relevant documents.

As a rst and approximate analysis, we noted that the terms that were most frequently used in the two reformulations are those related to the diagnostic and evaluative sphere such as diagnosis and related forms as diagnostic, diagnose and diagnosing as well as evaluation, examination, test and detection. Furthermore, even the replacement of the full multi-word terms with the acronym such as DMSA for Dimercaptosuccinic Acid Scan, VUR for Vesicoureteral Re ux and UTI for Urinary Tract Infection, has turned out to be a good approach because reduced lexical forms are one of the typical feature of medical language and abbreviations are used in order to rapidly transmit health information. 6

Ongoing and Future Work

In this work, we presented a continuous active learning approach that uses a xed stopping strategy to simulate the maximum amount of documents that a physician is willing to review, and a sampling strategy that is used to estimate the number of relevant documents in the collection. We are currently performing a failure analysis to understand the possibile reasons of a recall below 90% and identify the linguistic aspects of a query rewriting approach that may help to improve the performance of an interactive system. 10. Hanna Suominen, Liadh Kelly, Lorraine Goeuriot, Evangelos Kanoulas, Leif Azzopardi, Rene Spijker, Dan Li, Aurelie Neveol, Lionel Ramadier, Aude Robert, Joao Palotti, Jimmy, and Guido Zuccon, editors. Overview of the CLEF eHealth Evaluation Lab 2018. CLEF 2018 - 8th Conference and Labs of the Evaluation Forum, volume Lecture Notes in Computer Science (LNCS). Springer, September 2018. 11. Federica Vezzani, Giorgio Maria Di Nunzio, and Genevieve Henrot. Trimed: A multilingual terminological database. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018., 2018.

Evangelos

Kanoulas , Rene Spijker,

Dan

Li , and Leif Azzopardi, editors. CLEF 2018 Technology Assisted Reviews in Empirical Medicine Overview . CLEF 2018 Evaluation Labs and Workshop: Online Working Notes, CEUR Workshop Proceedings. CEUR-WS.org , 2018 .

Giorgio

Maria Di Nunzio . A new decision to take for cost-sensitive nave bayes classi ers . Inf . Process. Manage., 50 ( 5 ): 653 { 674 , 2014 .

Giorgio

Maria Di Nunzio . Interactive text categorisation: The geometry of likelihood spaces . Studies in Computational Intelligence , 668 : 13 { 34 , 2017 .

Giorgio

Maria Di Nunzio . A study of an automatic stopping strategy for technologically assisted medical reviews . In Advances in Information Retrieval - 40th European Conference on IR Research , ECIR 2018 , Grenoble, France, March 26-29, 2018 , Proceedings, pages 672 { 677 , 2018 .

Giorgio

Maria Di Nunzio , Federica Beghini, Federica Vezzani, and

Genevieve

Henrot . An interactive two-dimensional approach to query aspects rewriting in systematic reviews. IMS unipd at CLEF ehealth task 2 . In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum , Dublin, Ireland, September 11-14 , 2017 ., 2017 .

Giorgio

Maria Di Nunzio , Maria Maistro, and

Federica

Vezzani . A gami ed approach to nave bayes classi cation: A case study for newswires and systematic medical reviews . In Companion of the The Web Conference 2018 on The Web Conference 2018 , WWW 2018 , Lyon , France, April 23-27 , 2018 , pages 1139 { 1146 , 2018 .

Giorgio

Maria Di Nunzio , Maria Maistro, and Daniel Zilio. Gami cation for machine learning: The classi cation game . In Proceedings of the Third International Workshop on Gami cation for Information Retrieval co-located with 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016 ), Pisa, Italy, July 21 , 2016 ., pages 45 { 52 , 2016 .

Giorgio

Maria Di Nunzio , Maria Maistro, and

Daniel

Zilio . The university of padua (IMS) at TREC 2016 total recall track . In Proceedings of The Twenty-Fifth Text REtrieval Conference , TREC 2016, Gaithersburg, Maryland, USA, November 15 - 18 , 2016 , 2016 .

9. Stephen

Robertson and Hugo

Zaragoza . The probabilistic relevance framework: BM25 and beyond . Foundations and Trends in Information Retrieval , 3 ( 4 ): 333 { 389 , 2009 .