-

An Interactive Two-Dimensional Approach to Query Aspects Rewriting in Systematic Reviews. IMS Unipd At CLEF eHealth Task 2.

Giorgio Maria Di Nunzio

giorgiomaria.dinunzio@unipd.it 0

Federica Beghini

fede.beghini92@gmail.com 1

Federica Vezzani

Genevieve Henrot

genevieve.henrot@unipd.it 1 0 Dept. of Information Engineering 1 Dept. of Linguistic and Literary Studies 2 University of Padua

In this paper, we describe the participation of the Information Management Systems (IMS) group at CLEF eHealth 2017 Task 2. This task focuses on the problem of systematic reviews, that is articles that summarise all evidence that is published regarding a certain medical topic. This task, known in Information Retrieval as the total recall problem, requires long and tedious search sessions by experts in the eld of medicine. Automatic (or semi-automatic) approaches are essential to support these type of searches when the amount of data exceed the limits of users, i.e. in terms of attention or patience. We present the two-dimensional probabilistic version of BM25 with explicit relevance feedback together with a query aspect rewriting approach for both the simple evaluation and the cost-e ective evaluation.

In this paper, we describe the participation of the Information Management Systems (IMS) group at CLEF eHealth 2017 [ 1 ] Task [ 2 ]. This task focuses on the problem of systematic reviews, that is articles that summarise all evidence that is published regarding a certain medical topic. This task, known in Information Retrieval as the total recall problem, requires long and tedious search sessions by experts in the eld of medicine. Automatic (or semi-automatic) approaches are essential to support these type of searches when the amount of data exceed the limits of users, i.e. in terms of attention or patience. In particular, the aim is to make research papers abstract and title screening more e ective given the results of a boolean search submitted to a medical database.

The CLEF eHealth Task 2 has two types of evaluation procedures to assess the quality of a system that supports systematic reviews. These procedures are based on the following assumptions: { Simple evaluation, the user of the system is the researcher (end-user) that performs the abstract and title screening of the retrieved articles. Every time the system returns an abstract to the end-user there is an incurred cost. { Cost-e ective evaluation, the user that performs the screening is not the end-user. The user can interchangeably perform abstract and title screening, or document screening, and decide what documents to pass to the end-user. Every time the system provides an abstract to the user, she/he can i) either read the abstract (with an incurred cost, like in the simple evaluation) and decide whether to pass this document to the end-user, ii) or read the full document (with a higher cost) and decide whether to pass this document to the end-user, iii) or directly pass the document to the end-user. For each document passed to the end-user there are additional costs that need to be added.

The objective of our participation to this task was to: { nd the best parameters (in terms of classi cation/ranking accuracy) of the

BM25 model [ 4 ]; { explore the problem of query aspects and query (re-)formulation given an information need [ 6, 10 ]; { integrate the query aspects into the two-dimensional probabilistic model [ 3 ]; { study an automatic feedback loop to nd the optimal stopping strategy [ 8 ]. 2

Approach

In this paper, we continue to investigate the interaction with the two dimensional interpretation of the BM25 model applied to the problem of explicit relevance feedback with three goals in mind: { we want to create a set of relevance judgements with the least e ort by human assessors, { we use interactive visualizations to interpret the intermediate results of the relevance feedback, { we use explicit query rewriting by non experts to create di erent aspects of the information need.

Following the work started in [ 6, 4, 8, 3, 7 ], we continue to study the two-dimensional interpretation of the BM25 model de ned in the following section. 2.1

BM25 The BM25 is a probabilistic retrieval model where, if we use the de nition given by Zaragoza and Robertson in [ 9 ], the weight of the i-th term in a document is equal to: 3 http://terrier.org the Binary Independence Model weight of the i-th term: wBIM = log i (1

R i iR) (1

iN R) iN R where iR and iN R are the parameters of the Bernoulli random variable that represent the presence (or absence) of the i-th term in the relevant (R) and non-relevant (N R) documents. The estimate of each parameter is: iR = iN R =

ri + R + ni R +

R R + ri +

N R N R +

N R (2) (3) (4) (5) where R is the number of relevant documents, ri the number of relevant documents in which the i-th term appears, N is the total number of documents and ni is the total number of documents in which the i-th term appears. Parameters and correspond to the hyper-parameter of the conjugate beta prior distribution of the Bernoulli random variable. For R = R = 0:5 and R =N R= 0:5, we obtain the de nition of the well-known Robertson - Sparck Jones weight wRSJ [ 9 ].

i 2.2

Two-Dimensional Model

The two-dimensional representation of probabilities [ 5, 8 ] is an intuitive way of presenting a two-class classi cation problem on a two-dimensional space. Given two classes, for example relvant R and non-relevant N R, a document d is assigned to category R if the following inequality holds:

P (djN R) < m P (djR) +q | {yz } | {xz } where P (djR) and P (djN R) are the likelihoods of the object d given the two categories, while m and q are two parameters that can be assigned (automatically or by a user) to compensate for either the unbalanced class issues or di erent misclassi cation costs.

If we interpret the two likelihoods as two coordinates x and y of a two dimensional space, the problem of classi cation can be studied on a two-dimensional plot. The decision of the classi cation is represented by the line y = mx + q that splits the plane into two parts: all the points that fall `below' this line are classi ed as objects that belong to class R.

Two-dimensional BM25 In order to link the two-dimensional model to the BM25 model, rst we de ne the BIM weight as a di erence of logarithms: wBIM = log i (1

R i iR) log We now have all the elements to de ne the two coordinates x = P (djR) and y = P (djN R) in the following way: (7) (8) (9)

P (djR) = P (djN R) =

X wiBM25;R(tf ) i2d X wiBM25;N R(tf ) i2d where Pi2d indicates (with an abuse of notation) the sum over all the terms of document d. 3

Method

Given the de nition of two-dimensional BM25 model, we focused on the following problems: 1. nd the best combination of hyper-parameters R, N R, R, N R, 2. devise a strategy to create di erent query aspects of the same information need given a minimum amount of relevance feedback, 3. explore di erent options of explicit relevance feedback for both the simple and the cost-e ective evaluation subtasks.

For step 1), we used the training data available for this task to nd the best combination of parameters trough a force brute approach.

For step 2), we decided to use the following procedure: { for each topic, we run a plain BM25 retrieval model and get the relevance feedback for the rst abstract in the ranking list, { we get the explicit relevance feedback on that abstract and ask to two different people (non-experts in the eld of medicine) to review the abstract and rewrite an alternative query also according to the value of the feedback (relevant or not), For example, for topic CD008803 the original information need is expressed by the following statement:

\Optic nerve head and bre layer imaging for diagnosing glaucoma" we run BM25 and obtain the top retrieved abstract is document 19028735, the content of which is: title: Imaging of the retinal nerve bre layer for glaucoma. abstract: Glaucoma is a group of diseases characterised by retinal ganglion cell dysfunction and death. Detection of glaucoma and its progression are based on identi cation of abnormalities or changes in the optic nerve head (ONH) or the retinal nerve bre layer (RNFL), either functional or structural. This review will focus on the identi cation of structural abnormalities in the RNFL associated with glaucoma. . . . Then we pass the information that this abstract is not relevant (according to the abstract qrels) to the two users that rewrite the query, and we obtain two new query aspects. One user was asked to prepare a list of terms: \optic nerve head, ONH, optic disc, bre layer, diagnosis, retinal, imaging, RNFL, glaucoma, SLP, Scanning laser polarimetry, HRT, Heidelberg Retina Tomograph, OCT, Optical Coherence Tomography, GDx" The other user had to write a sort of information need instead of a list of words: \Diagnostic accuracy of HRT, OCT and GDx for diagnosing manifest glaucoma by detecting ONH and RNFL damage."

The rst type of query was written with the aim of entering the key words contained in the topic title, in the boolean query and in the article that was given (if relevant), along with other terms which were the result of various processes: the componential analysis of some words, the variants, the synonyms, the declensions and the acronyms of some terms and the reading of other relevant information using sources on the web4. The componential analysis consists of breaking down the sememe (i.e. the meaning) of a word in all its sense components5, e.g. the semes of radiculopathy6(topic CD007431) are /pathology/, /nerve root/, /spinal/, /in ammation/, /compression/. Therefore, in this case we also included all these terms in the query, which were not present in the information need7. We did not decide to enter the semes of all the words, but just of the terms whose semes we thought could improve the search of the most 4 PubMed https://www.ncbi.nlm.nih.gov/pubmed/

The Free Dictionary by Farlex - Medical Dictionary http://medical-dictionary. thefreedictionary.com/radiculopathy Merriam Webster Dictionary https://www.merriam-webster.com/)

Wikipedia https://en.wikipedia.org/wiki/Main_Page 5 Rastier, F, (1987), Smantique interprtative, d. Presses Universitaires de France, 2009, Paris, p.18-32.

Dubois., J. et al. (1994), Dictionnaire de linguistique et des sciences du langage, d.Larousse, Paris, p.423-424.

Ducrot, O., Schae er, J.-M., (1972), Nouveau dictionnaire encyclopdique des sciences du langage, d.du Seuil, 1995, Paris, p.445-447. 6 The Free Dictionary by Farlex - Medical Dictionary http://medical-dictionary.

thefreedictionary.com/radiculopathy 7 Physical examination for lumbar radiculopathy due to disc herniation in patients with low-back pain. relevant articles. Furthermore, if the terms had many variants, we added them to the query: e.g. in topic CD008760, we did not just enter oesophageal and oesophagus, but also esophageal and esophagus. Moreover, we tried to use not only one grammatical form to describe a concept, which is why we did not just enter nouns, but also verbs and adjectives, e.g. radiculopathy, radicular and spinal, spine (topic CD007431); endometriosis, endometrial (topic CD012019), diagnosis, diagnose, diagnosing, diagnostic (topic CD010542). We also added synonyms, e.g. diagnosis, screening, examination (topic CD009925) and diagnosis, detection (topic CD010783). For what concerns the process of declension, sometimes we wrote not only the singular, but also the plural form of a noun, e.g. dementia, dementias; biomarker, biomarkers (topic CD008782). Then, we entered the acronym of some terms, e.g. LPB (lumbago) (topic CD007431); mild cognitive impairment (MCI) (CD008782). Finally, the terms have been entered in a random order.

The second type of query was written with the aim of reformulating the information need. The purpose was to rewrite the information given for each topic using an alternative terminology and trying to reformulate a meaningful and humanly readable sentence. For this reason, validly attested synonyms and orthographic alternatives were used as variants of the medical terms provided in the original information need. In addition, another criterion was to systematically replace acronyms with their expansions and expansions with their acronyms. For example, for topic CD009135, the information need \Rapid tests for the diagnosis of visceral leishmaniasis in patients with suspected disease" was rewritten using synonyms and acronyms for "visceral leishmaniasis": \Evaluation of rapid examinations in order to detect VL (kala-azar, black fever and Dumdum fever) in patients with clinically suspected infection". This approach allowed us to expand the medical terminology and to evaluate also the documents in which the selected variants were present. The sources from which the terminological variants were selected were PubMed, the online medical dictionary Merriam Webster and Wikipedia. For what concerns the topics presenting a relevant document (relevance index 1) selected by the expert, the criterion of re-writing the information need was based on the knowledge acquired by reading the PubMed article abstract. This step facilitated the reformulation of the title by focusing on the typology of the request and its related aspects. On the contrary, the topics where the document's relevance index was 0, the reformulation was based on the terminology frequency analysis and on an in-depth research of the topic on reliable sources available on the web.

For step 3), we designed alternative strategies that use the following parameters: { number of documents to assess, in batches or iteratively, { percent of documents to assess, { maximum number of documents to assess per iteration, { number of terms to add at each feedback iteration, { for the cost-e ective evaluation, the minimum precision the system can reach before stopping the search.

Simple evaluation For the simple evaluation subtask, we focused on the number (or percentage) of documents to use for explicit relevance feedback and how to combine the query aspects. No threshold on the number of documents to retrieve was set.

Cost-e ective evaluation For the cost-e ective subtask, we performed two rounds of relevance feedback: rst retrieve, then classify. In the rst round, we select a percentage of documents for explicit relevance feedback; then, we use the relevance information to build the two classes R and N R. Once the two classes are built, we use the two-dimensional space to pick the document with partial recall 100% (by `partial', we mean that if during the iteration we retrieve 10 relevant document out of 20, we pick the relevant document with the lowest score) and let the classi cation line pass through that point. Then we iterate the feedback until precision reaches 0.2.

In Figure 1, we show the two dimensional situation at four di erent steps of the iteration. Green dots represents relevant documents, red dots non-relevant documents, black dots documents to be ranked (or judged). In Figure 1 (a), we see the documents at the end of the relevance feedback phase. After we re-set the probabilities by building the two classes of relevant and non relevant documents, the documents are in a di erent position of the two-dimensional space, Figure 1 (b). The space between the interpolating line of the relevant documents (dashed line) and the line of the last relevant document (dot-dashed line) is the `grey area' where we expect to nd more relevant documents. After a few iteration, the relevant and non relevant clouds of points become more and more separate, Figure 1 (c). When all the documents within the space between the two lines are judged (plus some other of the `extra-rounds') the systems stops sending documents to the user, Figure 1 (d). 4

Experiments

In all experiments, we used the rst document retrieved with a BM25 approach (and then judged) to build two di erent queries that represent the same information need. The two alternative queries are combined with the original one in di erent ways as described in the following sections.

For all the experiments, we set the best set of values for the parameters R, N R, R, N R of the BM25 found with a brute force approach on the training data. The values are: { {

R = N R = 1:0

R = N R = 0:01 These values are consistent with other experiments and indicate that a beta prior distribution that discounts the `presence' of a term in favour of its `absence' (high and low ) results in a better retrieval performance.

We also run a set of experiments on the training data to nd the value of the number of documents k to use for relevance feedback that gives the best y 01 y 0 3 0 2 0 0 1 − 0 0 2 − 0 4 − 0 6 − 10

x −10 x (a) End of relevance feedback

(b) Beginning of classi cation

between cost and e ectiveness, and we found that k = 50 is a good For the simple evaluation subtask, we submitted four runs: { ims iafa m10k150f0m10, run-1, this run uses Interactive Automatic Feedback with query Aspects (iafa) and, for each topic, uses k = 150 feedback rounds where, at each round, a new word is picked from the relevant documents and the top document is judged. For each topic, a total of 150 documents are judged. { ims iafas m10k50f0m10, run-2, this run uses Interactive Automatic Feedback with query Aspects with Separate rankings (iafas). At each round of feedback, the two query variants are run in parallel with the original one and three di erent documents are judged. There are k = 50 rounds for a total of 150 documents judged per topic. { ims iafap m10p2f0m10, run-3, this run uses Interactive Automatic Feedback with query Aspects using a Percent (iafap) of documents for feedback. This run is similar to the rst one but it uses a number of documents for relevance feedback that is proportional to the number of documents to rank.

In this case, p = 2 is two percent of feedback. { ims iafap m10p5f0m10, run-4, this run uses Interactive Automatic Feedback with query Aspects using a Percent (iafap) of documents for feedback.

The percent of feedback is p = 5. 4.2

Cost-E ective Evaluation

For the cost-e ective evaluation subtask, we submitted four runs. All the four runs use the same approach named Interactive Automatic Feedback with query Aspects and Percent of relevance feedback and Classi cation (iafapc). In particular, we tried di erent values of parameters concerning the percent of documents for relevance feedback and the maximum number of documents for relevance feedback in the initial phase.

During the classi cation phase, we calculate the linear interpolation of relevant documents if 5 or more relevant documents are available, otherwise we compute the linear interpolation of non relevant documents. If the angular coe cient of the line is less than 0.9, we adjust it. We iterate this process by selecting the top 10 documents and perform explicit relevance feedback until precision reaches 0.2. After that point, extra iterations are performed with half of the documents used in the previous feedback round. We stop if no other documents are available or precision is below 0.2 and we have only one document for the extra rounds of relevance feecback.

{ ims iafapc m10p5f0t0p2m10, run-5, this run uses 5 percent of relevance feedback documents per round of relevance feedback in the initial phase.

The results for the simple evaluation are reported in Table 1 and Figure 2a while the results for the cost-e ective evaluation are reported in Table 2 and Figure 2b. 5

Final Remarks and Future Work

In this paper, we presented the experiments of our research group to the CLEF eHealth Task 2. The objective of our participation to this task was to investi1.0 0.9 lue0.8 a v 0.7 0.6 0.8 0.7 e lau0.6 v 0.5 50 recall 25 75

100 (a) NCG for simple evaluation run Fig. 2: NCG at di erent recall values for the simple and cost-e ective evaluation. gate a better set of parameters for the BM25, explore the problem of query aspects and query (re-)formulation given an information need, integrate the query aspects into the two-dimensional probabilistic model, and study an automatic feedback loop to nd the optimal stopping strategy.

Some interesting ndings during the training phase that we will document more deeply in the future can be summarised as follows: { there are values for the and parameter that clearly outperform the standard BM25 with = = 0:5; { performing an iterative explicit relevance feedback one document at a time changes signi cantly the performance of both retrieval and classi cation (the cost of training at each round of feedback is very high in computational terms, though); { adding query aspects to the original information need increase consistently the performance of both the retrieval and classi cation; { choosing the right terms to add during the iteration of relevance feedback may change signi cantly the results of both the retrieval and classi cation.

The results of the test phase presented in the previous section will be analyzed more deeply in the next weeks. In particular, it is not clear whether a xed amount of documents (k = 150, for example) may be better than a xed percentage of documents (say, p = 5). It will be interesting to study and compare the simple and the cost-e ective strategies in terms of the actual costs, as shown by Table 1 and Table 2. We will also continue to investigate the process of query aspect rewriting by extending it to the case of iteratively rewriting the query aspects according to the shifts of the two-dimensional cloud of points, as shown in Figure 2.

Lorraine

Goeuriot , Liadh Kelly, Hanna Suominen, Aurelie Neveol, Aude Robert, Evangelos Kanoulas, Rene Spijker, Joa~o Palotti, and Guido Zuccon, editors. CLEF 2017 eHealth Evaluation Lab Overview . CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science . Springer, 2017 .

Evangelos

Kanoulas ,

Dan

Li ,

Leif

Azzopardi , and Rene Spijker, editors. CLEF 2017 Technologically Assisted Reviews in Empirical Medicine Overview . In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum , Dublin, Ireland, September 11-14 , 2017 ., CEUR Workshop Proceedings. CEUR-WS.org, 2017 .

Giorgio

Maria Di Nunzio . A new decision to take for cost-sensitive nave bayes classi ers . Inf . Process. Manage., 50 ( 5 ): 653 { 674 , 2014 .

Giorgio

Maria Di Nunzio . Geometric perspectives of the BM25 . In Proceedings of the 6th Italian Information Retrieval Workshop , Cagliari, Italy, May 25 -26, 2015 ., 2015 .

Giorgio

Maria Di Nunzio . Interactive text categorisation: The geometry of likelihood spaces . Studies in Computational Intelligence , 668 : 13 { 34 , 2017 .

Giorgio

Maria Di Nunzio , Maria Maistro, and Daniel Zilio. Gami cation for IR: the query aspects game . In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016 ) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2016 ), Napoli, Italy, December 5- 7 , 2016 ., 2016 .

Giorgio

Maria Di Nunzio , Maria Maistro, and Daniel Zilio. Gami cation for machine learning: The classi cation game . In Proceedings of the Third International Workshop on Gami cation for Information Retrieval co-located with 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016 ), Pisa, Italy, July 21 , 2016 ., pages 45 { 52 , 2016 .

Giorgio

Maria Di Nunzio , Maria Maistro, and

Daniel

Zilio . The university of padua (IMS) at TREC 2016 total recall track . In Proceedings of The Twenty-Fifth Text REtrieval Conference , TREC 2016, Gaithersburg, Maryland, USA, November 15 - 18 , 2016 , 2016 .

9. Stephen

Robertson and Hugo

Zaragoza . The probabilistic relevance framework: BM25 and beyond . Foundations and Trends in Information Retrieval , 3 ( 4 ): 333 { 389 , 2009 .

10. Kazutoshi

Umemoto

, Takehiro Yamamoto, and

Katsumi

Tanaka . Scentbar: A query suggestion interface visualizing the amount of missed relevant information for intrinsically diverse search . In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval , SIGIR 2016 , Pisa, Italy, July 17-21 , 2016 , pages 405 { 414 , 2016 .