Introduction

Exploring Understandability Features to Personalize Consumer Health Search

Joao Palotti

P@10 palotti@ifs.tuwien.ac.at 0

Navid Rekabsaz

rekabsaz@ifs.tuwien.ac.at 0 0 Vienna University of Technology (TUW) Favoritenstrasse 9-11/188 1040 Vienna , Austria

Introduction

This paper describes the participation of Technical University of Vienna (TUW) at CLEF eHealth 2017 Task 3 [ 5,9 ]. This track runs annually since 2013 (see [ 3,4,7,12 ]) and this year’s challenge is a continuation of 2016’s one. The Information Retrieval task of CLEF eHealth Lab aims to foster research on search for health consumers, emphasizing crucial aspects of this domain such as document understandability and trustworthiness.

In 2016, fifty topics were extracted from real user posts/interactions in the AskDocs section of Reddit1. Each topic was presented to six query creators with different medical expertise. Their job was to read a post (usually with a medical question) and formulate a query using their medical background knowledge, if any. In total 300 queries were created.

This year, this track has 4 subtasks (named IRTasks, see [ 9 ] for a full description of each task) and TUW submitted runs for two of them, IRTask 1 and 2. IRTask 1 is the Ad-Hoc task with the same topics of last year, aiming to increase the number of assessed documents for the collection. IRTask 2 is a new task, and the goal is to personalize the results for each query creator according to his/her medical expertise.

The experiments conducted by TUW aim to investigate two research directions: 1. IRTask 1 : Can understandability metrics be used to improve retrieval? 2. IRTask 2 : How to personalize retrieval in a learning to rank setting, according to different reading profiles and user expertise?

For IRTask 1, a previous study conducted in the context of CLEF eHealth 2014 and 2015 ([ 6 ]) showed promising improvements when using a small set of understandability estimators in a learning to rank context. Here we expand the set of understandability features used as well as non-understandability features (see Section 2.2). Our aim is to investigate if the improvements first seen in [ 6 ] 1 https://www.reddit.com/r/AskDocs/ would also occur in this dataset. For IRTask 2, we propose to explicitly define learning to rank features based on different user profiles. We study the effect of the suggested features in the system effectiveness. 2

Methodology

In this section we describe our learning to rank approach, the feature set devised, and our functions to map topical and understandability assessments into a single relevance label. 2.1

Learning to Rank Our learning to rank approach is based on 3 items: (1) a set of features, (2) a set of <document, label> pairs, and (3) a learning to rank algorithm. The set of features is described in Section 2.2. We consider in this work three different functions to label documents: for Subtask 1, we only use the pure topical relevance as judged in 2016; for Subtask 2, we define two understandability-biased function (named boost and float). Given a document with topical relevance T and understandability score U , and an user with a reading goal G, we define boost and float as: boost(T, U ) = (2 ∗ T

T if |G − U | ≤ 0.2 if |G − U | > 0.2 f loat(T, U ) = T ∗ (1.0 − |G − U |) (1) (2)

As topical relevance scores are either 0, 1 or 2, and the understandability scores are float numbers from 0.0 to 1.0, the possible values for function boost are the integers 4, 2, 1 and 0, while the possible values for function float are any float precision number between 0.0 and 2.0. All experiments used the pairwise learning to rank algorithm based on gradient boosting implemented in XGboost2 with NDCG@20 as goal to be optimized. Differently from past work [ 10,6 ], we do consider up to 1000 documents when re-ranking documents. 2.2

Features We devised 91 features from 3 distinct groups: information retrieval traditional features, understandability related features, and the modified output of regression algorithms made to estimate the understandability of a document. Elaborated features based on recent advance on semantic similarity, as made in [ 10 ], are left as future work. A comprehensive list of all features used in this work is shown in Table 1. 2 https://github.com/dmlc/xgboost/tree/master/demo/rank IR Features (12) Understandability Features (72) Regression Features (7)

Feature Category Feature Name

BM25

PL2 Common IR Models (7) LTDeFirmicIuDhrlTFetFLMIDF

DFRee

Hiemstra LM Query Independ. (3) DDDooocccuuummmeeennnttt SLPpeanagmgetRhSacnokres Doc. Score Modifier (2) DMiavrekrogvenRceanfrdoommRFaienlddomness

ARI Index Coleman Liau Index

Dale-Chall Score Traditional Formulas (8) FFlleesscchh RKeinacdainidgGEraasdee

Acronyms ♦† Mesh ♦†

DrugBank ♦† RMeleadtiecdalFVeaotcuarbeusla(2ry7) IMMCeeDdd1iicc0aall(ISPnurteeffifirnxxeaestsi♦o♦n††al classification of Diseases) ♦†

Consumer Health Vocabulary ♦† Sum(chv Score) ♦† Mean(chv Score) ♦† Ada Boosting Regressor

Extra Tree Regressor Modified Regression Scores (7) KG-rNadeiaernetstBNooesigtihnbgorRRegergersessosror

Linear Regression Support Vector Machine Regressor Random Forest Regressor IR Features: Regularly used information retrieval features are considered in this work. This list includes many commonly used retrieval models and document specific values, such as Spam scores[ 1 ] and PageRank scores3.

Understandability Features: All HTML pages were preprocessed with Boilerpipe4 to remove the undesirable boilerplate content as suggested in [ 8 ]. Then, a series of traditional readability metrics was calculated [ 2 ], as well as a number of basic syntactic and lexical features that are important components of such readability metrics. Finally, we measure the occurrence of words in different vocabularies, both medical and non-medical ones.

Regression Features: We adapted the output of regression algorithms to create personalized features. The 2016’s judgements were used as labels for a num3 http://www.lemurproject.org/clueweb12/PageRank.php 4 https://pypi.python.org/pypi/boilerpipe ber of regression algorithms (the list of algorithms used is shown in Table 1). Models were trained on a Latent Semantic Analysis (LSA) applied on words from 3.549 documents marked as topical relevant in the QRels from 2016, which understandability label varied from 0 (easy to understand) to 100 (hard to understand). LSA dimensions vary from 40 to 240 according to the best result of a 10-fold cross validation experiment. In order to avoid interference from the training set in the learning to rank algorithm, scores for the documents in the training set were predicted also in a 10-fold cross validation fashion. The personalization step consisted in calculating the absolute difference between the estimated score and the goal score, which is defined by user. For example, if the score estimated by a regression algorithm for a document D was 0.45 and the reading goal of a user U was 0.80, we used as feature the value 0.35 (the absolute difference between 80 and 45). We want to evaluate if features like these ones can help the learning to rank model to adapt according to the reading skills of a user. 3 3.1

Experiments

Evaluation Metrics We consider a large number of evaluation metrics in this work. As topical relevance centred evaluation metrics, we consider Precision at 10 (P@10) and Rank Biased Precision with μ parameter set to 0.8 (RBP(0.8)). Due to the fact that a learning to rank algorithm has the potential to bring many unjudged documents to the top of the ranking list, we consider also a modified version of P@10, Only Judged P@10, which will calculate P@10 considering only the first 10 judged documents of each topic.

As modified metrics that take into account understandability scores, we consider understandability-biased Rank Biased Precision, also with μ parameter set to 0.8 (uRBP(0.8)) as proposed by [ 11 ], and propose three new metrics for personalized search.

The first personalization-aware metric is a specialization of uRBP, auRBP, which takes advantage of an α parameter to model the kind of documents a user wants to read. We assume that α is a parameter that models understandability profiles of an entity. A low α is assigned to items/documents/users that are experts, while a high α are the opposite. We assume that a user with a low α wants to read specialized documents to the detriment of easy and introductory documents, while laypeople want the opposite. We model in auRBP a penalty for the case in which a low α document is presented to a user that wants high α documents and vice versa. While we are still investigating which one is the best function to model this penalty, we assume a normal penalty. Figure 1 shows an example in which a user is looking forward to reading documents with α=20 and other values for α would have a penalty associated to them according to this normal curve with mean 20 and standard deviation of 30. We use the standard deviation of 30 in all of our experiments, but it is left as future work ways to estimate a right value for it.

The second and third personalization-aware metrics are simple modifications of Precision at depth X. For the relevant documents found in the top X, we inspect how far is the understandability label of each document to the expected value required by a user. We could penalize the absolute difference linearly (LinUndP@X) or using the same Gaussian curve as in auRBP (GaussianUndP@10). Note that lower values are better for LinUndP@10, meaning that the distance from the required understandability value is small, and higher values are better for GaussianUndP@10, as a value of 100 is the best value one could reach. 100 80 60 40 20 0 0 20 40 60 80 100 Seven runs were submitted to IRTasks 1 and another seven were submitted to IRTask 2. Tables 2 and 3 present a summary of each approach, submissions for IRTask 1 and 2, respectively, and the results using 2016 Qrels. 4

Discussion and Conclusion

As shown in Table 2 and 3, we based our runs on the BM25 implementation from a Terrier 4.2 index of ClueWeb 12-B. The results of using relevance feedback are high because the judged as relevant documents appear at the top of the ranking list of each topic, but it does not necessarily means that these approach will be much better than a plain BM25 for 2017, as the already judged documents will be discarded by the organizers.

Results on CLEF eHealth 2016 QRels

Results on CLEF eHealth 2016 QRels 17.34 25.76 that higher RBP(0.8) are followed by higher uRBP(0.8) and auRBP(0.8), while This means that our efforts to retrieve more topical relevant documents also increases uRBP and auRBP, but does not affect LinUnd. and GausianUnd. metrics. better) and 56.96 for GausianUnd.P@10 (the higher the better).

We are looking forward to evaluating our results with 2017 QRels, but as the 2017’s assessments are still being conducted, an analysis of the official results will be posted online at https://github.com/joaopalotti/tuw_at_ clef_ehealth_2017. CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval. In CLEF 2016 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, September 2016.

1. Gordon

Cormack , Mark D.

Smucker , and Charles L. A.

Clarke . Efficient and effective spam filtering and re-ranking for large web datasets . CoRR, abs/1004.5168 , 2010 .

2. William

Dubay . The principles of readability . Costa Mesa, CA: Impact Information, 2004 .

Lorraine

Goeuriot , Gareth JF Jones, Liadh Kelly , Johannes Leveling, Allan Hanbury, Henning Mu¨ller, Sanna Salantera, Hanna Suominen, and Guido Zuccon. ShARe/CLEF eHealth Evaluation Lab 2013 , Task 3: Information retrieval to address patients' questions when reading clinical reports . CLEF 2013 Online Working Notes , 8138 , 2013 .

Lorraine

Goeuriot , Liadh Kelly,

Wei

Lee , Joao Palotti, Pavel Pecina, Guido Zuccon, Allan Hanbury, and Henning Mueller Gareth J.F. Jones . ShARe/CLEF eHealth Evaluation Lab 2014 , Task 3: User-centred health information retrieval . In CLEF 2014 Evaluation Labs and Workshop: Online Working Notes, Sheffield, UK, 2014 .

Lorraine

Goeuriot , Liadh Kelly, Hanna Suominen, Aur´elie N´ev´eol, Aude Robert, Evangelos Kanoulas, Rene Spijker, Joao Palotti, and

Guido

Zuccon . Clef 2017 ehealth evaluation lab overview . In Proceedings of CLEF 2017 - 8th Conference and Labs of the Evaluation Forum. Lecture Notes in Computer Science (LNCS) , Springer, September 2017 .

Joao

Palotti , Lorraine Goeuriot, Guido Zuccon, and

Allan

Hanbury . Ranking health web pages with relevance and understandability . In Proceedings of the 39th international ACM SIGIR conference on Research and development in information retrieval , pages 965 - 968 . ACM, 2016 .

Joao

Palotti , Guido Zuccon, Lorraine Goeuriot, Liadh Kelly, Allan Hanburyn,

Gareth J.F.

Jones , Mihai Lupu, and

Pavel

Pecina . CLEF eHealth Evaluation Lab 2015 , Task 2: Retrieving Information about Medical Symptoms . In CLEF 2015 Online Working Notes. CEUR-WS , 2015 .

Joao

Palotti , Guido Zuccon, and

Allan

Hanbury . The influence of pre-processing on the estimation of readability of web documents . In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management , pages 1763 - 1766 . ACM, 2015 .

Joao

Palotti , Guido Zuccon, Jimmy, Pavel Pecina, Mihai Lupu, Lorraine Goeuriot, Liadh Kelly, and

Allan

Hanburyn . Clef 2017 task overview: The ir task at the ehealth evaluation lab . In Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings. Proceedings of CLEF 2017 - 8th Conference and Labs of the Evaluation Forum , 2017 .

10.

Luca

Soldaini and

Nazli

Goharian . Learning to Rank for Consumer Health Search: A Semantic Approach , pages 640 - 646 . Springer International Publishing, 2017 .

11.

Guido

Zuccon . Understandability biased evaluation for information retrieval . In Proc. of ECIR , 2016 .

12. Guido

Zuccon

, Joao Palotti, Lorraine Goeuriot, Liadh Kelly, Mihai Lupu, Pavel Pecina, Henning Mueller, Julie Budaher, and

Anthony

Deacon . The IR Task at the