=Paper=
{{Paper
|id=Vol-1276/MedIR-SIGIR2014-08
|storemode=property
|title=Integrating Understandability in the Evaluation of Consumer Health Search Engines
|pdfUrl=https://ceur-ws.org/Vol-1276/MedIR-SIGIR2014-08.pdf
|volume=Vol-1276
|dblpUrl=https://dblp.org/rec/conf/sigir/ZucconK14
}}
==Integrating Understandability in the Evaluation of Consumer Health Search Engines==
Integrating Understandability in the Evaluation of Consumer Health Search Engines Guido Zuccon Bevan Koopman Queensland University of Technology Australian e-Health Research Centre, CSIRO Brisbane, Australia Brisbane, Australia g.zuccon@qut.edu.au bevan.koopman@csiro.au ABSTRACT the top 5 medical related causes of death in US is pre- In this paper we propose a method that integrates the no- sented at a readability level (measured by the SMOG, FOG tion of understandability, as a factor of document relevance, and Flesch-Kincaid reading indexes [7]) that exceeds that of into the evaluation of information retrieval systems for con- the average US citizen (7th grade level). Ahmed et al. [1] sumer health search. We consider the gain-discount eval- have highlighted the variability in readability (measured by uation framework (RBP, nDCG, ERR) and propose two the Flesch Reading Ease and the Flesch-Kincaid reading understandability-based variants (uRBP) of rank biased pre- index [7]) and quality of concussion information accessed cision, characterised by an estimation of understandability through Google searches. The understandability and relia- based on document readability and by different models of bility of online health information has been considered as a how readability influences user understanding of document critical issue for supporting online consumer health search content. The proposed uRBP measures are empirically con- because (1) consumers may not benefit from health infor- trasted to RBP by comparing system rankings obtained with mation that is not provided in an understandable way; and each measure. The findings suggest that considering under- (2) the provision of unreliable, misleading or false informa- standability along with topicality in the evaluation of in- tion on a health topic, e.g., a medical condition or treatment, formation retrieval systems lead to different claims about may led to negative health outcomes. This previous research systems effectiveness than considering topicality alone. suggests that topicality should not be considered as the only relevance factor for assessing the effectiveness of IR systems Categories and Subject Descriptors: H.3 [Information for consumer health search: other factors, such as under- Storage and Retrieval]: H.3.3 Information Search and Re- standability and reliability, should also be included in the trieval evaluation framework. General Terms: Evaluation. Research on the user perception of document relevance has shown that users’ relevance assessments are affected by 1. INTRODUCTION a number of factors beyond topicality, although topicality Searching for health advice on the Web is an increasingly has been found to be the essential relevance criteria. For common practice. A recent research has found that in 2012 example, Xu and Chen proposed and validated a five-factor about 58% of US adults (72% of all US Internet users – 66% model of relevance which consists of novelty, reliability, un- in 2011) have consulted the Internet for health advice [5]; derstandability, scope, along with topicality [12]. Their em- of these, 77% have used search engines like Google, Bing, pirical findings highlight the importance of understandabil- or Yahoo! to gather health information, while only 13% ity, reliability and novelty along with topicality in the rele- have started their health information seeking activities from vance judgements they collected. Nevertheless, typical eval- specialised sites such as WebMD. It is, therefore, crucial to uation of IR systems commonly considers only relevance as- create and evaluate information retrieval (IR) systems that sessments in terms of topicality1 ; this is also the case when specifically support consumers searching for health advise evaluating systems for consumer health search, for example, on the Web. In this paper, we focus on the evaluation of IR within CLEF eHealth 2013 [6]. In this paper, we aim to systems for consumer health search. close this gap in the evaluation of IR systems and focus on Previous studies within health informatics have investi- integrating understandability along with topicality for the gated online consumer health information beyond topical- evaluation of consumer health search engines. The integra- ity to specific health topics; in particular, with respect to tion of other factors influencing relevance, such as reliability, the understandability and reliability of such information. are left for future work. For example, Wiener and Wiener-Pla [11] have investigated The integration of understandability within the evalua- the readability (measured by the SMOG reading index [7]) tion methodology is achieved by extending the general gain- of Web pages concerning pregnancy and the periodontium discount framework synthesised by Carterette [3]; this frame- as retrieved by Google, Bing and Yahoo!. Walsh and Vol- work encompasses the widely-used nDCG, RBP and ERR. sko [10] have shown that most online information sampled The result is a series of understandability-biased evaluation from five US consumer health organisations and related to measures. Specifically, we examine one such measure, the understandability-based rank biased precision (uRBP) – a Copyright is held by the author/owner(s). variant of rank biased precision (RBP) [8]; variants of nDCG MedIR July 11, 2014, Gold Coast, Australia. 1 ACM SIGIR. With the recent exception of novelty and diversity, e.g., [4]. 32 and ERR may also be derived within our framework. relevance, thus P (R|k) = P (T |k), i.e., the probability that The proposed evaluation measure is further instantiated the document at k is topically relevant (to a query). by considering specific estimations of understandability based on readability measures computed for each retrieved docu- 2.2 Integrating understandability ment. While understandability encompasses other aspects As discussed by previous work, e.g. [12], relevance is in- in addition to text readability (e.g., prior knowledge), the fluenced by many factors; topicality being only one of them use of readability measures is a good first approximation for – although the most important. To integrate understand- understandability. This choice is also supported by prior ability into the gain-discount framework, we model P (R|k) work in health informatics regarding understandability of as the joint P (T, U |k), i.e. the probability of relevance of a consumer health information (e.g., see [11, 10, 1]). document (at rank k) is estimated using the joint probability The impact of the proposed framework and the specific of the document being topical and understandable. resultant measures on the evaluation of IR systems is inves- To compute the joint probability we assume that topical- tigated in the context of the consumer health search task of ity and understandability are compositional events and their CLEF eHealth 2013 [6]; empirical findings show that systems probabilities independent, i.e., P (T, U |k) = P (T |k)P (U |k). that are most effective according to uRBP are not necessar- This is a strong assumption and its limitations are briefly ily as effective when considering topicality alone (i.e. RBP). discussed in Section 4. Following this assumption, the gain function in the gain-discount framework is expressed as: 2. UNDERSTANDABILITY-BASED EVALU- g(k) = f (P (R|k)) = f P (T |k)P (U |k) (2) ATION Different evaluation measures thatmay be developed within this framework would instantiate f P (T |k)P (U |k) in dif- 2.1 The gain-discount framework ferent ways. In the following we will propose two RBP-based We tackle the problem of jointly evaluating topicality and instantiations; other instantiations are left for future work. understandability for measuring IR system effectiveness within the gain-discount framework synthesised by Carterette [3]. 2.3 Estimating understandability Within this framework, the effectiveness of a system, con- In the traditional TREC settings, assessments about the veyed by a ranked list of documents, is measured by the topicality of a document to a query are collected through evaluation measure M , defined as: manual annotation of query-document pairs from assessors 1 K (i.e., binary or graded relevance assessments3 ); these are M= g(k)d(k) (1) then turned into estimations of P (T |k). This process may N k=1 be mimicked to collect understandability assessments; in this where g(k) and d(k) are respectively the gain and discount paper however we do not explore this possibility. Instead, we function computed for the document at rank k,2 K is the explore the possibility of computing understandability as a depth of assessment at which the measure is evaluated, and property of a document and integrate this in the evaluation 1/N is a (optional) normalisation factor, which serves to process, along with standard relevance assessments. To this bound the value of the sum into the range [0,1] (see [9]). aim, readability is used as a proxy for understandability. Different measures developed within the gain-discount frame- (The limitations of this choice are briefly noted in Section 4.) work are characterised by different instantiations of its com- Its use is however justifiable because readability is one of the ponents. For example, the discount function in RBP is aspects that influence the understanding of text. modelled by d(k) = β k−1 , where β ∈ [0, 1] reflects user be- To estimate readability (and thus understandability), we haviour (high values representing persistent users, low values employ established general readability measures as those representing impatient users); while in nDCG the discount used in [1, 10, 11], e.g., SMOG, FOG and Flesch-Kincaid function is given by d(k) = 1/(log2 (1 + k)) and in ERR by reading indexes. These measures consider the surface level d(k) = 1/k. Similarly, instantiations of gain functions differ of the text contained in Web pages, that is, wording and depending upon the considered measure. In RBP, the gain syntax of sentences. In this framework, the presence of long function is binary-valued (i.e., g(k) = 1 if the document at sentences, words containing many syllables and unpopular rank k is relevant, g(k) = 0 otherwise); while for nDCG words, are all indicators of difficult text to read [7]. In this g(k) = 2r(k) − 1 and for ERR g(k) = (2r(k) − 1)/2rmax (with paper, we use the FOG measure to estimate the readability r(k) being the relevance grade of the document at rank k). of a text; the FOG reading level is computed as Without loss of generality, we can express the gain pro- vided by a document at rank k as a function of its probability F OG(d) = 0.4 ∗ (avgslen(d) + phw(d)) (3) of relevance; for simplicity we shall write g(k) = f (P (R|k)), where avgslen(d) is the average length of sentences in a where P (R|k) is the probability of relevance given the docu- document d and phw(d) is the percentage of hard words ment at rank k. Note that a similar form has been used for (i.e., words with more than two syllables) in d. the definition of the gain function for time-biased evaluation The use of such general readability measures to assess the measures [9]. The specific instantiations of g(k) in measures readability of documents concerning health information has like RBP, nDCG and ERR can be seen as the application of been questioned [13] as these do not seem to adequately different functions f (.) to estimations of P (R|k). correlate with human judgments for documents in this do- Traditional TREC-style relevance assessors are instructed main [13]. Nevertheless, the adoption of standard readabil- to consider topicality as the only (explicit) factor influencing 3 Recall that although called “relevance assessments”, in TREC- 2 For simplicity of notation, in the following we override k to rep- style assessments, annotators are usually instructed to consider resent the rank position k, or the document at rank k: the context only the topicality of a document to a query, isolating this factor of use will determine the meaning of k. from others influencing relevance in real settings. 33 ity measures in this paper is a first step towards demon- 1.0 P1 (U|k) 0.20 strating the use of the proposed understandability biased P2 (U|k) density of redability scores for collection 0.8 measures and analyse how system rankings would change 0.15 accordingly. In addition, their usage is partially supported 0.6 ● ●● ● ●● ● ●● ● ●● ●● ● ● by previous work within health informatics on assessing the ● ● ● ●● ● P(U|k) ●● ●● ● ● ● ● ● ● ● ● ● ● ●● 0.10 ● ● ● ● ● ● ● ● readability of online health advice [1, 10, 11]. ●● ● ●● ● ●● ● ●● ● ●● ● ● ● ● 0.4 ●● ● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.05 ● ● 2.4 Modelling P(U|k) ● ● ● ● 0.2 ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● Given the readability score for a document at rank k, ● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● 0.0 ● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● 0.00 P (U |k) needs to be estimated; this is achieved by consid- 0 10 20 30 40 50 60 readability score of doc@k ering user models that encode different ways in which a user is affected by document readability. Figure 1: Distributions for P1 (U|k) and P2 (U|k) with We first consider a user model P1 (U |k) where a user is respect to threshold th = 20, along with the density dis- characterised by a readability threshold th and every docu- tribution of readability scores (computed using FOG) for ment that has a readability score below th is considered cer- the documents in the CLEF eHealth 2013 qrels. tainly understandable, i.e., P1 (U |k) = 1; while documents 3. EMPIRICAL ANALYSIS with readability above th are considered not understandable, i.e. P1 (U |k) = 0. This is a (Heaviside) step function centred 3.1 Experiment design and settings in th; this function is depicted in Figure 1 (P1 (U |k)) with To understand how accounting for understandability in- th = 20, along with the FOG readability score distribution fluences the evaluation of IR systems tailored to searching for documents from CLEF e-Health 2013 [6]. The use of a health advice on the Web, we consider the runs submitted to step function to model P (U |k) is akin to the gain function the CLEF eHealth 2013 [6], which specifically aimed at eval- in RBP (also a step function). The understandability-based uating systems for this task. Our empirical experiments and RBP for user model one is then given by: subsequent analysis specifically focus on the changes in sys- tem rankings obtained when evaluating with standard mea- K sures (RBP) and understandability-based measures (uRBP1 uRBP1 = (1 − β) β k−1 r(k)u1 (k) (4) and uRBP2 ). System rankings are compared using Kendall k=1 rank correlation (τ ) and AP correlation [14] (τAP ), which where, for simplicity of notation, u1 (k) indicates the value weights higher rank changes that affect top systems. We do of P1 (U |k) and r(k) is the (topical) relevance assessment of not experiment with different values of β in RBP, and set document k (alternatively, the value of P(T|k)); thus g(k) = β = .95 across RBP and uRBP. f (P (T |k)P (U |k)) = P (T |k)P (U |k) = r(k)u1 (k). The document collection used in CLEF eHealth 2013 has A second user model (P2 (U |k)) is proposed, where the been retired due to removal of duplicates and copyrighted probability estimation is similar to a step function, but smoothed documents; we thus use the CLEF eHealth 2014 collection in the surroundings of the threshed value; this provides a (which is a subset of the CLEF eHealth 2013 collection) to more realistic transition between readable and not-readable allow reproducibility of the reported results and the 2013 content: qrels for relevance assessment. For each document in the arctan F OG(k) − th collection, the FOG readability scores (Equation 3) were 1 P2 (U |k) ∝ − (5) computed – the score distribution for all documents in the 2 π CLEF eHealth 2013 qrels is shown in Figure 1. Three thresh- where arctan is the arctangent trigonometric function and olds on the FOG readability values were explored for the F OG(k) is the FOG readability score of document at rank k; computation of the two alternative formulations of uRBP: other readability scores could be used instead of FOG. The th = 10, 15, 20; documents with a FOG score below 10 distribution of P2 (U |k) values is shown in Figure 1. Equa- should be near-universally understandable, while documents tion 5 is not a proper probability distribution, but this can with FOG scores above 15 and 20 increasingly restrict the be obtained by normalising Equation 5 by its integral be- audience able to understand the text. tween [min F OG(k) , max F OG(k) ]; however Equation 5 is rank equivalent to such distribution, not changing the 3.2 Results and analysis effect on the uRBP variant. These settings lead to the for- Figure 2 reports RBP vs. uRBP of IR systems participat- mulation of a second understandability-based RBP, uRBP2 , ing to CLEF eHealth 2013 for the two user models proposed based on the second user model, by simply substituting in Section 2.4 and for the three readability thresholds con- u2 (k) = P2 (U |k) to u1 (k) in Equation 4. sidered in the experiments. Similarly, Table 1 reports the Note that in both understandability-based measures (as values of Kendall rank correlation (τ ) and AP correlation well as in the original RBP) the contribution of an irrelevant (τAP ) between system rankings obtained with RBP and the document is zero, irrespective of its P (U |k). The contribu- two versions of uRBP. tion (to the gain) of a relevant document with readability Higher correlation between systems rankings obtained with score above th is 1 for RBP, 0 for uRBP1 and less than 0.5 RBP and uRBP is observed for higher values of th, irrespec- for uRBP2 (for uRBP2 the score will quickly tend to 0 the tively of uRBP version (see Table 1). This is expected as more the readability score is above the threshold value). the higher the threshold, the more documents will be charac- Finally, note that it is possible to design other user mod- terised by a P (U |k) = 1 (or ≈ 1 for uRBP2 ), thus reducing els representing how readability influences document under- uRBP to RBP. The fact that in general uRBP2 is correlated standability; the challenge is to determine which model bet- with RBP more than uRBP1 is to RBP highlights the effect ter represents the relationship between readability and doc- of smoothing obtained by the arctan function; specifically, ument understanding. the increase of readability scores for which P (U |k) is not zero 34 standability, instead of one as in the standard nDCG. 0.30 0.30 ● th=10 ● th=10 th=15 th=15 th=20 th=20 Xu and Chen [12] have noted that factors of relevance in- 0.25 0.25 fluence relevance assessments in different proportions, e.g., 0.20 0.20 uRBP1 uRBP2 in their study, topicality was found to be more influential 0.15 0.15 than understandability. The specific uRBP measures stud- 0.10 0.10 ied here did not consider this aspect; however weighting of 0.05 0.05 ●●●●● ● ●●● ● ●●●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ●●● different factors could be accomplished through a different ● ●● ● ●● ● ● ● ● ●● ●●●● ● ● ● ●● ● ●● ●●● ● ● ● ● 0.00 0.00 ● ●● ● ● ● ● ●●● ●● ●●● ● ● ● ● ● ● ● ● 0.00 0.05 0.10 0.15 RBP 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 RBP 0.20 0.25 0.30 0.35 f (.) function for converting P (T, U |k) into gain values. Figure 2: RBP vs. uRBP of CLEF eHealth 2013 sys- In this paper, we have used readability as a proxy for un- tems (left: uRBP1 ; right: uRBP2 ) for varying values of derstandability, but this is only one aspect that influences threshold on the readability scores (th = 10, 15, 20). understandability [12]; future work may explore other fac- tors, e.g., users’ prior knowledge, as well as the presence of th = 10 th = 15 th = 20 images that further explain the textual information. Fur- RBP vs. τ = .1277 τ = .5603 τ = .9574 thermore, readability was estimated using general, surface uRBP1 τAP = −.0255 τAP = .2746 τAP = .9261 level readability measures. Previous work has shown that RBP vs. τ = .5887 τ = .6791 τ = .9574 these measures are often not suitable to evaluate the read- uRBP2 τAP = .2877 τAP = .4102 τAP = .9407 ability of health information. For example, Yan et al. [13] claim that people experience the highest readability difficul- Table 1: Correlation coefficients (τ and τA P) between ties at word level rather than at sentence level; they further system rankings obtained with RBP and uRBP1 or propose a new metric based on concept-based readability, uRBP2 for different values of the readability threshold. specifically instantiated in the health domain. A number of beyond th narrows the scope for ranking differences between alternative approaches that measure text readability beyond systems effectiveness. These observations are confirmed in the surface characteristics of text have been proposed. Fu- Figure 2, where only few changes in the rank of systems are ture work will investigate their use to estimate P(U|k), along shown for th = 20 (× in Figure 2), while more changes are with actual readability assessments collected from users. found for th = 10 (◦) and th = 15 (+). Note that the small 5. REFERENCES differences in the absolute values of effectiveness recorded [1] O. H. Ahmed, S. J. Sullivan, A. G. Schneiders, and P. R. by uRBP with th = 10 should not be interpreted as a lack McCrory. Concussion information online: evaluation of of discriminative power. When th = 10 only 1.4% of the information quality, content and readability of documents in the CLEF eHealth 2013 qrels are relevant and concussion-related websites. British journal of sports readable, thus contributing to uRBP. medicine, 46(9):675–683, 2012. Figure 2 demonstrates the importance of considering un- [2] P. D. Bruza, G. Zuccon, and L. Sitbon. Modelling the derstandability along with topicality in the evaluation of information seeking user by the decision they make. In MUBE 2013, pages 5–6. ACM, 2013. systems for the considered task. The system ranked highest [3] B. Carterette. System effectiveness, user models, and user according to RBP (MEDINFO.1.3.noadd) is second to a num- utility: a conceptual framework for investigation. In ber of systems according to uRBP if user understandability SIGIR’11, pages 903–912, 2011. of up to FOG level 15 is wanted. Specifically, the high- [4] C. L. Clarke, N. Craswell, and I. Soboroff. Overview of the est uRBP1 for th = 10 is achieved by UTHealth_CCB.1.3. TREC 2009 web track. In TREC’09, 2009. noadd, which is ranked 28th according to RBP, and for [5] S. Fox and M. Duggan. Health online 2013. Tech. Rep., Pew th = 15 by teamAEHRC.6.3, which is ranked 19th accord- Research Center’s Internet & American Life Project, 2013. ing to RBP and achieves the highest uRBP2 for th = 10, 15. [6] L. Goeuriot, G. Jones, L. Kelly, J. Leveling, A. Hanbury, H. Müller, S. Salanterä, H. Suominen, and G. Zuccon. 4. LIMITATIONS AND CONCLUSIONS Share/clef ehealth evaluation lab 2013, task 3: Information retrieval to address patients’ questions when reading In this paper, we have investigated how understandability clinical reports. In CLEF, 2013. can be integrated in the gain-discount framework for eval- [7] D. R. McCallum and J. L. Peterson. Computer-based uating IR systems. The approach studied here is general readability indexes. In ACM’82 Conf., pages 44–48, 1982. and can be adopted to other factors of relevance, such as [8] A. Moffat and J. Zobel. Rank-biased precision for reliability. Information reliability plays an important role in measurement of retrieval effectiveness. TOIS, 27(1):2, 2008. consumer health advice search; its integration will be stud- [9] M. D. Smucker and C. L. Clarke. Time-based calibration of effectiveness measures. In SIGIR’12, pages 95–104, 2012. ied in future work. [10] T. M. Walsh and T. A. Volsko. Readability assessment of In the proposed approach, the relevance (P(R|k)) was internet-based consumer health information. Respiratory modelled as the joint probability P (T, U |k). This joint prob- care, 53(10):1310–1315, 2008. ability was assumed to be independent and the two events [11] R. C. Wiener and R. Wiener-Pla. Literacy, pregnancy and to be compositional, thus allowing to derive P (T, U |k) = potential oral health changes: The internet and readability P (T |k)P (U |k) and to treat topicality and understandability levels. Maternal and child health journal, pages 1–6, 2013. separately. This is a strong assumption and it is not neces- [12] Y. C. Xu and Z. Chen. Relevance judgment: What do sarily true; alternatives are under investigation, e.g. [2]. information users consider beyond topicality? JASIST, 57(7):961–973, 2006. The approach was demonstrated by deriving understand- [13] X. Yan, D. Song, and X. Li. Concept-based document ability-based variants of RBP; other measures can also be readability in domain specific information retrieval. In extended, e.g., nDCG and ERR. Note, however, that nDCG- CIKM’06, pages 540–549, 2006. style versions would require normalising the gain function by [14] E. Yilmaz, J. A. Aslam, and S. Robertson. A new rank the ideal gain, which in turns requires finding the optimal correlation coefficient for information retrieval. In ranking based on two criteria, relevance score and under- SIGIR’08, pages 587–594, 2008. 35