=Paper=
{{Paper
|id=Vol-1866/paper_114
|storemode=property
|title=Exploring Understandability Features to Personalize Consumer Health Search. TUW at CLEF 2017 eHealth
|pdfUrl=https://ceur-ws.org/Vol-1866/paper_114.pdf
|volume=Vol-1866
|authors=Joao Palotti,Navid Rekabsaz
|dblpUrl=https://dblp.org/rec/conf/clef/PalottiR17
}}
==Exploring Understandability Features to Personalize Consumer Health Search. TUW at CLEF 2017 eHealth==
Exploring Understandability Features to
Personalize Consumer Health Search
TUW at CLEF 2017 eHealth
Joao Palotti and Navid Rekabsaz
Vienna University of Technology (TUW)
Favoritenstrasse 9-11/188 1040
Vienna, Austria
[palotti,rekabsaz]@ifs.tuwien.ac.at
1 Introduction
This paper describes the participation of Technical University of Vienna (TUW)
at CLEF eHealth 2017 Task 3 [5,9]. This track runs annually since 2013 (see
[3,4,7,12]) and this year’s challenge is a continuation of 2016’s one. The Informa-
tion Retrieval task of CLEF eHealth Lab aims to foster research on search for
health consumers, emphasizing crucial aspects of this domain such as document
understandability and trustworthiness.
In 2016, fifty topics were extracted from real user posts/interactions in the
AskDocs section of Reddit1 . Each topic was presented to six query creators with
different medical expertise. Their job was to read a post (usually with a medical
question) and formulate a query using their medical background knowledge, if
any. In total 300 queries were created.
This year, this track has 4 subtasks (named IRTasks, see [9] for a full descrip-
tion of each task) and TUW submitted runs for two of them, IRTask 1 and 2.
IRTask 1 is the Ad-Hoc task with the same topics of last year, aiming to increase
the number of assessed documents for the collection. IRTask 2 is a new task, and
the goal is to personalize the results for each query creator according to his/her
medical expertise.
The experiments conducted by TUW aim to investigate two research direc-
tions:
1. IRTask 1 : Can understandability metrics be used to improve retrieval?
2. IRTask 2 : How to personalize retrieval in a learning to rank setting, accord-
ing to different reading profiles and user expertise?
For IRTask 1, a previous study conducted in the context of CLEF eHealth
2014 and 2015 ([6]) showed promising improvements when using a small set of
understandability estimators in a learning to rank context. Here we expand the
set of understandability features used as well as non-understandability features
(see Section 2.2). Our aim is to investigate if the improvements first seen in [6]
1
https://www.reddit.com/r/AskDocs/
would also occur in this dataset. For IRTask 2, we propose to explicitly define
learning to rank features based on different user profiles. We study the effect of
the suggested features in the system effectiveness.
2 Methodology
In this section we describe our learning to rank approach, the feature set devised,
and our functions to map topical and understandability assessments into a single
relevance label.
2.1 Learning to Rank
Our learning to rank approach is based on 3 items: (1) a set of features, (2) a
set of pairs, and (3) a learning to rank algorithm. The set
of features is described in Section 2.2. We consider in this work three different
functions to label documents: for Subtask 1, we only use the pure topical rele-
vance as judged in 2016; for Subtask 2, we define two understandability-biased
function (named boost and float). Given a document with topical relevance T
and understandability score U , and an user with a reading goal G, we define
boost and float as:
(
2∗T if |G − U | ≤ 0.2
boost(T, U ) = (1)
T if |G − U | > 0.2
f loat(T, U ) = T ∗ (1.0 − |G − U |) (2)
As topical relevance scores are either 0, 1 or 2, and the understandability
scores are float numbers from 0.0 to 1.0, the possible values for function boost
are the integers 4, 2, 1 and 0, while the possible values for function float are any
float precision number between 0.0 and 2.0. All experiments used the pairwise
learning to rank algorithm based on gradient boosting implemented in XGboost2
with NDCG@20 as goal to be optimized. Differently from past work [10,6], we
do consider up to 1000 documents when re-ranking documents.
2.2 Features
We devised 91 features from 3 distinct groups: information retrieval traditional
features, understandability related features, and the modified output of regres-
sion algorithms made to estimate the understandability of a document. Elabo-
rated features based on recent advance on semantic similarity, as made in [10],
are left as future work. A comprehensive list of all features used in this work is
shown in Table 1.
2
https://github.com/dmlc/xgboost/tree/master/demo/rank
Feature Type Feature Category Feature Name
BM25
PL2
DirichletLM
Common IR Models (7) LemurTF IDF
TF IDF
IR Features (12)
DFRee
Hiemstra LM
Document Length
Query Independ. (3) Document Spam Scores
Document Page Rank
Divergence from Randomness
Doc. Score Modifier (2)
Markov Random Field
ARI Index
Coleman Liau Index
Dale-Chall Score
Flesch Kincaid Grade
Traditional Formulas (8)
Flesch Reading Ease
Gunning Fog Index
LIX Index
SMOG Index
# Characters ♦†
# Sentences ♦
# Syllables ♦†
# Words †
Surface Measures (25) # (| Syllables(Word) | > 3) ♦†
# (| Word | > 4) ♦†
Understandability Features (72) # (| Word | > 6) ♦†
# (| Word | > 10) ♦†
# (| Word | > 13) ♦†
Numbers ♦†
General Vocabulary English Dictionary ♦†
Related Features (12) Dale-Chall List ♦†
stopwords ♦†
Acronyms ♦†
Mesh ♦†
DrugBank ♦†
ICD10 (International classification of Diseases) ♦†
Medical Vocabulary
Related Features (27) Medical Prefixes ♦†
Medical Suffixes ♦†
Consumer Health Vocabulary ♦†
Sum(chv Score) ♦†
Mean(chv Score) ♦†
Ada Boosting Regressor
Extra Tree Regressor
Gradient Boosting Regressor
Modified Regression Scores (7)
Regression Features (7) K-Nearest Neighbor Regressor
Linear Regression
Support Vector Machine Regressor
Random Forest Regressor
Table 1: Features used in the learning to rank process; the number of features
for each group is reported in parenthesis. ♦: raw feature values and values nor-
malised by number of words in a documents are used. †: raw feature values and
values normalised by number of sentences in a document are used.
IR Features: Regularly used information retrieval features are considered in
this work. This list includes many commonly used retrieval models and document
specific values, such as Spam scores[1] and PageRank scores3 .
Understandability Features: All HTML pages were preprocessed with Boil-
erpipe4 to remove the undesirable boilerplate content as suggested in [8]. Then,
a series of traditional readability metrics was calculated [2], as well as a number
of basic syntactic and lexical features that are important components of such
readability metrics. Finally, we measure the occurrence of words in different
vocabularies, both medical and non-medical ones.
Regression Features: We adapted the output of regression algorithms to cre-
ate personalized features. The 2016’s judgements were used as labels for a num-
3
http://www.lemurproject.org/clueweb12/PageRank.php
4
https://pypi.python.org/pypi/boilerpipe
ber of regression algorithms (the list of algorithms used is shown in Table 1).
Models were trained on a Latent Semantic Analysis (LSA) applied on words
from 3.549 documents marked as topical relevant in the QRels from 2016, which
understandability label varied from 0 (easy to understand) to 100 (hard to un-
derstand). LSA dimensions vary from 40 to 240 according to the best result of a
10-fold cross validation experiment. In order to avoid interference from the train-
ing set in the learning to rank algorithm, scores for the documents in the training
set were predicted also in a 10-fold cross validation fashion. The personalization
step consisted in calculating the absolute difference between the estimated score
and the goal score, which is defined by user. For example, if the score estimated
by a regression algorithm for a document D was 0.45 and the reading goal of
a user U was 0.80, we used as feature the value 0.35 (the absolute difference
between 80 and 45). We want to evaluate if features like these ones can help the
learning to rank model to adapt according to the reading skills of a user.
3 Experiments
3.1 Evaluation Metrics
We consider a large number of evaluation metrics in this work. As topical rele-
vance centred evaluation metrics, we consider Precision at 10 (P@10) and Rank
Biased Precision with µ parameter set to 0.8 (RBP(0.8)). Due to the fact that a
learning to rank algorithm has the potential to bring many unjudged documents
to the top of the ranking list, we consider also a modified version of P@10, Only
Judged P@10, which will calculate P@10 considering only the first 10 judged
documents of each topic.
As modified metrics that take into account understandability scores, we con-
sider understandability-biased Rank Biased Precision, also with µ parameter
set to 0.8 (uRBP(0.8)) as proposed by [11], and propose three new metrics for
personalized search.
The first personalization-aware metric is a specialization of uRBP, auRBP,
which takes advantage of an α parameter to model the kind of documents a user
wants to read. We assume that α is a parameter that models understandability
profiles of an entity. A low α is assigned to items/documents/users that are
experts, while a high α are the opposite. We assume that a user with a low α
wants to read specialized documents to the detriment of easy and introductory
documents, while laypeople want the opposite. We model in auRBP a penalty
for the case in which a low α document is presented to a user that wants high
α documents and vice versa. While we are still investigating which one is the
best function to model this penalty, we assume a normal penalty. Figure 1 shows
an example in which a user is looking forward to reading documents with α=20
and other values for α would have a penalty associated to them according to this
normal curve with mean 20 and standard deviation of 30. We use the standard
deviation of 30 in all of our experiments, but it is left as future work ways to
estimate a right value for it.
The second and third personalization-aware metrics are simple modifications
of Precision at depth X. For the relevant documents found in the top X, we
inspect how far is the understandability label of each document to the expected
value required by a user. We could penalize the absolute difference linearly (Lin-
UndP@X) or using the same Gaussian curve as in auRBP (GaussianUndP@10).
Note that lower values are better for LinUndP@10, meaning that the distance
from the required understandability value is small, and higher values are better
for GaussianUndP@10, as a value of 100 is the best value one could reach.
100
80
60
40
20
0
0 20 40 60 80 100
Fig. 1: A Gaussian model for penalty. This example of normal curve has its peak
(mean) at 20 and standard deviation of 30. A document with α=60 would be
worth only 41% of a document with the desired α=20.
3.2 Runs Description
Seven runs were submitted to IRTasks 1 and another seven were submitted to
IRTask 2. Tables 2 and 3 present a summary of each approach, submissions for
IRTask 1 and 2, respectively, and the results using 2016 Qrels.
4 Discussion and Conclusion
As shown in Table 2 and 3, we based our runs on the BM25 implementation from
a Terrier 4.2 index of ClueWeb 12-B. The results of using relevance feedback are
high because the judged as relevant documents appear at the top of the ranking
list of each topic, but it does not necessarily means that these approach will be
much better than a plain BM25 for 2017, as the already judged documents will
be discarded by the organizers.
Results on CLEF eHealth 2016 QRels
Run ID Run description
Only Jud. LinUnd GaussianUnd
P@10 RBP(0.80)@10 uRBP(0.8)@10 auRBP(0.8)@10
P@10 P@10 P@10
TUW1 Baseline: Terrier 4.2 BM25 26.46 27.63 33.39 55.73 27.32 17.34 14.14
TUW2 Baseline2: Terrier 4.2 BM25 with Rel. Feedback 39.83 44.87 33.02 56.76 41.73 25.76 21.36
TUW3 LTR on top 1000 from TUW1 - All Features 25.46 29.37 33.19 56.07 27.23 16.98 13.91
TUW4 LTR on top 1000 from TUW2 - All Features 41.36 50.07 33.55 56.04 41.94 25.69 21.26
TUW5 LTR on top 1000 from TUW1 - IR only 25.26 27.57 33.80 55.04 26.44 16.59 13.68
TUW6 LTR on top 1000 from TUW2 - IR only 44.36 51.50 33.71 56.22 46.43 27.41 23.64
- LTR on top 1000 from TUW1 - IR + Underst. 25.00 29.50 33.20 56.37 26.56 16.67 13.64
TUW7 LTR on top 1000 from TUW2 - IR + Underst. 41.97 49.57 33.58 56.32 42.90 25.96 21.68
Table 2: Results on CLEF eHealth 2016 QRels and runs submitted to CLEF
eHealth 2017 IRTask 1.
Results on CLEF eHealth 2016 QRels
Run ID Run description
Only Jud. LinUnd GaussianUnd
P@10 RBP(0.80)@10 uRBP(0.8)@10 auRBP(0.8)@10
P@10 P@10 P@10
BL1 Baseline: Terrier 4.2 BM25 26.46 27.63 33.39 55.73 27.32 17.34 14.14
BL2 Baseline2: Terrier 4.2 BM25 with Rel. Feedback 39.83 44.87 33.02 56.76 41.73 25.76 21.36
TUW1 LTR on top 1000 from BL1 - All Features. Labels w. Boost 24.90 28.90 32.95 56.49 26.84 16.95 13.90
TUW2 LTR on top 1000 from BL2 - All Features. Labels w. Boost 42.60 49.73 34.24 55.44 42.88 26.34 21.70
TUW3 LTR on top 1000 from BL1 - All Features. Labels w. Float 25.43 29.30 32.88 56.96 27.16 17.11 13.87
TUW4 LTR on top 1000 from BL2 - All Features. Labels w. Float 42.23 50.20 33.92 55.91 43.59 26.44 22.17
- LTR on top 1000 from BL1 - IR only w. Boost 25.23 27.53 33.57 55.35 26.39 16.56 13.52
TUW5 LTR on top 1000 from BL2 - IR only w. Boost 43.20 51.13 33.34 56.34 45.67 27.10 23.01
- LTR on top 1000 from BL1 - IR only w. Float 25.57 27.60 33.45 55.47 26.31 16.53 13.54
TUW6 LTR on top 1000 from BL2 - IR only w. Float 43.53 51.33 33.50 56.51 45.42 26.55 22.95
- LTR on top 1000 from BL1 - IR + Regres. Labels w. Boost 25.30 27.83 33.77 55.07 26.27 16.36 13.50
- LTR on top 1000 from BL2 - IR + Regres. Labels w. Boost 42.56 50.70 33.49 56.14 45.22 26.20 22.85
- LTR on top 1000 from BL1 - IR + Regres. Labels w. Float 25.10 27.90 33.79 55.10 26.06 16.41 13.33
TUW7 LTR on top 1000 from BL2 - IR + Regres. Labels w. Float 43.63 51.50 33.64 55.90 45.83 26.78 23.11
- LTR on top 1000 from BL1 - IR + Unders. Labels w. Boost 24.70 29.40 33.30 56.25 25.39 17.47 13.55
- LTR on top 1000 from BL1 - IR + Unders. Labels w. Boost 43.03 50.97 34.15 55.23 42.93 27.16 21.84
- LTR on top 1000 from BL1 - IR + Underst. Labels w. Float 25.50 30.60 33.42 56.10 25.80 17.86 13.91
- LTR on top 1000 from BL2 - IR + Underst. Labels w. Float 41.00 50.03 33.82 55.30 41.95 26.39 21.41
Table 3: Results on CLEF eHealth 2016 QRels and runs submitted to CLEF
eHealth 2017 IRTask 2.
Considering our experiments with IRTask1, shown in Table 2, we noticed a
degradation of P@10 for runs TUW3 and TUW5 if compared to the baseline
TUW1. This, however, is not the case for the modified version of P@10 which
considers only judged documents. This is the same of RBP(0.8). Our expectation
is that TUW3 and TUW5 are going to be more effective than TUW1, as well
as, TUW4, TUW6 and TUW7 are going to be more effective than TUW2. Note
that higher RBP(0.8) are followed by higher uRBP(0.8) and auRBP(0.8), while
higher P@10 are not followed by higher LinUnd.P@10 or GausianUnd.P@10.
This means that our efforts to retrieve more topical relevant documents also
increases uRBP and auRBP, but does not affect LinUnd. and GausianUnd. met-
rics.
Table 3 shows our experiments with different labelling function (Boost and
Float). Again there is no approach that could beat the best P@10 value for
BL1 (which is TUW1 in IRTask1), while several approaches beat P@10 of BL2.
When looking at understandability biased metrics, in especial to LinUndP@10
and GaussianP@10, we can see much more variance than in Table 3. The best
result found was TUW3 in Table 3 with 32.88 for LinUnd.P@10 (the smaller the
better) and 56.96 for GausianUnd.P@10 (the higher the better).
We are looking forward to evaluating our results with 2017 QRels, but
as the 2017’s assessments are still being conducted, an analysis of the official
results will be posted online at https://github.com/joaopalotti/tuw_at_
clef_ehealth_2017.
References
1. Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke. Efficient and ef-
fective spam filtering and re-ranking for large web datasets. CoRR, abs/1004.5168,
2010.
2. William H. Dubay. The principles of readability. Costa Mesa, CA: Impact Infor-
mation, 2004.
3. Lorraine Goeuriot, Gareth JF Jones, Liadh Kelly, Johannes Leveling, Allan Han-
bury, Henning Müller, Sanna Salantera, Hanna Suominen, and Guido Zuccon.
ShARe/CLEF eHealth Evaluation Lab 2013, Task 3: Information retrieval to ad-
dress patients’ questions when reading clinical reports. CLEF 2013 Online Working
Notes, 8138, 2013.
4. Lorraine Goeuriot, Liadh Kelly, Wei Lee, Joao Palotti, Pavel Pecina, Guido Zuccon,
Allan Hanbury, and Henning Mueller Gareth J.F. Jones. ShARe/CLEF eHealth
Evaluation Lab 2014, Task 3: User-centred health information retrieval. In CLEF
2014 Evaluation Labs and Workshop: Online Working Notes, Sheffield, UK, 2014.
5. Lorraine Goeuriot, Liadh Kelly, Hanna Suominen, Aurélie Névéol, Aude Robert,
Evangelos Kanoulas, Rene Spijker, Joao Palotti, and Guido Zuccon. Clef 2017
ehealth evaluation lab overview. In Proceedings of CLEF 2017 - 8th Conference
and Labs of the Evaluation Forum. Lecture Notes in Computer Science (LNCS),
Springer, September 2017.
6. Joao Palotti, Lorraine Goeuriot, Guido Zuccon, and Allan Hanbury. Ranking
health web pages with relevance and understandability. In Proceedings of the 39th
international ACM SIGIR conference on Research and development in information
retrieval, pages 965–968. ACM, 2016.
7. Joao Palotti, Guido Zuccon, Lorraine Goeuriot, Liadh Kelly, Allan Hanburyn,
Gareth J.F. Jones, Mihai Lupu, and Pavel Pecina. CLEF eHealth Evaluation
Lab 2015, Task 2: Retrieving Information about Medical Symptoms. In CLEF
2015 Online Working Notes. CEUR-WS, 2015.
8. Joao Palotti, Guido Zuccon, and Allan Hanbury. The influence of pre-processing
on the estimation of readability of web documents. In Proceedings of the 24th ACM
International on Conference on Information and Knowledge Management, pages
1763–1766. ACM, 2015.
9. Joao Palotti, Guido Zuccon, Jimmy, Pavel Pecina, Mihai Lupu, Lorraine Goeuriot,
Liadh Kelly, and Allan Hanburyn. Clef 2017 task overview: The ir task at the
ehealth evaluation lab. In Working Notes of Conference and Labs of the Evaluation
(CLEF) Forum. CEUR Workshop Proceedings. Proceedings of CLEF 2017 - 8th
Conference and Labs of the Evaluation Forum, 2017.
10. Luca Soldaini and Nazli Goharian. Learning to Rank for Consumer Health Search:
A Semantic Approach, pages 640–646. Springer International Publishing, 2017.
11. Guido Zuccon. Understandability biased evaluation for information retrieval. In
Proc. of ECIR, 2016.
12. Guido Zuccon, Joao Palotti, Lorraine Goeuriot, Liadh Kelly, Mihai Lupu, Pavel
Pecina, Henning Mueller, Julie Budaher, and Anthony Deacon. The IR Task at the
CLEF eHealth Evaluation Lab 2016: User-centred Health Information Retrieval.
In CLEF 2016 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS,
September 2016.