=Paper=
{{Paper
|id=Vol-3052/paper10
|storemode=property
|title=ADOR: A New Medical Dataset for Sentiment-based IR
|pdfUrl=https://ceur-ws.org/Vol-3052/paper10.pdf
|volume=Vol-3052
|authors=Mohammad Bahrani,,Thomas Roelleke
|dblpUrl=https://dblp.org/rec/conf/cikm/BahraniR21
}}
==ADOR: A New Medical Dataset for Sentiment-based IR==
ADOR: A New Medical Dataset for Sentiment-based IR
Mohammad Bahrani, Thomas Roelleke
Queen Mary University of London, UK
Abstract
Sentiment analysis has received attention in retrieval applications. Combining opinions such as user feelings with semantics
would enhance the performance of these applications, especially when the level of urgency is essential, e.g., medical domain.
However, no widely medical benchmark is known for evaluating sentiment-aware IR. In this paper, we create a dataset based
on Amazon reviews for medical products and make it publicly available. To assess the compatibility of the benchmark with
opinions and concepts we propose a sentiment-aware extension of TF.IDF and apply it to the dataset. This model is derived
from linear combinations of sentiment-based TF.IDF score with term-based and conceptual TF.IDF scores. The benchmark
could help healthcare organizations to effectively detect, rank and filter the most urgent notifications based on patient’s
health status, narratives and conditions.
Keywords
Semantic Retrieval, Query Analysis, Language Modelling, Benchmark, TREC, Query Formulation, Knowledge Representa-
tion,
1. Introduction different IR models with respect to medical applications
including OHSUMED [4], CLEF-eHealth [2, 5]. However,
Despite the fact that both sentiment analysis and IR are developing a sentiment-focused query-set for a dataset
of importance with regards to medical applications, the such as OHSUMED is not optimal since documents are
work on incorporating sentiments into medical IR is lim- generated from medical literature. Although sentiments,
ited, and there is no well-known benchmark established e.g., cancer and treatment are included in documents, im-
for this task. Many review-based datasets have been plications of urgency and feelings e.g., emojis are rarely
released for the task of sentiment analysis such as multi- found. Table 1 shows the overview of well-known medi-
domain Amazon dataset [1], INEX social book search [2] cal datasets which listed fundamental statistics of their
and IMDB dataset of reviews [3]. However, researchers semantic features.
need a benchmark which primarily takes into consider- Sentiment analysis and opinion mining are popular re-
ation the integration of opinions and medical concepts. search fields in natural language processing, data science
This is due to the importance of feelings in detecting and text mining. They analyse textual contents based
the level of urgency in medical domain. Moreover, bio- on people's opinions, emotions and attitudes [6]. In this
medical companies need to analyse customer's general paper, we create a benchmark that consists of a dataset, a
feelings about their products. On the other hand, patients query-set and the relevance results. The dataset consists
need to know the sentiment of product reviews before of Amazon reviews for medical products. Additionally, it
buying. Wherefore the examination of sentiments would supports the use of common semantics (terms, concepts
be beneficial for both buyers and suppliers of medical and relations) in biomedical retrieval.
products. The second contribution of this paper is to apply
In this paper, we address this problem by creating and sentiment-aware models to the dataset. We propose a
making available a medical benchmark specifically for family of opinion-aware models for ranking medical re-
the task of opinion-aware retrieval. views. These models are semantic instances of a gener-
Bio-medical benchmarks consider various pillars of alizable TF.IDF. The technology of semantic retrieval is
semantics in collections and queries, e.g., terms, concepts of particular importance in medical applications and the
and attributes. These semantics would enable data scien- integration of semantics with the standard content-based
tists to develop effective models for different tasks, e.g., retrieval tools could lead to more intelligent search ex-
filtering and classification. periences [7, 8]. The generalization of TF.IDF towards
Several benchmarks have been published to examine semantic frameworks is discussed in [9]. When compared
to retrieval systems built upon only bag-of-words, the
CIKM’21: Fourth Workshop on Knowledge-driven Analytics and integrated methods result in more performant question
Systems Impacting Human Quality of Life, November 01–05, 2021, answering (QA) systems with constraint checking abili-
CIKM, Australia ties. There has been research on developing conceptual
" m.bahrani@qmul.ac.uk (M. Bahrani); t.roelleke@qmul.ac.uk
models for medical applications [10] and [11]. It could be
(T. Roelleke)
© 2021 Copyright for this paper by its authors. Use permitted under Creative interesting to leverage sentiments and feelings in these
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) applications.
Dataset Task Reports Number of Queries avg-Opinions-per-query avg-concepts-per-query
clef2013 e-health Task3: Patients’ Questions when Reading Clinical Reports Overview of the ShARe CLEF eHealth Evaluation Lab 2013 [5] 50 0.3 2.9
clef2014 e-health Task 3: use of information e.g. discharge summary and ontologies in IR Overview of the share-clef ehealth evaluation lab 2014 [12] 50 0.34 1.86
OHSUMED TREC-9 Filtering: Evaluate text filtering system OHSUMED [4] - TREC-9 Final Report [13] 63 0.41 4.87
TREC 2006 Genomics Track passage retrieval for Genomics question answering TREC 2006 genomics track overview. [14] 27 0.32 6.00
TREC 2007 Genomics Track Genomics passage retrieval based on biologists’ needs TREC 2007 genomics track overview. 35 0.27 4.6
Table 1: Overview of well established benchmarks for health-related retrieval.
By consolidating the methods for modelling opinions steps. Firstly, we converted the encoding of the contents
and sentiments in medical ranking, we aim to address the to UTF-8 and secondly, we defined the schema and the
deficiencies in different tasks including but not limited required fields. The essential fields consists of Amazon
to notification filtering and review filtering. In terms ASIN number, medical category, star-rating, the title of
of notification filtering, we know that doctors and pa- the review, review text and labels including star-rating
tients are overloaded with massive health-related data, and helpful, have been embedded into the dataset.
and it is critical for health organizations to focus on the
most important and urgent cases. In this scenario, the 2.1. ADOR Query Set
detection of urgency is associated with both ranking and
acquisition of sentiments. We have defined 25 topics based on five purposes. Figure
Our work contributes to building the grounds for im- 3 shows the distribution of queries and number of rele-
proving medical review filtering through IR. It is the vant documents. The five categories of information need
starting point of developing models that could better are as follows:
meet the needs of bio-medical organizations, companies 1. The retrieval of positive or negative reviews as-
and individual buyers for analysing most critical, positive sociated with medical products.
and negative reviews.
2. Fact-based and non-sentiment-bearing queries
which only intend to retrieve medical entities.
2. The ADOR Dataset 3. Ranking the polarity of item-reviews within the
sub-categories, e.g. vitiligo cream and flu tablets.
The Amazon Dataset of Reviews (ADOR) is based on re- 4. Ranking the polarity of item-reviews within the
views from bio-medical Amazon products derived from super-categories, e.g. medications or diagnostic
three super categories which are Medication & Remedies, tools.
Diagnostic and Monitoring Tools and Health-Related 5. The retrieval of extreme (most positive or most
Books. We have defined a set of sub-category products negative) reviews given different medical con-
inherited from the super-categories and subsequently cepts. We used modifiers to give attention to the
extracted reviews of related top ten items retrieved by information need, e.g. Highly negative reviews for
Amazon search engine. However, in order to achieve a books about borderline personality disorder.
more balanced dataset in terms of polarity, we ignored
items without negative reviews.
#Concepts 595442
#Distinct.Concepts 404748
#Opinions 194790
#Distinct.Opinions 163045
#Query 25
#Docs 44796
#Avg.Query Length 9.08
#Avg.Review.Text Length 35.38
#Sampling Date 31-03-2020
Figure 1: Document and collection statistics of the ADOR se-
mantic types: The opinions group has the highest document
Table 2: The statistics of ADOR. frequency.
To make the data easily reusable, we followed two
2.2. Overview of ADOR 3. Application of the Benchmark
In this section, we briefly present the dataset and provide
3.1. Rationales
the statistics of ADOR. Table 2 lists the fundamental
statistics of the dataset. There are 194790 opinion fea- Although the use of human judgments could seem ideal
tures and 59442 medical concepts in the dataset which for the generation of gold standards, we developed a
are distributed across 44796 documents. We used VADER generic framework which has some privileges, e.g., it
lexicon to capture opinions and Meta-Map to bind terms could be easily used to build gold standards for new query
to medical concepts. Figure 2 presents the distribution sets.
of document length and query length. The majority of We provided informative labels, including rating-star,
queries (more than 35%) have a length between 9 and the number of people who found reviews helpful and
12 words. More than 50% of documents have between 1 medical categories of Amazon products when prepar-
and 20 words, whereas 7% of them are longer than 100 ing the data. This framework helps to rapidly develop
words. The statistics regarding distribution of queries new queries that could be formulated into the pro-
and their relevant documents are shown in Figure 3. vided labels. Considering the example query Why do
As can be seen, 28% of queries contain 1-60 documents some customers are happy with books about caffeine addic-
which is the exact same percentage for queries with more tion and narcissistic personality disorder., the formulated
than 240 documents. The rest of the queries contain be- query is : ( Rating=[4,5], Super-Category=[Books], Sub-
tween 60 and 240 relevant documents. We extracted the Category=[NPD,Caffeine Addiction] ). In other words,
average document and collection frequencies of seman- any review in the dataset that meets the information
tic types (neutral terms, concepts and opinions) of the needs requested by the formulated query could be se-
ADOR which can be found in Figure 1. Even though the lected.
average document frequency of opinions is high, opin- To evaluate the accuracy of models, one approach
ions could significantly impact the retrieval quality due would be the use of existing reviews as queries. How-
to the nature of reviews. ever, there are two substantial issues with this approach.
Firstly, data scientists need to analyse and classify their
D 4 X H U \
E ' R F X P H Q W
experimental results based on the query intent, e.g. fact-
based, binary and explorative queries. The use of re-
views as queries is not in line with the nature of query
intent. Secondly, reviews are strongly focused on opin-
3 H U F H Q W D J H
3 H U F H Q W D J H
ions. Therefore, generating a robust query set consists of
a balanced combination of concepts, terms and opinions
> > !
> > > > !
do interfere with the structure of reviews.
/ H Q J W K / H Q J W K
Figure 2: The distribution of document length and query 3.2. Baseline Models
length.
The focus of this paper is to introduce a dataset for
the task of semantic retrieval in the medical domain,
sentiment-based and conceptual IR. Therefore, advanced
ranking algorithms are the primary baselines. However,
the benchmark is also able to be used for the predic-
tion/classification tasks. For example, a review could be
considered as a message posted by a patient or a cus-
3 H U F H Q W D J H
tomer. In this case, the evaluation approach is to predict
if it is extreme (very negative) and requires attention
> > > !
by an expert, e.g., doctor, nurse or a company member.
1 X P E H U R I ' R F V
The other applicable task is notification systems. In this
Figure 3: The distribution of queries and number of relevant scenario, users post messages and an algorithm needs to
documents. decide who (e.g. which doctor, expert) should be notified
for analysing the message or responding to it.
Furthermore, the framework could be employed by
data scientists to predict features provided by the dataset
such as positive/negative and helpful/not helpful. Base-
lines could be used such as Neural Network classifier
(e.g., Bert or scikit), Bayesian predictor, regression and
Model Evaluation Measure
P@5 P@10 NDCG MAP
TF.IDF 0.2480 0.2720 0.2354 0.0833
BM25 0.3120 0.3160 0.2336 0.0813
KNRM 0.2320 0.2440 0.2445 0.0906
DSSM 0.2080 0.2200 0.2422 0.1039
arc-I 0.3520 0.3040 0.2476 0.0902
CF.IDF 0.3840 0.4080 0.2619 0.1106
OF.IDF 0.3680 0.4120 0.2758 0.1250
OF.IDF+TF.IDF (w=0.5) 0.3600 0.3920 0.2705 0.1175
OF.IDF+CF.IDF (w=0.5) 0.4640𝛽𝜃𝜁 0.4280𝛽𝜃𝜁 0.2825𝛽𝜃𝜁 0.1274𝛽𝜃
Table 3: Ranking performances of the opinion-aware models and the baseline methods: The bold font denotes the
best result in that evaluation metric. 𝛽, 𝜃, 𝜁 indicate statistically significant improvements of the best
model over BM25𝛽 , KNRM𝜃 and DSSM𝜁 . The statistically significance is based on the paired t-test with
p-value < 0.05.
K-NN (nearest neighbours) to measure the q prediction is negative, and consequently, the 𝑡 list comprises all
quality. The KNN classifier could be applied to retrieve negative opinions in the lexicon.
the most similar train reviews (e.g., cosine similarity), Let 𝜙 be a medical concept and let IDF (𝜙, 𝑐) be the
aggregate evidence and assigns a label to the test review. Inverse Document Frequency weight of the concept, the
conceptual extension of TF.IDF is defined as below:
3.3. Processing the New Queries
To confirm the capability of the benchmark with mod- RSVCF.IDF (𝑑, 𝑞, 𝑐) :=
els derived from opinions and concepts, we have devel-
∑︁
CF (𝜙, 𝑞) · CF (𝜙, 𝑑) · IDF (𝜙, 𝑐) (2)
oped a naive semantic approach. We briefly describe the 𝜙∈𝑞
methodology and then show the experimental results of
comparing the semantic approach with well-known and 3.3.2. Evaluation
recent IR methods on ADOR.
In this section, we briefly discuss the evaluation results of
the propose semantic models, TF.IDF, BM25 and neural
3.3.1. Methodology
ranking models when applied to ADOR.
Our approach is to leverage the well-known TF.IDF and We have trained neural ranking models including
capture its semantic extensions which are built upon KNRM [15], DSSM [16] and arc-I [17] on ADOR. We
opinions and/or concepts. To make the formulations performed 5-fold cross-validation where the final fold in
readable, we use type-aware 𝑥 functions, e.g. OF (𝑜, 𝑑) each run was considered as the test set. We randomly di-
is the opinion frequency of opinion 𝑜 in document 𝑑, vided queries into five-folds and repeatedly captured the
where CF (𝑐, 𝑑) is the frequency of concept 𝑐 in the doc- average of the fivefold-level evaluation results. All neural
ument. Let 𝑞 be a query, 𝑑 be a document and let 𝑐 be models were developed using MatchZoo [18] based on
the collection, the Retrieval Status Value (RSV) of the tensorflow with Adam optimizer, batch size 16 and learn-
opinion-aware model is as follows: ing rate=0.001. Using the Lucene framework and the
Language Modelling with Dirichlet Prior, we retrieved
pseudo-relevant documents and subsequently, the top
RSVOF.IDF (𝑑, 𝑞, 𝑐) := 100 documents were re-ranked by the models. In addi-
(1) tion to OF.IDF and CF.IDF, we conducted experiments
∑︁
OF(𝑜, 𝑞) · OF(𝑜, 𝑑) · IDF(𝑜, 𝑐)
𝑜∈𝑡
on linear combinations of opinion-aware TF.IDF with
term-based and conceptual TF.IDF using aggregation pa-
IDF (𝑜, 𝑐) is the Inverse Document Frequency of the rameter 𝑤 = 0.5. Concerning concept-based models,
opinion 𝑜 in the collection. 𝑡 is a list of all lexical features we used MetaMap to extract concepts accompanied by
in lexicon where the sentiment polarity is equal to query their frequencies, semantic types and scores. We counted
polarity. For example, given query Any useless or poor ’trigger’ attributes of MetaMap-outputs to calculate the
medications for allergy or cold sore., the query polarity corresponding frequencies of semantic types.
Table 3 shows the experimental results on ADOR us- search track, in: Conference & Labs of the Evalua-
ing four metrics including P@5, p@10, NDCG and Mean tion Forum (CLEF), 2014.
Average Precision (MAP). We also conducted the paired [3] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y.
t-test with 𝑝 < 0.05 to compute the significance of im- Ng, C. Potts, Learning word vectors for sentiment
provements. The isolated OF.IDF and CF.IDF worked bet- analysis, in: Proceedings of the 49th annual meet-
ter than TF.IDF, BM25 and neural models (KNRM, DSSM, ing of the association for computational linguistics:
arc-I) while the combination of opinions and concepts Human language technologies-volume 1, Associa-
received the best results. The interesting finding is that tion for Computational Linguistics, 2011, pp. 142–
the models based on combinations of opinions with both 150.
terms (OF.IDF+TF.IDF) and concepts (OF.IDF+CF.IDF) [4] W. Hersh, C. Buckley, T. Leone, D. Hickam,
improved all the measures. Ohsumed: an interactive retrieval evaluation and
new large test collection for research, in: SIGIR’94,
Springer, 1994, pp. 192–201.
4. Conclusion [5] H. Suominen, S. Salanterä, S. Velupillai, W. W. Chap-
man, G. Savova, N. Elhadad, S. Pradhan, B. R. South,
In this paper, we introduced a new benchmark, namely
D. L. Mowery, G. J. Jones, et al., Overview of the
ADOR which is a subset of Amazon reviews. For our
share/clef ehealth evaluation lab 2013, in: Inter-
research aim, the dataset allows for bringing and testing
national Conference of the Cross-Language Eval-
sentiment-based IR to medical domain. The correspond-
uation Forum for European Languages, Springer,
ing dataset focuses on medical products within three cat-
2013, pp. 212–231.
egories including medicine, monitoring tools and health-
[6] B. Liu, Sentiment analysis and opinion mining,
related books. The collection of reviews comes with a
Synthesis lectures on human language technologies
structured framework which enables users to automati-
5 (2012) 1–167.
cally generate relevance labels for new topics. Moreover,
[7] R. Van Zwol, T. Van Loosbroek, Effective use of se-
a query set with relevance results was consolidated into
mantic structure in xml retrieval, in: European Con-
the benchmark. In order to develop this query set, we
ference on Information Retrieval, Springer, 2007, pp.
considered factors such as query intent, sentiment score
621–628.
of query and concept query frequency.
[8] M. Bahrani, T. Roelleke, FDCM: Towards balanced
To measure the suitability of the benchmark for
and generalizable concept-based models for effec-
sentiment-based IR, we proposed naive but reproducible
tive medical ranking, in: Proceedings of the 29th
opinion-aware models as semantic instances of the gen-
ACM International Conference on Information &
eralizable TF.IDF. These models are derived from com-
Knowledge Management, 2020, pp. 1957–1960.
binations of sentiment-only TF.IDF with term-only and
[9] H. Azzam, S. Yahyaei, M. Bonzanini, T. Roelleke, A
concept-only TF.IDF. We compared the new approach
schema-driven approach for knowledge-oriented
with well-established and modern retrieval models. Our
retrieval and query formulation, in: Proceedings
experiments confirmed that the integration of sentiments
of the Third International Workshop on Keyword
with IR improves the quality of ranking with regards to
Search on Structured Data, ACM, 2012, pp. 39–46.
the ADOR dataset. The semantic model based on com-
[10] E. Meij, D. Trieschnigg, M. De Rijke, W. Kraaij,
bination of OF.IDF and CF.IDF achieved the best results
Conceptual language models for domain-specific
against gold standards.
retrieval, Information Processing & Management
In conclusion, the ADOR benchmark could help re-
46 (2010) 448–469.
searchers to develop and evaluate opinion-aware re-
[11] C. Wang, R. Akella, Concept-based relevance mod-
trieval models. These models would benefit companies
els for medical and semantic information retrieval,
and healthcare organizations to effectively detect, rank
in: Proceedings of the 24th ACM International on
and filter urgent notifications based on patient’s health
Conference on Information and Knowledge Man-
status, narratives and conditions. The benchmark is avail-
agement, 2015, pp. 173–182.
able at https://github.com/mb320/ADOR.
[12] L. Kelly, L. Goeuriot, H. Suominen, T. Schreck,
G. Leroy, D. L. Mowery, S. Velupillai, W. W. Chap-
References man, D. Martinez, G. Zuccon, et al., Overview of
the share/clef ehealth evaluation lab 2014, in: Inter-
[1] S. Li, C. Zong, Multi-domain sentiment classifica- national Conference of the Cross-Language Eval-
tion, in: Proceedings of ACL-08: HLT, Short Papers, uation Forum for European Languages, Springer,
2008, pp. 257–260. 2014, pp. 172–191.
[2] M. Hall, H. Huurdemann, M. Skov, D. Walsh, et al., [13] S. Robertson, D. A. Hull, The trec-9 filtering track
Overview of the inex 2014 interactive social book final report, in: TREC, volume 10, Citeseer, 2000,
pp. 344250–344253.
[14] W. R. Hersh, A. M. Cohen, P. M. Roberts, H. K. Reka-
palli, Trec 2006 genomics track overview., in: TREC,
volume 7, 2006, pp. 500–274.
[15] C. Xiong, Z. Dai, J. Callan, Z. Liu, R. Power, End-
to-end neural ad-hoc ranking with kernel pooling,
in: Proceedings of the 40th International ACM SI-
GIR conference on research and development in
information retrieval, 2017, pp. 55–64.
[16] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero,
L. Heck, Learning deep structured semantic models
for web search using clickthrough data, in: Pro-
ceedings of the 22nd ACM international conference
on Information & Knowledge Management, 2013,
pp. 2333–2338.
[17] B. Hu, Z. Lu, H. Li, Q. Chen, Convolutional neu-
ral network architectures for matching natural lan-
guage sentences, in: Advances in neural informa-
tion processing systems, 2014, pp. 2042–2050.
[18] Y. Fan, L. Pang, J. Hou, J. Guo, Y. Lan, X. Cheng,
Matchzoo: A toolkit for deep text matching, arXiv
preprint arXiv:1707.07270 (2017).