1. Introduction

ADOR: A New Medical Dataset for Sentiment-based IR

Mohammad Bahrani

Thomas Roelleke

0 0 Queen Mary University of London , UK

Sentiment analysis has received attention in retrieval applications. Combining opinions such as user feelings with semantics would enhance the performance of these applications, especially when the level of urgency is essential, e.g., medical domain. However, no widely medical benchmark is known for evaluating sentiment-aware IR. In this paper, we create a dataset based on Amazon reviews for medical products and make it publicly available. To assess the compatibility of the benchmark with opinions and concepts we propose a sentiment-aware extension of TF.IDF and apply it to the dataset. This model is derived from linear combinations of sentiment-based TF.IDF score with term-based and conceptual TF.IDF scores. The benchmark could help healthcare organizations to efectively detect, rank and filter the most urgent notifications based on patient's health status, narratives and conditions.

eol>Semantic Retrieval Query Analysis Language Modelling Benchmark TREC Query Formulation Knowledge Representation

1. Introduction

diferent IR models with respect to medical applications including OHSUMED [4], CLEF-eHealth [2, 5]. However, Despite the fact that both sentiment analysis and IR are developing a sentiment-focused query-set for a dataset of importance with regards to medical applications, the such as OHSUMED is not optimal since documents are work on incorporating sentiments into medical IR is lim- generated from medical literature. Although sentiments, ited, and there is no well-known benchmark established e.g., cancer and treatment are included in documents, imfor this task. Many review-based datasets have been plications of urgency and feelings e.g., emojis are rarely released for the task of sentiment analysis such as multi- found. Table 1 shows the overview of well-known medidomain Amazon dataset [1], INEX social book search [2] cal datasets which listed fundamental statistics of their and IMDB dataset of reviews [3]. However, researchers semantic features. need a benchmark which primarily takes into consider- Sentiment analysis and opinion mining are popular reation the integration of opinions and medical concepts. search fields in natural language processing, data science This is due to the importance of feelings in detecting and text mining. They analyse textual contents based the level of urgency in medical domain. Moreover, bio- on people's opinions, emotions and attitudes [6]. In this medical companies need to analyse customer's general paper, we create a benchmark that consists of a dataset, a feelings about their products. On the other hand, patients query-set and the relevance results. The dataset consists need to know the sentiment of product reviews before of Amazon reviews for medical products. Additionally, it buying. Wherefore the examination of sentiments would supports the use of common semantics (terms, concepts be beneficial for both buyers and suppliers of medical and relations) in biomedical retrieval. products. The second contribution of this paper is to apply

In this paper, we address this problem by creating and sentiment-aware models to the dataset. We propose a making available a medical benchmark specifically for family of opinion-aware models for ranking medical rethe task of opinion-aware retrieval. views. These models are semantic instances of a gener

Bio-medical benchmarks consider various pillars of alizable TF.IDF. The technology of semantic retrieval is semantics in collections and queries, e.g., terms, concepts of particular importance in medical applications and the and attributes. These semantics would enable data scien- integration of semantics with the standard content-based tists to develop efective models for diferent tasks, e.g., retrieval tools could lead to more intelligent search exifltering and classification. periences [ 7, 8 ]. The generalization of TF.IDF towards Several benchmarks have been published to examine semantic frameworks is discussed in [9]. When compared to retrieval systems built upon only bag-of-words, the CIKM’21: Fourth Workshop on Knowledge-driven Analytics and integrated methods result in more performant question Systems Impacting Human Quality of Life, November 01–05, 2021, answering (QA) systems with constraint checking abiliCIKM, Australia ties. There has been research on developing conceptual ("T. mRo.beallherkaen)i@qmul.ac.uk (M. Bahrani); t.roelleke@qmul.ac.uk models for medical applications [10] and [11]. It could be © 2021 Copyright for this paper by its authors. Use permitted under Creative interesting to leverage sentiments and feelings in these CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) applications.

By consolidating the methods for modelling opinions steps. Firstly, we converted the encoding of the contents and sentiments in medical ranking, we aim to address the to UTF-8 and secondly, we defined the schema and the deficiencies in diferent tasks including but not limited required fields. The essential fields consists of Amazon to notification filtering and review filtering. In terms ASIN number, medical category, star-rating, the title of of notification filtering, we know that doctors and pa- the review, review text and labels including star-rating tients are overloaded with massive health-related data, and helpful, have been embedded into the dataset. and it is critical for health organizations to focus on the most important and urgent cases. In this scenario, the 2.1. ADOR Query Set detection of urgency is associated with both ranking and acquisition of sentiments. We have defined 25 topics based on five purposes. Figure

Our work contributes to building the grounds for im- 3 shows the distribution of queries and number of releproving medical review filtering through IR. It is the vant documents. The five categories of information need starting point of developing models that could better are as follows: meet the needs of bio-medical organizations, companies and individual buyers for analysing most critical, positive and negative reviews.

2. The ADOR Dataset The Amazon Dataset of Reviews (ADOR) is based on re

views from bio-medical Amazon products derived from three super categories which are Medication & Remedies, Diagnostic and Monitoring Tools and Health-Related Books. We have defined a set of sub-category products inherited from the super-categories and subsequently extracted reviews of related top ten items retrieved by Amazon search engine. However, in order to achieve a more balanced dataset in terms of polarity, we ignored items without negative reviews.

#Concepts #Distinct.Concepts #Opinions #Distinct.Opinions #Query #Docs #Avg.Query Length #Avg.Review.Text Length #Sampling Date

To make the data easily reusable, we followed two 1. The retrieval of positive or negative reviews associated with medical products. 2. Fact-based and non-sentiment-bearing queries which only intend to retrieve medical entities. 3. Ranking the polarity of item-reviews within the sub-categories, e.g. vitiligo cream and flu tablets. 4. Ranking the polarity of item-reviews within the super-categories, e.g. medications or diagnostic tools. 5. The retrieval of extreme (most positive or most negative) reviews given diferent medical concepts. We used modifiers to give attention to the information need, e.g. Highly negative reviews for books about borderline personality disorder.

2.2. Overview of ADOR 3. Application of the Benchmark

In this section, we briefly present the dataset and provide 3.1. Rationales the statistics of ADOR. Table 2 lists the fundamental statistics of the dataset. There are 194790 opinion fea- Although the use of human judgments could seem ideal tures and 59442 medical concepts in the dataset which for the generation of gold standards, we developed a are distributed across 44796 documents. We used VADER generic framework which has some privileges, e.g., it lexicon to capture opinions and Meta-Map to bind terms could be easily used to build gold standards for new query to medical concepts. Figure 2 presents the distribution sets. of document length and query length. The majority of We provided informative labels, including rating-star, queries (more than 35%) have a length between 9 and the number of people who found reviews helpful and 12 words. More than 50% of documents have between 1 medical categories of Amazon products when preparand 20 words, whereas 7% of them are longer than 100 ing the data. This framework helps to rapidly develop words. The statistics regarding distribution of queries new queries that could be formulated into the proand their relevant documents are shown in Figure 3. vided labels. Considering the example query Why do As can be seen, 28% of queries contain 1-60 documents some customers are happy with books about cafeine addicwhich is the exact same percentage for queries with more tion and narcissistic personality disorder., the formulated than 240 documents. The rest of the queries contain be- query is : ( Rating=[4,5], Super-Category=[Books], Subtween 60 and 240 relevant documents. We extracted the Category=[NPD,Cafeine Addiction] ). In other words, average document and collection frequencies of seman- any review in the dataset that meets the information tic types (neutral terms, concepts and opinions) of the needs requested by the formulated query could be seADOR which can be found in Figure 1. Even though the lected. average document frequency of opinions is high, opin- To evaluate the accuracy of models, one approach ions could significantly impact the retrieval quality due would be the use of existing reviews as queries. Howto the nature of reviews. ever, there are two substantial issues with this approach.

Firstly, data scientists need to analyse and classify their (a) Query (b) Document experimental results based on the query intent, e.g. facttrcegeaneP00000.....2133205055 trcnegaeeP0000....3254 viiboniaentsewsen.dstT,.ahbSseiernqceuaofreonyrrdieela,ysng, idersneenvexroieaptwtliionnsrgaalitarnievereoswtbrquioutsnhetgrqtilhueyesef.ronycTasutheuseterecduoosonefnsqiosoutfpseirroneyf-00..1005 0.1 a balanced combination of concepts, terms and opinions 0.00 do interfere with the structure of reviews. (0,6) [6,9)Length[9,12) >=12 TF.IDF BM25 KNRM DSSM arc-I CF.IDF OF.IDF OF.IDF+TF.IDF (w=0.5) OF.IDF+CF.IDF (w=0.5)

K-NN (nearest neighbours) to measure the q prediction is negative, and consequently, the list comprises all quality. The KNN classifier could be applied to retrieve negative opinions in the lexicon. the most similar train reviews (e.g., cosine similarity), Let be a medical concept and let IDF (, ) be the aggregate evidence and assigns a label to the test review. Inverse Document Frequency weight of the concept, the conceptual extension of TF.IDF is defined as below:

3.3. Processing the New Queries

To confirm the capability of the benchmark with models derived from opinions and concepts, we have developed a naive semantic approach. We briefly describe the methodology and then show the experimental results of comparing the semantic approach with well-known and recent IR methods on ADOR. 3.3.1. Methodology Our approach is to leverage the well-known TF.IDF and capture its semantic extensions which are built upon opinions and/or concepts. To make the formulations readable, we use type-aware functions, e.g. OF (, ) is the opinion frequency of opinion in document , where CF (, ) is the frequency of concept in the document. Let be a query, be a document and let be the collection, the Retrieval Status Value (RSV) of the opinion-aware model is as follows:

RSVOF.IDF(, , ) := ∑︁ OF(, ) · OF(, ) · IDF(, ) ∈ (1)

IDF (, ) is the Inverse Document Frequency of the

opinion in the collection. is a list of all lexical features in lexicon where the sentiment polarity is equal to query polarity. For example, given query Any useless or poor medications for allergy or cold sore., the query polarity RSVCF.IDF (, , ) := ∑︁ CF (, ) · CF (, ) · IDF (, ) ∈ (2) 3.3.2. Evaluation

In this section, we briefly discuss the evaluation results of

the propose semantic models, TF.IDF, BM25 and neural ranking models when applied to ADOR.

We have trained neural ranking models including KNRM [15], DSSM [16] and arc-I [17] on ADOR. We performed 5-fold cross-validation where the final fold in each run was considered as the test set. We randomly divided queries into five-folds and repeatedly captured the average of the fivefold-level evaluation results. All neural models were developed using MatchZoo [18] based on tensorflow with Adam optimizer, batch size 16 and learning rate=0.001. Using the Lucene framework and the Language Modelling with Dirichlet Prior, we retrieved pseudo-relevant documents and subsequently, the top 100 documents were re-ranked by the models. In addition to OF.IDF and CF.IDF, we conducted experiments on linear combinations of opinion-aware TF.IDF with term-based and conceptual TF.IDF using aggregation parameter = 0.5. Concerning concept-based models, we used MetaMap to extract concepts accompanied by their frequencies, semantic types and scores. We counted ’trigger’ attributes of MetaMap-outputs to calculate the corresponding frequencies of semantic types.

4. Conclusion In this paper, we introduced a new benchmark, namely

ADOR which is a subset of Amazon reviews. For our research aim, the dataset allows for bringing and testing sentiment-based IR to medical domain. The corresponding dataset focuses on medical products within three categories including medicine, monitoring tools and healthrelated books. The collection of reviews comes with a structured framework which enables users to automatically generate relevance labels for new topics. Moreover, a query set with relevance results was consolidated into the benchmark. In order to develop this query set, we considered factors such as query intent, sentiment score of query and concept query frequency.

To measure the suitability of the benchmark for sentiment-based IR, we proposed naive but reproducible opinion-aware models as semantic instances of the generalizable TF.IDF. These models are derived from combinations of sentiment-only TF.IDF with term-only and concept-only TF.IDF. We compared the new approach with well-established and modern retrieval models. Our experiments confirmed that the integration of sentiments with IR improves the quality of ranking with regards to the ADOR dataset. The semantic model based on combination of OF.IDF and CF.IDF achieved the best results against gold standards.

In conclusion, the ADOR benchmark could help researchers to develop and evaluate opinion-aware retrieval models. These models would benefit companies and healthcare organizations to efectively detect, rank and filter urgent notifications based on patient’s health status, narratives and conditions. The benchmark is available at https://github.com/mb320/ADOR.

pp. 344250 - 344253 . [14]

W. R.

Hersh ,

A. M.

Cohen ,

P. M.

Roberts , H. K. Reka-

palli , Trec 2006 genomics track overview ., in: TREC,

volume 7 , 2006 , pp. 500 - 274 . [15]

Xiong ,

Dai ,

Callan ,

Liu ,

Power , End-

in: Proceedings of the 40th International ACM SI-

information retrieval , 2017 , pp. 55 - 64 . [16] P.-S. Huang , X.

He , J.

Gao , L.

Deng , A . Acero,

ceedings of the 22nd ACM international conference

on Information & Knowledge Management , 2013 ,

pp. 2333 - 2338 . [17]

Hu ,

Lu ,

Li ,

Chen , Convolutional neu-

tion processing systems , 2014 , pp. 2042 - 2050 . [18]

Fan ,

Pang ,

Hou ,

Guo ,

Lan , X. Cheng,

preprint arXiv:1707.07270 ( 2017 ).