<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ADOR: A New Medical Dataset for Sentiment-based IR</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohammad Bahrani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Roelleke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Queen Mary University of London</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Sentiment analysis has received attention in retrieval applications. Combining opinions such as user feelings with semantics would enhance the performance of these applications, especially when the level of urgency is essential, e.g., medical domain. However, no widely medical benchmark is known for evaluating sentiment-aware IR. In this paper, we create a dataset based on Amazon reviews for medical products and make it publicly available. To assess the compatibility of the benchmark with opinions and concepts we propose a sentiment-aware extension of TF.IDF and apply it to the dataset. This model is derived from linear combinations of sentiment-based TF.IDF score with term-based and conceptual TF.IDF scores. The benchmark could help healthcare organizations to efectively detect, rank and filter the most urgent notifications based on patient's health status, narratives and conditions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Semantic Retrieval</kwd>
        <kwd>Query Analysis</kwd>
        <kwd>Language Modelling</kwd>
        <kwd>Benchmark</kwd>
        <kwd>TREC</kwd>
        <kwd>Query Formulation</kwd>
        <kwd>Knowledge Representation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>diferent IR models with respect to medical applications
including OHSUMED [4], CLEF-eHealth [2, 5]. However,
Despite the fact that both sentiment analysis and IR are developing a sentiment-focused query-set for a dataset
of importance with regards to medical applications, the such as OHSUMED is not optimal since documents are
work on incorporating sentiments into medical IR is lim- generated from medical literature. Although sentiments,
ited, and there is no well-known benchmark established e.g., cancer and treatment are included in documents,
imfor this task. Many review-based datasets have been plications of urgency and feelings e.g., emojis are rarely
released for the task of sentiment analysis such as multi- found. Table 1 shows the overview of well-known
medidomain Amazon dataset [1], INEX social book search [2] cal datasets which listed fundamental statistics of their
and IMDB dataset of reviews [3]. However, researchers semantic features.
need a benchmark which primarily takes into consider- Sentiment analysis and opinion mining are popular
reation the integration of opinions and medical concepts. search fields in natural language processing, data science
This is due to the importance of feelings in detecting and text mining. They analyse textual contents based
the level of urgency in medical domain. Moreover, bio- on people's opinions, emotions and attitudes [6]. In this
medical companies need to analyse customer's general paper, we create a benchmark that consists of a dataset, a
feelings about their products. On the other hand, patients query-set and the relevance results. The dataset consists
need to know the sentiment of product reviews before of Amazon reviews for medical products. Additionally, it
buying. Wherefore the examination of sentiments would supports the use of common semantics (terms, concepts
be beneficial for both buyers and suppliers of medical and relations) in biomedical retrieval.
products. The second contribution of this paper is to apply</p>
      <p>In this paper, we address this problem by creating and sentiment-aware models to the dataset. We propose a
making available a medical benchmark specifically for family of opinion-aware models for ranking medical
rethe task of opinion-aware retrieval. views. These models are semantic instances of a
gener</p>
      <p>
        Bio-medical benchmarks consider various pillars of alizable TF.IDF. The technology of semantic retrieval is
semantics in collections and queries, e.g., terms, concepts of particular importance in medical applications and the
and attributes. These semantics would enable data scien- integration of semantics with the standard content-based
tists to develop efective models for diferent tasks, e.g., retrieval tools could lead to more intelligent search
exifltering and classification. periences [
        <xref ref-type="bibr" rid="ref3">7, 8</xref>
        ]. The generalization of TF.IDF towards
Several benchmarks have been published to examine semantic frameworks is discussed in [9]. When compared
to retrieval systems built upon only bag-of-words, the
CIKM’21: Fourth Workshop on Knowledge-driven Analytics and integrated methods result in more performant question
Systems Impacting Human Quality of Life, November 01–05, 2021, answering (QA) systems with constraint checking
abiliCIKM, Australia ties. There has been research on developing conceptual
("T. mRo.beallherkaen)i@qmul.ac.uk (M. Bahrani); t.roelleke@qmul.ac.uk models for medical applications [10] and [11]. It could be
© 2021 Copyright for this paper by its authors. Use permitted under Creative interesting to leverage sentiments and feelings in these
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) applications.
      </p>
      <p>By consolidating the methods for modelling opinions steps. Firstly, we converted the encoding of the contents
and sentiments in medical ranking, we aim to address the to UTF-8 and secondly, we defined the schema and the
deficiencies in diferent tasks including but not limited required fields. The essential fields consists of Amazon
to notification filtering and review filtering. In terms ASIN number, medical category, star-rating, the title of
of notification filtering, we know that doctors and pa- the review, review text and labels including star-rating
tients are overloaded with massive health-related data, and helpful, have been embedded into the dataset.
and it is critical for health organizations to focus on the
most important and urgent cases. In this scenario, the 2.1. ADOR Query Set
detection of urgency is associated with both ranking and
acquisition of sentiments. We have defined 25 topics based on five purposes. Figure</p>
      <p>Our work contributes to building the grounds for im- 3 shows the distribution of queries and number of
releproving medical review filtering through IR. It is the vant documents. The five categories of information need
starting point of developing models that could better are as follows:
meet the needs of bio-medical organizations, companies
and individual buyers for analysing most critical, positive
and negative reviews.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The ADOR Dataset</title>
      <sec id="sec-2-1">
        <title>The Amazon Dataset of Reviews (ADOR) is based on re</title>
        <p>views from bio-medical Amazon products derived from
three super categories which are Medication &amp; Remedies,
Diagnostic and Monitoring Tools and Health-Related
Books. We have defined a set of sub-category products
inherited from the super-categories and subsequently
extracted reviews of related top ten items retrieved by
Amazon search engine. However, in order to achieve a
more balanced dataset in terms of polarity, we ignored
items without negative reviews.</p>
        <p>#Concepts
#Distinct.Concepts
#Opinions
#Distinct.Opinions
#Query
#Docs
#Avg.Query Length
#Avg.Review.Text Length
#Sampling Date</p>
        <p>To make the data easily reusable, we followed two
1. The retrieval of positive or negative reviews
associated with medical products.
2. Fact-based and non-sentiment-bearing queries
which only intend to retrieve medical entities.
3. Ranking the polarity of item-reviews within the
sub-categories, e.g. vitiligo cream and flu tablets.
4. Ranking the polarity of item-reviews within the
super-categories, e.g. medications or diagnostic
tools.
5. The retrieval of extreme (most positive or most
negative) reviews given diferent medical
concepts. We used modifiers to give attention to the
information need, e.g. Highly negative reviews for
books about borderline personality disorder.</p>
        <sec id="sec-2-1-1">
          <title>2.2. Overview of ADOR</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Application of the Benchmark</title>
      <p>In this section, we briefly present the dataset and provide 3.1. Rationales
the statistics of ADOR. Table 2 lists the fundamental
statistics of the dataset. There are 194790 opinion fea- Although the use of human judgments could seem ideal
tures and 59442 medical concepts in the dataset which for the generation of gold standards, we developed a
are distributed across 44796 documents. We used VADER generic framework which has some privileges, e.g., it
lexicon to capture opinions and Meta-Map to bind terms could be easily used to build gold standards for new query
to medical concepts. Figure 2 presents the distribution sets.
of document length and query length. The majority of We provided informative labels, including rating-star,
queries (more than 35%) have a length between 9 and the number of people who found reviews helpful and
12 words. More than 50% of documents have between 1 medical categories of Amazon products when
preparand 20 words, whereas 7% of them are longer than 100 ing the data. This framework helps to rapidly develop
words. The statistics regarding distribution of queries new queries that could be formulated into the
proand their relevant documents are shown in Figure 3. vided labels. Considering the example query Why do
As can be seen, 28% of queries contain 1-60 documents some customers are happy with books about cafeine
addicwhich is the exact same percentage for queries with more tion and narcissistic personality disorder., the formulated
than 240 documents. The rest of the queries contain be- query is : ( Rating=[4,5], Super-Category=[Books],
Subtween 60 and 240 relevant documents. We extracted the Category=[NPD,Cafeine Addiction] ). In other words,
average document and collection frequencies of seman- any review in the dataset that meets the information
tic types (neutral terms, concepts and opinions) of the needs requested by the formulated query could be
seADOR which can be found in Figure 1. Even though the lected.
average document frequency of opinions is high, opin- To evaluate the accuracy of models, one approach
ions could significantly impact the retrieval quality due would be the use of existing reviews as queries.
Howto the nature of reviews. ever, there are two substantial issues with this approach.</p>
      <p>Firstly, data scientists need to analyse and classify their
(a) Query (b) Document experimental results based on the query intent, e.g.
facttrcegeaneP00000.....2133205055 trcnegaeeP0000....3254 viiboniaentsewsen.dstT,.ahbSseiernqceuaofreonyrrdieela,ysng,
idersneenvexroieaptwtliionnsrgaalitarnievereoswtbrquioutsnhetgrqtilhueyesef.ronycTasutheuseterecduoosonefnsqiosoutfpseirroneyf-00..1005 0.1 a balanced combination of concepts, terms and opinions
0.00 do interfere with the structure of reviews.
(0,6) [6,9)Length[9,12) &gt;=12
TF.IDF
BM25
KNRM
DSSM
arc-I
CF.IDF
OF.IDF
OF.IDF+TF.IDF (w=0.5)
OF.IDF+CF.IDF (w=0.5)</p>
      <p>K-NN (nearest neighbours) to measure the q prediction is negative, and consequently, the  list comprises all
quality. The KNN classifier could be applied to retrieve negative opinions in the lexicon.
the most similar train reviews (e.g., cosine similarity), Let  be a medical concept and let IDF (,  ) be the
aggregate evidence and assigns a label to the test review. Inverse Document Frequency weight of the concept, the
conceptual extension of TF.IDF is defined as below:</p>
      <sec id="sec-3-1">
        <title>3.3. Processing the New Queries</title>
        <p>To confirm the capability of the benchmark with
models derived from opinions and concepts, we have
developed a naive semantic approach. We briefly describe the
methodology and then show the experimental results of
comparing the semantic approach with well-known and
recent IR methods on ADOR.
3.3.1. Methodology
Our approach is to leverage the well-known TF.IDF and
capture its semantic extensions which are built upon
opinions and/or concepts. To make the formulations
readable, we use type-aware  functions, e.g. OF (, )
is the opinion frequency of opinion  in document ,
where CF (, ) is the frequency of concept  in the
document. Let  be a query,  be a document and let  be
the collection, the Retrieval Status Value (RSV) of the
opinion-aware model is as follows:</p>
        <p>RSVOF.IDF(, , ) :=
∑︁ OF(, ) · OF(, ) · IDF(, )
∈
(1)</p>
        <sec id="sec-3-1-1">
          <title>IDF (, ) is the Inverse Document Frequency of the</title>
          <p>opinion  in the collection.  is a list of all lexical features
in lexicon where the sentiment polarity is equal to query
polarity. For example, given query Any useless or poor
medications for allergy or cold sore., the query polarity
RSVCF.IDF (, , ) :=
∑︁ CF (,  ) · CF (,  ) · IDF (,  )
 ∈
(2)
3.3.2. Evaluation</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>In this section, we briefly discuss the evaluation results of</title>
          <p>the propose semantic models, TF.IDF, BM25 and neural
ranking models when applied to ADOR.</p>
          <p>We have trained neural ranking models including
KNRM [15], DSSM [16] and arc-I [17] on ADOR. We
performed 5-fold cross-validation where the final fold in
each run was considered as the test set. We randomly
divided queries into five-folds and repeatedly captured the
average of the fivefold-level evaluation results. All neural
models were developed using MatchZoo [18] based on
tensorflow with Adam optimizer, batch size 16 and
learning rate=0.001. Using the Lucene framework and the
Language Modelling with Dirichlet Prior, we retrieved
pseudo-relevant documents and subsequently, the top
100 documents were re-ranked by the models. In
addition to OF.IDF and CF.IDF, we conducted experiments
on linear combinations of opinion-aware TF.IDF with
term-based and conceptual TF.IDF using aggregation
parameter  = 0.5. Concerning concept-based models,
we used MetaMap to extract concepts accompanied by
their frequencies, semantic types and scores. We counted
’trigger’ attributes of MetaMap-outputs to calculate the
corresponding frequencies of semantic types.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <sec id="sec-4-1">
        <title>In this paper, we introduced a new benchmark, namely</title>
        <p>ADOR which is a subset of Amazon reviews. For our
research aim, the dataset allows for bringing and testing
sentiment-based IR to medical domain. The
corresponding dataset focuses on medical products within three
categories including medicine, monitoring tools and
healthrelated books. The collection of reviews comes with a
structured framework which enables users to
automatically generate relevance labels for new topics. Moreover,
a query set with relevance results was consolidated into
the benchmark. In order to develop this query set, we
considered factors such as query intent, sentiment score
of query and concept query frequency.</p>
        <p>To measure the suitability of the benchmark for
sentiment-based IR, we proposed naive but reproducible
opinion-aware models as semantic instances of the
generalizable TF.IDF. These models are derived from
combinations of sentiment-only TF.IDF with term-only and
concept-only TF.IDF. We compared the new approach
with well-established and modern retrieval models. Our
experiments confirmed that the integration of sentiments
with IR improves the quality of ranking with regards to
the ADOR dataset. The semantic model based on
combination of OF.IDF and CF.IDF achieved the best results
against gold standards.</p>
        <p>In conclusion, the ADOR benchmark could help
researchers to develop and evaluate opinion-aware
retrieval models. These models would benefit companies
and healthcare organizations to efectively detect, rank
and filter urgent notifications based on patient’s health
status, narratives and conditions. The benchmark is
available at https://github.com/mb320/ADOR.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          pp.
          <fpage>344250</fpage>
          -
          <lpage>344253</lpage>
          . [14]
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Hersh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Roberts</surname>
          </string-name>
          , H. K. Reka-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>palli</surname>
          </string-name>
          ,
          <article-title>Trec 2006 genomics track overview</article-title>
          ., in: TREC,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>volume 7</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>500</fpage>
          -
          <lpage>274</lpage>
          . [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Power</surname>
          </string-name>
          , End-
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>in: Proceedings of the 40th International ACM SI-</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>information retrieval</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>64</lpage>
          . [16]
          <string-name>
            <surname>P.-S. Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Acero,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>ceedings of the 22nd ACM international conference</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>on Information &amp; Knowledge Management</source>
          ,
          <year>2013</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          pp.
          <fpage>2333</fpage>
          -
          <lpage>2338</lpage>
          . [17]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Convolutional neu-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>tion processing systems</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>2042</fpage>
          -
          <lpage>2050</lpage>
          . [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lan</surname>
          </string-name>
          , X. Cheng,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>preprint arXiv:1707.07270</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>