=Paper= {{Paper |id=Vol-3052/paper10 |storemode=property |title=ADOR: A New Medical Dataset for Sentiment-based IR |pdfUrl=https://ceur-ws.org/Vol-3052/paper10.pdf |volume=Vol-3052 |authors=Mohammad Bahrani,,Thomas Roelleke |dblpUrl=https://dblp.org/rec/conf/cikm/BahraniR21 }} ==ADOR: A New Medical Dataset for Sentiment-based IR== https://ceur-ws.org/Vol-3052/paper10.pdf
ADOR: A New Medical Dataset for Sentiment-based IR
Mohammad Bahrani, Thomas Roelleke
Queen Mary University of London, UK


                                          Abstract
                                          Sentiment analysis has received attention in retrieval applications. Combining opinions such as user feelings with semantics
                                          would enhance the performance of these applications, especially when the level of urgency is essential, e.g., medical domain.
                                          However, no widely medical benchmark is known for evaluating sentiment-aware IR. In this paper, we create a dataset based
                                          on Amazon reviews for medical products and make it publicly available. To assess the compatibility of the benchmark with
                                          opinions and concepts we propose a sentiment-aware extension of TF.IDF and apply it to the dataset. This model is derived
                                          from linear combinations of sentiment-based TF.IDF score with term-based and conceptual TF.IDF scores. The benchmark
                                          could help healthcare organizations to effectively detect, rank and filter the most urgent notifications based on patient’s
                                          health status, narratives and conditions.

                                          Keywords
                                          Semantic Retrieval, Query Analysis, Language Modelling, Benchmark, TREC, Query Formulation, Knowledge Representa-
                                          tion,



1. Introduction                                                                                                    different IR models with respect to medical applications
                                                                                                                   including OHSUMED [4], CLEF-eHealth [2, 5]. However,
Despite the fact that both sentiment analysis and IR are                                                           developing a sentiment-focused query-set for a dataset
of importance with regards to medical applications, the                                                            such as OHSUMED is not optimal since documents are
work on incorporating sentiments into medical IR is lim-                                                           generated from medical literature. Although sentiments,
ited, and there is no well-known benchmark established                                                             e.g., cancer and treatment are included in documents, im-
for this task. Many review-based datasets have been                                                                plications of urgency and feelings e.g., emojis are rarely
released for the task of sentiment analysis such as multi-                                                         found. Table 1 shows the overview of well-known medi-
domain Amazon dataset [1], INEX social book search [2]                                                             cal datasets which listed fundamental statistics of their
and IMDB dataset of reviews [3]. However, researchers                                                              semantic features.
need a benchmark which primarily takes into consider-                                                                 Sentiment analysis and opinion mining are popular re-
ation the integration of opinions and medical concepts.                                                            search fields in natural language processing, data science
This is due to the importance of feelings in detecting                                                             and text mining. They analyse textual contents based
the level of urgency in medical domain. Moreover, bio-                                                             on people's opinions, emotions and attitudes [6]. In this
medical companies need to analyse customer's general                                                               paper, we create a benchmark that consists of a dataset, a
feelings about their products. On the other hand, patients                                                         query-set and the relevance results. The dataset consists
need to know the sentiment of product reviews before                                                               of Amazon reviews for medical products. Additionally, it
buying. Wherefore the examination of sentiments would                                                              supports the use of common semantics (terms, concepts
be beneficial for both buyers and suppliers of medical                                                             and relations) in biomedical retrieval.
products.                                                                                                             The second contribution of this paper is to apply
    In this paper, we address this problem by creating and                                                         sentiment-aware models to the dataset. We propose a
making available a medical benchmark specifically for                                                              family of opinion-aware models for ranking medical re-
the task of opinion-aware retrieval.                                                                               views. These models are semantic instances of a gener-
    Bio-medical benchmarks consider various pillars of                                                             alizable TF.IDF. The technology of semantic retrieval is
semantics in collections and queries, e.g., terms, concepts                                                        of particular importance in medical applications and the
and attributes. These semantics would enable data scien-                                                           integration of semantics with the standard content-based
tists to develop effective models for different tasks, e.g.,                                                       retrieval tools could lead to more intelligent search ex-
filtering and classification.                                                                                      periences [7, 8]. The generalization of TF.IDF towards
    Several benchmarks have been published to examine                                                              semantic frameworks is discussed in [9]. When compared
                                                                                                                   to retrieval systems built upon only bag-of-words, the
CIKM’21: Fourth Workshop on Knowledge-driven Analytics and                                                         integrated methods result in more performant question
Systems Impacting Human Quality of Life, November 01–05, 2021,                                                     answering (QA) systems with constraint checking abili-
CIKM, Australia                                                                                                    ties. There has been research on developing conceptual
" m.bahrani@qmul.ac.uk (M. Bahrani); t.roelleke@qmul.ac.uk
                                                                                                                   models for medical applications [10] and [11]. It could be
(T. Roelleke)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative   interesting to leverage sentiments and feelings in these
                                    Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        applications.
            Dataset                                            Task                                                              Reports                             Number of Queries   avg-Opinions-per-query   avg-concepts-per-query

   clef2013 e-health          Task3: Patients’ Questions when Reading Clinical Reports                 Overview of the ShARe CLEF eHealth Evaluation Lab 2013 [5]           50                    0.3                      2.9

   clef2014 e-health          Task 3: use of information e.g. discharge summary and ontologies in IR   Overview of the share-clef ehealth evaluation lab 2014 [12]          50                    0.34                     1.86

   OHSUMED                    TREC-9 Filtering: Evaluate text filtering system                         OHSUMED [4] - TREC-9 Final Report [13]                               63                    0.41                     4.87

   TREC 2006 Genomics Track   passage retrieval for Genomics question answering                        TREC 2006 genomics track overview. [14]                              27                    0.32                     6.00

   TREC 2007 Genomics Track   Genomics passage retrieval based on biologists’ needs                    TREC 2007 genomics track overview.                                   35                    0.27                     4.6




                                Table 1: Overview of well established benchmarks for health-related retrieval.


   By consolidating the methods for modelling opinions                                                                        steps. Firstly, we converted the encoding of the contents
and sentiments in medical ranking, we aim to address the                                                                      to UTF-8 and secondly, we defined the schema and the
deficiencies in different tasks including but not limited                                                                     required fields. The essential fields consists of Amazon
to notification filtering and review filtering. In terms                                                                      ASIN number, medical category, star-rating, the title of
of notification filtering, we know that doctors and pa-                                                                       the review, review text and labels including star-rating
tients are overloaded with massive health-related data,                                                                       and helpful, have been embedded into the dataset.
and it is critical for health organizations to focus on the
most important and urgent cases. In this scenario, the                                                                        2.1. ADOR Query Set
detection of urgency is associated with both ranking and
acquisition of sentiments.                                                                                                    We have defined 25 topics based on five purposes. Figure
   Our work contributes to building the grounds for im-                                                                       3 shows the distribution of queries and number of rele-
proving medical review filtering through IR. It is the                                                                        vant documents. The five categories of information need
starting point of developing models that could better                                                                         are as follows:
meet the needs of bio-medical organizations, companies                                                                                   1. The retrieval of positive or negative reviews as-
and individual buyers for analysing most critical, positive                                                                                 sociated with medical products.
and negative reviews.
                                                                                                                                         2. Fact-based and non-sentiment-bearing queries
                                                                                                                                            which only intend to retrieve medical entities.
2. The ADOR Dataset                                                                                                                      3. Ranking the polarity of item-reviews within the
                                                                                                                                            sub-categories, e.g. vitiligo cream and flu tablets.
The Amazon Dataset of Reviews (ADOR) is based on re-                                                                                     4. Ranking the polarity of item-reviews within the
views from bio-medical Amazon products derived from                                                                                         super-categories, e.g. medications or diagnostic
three super categories which are Medication & Remedies,                                                                                     tools.
Diagnostic and Monitoring Tools and Health-Related                                                                                       5. The retrieval of extreme (most positive or most
Books. We have defined a set of sub-category products                                                                                       negative) reviews given different medical con-
inherited from the super-categories and subsequently                                                                                        cepts. We used modifiers to give attention to the
extracted reviews of related top ten items retrieved by                                                                                     information need, e.g. Highly negative reviews for
Amazon search engine. However, in order to achieve a                                                                                        books about borderline personality disorder.
more balanced dataset in terms of polarity, we ignored
items without negative reviews.

 #Concepts                                                          595442
 #Distinct.Concepts                                                 404748
 #Opinions                                                          194790
 #Distinct.Opinions                                                 163045
 #Query                                                             25
 #Docs                                                              44796
 #Avg.Query Length                                                  9.08
 #Avg.Review.Text Length                                            35.38
 #Sampling Date                                                     31-03-2020
                                                                                                                              Figure 1: Document and collection statistics of the ADOR se-
                                                                                                                              mantic types: The opinions group has the highest document
                       Table 2: The statistics of ADOR.                                                                       frequency.

  To make the data easily reusable, we followed two
2.2. Overview of ADOR                                                                                                                                                                                                                                                                 3. Application of the Benchmark
In this section, we briefly present the dataset and provide
                                                                                                                                                                                                                                                                                      3.1. Rationales
the statistics of ADOR. Table 2 lists the fundamental
statistics of the dataset. There are 194790 opinion fea-                                                                                                                                                                                                                              Although the use of human judgments could seem ideal
tures and 59442 medical concepts in the dataset which                                                                                                                                                                                                                                 for the generation of gold standards, we developed a
are distributed across 44796 documents. We used VADER                                                                                                                                                                                                                                 generic framework which has some privileges, e.g., it
lexicon to capture opinions and Meta-Map to bind terms                                                                                                                                                                                                                                could be easily used to build gold standards for new query
to medical concepts. Figure 2 presents the distribution                                                                                                                                                                                                                               sets.
of document length and query length. The majority of                                                                                                                                                                                                                                     We provided informative labels, including rating-star,
queries (more than 35%) have a length between 9 and                                                                                                                                                                                                                                   the number of people who found reviews helpful and
12 words. More than 50% of documents have between 1                                                                                                                                                                                                                                   medical categories of Amazon products when prepar-
and 20 words, whereas 7% of them are longer than 100                                                                                                                                                                                                                                  ing the data. This framework helps to rapidly develop
words. The statistics regarding distribution of queries                                                                                                                                                                                                                               new queries that could be formulated into the pro-
and their relevant documents are shown in Figure 3.                                                                                                                                                                                                                                   vided labels. Considering the example query Why do
As can be seen, 28% of queries contain 1-60 documents                                                                                                                                                                                                                                 some customers are happy with books about caffeine addic-
which is the exact same percentage for queries with more                                                                                                                                                                                                                              tion and narcissistic personality disorder., the formulated
than 240 documents. The rest of the queries contain be-                                                                                                                                                                                                                               query is : ( Rating=[4,5], Super-Category=[Books], Sub-
tween 60 and 240 relevant documents. We extracted the                                                                                                                                                                                                                                 Category=[NPD,Caffeine Addiction] ). In other words,
average document and collection frequencies of seman-                                                                                                                                                                                                                                 any review in the dataset that meets the information
tic types (neutral terms, concepts and opinions) of the                                                                                                                                                                                                                               needs requested by the formulated query could be se-
ADOR which can be found in Figure 1. Even though the                                                                                                                                                                                                                                  lected.
average document frequency of opinions is high, opin-                                                                                                                                                                                                                                    To evaluate the accuracy of models, one approach
ions could significantly impact the retrieval quality due                                                                                                                                                                                                                             would be the use of existing reviews as queries. How-
to the nature of reviews.                                                                                                                                                                                                                                                             ever, there are two substantial issues with this approach.
                                                                                                                                                                                                                                                                                      Firstly, data scientists need to analyse and classify their
                                                                                D4XHU\
                                                                                                                                                                                     
                                                                                                                                                                                                                          E'RFXPHQW
                                                                                                                                                                                                                                                                                      experimental results based on the query intent, e.g. fact-
                                                                                                                                                                                                                                                                                      based, binary and explorative queries. The use of re-
                          

                          

                                                                                                                                                                                                                                                                                      views as queries is not in line with the nature of query
                                                                                                                                                                                     
                          

                                                                                                                                                                                                                                                                                      intent. Secondly, reviews are strongly focused on opin-
   3HUFHQWDJH




                                                                                                                                                              3HUFHQWDJH




                                                                                                                                                                                     
                          

                                                                                                                                                                             
                                                                                                                                                                                                                                                                                      ions. Therefore, generating a robust query set consists of
                                                                                                                                                                                                                                                                                      a balanced combination of concepts, terms and opinions
                          
                                                                                                                                                                                     
                          

                          
                                                               >                >                      ! 
                                                                                                                                                                                     
                                                                                                                                                                                               > > > > ! 
                                                                                                                                                                                                                                                                                      do interfere with the structure of reviews.
                                                                                      /HQJWK                                                                                                                               /HQJWK


Figure 2: The distribution of document length and query                                                                                                                                                                                                                               3.2. Baseline Models
length.
                                                                                                                                                                                                                                                                                      The focus of this paper is to introduce a dataset for
                                                                                                                                                                                                                                                                                      the task of semantic retrieval in the medical domain,
                                                                                                                                                                                                                                                                                      sentiment-based and conceptual IR. Therefore, advanced
                                                                                                                                                                                                                                                                                      ranking algorithms are the primary baselines. However,
                                                                         
                                                                                                                                                                                                                                                                                      the benchmark is also able to be used for the predic-
                                                                                                                                                                                                                                                                              tion/classification tasks. For example, a review could be
                                                                                                                                                                                                                                                                                      considered as a message posted by a patient or a cus-
                                                  3HUFHQWDJH




                                                                         


                                                                                                                                                                                                                                                                              tomer. In this case, the evaluation approach is to predict
                                                                                                                                                                                                                                                                              if it is extreme (very negative) and requires attention
                                                                         
                                                                                                      >       >           >       ! 
                                                                                                                                                                                                                                                                                      by an expert, e.g., doctor, nurse or a company member.
                                                                                                                                 1XPEHURI'RFV
                                                                                                                                                                                                                                                                                      The other applicable task is notification systems. In this
Figure 3: The distribution of queries and number of relevant                                                                                                                                                                                                                          scenario, users post messages and an algorithm needs to
documents.                                                                                                                                                                                                                                                                            decide who (e.g. which doctor, expert) should be notified
                                                                                                                                                                                                                                                                                      for analysing the message or responding to it.
                                                                                                                                                                                                                                                                                         Furthermore, the framework could be employed by
                                                                                                                                                                                                                                                                                      data scientists to predict features provided by the dataset
                                                                                                                                                                                                                                                                                      such as positive/negative and helpful/not helpful. Base-
                                                                                                                                                                                                                                                                                      lines could be used such as Neural Network classifier
                                                                                                                                                                                                                                                                                      (e.g., Bert or scikit), Bayesian predictor, regression and
                   Model                                         Evaluation Measure
                                                  P@5            P@10       NDCG               MAP
                   TF.IDF                         0.2480         0.2720     0.2354             0.0833
                   BM25                           0.3120         0.3160     0.2336             0.0813
                   KNRM                           0.2320         0.2440     0.2445             0.0906
                   DSSM                           0.2080         0.2200     0.2422             0.1039
                   arc-I                          0.3520         0.3040     0.2476             0.0902
                   CF.IDF                         0.3840         0.4080     0.2619             0.1106
                   OF.IDF                         0.3680         0.4120     0.2758             0.1250
                   OF.IDF+TF.IDF (w=0.5)          0.3600         0.3920     0.2705             0.1175
                   OF.IDF+CF.IDF (w=0.5)          0.4640𝛽𝜃𝜁      0.4280𝛽𝜃𝜁 0.2825𝛽𝜃𝜁           0.1274𝛽𝜃


Table 3: Ranking performances of the opinion-aware models and the baseline methods: The bold font denotes the
         best result in that evaluation metric. 𝛽, 𝜃, 𝜁 indicate statistically significant improvements of the best
         model over BM25𝛽 , KNRM𝜃 and DSSM𝜁 . The statistically significance is based on the paired t-test with
         p-value < 0.05.


K-NN (nearest neighbours) to measure the q prediction is negative, and consequently, the 𝑡 list comprises all
quality. The KNN classifier could be applied to retrieve negative opinions in the lexicon.
the most similar train reviews (e.g., cosine similarity),    Let 𝜙 be a medical concept and let IDF (𝜙, 𝑐) be the
aggregate evidence and assigns a label to the test review. Inverse Document Frequency weight of the concept, the
                                                           conceptual extension of TF.IDF is defined as below:
3.3. Processing the New Queries
To confirm the capability of the benchmark with mod-                     RSVCF.IDF (𝑑, 𝑞, 𝑐) :=
els derived from opinions and concepts, we have devel-
                                                                               ∑︁
                                                                                   CF (𝜙, 𝑞) · CF (𝜙, 𝑑) · IDF (𝜙, 𝑐)      (2)
oped a naive semantic approach. We briefly describe the                        𝜙∈𝑞
methodology and then show the experimental results of
comparing the semantic approach with well-known and 3.3.2. Evaluation
recent IR methods on ADOR.
                                                                 In this section, we briefly discuss the evaluation results of
                                                                 the propose semantic models, TF.IDF, BM25 and neural
3.3.1. Methodology
                                                                 ranking models when applied to ADOR.
Our approach is to leverage the well-known TF.IDF and               We have trained neural ranking models including
capture its semantic extensions which are built upon KNRM [15], DSSM [16] and arc-I [17] on ADOR. We
opinions and/or concepts. To make the formulations performed 5-fold cross-validation where the final fold in
readable, we use type-aware 𝑥 functions, e.g. OF (𝑜, 𝑑) each run was considered as the test set. We randomly di-
is the opinion frequency of opinion 𝑜 in document 𝑑, vided queries into five-folds and repeatedly captured the
where CF (𝑐, 𝑑) is the frequency of concept 𝑐 in the doc- average of the fivefold-level evaluation results. All neural
ument. Let 𝑞 be a query, 𝑑 be a document and let 𝑐 be models were developed using MatchZoo [18] based on
the collection, the Retrieval Status Value (RSV) of the tensorflow with Adam optimizer, batch size 16 and learn-
opinion-aware model is as follows:                               ing rate=0.001. Using the Lucene framework and the
                                                                 Language Modelling with Dirichlet Prior, we retrieved
                                                                 pseudo-relevant documents and subsequently, the top
         RSVOF.IDF (𝑑, 𝑞, 𝑐) :=                                  100 documents were re-ranked by the models. In addi-
                                                             (1) tion to OF.IDF and CF.IDF, we conducted experiments
               ∑︁
                    OF(𝑜, 𝑞) · OF(𝑜, 𝑑) · IDF(𝑜, 𝑐)
                𝑜∈𝑡
                                                                 on linear combinations of opinion-aware TF.IDF with
                                                                 term-based and conceptual TF.IDF using aggregation pa-
   IDF (𝑜, 𝑐) is the Inverse Document Frequency of the rameter 𝑤 = 0.5. Concerning concept-based models,
opinion 𝑜 in the collection. 𝑡 is a list of all lexical features we used MetaMap to extract concepts accompanied by
in lexicon where the sentiment polarity is equal to query their frequencies, semantic types and scores. We counted
polarity. For example, given query Any useless or poor ’trigger’ attributes of MetaMap-outputs to calculate the
medications for allergy or cold sore., the query polarity corresponding frequencies of semantic types.
   Table 3 shows the experimental results on ADOR us-             search track, in: Conference & Labs of the Evalua-
ing four metrics including P@5, p@10, NDCG and Mean               tion Forum (CLEF), 2014.
Average Precision (MAP). We also conducted the paired         [3] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y.
t-test with 𝑝 < 0.05 to compute the significance of im-           Ng, C. Potts, Learning word vectors for sentiment
provements. The isolated OF.IDF and CF.IDF worked bet-            analysis, in: Proceedings of the 49th annual meet-
ter than TF.IDF, BM25 and neural models (KNRM, DSSM,              ing of the association for computational linguistics:
arc-I) while the combination of opinions and concepts             Human language technologies-volume 1, Associa-
received the best results. The interesting finding is that        tion for Computational Linguistics, 2011, pp. 142–
the models based on combinations of opinions with both            150.
terms (OF.IDF+TF.IDF) and concepts (OF.IDF+CF.IDF)            [4] W. Hersh, C. Buckley, T. Leone, D. Hickam,
improved all the measures.                                        Ohsumed: an interactive retrieval evaluation and
                                                                  new large test collection for research, in: SIGIR’94,
                                                                  Springer, 1994, pp. 192–201.
4. Conclusion                                                 [5] H. Suominen, S. Salanterä, S. Velupillai, W. W. Chap-
                                                                  man, G. Savova, N. Elhadad, S. Pradhan, B. R. South,
In this paper, we introduced a new benchmark, namely
                                                                  D. L. Mowery, G. J. Jones, et al., Overview of the
ADOR which is a subset of Amazon reviews. For our
                                                                  share/clef ehealth evaluation lab 2013, in: Inter-
research aim, the dataset allows for bringing and testing
                                                                  national Conference of the Cross-Language Eval-
sentiment-based IR to medical domain. The correspond-
                                                                  uation Forum for European Languages, Springer,
ing dataset focuses on medical products within three cat-
                                                                  2013, pp. 212–231.
egories including medicine, monitoring tools and health-
                                                              [6] B. Liu, Sentiment analysis and opinion mining,
related books. The collection of reviews comes with a
                                                                  Synthesis lectures on human language technologies
structured framework which enables users to automati-
                                                                  5 (2012) 1–167.
cally generate relevance labels for new topics. Moreover,
                                                              [7] R. Van Zwol, T. Van Loosbroek, Effective use of se-
a query set with relevance results was consolidated into
                                                                  mantic structure in xml retrieval, in: European Con-
the benchmark. In order to develop this query set, we
                                                                  ference on Information Retrieval, Springer, 2007, pp.
considered factors such as query intent, sentiment score
                                                                  621–628.
of query and concept query frequency.
                                                              [8] M. Bahrani, T. Roelleke, FDCM: Towards balanced
   To measure the suitability of the benchmark for
                                                                  and generalizable concept-based models for effec-
sentiment-based IR, we proposed naive but reproducible
                                                                  tive medical ranking, in: Proceedings of the 29th
opinion-aware models as semantic instances of the gen-
                                                                  ACM International Conference on Information &
eralizable TF.IDF. These models are derived from com-
                                                                  Knowledge Management, 2020, pp. 1957–1960.
binations of sentiment-only TF.IDF with term-only and
                                                              [9] H. Azzam, S. Yahyaei, M. Bonzanini, T. Roelleke, A
concept-only TF.IDF. We compared the new approach
                                                                  schema-driven approach for knowledge-oriented
with well-established and modern retrieval models. Our
                                                                  retrieval and query formulation, in: Proceedings
experiments confirmed that the integration of sentiments
                                                                  of the Third International Workshop on Keyword
with IR improves the quality of ranking with regards to
                                                                  Search on Structured Data, ACM, 2012, pp. 39–46.
the ADOR dataset. The semantic model based on com-
                                                             [10] E. Meij, D. Trieschnigg, M. De Rijke, W. Kraaij,
bination of OF.IDF and CF.IDF achieved the best results
                                                                  Conceptual language models for domain-specific
against gold standards.
                                                                  retrieval, Information Processing & Management
   In conclusion, the ADOR benchmark could help re-
                                                                  46 (2010) 448–469.
searchers to develop and evaluate opinion-aware re-
                                                             [11] C. Wang, R. Akella, Concept-based relevance mod-
trieval models. These models would benefit companies
                                                                  els for medical and semantic information retrieval,
and healthcare organizations to effectively detect, rank
                                                                  in: Proceedings of the 24th ACM International on
and filter urgent notifications based on patient’s health
                                                                  Conference on Information and Knowledge Man-
status, narratives and conditions. The benchmark is avail-
                                                                  agement, 2015, pp. 173–182.
able at https://github.com/mb320/ADOR.
                                                             [12] L. Kelly, L. Goeuriot, H. Suominen, T. Schreck,
                                                                  G. Leroy, D. L. Mowery, S. Velupillai, W. W. Chap-
References                                                        man, D. Martinez, G. Zuccon, et al., Overview of
                                                                  the share/clef ehealth evaluation lab 2014, in: Inter-
 [1] S. Li, C. Zong, Multi-domain sentiment classifica-           national Conference of the Cross-Language Eval-
     tion, in: Proceedings of ACL-08: HLT, Short Papers,          uation Forum for European Languages, Springer,
     2008, pp. 257–260.                                           2014, pp. 172–191.
 [2] M. Hall, H. Huurdemann, M. Skov, D. Walsh, et al.,      [13] S. Robertson, D. A. Hull, The trec-9 filtering track
     Overview of the inex 2014 interactive social book            final report, in: TREC, volume 10, Citeseer, 2000,
     pp. 344250–344253.
[14] W. R. Hersh, A. M. Cohen, P. M. Roberts, H. K. Reka-
     palli, Trec 2006 genomics track overview., in: TREC,
     volume 7, 2006, pp. 500–274.
[15] C. Xiong, Z. Dai, J. Callan, Z. Liu, R. Power, End-
     to-end neural ad-hoc ranking with kernel pooling,
     in: Proceedings of the 40th International ACM SI-
     GIR conference on research and development in
     information retrieval, 2017, pp. 55–64.
[16] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero,
     L. Heck, Learning deep structured semantic models
     for web search using clickthrough data, in: Pro-
     ceedings of the 22nd ACM international conference
     on Information & Knowledge Management, 2013,
     pp. 2333–2338.
[17] B. Hu, Z. Lu, H. Li, Q. Chen, Convolutional neu-
     ral network architectures for matching natural lan-
     guage sentences, in: Advances in neural informa-
     tion processing systems, 2014, pp. 2042–2050.
[18] Y. Fan, L. Pang, J. Hou, J. Guo, Y. Lan, X. Cheng,
     Matchzoo: A toolkit for deep text matching, arXiv
     preprint arXiv:1707.07270 (2017).