Knowledge Based Transformer Model for Information Retrieval
                                    Jibril Frej                                                            Didier Schwab
                  jibril.frej@univ-grenoble-alpes.fr                                          didier.schwab@univ-grenoble-alpes.fr
          Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG                                 Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG
          * Institute of Engineering Univ. Grenoble Alpes                                * Institute of Engineering Univ. Grenoble Alpes

                                                                  Jean-Pierre Chevallet
                                                  jean-pierre.chevallet@univ-grenoble-alpes.fr
                                                 Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG
                                                 * Institute of Engineering Univ. Grenoble Alpes

ABSTRACT                                                                             introduced by KB and focus on the relevant knowledge to improve
Vocabulary mismatch is a frequent problem in information retrieval                   search.
(IR). It can occur when the query is short and/or ambiguous but also                 Given the recent success of transformer encoders for several NLP
in specialized domains where queries are made by non-specialists                     tasks [6, 13, 21, 27], we propose KTRel: a NLTR model that uses:
and documents are written by experts. Recently, vocabulary mis-                      (1) word embeddings pre-trained on large amount of text; (2) con-
match has been addressed with neural learning-to-rank (NLTR)                         cept embeddings pre-trained on a specialized KB; (3) transformer
models and word embeddings to avoid relying only on the exact                        encoders that associate sequence of word embeddings and concept
matching of terms for retrieval. Another approach to vocabulary                      embeddings to a fixed-size representation.
mismatch is to use knowledge bases (KB) that can associate different
terms to the same concept. Given the recent success of transformer                   2   RELATED WORK
encoders for NLP, we propose KTRel: a NLTR model that uses word                      Several methods to include KB for IR in specialized domains have
embeddings, Knowledge bases and Transformer encoders for IR.                         already been proposed. These models use one (or a combination)
                                                                                     of the following three strategies: (1) explicit rules; (2) machine
KEYWORDS                                                                             learning methods based on hand-crafted features; (3) deep learning
Information Retrieval, Neural Networks, Learning-to-Rank, Knowl-                     methods.
edge Base                                                                            Explicit rules. The entity query feature expansion [5] that uses
                                                                                     relations between KB elements to extend queries with entities has
1    INTRODUCTION                                                                    been studied in depth by Jimmy et. al. [33] in the medical field. They
In specialized domains like the medical one, non-specialists ex-                     show that such methods require several key choices and design de-
press their queries using plain English, whereas documents contain                   cisions to be effective and are therefore difficult to use in practice.
domain-specific terms. For example, if a user asks "How plant-based                  Machine learning. Soldaini et al. [25] proposed to use KBs to
diets may extend our lives?" a bag-of-words (BoW) based IR system                    add medical and health hand-crafted features to improve the per-
will be unable to retrieve relevant documents such as "A review of                   formance of learning-to-rank methods for IR in the medical field.
methionine dependency and the role of methionine restriction in can-                 However, this approach relies on hand-crafted features that require
cer growth control and life-span extension". To retrieve this document,              domain and KB specific knowledge when they are designed.
an IR system should associate "plant-based diets" with "methionine                   Deep learning. Recently, KBs have been successfully combined
restriction".                                                                        with NLTR approaches for Question-Answer systems [23] and for
On the one hand, neural learning-to-rank (NLTR) models that use                      web search in the general domain [15]. NLTR models for IR work
prior knowledge from word embeddings trained on large amounts                        similarly to neural models for automatic natural language process-
of raw text are a promising approach to this problem. However,                       ing (TALN). The main difference is that the objective functions used
most NLTR models are not interpretable, with unknown or rare                         by the NLTR models optimize the ranking of a list of documents
words and struggle to outperform a well-tuned BoW baseline on                        with respect to a query. Unsupervised learning methods were also
standard IR collections such as Robust04 where the amount of an-                     used to learn vector representations of documents based on medical
notated data is limited [30].                                                        concepts for information retrieval in the medical domain [19].
On the other hand, using knowledge bases (KB) to expand queries
and/or documents with concepts has often been proposed to tackle
the vocabulary mismatch since the same concept/entity can be re-                     3   KTREL
lated to words belonging to non-specialist and expert vocabularies.                  In this section, we describe the different steps our model follows to
However, it is a challenging task since KB can be incomplete, lead                   perform IR. The overall architecture of KTRel is shown in Figure 1.
to noise addition and require hand-crafted features [33]. In this                    Prior step. In order to include prior knowledge to our model, we
work, we study the potential for NLTR models to ignore the noise                     pre-train word embeddings on raw text and we pre-train concept
"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-   embeddings on a specialized domain KB.
mons License Attribution 4.0 International (CC BY 4.0)."                             Knowledge step. Queries and documents are annotated with a set
                                                                                                                               Anonymous, et al.


                                                                Relevance score

                                                                     Linear
                                                                   Combination
                                       Concept similarity                         Word similarity


                                  cos                                                                  cos


            qc                                    dc                                qw                             dw

     Transformer Encoder                  Transformer Encoder                Transformer Encoder             Transformer Encoder


               ...                                 ...                                    ...                           ...
      Query concepts                    Document concepts                         Query words                  Document words

                             KTRel-C                                                              KTRel-W

                                                          Figure 1: Architecture of KTRel


of candidate concepts from the specialized domain KB. We adopt              4.1    Datasets
the same strategy as Shen et al. [23]: n-grams are annotated with           Collection. We evaluate KTRel on the NFCorpus [3]: a publicly
their top-K candidate concepts in order to deal with the possible           available collection for learning-to-rank in the medical domain.
ambiguity of some n-grams.                                                  It consists of 5,276 different queries written in plain English and
Transformer step. Considering the significant performance gains             3,633 documents composed of titles and abstracts from PubMed
recently obtained by transformers in NLP [6, 13, 21, 27], we propose        and PMC with a highly technical vocabulary. We did not evaluate
to use transformers encoders to associate both sequences of words           our model on standard medical ad hoc IR collection such as CLEF
and sequences of concepts with a fixed size representation using            eHealth 2013 [26] or CLEF eHealth 2014 [7] because they contain
the following steps: (1) a mapping of the elements of the input             about 50 annotated queries each which is not enough to train NLTR
sequence with their corresponding embeddings; (2) a self-attention          models [9, 30].
mechanism [27] to compute a context aware representations of ele-           Knowledge base. We use medical concepts from the version 2018AA
ments of the sequence; (3) a position-wise Feed Forward Network;            of the UMLS Metathesaurus [1]. We choose the UMLS Metathe-
(4) an element-wise sum of the representations obtained previously          saurus mainly because of its huge coverage: 3.67 million concepts
to get a fixed size sequence encoding.                                      from 203 source vocabularies.
Relevance step. A concept-based similarity is computed using the
cosine between the transformer encoding of the query’s concepts
𝑞𝑐 and the transformer encoding of the document’s concepts 𝑑𝑐 .             4.2    Experimental setup
Analogously, we calculate a word-based similarity (see Figure 1).           Concepts. We use MetamorphoSys to extract the relational graph
The final relevance score between query 𝑄 and document 𝐷 con-               of medical concepts from UMLS. We discard concepts that do not
sists in a linear combination between the concept-based similarity          belong to a medical semantic type (e.g. Quantitative Concept). Text
and the word-based similarity:                                              is annotated with medical concepts using QuickUMLS [24] with
                                                                            default parameter values. As done by Shen et al. [23], the number
            Rel(𝑄, 𝐷) = 𝑎 cos (𝑞 𝑤 , 𝑑 𝑤 ) + 𝑏 cos (𝑞𝑐 , 𝑑𝑐 )       (1)     of candidate concepts K is set to 8.
                                                                            Pre-trained Embeddings. We use word embeddings trained with
With 𝑎 ∈ R and 𝑏 ∈ R two parameters learned during training.
                                                                            word2vec [17] on a combination of PubMed and PMC texts avail-
                                                                            able at: http://bio.nlplab.org. Concept embeddings are trained on
4   EXPERIMENTS                                                             the UMLS relational graph with TransE [2]. All embeddings are
In this section, we describe the empirical evaluation of our NLTR           updated during training and both word and concept embedding
models. We first present the data (Section 4.1), our baselines (Sec-        dimensions are set to 200.
tion 4.3) and the experimental setup (Section 4.2).                         Implementation. KTRel is implemented in pytorch (https://pytorch.
Knowledge-Based Transformer Model for Information Retrieval


org/).                                                                                   0.25
Loss. Models are trained to minimize the Margin Ranking Loss:

               𝐿 = max(0, 1 − rel(𝑄, 𝐷 + ) + rel(𝑄, 𝐷 − ))        (2)
                                                                                          0.2
                                                                                                         \                         \
                                                                                         0.15


                                                                                MAP
Where 𝐷 + is a document more relevant to query 𝑄 than 𝐷 − .
Transformer encoder. The number of attention heads and the di-                            0.1
mension of the feed forward network are selected from {1, 2, 5, 10}                                                              BM25
                                                                                                                               BM25-RM3
and {50, 100, 200, 500} respectively. We used ReLU activation func-
                                                                                      5 · 10−2                                  DRMM
tion. Preliminary experiments showed that using a single trans-                                                               Conv-KNRM
former encoder layer yields the best results. This is probably due                                                             UMLSRank
                                                                                            0
to the small size of our collection.
                                                                                                 0   1,000     2,000       3,000   4,000
Training. Adam optimizer [12] is used with default parameter val-
                                                                                                         # queries in training
ues. Batch size and dropout rate are selected from {10, 20, 50} and
{0.1, 0.2, 0.3, 0.4, 0.5} respectively. We apply early stopping on the
validation MAP.                                                          Figure 2: MAP on all query fields against the # of queries
Validation. Hyper-parameters listed above are tuned on the MAP           used in training. \ indicate when a NLTR model has enough
on the validation set using grid search.                                 queries in training to achieve statistically significant im-
Evaluation. We use 4 standard evaluation metrics: MAP, Recall,           provement compared to BM25-RM3 (p-value < 0.05)
Precision and nDCG on the top 1,000 documents. These metrics
are implemented with pytrec-eval [11]. We use a two-tailed paired        5   RESULTS
t-test with Bonferroni correction to measure statistically significant
differences between the evaluation metrics. Because the NFCorpus         The performance of KTRel against baselines are shown in Table 1.
has only 3,633 documents we can evaluate every (query, document)         In the following, we propose empirical answers to several research
pair in a reasonable amount of time in order to avoid relying on a       questions.
re-ranking strategy [9, 31]. Therefore the recall of KTRel and the       Is it useful for ranking to use both words and medical con-
NLTR baselines is not upper-bounded by a prior ranking stage.            cepts? KTRel outperforms all the NLTR baselines on all metrics
                                                                         with statistical significance. The fact that KTRel also outperforms
                                                                         both KTRel-W and KTRel-C provides empirical evidence that IR in
4.3     Baselines                                                        specialized domain can benefit from combining pre-trained concept
We compare KTRel with three types of baselines methods: BoW              representations with pre-trained word representations.
and NLTR for IR and Pre-trained BERT encoder.                            Can KTRel outperform a strong BoW baseline? KTRel achieves
BoW. As suggested by Yang et al. [30], we use Okapi BM25 [22]            statistical significance against BM25 with RM3 query expansion on
and Okapi BM25 with RM3 pseudo-relevance feedback [16] as our            most metrics. The overall ranking is largely improved by KTRel:
BoWs baselines. Stemming, indexing and evaluation of BM25 and            +33.9% w.r.t MAP. The notable exceptions are nDCG@5 and nDCG@10:
BM25-RM3 are performed by Terrier [20]. Hyper-parameter values           even if KTRel outperforms BM25-RM3 in terms of nDCG@5 (+1.2%)
are tuned on the validation MAP with grid search.                        and nDCG@10 (+4.4%), it does not achieve statistical significance.
NLTR. DUET [18], KNRM [29], DRMM [8] and Conv-KNRM [4]                   Interestingly, KTRel do achieve statistical significance against P@5
are used as NLTR baselines for IR. Training and evaluation of these      (+16.3%) and P@10 (+26.5%). The difference between P@k and
models is performed with MatchZoo [10]. Hyper-parameter values           nDCG@k is that precision only looks at the proportion of relevant
are tuned on the validation MAP with random search over 10 runs.         documents whereas nDCG@k emphasizes more on the ranking
We use the tuner provided by MatchZoo to sample values from the          itself and takes into account the relevance levels of documents.
hyper-parameter space associated with each model.                        Therefore, we can conclude that even if KTRel is able to retrieve
BERT. We also compare KTRel against BERT [6] encoder: a state            more relevant documents in the top-k results, BM25-RM3 is still
of the art language representation model. We use the "bert-base-         a strong baseline when it comes to the ranking of the top-k docu-
uncased" model provided by Hugging Face [28], pre-trained on the         ments.
BooksCorpus [32] and English Wikipedia. When training, we fine           How is BERT performing in IR in specialised domains? BERT
tune the last layer of the model.                                        performs worst than BM25 despite it’s success in several NLP
BioBERT. Finally, we compare KTRel against BioBERT [14]: a               tasks [6]. Because BioBERT outperforms BERT with a high margin,
biomedical language representation that train BERT language model        we can conclude that, when using a language model in a specialized
on large-scale biomedical corpora.                                       domain, it is essential to pre-train the model on text of the same
To study the usefulness of combining concepts and words, we also         domain.
train and evaluate separately the part of the KTRel architecture that    Can baseline NLTR models outperform a strong BoW base-
uses only concepts (denoted by KTRel-C) and the part that uses           line? First, we notice that the DUET and KNRM models perform
only words (denoted by KTRel-W) as pictured in Figure 1. KTRel-C         worst than BM25. The reason is probably that these models were
and KTRel-W are trained independently using the same setup as            developed on much larger datasets [18, 29] than the NFCorpus. Sec-
described in subsection 4.2                                              ond, DRMM performs slightly better than BM25 (+6.7% w.r.t MAP,
                                                                                                                                          Anonymous, et al.


              model          P@5       P@10     P@20     nDCG@5 nDCG@10 nDCG@20                  MAP      Recall
              BM25          0.2846-   0.2419-  0.1733-    0.3524-    0.3267-       0.3038-     0.1548-    0.4740-
              BM25-RM3      0.3056     0.2603   0.1912     0.3664    0.3431         0.3249      0.1801    0.6249
              DUET          0.1967-   0.1840-  0.1561-    0.1857-    0.1883-       0.1892-     0.1264-   0.7673+
              KNRM          0.2082-   0.1887-  0.1617-    0.1914-    0.1914-       0.1936-     0.1216-   0.7764+
              DRMM          0.2940     0.2489   0.1819     0.3540    0.3330         0.3116      0.1651   0.7051+
              Conv-KNRM 0.3146         0.2865  0.2378+    0.3010-    0.3090-        0.3138     0.2110+ 0.8143+
              BERT          0.2084-   0.1998-  0.1536-    0.2090-    0.2196-       0.2062-     0.1567-   0.7847+
              BioBERT       0.3148    0.2989 + 0.2373+     0.3508    0.3377         0.3228     0.2358 +  0.8265+
              KTRel-C       0.3127    0.2889+ 0.2304+     0.3285-    0.3295-       0.3094-     0.2194+ 0.8047+
              KTRel-W      0.3204+ 0.3008+ 0.2377+         0.3465    0.3369         0.3141     0.2228+ 0.8187+
              KTRel        0.3554 +   0.3294 + 0.2498 +   0.3708     0.3584        0.3424 +    0.2411+ 0.8520+
Table 1: Performance comparison of different models on the NFCorpus. + (resp. -) denotes a significant performance gain (resp.
degradation) against BM25-RM3 (p-value < 0.01). Best performances are highlighted in bold.


+3.3% w.r.t P@5 and +0.5% w.r.t nDCG@5) but it does not manages         on standard IR collections that contain only a few hundred queries
to outperform BM25-RM3. Finally, Conv-KNRM is the only NLTR             at best and a few dozen at worst.
baseline that manages to outperform BM25 and BM25-RM3 w.r.t
MAP, Precision and Recall (but not w.r.t nDCG). These results em-       6   CONCLUSIONS AND FUTURE WORK
pirically confirm that NLTR models have not achieved significant        In this paper, we propose KTRel: a transformer-based NLTR model
breakthroughs in IR [9].                                                that uses both words and concepts for IR in specialized domains. We
Do transformer encoders provide useful representations for              empirically demonstrate that adding concepts to a neural learning-
IR in specialised domains? KTRel-W and KTRel-C perform simi-            to-rank model is useful for IR in the medical domain. We show
larly to the best NLTR baseline. Moreover, the fact that these models   that transformer encoders provide effective sequence representa-
rely on a simple cosine similarity between the query and the doc-       tions for IR. We also empirically confirm that BM25 with RM3
ument representation empirically demonstrate that transformer           query expansion is still a strong baseline, especially with respect to
encoders do produce useful representations for IR.                      high-precision metrics. As future work we plan to evaluate KTRel
How do NLTR models affect the recall? Since the number of               on more collections and other specialized domains. To make our
documents is limited in the NFCorpus, we do not rely on a re-           model scalable to larger collections, we will adapt it to learn word-
ranking strategy based on BM25 [31]. Therefore the recall of the        based and concept-based sparse representations compatible with
NLTR models is not upper bounded by the recall of BM25. The             an inverted index as suggested by Zamani et al. [31].
results indicate that the gain of KTRel in terms of recall is signif-
icant compared to BM25 (+78.6%) and BM25-RM3 (+36.3%). This             REFERENCES
happens because BoW models can only retrieve documents that             [1] Olivier Bodenreider. 2004. The unified medical language system (UMLS): inte-
contain terms of the query whereas NLTR models do not have this             grating biomedical terminology. Nucleic acids research 32, suppl_1, D267–D270.
                                                                        [2] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok-
restriction.                                                                sana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational
How much data is needed to outperform BM25-RM3 with a                       Data. In NIPS. 2787–2795.
neural network? As we can see on Figure 2, the number of queries        [3] Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A
                                                                            Full-Text Learning to Rank Dataset for Medical Information Retrieval. In ECIR.
required for an NLTR model to outperform BM25 or BM25-RM3                   716–722.
baselines varies depending on the model under consideration. On         [4] Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional
the NFCorpus, about 200 queries are required by DRMM to obtain              Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. In Proceedings
                                                                            of the Eleventh ACM International Conference on Web Search and Data Mining
results comparable to those of BM25. This is due to the fact that           (Marina Del Rey, CA, USA) (WSDM ’18). ACM, New York, NY, USA, 126–134.
DRMM is a model with very few parameters (≈ 450) and therefore              https://doi.org/10.1145/3159652.3159659
                                                                        [5] Jeffrey Dalton, Laura Dietz, and James Allan. 2014. Entity Query Feature Expan-
does not need a lot of data to converge. However, and for the same          sion Using Knowledge Base Links. In Proceedings of the 37th International ACM
reasons, DRMM does not benefit from a lot of training data and does         SIGIR Conference on Research &#38; Development in Information Retrieval (Gold
not even outperform the BM25-RM3 reference model when given                 Coast, Queensland, Australia) (SIGIR ’14). ACM, New York, NY, USA, 365–374.
                                                                            https://doi.org/10.1145/2600428.2609628
more training queries. The Conv-KNRM model manages to outper-           [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
form BM25 and BM25-RM3, but about 3,000 queries are required                Pre-training of deep bidirectional transformers for language understanding. arXiv
in the training set for Conv-KNRM to outperform BM25-RM3 over               preprint arXiv:1810.04805 (2018).
                                                                        [7] Lorraine Goeuriot, Liadh Kelly, Wei Li, Joao Palotti, Pavel Pecina, Guido Zuccon,
NFCorpus. It seems that less data (≈ 1,500 queries) are needed for          Allan Hanbury, Gareth J. F. Jones, and Henning Müller. 2014. ShARe/CLEF
the KTRel model. This suggests that the use of concepts can be              eHealth Evaluation Lab 2014, Task 3: User-centred health information retrieval.
                                                                            In Proceedings of CLEF 2014. Sheffield, United Kingdom.
useful in resource-constrained scenarios. These results also con-       [8] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance
firm that training an NLTR model on a collection containing only            matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International
a few hundred queries is a very difficult task. This may explain            on Conference on Information and Knowledge Management. ACM, 55–64.
                                                                        [9] Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen
why significant breakthroughs have yet to be achieved by NLTR               Wu, W Bruce Croft, and Xueqi Cheng. 2019. A Deep Look into Neural Ranking
Knowledge-Based Transformer Model for Information Retrieval


     Models for Information Retrieval. arXiv preprint arXiv:1903.06902 (2019).                  Story-like Visual Explanations by Watching Movies and Reading Books. CoRR
[10] Jiafeng Guo, Fan Yixing, Ji Xiang, and Cheng Xueqi. 2019. MatchZoo: A Learning,            abs/1506.06724 (2015). arXiv:1506.06724 http://arxiv.org/abs/1506.06724
     Practicing, and Developing System for Neural Text Matching. In Proceedings            [33] Guido Zuccon, Bevan Koopman, et al. 2018. Payoffs and pitfalls in using
     of the 42Nd International ACM SIGIR Conference on Research and Development                 knowledge-bases for consumer health search. Information Retrieval Journal
     in Information Retrieval (Paris, France) (SIGIR’19). ACM, New York, NY, USA,               (2018), 1–45.
     1297–1300. https://doi.org/10.1145/3331184.3331403
[11] Christophe Van Gysel and Maarten de Rijke. 2018. Pytrec_eval: An Extremely
     Fast Python Interface to trec_eval. In The 41st International ACM SIGIR Conference
     on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI,
     USA, July 08-12, 2018. 873–876. https://doi.org/10.1145/3209978.3210065
[12] D. P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization.
     In ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
     http://arxiv.org/abs/1412.6980
[13] Guillaume Lample and Alexis Conneau. 2019.                Cross-lingual Language
     Model Pretraining.         arXiv e-prints, Article arXiv:1901.07291 (Jan 2019),
     arXiv:1901.07291 pages. arXiv:cs.CL/1901.07291
[14] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim,
     Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language
     representation model for biomedical text mining. CoRR abs/1901.08746 (2019).
     arXiv:1901.08746 http://arxiv.org/abs/1901.08746
[15] Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. 2018. Entity-
     Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in
     Neural Information Retrieval. arXiv preprint arXiv:1805.07591 (2018).
[16] Yuanhua Lv and ChengXiang Zhai. 2009. A comparative study of methods for
     estimating query language models with pseudo feedback. In CIKM. 1895–1898.
[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
     Distributed representations of words and phrases and their compositionality. In
     Advances in neural information processing systems. 3111–3119.
[18] Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to match using
     local and distributed representations of text for web search. In Proceedings of the
     26th International Conference on World Wide Web. International World Wide Web
     Conferences Steering Committee, 1291–1299.
[19] Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, and Nathalie Souf. 2017. Learn-
     ing Concept-Driven Document Embeddings for Medical Information Search. In
     Artificial Intelligence in Medicine - 16th Conference on Artificial Intelligence in
     Medicine, AIME 2017, Vienna, Austria, June 21-24, 2017, Proceedings. 160–170.
     https://doi.org/10.1007/978-3-319-59758-4_17
[20] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and
     Douglas Johnson. 2005. Terrier information retrieval platform. In European
     Conference on Information Retrieval. Springer, 517–519.
[21] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
     Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI
     Blog 1, 8 (2019).
[22] Stephen Robertson and Steve Walker. 1994. Some simple effective approximations
     to the 2-poisson model for probabilistic weighted retrieval. In SIGIR. 232–241.
[23] Ying Shen, Yang Deng, Min Yang, Yaliang Li, Nan Du, Wei Fan, and Kai Lei. 2018.
     Knowledge-aware Attentive Neural Network for Ranking Question Answer Pairs.
     In SIGIR. Springer, 901–904.
[24] Luca Soldaini and Nazli Goharian. 2016. Quickumls: a fast, unsupervised approach
     for medical concept extraction. In MedIR workshop, SIGIR.
[25] Luca Soldaini and Nazli Goharian. 2017. Learning to rank for consumer health
     search: a semantic approach. In ECIR. Springer, 640–646.
[26] Hanna Suominen, Sanna Salanterä, Sumithra Velupillai, Wendy W Chapman,
     Guergana Savova, Noemie Elhadad, Sameer Pradhan, Brett R South, Danielle L
     Mowery, Gareth JF Jones, et al. 2013. Overview of the ShARe/CLEF eHealth
     evaluation lab 2013. In International Conference of the Cross-Language Evaluation
     Forum for European Languages. Springer, 212–231.
[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
     Aidan N Gomez, et al. 2017. Attention is all you need. In NIPS. 5998–6008.
[28] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
     Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and
     Jamie Brew. 2019. Transformers: State-of-the-art Natural Language Processing.
     arXiv:cs.CL/1910.03771
[29] Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017.
     End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th
     International ACM SIGIR conference on research and development in information
     retrieval. ACM, 55–64.
[30] Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically Examining
     the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains
     from Neural Ranking Models. In Proceedings of the 42nd International ACM SIGIR
     Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris,
     France, July 21-25, 2019. 1129–1132. https://doi.org/10.1145/3331184.3331340
[31] Hamed Zamani, Mostafa Dehghani, W Bruce Croft, Erik Learned-Miller, and
     Jaap Kamps. 2018. From neural re-ranking to neural ranking: Learning a sparse
     representation for inverted indexing. In Proceedings of the 27th ACM International
     Conference on Information and Knowledge Management. ACM, 497–506.
[32] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
     Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards