=Paper= {{Paper |id=Vol-2621/CIRCLE20_05 |storemode=property |title=Knowledge Based Transformer Model for Information Retrieval |pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_05.pdf |volume=Vol-2621 |authors=Jibril Frej,Didier Schwab,Jean-Pierre Chevallet |dblpUrl=https://dblp.org/rec/conf/circle/FrejSC20 }} ==Knowledge Based Transformer Model for Information Retrieval== https://ceur-ws.org/Vol-2621/CIRCLE20_05.pdf

Knowledge Based Transformer Model for Information Retrieval
Jibril Frej Didier Schwab
jibril.frej@univ-grenoble-alpes.fr didier.schwab@univ-grenoble-alpes.fr
Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG
* Institute of Engineering Univ. Grenoble Alpes * Institute of Engineering Univ. Grenoble Alpes

Jean-Pierre Chevallet
jean-pierre.chevallet@univ-grenoble-alpes.fr
Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG
* Institute of Engineering Univ. Grenoble Alpes

ABSTRACT introduced by KB and focus on the relevant knowledge to improve
Vocabulary mismatch is a frequent problem in information retrieval search.
(IR). It can occur when the query is short and/or ambiguous but also Given the recent success of transformer encoders for several NLP
in specialized domains where queries are made by non-specialists tasks [6, 13, 21, 27], we propose KTRel: a NLTR model that uses:
and documents are written by experts. Recently, vocabulary mis- (1) word embeddings pre-trained on large amount of text; (2) con-
match has been addressed with neural learning-to-rank (NLTR) cept embeddings pre-trained on a specialized KB; (3) transformer
models and word embeddings to avoid relying only on the exact encoders that associate sequence of word embeddings and concept
matching of terms for retrieval. Another approach to vocabulary embeddings to a fixed-size representation.
mismatch is to use knowledge bases (KB) that can associate different
terms to the same concept. Given the recent success of transformer 2 RELATED WORK
encoders for NLP, we propose KTRel: a NLTR model that uses word Several methods to include KB for IR in specialized domains have
embeddings, Knowledge bases and Transformer encoders for IR. already been proposed. These models use one (or a combination)
of the following three strategies: (1) explicit rules; (2) machine
KEYWORDS learning methods based on hand-crafted features; (3) deep learning
Information Retrieval, Neural Networks, Learning-to-Rank, Knowl- methods.
edge Base Explicit rules. The entity query feature expansion [5] that uses
relations between KB elements to extend queries with entities has
1 INTRODUCTION been studied in depth by Jimmy et. al. [33] in the medical field. They
In specialized domains like the medical one, non-specialists ex- show that such methods require several key choices and design de-
press their queries using plain English, whereas documents contain cisions to be effective and are therefore difficult to use in practice.
domain-specific terms. For example, if a user asks "How plant-based Machine learning. Soldaini et al. [25] proposed to use KBs to
diets may extend our lives?" a bag-of-words (BoW) based IR system add medical and health hand-crafted features to improve the per-
will be unable to retrieve relevant documents such as "A review of formance of learning-to-rank methods for IR in the medical field.
methionine dependency and the role of methionine restriction in can- However, this approach relies on hand-crafted features that require
cer growth control and life-span extension". To retrieve this document, domain and KB specific knowledge when they are designed.
an IR system should associate "plant-based diets" with "methionine Deep learning. Recently, KBs have been successfully combined
restriction". with NLTR approaches for Question-Answer systems [23] and for
On the one hand, neural learning-to-rank (NLTR) models that use web search in the general domain [15]. NLTR models for IR work
prior knowledge from word embeddings trained on large amounts similarly to neural models for automatic natural language process-
of raw text are a promising approach to this problem. However, ing (TALN). The main difference is that the objective functions used
most NLTR models are not interpretable, with unknown or rare by the NLTR models optimize the ranking of a list of documents
words and struggle to outperform a well-tuned BoW baseline on with respect to a query. Unsupervised learning methods were also
standard IR collections such as Robust04 where the amount of an- used to learn vector representations of documents based on medical
notated data is limited [30]. concepts for information retrieval in the medical domain [19].
On the other hand, using knowledge bases (KB) to expand queries
and/or documents with concepts has often been proposed to tackle
the vocabulary mismatch since the same concept/entity can be re- 3 KTREL
lated to words belonging to non-specialist and expert vocabularies. In this section, we describe the different steps our model follows to
However, it is a challenging task since KB can be incomplete, lead perform IR. The overall architecture of KTRel is shown in Figure 1.
to noise addition and require hand-crafted features [33]. In this Prior step. In order to include prior knowledge to our model, we
work, we study the potential for NLTR models to ignore the noise pre-train word embeddings on raw text and we pre-train concept
"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- embeddings on a specialized domain KB.
mons License Attribution 4.0 International (CC BY 4.0)." Knowledge step. Queries and documents are annotated with a set
Anonymous, et al.

Relevance score

Linear
Combination
Concept similarity Word similarity

cos cos

qc dc qw dw

Transformer Encoder Transformer Encoder Transformer Encoder Transformer Encoder

... ... ... ...
Query concepts Document concepts Query words Document words

KTRel-C KTRel-W

Figure 1: Architecture of KTRel

of candidate concepts from the specialized domain KB. We adopt 4.1 Datasets
the same strategy as Shen et al. [23]: n-grams are annotated with Collection. We evaluate KTRel on the NFCorpus [3]: a publicly
their top-K candidate concepts in order to deal with the possible available collection for learning-to-rank in the medical domain.
ambiguity of some n-grams. It consists of 5,276 different queries written in plain English and
Transformer step. Considering the significant performance gains 3,633 documents composed of titles and abstracts from PubMed
recently obtained by transformers in NLP [6, 13, 21, 27], we propose and PMC with a highly technical vocabulary. We did not evaluate
to use transformers encoders to associate both sequences of words our model on standard medical ad hoc IR collection such as CLEF
and sequences of concepts with a fixed size representation using eHealth 2013 [26] or CLEF eHealth 2014 [7] because they contain
the following steps: (1) a mapping of the elements of the input about 50 annotated queries each which is not enough to train NLTR
sequence with their corresponding embeddings; (2) a self-attention models [9, 30].
mechanism [27] to compute a context aware representations of ele- Knowledge base. We use medical concepts from the version 2018AA
ments of the sequence; (3) a position-wise Feed Forward Network; of the UMLS Metathesaurus [1]. We choose the UMLS Metathe-
(4) an element-wise sum of the representations obtained previously saurus mainly because of its huge coverage: 3.67 million concepts
to get a fixed size sequence encoding. from 203 source vocabularies.
Relevance step. A concept-based similarity is computed using the
cosine between the transformer encoding of the query’s concepts
𝑞𝑐 and the transformer encoding of the document’s concepts 𝑑𝑐 . 4.2 Experimental setup
Analogously, we calculate a word-based similarity (see Figure 1). Concepts. We use MetamorphoSys to extract the relational graph
The final relevance score between query 𝑄 and document 𝐷 con- of medical concepts from UMLS. We discard concepts that do not
sists in a linear combination between the concept-based similarity belong to a medical semantic type (e.g. Quantitative Concept). Text
and the word-based similarity: is annotated with medical concepts using QuickUMLS [24] with
default parameter values. As done by Shen et al. [23], the number
Rel(𝑄, 𝐷) = 𝑎 cos (𝑞 𝑤 , 𝑑 𝑤 ) + 𝑏 cos (𝑞𝑐 , 𝑑𝑐 ) (1) of candidate concepts K is set to 8.
Pre-trained Embeddings. We use word embeddings trained with
With 𝑎 ∈ R and 𝑏 ∈ R two parameters learned during training.
word2vec [17] on a combination of PubMed and PMC texts avail-
able at: http://bio.nlplab.org. Concept embeddings are trained on
4 EXPERIMENTS the UMLS relational graph with TransE [2]. All embeddings are
In this section, we describe the empirical evaluation of our NLTR updated during training and both word and concept embedding
models. We first present the data (Section 4.1), our baselines (Sec- dimensions are set to 200.
tion 4.3) and the experimental setup (Section 4.2). Implementation. KTRel is implemented in pytorch (https://pytorch.
Knowledge-Based Transformer Model for Information Retrieval

org/). 0.25
Loss. Models are trained to minimize the Margin Ranking Loss:

𝐿 = max(0, 1 − rel(𝑄, 𝐷 + ) + rel(𝑄, 𝐷 − )) (2)
0.2
\ \
0.15

MAP
Where 𝐷 + is a document more relevant to query 𝑄 than 𝐷 − .
Transformer encoder. The number of attention heads and the di- 0.1
mension of the feed forward network are selected from {1, 2, 5, 10} BM25
BM25-RM3
and {50, 100, 200, 500} respectively. We used ReLU activation func-
5 · 10−2 DRMM
tion. Preliminary experiments showed that using a single trans- Conv-KNRM
former encoder layer yields the best results. This is probably due UMLSRank
0
to the small size of our collection.
0 1,000 2,000 3,000 4,000
Training. Adam optimizer [12] is used with default parameter val-
# queries in training
ues. Batch size and dropout rate are selected from {10, 20, 50} and
{0.1, 0.2, 0.3, 0.4, 0.5} respectively. We apply early stopping on the
validation MAP. Figure 2: MAP on all query fields against the # of queries
Validation. Hyper-parameters listed above are tuned on the MAP used in training. \ indicate when a NLTR model has enough
on the validation set using grid search. queries in training to achieve statistically significant im-
Evaluation. We use 4 standard evaluation metrics: MAP, Recall, provement compared to BM25-RM3 (p-value < 0.05)
Precision and nDCG on the top 1,000 documents. These metrics
are implemented with pytrec-eval [11]. We use a two-tailed paired 5 RESULTS
t-test with Bonferroni correction to measure statistically significant
differences between the evaluation metrics. Because the NFCorpus The performance of KTRel against baselines are shown in Table 1.
has only 3,633 documents we can evaluate every (query, document) In the following, we propose empirical answers to several research
pair in a reasonable amount of time in order to avoid relying on a questions.
re-ranking strategy [9, 31]. Therefore the recall of KTRel and the Is it useful for ranking to use both words and medical con-
NLTR baselines is not upper-bounded by a prior ranking stage. cepts? KTRel outperforms all the NLTR baselines on all metrics
with statistical significance. The fact that KTRel also outperforms
both KTRel-W and KTRel-C provides empirical evidence that IR in
4.3 Baselines specialized domain can benefit from combining pre-trained concept
We compare KTRel with three types of baselines methods: BoW representations with pre-trained word representations.
and NLTR for IR and Pre-trained BERT encoder. Can KTRel outperform a strong BoW baseline? KTRel achieves
BoW. As suggested by Yang et al. [30], we use Okapi BM25 [22] statistical significance against BM25 with RM3 query expansion on
and Okapi BM25 with RM3 pseudo-relevance feedback [16] as our most metrics. The overall ranking is largely improved by KTRel:
BoWs baselines. Stemming, indexing and evaluation of BM25 and +33.9% w.r.t MAP. The notable exceptions are nDCG@5 and nDCG@10:
BM25-RM3 are performed by Terrier [20]. Hyper-parameter values even if KTRel outperforms BM25-RM3 in terms of nDCG@5 (+1.2%)
are tuned on the validation MAP with grid search. and nDCG@10 (+4.4%), it does not achieve statistical significance.
NLTR. DUET [18], KNRM [29], DRMM [8] and Conv-KNRM [4] Interestingly, KTRel do achieve statistical significance against P@5
are used as NLTR baselines for IR. Training and evaluation of these (+16.3%) and P@10 (+26.5%). The difference between P@k and
models is performed with MatchZoo [10]. Hyper-parameter values nDCG@k is that precision only looks at the proportion of relevant
are tuned on the validation MAP with random search over 10 runs. documents whereas nDCG@k emphasizes more on the ranking
We use the tuner provided by MatchZoo to sample values from the itself and takes into account the relevance levels of documents.
hyper-parameter space associated with each model. Therefore, we can conclude that even if KTRel is able to retrieve
BERT. We also compare KTRel against BERT [6] encoder: a state more relevant documents in the top-k results, BM25-RM3 is still
of the art language representation model. We use the "bert-base- a strong baseline when it comes to the ranking of the top-k docu-
uncased" model provided by Hugging Face [28], pre-trained on the ments.
BooksCorpus [32] and English Wikipedia. When training, we fine How is BERT performing in IR in specialised domains? BERT
tune the last layer of the model. performs worst than BM25 despite it’s success in several NLP
BioBERT. Finally, we compare KTRel against BioBERT [14]: a tasks [6]. Because BioBERT outperforms BERT with a high margin,
biomedical language representation that train BERT language model we can conclude that, when using a language model in a specialized
on large-scale biomedical corpora. domain, it is essential to pre-train the model on text of the same
To study the usefulness of combining concepts and words, we also domain.
train and evaluate separately the part of the KTRel architecture that Can baseline NLTR models outperform a strong BoW base-
uses only concepts (denoted by KTRel-C) and the part that uses line? First, we notice that the DUET and KNRM models perform
only words (denoted by KTRel-W) as pictured in Figure 1. KTRel-C worst than BM25. The reason is probably that these models were
and KTRel-W are trained independently using the same setup as developed on much larger datasets [18, 29] than the NFCorpus. Sec-
described in subsection 4.2 ond, DRMM performs slightly better than BM25 (+6.7% w.r.t MAP,
Anonymous, et al.

model P@5 P@10 P@20 nDCG@5 nDCG@10 nDCG@20 MAP Recall
BM25 0.2846- 0.2419- 0.1733- 0.3524- 0.3267- 0.3038- 0.1548- 0.4740-
BM25-RM3 0.3056 0.2603 0.1912 0.3664 0.3431 0.3249 0.1801 0.6249
DUET 0.1967- 0.1840- 0.1561- 0.1857- 0.1883- 0.1892- 0.1264- 0.7673+
KNRM 0.2082- 0.1887- 0.1617- 0.1914- 0.1914- 0.1936- 0.1216- 0.7764+
DRMM 0.2940 0.2489 0.1819 0.3540 0.3330 0.3116 0.1651 0.7051+
Conv-KNRM 0.3146 0.2865 0.2378+ 0.3010- 0.3090- 0.3138 0.2110+ 0.8143+
BERT 0.2084- 0.1998- 0.1536- 0.2090- 0.2196- 0.2062- 0.1567- 0.7847+
BioBERT 0.3148 0.2989 + 0.2373+ 0.3508 0.3377 0.3228 0.2358 + 0.8265+
KTRel-C 0.3127 0.2889+ 0.2304+ 0.3285- 0.3295- 0.3094- 0.2194+ 0.8047+
KTRel-W 0.3204+ 0.3008+ 0.2377+ 0.3465 0.3369 0.3141 0.2228+ 0.8187+
KTRel 0.3554 + 0.3294 + 0.2498 + 0.3708 0.3584 0.3424 + 0.2411+ 0.8520+
Table 1: Performance comparison of different models on the NFCorpus. + (resp. -) denotes a significant performance gain (resp.
degradation) against BM25-RM3 (p-value < 0.01). Best performances are highlighted in bold.

+3.3% w.r.t P@5 and +0.5% w.r.t nDCG@5) but it does not manages on standard IR collections that contain only a few hundred queries
to outperform BM25-RM3. Finally, Conv-KNRM is the only NLTR at best and a few dozen at worst.
baseline that manages to outperform BM25 and BM25-RM3 w.r.t
MAP, Precision and Recall (but not w.r.t nDCG). These results em- 6 CONCLUSIONS AND FUTURE WORK
pirically confirm that NLTR models have not achieved significant In this paper, we propose KTRel: a transformer-based NLTR model
breakthroughs in IR [9]. that uses both words and concepts for IR in specialized domains. We
Do transformer encoders provide useful representations for empirically demonstrate that adding concepts to a neural learning-
IR in specialised domains? KTRel-W and KTRel-C perform simi- to-rank model is useful for IR in the medical domain. We show
larly to the best NLTR baseline. Moreover, the fact that these models that transformer encoders provide effective sequence representa-
rely on a simple cosine similarity between the query and the doc- tions for IR. We also empirically confirm that BM25 with RM3
ument representation empirically demonstrate that transformer query expansion is still a strong baseline, especially with respect to
encoders do produce useful representations for IR. high-precision metrics. As future work we plan to evaluate KTRel
How do NLTR models affect the recall? Since the number of on more collections and other specialized domains. To make our
documents is limited in the NFCorpus, we do not rely on a re- model scalable to larger collections, we will adapt it to learn word-
ranking strategy based on BM25 [31]. Therefore the recall of the based and concept-based sparse representations compatible with
NLTR models is not upper bounded by the recall of BM25. The an inverted index as suggested by Zamani et al. [31].
results indicate that the gain of KTRel in terms of recall is signif-
icant compared to BM25 (+78.6%) and BM25-RM3 (+36.3%). This REFERENCES
happens because BoW models can only retrieve documents that [1] Olivier Bodenreider. 2004. The unified medical language system (UMLS): inte-
contain terms of the query whereas NLTR models do not have this grating biomedical terminology. Nucleic acids research 32, suppl_1, D267–D270.
[2] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok-
restriction. sana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational
How much data is needed to outperform BM25-RM3 with a Data. In NIPS. 2787–2795.
neural network? As we can see on Figure 2, the number of queries [3] Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A
Full-Text Learning to Rank Dataset for Medical Information Retrieval. In ECIR.
required for an NLTR model to outperform BM25 or BM25-RM3 716–722.
baselines varies depending on the model under consideration. On [4] Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional
the NFCorpus, about 200 queries are required by DRMM to obtain Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. In Proceedings
of the Eleventh ACM International Conference on Web Search and Data Mining
results comparable to those of BM25. This is due to the fact that (Marina Del Rey, CA, USA) (WSDM ’18). ACM, New York, NY, USA, 126–134.
DRMM is a model with very few parameters (≈ 450) and therefore https://doi.org/10.1145/3159652.3159659
[5] Jeffrey Dalton, Laura Dietz, and James Allan. 2014. Entity Query Feature Expan-
does not need a lot of data to converge. However, and for the same sion Using Knowledge Base Links. In Proceedings of the 37th International ACM
reasons, DRMM does not benefit from a lot of training data and does SIGIR Conference on Research & Development in Information Retrieval (Gold
not even outperform the BM25-RM3 reference model when given Coast, Queensland, Australia) (SIGIR ’14). ACM, New York, NY, USA, 365–374.
https://doi.org/10.1145/2600428.2609628
more training queries. The Conv-KNRM model manages to outper- [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
form BM25 and BM25-RM3, but about 3,000 queries are required Pre-training of deep bidirectional transformers for language understanding. arXiv
in the training set for Conv-KNRM to outperform BM25-RM3 over preprint arXiv:1810.04805 (2018).
[7] Lorraine Goeuriot, Liadh Kelly, Wei Li, Joao Palotti, Pavel Pecina, Guido Zuccon,
NFCorpus. It seems that less data (≈ 1,500 queries) are needed for Allan Hanbury, Gareth J. F. Jones, and Henning Müller. 2014. ShARe/CLEF
the KTRel model. This suggests that the use of concepts can be eHealth Evaluation Lab 2014, Task 3: User-centred health information retrieval.
In Proceedings of CLEF 2014. Sheffield, United Kingdom.
useful in resource-constrained scenarios. These results also con- [8] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance
firm that training an NLTR model on a collection containing only matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International
a few hundred queries is a very difficult task. This may explain on Conference on Information and Knowledge Management. ACM, 55–64.
[9] Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen
why significant breakthroughs have yet to be achieved by NLTR Wu, W Bruce Croft, and Xueqi Cheng. 2019. A Deep Look into Neural Ranking
Knowledge-Based Transformer Model for Information Retrieval

Models for Information Retrieval. arXiv preprint arXiv:1903.06902 (2019). Story-like Visual Explanations by Watching Movies and Reading Books. CoRR
[10] Jiafeng Guo, Fan Yixing, Ji Xiang, and Cheng Xueqi. 2019. MatchZoo: A Learning, abs/1506.06724 (2015). arXiv:1506.06724 http://arxiv.org/abs/1506.06724
Practicing, and Developing System for Neural Text Matching. In Proceedings [33] Guido Zuccon, Bevan Koopman, et al. 2018. Payoffs and pitfalls in using
of the 42Nd International ACM SIGIR Conference on Research and Development knowledge-bases for consumer health search. Information Retrieval Journal
in Information Retrieval (Paris, France) (SIGIR’19). ACM, New York, NY, USA, (2018), 1–45.
1297–1300. https://doi.org/10.1145/3331184.3331403
[11] Christophe Van Gysel and Maarten de Rijke. 2018. Pytrec_eval: An Extremely
Fast Python Interface to trec_eval. In The 41st International ACM SIGIR Conference
on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI,
USA, July 08-12, 2018. 873–876. https://doi.org/10.1145/3209978.3210065
[12] D. P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization.
In ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
http://arxiv.org/abs/1412.6980
[13] Guillaume Lample and Alexis Conneau. 2019. Cross-lingual Language
Model Pretraining. arXiv e-prints, Article arXiv:1901.07291 (Jan 2019),
arXiv:1901.07291 pages. arXiv:cs.CL/1901.07291
[14] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim,
Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language
representation model for biomedical text mining. CoRR abs/1901.08746 (2019).
arXiv:1901.08746 http://arxiv.org/abs/1901.08746
[15] Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. 2018. Entity-
Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in
Neural Information Retrieval. arXiv preprint arXiv:1805.07591 (2018).
[16] Yuanhua Lv and ChengXiang Zhai. 2009. A comparative study of methods for
estimating query language models with pseudo feedback. In CIKM. 1895–1898.
[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Distributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems. 3111–3119.
[18] Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to match using
local and distributed representations of text for web search. In Proceedings of the
26th International Conference on World Wide Web. International World Wide Web
Conferences Steering Committee, 1291–1299.
[19] Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, and Nathalie Souf. 2017. Learn-
ing Concept-Driven Document Embeddings for Medical Information Search. In
Artificial Intelligence in Medicine - 16th Conference on Artificial Intelligence in
Medicine, AIME 2017, Vienna, Austria, June 21-24, 2017, Proceedings. 160–170.
https://doi.org/10.1007/978-3-319-59758-4_17
[20] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and
Douglas Johnson. 2005. Terrier information retrieval platform. In European
Conference on Information Retrieval. Springer, 517–519.
[21] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI
Blog 1, 8 (2019).
[22] Stephen Robertson and Steve Walker. 1994. Some simple effective approximations
to the 2-poisson model for probabilistic weighted retrieval. In SIGIR. 232–241.
[23] Ying Shen, Yang Deng, Min Yang, Yaliang Li, Nan Du, Wei Fan, and Kai Lei. 2018.
Knowledge-aware Attentive Neural Network for Ranking Question Answer Pairs.
In SIGIR. Springer, 901–904.
[24] Luca Soldaini and Nazli Goharian. 2016. Quickumls: a fast, unsupervised approach
for medical concept extraction. In MedIR workshop, SIGIR.
[25] Luca Soldaini and Nazli Goharian. 2017. Learning to rank for consumer health
search: a semantic approach. In ECIR. Springer, 640–646.
[26] Hanna Suominen, Sanna Salanterä, Sumithra Velupillai, Wendy W Chapman,
Guergana Savova, Noemie Elhadad, Sameer Pradhan, Brett R South, Danielle L
Mowery, Gareth JF Jones, et al. 2013. Overview of the ShARe/CLEF eHealth
evaluation lab 2013. In International Conference of the Cross-Language Evaluation
Forum for European Languages. Springer, 212–231.
[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, et al. 2017. Attention is all you need. In NIPS. 5998–6008.
[28] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and
Jamie Brew. 2019. Transformers: State-of-the-art Natural Language Processing.
arXiv:cs.CL/1910.03771
[29] Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017.
End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th
International ACM SIGIR conference on research and development in information
retrieval. ACM, 55–64.
[30] Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically Examining
the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains
from Neural Ranking Models. In Proceedings of the 42nd International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris,
France, July 21-25, 2019. 1129–1132. https://doi.org/10.1145/3331184.3331340
[31] Hamed Zamani, Mostafa Dehghani, W Bruce Croft, Erik Learned-Miller, and
Jaap Kamps. 2018. From neural re-ranking to neural ranking: Learning a sparse
representation for inverted indexing. In Proceedings of the 27th ACM International
Conference on Information and Knowledge Management. ACM, 497–506.
[32] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards