Knowledge Based Transformer Model for Information Retrieval Jibril Frej Didier Schwab jibril.frej@univ-grenoble-alpes.fr didier.schwab@univ-grenoble-alpes.fr Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG * Institute of Engineering Univ. Grenoble Alpes * Institute of Engineering Univ. Grenoble Alpes Jean-Pierre Chevallet jean-pierre.chevallet@univ-grenoble-alpes.fr Univ. Grenoble Alpes, CNRS, Grenoble INP*, LIG * Institute of Engineering Univ. Grenoble Alpes ABSTRACT introduced by KB and focus on the relevant knowledge to improve Vocabulary mismatch is a frequent problem in information retrieval search. (IR). It can occur when the query is short and/or ambiguous but also Given the recent success of transformer encoders for several NLP in specialized domains where queries are made by non-specialists tasks [6, 13, 21, 27], we propose KTRel: a NLTR model that uses: and documents are written by experts. Recently, vocabulary mis- (1) word embeddings pre-trained on large amount of text; (2) con- match has been addressed with neural learning-to-rank (NLTR) cept embeddings pre-trained on a specialized KB; (3) transformer models and word embeddings to avoid relying only on the exact encoders that associate sequence of word embeddings and concept matching of terms for retrieval. Another approach to vocabulary embeddings to a fixed-size representation. mismatch is to use knowledge bases (KB) that can associate different terms to the same concept. Given the recent success of transformer 2 RELATED WORK encoders for NLP, we propose KTRel: a NLTR model that uses word Several methods to include KB for IR in specialized domains have embeddings, Knowledge bases and Transformer encoders for IR. already been proposed. These models use one (or a combination) of the following three strategies: (1) explicit rules; (2) machine KEYWORDS learning methods based on hand-crafted features; (3) deep learning Information Retrieval, Neural Networks, Learning-to-Rank, Knowl- methods. edge Base Explicit rules. The entity query feature expansion [5] that uses relations between KB elements to extend queries with entities has 1 INTRODUCTION been studied in depth by Jimmy et. al. [33] in the medical field. They In specialized domains like the medical one, non-specialists ex- show that such methods require several key choices and design de- press their queries using plain English, whereas documents contain cisions to be effective and are therefore difficult to use in practice. domain-specific terms. For example, if a user asks "How plant-based Machine learning. Soldaini et al. [25] proposed to use KBs to diets may extend our lives?" a bag-of-words (BoW) based IR system add medical and health hand-crafted features to improve the per- will be unable to retrieve relevant documents such as "A review of formance of learning-to-rank methods for IR in the medical field. methionine dependency and the role of methionine restriction in can- However, this approach relies on hand-crafted features that require cer growth control and life-span extension". To retrieve this document, domain and KB specific knowledge when they are designed. an IR system should associate "plant-based diets" with "methionine Deep learning. Recently, KBs have been successfully combined restriction". with NLTR approaches for Question-Answer systems [23] and for On the one hand, neural learning-to-rank (NLTR) models that use web search in the general domain [15]. NLTR models for IR work prior knowledge from word embeddings trained on large amounts similarly to neural models for automatic natural language process- of raw text are a promising approach to this problem. However, ing (TALN). The main difference is that the objective functions used most NLTR models are not interpretable, with unknown or rare by the NLTR models optimize the ranking of a list of documents words and struggle to outperform a well-tuned BoW baseline on with respect to a query. Unsupervised learning methods were also standard IR collections such as Robust04 where the amount of an- used to learn vector representations of documents based on medical notated data is limited [30]. concepts for information retrieval in the medical domain [19]. On the other hand, using knowledge bases (KB) to expand queries and/or documents with concepts has often been proposed to tackle the vocabulary mismatch since the same concept/entity can be re- 3 KTREL lated to words belonging to non-specialist and expert vocabularies. In this section, we describe the different steps our model follows to However, it is a challenging task since KB can be incomplete, lead perform IR. The overall architecture of KTRel is shown in Figure 1. to noise addition and require hand-crafted features [33]. In this Prior step. In order to include prior knowledge to our model, we work, we study the potential for NLTR models to ignore the noise pre-train word embeddings on raw text and we pre-train concept "Copyright Β© 2020 for this paper by its authors. Use permitted under Creative Com- embeddings on a specialized domain KB. mons License Attribution 4.0 International (CC BY 4.0)." Knowledge step. Queries and documents are annotated with a set Anonymous, et al. Relevance score Linear Combination Concept similarity Word similarity cos cos qc dc qw dw Transformer Encoder Transformer Encoder Transformer Encoder Transformer Encoder ... ... ... ... Query concepts Document concepts Query words Document words KTRel-C KTRel-W Figure 1: Architecture of KTRel of candidate concepts from the specialized domain KB. We adopt 4.1 Datasets the same strategy as Shen et al. [23]: n-grams are annotated with Collection. We evaluate KTRel on the NFCorpus [3]: a publicly their top-K candidate concepts in order to deal with the possible available collection for learning-to-rank in the medical domain. ambiguity of some n-grams. It consists of 5,276 different queries written in plain English and Transformer step. Considering the significant performance gains 3,633 documents composed of titles and abstracts from PubMed recently obtained by transformers in NLP [6, 13, 21, 27], we propose and PMC with a highly technical vocabulary. We did not evaluate to use transformers encoders to associate both sequences of words our model on standard medical ad hoc IR collection such as CLEF and sequences of concepts with a fixed size representation using eHealth 2013 [26] or CLEF eHealth 2014 [7] because they contain the following steps: (1) a mapping of the elements of the input about 50 annotated queries each which is not enough to train NLTR sequence with their corresponding embeddings; (2) a self-attention models [9, 30]. mechanism [27] to compute a context aware representations of ele- Knowledge base. We use medical concepts from the version 2018AA ments of the sequence; (3) a position-wise Feed Forward Network; of the UMLS Metathesaurus [1]. We choose the UMLS Metathe- (4) an element-wise sum of the representations obtained previously saurus mainly because of its huge coverage: 3.67 million concepts to get a fixed size sequence encoding. from 203 source vocabularies. Relevance step. A concept-based similarity is computed using the cosine between the transformer encoding of the query’s concepts π‘žπ‘ and the transformer encoding of the document’s concepts 𝑑𝑐 . 4.2 Experimental setup Analogously, we calculate a word-based similarity (see Figure 1). Concepts. We use MetamorphoSys to extract the relational graph The final relevance score between query 𝑄 and document 𝐷 con- of medical concepts from UMLS. We discard concepts that do not sists in a linear combination between the concept-based similarity belong to a medical semantic type (e.g. Quantitative Concept). Text and the word-based similarity: is annotated with medical concepts using QuickUMLS [24] with default parameter values. As done by Shen et al. [23], the number Rel(𝑄, 𝐷) = π‘Ž cos (π‘ž 𝑀 , 𝑑 𝑀 ) + 𝑏 cos (π‘žπ‘ , 𝑑𝑐 ) (1) of candidate concepts K is set to 8. Pre-trained Embeddings. We use word embeddings trained with With π‘Ž ∈ R and 𝑏 ∈ R two parameters learned during training. word2vec [17] on a combination of PubMed and PMC texts avail- able at: http://bio.nlplab.org. Concept embeddings are trained on 4 EXPERIMENTS the UMLS relational graph with TransE [2]. All embeddings are In this section, we describe the empirical evaluation of our NLTR updated during training and both word and concept embedding models. We first present the data (Section 4.1), our baselines (Sec- dimensions are set to 200. tion 4.3) and the experimental setup (Section 4.2). Implementation. KTRel is implemented in pytorch (https://pytorch. Knowledge-Based Transformer Model for Information Retrieval org/). 0.25 Loss. Models are trained to minimize the Margin Ranking Loss: 𝐿 = max(0, 1 βˆ’ rel(𝑄, 𝐷 + ) + rel(𝑄, 𝐷 βˆ’ )) (2) 0.2 \ \ 0.15 MAP Where 𝐷 + is a document more relevant to query 𝑄 than 𝐷 βˆ’ . Transformer encoder. The number of attention heads and the di- 0.1 mension of the feed forward network are selected from {1, 2, 5, 10} BM25 BM25-RM3 and {50, 100, 200, 500} respectively. We used ReLU activation func- 5 Β· 10βˆ’2 DRMM tion. Preliminary experiments showed that using a single trans- Conv-KNRM former encoder layer yields the best results. This is probably due UMLSRank 0 to the small size of our collection. 0 1,000 2,000 3,000 4,000 Training. Adam optimizer [12] is used with default parameter val- # queries in training ues. Batch size and dropout rate are selected from {10, 20, 50} and {0.1, 0.2, 0.3, 0.4, 0.5} respectively. We apply early stopping on the validation MAP. Figure 2: MAP on all query fields against the # of queries Validation. Hyper-parameters listed above are tuned on the MAP used in training. \ indicate when a NLTR model has enough on the validation set using grid search. queries in training to achieve statistically significant im- Evaluation. We use 4 standard evaluation metrics: MAP, Recall, provement compared to BM25-RM3 (p-value < 0.05) Precision and nDCG on the top 1,000 documents. These metrics are implemented with pytrec-eval [11]. We use a two-tailed paired 5 RESULTS t-test with Bonferroni correction to measure statistically significant differences between the evaluation metrics. Because the NFCorpus The performance of KTRel against baselines are shown in Table 1. has only 3,633 documents we can evaluate every (query, document) In the following, we propose empirical answers to several research pair in a reasonable amount of time in order to avoid relying on a questions. re-ranking strategy [9, 31]. Therefore the recall of KTRel and the Is it useful for ranking to use both words and medical con- NLTR baselines is not upper-bounded by a prior ranking stage. cepts? KTRel outperforms all the NLTR baselines on all metrics with statistical significance. The fact that KTRel also outperforms both KTRel-W and KTRel-C provides empirical evidence that IR in 4.3 Baselines specialized domain can benefit from combining pre-trained concept We compare KTRel with three types of baselines methods: BoW representations with pre-trained word representations. and NLTR for IR and Pre-trained BERT encoder. Can KTRel outperform a strong BoW baseline? KTRel achieves BoW. As suggested by Yang et al. [30], we use Okapi BM25 [22] statistical significance against BM25 with RM3 query expansion on and Okapi BM25 with RM3 pseudo-relevance feedback [16] as our most metrics. The overall ranking is largely improved by KTRel: BoWs baselines. Stemming, indexing and evaluation of BM25 and +33.9% w.r.t MAP. The notable exceptions are nDCG@5 and nDCG@10: BM25-RM3 are performed by Terrier [20]. Hyper-parameter values even if KTRel outperforms BM25-RM3 in terms of nDCG@5 (+1.2%) are tuned on the validation MAP with grid search. and nDCG@10 (+4.4%), it does not achieve statistical significance. NLTR. DUET [18], KNRM [29], DRMM [8] and Conv-KNRM [4] Interestingly, KTRel do achieve statistical significance against P@5 are used as NLTR baselines for IR. Training and evaluation of these (+16.3%) and P@10 (+26.5%). The difference between P@k and models is performed with MatchZoo [10]. Hyper-parameter values nDCG@k is that precision only looks at the proportion of relevant are tuned on the validation MAP with random search over 10 runs. documents whereas nDCG@k emphasizes more on the ranking We use the tuner provided by MatchZoo to sample values from the itself and takes into account the relevance levels of documents. hyper-parameter space associated with each model. Therefore, we can conclude that even if KTRel is able to retrieve BERT. We also compare KTRel against BERT [6] encoder: a state more relevant documents in the top-k results, BM25-RM3 is still of the art language representation model. We use the "bert-base- a strong baseline when it comes to the ranking of the top-k docu- uncased" model provided by Hugging Face [28], pre-trained on the ments. BooksCorpus [32] and English Wikipedia. When training, we fine How is BERT performing in IR in specialised domains? BERT tune the last layer of the model. performs worst than BM25 despite it’s success in several NLP BioBERT. Finally, we compare KTRel against BioBERT [14]: a tasks [6]. Because BioBERT outperforms BERT with a high margin, biomedical language representation that train BERT language model we can conclude that, when using a language model in a specialized on large-scale biomedical corpora. domain, it is essential to pre-train the model on text of the same To study the usefulness of combining concepts and words, we also domain. train and evaluate separately the part of the KTRel architecture that Can baseline NLTR models outperform a strong BoW base- uses only concepts (denoted by KTRel-C) and the part that uses line? First, we notice that the DUET and KNRM models perform only words (denoted by KTRel-W) as pictured in Figure 1. KTRel-C worst than BM25. The reason is probably that these models were and KTRel-W are trained independently using the same setup as developed on much larger datasets [18, 29] than the NFCorpus. Sec- described in subsection 4.2 ond, DRMM performs slightly better than BM25 (+6.7% w.r.t MAP, Anonymous, et al. model P@5 P@10 P@20 nDCG@5 nDCG@10 nDCG@20 MAP Recall BM25 0.2846- 0.2419- 0.1733- 0.3524- 0.3267- 0.3038- 0.1548- 0.4740- BM25-RM3 0.3056 0.2603 0.1912 0.3664 0.3431 0.3249 0.1801 0.6249 DUET 0.1967- 0.1840- 0.1561- 0.1857- 0.1883- 0.1892- 0.1264- 0.7673+ KNRM 0.2082- 0.1887- 0.1617- 0.1914- 0.1914- 0.1936- 0.1216- 0.7764+ DRMM 0.2940 0.2489 0.1819 0.3540 0.3330 0.3116 0.1651 0.7051+ Conv-KNRM 0.3146 0.2865 0.2378+ 0.3010- 0.3090- 0.3138 0.2110+ 0.8143+ BERT 0.2084- 0.1998- 0.1536- 0.2090- 0.2196- 0.2062- 0.1567- 0.7847+ BioBERT 0.3148 0.2989 + 0.2373+ 0.3508 0.3377 0.3228 0.2358 + 0.8265+ KTRel-C 0.3127 0.2889+ 0.2304+ 0.3285- 0.3295- 0.3094- 0.2194+ 0.8047+ KTRel-W 0.3204+ 0.3008+ 0.2377+ 0.3465 0.3369 0.3141 0.2228+ 0.8187+ KTRel 0.3554 + 0.3294 + 0.2498 + 0.3708 0.3584 0.3424 + 0.2411+ 0.8520+ Table 1: Performance comparison of different models on the NFCorpus. + (resp. -) denotes a significant performance gain (resp. degradation) against BM25-RM3 (p-value < 0.01). Best performances are highlighted in bold. +3.3% w.r.t P@5 and +0.5% w.r.t nDCG@5) but it does not manages on standard IR collections that contain only a few hundred queries to outperform BM25-RM3. Finally, Conv-KNRM is the only NLTR at best and a few dozen at worst. baseline that manages to outperform BM25 and BM25-RM3 w.r.t MAP, Precision and Recall (but not w.r.t nDCG). These results em- 6 CONCLUSIONS AND FUTURE WORK pirically confirm that NLTR models have not achieved significant In this paper, we propose KTRel: a transformer-based NLTR model breakthroughs in IR [9]. that uses both words and concepts for IR in specialized domains. We Do transformer encoders provide useful representations for empirically demonstrate that adding concepts to a neural learning- IR in specialised domains? KTRel-W and KTRel-C perform simi- to-rank model is useful for IR in the medical domain. We show larly to the best NLTR baseline. Moreover, the fact that these models that transformer encoders provide effective sequence representa- rely on a simple cosine similarity between the query and the doc- tions for IR. We also empirically confirm that BM25 with RM3 ument representation empirically demonstrate that transformer query expansion is still a strong baseline, especially with respect to encoders do produce useful representations for IR. high-precision metrics. As future work we plan to evaluate KTRel How do NLTR models affect the recall? Since the number of on more collections and other specialized domains. To make our documents is limited in the NFCorpus, we do not rely on a re- model scalable to larger collections, we will adapt it to learn word- ranking strategy based on BM25 [31]. Therefore the recall of the based and concept-based sparse representations compatible with NLTR models is not upper bounded by the recall of BM25. The an inverted index as suggested by Zamani et al. [31]. results indicate that the gain of KTRel in terms of recall is signif- icant compared to BM25 (+78.6%) and BM25-RM3 (+36.3%). This REFERENCES happens because BoW models can only retrieve documents that [1] Olivier Bodenreider. 2004. The unified medical language system (UMLS): inte- contain terms of the query whereas NLTR models do not have this grating biomedical terminology. Nucleic acids research 32, suppl_1, D267–D270. [2] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok- restriction. sana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational How much data is needed to outperform BM25-RM3 with a Data. In NIPS. 2787–2795. neural network? As we can see on Figure 2, the number of queries [3] Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A Full-Text Learning to Rank Dataset for Medical Information Retrieval. In ECIR. required for an NLTR model to outperform BM25 or BM25-RM3 716–722. baselines varies depending on the model under consideration. On [4] Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional the NFCorpus, about 200 queries are required by DRMM to obtain Neural Networks for Soft-Matching N-Grams in Ad-hoc Search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining results comparable to those of BM25. This is due to the fact that (Marina Del Rey, CA, USA) (WSDM ’18). ACM, New York, NY, USA, 126–134. DRMM is a model with very few parameters (β‰ˆ 450) and therefore https://doi.org/10.1145/3159652.3159659 [5] Jeffrey Dalton, Laura Dietz, and James Allan. 2014. Entity Query Feature Expan- does not need a lot of data to converge. However, and for the same sion Using Knowledge Base Links. In Proceedings of the 37th International ACM reasons, DRMM does not benefit from a lot of training data and does SIGIR Conference on Research & Development in Information Retrieval (Gold not even outperform the BM25-RM3 reference model when given Coast, Queensland, Australia) (SIGIR ’14). ACM, New York, NY, USA, 365–374. https://doi.org/10.1145/2600428.2609628 more training queries. The Conv-KNRM model manages to outper- [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: form BM25 and BM25-RM3, but about 3,000 queries are required Pre-training of deep bidirectional transformers for language understanding. arXiv in the training set for Conv-KNRM to outperform BM25-RM3 over preprint arXiv:1810.04805 (2018). [7] Lorraine Goeuriot, Liadh Kelly, Wei Li, Joao Palotti, Pavel Pecina, Guido Zuccon, NFCorpus. It seems that less data (β‰ˆ 1,500 queries) are needed for Allan Hanbury, Gareth J. F. Jones, and Henning MΓΌller. 2014. ShARe/CLEF the KTRel model. This suggests that the use of concepts can be eHealth Evaluation Lab 2014, Task 3: User-centred health information retrieval. In Proceedings of CLEF 2014. Sheffield, United Kingdom. useful in resource-constrained scenarios. These results also con- [8] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance firm that training an NLTR model on a collection containing only matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International a few hundred queries is a very difficult task. This may explain on Conference on Information and Knowledge Management. ACM, 55–64. [9] Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen why significant breakthroughs have yet to be achieved by NLTR Wu, W Bruce Croft, and Xueqi Cheng. 2019. A Deep Look into Neural Ranking Knowledge-Based Transformer Model for Information Retrieval Models for Information Retrieval. arXiv preprint arXiv:1903.06902 (2019). Story-like Visual Explanations by Watching Movies and Reading Books. CoRR [10] Jiafeng Guo, Fan Yixing, Ji Xiang, and Cheng Xueqi. 2019. MatchZoo: A Learning, abs/1506.06724 (2015). arXiv:1506.06724 http://arxiv.org/abs/1506.06724 Practicing, and Developing System for Neural Text Matching. In Proceedings [33] Guido Zuccon, Bevan Koopman, et al. 2018. Payoffs and pitfalls in using of the 42Nd International ACM SIGIR Conference on Research and Development knowledge-bases for consumer health search. Information Retrieval Journal in Information Retrieval (Paris, France) (SIGIR’19). ACM, New York, NY, USA, (2018), 1–45. 1297–1300. https://doi.org/10.1145/3331184.3331403 [11] Christophe Van Gysel and Maarten de Rijke. 2018. Pytrec_eval: An Extremely Fast Python Interface to trec_eval. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018. 873–876. https://doi.org/10.1145/3209978.3210065 [12] D. P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.6980 [13] Guillaume Lample and Alexis Conneau. 2019. Cross-lingual Language Model Pretraining. arXiv e-prints, Article arXiv:1901.07291 (Jan 2019), arXiv:1901.07291 pages. arXiv:cs.CL/1901.07291 [14] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. CoRR abs/1901.08746 (2019). arXiv:1901.08746 http://arxiv.org/abs/1901.08746 [15] Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. 2018. Entity- Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval. arXiv preprint arXiv:1805.07591 (2018). [16] Yuanhua Lv and ChengXiang Zhai. 2009. A comparative study of methods for estimating query language models with pseudo feedback. In CIKM. 1895–1898. [17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119. [18] Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to match using local and distributed representations of text for web search. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1291–1299. [19] Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, and Nathalie Souf. 2017. Learn- ing Concept-Driven Document Embeddings for Medical Information Search. In Artificial Intelligence in Medicine - 16th Conference on Artificial Intelligence in Medicine, AIME 2017, Vienna, Austria, June 21-24, 2017, Proceedings. 160–170. https://doi.org/10.1007/978-3-319-59758-4_17 [20] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and Douglas Johnson. 2005. Terrier information retrieval platform. In European Conference on Information Retrieval. Springer, 517–519. [21] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019). [22] Stephen Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR. 232–241. [23] Ying Shen, Yang Deng, Min Yang, Yaliang Li, Nan Du, Wei Fan, and Kai Lei. 2018. Knowledge-aware Attentive Neural Network for Ranking Question Answer Pairs. In SIGIR. Springer, 901–904. [24] Luca Soldaini and Nazli Goharian. 2016. Quickumls: a fast, unsupervised approach for medical concept extraction. In MedIR workshop, SIGIR. [25] Luca Soldaini and Nazli Goharian. 2017. Learning to rank for consumer health search: a semantic approach. In ECIR. Springer, 640–646. [26] Hanna Suominen, Sanna SalanterΓ€, Sumithra Velupillai, Wendy W Chapman, Guergana Savova, Noemie Elhadad, Sameer Pradhan, Brett R South, Danielle L Mowery, Gareth JF Jones, et al. 2013. Overview of the ShARe/CLEF eHealth evaluation lab 2013. In International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 212–231. [27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, et al. 2017. Attention is all you need. In NIPS. 5998–6008. [28] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, RΓ©mi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Transformers: State-of-the-art Natural Language Processing. arXiv:cs.CL/1910.03771 [29] Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval. ACM, 55–64. [30] Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically Examining the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019. 1129–1132. https://doi.org/10.1145/3331184.3331340 [31] Hamed Zamani, Mostafa Dehghani, W Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 497–506. [32] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards