=Paper=
{{Paper
|id=Vol-2950/paper-20
|storemode=property
|title=RUDI: Real-Time Learning to Update Dense Retrieval Indices
|pdfUrl=https://ceur-ws.org/Vol-2950/paper-20.pdf
|volume=Vol-2950
|authors=Sophia Althammer
|dblpUrl=https://dblp.org/rec/conf/desires/Althammer21
}}
==RUDI: Real-Time Learning to Update Dense Retrieval Indices==
RUDI: Real-Time Learning to Update Dense Retrieval Indices Sophia Althammer1 1 TU Vienna, Austria, Karlsplatz 13, Vienna, 1040, Austria Keywords Dense retrieval, Real-time update, In-production systems Dense retrieval models demonstrate great effectiveness simple transformations. In RUDI a computationally gains for retrieval and re-ranking by learning a vector lightweight vector space transformation function π : space embedding for the queries and the documents in V β Vπ between the vector embedding space of the the corpus [1, 2, 3, 4, 5, 6]. At the same time dense re- previous retrieval model V and of the re-trained dense trieval improves inference speed at query time with fast retrieval model Vπ is used to transform the vector em- approximate nearest neighbor search [7, 8, 9] compared beddings of the previous index to the embeddings of the to exact k-nearest neighbor search by moving most of re-trained indexing model. The advantage of RUDI is that the computational effort to the indexing phase. the index embedding does not need to be fully re-indexed For search indices in production, there are continously with the re-trained dense retrieval model, but the index millions of new data points which need to be included is updated with a learned, computationally lightweight in the index in real-time [10, 11]. With the constantly transformation function. This allows updating the dense new content to be indexed, the overall content of the retrieval index in real-time. whole corpus shifts gradually. With this it also shifts First the dense retrieval model is re-trained in real-time what should be considered relevant for a given query. with new labels accounting for the shift in the corpus. For example the global COVID-19 pandemic resulted in These new labels are determined by indexing the new an explosion of COVID-19 related websites, news arti- content with the original retrieval model and getting im- cles and scientific publications [12]. In order to track, plicit feedback through user interaction. Re-indexing the understand and seek this rapidly growing, novel infor- whole corpus with the re-trained dense retrieval model mation, the information retrieval community created a would give the vector embedding space of the re-trained continously growing research dataset containing COVID- dense retrieval model Vπ . To approximate the embed- 19 related publications [13]. dings in Vπ , the transformation function π takes the To be able to find high-quality, relevant, and recent embedding π£ π β V of the document π from the previous results, the novel content not only needs to be included in embedding space as input and outputs the approximated the search index, but the indexing model needs to account vector space embedding π£ππ . This vector π£ππ approximates for the content shift and update the index in real-time. To the vector space embedding π£ππ β Vπ of document π of incorporate this content shift in real-time in production the re-trained dense retrieval model. The approximated systems, the dense retrieval model needs to be re-trained vector space embedding of document π π£ππ is then the up- and the corpus re-indexed with the re-trained dense re- dated embedding of vector space Vπ . The transformation trieval model. Real-time user interactions provide labels function π is learned in real-time on a small, sampled for re-training the dense retrieval model in real-time. fraction D of the documents in the corpus. For these However updating the search index in real-time remains training documents π β D the updated vector π£ππ β Vπ an open challenge. In production systems the search in- of the re-trained dense retrieval model is computed. Then dices have a size up to 100 millions of terabytes, thus the transformation function is trained on π£ π and π£ππ with re-indexing the whole corpus is computationally expen- the objective of minimizing the distance between the sive and not feasible in real-time scenarios. approximate vector space embedding π£ππ and π£ππ In this paper we propose the concept RUDI for Real- β¦ β¦ time learning to Update Dense retrieval Indices with πππ β¦π£ππ β π£ππ β¦ . β¦ β¦ DESIRES 2021 β 2nd International Conference on Design of Experimental Search Information REtrieval Systems, September With this learned, lightweight transformation function 15β18, 2021, Padua, Italy the whole index can be updated in real-time while ac- " sophia.althammer@tuwien.ac.at (S. Althammer) counting for the temporal content shift in the corpus. ~ https://sophiaalthammer.github.io/ (S. Althammer) We plan to first analyze the shift of the vector embed- Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ding space between the previous and the re-trained dense CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) retrieval model. Furthermore we plan study to what ex- the 9th International Joint Conference on Natural tent we can learn a simple, lightweight transformation Language Processing (EMNLP-IJCNLP), Associa- function between the embedding space of the previous tion for Computational Linguistics, Hong Kong, and the re-trained dense retrieval model. We investigate China, 2019, pp. 3982β3992. URL: https://www. different transformation functions from one fully con- aclweb.org/anthology/D19-1410. doi:10.18653/ nected layer to exponential transformation functions and v1/D19-1410. compare their approximation performance. Also we plan [2] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, to investigate how the overall retrieval effectiveness is S. Edunov, D. Chen, W.-t. Yih, Dense passage re- influenced by updating the retrieval index with RUDI trieval for open-domain question answering, in: compared to re-indexing the whole index. Proceedings of the 2020 Conference on Empirical As re-indexing the training documents π β D for train- Methods in Natural Language Processing (EMNLP), ing the transformation function in real-time is compu- Association for Computational Linguistics, Online, tationally expensive, we plan to analyze the trade-off 2020, pp. 6769β6781. URL: https://www.aclweb.org/ between number of training documents and overall re- anthology/2020.emnlp-main.550. doi:10.18653/ trieval quality on the updated index. Furthermore we v1/2020.emnlp-main.550. investigate different sampling strategies for sampling the [3] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. N. training documents from the overall index. We plan to Bennett, J. Ahmed, A. Overwijk, Approximate compare random sampling with strategies aiming to sam- nearest neighbor negative contrastive learning for ple documents from the index with maximal orthogonal dense text retrieval, in: International Confer- embeddings. We plan to compare the effectiveness of ence on Learning Representations, 2021. URL: https: the transformation functions trained with the different //openreview.net/forum?id=zeFrfgyZln. sampling strategies. Furthermore we plan to do speed [4] S. HofstΓ€tter, S.-C. Lin, J.-H. Yang, J. Lin, A. Han- comparisons between updating the dense retrieval index bury, Efficiently teaching an effective dense re- with different size of training samples for the transforma- triever with balanced topic aware sampling, 2021. tion function and between re-indexing the whole index. arXiv:2104.06967. One could include additional features in the embedding [5] O. Khattab, M. Zaharia, Colbert: Efficient and ef- space for hyperparameters like date or version, in order fective passage search via contextualized late in- to include the recency of the results in the embedding teraction over bert, in: Proceedings of the 43rd space and make additional filter systems redundant. International ACM SIGIR Conference on Research Another open challenge is the evaluation of updated in- and Development in Information Retrieval, SIGIR dices. As in the real-time scenario the query and content β20, Association for Computing Machinery, New distribution gradually shifts, the evaluation with fixed York, NY, USA, 2020, p. 39β48. URL: https://doi.org/ test collections lacks to account for this shift. There- 10.1145/3397271.3401075. doi:10.1145/3397271. fore it is an interesting question how to evaluate an in- 3401075. production system for example with A/B testing. [6] L. Gao, Z. Dai, T. Chen, Z. Fan, B. V. Durme, We conclude that our goal is to update dense retrieval J. Callan, Complementing lexical retrieval with indices in real-time while incorporating the temporal semantic residual embedding (2020). URL: http: content shift. Therefore we propose RUDI for updating //arxiv.org/abs/2004.13969. dense retrieval indices with transformations in real-time. [7] J. Johnson, M. Douze, H. JΓ©gou, Billion-scale sim- We outline which research questions are necessary to ilarity search with gpus, IEEE Transactions on investigate the effectiveness and efficiency of RUDI. Big Data (2019) 1β1. doi:10.1109/TBDATA.2019. 2921572. [8] R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, Acknowledgments F. Chern, S. Kumar, Accelerating large-scale in- ference with anisotropic vector quantization, in: This work was supported by the EU Horizon 2020 H. D. III, A. Singh (Eds.), Proceedings of the 37th ITN/ETN on Domain Specific Systems for Information International Conference on Machine Learning, vol- Extraction and Retrieval (H2020-EU.1.3.1., ID: 860721). ume 119 of Proceedings of Machine Learning Re- search, PMLR, 2020, pp. 3887β3896. URL: http:// References proceedings.mlr.press/v119/guo20h.html. [9] C. Fu, C. Xiang, C. Wang, D. Cai, Fast ap- [1] N. Reimers, I. Gurevych, Sentence-BERT: Sen- proximate nearest neighbor search with the nav- tence embeddings using Siamese BERT-networks, igating spreading-out graph, Proc. VLDB En- in: Proceedings of the 2019 Conference on Empiri- dow. 12 (2019) 461β474. URL: https://doi.org/10. cal Methods in Natural Language Processing and 14778/3303753.3303754. doi:10.14778/3303753. 3303754. [10] I. L. Stats, Total number of websites, https://www. internetlivestats.com/total-number-of-websites/, 2021. [Online; accessed 17-June-2021]. [11] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, Comput. Netw. ISDN Syst. 30 (1998) 107β117. URL: http://dx.doi. org/10.1016/S0169-7552(98)00110-X. doi:10.1016/ S0169-7552(98)00110-X. [12] H. Poon, Domain-specific language model pretrain- ing for biomedical natural language processing, https://www.microsoft.com/en-us/research/blog/ domain-specific-language-model-pretraining-for\ -biomedical-natural-language-processing/, 2020. [Online; accessed 11-June-2021]. [13] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis, R. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, D. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stil- son, A. Wade, K. Wang, N. X. R. Wang, C. Wil- helm, B. Xie, D. Raymond, D. S. Weld, O. Etzioni, S. Kohlmeier, Cord-19: The covid-19 open research dataset, 2020. arXiv:2004.10706.