-

RUDI: Real-Time Learning to Update Dense Retrieval Indices

Sophia Althammer

0 0 TU Vienna , Austria, Karlsplatz 13, Vienna, 1040 , Austria

2021

15 18

Dense retrieval models demonstrate great efectiveness simple transformations. In RUDI a computationally gains for retrieval and re-ranking by learning a vector lightweight vector space transformation function : space embedding for the queries and the documents in V → V between the vector embedding space of the the corpus [1, 2, 3, 4, 5, 6]. At the same time dense re- previous retrieval model V and of the re-trained dense trieval improves inference speed at query time with fast retrieval model V is used to transform the vector emapproximate nearest neighbor search [7, 8, 9] compared beddings of the previous index to the embeddings of the to exact k-nearest neighbor search by moving most of re-trained indexing model. The advantage of RUDI is that the computational efort to the indexing phase. the index embedding does not need to be fully re-indexed For search indices in production, there are continously with the re-trained dense retrieval model, but the index millions of new data points which need to be included is updated with a learned, computationally lightweight in the index in real-time [10, 11]. With the constantly transformation function. This allows updating the dense new content to be indexed, the overall content of the retrieval index in real-time. whole corpus shifts gradually. With this it also shifts First the dense retrieval model is re-trained in real-time what should be considered relevant for a given query. with new labels accounting for the shift in the corpus. For example the global COVID-19 pandemic resulted in These new labels are determined by indexing the new an explosion of COVID-19 related websites, news arti- content with the original retrieval model and getting imcles and scientific publications [ 12]. In order to track, plicit feedback through user interaction. Re-indexing the understand and seek this rapidly growing, novel infor- whole corpus with the re-trained dense retrieval model mation, the information retrieval community created a would give the vector embedding space of the re-trained continously growing research dataset containing COVID- dense retrieval model V. To approximate the embed19 related publications [13]. dings in V, the transformation function takes the To be able to find high-quality, relevant, and recent embedding ∈ V of the document from the previous results, the novel content not only needs to be included in embedding space as input and outputs the approximated the search index, but the indexing model needs to account vector space embedding . This vector approximates for the content shift and update the index in real-time. To the vector space embedding ∈ V of document of incorporate this content shift in real-time in production the re-trained dense retrieval model. The approximated systems, the dense retrieval model needs to be re-trained vector space embedding of document is then the upand the corpus re-indexed with the re-trained dense re- dated embedding of vector space V. The transformation trieval model. Real-time user interactions provide labels function is learned in real-time on a small, sampled for re-training the dense retrieval model in real-time. fraction D of the documents in the corpus. For these However updating the search index in real-time remains training documents ∈ D the updated vector ∈ V an open challenge. In production systems the search in- of the re-trained dense retrieval model is computed. Then dices have a size up to 100 millions of terabytes, thus the transformation function is trained on and with re-indexing the whole corpus is computationally expen- the objective of minimizing the distance between the sive and not feasible in real-time scenarios. approximate vector space embedding and In this paper we propose the concept RUDI for Realtime learning to Update Dense retrieval Indices with

eol>Dense retrieval Real-time update In-production systems

⃦ ⃦⃦ − ⃦⃦ .

⃦ With this learned, lightweight transformation function the whole index can be updated in real-time while accounting for the temporal content shift in the corpus.

We plan to first analyze the shift of the vector embedding space between the previous and the re-trained dense retrieval model. Furthermore we plan study to what extent we can learn a simple, lightweight transformation function between the embedding space of the previous and the re-trained dense retrieval model. We investigate diferent transformation functions from one fully connected layer to exponential transformation functions and compare their approximation performance. Also we plan to investigate how the overall retrieval efectiveness is influenced by updating the retrieval index with RUDI compared to re-indexing the whole index.

As re-indexing the training documents ∈ D for training the transformation function in real-time is computationally expensive, we plan to analyze the trade-of between number of training documents and overall retrieval quality on the updated index. Furthermore we investigate diferent sampling strategies for sampling the training documents from the overall index. We plan to compare random sampling with strategies aiming to sample documents from the index with maximal orthogonal embeddings. We plan to compare the efectiveness of the transformation functions trained with the diferent sampling strategies. Furthermore we plan to do speed comparisons between updating the dense retrieval index with diferent size of training samples for the transformation function and between re-indexing the whole index.

One could include additional features in the embedding space for hyperparameters like date or version, in order to include the recency of the results in the embedding space and make additional filter systems redundant.

Another open challenge is the evaluation of updated indices. As in the real-time scenario the query and content distribution gradually shifts, the evaluation with fixed test collections lacks to account for this shift. Therefore it is an interesting question how to evaluate an inproduction system for example with A/B testing.

We conclude that our goal is to update dense retrieval indices in real-time while incorporating the temporal content shift. Therefore we propose RUDI for updating dense retrieval indices with transformations in real-time. We outline which research questions are necessary to investigate the efectiveness and eficiency of RUDI. Acknowledgments This work was supported by the EU Horizon 2020 ITN/ETN on Domain Specific Systems for Information Extraction and Retrieval (H2020-EU.1.3.1., ID: 860721). 3303754. [10] I. L. Stats, Total number of websites, https://www. internetlivestats.com/total-number-of-websites/, 2021. [Online; accessed 17-June-2021]. [11] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, Comput. Netw. ISDN Syst. 30 (1998) 107–117. URL: http://dx.doi. org/10.1016/S0169-7552(98)00110-X. doi:10.1016/ S0169-7552(98)00110-X. [12] H. Poon, Domain-specific language model pretraining for biomedical natural language processing, https://www.microsoft.com/en-us/research/blog/ domain-specific-language-model-pretraining-for\ -biomedical-natural-language-processing/, 2020. [Online; accessed 11-June-2021]. [13] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis, R. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, D. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. Wade, K. Wang, N. X. R. Wang, C. Wilhelm, B. Xie, D. Raymond, D. S. Weld, O. Etzioni, S. Kohlmeier, Cord-19: The covid-19 open research dataset, 2020. arXiv:2004.10706.

[1]

Reimers , I. Gurevych , Sentence-BERT: Sentence embeddings using Siamese BERT-networks , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , Association for Computational Linguistics , Hong Kong, China, 2019 , pp. 3982 - 3992 . URL: https://www. aclweb.org/anthology/D19-1410. doi: 10 .18653/ v1/ D19 -1410.

[2]

Karpukhin ,

Oguz ,

Min ,

Lewis ,

Wu ,

Edunov ,

Chen , W.-t. Yih, Dense passage retrieval for open-domain question answering , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Association for Computational Linguistics , Online, 2020 , pp. 6769 - 6781 . URL: https://www.aclweb.org/ anthology/2020.emnlp-main. 550 . doi: 10 .18653/ v1/ 2020 .emnlp-main. 550 .

[3]

Xiong ,

Li ,

K.-F.

Tang , J. Liu,

P. N.

Bennett ,

Ahmed ,

Overwijk , Approximate nearest neighbor negative contrastive learning for dense text retrieval , in: International Conference on Learning Representations , 2021 . URL: https: //openreview.net/forum?id=zeFrfgyZln.

[4]

Hofstätter ,

S.-C.

Lin ,

J.-H.

Yang ,

Lin ,

Hanbury , Eficiently teaching an efective dense retriever with balanced topic aware sampling , 2021 . arXiv: 2104 . 06967 .

[5]

Khattab ,

Zaharia , Colbert: Eficient and effective passage search via contextualized late interaction over bert , in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '20, Association for Computing Machinery, New York, NY, USA, 2020 , p. 39 - 48 . URL: https://doi.org/ 10.1145/3397271.3401075. doi: 10 .1145/3397271. 3401075.

[6]

Gao ,

Dai ,

Chen ,

Fan ,

B. V.

Durme ,

Callan , Complementing lexical retrieval with semantic residual embedding ( 2020 ). URL: http: //arxiv.org/abs/ 2004 .13969.

[7]

Johnson , M. Douze,

Jégou , Billion-scale similarity search with gpus , IEEE Transactions on Big Data ( 2019 ) 1 - 1 . doi: 10 .1109/TBDATA. 2019 . 2921572 .

[8]

Guo ,

Sun ,

Lindgren ,

Geng ,

Simcha ,

Chern ,

Kumar , Accelerating large-scale inference with anisotropic vector quantization , in: H. D. III , A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning , volume 119 of Proceedings of Machine Learning Research, PMLR , 2020 , pp. 3887 - 3896 . URL: http:// proceedings.mlr.press/v119/guo20h.html.

[9]

Fu ,

Xiang ,

Wang ,

Cai , Fast approximate nearest neighbor search with the navigating spreading-out graph , Proc. VLDB Endow . 12 ( 2019 ) 461 - 474 . URL: https://doi.org/10. 14778/3303753.3303754. doi: 10 .14778/3303753.