=Paper= {{Paper |id=Vol-2950/paper-20 |storemode=property |title=RUDI: Real-Time Learning to Update Dense Retrieval Indices |pdfUrl=https://ceur-ws.org/Vol-2950/paper-20.pdf |volume=Vol-2950 |authors=Sophia Althammer |dblpUrl=https://dblp.org/rec/conf/desires/Althammer21 }} ==RUDI: Real-Time Learning to Update Dense Retrieval Indices== https://ceur-ws.org/Vol-2950/paper-20.pdf
RUDI: Real-Time Learning to Update Dense Retrieval
Indices
Sophia Althammer1
1
    TU Vienna, Austria, Karlsplatz 13, Vienna, 1040, Austria

Keywords
Dense retrieval, Real-time update, In-production systems



   Dense retrieval models demonstrate great effectiveness                                                             simple transformations. In RUDI a computationally
gains for retrieval and re-ranking by learning a vector                                                               lightweight vector space transformation function 𝑇 :
space embedding for the queries and the documents in                                                                  V β†’ Vπ‘Ÿ between the vector embedding space of the
the corpus [1, 2, 3, 4, 5, 6]. At the same time dense re-                                                             previous retrieval model V and of the re-trained dense
trieval improves inference speed at query time with fast                                                              retrieval model Vπ‘Ÿ is used to transform the vector em-
approximate nearest neighbor search [7, 8, 9] compared                                                                beddings of the previous index to the embeddings of the
to exact k-nearest neighbor search by moving most of                                                                  re-trained indexing model. The advantage of RUDI is that
the computational effort to the indexing phase.                                                                       the index embedding does not need to be fully re-indexed
   For search indices in production, there are continously                                                            with the re-trained dense retrieval model, but the index
millions of new data points which need to be included                                                                 is updated with a learned, computationally lightweight
in the index in real-time [10, 11]. With the constantly                                                               transformation function. This allows updating the dense
new content to be indexed, the overall content of the                                                                 retrieval index in real-time.
whole corpus shifts gradually. With this it also shifts                                                                  First the dense retrieval model is re-trained in real-time
what should be considered relevant for a given query.                                                                 with new labels accounting for the shift in the corpus.
For example the global COVID-19 pandemic resulted in                                                                  These new labels are determined by indexing the new
an explosion of COVID-19 related websites, news arti-                                                                 content with the original retrieval model and getting im-
cles and scientific publications [12]. In order to track,                                                             plicit feedback through user interaction. Re-indexing the
understand and seek this rapidly growing, novel infor-                                                                whole corpus with the re-trained dense retrieval model
mation, the information retrieval community created a                                                                 would give the vector embedding space of the re-trained
continously growing research dataset containing COVID-                                                                dense retrieval model Vπ‘Ÿ . To approximate the embed-
19 related publications [13].                                                                                         dings in Vπ‘Ÿ , the transformation function 𝑇 takes the
   To be able to find high-quality, relevant, and recent                                                              embedding 𝑣 𝑑 ∈ V of the document 𝑑 from the previous
results, the novel content not only needs to be included in                                                           embedding space as input and outputs the approximated
the search index, but the indexing model needs to account                                                             vector space embedding π‘£π‘Žπ‘‘ . This vector π‘£π‘Žπ‘‘ approximates
for the content shift and update the index in real-time. To                                                           the vector space embedding π‘£π‘Ÿπ‘‘ ∈ Vπ‘Ÿ of document 𝑑 of
incorporate this content shift in real-time in production                                                             the re-trained dense retrieval model. The approximated
systems, the dense retrieval model needs to be re-trained                                                             vector space embedding of document 𝑑 π‘£π‘Žπ‘‘ is then the up-
and the corpus re-indexed with the re-trained dense re-                                                               dated embedding of vector space Vπ‘Ÿ . The transformation
trieval model. Real-time user interactions provide labels                                                             function 𝑇 is learned in real-time on a small, sampled
for re-training the dense retrieval model in real-time.                                                               fraction D of the documents in the corpus. For these
However updating the search index in real-time remains                                                                training documents 𝑑 ∈ D the updated vector π‘£π‘Ÿπ‘‘ ∈ Vπ‘Ÿ
an open challenge. In production systems the search in-                                                               of the re-trained dense retrieval model is computed. Then
dices have a size up to 100 millions of terabytes, thus                                                               the transformation function is trained on 𝑣 𝑑 and π‘£π‘Ÿπ‘‘ with
re-indexing the whole corpus is computationally expen-                                                                the objective of minimizing the distance between the
sive and not feasible in real-time scenarios.                                                                         approximate vector space embedding π‘£π‘Žπ‘‘ and π‘£π‘Ÿπ‘‘
   In this paper we propose the concept RUDI for Real-                                                                                           ⃦         ⃦
time learning to Update Dense retrieval Indices with                                                                                        π‘šπ‘–π‘› βƒ¦π‘£π‘Žπ‘‘ βˆ’ π‘£π‘Ÿπ‘‘ ⃦ .
                                                                                                                                                 ⃦         ⃦

DESIRES 2021 – 2nd International Conference on Design of
Experimental Search Information REtrieval Systems, September                                                          With this learned, lightweight transformation function
15–18, 2021, Padua, Italy                                                                                             the whole index can be updated in real-time while ac-
" sophia.althammer@tuwien.ac.at (S. Althammer)                                                                        counting for the temporal content shift in the corpus.
~ https://sophiaalthammer.github.io/ (S. Althammer)                                                                     We plan to first analyze the shift of the vector embed-
                                       Β© 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     ding space between the previous and the re-trained dense
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
retrieval model. Furthermore we plan study to what ex-            the 9th International Joint Conference on Natural
tent we can learn a simple, lightweight transformation            Language Processing (EMNLP-IJCNLP), Associa-
function between the embedding space of the previous              tion for Computational Linguistics, Hong Kong,
and the re-trained dense retrieval model. We investigate          China, 2019, pp. 3982–3992. URL: https://www.
different transformation functions from one fully con-            aclweb.org/anthology/D19-1410. doi:10.18653/
nected layer to exponential transformation functions and          v1/D19-1410.
compare their approximation performance. Also we plan         [2] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu,
to investigate how the overall retrieval effectiveness is         S. Edunov, D. Chen, W.-t. Yih, Dense passage re-
influenced by updating the retrieval index with RUDI              trieval for open-domain question answering, in:
compared to re-indexing the whole index.                          Proceedings of the 2020 Conference on Empirical
   As re-indexing the training documents 𝑑 ∈ D for train-         Methods in Natural Language Processing (EMNLP),
ing the transformation function in real-time is compu-            Association for Computational Linguistics, Online,
tationally expensive, we plan to analyze the trade-off            2020, pp. 6769–6781. URL: https://www.aclweb.org/
between number of training documents and overall re-              anthology/2020.emnlp-main.550. doi:10.18653/
trieval quality on the updated index. Furthermore we              v1/2020.emnlp-main.550.
investigate different sampling strategies for sampling the    [3] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. N.
training documents from the overall index. We plan to             Bennett, J. Ahmed, A. Overwijk, Approximate
compare random sampling with strategies aiming to sam-            nearest neighbor negative contrastive learning for
ple documents from the index with maximal orthogonal              dense text retrieval, in: International Confer-
embeddings. We plan to compare the effectiveness of               ence on Learning Representations, 2021. URL: https:
the transformation functions trained with the different           //openreview.net/forum?id=zeFrfgyZln.
sampling strategies. Furthermore we plan to do speed          [4] S. HofstΓ€tter, S.-C. Lin, J.-H. Yang, J. Lin, A. Han-
comparisons between updating the dense retrieval index            bury, Efficiently teaching an effective dense re-
with different size of training samples for the transforma-       triever with balanced topic aware sampling, 2021.
tion function and between re-indexing the whole index.            arXiv:2104.06967.
   One could include additional features in the embedding     [5] O. Khattab, M. Zaharia, Colbert: Efficient and ef-
space for hyperparameters like date or version, in order          fective passage search via contextualized late in-
to include the recency of the results in the embedding            teraction over bert, in: Proceedings of the 43rd
space and make additional filter systems redundant.               International ACM SIGIR Conference on Research
   Another open challenge is the evaluation of updated in-        and Development in Information Retrieval, SIGIR
dices. As in the real-time scenario the query and content         ’20, Association for Computing Machinery, New
distribution gradually shifts, the evaluation with fixed          York, NY, USA, 2020, p. 39–48. URL: https://doi.org/
test collections lacks to account for this shift. There-          10.1145/3397271.3401075. doi:10.1145/3397271.
fore it is an interesting question how to evaluate an in-         3401075.
production system for example with A/B testing.               [6] L. Gao, Z. Dai, T. Chen, Z. Fan, B. V. Durme,
   We conclude that our goal is to update dense retrieval         J. Callan, Complementing lexical retrieval with
indices in real-time while incorporating the temporal             semantic residual embedding (2020). URL: http:
content shift. Therefore we propose RUDI for updating             //arxiv.org/abs/2004.13969.
dense retrieval indices with transformations in real-time.    [7] J. Johnson, M. Douze, H. JΓ©gou, Billion-scale sim-
We outline which research questions are necessary to              ilarity search with gpus, IEEE Transactions on
investigate the effectiveness and efficiency of RUDI.             Big Data (2019) 1–1. doi:10.1109/TBDATA.2019.
                                                                  2921572.
                                                              [8] R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha,
Acknowledgments                                                   F. Chern, S. Kumar, Accelerating large-scale in-
                                                                  ference with anisotropic vector quantization, in:
This work was supported by the EU Horizon 2020
                                                                  H. D. III, A. Singh (Eds.), Proceedings of the 37th
ITN/ETN on Domain Specific Systems for Information
                                                                  International Conference on Machine Learning, vol-
Extraction and Retrieval (H2020-EU.1.3.1., ID: 860721).
                                                                  ume 119 of Proceedings of Machine Learning Re-
                                                                  search, PMLR, 2020, pp. 3887–3896. URL: http://
References                                                        proceedings.mlr.press/v119/guo20h.html.
                                                              [9] C. Fu, C. Xiang, C. Wang, D. Cai, Fast ap-
 [1] N. Reimers, I. Gurevych, Sentence-BERT: Sen-                 proximate nearest neighbor search with the nav-
     tence embeddings using Siamese BERT-networks,                igating spreading-out graph, Proc. VLDB En-
     in: Proceedings of the 2019 Conference on Empiri-            dow. 12 (2019) 461–474. URL: https://doi.org/10.
     cal Methods in Natural Language Processing and               14778/3303753.3303754. doi:10.14778/3303753.
     3303754.
[10] I. L. Stats, Total number of websites, https://www.
     internetlivestats.com/total-number-of-websites/,
     2021. [Online; accessed 17-June-2021].
[11] S. Brin, L. Page, The anatomy of a large-scale
     hypertextual web search engine, Comput. Netw.
     ISDN Syst. 30 (1998) 107–117. URL: http://dx.doi.
     org/10.1016/S0169-7552(98)00110-X. doi:10.1016/
     S0169-7552(98)00110-X.
[12] H. Poon, Domain-specific language model pretrain-
     ing for biomedical natural language processing,
     https://www.microsoft.com/en-us/research/blog/
     domain-specific-language-model-pretraining-for\
     -biomedical-natural-language-processing/, 2020.
     [Online; accessed 11-June-2021].
[13] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas,
     J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis,
     R. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney,
     D. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stil-
     son, A. Wade, K. Wang, N. X. R. Wang, C. Wil-
     helm, B. Xie, D. Raymond, D. S. Weld, O. Etzioni,
     S. Kohlmeier, Cord-19: The covid-19 open research
     dataset, 2020. arXiv:2004.10706.