<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RUDI: Real-Time Learning to Update Dense Retrieval Indices</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sophia Althammer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Vienna</institution>
          ,
          <addr-line>Austria, Karlsplatz 13, Vienna, 1040</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>15</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>Dense retrieval models demonstrate great efectiveness simple transformations. In RUDI a computationally gains for retrieval and re-ranking by learning a vector lightweight vector space transformation function  : space embedding for the queries and the documents in V → V between the vector embedding space of the the corpus [1, 2, 3, 4, 5, 6]. At the same time dense re- previous retrieval model V and of the re-trained dense trieval improves inference speed at query time with fast retrieval model V is used to transform the vector emapproximate nearest neighbor search [7, 8, 9] compared beddings of the previous index to the embeddings of the to exact k-nearest neighbor search by moving most of re-trained indexing model. The advantage of RUDI is that the computational efort to the indexing phase. the index embedding does not need to be fully re-indexed For search indices in production, there are continously with the re-trained dense retrieval model, but the index millions of new data points which need to be included is updated with a learned, computationally lightweight in the index in real-time [10, 11]. With the constantly transformation function. This allows updating the dense new content to be indexed, the overall content of the retrieval index in real-time. whole corpus shifts gradually. With this it also shifts First the dense retrieval model is re-trained in real-time what should be considered relevant for a given query. with new labels accounting for the shift in the corpus. For example the global COVID-19 pandemic resulted in These new labels are determined by indexing the new an explosion of COVID-19 related websites, news arti- content with the original retrieval model and getting imcles and scientific publications [ 12]. In order to track, plicit feedback through user interaction. Re-indexing the understand and seek this rapidly growing, novel infor- whole corpus with the re-trained dense retrieval model mation, the information retrieval community created a would give the vector embedding space of the re-trained continously growing research dataset containing COVID- dense retrieval model V. To approximate the embed19 related publications [13]. dings in V, the transformation function  takes the To be able to find high-quality, relevant, and recent embedding  ∈ V of the document  from the previous results, the novel content not only needs to be included in embedding space as input and outputs the approximated the search index, but the indexing model needs to account vector space embedding . This vector  approximates for the content shift and update the index in real-time. To the vector space embedding  ∈ V of document  of incorporate this content shift in real-time in production the re-trained dense retrieval model. The approximated systems, the dense retrieval model needs to be re-trained vector space embedding of document   is then the upand the corpus re-indexed with the re-trained dense re- dated embedding of vector space V. The transformation trieval model. Real-time user interactions provide labels function  is learned in real-time on a small, sampled for re-training the dense retrieval model in real-time. fraction D of the documents in the corpus. For these However updating the search index in real-time remains training documents  ∈ D the updated vector  ∈ V an open challenge. In production systems the search in- of the re-trained dense retrieval model is computed. Then dices have a size up to 100 millions of terabytes, thus the transformation function is trained on  and  with re-indexing the whole corpus is computationally expen- the objective of minimizing the distance between the sive and not feasible in real-time scenarios. approximate vector space embedding  and  In this paper we propose the concept RUDI for Realtime learning to Update Dense retrieval Indices with</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Dense retrieval</kwd>
        <kwd>Real-time update</kwd>
        <kwd>In-production systems</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>⃦ 
 ⃦⃦  −
⃦⃦ .</p>
      <p>⃦
With this learned, lightweight transformation function
the whole index can be updated in real-time while
accounting for the temporal content shift in the corpus.</p>
      <p>We plan to first analyze the shift of the vector
embedding space between the previous and the re-trained dense
retrieval model. Furthermore we plan study to what
extent we can learn a simple, lightweight transformation
function between the embedding space of the previous
and the re-trained dense retrieval model. We investigate
diferent transformation functions from one fully
connected layer to exponential transformation functions and
compare their approximation performance. Also we plan
to investigate how the overall retrieval efectiveness is
influenced by updating the retrieval index with RUDI
compared to re-indexing the whole index.</p>
      <p>As re-indexing the training documents  ∈ D for
training the transformation function in real-time is
computationally expensive, we plan to analyze the trade-of
between number of training documents and overall
retrieval quality on the updated index. Furthermore we
investigate diferent sampling strategies for sampling the
training documents from the overall index. We plan to
compare random sampling with strategies aiming to
sample documents from the index with maximal orthogonal
embeddings. We plan to compare the efectiveness of
the transformation functions trained with the diferent
sampling strategies. Furthermore we plan to do speed
comparisons between updating the dense retrieval index
with diferent size of training samples for the
transformation function and between re-indexing the whole index.</p>
      <p>One could include additional features in the embedding
space for hyperparameters like date or version, in order
to include the recency of the results in the embedding
space and make additional filter systems redundant.</p>
      <p>Another open challenge is the evaluation of updated
indices. As in the real-time scenario the query and content
distribution gradually shifts, the evaluation with fixed
test collections lacks to account for this shift.
Therefore it is an interesting question how to evaluate an
inproduction system for example with A/B testing.</p>
      <p>We conclude that our goal is to update dense retrieval
indices in real-time while incorporating the temporal
content shift. Therefore we propose RUDI for updating
dense retrieval indices with transformations in real-time.
We outline which research questions are necessary to
investigate the efectiveness and eficiency of RUDI.
Acknowledgments
This work was supported by the EU Horizon 2020
ITN/ETN on Domain Specific Systems for Information
Extraction and Retrieval (H2020-EU.1.3.1., ID: 860721).
3303754.
[10] I. L. Stats, Total number of websites, https://www.
internetlivestats.com/total-number-of-websites/,
2021. [Online; accessed 17-June-2021].
[11] S. Brin, L. Page, The anatomy of a large-scale
hypertextual web search engine, Comput. Netw.
ISDN Syst. 30 (1998) 107–117. URL: http://dx.doi.
org/10.1016/S0169-7552(98)00110-X. doi:10.1016/
S0169-7552(98)00110-X.
[12] H. Poon, Domain-specific language model
pretraining for biomedical natural language processing,
https://www.microsoft.com/en-us/research/blog/
domain-specific-language-model-pretraining-for\
-biomedical-natural-language-processing/, 2020.
[Online; accessed 11-June-2021].
[13] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas,
J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis,
R. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney,
D. Murdick, D. Rishi, J. Sheehan, Z. Shen, B.
Stilson, A. Wade, K. Wang, N. X. R. Wang, C.
Wilhelm, B. Xie, D. Raymond, D. S. Weld, O. Etzioni,
S. Kohlmeier, Cord-19: The covid-19 open research
dataset, 2020. arXiv:2004.10706.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence embeddings using Siamese BERT-networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3982</fpage>
          -
          <lpage>3992</lpage>
          . URL: https://www. aclweb.org/anthology/D19-1410. doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>D19</fpage>
          -1410.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Oguz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W.-t. Yih,
          <article-title>Dense passage retrieval for open-domain question answering</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>6769</fpage>
          -
          <lpage>6781</lpage>
          . URL: https://www.aclweb.org/ anthology/2020.emnlp-main.
          <volume>550</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2020</year>
          .emnlp-main.
          <volume>550</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-F.</given-names>
            <surname>Tang</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Overwijk</surname>
          </string-name>
          ,
          <article-title>Approximate nearest neighbor negative contrastive learning for dense text retrieval</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2021</year>
          . URL: https: //openreview.net/forum?id=zeFrfgyZln.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hofstätter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <article-title>Eficiently teaching an efective dense retriever with balanced topic aware sampling</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2104</volume>
          .
          <fpage>06967</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <article-title>Colbert: Eficient and effective passage search via contextualized late interaction over bert</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          . URL: https://doi.org/ 10.1145/3397271.3401075. doi:
          <volume>10</volume>
          .1145/3397271. 3401075.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. V.</given-names>
            <surname>Durme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Complementing lexical retrieval with semantic residual embedding (</article-title>
          <year>2020</year>
          ). URL: http: //arxiv.org/abs/
          <year>2004</year>
          .13969.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , M. Douze,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Billion-scale similarity search with gpus</article-title>
          ,
          <source>IEEE Transactions on Big Data</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          . doi:
          <volume>10</volume>
          .1109/TBDATA.
          <year>2019</year>
          .
          <volume>2921572</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lindgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Simcha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Accelerating large-scale inference with anisotropic vector quantization</article-title>
          , in: H.
          <string-name>
            <surname>D. III</surname>
          </string-name>
          , A. Singh (Eds.),
          <source>Proceedings of the 37th International Conference on Machine Learning</source>
          , volume
          <volume>119</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3887</fpage>
          -
          <lpage>3896</lpage>
          . URL: http:// proceedings.mlr.press/v119/guo20h.html.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>Fast approximate nearest neighbor search with the navigating spreading-out graph</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>12</volume>
          (
          <year>2019</year>
          )
          <fpage>461</fpage>
          -
          <lpage>474</lpage>
          . URL: https://doi.org/10. 14778/3303753.3303754. doi:
          <volume>10</volume>
          .14778/3303753.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>