<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Precomputation and Caching in Information Retrieval Experiments with Pipeline Architectures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sean MacAvaney</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Craig Macdonald</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Glasgow</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Modern information retrieval systems often rely on multiple components executed in a pipeline. In a research setting, this can lead to substantial redundant computations (e.g., retrieving the same query multiple times for evaluating diferent downstream rerankers). To overcome this, researchers take cached “result” files as inputs, which represent the output of another pipeline. However, these result files can be brittle and can cause a disconnect between the conceptual design of the pipeline and its logical implementation. To overcome both the redundancy problem (when executing complete pipelines) and the disconnect problem (when relying on intermediate result ifles), we describe our recent eforts to improve the caching capabilities in the open-source PyTerrier IR platform. We focus on two main directions: (1) automatic implicit caching of common pipeline prefixes when comparing systems and (2) explicit caching of operations through a new extension package, pyterrier-caching. These approaches allow for the best of both worlds: pipelines can be fully expressed end-to-end, while also avoiding redundant computations between pipelines.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval Experiments</kwd>
        <kwd>Caching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Information retrieval systems are now often multi-stage architectures. In large-scale deployments, this
allows separation of eforts: one team may work on the indexer or first-stage retriever architecture,
while other teams may be responsible for rerankers (e.g. learning-to-rank, neural models), ad selection,
answer generation or result presentation. While developed independently, it is imperative that these
pipeline components successfully operate in tandem. For instance, if a reranker is not prepared to
handle the distribution of results provided by the retriever, the engine’s efectiveness will sufer.</p>
      <p>When proposing a new method, it is important for researchers to demonstrate its robustness in
different environments. Robustness is typically shown by evaluating the method on multiple benchmarks,
but this is only one dimension of robustness. We argue that it is also important to show how well a
method performs in various pipelines since this reflects the diversity of environments in which the
method may be deployed.</p>
      <p>However, the available data and tooling often make this type of experimentation challenging. For
instance, some QA datasets (such as 2Wiki [1]) commonly provide retrieved documents for a single
retriever, meaning that the impact of answer generation quality to diferent retrieval systems is not
systematically ablated; precomputing the first-stage results for testing an answer generator or reranker
can lead to an inflexible experimental workflow, that prevents the researcher from dog-fooding, i.e.
testing the approach on their own queries.</p>
      <p>A central goal of PyTerrier [2, 3]1 is to provide a shared platform that enables this kind of
experimentation. Components in the platform can be composed into declarative pipelines that clearly define
their constituent components, called Transformers2, and how they interact. This design also allows
individual components to be easily ablated (e.g., swapping out one retriever for another). However, this
approach can incur substantial redundant costs at experimentation time. For instance, when comparing
two answer generators under the same retrieve-and-rerank pipeline, the preceding pipeline is fully
executed twice (even though the results are the same). Similarly, if the researcher seeks to ablate the
retriever choice within the same pipeline, the reranker will redundantly score any document retrieved
by both systems twice, even though they are assigned the same score.</p>
      <p>This paper presents our recent eforts to address these problems while retaining the platform’s
lfexibility. We apply two complementary approaches. The first automatically detects common prefixes
in pipelines when running an experiment and only executes the common prefix once (Section 3). The
second approach is to let users explicitly define a caching strategy over individual components (Section 4).
Together, approaches will help reduce the computational overhead of conducting experiments, abiding
by the “reuse” principle of GreenIR [4].</p>
      <p>Naturally, our approaches are not the only ways to perform caching for information retrieval research.3
However, we feel that our approaches hit a sweet spot between functionality and ease-of-use for
dayto-day experimentation. The traditional file-based workflow involves saving intermediate results to
ifles (e.g., TREC-formatted “result” or “run” files). These result files can be used as starting points
(and are hosted in places like ranxhub [5] or the TREC “Past TREC Results" page4), but result files on
their own do not clearly define their provenance 5 and are relatively brittle (e.g., one could mistakenly
use a run file that operated over a topic’s “title” field when the remainder of the pipeline runs over
its “description”). Meanwhile, the TIREx platform [7, 8] performs community-wide pipeline prefix
caching, but requires containerising all components (which is a benefit in terms of reproducibility, but
reduces flexibility). Finally, we note that there is a significant body of literature on the caching of search
engine posting lists [9, 10] or results [11, 12], however, to the best of our knowledge, all existing work
is concerned with caching in deployed information retrieval systems, rather than for the purposes of
aiding experimentation.</p>
      <p>In the remainder of this paper, we provide a background on PyTerrier (Section 2), followed by a
description of our precomputation method for IR experiments (Section 3), explicit caching transformers
(Section 4). Section 5 provides some demonstration experiments of the benefits of the
precomputation and caching to an exemplar experiment; Section 6 discusses limitations and future work, while
concluding remarks follow in Section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Information Retrieval Pipelines and Experiments in PyTerrier</title>
      <p>We provide an overview of the PyTerrier data model and operator language (Section 2.1), as well
as the manner in which evaluation is conducted in PyTerrier (Section 2.2). Together these provide
these necessary background for our understanding the functional aspects of our solutions described in
Sections 3 and 4.</p>
      <sec id="sec-2-1">
        <title>2.1. Declarative Pipelines</title>
        <p>We now provide a short summary of the PyTerrier [2] data model, the families of transformers, and
operators for their combination. Firstly, let (, ) be a relation type for a set of queries, and
similarly (, , ...) a relation type for a set of documents, with additional possible attributes.
From these principle types, we can derive types such as for (i) documents ranked in response to a query
(, , , , ...); (ii) relevance assessments (, , ). All relation types are
extensible, in that extra columns can be added. In PyTerrier, these relation types can be instantiated
as Pandas DataFrames, or as lists of dictionaries, both with the required and any optional attributes.
Indeed, the exact choice of instantiation is left to the preferences of the developer of a particular class,
and PyTerrier maps between the DataFrames and lists as needed.
3In fact, PyTerrier previously had a general-purpose caching operator (∼ ). This operator is now deprecated in favor of these
more robust and flexible caching approaches.
4https://trec.nist.gov/results.html
5We note that eforts have been undertaken to improve the provenance of result files, however [6].</p>
        <p>Transformations between these relation types are called transformers: for instance a transformer
class fulfilling a retrieval role should expect data of type , and return type . Typical families of
pipeline stages can be expressed as mappings between these relations, for instance:
• Retrieval:  → 
• Reranking:  → 
• Query rewriting:  → 
• Document rewriting:  → 
• Pseudo-relevance feedback:  → 
• Indexing:  → ∅ (a terminal operation)</p>
        <p>Each transformer object, , operates as a function, i.e. we can obtain a set of results for a given input
by invoking  on that input, ().</p>
        <p>
          As argued above, IR systems are more commonly phrased as pipelines. To combine diferent
transformers, PyTerrier ofers a number of operators defined on transformers, that allow them to be succinctly
expressed. For instance ≫ is known as “then” or “compose”, and is defined as:
(1»2)() := 2(1())
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
This allows many pipelines of transformers to be created, and easily understood. Indeed, rather than
an imperative programming style, where a series of steps may be executed for a given set of data in
sequence, here the entire pipeline is constructed before execution on a set of queries. The ≫ notation
has been seen to be easily understandable and has been appearing as notation in many IR papers in
recent years. Table 1 summarises all of the operators defined in PyTerrier. Each operator has relational
algebra semantics, as explored in Macdonald and Tonellotto [2].
        </p>
        <p>Of note from Table 1, ≫ and % (rank cutof) are the most commonly used operators. For example,
we might want to apply rank cutofs before diferent reranking stages:
1 pipe = bm25 % 100 &gt;&gt; MonoT5() % 10 &gt;&gt; DuoT5()</p>
        <p>Within Table 1, ∼ is a caching operator, which is used to cache the results of a retrieving transformer
to disk. This operates such that when a query is executed repeatedly, the results for that query can be
retrieved directly from the cache, in order to speed up retrieval, particularly for experiments where a
ifrst-stage retrieval may be invoked repeatedly for the same query(ies). In brief, this is the only operator
we have been unhappy with, and it is now deprecated for removal in a future release of PyTerrier. The
semantics of how the caching operator is currently defined does not ofer suficiently fine-grained
control - in particular, the type of caching that is most appropriate likely varies according to the family of
transformer, which is not immediately obvious to the cache implementation. For example, a cache for a
retriever should cache on the  attribute (the primary key of  datatype), and return the score for that
Pass the output from one transformer to the next transformer
Shorten a retrieved results list to the first  elements
Sum the query-document scores of the two retrieved results lists
Multiply the query-document scores of a retrieved results list by a
scalar
Combine two retrieved results lists as features
Make the set union of documents from the two retrieved results lists
Make the set intersection of the two retrieved results lists
Add the retrieved results list from one transformer to the bottom of
the other</p>
        <p>Keep the results of this transformer on disk [now deprecated]
query. However, caching on only the  would result in a rewritten query retrieving the same results as
the original query, so ⟨, ⟩ would be a safer cache key. On the other hand, the cache for a
crossencoder reranker should cache on at least ⟨, ⟩, and, more safely ⟨, , , ⟩, (in
order to rescore when the query or text have been expanded). More generically, the primary keys
and the functional dependencies of the input and output types of a transformer are not exposed by a
given transformer, and would be required for caching to operate properly. Such fine-grained control
over caching cannot be achieved using only a unary caching operator to signify a cached transformer
(indeed, the current caching operator assumes that all cached transformers are retrievers). For this
reason, this paper discusses alternative caching strategies, addressing common experimental use cases
easily, while also ofering fine-grained control over caching behaviour.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Declarative Experiments</title>
        <p>PyTerrier also defines an experimentation abstraction for the purposes of evaluating diferent retrieval
systems (including composed pipelines). In particular, the pt.Experiment() function takes four key
arguments: (i) a list of systems to be compared; (ii) the set of queries on which they should be evaluated
(type ); (iii) the set of relevance assessments to use to evaluate them (type ); and (iv) the evaluation
measures to compute. This invokes each system on the specified queries, and applies the specified
evaluation measures on the outcome. An example experiment is shown below, which would evaluate
the impact on the number of documents retrieved by BM25 on the nDCG@10 efectiveness of a pipeline
involving MonoT5 and DuoT5 (recall that % denotes application of a rank cutof):
pt.Experiment(
[bm25 % k &gt;&gt; MonoT5() % 10 &gt;&gt; DuoT5() for k in [20, 50, 100, 200]],
dataset.get_topics('test'),
dataset.get_qrels('test'),
[nDCG@10]
)</p>
        <p>The succinctness of the experimentation abstraction demonstrates its utility for researchers - indeed,
we place considerable focus on developing memorable APIs that don’t require researchers to regularly
refer to documentation. Its also more succinct than an imperative workflow - its a single statement
no for loops, no diferent invocations for diferent stages of a ranking pipeline. Additional options for
pt.Experiment() allow the calculation of significance tests wrt. a baseline, application of
multipletesting correction (as recommended by Fuhr [13], Sakai [14]), limiting batch sizes, etc.</p>
        <p>However, the example above also illustrates one of the challenges with a declarative workflow, in that
the BM25 retriever would be invoked for the topic set on each of the 4 pipelines. This is a considerable
eficiency disadvantage compared to an imperative workflow, whereby a user may gather all BM25
results, before applying the cutof. An alternative formulation is shown below - here BM25 is applied
on the topics before ingestion into the experiment; an Identity transformer6 is used in place of BM25 to
allow the results to be passed-through to a rank cutof:
pt.Experiment(
[pt.Transformer.identity() % k &gt;&gt; MonoT5() % 10 &gt;&gt; DuoT5() for k in [20, 50, 100,
200]],
bm25(dataset.get_topics('test')), # &lt;--- BM25 results as input, rather than queries
dataset.get_qrels('test'),
[nDCG@10]
)</p>
        <p>So while this is explicit, it somehow feels less appropriate - a reader of the code would not be naturally
drawn to line 3 where bm25 is invoked on the test queries; the clear separation between queries and
systems has been lost. The use of the Identity transformer also feels unnatural, and likely to confuse readers.</p>
        <p>Instead, in the next section, we discuss how a diferent strategy whereby the same experiment is
eficiently conducted, i.e. without repetitive invocations of BM25.
6The Identity transformer just returns its input, unchanged.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Prefix Precomputation in Comparative Experiments</title>
      <p>PyTerrier’s operator-based language for expressing pipelines can be seen as closer to the conceptual
design that one might write in a paper.7 However, the logical implementation may difer. For example,
consider a transformer (, ) that has a rank cutof operation ( %′) applied. A more
eficient pipeline formulation would be to apply the rank cutof directly in the Retriever instance. 8
PyTerrier supports a number of such optional compile operations, which allow apply a rewriting of the
conceptual pipeline into a more eficient logical variant [ 2] - i.e. a syntactically diferent but semantically
equivalent reformulation of a pipeline that executes more quickly.</p>
      <p>In this vein, let us consider an experiment comparing rerankers ( and ) applied to the results of a
retriever . This would be instantiated as an experiment involving two pipelines  ≫  and  ≫ .
When performing a side-by-side evaluation of these pipelines, as mentioned in Section 2.2, there may
be eficiency gains in pre-computing the results of .</p>
      <p>To this end, PyTerrier now supports prefix precomputation when conducting experiments: here, any
common prefix of all the evaluated pipelines is invoked once, and the results applied on the remainder of
the pipelines. This is shown visually in Figure 1. To expose this functionality to the researcher, we simply
add an optional argument to pt.Experiment(), namely precompute_prefix, as shown below:</p>
      <p>Our current implementation identifies the longest common prefix (LCP) 9 of a set of pipelines - eficient
implementations of this algorithm can be instantiated that only assume that transformers have an
equality property (i.e. we can test to see if two transformers are equal).</p>
      <p>
        Formally, let  be a set of transformer pipelines for a given experiment, where each pipeline  ∈ 
consists of diferent stages of a pipeline ,1 ≫ . . . ≫ ,|||| where the number of stages in a pipeline
is denoted by || · || . Further, let [] denote the th stage of a pipeline, and [..] denote a range of
transformer stages. In applying the LCP algorithm, we identify a common prefix LCP ( ), as follows:
LCP( ) = arg max {|||| s.t. [] == [] ∀, 1..}
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
Using  ( ), we can identify the remainder pipelines ^ , where each constituent ^ is the remainder
of the pipeline  starting after the common prefix, i.e. ^ = [|| ( )||..||||]. Then, for a given
set of queries, , evaluation naturally takes place by obtaining the results on the common prefix, i.e.
_ = LCP( )(), followed by evaluation of each pipeline remainder, i.e. ^(_).
      </p>
      <p>The astute reader may observe that there are possible experiments of  &gt; 2 pipelines, where a
common prefix is shared by only a subset, 2.. − 1. Our current implementation would not benefit
7Indeed, we’ve noted an increasing number of papers using the ≫ notation to indicate composition of pipeline stages.
8This is akin to a SQL query optimisation, where selection operations are moved earlier.
9It has to be prefix, rather than the more general longest common subsequence, as pipelines are afected by their leftmost
constituent transformation.</p>
      <p>iqd
nIput</p>
      <p>D
A
B
uqery
itle
aCche
yek
C
D
A
B
lauev
6
5
1
1
iqd
uOtp
uqery
itle
csore
6
1
1
of all possible pipeline prefixes, but leave this further development to future work.</p>
      <p>That said, our experience with prefix precomputation thus far has been positive - the functionality
works as expected, with no unexpected corner cases that required to be resolved. As a result, we
may we set this to be default setting for pt.Experiment() in the future. Overall, we argue that the
simplicity of prefix precomputation addresses a key eficiency disadvantage of the declarative workflow,
while enabling end-to-end evaluation and retaining legibility of the experiments. It can also be seen
as a marked progress between separation of a conceptual model of an IR experiment and its logical
implementation. We are not aware of any previous work considering the programmatic decomposition
of IR pipelines in this manner to benefit experimentation.</p>
      <p>For more fine-grained control, in the next section we discuss a diferent type of caching, where the
researcher wishes to retain full control over what is reused between pipelines or pipeline invocations.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Explicit Caching Strategies</title>
      <p>We now describe the explicit caching strategies provided by the pyterrier-caching10 package. Four
strategies are provided to cover a variety of use cases: key-value caching (Section 4.1); caching for
scorers/rerankers (Section 4.2); caching for retrievers (Section 4.3), and caching for indexing operations
(Section 4.4).</p>
      <sec id="sec-4-1">
        <title>4.1. KeyValueCache: Key-Value Caching</title>
        <p>The KeyValueCache is a basic caching strategy that maps one or more “key” columns to one or more
“value” columns. The cache operates on a row-by-row basis, working under the assumption that each
row does not afect the results of other rows. A visual depiction of the cache is given in Figure 2. This
formulation makes the KeyValueCache suitable for operations like Query Rewriting ( → ) and
Document Rewriting ( → ). For example, a Doc2Query [15] transformer can be cached as follows:
1 from pyterrier_caching import KeyValueCache
2 from pyterrier_doc2query import Doc2Query
3 dataset = pt.get_dataset('irds:msmarco-passage') # some dataset
4 model = Doc2Query(append=True)
5 cache = KeyValueCache('doc2query.cache', model, key='text', value='querygen')
6
11
7 # First index with Terrier: (fills cache with Doc2Query results)
8 index = pt.terrier.TerrierIndex('doc2query.terrier')
9 pipeline = cache &gt;&gt; index.indexer()
10 pipeline.index(dataset.get_corpus_iter())
12 # Indexing with PISA is now faster, since Doc2Query results are cached
13 from pyterrier_pisa import PisaIndex
14 index = PisaIndex('doc2query.pisa')
15 pipeline = cache &gt;&gt; index.indexer()
16 pipeline.index(dataset.get_corpus_iter())
10https://github.com/terrierteam/pyterrier-caching
iqd
nIput</p>
        <p>A
B
uqery
itle
aCche
yek
A
B
lauesv
6
5
1
1
iqd
uOtp
uqery
itle
csore
6
5
1
1</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. ScorerCache: Caching Scorer Results</title>
        <p>Typical scorers (rerankers) operate by independently assigning a new relevance score for each document
under the probability ranking principle [16].11 Given the prominence of this pattern and the need to
re-assign the  column based on these new scores, the ScorerCache implements this special case
of the general-purpose KeyValueCache. The  and  columns serve as the key, and the 
column serves as the value, though this functionality can be overridden. An example demonstrating
the value of the ScorerCache when caching the results of a MonoElectra [20] model follows:
7
11
13
1 from pyterrier_dr import ElectraScorer
2 from pyterrier_pisa import PisaIndex
3 dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
4 index = PisaIndex.from_hf('macavaney/msmarco-passage.pisa')
5 scorer = dataset.text_loader() &gt;&gt; ElectraScorer()
6 cached_scorer = ScorerCache('electra.cache', scorer)
8 # Use the ScorerCache cache object just as you would a scorer
9 cached_pipeline = index.bm25() &gt;&gt; cached_scorer
10 cached_pipeline(dataset.get_topics())
12 cached_pipeline(dataset.get_topics()) # &lt;-- all values are cached
14 # Will only compute scores for docnos that were not returned by bm25
15 another_cached_pipeline = index.qld() &gt;&gt; cached_scorer
16 another_cached_pipeline(dataset.get_topics())</p>
        <p>Implementation Details.</p>
        <p>By default, ScorerCache uses the same SQLite strategy as
KeyValueCache, while also re-assigning the  column based on the new scores. In some cases,
a large proportion of a corpus is scored – for instance, when exploring the efect of an exhaustive
search strategy of cross-encoders [21]. In this case, a SQLite cache is ineficient because of the high
repetition of document identifiers and other overheads. Therefore an alternative implementation
(DenseScorerCache) is available. DenseScorerCache uses a HDF5 backend, with a separate npids12
ifle serving as a mapping between the docnos and their corresponding indexes in the storage array.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. RetrieverCache: Caching Retriever Results</title>
        <p>Other operations — most notably retrievers — map each input row to multiple output rows. The
RetrieverCache (shown visually in Figure 3) handles caching in this situation. The following code
snippet shows how this cache can save on repetitive calls to a retriever:
1 from pyterrier_caching import RetrieverCache
2 dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
3 index = pt.terrier.TerrierIndex.from_hf('macavaney/msmarco-passage.terrier')
4 bm25_cache = RetrieverCache('path/to/cache', index.bm25())
5
6 bm25_cache(dataset.get_topics())
7 bm25_cache(dataset.get_topics()) # &lt;-- all values are cached</p>
        <p>Implementation Details. RetrieverCache is implemented using Python’s dbm package. The
dbm is a generic key-value database interface that is included in Python’s standard library. The keys to
the DBM database are SHA256 hashed pickles of the keys, and the values are LZ4-compressed pickles
of the value frames.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. IndexerCache: Caching Indexing Streams</title>
        <p>In some cases it is beneficial to store an entire sequence of inputs. This is especially true for indexing
operations, such as document encoding with Learned Sparse Retrieval [22]. The following example
shows how to cache SPLADE document representations [23]. Note that unlike other caching operations,
the IndexerCache acts as an indexer:
1 from pyt_splade import Splade
2 from pyterrier_caching import IndexerCache
3 dataset = pt.get_dataset('irds:msmarco-passage')
4 splade = Splade()
5 cache = IndexerCache('splade.cache')
6 cache_pipeline = splade &gt;&gt; cache
7
8 # The following line will save the results of splade to splade.cache
9 cache_pipeline.index(dataset.get_corpus_iter())
10
11 # Now you can build multiple indexes over the results of splade without
12 # needing to re-run it each time
13 indexer1 = pt.terrier.TerrierIndex('splade.terrier').indexer()
14 indexer1.index(cache)
15 indexer2 = pyterrier_pisa.PisaIndex('./path/to/index.pisa').toks_indexer()
16 indexer2.index(cache)</p>
        <p>Note that unlike the other caching components, IndexerCache captures a sequence of
documents, where the order of records is potentially important [24]. Consequently, it does not wrap a
pipeline the way that other components do (i.e., Splade() » IndexerCache(’path’) instead of
IndexerCache(’path’, Splade())). The user may decide that the order is not important; in these
cases, it can also be used as a basic forward index, as it allows for eficient row lookups based on the
 column (if present).</p>
        <p>Implementation Details. IndexerCache stores the sequences as LZ4-compressed pickles of each
input row. When the cache object is iterated over, the sequence is decompressed row-by-row as a
row generator. The index also captures the  column (if present), and stores it in an npids file to
facilitate the forward index functionality.
11Note that this does not apply to some types of scorers, e.g., adaptive rerankers [17] (which add new documents to the pool),
or pairwise [18] and listwise [19] rerankers (for which each score depends on the others present in the pool).
12https://github.com/seanmacavaney/npids</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Other Caching Features</title>
        <p>Cache objects conform to PyTerrier’s Artifact API, which allows them to be shared using HuggingFace,
Zenodo, or other platforms. This can enable the sharing of computational resources across research
groups.
1 cache.to_hf('username/some-cache') # upload to HuggingFace
2 cache.to_zenodo() # upload to Zenodo
3 cache = pt.Artifact.from_hf('username/some-cache') # download from HuggingFace
4 cache = pt.Artifact.from_zenodo('1234') # download from Zenodo</p>
        <p>All caches support a "temporary" cache mode, where a temporary cache directory is created and
cleaned up when the object is deleted. This setting is applied by omitting the index path when creating
the cache. We recommend using these as context managers so that the lifetime and cleanup of the
temporary cache is well-defined. For example:
1 from pyterrier_caching import RetrieverCache
2 dataset = pt.get_dataset('irds:msmarco-passage/dev/small')
3 index = pt.terrier.TerrierIndex.from_hf('macavaney/msmarco-passage.terrier')
4
5 # construct a temporary retriever cache
6 with RetrieverCache(retriever=index.bm25()) as bm25_cache:
7 bm25_cache(dataset.get_topics())
8 # second time faster due to caching
9 bm25_cache(dataset.get_topics())
10 # (temporary cache deleted when the context manager exits)</p>
        <p>All the caches also support a mode where the transformer is not provided. If there is a cache miss and
the transformer is not provided, an exception is raised. Alternatively, the transformer can be constructed
on-demand using the Lazy utility transformer, which only constructs the actual transformer once
if/when it is invoked. Both are helpful for situations where the user wants to avoid constructing the
transformer object due to the resources that it would consume, e.g., a GPU for a neural scorer. For
example:
1 from pyterrier_caching import ScorerCache, Lazy
2 from pyterrier_dr import ElectraScorer
3
4 # scorer omitted (raises error on cache miss)
5 cached_scorer = ScorerCache('electra.cache')
6
7 # lazy scorer (only constructed once called)
8 lazy_scorer = Lazy(lambda: dataset.text_loader() &gt;&gt; ElectraScorer())
9 lazy_cached_scorer = ScorerCache('electra.cache', lazy_scorer)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Demonstration Experiments</title>
      <p>To demonstrate the benefit of the PyTerrier techniques described in this paper, we examine the response
times of four diferent experimental settings. Our chosen experiment is based on the example pipelines
discussed in Section 2.2, where we vary the number of documents retrieved by BM25 to determine the
impact on MonoT5 and DuoT5. Such an experiment would use a list of pipelines expressed as follows:
[bm25 % k » MonoT5() % 10 » DuoT5() for k in [20, 50, 100, 200]].</p>
      <p>
        The four diferent experimental settings are: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) without using any caching, such that BM25 is
executed four times on each query; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) using precomputation (Section 3), such that BM25 is only
executed once; (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) using a cold ScorerCache around MonoT5, such that MonoT5 is not executed more
than once for a given query/document pair within the experiment; and (4) a re-run of the same setting
where the ScorerCache is hot, such that MonoT5 reuses the results computed in (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ). Finally, recall that
DuoT5 is not amenable to caching in these pipelines, as the overall score of a document depends on the
other retrieved documents for that query.
      </p>
      <p>
        #
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(4)
✗
✓
✓
✓
✗
✗
✓ (cold)
✓ (hot)
      </p>
      <p>We execute these experiments on a machine with Intel Xeon Gold 5222 CPU @ 3.80GHz (16 cores)
and an NVIDIA RTX 3090 GPU. We use a Terrier backend for BM25, with the index stored on disk. We
use two indices: MSMARCO v1 passage corpus using 43 queries from the TREC 2019 Deep Learning
track; and MSMARCO V2 passage corpus using 53 queries from the TREC 2021 Deep Learning track.
The Jupyter notebooks for executing these experiments can be found on the PyTerrier GitHub repo:
https://github.com/terrier-org/pyterrier/blob/master/examples/notebooks.md.</p>
      <p>The obtained timings are shown in Table 2, allowing the following observations: precomputation
allows reducing the experimental execution time by 8% on MSMARCO v1 and 28% on MSMARCO v2
– the small relative benefit on MSMARCO v1 is due to the small size of the MSMARCO v1 index and
resulting fast BM25 retrieval (only 3 seconds for 43 queries); Caching of MonoT5 results shows a benefit
of 27-41%; Rerunning using a hot ScorerCache reduced overall time to 50-68% - essentially executing
BM25 once, and the reexecutions of DuoT5. Overall, the table supports the expected benefits of prefix
precomputation and appropriate transformer caching, which are easily accessible through PyTerrier’s
pt.Experiment API and the the pyterrier-caching package. Prefix precomputation is more beneficial
for more expensive shared prefixes of transformer pipelines in an experiment.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations and Future Work</title>
      <p>We have described two approaches for caching in PyTerrier – a precomputation approach that can be
automatically applied to experiments, and explicit caching components that can be incorporated into
indexing and retrieval pipelines. Although these additions provide a major improvement over the prior
operator-based caching strategy, we see these as promising starting points, rather than final products,
due to several shortcomings.</p>
      <p>
        Although the precomputation approach covers many practical experimental settings, it is not
comprehensive. For instance, consider an ablation experiment where components are progressively added
to a pipeline: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) A, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) A » B, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) A » B » C. The precomputation approach will only precompute
A (since it’s common to all three pipelines), even though also precomputing A » B would be able to
benefit both pipelines 2 and 3. Furthermore, the current strategy only supports precomputation across
sequential pipelines; operations that occur as part of linear combinations, set operations, or feature
combinations are not supported. We aim to assess which of these cases can be practically addressed
and eficiently precomputed in future iterations of this feature – potentially in combination with the
pipeline.compile() functionality [2].
      </p>
      <p>The explicit caches also have limitations. Most notably, they rely on direct application by the
researcher. This is by design, since current transformer implementations do not provide suficient
information to automatically infer the correct caching strategy. In the future, we may enhance the
Transformer API to include this kind of information, e.g. the input and output columns, to ease the
process of identifying the caching strategy to apply.13</p>
      <p>Both precomputation and explicit caching make determinism assumptions; that is, the same input
13An added benefit is this information would also allow the automatic type-checking of pipelines.
will yield the same output. This is not always the case, especially for components running on GPUs.14
In some sense, caching reduces the variability in experiments due to this noise. On the other hand, it
means that such variations are not encountered through the course of experimentation, which could
lead the researcher to infer a false sense of stability.</p>
      <p>Finally, precomputation is an first study of optimisation across multiple IR pipelines, and can be
seen as similar to multi-query optimisation [25] in database management systems - we believe there
may be other multi-query optimisation techniques from databases that can result in improved IR
experimentation, which we leave to future work.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>We presented recent additions to the open-source PyTerrier platform to better facilitate caching. This
involves two approaches: the precomputation of common pipeline prefixes when executing an
experiment and new explicit result caching components. Demonstration experiments on MSMARCO v1 and
v2 passage corpora concerning a pipeline with BM25, monoT5 and DuoT5 showed how precomputation
and caching could reduce the execution time of a particular experiment by upto 41% in a cold-cache
setting. We hope that these approaches are intuitive for researchers to use, and will help reduce the
computational cost of running experiments, promote GreenIR principles, and ease collaboration through
shared caches.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We thank Jan Heinrich Merker for helpful feedback and suggestions on the pyterrier-caching package.
We also thank Andrew Parry for help implementing the longest common prefix algorithm for prefix
computation. Finally, we thank the anonymous reviewers for their detailed and thoughtful feedback.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly for: Grammar and spelling check.
After using these tools, the authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.
14The non-determinism on GPUs arises from the non-deterministic order in which operations are assigned to the GPU, which
can ultimately cause cascading diferences in floating point operations.
[4] H. Scells, S. Zhuang, G. Zuccon, Reduce, reuse, recycle: Green information retrieval research, in:
E. Amigó, P. Castells, J. Gonzalo, B. Carterette, J. S. Culpepper, G. Kazai (Eds.), SIGIR ’22: The 45th
International ACM SIGIR Conference on Research and Development in Information Retrieval,
Madrid, Spain, July 11 - 15, 2022, ACM, 2022, pp. 2825–2837. doi:10.1145/3477495.3531766.
[5] E. Bassani, ranxhub: An online repository for information retrieval runs, in: H. Chen, W. E. Duh,
H. Huang, M. P. Kato, J. Mothe, B. Poblete (Eds.), Proceedings of the 46th International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan,
July 23-27, 2023, ACM, 2023, pp. 3210–3214. doi:10.1145/3539618.3591823.
[6] T. Breuer, J. Keller, P. Schaer, ir_metadata: An extensible metadata schema for IR experiments,
in: E. Amigó, P. Castells, J. Gonzalo, B. Carterette, J. S. Culpepper, G. Kazai (Eds.), SIGIR ’22: The
45th International ACM SIGIR Conference on Research and Development in Information Retrieval,
Madrid, Spain, July 11 - 15, 2022, ACM, 2022, pp. 3078–3089. doi:10.1145/3477495.3531738.
[7] M. Fröbe, J. H. Reimer, S. MacAvaney, N. Deckers, J. Bevendorf, B. Stein, M. Hagen, M. Potthast,
The information retrieval experiment platform, in: M. Leyer, J. Wichmann (Eds.), Lernen, Wissen,
Daten, Analysen (LWDA) Conference Proceedings, Marburg, Germany, October 9-11, 2023, volume
3630 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 175–178. URL: https://ceur-ws.org/
Vol-3630/LWDA2023-paper16.pdf.
[8] M. Fröbe, J. H. Reimer, S. MacAvaney, N. Deckers, S. Reich, J. Bevendorf, B. Stein, M. Hagen,
M. Potthast, The information retrieval experiment platform, in: H. Chen, W. E. Duh, H. Huang,
M. P. Kato, J. Mothe, B. Poblete (Eds.), Proceedings of the 46th International ACM SIGIR Conference
on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27,
2023, ACM, 2023, pp. 2826–2836. doi:10.1145/3539618.3591888.
[9] R. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, F. Silvestri, The impact of
caching on search engines, in: Proceedings of the 30th annual international ACM SIGIR conference
on Research and development in information retrieval, 2007, pp. 183–190. doi:10.1145/1277741.
1277775.
[10] C. Macdonald, R. L. Santos, I. Ounis, B. He, About learning models with multiple query-dependent
features, ACM Trans. Inf. Syst. 31 (2013). doi:10.1145/2493175.2493176.
[11] Q. Gan, T. Suel, Improved techniques for result caching in web search engines, in: Proceedings of
the 18th international conference on World Wide Web, 2009, pp. 431–440. doi:10.1145/1526709.
1526768.
[12] E. P. Markatos, On caching search engine query results, Computer Communications 24 (2001)
137–143. doi:10.1016/S0140-3664(00)00308-X.
[13] N. Fuhr, Proof by experimentation? Towards better IR research, in: Proceedings of the 43rd
International ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 2. doi:10.1145/
3397271.3402426.
[14] T. Sakai, On Fuhr’s guideline for IR evaluation, SIGIR Forum 54 (2021). doi:10.1145/3451964.</p>
      <p>3451976.
[15] R. F. Nogueira, W. Yang, J. Lin, K. Cho, Document expansion by query prediction, CoRR
abs/1904.08375 (2019). URL: http://arxiv.org/abs/1904.08375. arXiv:1904.08375.
[16] S. E. Robertson, The probability ranking principle in IR, Journal of documentation 33 (1977)
294–304. doi:10.1108/eb026647.
[17] S. MacAvaney, N. Tonellotto, C. Macdonald, Adaptive re-ranking with a corpus graph, in: M. A.</p>
      <p>Hasan, L. Xiong (Eds.), Proceedings of the 31st ACM International Conference on Information
&amp; Knowledge Management, Atlanta, GA, USA, October 17-21, 2022, ACM, 2022, pp. 1491–1500.
doi:10.1145/3511808.3557231.
[18] R. Pradeep, R. Nogueira, J. Lin, The expando-mono-duo design pattern for text ranking with
pretrained sequence-to-sequence models, CoRR abs/2101.05667 (2021). URL: https://arxiv.org/abs/
2101.05667. arXiv:2101.05667.
[19] W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, Z. Ren, Is chatgpt good at search?
investigating large language models as re-ranking agents, in: H. Bouamor, J. Pino, K. Bali (Eds.),
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,
EMNLP 2023, Singapore, December 6-10, 2023, Association for Computational Linguistics, 2023,
pp. 14918–14937. doi:10.18653/V1/2023.EMNLP-MAIN.923.
[20] R. Pradeep, Y. Liu, X. Zhang, Y. Li, A. Yates, J. Lin, Squeezing water from a stone: A bag of
tricks for further improving cross-encoder efectiveness for reranking, in: M. Hagen, S. Verberne,
C. Macdonald, C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information Retrieval
- 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022,
Proceedings, Part I, volume 13185 of Lecture Notes in Computer Science, Springer, 2022, pp. 655–670.
doi:10.1007/978-3-030-99736-6_44.
[21] S. MacAvaney, X. Wang, Online distillation for pseudo-relevance feedback, CoRR
abs/2306.09657 (2023). URL: https://arxiv.org/abs/2306.09657. doi:10.48550/ARXIV.2306.
09657. arXiv:2306.09657.
[22] T. Nguyen, S. MacAvaney, A. Yates, A unified framework for learned sparse retrieval, in: J. Kamps,
L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.),
Advances in Information Retrieval - 45th European Conference on Information Retrieval, ECIR
2023, Dublin, Ireland, April 2-6, 2023, Proceedings, Part III, volume 13982 of Lecture Notes in
Computer Science, Springer, 2023, pp. 101–116. doi:10.1007/978-3-031-28241-6_7.
[23] T. Formal, B. Piwowarski, S. Clinchant, SPLADE: sparse lexical and expansion model for first
stage ranking, in: F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, T. Sakai (Eds.), SIGIR ’21: The
44th International ACM SIGIR Conference on Research and Development in Information Retrieval,
Virtual Event, Canada, July 11-15, 2021, ACM, 2021, pp. 2288–2292. doi:10.1145/3404835.
3463098.
[24] L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, A. Shalita, Compressing graphs and
indexes with recursive graph bisection, in: B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal,
D. Shen, R. Rastogi (Eds.), Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, ACM, 2016,
pp. 1535–1544. doi:10.1145/2939672.2939862.
[25] P. Roy, S. Sudarshan, Multi-Query Optimization, Springer US, Boston, MA, 2009, pp. 1849–1852.
doi:10.1007/978-0-387-39940-9_239.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sugawara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aizawa</surname>
          </string-name>
          ,
          <article-title>Constructing A multi-hop QA dataset for comprehensive evaluation of reasoning steps</article-title>
          , in: D.
          <string-name>
            <surname>Scott</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Bel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Zong (Eds.),
          <source>Proceedings of the 28th International Conference on Computational Linguistics</source>
          ,
          <string-name>
            <surname>COLING</surname>
          </string-name>
          <year>2020</year>
          , Barcelona, Spain (Online),
          <source>December 8-13</source>
          ,
          <year>2020</year>
          ,
          <source>International Committee on Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6609</fpage>
          -
          <lpage>6625</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2020</year>
          .COLING-MAIN.
          <year>580</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <article-title>Declarative experimentation in information retrieval using PyTerrier, in:</article-title>
          K.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Setty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lioma</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , K. Berberich (Eds.),
          <source>ICTIR '20: The 2020 ACM SIGIR International Conference on the Theory of Information Retrieval</source>
          , Virtual Event, Norway,
          <source>September 14-17</source>
          ,
          <year>2020</year>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>168</lpage>
          . doi:
          <volume>10</volume>
          .1145/3409256.3409829.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          , S. MacAvaney, I. Ounis,
          <article-title>PyTerrier: Declarative experimentation in Python from BM25 to dense retrieval</article-title>
          , in: G. Demartini, G. Zuccon,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          , H. Tong (Eds.),
          <source>CIKM '21: The 30th ACM International Conference on Information and Knowledge Management</source>
          , Virtual Event, Queensland, Australia, November 1 -
          <issue>5</issue>
          ,
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>4526</fpage>
          -
          <lpage>4533</lpage>
          . doi:
          <volume>10</volume>
          .1145/3459637.3482013.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>