<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Turning on a DIME: Estimating Dimension Importance for Dense Information Retrieval⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Extended Abstract</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guglielmo Faggioli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafaele Perego</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Tonellotto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research Council</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Dense Information Retrieval approaches are considered state-of-the-art and are based on projecting the queries and documents in a latent space, where each dimension encodes a latent characteristic of the text. In this paper, we enunciate the Manifold Clustering (MC) Hypothesis: projecting queries and documents onto a subspace of the original representation space can improve retrieval efectiveness. Based on the MC hypothesis, we define the Dimension IMportance Estimators (DIME). DIMEs operate on the query representation to estimate the expected importance of each dimension. Such DIMEs can be used to truncate the representation only to the most important dimensions. We describe two DIMEs, one based on the response generated by a Large Language Model (LLM), and one that relies on the user's active feedback. Our experiments show that the LLM-based DIME enables performance improvements of up to +11.5% (moving from 0.675 to 0.752 nDCG@10) compared to the baseline methods using all dimensions. Even more impressively, the DIME based on the active feedback allows us to outperform the baseline by up to +0.224 nDCG@10 points (+58.6%, moving from 0.384 to 0.608).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Information Retrieval (IR) systems have benefited from the emergence of pretrained Large
Language Models (LLMs), leading to the development of new systems with improved retrieval
efectiveness over the previous state-of-the-art IR systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These new IR systems leverage
neural networks to acquire a comprehensive understanding of documents and queries [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Among them, the dense IR systems rely on learning semantic representations for queries and
documents, called contextualised word embeddings. These representations aim at better
encoding the relevance of documents to queries. In dense IR systems, both query and document
texts are embedded into the same latent representation space, characterised by a lower
dimensionality yet denser representation than traditional IR systems. In a dense IR system where
queries and documents are encoded as multidimensional vectors, the diferent dimensions of
the embeddings represent features that the model has learned to be important for representing
the textual content in the latent space. Each dimension of the vector may correspond to a
specific aspect. The values along those dimensions measure the importance or presence of those
features in a given query or document. Ad hoc retrieval in this setting requires identifying the
document embeddings nearest to the query one in the latent space and subsequently ranking
them according to the specified similarity measure.
      </p>
      <p>We conjecture that it is possible to find a subspace of the original latent space that best
represents the query and the associated relevant documents. Thus, we formulate the following
Manifold Clustering hypothesis (MC hypothesis) for dense IR systems:</p>
      <sec id="sec-1-1">
        <title>High-dimensional representations of queries and documents relevant to them often</title>
        <p>lie in a query-dependent lower-dimensional manifold of the representation space.</p>
        <p>If our MC hypothesis holds, there is a query-dependent, low-dimensional manifold in the
latent space where retrieval is more efective since the query and its relevant documents are
closer than in the original latent space. In other terms, we assume it is possible to devise a
subset of the dimensions of the latent space, optimal to represent the query and the documents
and discard the other ones. This assumption corresponds to restricting ourselves to seek for
linear subspaces of the original latent space, one for each query.</p>
        <p>In this work, we describe two methods to estimate which dimensions to retain and which
ones to discard, and we call them Dimension IMportance Estimators (DIMEs). Thorough
experimentation of the proposed DIMEs with state-of-the-art dense IR systems on various TREC
collections show impressive performance improvements: up to +0.126 (+52.8%, moving from
0.238 to 0.364) in AP and +0.224 (+58.6%, moving from 0.384 to 0.608) in nDCG@10.
1.1. The Dimension Importance Estimation Framework
According to the MC hypothesis, by reducing the number of dimensions considered when
computing the similarity between dense query and documents’ representations, we can improve
the retrieval result. Importantly, this does not mean that we train a representation model with
fewer dimensions, but we take a complete representation and remove some of the dimensions
from it at query time. The task is now to determine which dimensions to be removed. To this
end, we employ some heuristics, which we refer to as Dimension IMportance Estimators (DIMEs).
A DIME is a function  that takes in input the representation q ∈ R of a query  and – possibly
– some additional information and outputs a real number for each dimension  of q describing
its importance. Given a generic DIME , we compute the projection of the query on the top 
dimensions. In practical terms, this corresponds to setting to 0 the  −  dimensions that are not
among the top  ones according to the DIME scores. Finally, we use the novel representation of
the query to rank the documents, by leaving unaltered the original representations of documents.
This operationalization allows for DIMEs seamless integration in already deployed retrieval
pipelines: there is no need for re-indexing the collection, but it is suficient to operate on the
query representations only.</p>
        <p>LLM DIME. LLMs are the current state of the art for generating documents. Therefore,
given a query , we harness their power to generate an artificial document that can be used
to determine which dimensions of q are the most important. In more detail, we employ a
state-of-the-art LLM to generate an answer in response to the query. We are not interested
in investigating if the answer returned is correct, as it will not be presented to the user but
used only for computing the DIME. To avoid introducing any form of bias, we do not perform
any prompt engineering: we directly input the verbatim query to the LLM, without any form
of preprocessing, granting the highest possible reproducibility. Once the text in response to
the query has been generated by the LLM, we compute its representation a in the latent space.
Then, the DIME based on LLM feedback  is defined as  () = q · a. The dimension
importance is given by the product of the -th dimension of the representations of the query
and the LLM-generated answer.</p>
        <p>Active-Feedback DIME. This DIME constructs upon the LLM DIME, by replacing the
document generated by the LLM, with an actual, human-assessed, relevant document. This
importance estimator cannot be a suitable option in an ofline scenario, as it requires knowing,
for each query, at least one relevant document. Nevertheless, it can be particularly efective
when it comes to online situations. Let thus us assume to have access to a relevant document
in response to a query and let s be its representation in the latent space. The DIME based on
Active-Feedback is defined as () = q · s. In other terms, the weight of each dimension
is the product of the -th dimension of the relevant document representation and the -th
dimension of the query representation.</p>
        <p>While this DIME has a specific area of application, i.e., real-time retrieval, it is also efective in
showing the power of DIMEs in identifying the optimal dimensions. In turn, it represents a sort
of middle solution between the superior performance of the oracle DIME and the performance
of the other, more practical DIMEs.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Experimental Results</title>
      <p>
        In our experimental analysis1, we examine three dense retrieval models: ANCE [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
Contriever [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and TAS-B [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In terms of datasets, we consider four experimental collections:
TREC Deep Learning ‘19 (DL ‘19) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], TREC Deep Learning ‘20 (DL ‘20) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Deep Learning
Hard (DL HD) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and TREC Robust ‘04 (RB ‘04) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. To instantiate the DIME based on LLMs,
we used GPT4 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>AP
Retained
0.2 0.4 0.6 0.8 1</p>
      <sec id="sec-2-1">
        <title>1source code available at: https://github.com/guglielmof/DIME-SIGIR-2024</title>
        <p>Table 1 shows the performance achieved if we retain a varying fraction of the representation
dimensions based on the two DIMEs described before.</p>
        <p>Concerning the DIME based on an LLM ( ), we notice that, on ANCE, the improvement
ranges from +0.005 for AP on DL ‘20, to +0.020 for nDCG@10 on DL ‘19– while the improvement
is present, it is not statistically significant. On the contrary, for both Contriever and TAS-B,
we can observe an impressive improvement over the baseline. Indeed, the improvement for
Contriever is between +0.023 (+9.55%) (Average Precision (AP) for RB ‘04) up to +0.077 (+11.5%)
in the case of nDCG@10 for DL ‘19. For TAS-B on the other hand the improvement is between
0.021 (+8.96%) in the case of AP for DL HD, to 0.053 (+11.2%) for DL ‘19. The analysis highlights
the large impact of using DIMEs for zero-shot application of IR models: when it comes to the
RB ‘04 collection, in almost all scenarios there is a significant improvement over the baseline
for both Contriever and TAS-B.</p>
        <p>We now consider the scenario in which the user provides us some feedback, using the DIME
. To simulate such feedback, for each query, we randomly pick a document with maximum
relevance among those annotated for the query. First of all, it is interesting to notice that in all
scenarios there is an improvement over the baseline. In particular, in the case of Contriever
and TAS-B, the improvement is significant (and very large), regardless of the collection or
evaluation measure considered. The maximum improvement is observed on the DL HD, where
Contriever and TAS-B reach an impressive improvement in nDCG@10 of +0.220 (+57.7%) and
+0.225 (+58.6%), respectively. ANCE, on the other hand, remains the most challenging model,
with improvements that are not significant, although they are quite large in some cases (e.g.,
+0.056 of nDCG@10 with DL HD).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion and Future Work</title>
      <p>This paper introduces the MC hypothesis for the latent space learned by dense IR neural models:
“high-dimensional representations of queries and documents relevant to them often lie in a
query-dependent lower-dimensional manifold of the representation space”. According to this
hypothesis, for a given query there is a subspace of the learned representation space where the
representations of relevant documents tend to cluster closer around the query representation.
To address the task of finding such a space, we define the problem of Dimension Importance
Estimation and a novel class of models, the DIMEs. Given a dense IR model and a query, a DIME
identifies the most important dimensions to induce the optimal document ranking. We propose
a DIME that exploits a pseudo-relevant document generated by a LLM which allows us to gain
+11.5% in the best scenario, moving from 0.675 to 0.752 of nDCG@10. We also propose an
active-feedback DIME that, by using a single relevant document is capable of largely improving
the retrieval performance of dense IR models. The improvement is as big as +52.8% (moving
from 0.238 to 0.364 of AP) and +58.6% (moving from 0.384 to 0.608 of nDCG@10).</p>
      <p>Among future developments, we plan to tackle the automatic selection of the optimal number
of dimensions to be retained. Additionally, we plan to explore DIME based on other signals,
such as previous utterances in the conversational search scenario or query reformulations.
Finally, we plan to develop DIMEs based on linear combinations of the dimensions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <article-title>Dimension importance estimation for dense information retrieval</article-title>
          ,
          <source>in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval July 14-18</source>
          ,
          <year>2024</year>
          (Washington D.C., USA),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis</article-title>
          , MN, USA, June 2-7,
          <year>2019</year>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>ACL</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://doi.org/10.18653/v1/n19-
          <fpage>1423</fpage>
          . doi:
          <volume>10</volume>
          .18653/v1/n19-
          <fpage>1423</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , M. Collins, Sparse, dense, and
          <article-title>attentional representations for text retrieval</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>9</volume>
          (
          <year>2021</year>
          )
          <fpage>329</fpage>
          -
          <lpage>345</lpage>
          . URL: https://doi.org/10.1162/tacl_a_00369. doi:
          <volume>10</volume>
          .1162/tacl\_a\_
          <volume>00369</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tang</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Overwijk</surname>
          </string-name>
          ,
          <article-title>Approximate nearest neighbor negative contrastive learning for dense text retrieval</article-title>
          ,
          <source>in: 9th International Conference on Learning Representations, ICLR</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          , Austria, May 3-
          <issue>7</issue>
          ,
          <year>2021</year>
          , OpenReview.net,
          <year>2021</year>
          . URL: https://openreview.net/forum?id=zeFrfgyZln.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Caron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <article-title>Towards unsupervised dense information retrieval with contrastive learning</article-title>
          ,
          <source>CoRR abs/2112</source>
          .09118 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2112.09118. arXiv:
          <volume>2112</volume>
          .
          <fpage>09118</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hofstätter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <article-title>Eficiently teaching an efective dense retriever with balanced topic aware sampling</article-title>
          , in: F. Diaz,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Castells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          , T. Sakai (Eds.),
          <source>SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Virtual Event, Canada,
          <source>July 11- 15</source>
          ,
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>122</lpage>
          . URL: https://doi.org/10.1145/3404835.3462891. doi:
          <volume>10</volume>
          . 1145/3404835.3462891.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <article-title>Overview of the TREC 2019 deep learning track</article-title>
          , CoRR abs/
          <year>2003</year>
          .07820 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2003</year>
          .07820. arXiv:
          <year>2003</year>
          .07820.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <article-title>Overview of the TREC 2020 deep learning track</article-title>
          ,
          <source>CoRR abs/2102</source>
          .07662 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2102.07662. arXiv:
          <volume>2102</volume>
          .
          <fpage>07662</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Mackie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dalton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <article-title>How deep is your learning: the DL-HARD annotated deep learning dataset</article-title>
          , in: F. Diaz,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Castells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          , T. Sakai (Eds.),
          <source>SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Virtual Event, Canada,
          <source>July 11-15</source>
          ,
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>2335</fpage>
          -
          <lpage>2341</lpage>
          . URL: https://doi.org/10.1145/3404835.3463262. doi:
          <volume>10</volume>
          .1145/3404835.3463262.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <article-title>Overview of the trec 2004 robust retrieval track</article-title>
          ,
          <year>2005</year>
          . doi:https://doi. org/10.6028/NIST.SP.
          <volume>500</volume>
          -
          <fpage>261</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          ,
          <source>Chatgpt [large language model]; accessed on december</source>
          <year>2023</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>