<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MDCU: A Multi-Dimensional Cumulative Utility Metric for Information Retrieval.⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesco Luigi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>De Faveri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guglielmo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Faggioli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ferro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kalervo Järvelin</string-name>
          <email>kalervo.jarvelin@tuni.fi</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padova</institution>
          ,
          <addr-line>Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tampere University</institution>
          ,
          <addr-line>Tampere</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Information Retrieval (IR) efectiveness metrics typically assume that a relevant document fully satisfies the user's information need. However, this assumption becomes inadequate when such information need are faceted or comprise multiple subtopics, addressing only a subset of them. Additionally, ranked lists may contain multiple documents focusing on the same subtopics, leading to content redundancy while neglecting other aspects of the information. Search results that present top-ranked documents covering diverse subtopics are generally more desirable than those ofering overlapping content on limited facets. The Multi-Dimensional Cumulated Utility (MDCU) metric, introduced by Järvelin and Sormunen, addresses this issue by incorporating content overlap into the evaluation of novelty and diversity. While their work demonstrated MDCU's conceptual foundation and illustrated it with a simplified example, its empirical application has yet to be explored. In this study, we empirically evaluate the practical stability of MDCU using publicly available TREC test collections. Moreover, we examine its relationship with the well-established  -nDCG metric and present a Python implementation of MDCU to support its adoption in future evaluation settings. Our experimental results reveal a strong positive correlation between MDCU and  -nDCG, indicating that both metrics consistently capture similar performance trends across IR systems. Moreover, compared to  -nDCG, MDCU demonstrates greater statistical power, identifying up to nine times more statistically significant diferences between system pairs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Evaluation Metrics</kwd>
        <kwd>Multidimensional Cumulated Utility</kwd>
        <kwd>Information Retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Traditional Information Retrieval (IR) evaluation metrics, such as those in the Cumulated Gain family [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
are based on mono-dimensional relevance judgments that assume queries are mono-thematic and focus
solely on topical relevance. This abstraction supports eficient ofline evaluation but fails to capture
the complexity of real-world information needs, which often span multiple subtopics and involve
factors such as novelty, redundancy, and document accessibility [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. To address these limitations,
multi-dimensional evaluation measures have been proposed [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
      </p>
      <p>
        A recent development in this area is the Multi-Dimensional Cumulated Utility (MDCU) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which
accounts for thematic diversity, content overlap, and document-level attributes (e.g., language,
complexity, recency). MDCU operates under four assumptions: i) An information need might be multi-thematic,
and documents might satisfy one, many, or none of such themes; ii) The relevance of a document to
each sub-theme of the information need can be multi-graded; iii) When crossing the ranked list of
documents, the user experience decreasing gain proportionally to the information accrued up to that
point: i.e., a partially relevant document inspected after a highly relevant one contributes less to the
user’s total gain; iv) The contribution of the document to the user’s utility gain depends on its attributes,
e.g., the language, complexity, recency [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. While the theoretical foundations of MDCU have been
established, its empirical performance remains unexplored. In this work, we operationalise MDCU as
a practical evaluation metric, apply it to TREC Web Diversity Track collections, compare it with the
established  -nDCG [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and release a public implementation1.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background on Multi-Dimensional Evaluation Measures</title>
      <p>Assume an IR system has produced a ranked list of documents  = {1, ..., } for a given information
need. We consider multi-theme information needs, meaning each information need can be split into
themes  ∈  . Therefore, the relevance judgement for the document  is a vector (,1, ..., ,| |) were
element , describes the relevance of  to theme . The , value can be binary or graded.</p>
      <p>
        The Multi-Dimensional Cumulated Utility (MDCU) framework introduced in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] assesses cumulative
gain by considering multi-dimensional relevance judgments and usability attributes of the documents
retrieved by the system in an IR search task. Therefore, document  is associated with a vector of
usability attributes (,1, . . . , ,), (, )1≤ ≤  ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]. The algorithm takes as input the ranked list of
search results, denoted as , and a discounting parameter , accounting for the overlap. The cumulated
relevance vector  is initialized as the 0-vector of length | |. For each document  ∈ , MDCU defines
the attribute factor  = ∏︀
      </p>
      <p>=1 , that describes its usability. After each document’s inspection, the
cumulated relevance vector  is updated considering the multi-dimensional relevance of the document
. The contribution of the -th document combines its relevance to the various sub-themes and weighs
it by the usability attribute. Furthermore, the contribution on theme  is discounted by the cumulated
contribution  accrued until that point. The MDCU is the sum of all cumulated contributions  across
the relevance themes  .</p>
    </sec>
    <sec id="sec-3">
      <title>3. Challenges and Solutions of the MDCU</title>
      <p>
        Collections for Computing MDCU. Identifying a suitable test collection to validate empirically
MDCU represents a preliminary experimental challenge. As noted by Järvelin and Sormunen [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], no
collection contains document annotations for multi-theme relevance and usability. We stress that, at
the moment of the experiments, the literature lacks a suitable benchmark test collection that assesses
both multi-dimensional relevance themes and usability attributes. Consequently, we focus on the
multi-theme evaluation, leaving the investigation of the role of the usability attributes, e.g., score that
simulate the credibility, virality or sensitivity analysis of retrieved documents, as future work. We
remark that, as noted in the seminal MDCU paper by Järvelin and Sormunen [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], not considering
the usability attributes in the MDCU analysis means that the measure only accounts for the impact
of the themes’ relevance, and each document is used equally by the user. In detail, we consider the
TREC Web collections spanning 2009 to 2012 and use MDCU to evaluate systems submitted to the Web
Diversity cat-B task [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11, 12</xref>
        ]. Within the Web Diversity challenge, relevance judgments for the
documents are provided considering distinct subtopics relevant to the queries, thus representing themes’
relevance. The objective of the diversity task was to produce a ranked list of pages that collectively
ofer comprehensive coverage for a query, minimizing excessive redundancy.
      </p>
      <p>
        Normalizing The MDCU Score. As noted by Clarke et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], determining the optimal run to
normalize  −  in the multi-theme scenario is an NP-hard problem. The same challenge holds
for the MDCU computation. Indeed, finding such run would require evaluating all possible ranking
permutations to find the one that maximizes the final MDCU score. Specifically, for a list of  documents,
this entails considering the !( − 1)! rankings. To make the problem computationally tractable and
avoid heuristics that might lead to inconsistent values, we propose two strategies to project MDCU
scores in standard intervals: Z-Score standardization and MinMax normalization. These approaches
have been demonstrated to map the values of a stochastic variable in equivalent intervals in [13].
      </p>
      <sec id="sec-3-1">
        <title>1https://github.com/Kekkodf/MDCUEval</title>
        <p>We follow Webber et al. [14] to apply Z-score standardization. In detail, given  the set of systems
to evaluate, a query , and called  , the MDCU score for the system  ∈  on query , to
standardize the MDCU values we compute across the systems under evaluation the observed mean
MDCU   = ∑︀∈  ,/|| and standard deviation   = stdev({ ,, ∀ ∈ }). The
standardized MDCU score of system  on query  is computed as:</p>
        <p>(, ) = ( , −  )/ .</p>
        <p>
          To map the values of the MDCU in the interval [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], we employ the MinMax normalization. Thus, for
a query , we first compute the minimum and maximum MDCU scores observed for that query across
the retrieval systems as  = min∈  , and  = max∈  ,. The MinMax
normalized MDCU for a system  on query  is computed as:
 (, ) =
  −
        </p>
        <p>min
max − min
(1)
(2)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <p>
        We compare  -nDCG and MDCU when evaluating the systems submitted to the TREC Web Diversity
cat-B track. To compute  -nDCG, we use the pyndeval package of the ir_measure Python library2.
The  value has been maintained as the default specified in the package, i.e.,  = 0.5. To ensure
transparency and reproducibility, we provide the MDCU code and results in the online repository.
Assessing the MDCU Stability. We present the analysis concerning the stability of MDCU compared
to  -nDCG on the TREC Web Diversity cat-B3. For the comparison cut-of points , we select  = 5
and  = 20, which correspond to the measurement standard adopted in the original challenges of
the Web Diversity track [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11, 12</xref>
        ]. For space limitations, we discuss here only the results for
 = 5, but a more comprehensive discussion is provided in the paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and in the repository. We
analyze the correlation and agreement between diferent runs by computing Pearson’s  and Kendall’s
 correlation coeficients [ 15, 16] for pairs of evaluation measures. The ideal outcome of the stability
analysis is to ensure that  -nDCG and MDCU positively correlate —indicating that the two measures
are related— but they capture distinct aspects of the multi-dimensional evaluation, testified by the
absence of pathologically high correlation.
      </p>
      <p>Measuring the MDCU Statistical Power. In addition to the stability analysis of the MDCU
framework, we conduct the ANalysis Of the VAriance (ANOVA) [17] and Siegel-Tukey’s test [18] using
the Pingouin package [19] to assess the concordance between  -nDCG and MDCU, as proposed
in [20, 21, 22]. In detail, the concordance measures proposed by [20, 21, 22] consider pair-wise
comparisons of systems carried out in diferent experimental settings—in our case, using diferent measures.
Such measures consider two aspects of a system-system pair-wise comparison: “statistical significance”
and “directional agreement”. The first dimension categorizes system pair comparisons as Active (A) if
both evaluation measures detect statistically significant diferences between the systems, Mixed (M) if
only one measure identifies significance, and Passive (P) if neither measure finds a significant diference
between systems. The second axis assesses whether the measure yields consistent rankings, classifying
them as Agreements (A) when both measures consider the same system to be better and Disagreements
(D) when rankings conflict. Combining these dimensions results in the following six concordance
measurements: Active Agreement (AA), Mixed Agreement (MA), Passive Agreement (PA), Active
Disagreement (AD), Mixed Disagreement (MD), and Passive Disagreement (PD), each capturing
diferent relationships between pairs of systems analysed. Moreover, we employ the Conclusion Bias4 to</p>
      <sec id="sec-4-1">
        <title>2https://github.com/terrierteam/ir_measures</title>
        <p>3Here, we show the results for Pearson’s and Kendall’s correlation on the runs of the Web Diversity‘12 [12]. The results on
the other collections are reported in the repository.
4Ferro and Sanderson [22] call this score Publication Bias, as it would lead to diferent outcomes being published.
quantify the proportion of conflicting outcomes discovered where the two evaluation measures lead to
opposite findings, i.e., assessing instances where one measure identifies a model as the statistical best.</p>
        <p>MDCU Stability Analysis. Figure 4 shows the results of the diferent runs considering the average
 -nDCG and normalized MDCU on the Web Diversity‘12 cat-B collection at  = 5. The correlation
analysis results indicate a Pearson’s correlation of 0.96 for both normalization methods concerning the
 -nDCG. On the other hand, Kendall’s  is 0.81 for  , increasing to 0.83 in the MinMax
normalization. However, the correlation between the measures remains within a non-pathological
range. This ensures that while the measures are related, they still capture diferent aspects of retrieval
performance. Since only the top 5 retrieved documents are considered in the evaluation, there is a
limited opportunity for the documents to overlap themes, leading to a higher correlation.
MDCU Statistical Power. Table 1 presents the concordance results for  = 5. The normalized
MDCU measure demonstrates a stronger ability to identify statistically significant diferences between
system pairs. Moreover, the number of ADs, representing the most undesirable outcome in the analysis,
consistently remains the lowest, indicating that the two measures rarely produce conflicting rankings
between systems. Notably, the agreement ratio, i.e., the sum of active and passive agreements, shows
strong concordance between the system rankings found in all four collections. Finally, the Conclusion
Bias highlights that the use of MDCU or  -nDCG emphasises their difering evaluation perspectives.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we propose the implementation of the MDCU framework and evaluate its efectiveness
using the  -nDCG measure as the baseline to assess novelty and diversity with overlapping themes in
system search results. We also performed a statistical analysis of the results obtained in four TREC
collections, investigating the concordance between these two measures. Our findings indicate strong
positive correlations among these metrics at a cut-of of 5, which exhibits a slightly lower positive
correlation at a cut-of of 20 due to the higher number of distinct relevance aspects assessed, considering
a larger pool of documents. One study limitation is the absence of an analysis of the documents’ usability
attributes. Since the literature lacks a suitable benchmark, we intend to develop ad hoc collections. The
usability attributes may be derived by employing LLMs to generate the usability aspects of the systems’
retrieved documents, thus simulating, for example, the virality and credibility of the texts.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly for Readability and Spelling checks.
After using this tool, the authors reviewed and edited the content as needed and took full responsibility
for the publication’s content.
Conference, TREC 2011, Gaithersburg, Maryland, USA, November 15-18, 2011, volume 500-296
of NIST Special Publication, National Institute of Standards and Technology (NIST), 2011. URL:
http://trec.nist.gov/pubs/trec20/papers/WEB.OVERVIEW.pdf.
[12] C. L. A. Clarke, N. Craswell, E. M. Voorhees, Overview of the TREC 2012 web track, in: E. M.</p>
      <p>Voorhees, L. P. Buckland (Eds.), Proceedings of The Twenty-First Text REtrieval Conference,
TREC 2012, Gaithersburg, Maryland, USA, November 6-9, 2012, volume 500-298 of NIST Special
Publication, National Institute of Standards and Technology (NIST), 2012. URL: http://trec.nist.gov/
pubs/trec21/papers/WEB12.overview.pdf.
[13] Z. Khasidashvili, J. R. W. Glauert, Discrete normalization and standardization in deterministic
residual structures, in: M. Hanus, M. Rodríguez-Artalejo (Eds.), Algebraic and Logic Programming,
5th International Conference, ALP’96, Aachen, Germany, September 25-27, 1996, Proceedings,
volume 1139 of Lecture Notes in Computer Science, Springer, 1996, pp. 135–149. URL: https://doi.
org/10.1007/3-540-61735-3_9. doi:10.1007/3-540-61735-3\_9.
[14] W. Webber, A. Mofat, J. Zobel, Score standardization for inter-collection comparison of retrieval
systems, in: S. Myaeng, D. W. Oard, F. Sebastiani, T. Chua, M. Leong (Eds.), Proceedings of the
31st Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR 2008, Singapore, July 20-24, 2008, ACM, 2008, pp. 51–58. URL: https://doi.org/10.
1145/1390334.1390346. doi:10.1145/1390334.1390346.
[15] K. Pearson, Vii. note on regression and inheritance in the case of two parents, proceedings of the
royal society of London 58 (1895) 240–242.
[16] M. G. Kendall, A new measure of rank correlation, Biometrika 30 (1938) 81–93.
[17] A. Rutherford, ANOVA and ANCOVA: a GLM approach, John Wiley &amp; Sons, 2011.
[18] S. Siegel, J. W. Tukey, A nonparametric sum of ranks procedure for relative spread in unpaired
samples, Journal of the American Statistical Association 55 (1960) 429–445. URL: https://api.
semanticscholar.org/CorpusID:121903915.
[19] R. Vallat, Pingouin: statistics in python, Journal of Open Source Software 3 (2018) 1026. doi:10.</p>
      <p>21105/joss.01026.
[20] A. Mofat, F. Scholer, P. Thomas, Models and metrics: IR evaluation as a user process, in:
A. Trotman, S. J. Cunningham, L. Sitbon (Eds.), The Seventeenth Australasian Document Computing
Symposium, ADCS ’12, Dunedin, New Zealand, December 5-6, 2012, ACM, 2012, pp. 47–54. URL:
https://doi.org/10.1145/2407085.2407092. doi:10.1145/2407085.2407092.
[21] G. Faggioli, N. Ferro, System efect estimation by sharding: A comparison between ANOVA
approaches to detect significant diferences, in: D. Hiemstra, M. Moens, J. Mothe, R. Perego,
M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval - 43rd European Conference
on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Proceedings, Part II, volume
12657 of Lecture Notes in Computer Science, Springer, 2021, pp. 33–46. URL: https://doi.org/10.1007/
978-3-030-72240-1_3. doi:10.1007/978-3-030-72240-1\_3.
[22] N. Ferro, M. Sanderson, How do you test a test?: A multifaceted examination of significance
tests, in: K. S. Candan, H. Liu, L. Akoglu, X. L. Dong, J. Tang (Eds.), WSDM ’22: The Fifteenth
ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA,
February 21 - 25, 2022, ACM, 2022, pp. 280–288. URL: https://doi.org/10.1145/3488560.3498406.
doi:10.1145/3488560.3498406.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>L. De Faveri</surname>
          </string-name>
          , G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Järvelin</surname>
          </string-name>
          ,
          <article-title>Evaluating multi-dimensional cumulated utility in information retrieval</article-title>
          , in: N.
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Pasi</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Alonso</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Trotman</surname>
          </string-name>
          , S. Verberne (Eds.),
          <source>Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2025</year>
          , Padua, Italy,
          <source>July 13-18</source>
          ,
          <year>2025</year>
          , ACM,
          <year>2025</year>
          , pp.
          <fpage>2622</fpage>
          -
          <lpage>2626</lpage>
          . URL: https://doi.org/10.1145/3726302.3730191. doi:
          <volume>10</volume>
          .1145/3726302.3730191.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Järvelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kekäläinen</surname>
          </string-name>
          ,
          <article-title>Cumulated gain-based evaluation of IR techniques</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>20</volume>
          (
          <year>2002</year>
          )
          <fpage>422</fpage>
          -
          <lpage>446</lpage>
          . URL: http://doi.acm.
          <source>org/10</source>
          .1145/582415.582418. doi:
          <volume>10</volume>
          .1145/582415. 582418.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          ,
          <article-title>How many relevances in information retrieval?</article-title>
          ,
          <source>Interact. Comput</source>
          .
          <volume>10</volume>
          (
          <year>1998</year>
          )
          <fpage>303</fpage>
          -
          <lpage>320</lpage>
          . URL: https://doi.org/10.1016/S0953-
          <volume>5438</volume>
          (
          <issue>98</issue>
          )
          <fpage>00012</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1016/S0953-
          <volume>5438</volume>
          (
          <issue>98</issue>
          )
          <fpage>00012</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soborof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ashkan</surname>
          </string-name>
          ,
          <article-title>A comparative analysis of cascade measures for novelty and diversity</article-title>
          , in: I.
          <string-name>
            <surname>King</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Nejdl</surname>
          </string-name>
          , H. Li (Eds.),
          <source>Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM</source>
          <year>2011</year>
          ,
          <string-name>
            <given-names>Hong</given-names>
            <surname>Kong</surname>
          </string-name>
          , China, February 9-
          <issue>12</issue>
          ,
          <year>2011</year>
          , ACM,
          <year>2011</year>
          , pp.
          <fpage>75</fpage>
          -
          <lpage>84</lpage>
          . URL: https://doi.org/10.1145/1935826.1935847. doi:
          <volume>10</volume>
          .1145/ 1935826.1935847.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Ma,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Result diversification in search and recommendation: A survey</article-title>
          ,
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>5354</fpage>
          -
          <lpage>5373</lpage>
          . URL: https: //doi.org/10.1109/TKDE.
          <year>2024</year>
          .
          <volume>3382262</volume>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2024</year>
          .
          <volume>3382262</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kolla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vechtomova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ashkan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Büttcher</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. MacKinnon</surname>
          </string-name>
          ,
          <article-title>Novelty and diversity in information retrieval evaluation</article-title>
          , in: S. Myaeng,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chua</surname>
          </string-name>
          , M. Leong (Eds.),
          <source>Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2008</year>
          , Singapore,
          <source>July 20-24</source>
          ,
          <year>2008</year>
          , ACM,
          <year>2008</year>
          , pp.
          <fpage>659</fpage>
          -
          <lpage>666</lpage>
          . URL: https://doi.org/10.1145/1390334.1390446. doi:
          <volume>10</volume>
          .1145/1390334. 1390446.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Järvelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sormunen</surname>
          </string-name>
          ,
          <article-title>A blueprint of IR evaluation integrating task and user characteristics</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>42</volume>
          (
          <year>2024</year>
          )
          <volume>164</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>164</lpage>
          :
          <fpage>38</fpage>
          . URL: https://doi.org/10.1145/3675162. doi:
          <volume>10</volume>
          .1145/ 3675162.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giachanou</surname>
          </string-name>
          , G. Grefenstette,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanselowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Järvelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Nejdl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>An information nutritional label for online documents</article-title>
          ,
          <source>SIGIR Forum 51</source>
          (
          <year>2017</year>
          )
          <fpage>46</fpage>
          -
          <lpage>66</lpage>
          . URL: https://doi.org/10.1145/3190580.3190588. doi:
          <volume>10</volume>
          .1145/3190580. 3190588.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soborof</surname>
          </string-name>
          ,
          <article-title>Overview of the TREC 2009 web track</article-title>
          , in: E. M.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>L. P.</given-names>
          </string-name>
          Buckland (Eds.),
          <source>Proceedings of The Eighteenth Text REtrieval Conference</source>
          , TREC 2009, Gaithersburg, Maryland, USA, November
          <volume>17</volume>
          -
          <issue>20</issue>
          ,
          <year>2009</year>
          , volume
          <volume>500</volume>
          -278 of NIST Special Publication,
          <source>National Institute of Standards and Technology (NIST)</source>
          ,
          <year>2009</year>
          . URL: http://trec.nist.gov/pubs/trec18/ papers/WEB09.OVERVIEW.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          , I. Soborof,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <article-title>Overview of the TREC 2010 web track</article-title>
          , in: E. M.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>L. P.</given-names>
          </string-name>
          Buckland (Eds.),
          <source>Proceedings of The Nineteenth Text REtrieval Conference</source>
          , TREC 2010, Gaithersburg, Maryland, USA, November
          <volume>16</volume>
          -
          <issue>19</issue>
          ,
          <year>2010</year>
          , volume
          <volume>500</volume>
          -294 of NIST Special Publication,
          <source>National Institute of Standards and Technology (NIST)</source>
          ,
          <year>2010</year>
          . URL: https://trec.nist. gov/pubs/trec19/papers/WEB.OVERVIEW.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Soborof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <article-title>Overview of the TREC 2011 web track</article-title>
          , in: E. M.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>L. P.</given-names>
          </string-name>
          Buckland (Eds.),
          <source>Proceedings of The Twentieth Text REtrieval</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>