<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bias and Fairness in Efectiveness Evaluation by Means of Network Analysis and Mixture Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Soprano</string-name>
          <email>soprano.michael@spes.uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin Roitero</string-name>
          <email>roitero.kevin@spes.uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Mizzaro</string-name>
          <email>mizzaro@uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Udine</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Information retrieval efectiveness evaluation is often carried out by means of test collections. Many works investigated possible sources of bias in such an approach. We propose a systematic approach to identify bias and its causes, and to remove it, thus enforcing fairness in efectiveness evaluation by means of test collections.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>In Information Retrieval (IR) the evaluation of systems is often
carried out using test collections. Diferent initiatives, such as TREC,
FIRE, CLEF, etc. implement this setting in a competition scenario. In
the well known TREC initiative, participants are provided with a
collection of documents and a set of topics, which are representations
of information needs. Each participant can submit one or more run,
that consists in a ranked list of (usually) 1000 documents for each
topic. The retrieved documents are then pooled, and expert judges
provide relevance judgements for the pooled ⟨topic, document⟩
pairs. Then, an efectiveness metric (such as AP, NDCG, etc.) is
computed for each ⟨run, topic⟩ pair, and the final efectiveness
metric for each run is obtained by averaging its efectiveness score over
the set of topics. Finally, the set of runs is ranked in descending
order of efectiveness.</p>
      <p>Diferent works investigated possible source of bias for this
evaluation model by looking at system-topics correlations. In this work
we propose to extend prior work by considering the many
dimensions of the problem, and we develop a statistical model to capture
the magnitude of the efects of the diferent dimensions.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND AND RELATED WORK</title>
    </sec>
    <sec id="sec-3">
      <title>HITS Hits TREC</title>
      <p>
        The output of the TREC initiative can be represented as an
efectiveness matrix E as in Table 1, where each si is a system configuration
(i.e., run), each tj is a topic, esi,tj represents the efectiveness (with
a metric such as AP) of the i-th system for the j-th topic, Es and Et
represent respectively the average efectiveness of a system (with a
metric such as MAP) and the average topic dificulty (with a metric
such as AAP [
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ]).
      </p>
      <p>
        To capture the bias of such evaluation setting, Mizzaro and
Robertson [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] normalise the E matrix, or more precisely each esi,tj
in two ways: (i) by removing the system efectiveness efect, achieved
by subtracting Es from each esi,tj , and (ii) by removing the topic
efect, achieved by subtracting Et from each esi,tj . After the
normalisation, Mizzaro and Robertson merge the two efectiveness
matrices obtained from (i) and (ii) to form a graph in which: each
s1
.
.
.
sm
Et
      </p>
      <p>t1
es1,t1
es1,t1 · · ·
Et (t1) · · ·
· · ·
· · ·
. . .</p>
      <p>tn
es1,tn
esm,tn
Et (tn )</p>
      <p>Es
Es (s1)
.
.</p>
      <p>.</p>
      <p>
        Es (sm )
link from a system to a topic expresses how a system thinks a topic
is easy, and each link from a topic to a system expresses how a
topic thinks a system is efective. Then, Mizzaro and Robertson
compute the hubness and authority values of the systems and topics
by running the HITS algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] on such graph; the hubness
value for a system expresses its ability to recognise easy topics,
while the hubness value for a topic expresses its ability to recognise
efective systems. Results of the analysis by Mizzaro and Robertson
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], as well as by Roitero et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], demonstrate that the evaluation
is biased, and in particular that easy topics are better in recognising
efective systems; in other words, a retrieval system to be efective
needs to be efective on the easy topics.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>HITS Hits Readersourcing</title>
      <p>
        Soprano et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] used the same analysis based on the HITS
algorithm and described in Section 2.1 to analyse the bias present in
the Readersourcing model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], an alternative peer review proposal
that exploits readers to assess paper quality. Due to the lack of real
data, Soprano et al. run a series of simulations to produce synthetic
but realistic user models that simulate readers assessing the quality
of the papers. Their results show that the Readersourcing model
presents some (both good and bad) bias under certain conditions
derived from how the synthetic data is produced, as for example:
(i) the ability of a reader to recognise good papers is independent
from the fact that s/he read papers that on average get high/low
judgements, and (ii) a paper is able to recognise high/low quality
readers independently from its average score or from its quality.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Breaking Components Down</title>
      <p>
        Breaking down the efect caused by a dimension on a complex
system has been widely studied in IR. A problem of particular
interest is to break down the system efectiveness score (such as
AP) into the efect of systems, topics, and system components, like
for example the efect of stemming, query expansion, etc. For this
purpose, Ferro and Silvello [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] used Generalised Linear Mixture
Models (GLMM) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to break down the efectiveness score of a
system considering its components, Ferro and Sanderson [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
considered the efect of sub-corpora, and Zamperi et al . [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] provided a
complete analysis on the topic ease and the relative efect of system
configurations, corpora, and interactions between components.
)G 1.0
CND 0.8
g
-vA 0.6
c
i
Top 0.4
sse 0.2
n
ubH 0.0
(
ohR 0.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>3 EXPERIMENTS</title>
      <p>In this paper, we propose to extend results from the related work
(see Sections 2.1 and 2.2) to define a sound and engineered pipeline
to find and correct the bias in the IR efectiveness evaluation setting.
More in detail, we plan to investigate how the specific bias of
efective systems being recognised by easy topics varies when varying
the components of a test collection, such as the system population,
pool depth, etc. Finally, we propose to use a GLMM as done in the
related work (see Section 2.3) to compute the magnitude of efect
of the various components on the bias.</p>
    </sec>
    <sec id="sec-7">
      <title>3.1 Pool Efect</title>
      <p>To investigate the efect of the diferent pool depths, we plan to
compute, for each efectiveness metric, its value at diference
cutofs. Figure 1 shows a preliminary result: the plot shows, for the
Precision metric, the diferent cut-of values on the x-axis and, on
the y-axis, the bias value represented by the Pearson’s ρ
correlation between the hubness measure of systems and their average
precision value; this bias represents the fact that efective systems
are recognised by easy topics. As we can see from the plot, there is
a clear trend suggesting that the bias grows together with the pool
depth. The undesired efect that efective systems are mainly those
that work well on easy topics becomes stronger when increasing
pool depth.</p>
    </sec>
    <sec id="sec-8">
      <title>3.2 Collection and Corpora Efect</title>
      <p>To investigate the efect of the diferent collections, we plan to
use diferent TREC collections: Robust 2003 (R03), 2004 (R04), and
2005 (R05), Terabyte 2004 (TB04) and 2005 (TB05), TREC5, TREC6,
TREC7, TREC8, and TREC2001. Furthermore, we plan to break
down the sub-corpora efect by considering the diferent corpora
of the collections.</p>
    </sec>
    <sec id="sec-9">
      <title>3.3 Metric Efect</title>
      <p>
        We will investigate the efect of diferent evaluation metrics in
the model bias, specifically we will consider: Precision (P), AP,
Recall (R), NDCG, τAP, RBP, etc. When dealing with the metric
efect, we can consider two approaches in the normalisation step:
remove the average of system efectiveness and topics ease, as for
example remove MAP and AAP from AP (as done by Mizzaro and
Robertson [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Roitero et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Soprano et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]), or try more
complex approaches; in the latter case, we can remove a score
computed on a deep pool from one computed on a shallow pool
(e.g., remove AP@10 from AP@1000) in order to remove the
topheaviness of a metric, or remove Precision (or Recall) from F1, to
enhance the efect of precision-oriented or recall-oriented systems,
and so on.
      </p>
    </sec>
    <sec id="sec-10">
      <title>3.4 System and Topic Population Efect</title>
      <p>
        Another efect we can investigate is to consider diferent systems
and topic populations. We can consider, for example, systems
ordered by efectiveness, topics ordered by dificulty, or even consider
the most representative subset of systems / topics selected according
to the strategy developed by Roitero et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
3.5 GLMM
Finally, we can develop a GLMM adapting the techniques used in
[
        <xref ref-type="bibr" rid="ref1 ref10 ref2">1, 2, 10</xref>
        ] to study the efect that the diferent components described
so far (see Sections 3.1–3.4) have on the bias of the model. Thus,
we can define the following GLMM:
Biasi jklm = Pooli + Collectionj + Corporak + System-subsetl
+ Topic-subsetm + (interactions) + Error.
      </p>
      <p>
        From the above equation, we can compute the Size of Efect index
ω2 which is an “unbiased and standardised index and estimates a
parameter that is independent of sample size and quantifies the
magnitude of diference between populations or the relationships
between explanatory and response variables” [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Such index
expresses the magnitude of the efect that the diferent components
of a test collection have on the bias of the model.
      </p>
    </sec>
    <sec id="sec-11">
      <title>4 CONCLUSIONS</title>
      <p>Our contribution is twofold: we propose an engineered pipeline
based on network analysis and mixture models that can be used to
detect bias and its causes in retrieval evaluation, and we present
some preliminary result. We plan to conduct the experiments
described, that will allow to better understand the efect and cause of
bias and fairness in the retrieval evaluation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Ferro</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sanderson</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Sub-corpora Impact on System Effectiveness</article-title>
          .
          <source>In Proceedings of the 40th ACM SIGIR Conference</source>
          . ACM, New York,
          <fpage>901</fpage>
          -
          <lpage>904</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Ferro</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gianmaria</given-names>
            <surname>Silvello</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Toward an anatomy of IR system component performances</article-title>
          .
          <source>JASIST 69</source>
          ,
          <issue>2</issue>
          (
          <year>2018</year>
          ),
          <fpage>187</fpage>
          -
          <lpage>200</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Jon</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kleinberg</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Authoritative Sources in a Hyperlinked Environment</article-title>
          .
          <source>J. ACM</source>
          <volume>46</volume>
          ,
          <issue>5</issue>
          (Sept.
          <year>1999</year>
          ),
          <fpage>604</fpage>
          -
          <lpage>632</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Quality control in scholarly publishing: A new proposal</article-title>
          .
          <source>JASIST 54</source>
          ,
          <issue>11</issue>
          (
          <year>2003</year>
          ),
          <fpage>989</fpage>
          -
          <lpage>1005</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>HITS Hits TREC: Exploring IR Evaluation Results with Network Analysis</article-title>
          .
          <source>In Proceedings of the 30th ACM SIGIR Conference</source>
          .
          <volume>479</volume>
          -
          <fpage>486</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Olejnik</surname>
          </string-name>
          and
          <string-name>
            <given-names>James</given-names>
            <surname>Algina</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Generalized Eta and Omega Squared Statistics: Measures of Efect Size for Some Common Research Designs</article-title>
          .
          <source>Psychological Methods 8</source>
          ,
          <issue>4</issue>
          (
          <year>2003</year>
          ),
          <fpage>434</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Roitero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Shane</given-names>
            <surname>Culpepper</surname>
          </string-name>
          , Mark Sanderson, Falk Scholer, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Fewer topics? A million topics? Both?! On topics subsets in test collections</article-title>
          .
          <source>Information Retrieval Journal</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Roitero</surname>
          </string-name>
          , Eddy Maddalena, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Do Easy Topics Predict Efectiveness Better Than Dificult Topics?</article-title>
          .
          <source>In Advances in Information Retrieval</source>
          , Joemon M Jose, Claudia Hauf, Ismail Sengor Altıngovde, Dawei Song, Dyaa Albakour, Stuart Watt, and John Tait (Eds.). Springer,
          <fpage>605</fpage>
          -
          <lpage>611</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Soprano</surname>
          </string-name>
          , Kevin Roitero, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>HITS Hits Readersourcing: Validating Peer Review Alternatives Using Network Analysis.</article-title>
          .
          <source>In Proceedings of the 4th BIRNDL Workshop at the 42nd ACM SIGIR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Fabio</surname>
            <given-names>Zamperi</given-names>
          </string-name>
          , Kevin Roitero, Shane Culpepper,
          <string-name>
            <given-names>Oren</given-names>
            <surname>Kurland</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>On Topic Dificulty in IR Evaluation: The Efect of Corpora, Systems, and System Components.</article-title>
          .
          <source>In Proceedings of the 42nd ACM SIGIR Conference.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>