Bias and Fairness in Effectiveness Evaluation by Means of
                  Network Analysis and Mixture Models
                                              Michael Soprano, Kevin Roitero, Stefano Mizzaro
                                                        University of Udine, Italy
                                soprano.michael@spes.uniud.it,roitero.kevin@spes.uniud.it,mizzaro@uniud.it
ABSTRACT                                                                                                     Table 1: Effectiveness Table.
Information retrieval effectiveness evaluation is often carried out by                                              t1        ···      tn         Es
means of test collections. Many works investigated possible sources                                         s1    es1,t1      ···    es1,tn     Es (s 1 )
of bias in such an approach. We propose a systematic approach                                                ..               ..                    ..
                                                                                                              .                  .                   .
to identify bias and its causes, and to remove it, thus enforcing                                           sm    es1,t1      ···    esm ,tn    Es (sm )
fairness in effectiveness evaluation by means of test collections.                                          Et    Et (t 1 )   ···    Et (tn )

1    INTRODUCTION
In Information Retrieval (IR) the evaluation of systems is often car-
                                                                                           link from a system to a topic expresses how a system thinks a topic
ried out using test collections. Different initiatives, such as TREC,
                                                                                           is easy, and each link from a topic to a system expresses how a
FIRE, CLEF, etc. implement this setting in a competition scenario. In
                                                                                           topic thinks a system is effective. Then, Mizzaro and Robertson
the well known TREC initiative, participants are provided with a col-
                                                                                           compute the hubness and authority values of the systems and topics
lection of documents and a set of topics, which are representations
                                                                                           by running the HITS algorithm [3] on such graph; the hubness
of information needs. Each participant can submit one or more run,
                                                                                           value for a system expresses its ability to recognise easy topics,
that consists in a ranked list of (usually) 1000 documents for each
                                                                                           while the hubness value for a topic expresses its ability to recognise
topic. The retrieved documents are then pooled, and expert judges
                                                                                           effective systems. Results of the analysis by Mizzaro and Robertson
provide relevance judgements for the pooled ⟨topic, document⟩
                                                                                           [5], as well as by Roitero et al. [8], demonstrate that the evaluation
pairs. Then, an effectiveness metric (such as AP, NDCG, etc.) is
                                                                                           is biased, and in particular that easy topics are better in recognising
computed for each ⟨run, topic⟩ pair, and the final effectiveness met-
                                                                                           effective systems; in other words, a retrieval system to be effective
ric for each run is obtained by averaging its effectiveness score over
                                                                                           needs to be effective on the easy topics.
the set of topics. Finally, the set of runs is ranked in descending
order of effectiveness.
   Different works investigated possible source of bias for this eval-                     2.2    HITS Hits Readersourcing
uation model by looking at system-topics correlations. In this work                        Soprano et al. [9] used the same analysis based on the HITS algo-
we propose to extend prior work by considering the many dimen-                             rithm and described in Section 2.1 to analyse the bias present in
sions of the problem, and we develop a statistical model to capture                        the Readersourcing model [4], an alternative peer review proposal
the magnitude of the effects of the different dimensions.                                  that exploits readers to assess paper quality. Due to the lack of real
                                                                                           data, Soprano et al. run a series of simulations to produce synthetic
2 BACKGROUND AND RELATED WORK                                                              but realistic user models that simulate readers assessing the quality
                                                                                           of the papers. Their results show that the Readersourcing model
2.1 HITS Hits TREC                                                                         presents some (both good and bad) bias under certain conditions
The output of the TREC initiative can be represented as an effective-                      derived from how the synthetic data is produced, as for example:
ness matrix E as in Table 1, where each si is a system configuration                       (i) the ability of a reader to recognise good papers is independent
(i.e., run), each t j is a topic, esi ,t j represents the effectiveness (with              from the fact that s/he read papers that on average get high/low
a metric such as AP) of the i-th system for the j-th topic, Es and Et                      judgements, and (ii) a paper is able to recognise high/low quality
represent respectively the average effectiveness of a system (with a                       readers independently from its average score or from its quality.
metric such as MAP) and the average topic difficulty (with a metric
such as AAP [5, 8]).                                                                       2.3    Breaking Components Down
    To capture the bias of such evaluation setting, Mizzaro and
Robertson [5] normalise the E matrix, or more precisely each esi ,t j                      Breaking down the effect caused by a dimension on a complex
                                                                                           system has been widely studied in IR. A problem of particular in-
in two ways: (i) by removing the system effectiveness effect, achieved
                                                                                           terest is to break down the system effectiveness score (such as
by subtracting Es from each esi ,t j , and (ii) by removing the topic
                                                                                           AP) into the effect of systems, topics, and system components, like
effect, achieved by subtracting Et from each esi ,t j . After the nor-
                                                                                           for example the effect of stemming, query expansion, etc. For this
malisation, Mizzaro and Robertson merge the two effectiveness
                                                                                           purpose, Ferro and Silvello [2] used Generalised Linear Mixture
matrices obtained from (i) and (ii) to form a graph in which: each
                                                                                           Models (GLMM) [6] to break down the effectiveness score of a
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons       system considering its components, Ferro and Sanderson [1] con-
License Attribution 4.0 International (CC BY 4.0).
                                                                                           sidered the effect of sub-corpora, and Zamperi et al. [10] provided a
IIR 2019, September 16–18, 2019, Padova, Italy
                                                                                           complete analysis on the topic ease and the relative effect of system
                                                                                           configurations, corpora, and interactions between components.


                                                                                       6
IIR 2019, September 16–18, 2019, Padova, Italy                                                                                         Michael Soprano, Kevin Roitero, Stefano Mizzaro


                              1.0                                                    R03
                                                                                                    (e.g., remove AP@10 from AP@1000) in order to remove the top-
 Rho(Hubness Topic-AvgNDCG)


                              0.8                                                    R04            heaviness of a metric, or remove Precision (or Recall) from F1, to
                                                                                     R05            enhance the effect of precision-oriented or recall-oriented systems,
                              0.6                                                    TB04
                                                                                     TB05           and so on.
                              0.4
                                                                                     TREC5
                              0.2                                                    TREC6          3.4     System and Topic Population Effect
                                                                                     TREC7
                              0.0                                                    TREC8          Another effect we can investigate is to consider different systems
                              0.2                                                    TREC2001       and topic populations. We can consider, for example, systems or-
                                                                                                    dered by effectiveness, topics ordered by difficulty, or even consider
                                      5

                                            0

                                                   5

                                                          0

                                                                 0

                                                                       0

                                                                            0

                                                                                 0
                                    P@

                                             1

                                                    1

                                                           2

                                                                  3

                                                                      10

                                                                           20

                                                                                50
                                                                                                    the most representative subset of systems / topics selected according
                                          P@

                                                 P@

                                                        P@

                                                               P@

                                                                      P@

                                                                           P@

                                                                                P@
                                                         Pool Depth                                 to the strategy developed by Roitero et al. [7].
                              Figure 1: Effect of pool depth on the model bias.
                                                                                                    3.5     GLMM
                                                                                                    Finally, we can develop a GLMM adapting the techniques used in
3                             EXPERIMENTS                                                           [1, 2, 10] to study the effect that the different components described
In this paper, we propose to extend results from the related work                                   so far (see Sections 3.1–3.4) have on the bias of the model. Thus,
(see Sections 2.1 and 2.2) to define a sound and engineered pipeline                                we can define the following GLMM:
to find and correct the bias in the IR effectiveness evaluation setting.                            Biasi jklm = Pooli + Collectionj + Corporak + System-subsetl
More in detail, we plan to investigate how the specific bias of effec-                                              + Topic-subsetm + (interactions) + Error.
tive systems being recognised by easy topics varies when varying
the components of a test collection, such as the system population,                                 From the above equation, we can compute the Size of Effect index
pool depth, etc. Finally, we propose to use a GLMM as done in the                                   ω 2 which is an “unbiased and standardised index and estimates a
related work (see Section 2.3) to compute the magnitude of effect                                   parameter that is independent of sample size and quantifies the
of the various components on the bias.                                                              magnitude of difference between populations or the relationships
                                                                                                    between explanatory and response variables” [6]. Such index ex-
3.1                            Pool Effect                                                          presses the magnitude of the effect that the different components
                                                                                                    of a test collection have on the bias of the model.
To investigate the effect of the different pool depths, we plan to
compute, for each effectiveness metric, its value at difference cut-
                                                                                                    4     CONCLUSIONS
offs. Figure 1 shows a preliminary result: the plot shows, for the
Precision metric, the different cut-off values on the x-axis and, on                                Our contribution is twofold: we propose an engineered pipeline
the y-axis, the bias value represented by the Pearson’s ρ correla-                                  based on network analysis and mixture models that can be used to
tion between the hubness measure of systems and their average                                       detect bias and its causes in retrieval evaluation, and we present
precision value; this bias represents the fact that effective systems                               some preliminary result. We plan to conduct the experiments de-
are recognised by easy topics. As we can see from the plot, there is                                scribed, that will allow to better understand the effect and cause of
a clear trend suggesting that the bias grows together with the pool                                 bias and fairness in the retrieval evaluation.
depth. The undesired effect that effective systems are mainly those
that work well on easy topics becomes stronger when increasing                                      REFERENCES
                                                                                                     [1] Nicola Ferro and Mark Sanderson. 2017. Sub-corpora Impact on System Ef-
pool depth.                                                                                              fectiveness. In Proceedings of the 40th ACM SIGIR Conference. ACM, New York,
                                                                                                         901–904.
3.2                            Collection and Corpora Effect                                         [2] Nicola Ferro and Gianmaria Silvello. 2018. Toward an anatomy of IR system
                                                                                                         component performances. JASIST 69, 2 (2018), 187–200.
To investigate the effect of the different collections, we plan to                                   [3] Jon M. Kleinberg. 1999. Authoritative Sources in a Hyperlinked Environment. J.
use different TREC collections: Robust 2003 (R03), 2004 (R04), and                                       ACM 46, 5 (Sept. 1999), 604–632.
                                                                                                     [4] Stefano Mizzaro. 2003. Quality control in scholarly publishing: A new proposal.
2005 (R05), Terabyte 2004 (TB04) and 2005 (TB05), TREC5, TREC6,                                          JASIST 54, 11 (2003), 989–1005.
TREC7, TREC8, and TREC2001. Furthermore, we plan to break                                            [5] Stefano Mizzaro and Stephen Robertson. 2007. HITS Hits TREC: Exploring IR
down the sub-corpora effect by considering the different corpora                                         Evaluation Results with Network Analysis. In Proceedings of the 30th ACM SIGIR
                                                                                                         Conference. 479–486.
of the collections.                                                                                  [6] Stephen Olejnik and James Algina. 2003. Generalized Eta and Omega Squared Sta-
                                                                                                         tistics: Measures of Effect Size for Some Common Research Designs. Psychological
                                                                                                         Methods 8, 4 (2003), 434.
3.3                            Metric Effect                                                         [7] Kevin Roitero, J. Shane Culpepper, Mark Sanderson, Falk Scholer, and Stefano
We will investigate the effect of different evaluation metrics in                                        Mizzaro. 2019. Fewer topics? A million topics? Both?! On topics subsets in test
                                                                                                         collections. Information Retrieval Journal (2019).
the model bias, specifically we will consider: Precision (P), AP,                                    [8] Kevin Roitero, Eddy Maddalena, and Stefano Mizzaro. 2017. Do Easy Topics
Recall (R), NDCG, τAP , RBP, etc. When dealing with the metric                                           Predict Effectiveness Better Than Difficult Topics?. In Advances in Information
effect, we can consider two approaches in the normalisation step:                                        Retrieval, Joemon M Jose, Claudia Hauff, Ismail Sengor Altıngovde, Dawei Song,
                                                                                                         Dyaa Albakour, Stuart Watt, and John Tait (Eds.). Springer, 605–611.
remove the average of system effectiveness and topics ease, as for                                   [9] Michael Soprano, Kevin Roitero, and Stefano Mizzaro. 2019. HITS Hits Read-
example remove MAP and AAP from AP (as done by Mizzaro and                                               ersourcing: Validating Peer Review Alternatives Using Network Analysis.. In
                                                                                                         Proceedings of the 4th BIRNDL Workshop at the 42nd ACM SIGIR.
Robertson [5], Roitero et al. [8], Soprano et al. [9]), or try more                                 [10] Fabio Zamperi, Kevin Roitero, Shane Culpepper, Oren Kurland, and Stefano Miz-
complex approaches; in the latter case, we can remove a score                                            zaro. 2019. On Topic Difficulty in IR Evaluation: The Effect of Corpora, Systems,
computed on a deep pool from one computed on a shallow pool                                              and System Components.. In Proceedings of the 42nd ACM SIGIR Conference.


                                                                                                7