Bias and Fairness in Effectiveness Evaluation by Means of Network Analysis and Mixture Models Michael Soprano, Kevin Roitero, Stefano Mizzaro University of Udine, Italy soprano.michael@spes.uniud.it,roitero.kevin@spes.uniud.it,mizzaro@uniud.it ABSTRACT Table 1: Effectiveness Table. Information retrieval effectiveness evaluation is often carried out by t1 ··· tn Es means of test collections. Many works investigated possible sources s1 es1,t1 ··· es1,tn Es (s 1 ) of bias in such an approach. We propose a systematic approach .. .. .. . . . to identify bias and its causes, and to remove it, thus enforcing sm es1,t1 ··· esm ,tn Es (sm ) fairness in effectiveness evaluation by means of test collections. Et Et (t 1 ) ··· Et (tn ) 1 INTRODUCTION In Information Retrieval (IR) the evaluation of systems is often car- link from a system to a topic expresses how a system thinks a topic ried out using test collections. Different initiatives, such as TREC, is easy, and each link from a topic to a system expresses how a FIRE, CLEF, etc. implement this setting in a competition scenario. In topic thinks a system is effective. Then, Mizzaro and Robertson the well known TREC initiative, participants are provided with a col- compute the hubness and authority values of the systems and topics lection of documents and a set of topics, which are representations by running the HITS algorithm [3] on such graph; the hubness of information needs. Each participant can submit one or more run, value for a system expresses its ability to recognise easy topics, that consists in a ranked list of (usually) 1000 documents for each while the hubness value for a topic expresses its ability to recognise topic. The retrieved documents are then pooled, and expert judges effective systems. Results of the analysis by Mizzaro and Robertson provide relevance judgements for the pooled ⟨topic, document⟩ [5], as well as by Roitero et al. [8], demonstrate that the evaluation pairs. Then, an effectiveness metric (such as AP, NDCG, etc.) is is biased, and in particular that easy topics are better in recognising computed for each ⟨run, topic⟩ pair, and the final effectiveness met- effective systems; in other words, a retrieval system to be effective ric for each run is obtained by averaging its effectiveness score over needs to be effective on the easy topics. the set of topics. Finally, the set of runs is ranked in descending order of effectiveness. Different works investigated possible source of bias for this eval- 2.2 HITS Hits Readersourcing uation model by looking at system-topics correlations. In this work Soprano et al. [9] used the same analysis based on the HITS algo- we propose to extend prior work by considering the many dimen- rithm and described in Section 2.1 to analyse the bias present in sions of the problem, and we develop a statistical model to capture the Readersourcing model [4], an alternative peer review proposal the magnitude of the effects of the different dimensions. that exploits readers to assess paper quality. Due to the lack of real data, Soprano et al. run a series of simulations to produce synthetic 2 BACKGROUND AND RELATED WORK but realistic user models that simulate readers assessing the quality of the papers. Their results show that the Readersourcing model 2.1 HITS Hits TREC presents some (both good and bad) bias under certain conditions The output of the TREC initiative can be represented as an effective- derived from how the synthetic data is produced, as for example: ness matrix E as in Table 1, where each si is a system configuration (i) the ability of a reader to recognise good papers is independent (i.e., run), each t j is a topic, esi ,t j represents the effectiveness (with from the fact that s/he read papers that on average get high/low a metric such as AP) of the i-th system for the j-th topic, Es and Et judgements, and (ii) a paper is able to recognise high/low quality represent respectively the average effectiveness of a system (with a readers independently from its average score or from its quality. metric such as MAP) and the average topic difficulty (with a metric such as AAP [5, 8]). 2.3 Breaking Components Down To capture the bias of such evaluation setting, Mizzaro and Robertson [5] normalise the E matrix, or more precisely each esi ,t j Breaking down the effect caused by a dimension on a complex system has been widely studied in IR. A problem of particular in- in two ways: (i) by removing the system effectiveness effect, achieved terest is to break down the system effectiveness score (such as by subtracting Es from each esi ,t j , and (ii) by removing the topic AP) into the effect of systems, topics, and system components, like effect, achieved by subtracting Et from each esi ,t j . After the nor- for example the effect of stemming, query expansion, etc. For this malisation, Mizzaro and Robertson merge the two effectiveness purpose, Ferro and Silvello [2] used Generalised Linear Mixture matrices obtained from (i) and (ii) to form a graph in which: each Models (GLMM) [6] to break down the effectiveness score of a Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons system considering its components, Ferro and Sanderson [1] con- License Attribution 4.0 International (CC BY 4.0). sidered the effect of sub-corpora, and Zamperi et al. [10] provided a IIR 2019, September 16–18, 2019, Padova, Italy complete analysis on the topic ease and the relative effect of system configurations, corpora, and interactions between components. 6 IIR 2019, September 16–18, 2019, Padova, Italy Michael Soprano, Kevin Roitero, Stefano Mizzaro 1.0 R03 (e.g., remove AP@10 from AP@1000) in order to remove the top- Rho(Hubness Topic-AvgNDCG) 0.8 R04 heaviness of a metric, or remove Precision (or Recall) from F1, to R05 enhance the effect of precision-oriented or recall-oriented systems, 0.6 TB04 TB05 and so on. 0.4 TREC5 0.2 TREC6 3.4 System and Topic Population Effect TREC7 0.0 TREC8 Another effect we can investigate is to consider different systems 0.2 TREC2001 and topic populations. We can consider, for example, systems or- dered by effectiveness, topics ordered by difficulty, or even consider 5 0 5 0 0 0 0 0 P@ 1 1 2 3 10 20 50 the most representative subset of systems / topics selected according P@ P@ P@ P@ P@ P@ P@ Pool Depth to the strategy developed by Roitero et al. [7]. Figure 1: Effect of pool depth on the model bias. 3.5 GLMM Finally, we can develop a GLMM adapting the techniques used in 3 EXPERIMENTS [1, 2, 10] to study the effect that the different components described In this paper, we propose to extend results from the related work so far (see Sections 3.1–3.4) have on the bias of the model. Thus, (see Sections 2.1 and 2.2) to define a sound and engineered pipeline we can define the following GLMM: to find and correct the bias in the IR effectiveness evaluation setting. Biasi jklm = Pooli + Collectionj + Corporak + System-subsetl More in detail, we plan to investigate how the specific bias of effec- + Topic-subsetm + (interactions) + Error. tive systems being recognised by easy topics varies when varying the components of a test collection, such as the system population, From the above equation, we can compute the Size of Effect index pool depth, etc. Finally, we propose to use a GLMM as done in the ω 2 which is an “unbiased and standardised index and estimates a related work (see Section 2.3) to compute the magnitude of effect parameter that is independent of sample size and quantifies the of the various components on the bias. magnitude of difference between populations or the relationships between explanatory and response variables” [6]. Such index ex- 3.1 Pool Effect presses the magnitude of the effect that the different components of a test collection have on the bias of the model. To investigate the effect of the different pool depths, we plan to compute, for each effectiveness metric, its value at difference cut- 4 CONCLUSIONS offs. Figure 1 shows a preliminary result: the plot shows, for the Precision metric, the different cut-off values on the x-axis and, on Our contribution is twofold: we propose an engineered pipeline the y-axis, the bias value represented by the Pearson’s ρ correla- based on network analysis and mixture models that can be used to tion between the hubness measure of systems and their average detect bias and its causes in retrieval evaluation, and we present precision value; this bias represents the fact that effective systems some preliminary result. We plan to conduct the experiments de- are recognised by easy topics. As we can see from the plot, there is scribed, that will allow to better understand the effect and cause of a clear trend suggesting that the bias grows together with the pool bias and fairness in the retrieval evaluation. depth. The undesired effect that effective systems are mainly those that work well on easy topics becomes stronger when increasing REFERENCES [1] Nicola Ferro and Mark Sanderson. 2017. Sub-corpora Impact on System Ef- pool depth. fectiveness. In Proceedings of the 40th ACM SIGIR Conference. ACM, New York, 901–904. 3.2 Collection and Corpora Effect [2] Nicola Ferro and Gianmaria Silvello. 2018. Toward an anatomy of IR system component performances. JASIST 69, 2 (2018), 187–200. To investigate the effect of the different collections, we plan to [3] Jon M. Kleinberg. 1999. Authoritative Sources in a Hyperlinked Environment. J. use different TREC collections: Robust 2003 (R03), 2004 (R04), and ACM 46, 5 (Sept. 1999), 604–632. [4] Stefano Mizzaro. 2003. Quality control in scholarly publishing: A new proposal. 2005 (R05), Terabyte 2004 (TB04) and 2005 (TB05), TREC5, TREC6, JASIST 54, 11 (2003), 989–1005. TREC7, TREC8, and TREC2001. Furthermore, we plan to break [5] Stefano Mizzaro and Stephen Robertson. 2007. HITS Hits TREC: Exploring IR down the sub-corpora effect by considering the different corpora Evaluation Results with Network Analysis. In Proceedings of the 30th ACM SIGIR Conference. 479–486. of the collections. [6] Stephen Olejnik and James Algina. 2003. Generalized Eta and Omega Squared Sta- tistics: Measures of Effect Size for Some Common Research Designs. Psychological Methods 8, 4 (2003), 434. 3.3 Metric Effect [7] Kevin Roitero, J. Shane Culpepper, Mark Sanderson, Falk Scholer, and Stefano We will investigate the effect of different evaluation metrics in Mizzaro. 2019. Fewer topics? A million topics? Both?! On topics subsets in test collections. Information Retrieval Journal (2019). the model bias, specifically we will consider: Precision (P), AP, [8] Kevin Roitero, Eddy Maddalena, and Stefano Mizzaro. 2017. Do Easy Topics Recall (R), NDCG, τAP , RBP, etc. When dealing with the metric Predict Effectiveness Better Than Difficult Topics?. In Advances in Information effect, we can consider two approaches in the normalisation step: Retrieval, Joemon M Jose, Claudia Hauff, Ismail Sengor Altıngovde, Dawei Song, Dyaa Albakour, Stuart Watt, and John Tait (Eds.). Springer, 605–611. remove the average of system effectiveness and topics ease, as for [9] Michael Soprano, Kevin Roitero, and Stefano Mizzaro. 2019. HITS Hits Read- example remove MAP and AAP from AP (as done by Mizzaro and ersourcing: Validating Peer Review Alternatives Using Network Analysis.. In Proceedings of the 4th BIRNDL Workshop at the 42nd ACM SIGIR. Robertson [5], Roitero et al. [8], Soprano et al. [9]), or try more [10] Fabio Zamperi, Kevin Roitero, Shane Culpepper, Oren Kurland, and Stefano Miz- complex approaches; in the latter case, we can remove a score zaro. 2019. On Topic Difficulty in IR Evaluation: The Effect of Corpora, Systems, computed on a deep pool from one computed on a shallow pool and System Components.. In Proceedings of the 42nd ACM SIGIR Conference. 7