Introduction

How Topic and System Size A ect the Correlation among Evaluation Measures?

Nicola Ferro

ferro@dei.unipd.it 0 0 University of Padua , Italy

In this paper, we investigate the e ect of topic and system sizes on the correlation among evaluation measures for both and AP . We found that topic size matters more than system size and that and AP does not lead to noticeably di erent rankings among measures. Correlation analysis plays a central role in Information Retrieval (IR) evaluation where it is one of the tools we use to study properties and relationships among evaluation measures. When a new evaluation measure is proposed, correlation analysis is used to assess how the new measure ranks IR systems with respect to the other existing measures and, thus, to understand whether it actually grasps di erent aspects of the systems and its introduction is somehow motivated. In this context, the most used correlation coe cients are the Kendall's tau correlation [4] and the AP correlation AP [6]. In this paper, we investigate what is the e ect of the number of systems and topics on the correlation among evaluation measures and what are the di erences in using or AP . In order to answer these research questions, we rely on 3 di erent Text REtrieval Conference (TREC) collections and, for each collection, we create a Grid of Points (GoP) [2, 3], i.e. a set of system runs originating from all the possible combinations of the following components: 6 di erent stop lists, 6 types of stemmers, 7 avors of n-grams, and 17 distinct IR models, leading to 1,326 distinct system run. These GoPs basically represent nearly all the state-of-the-art components which constitute the common denominator almost always present in any IR system for English retrieval. We consider 8 di erent evaluation measures { namely, AP, P@10, Rprec, RBP, nDCG, nDCG@20, ERR, and Twist { and we compute the correlation among them over the created GoPs. Finally, we use General Linear Mixed Model (GLMM) and ANalysis Of VAriance (ANOVA) [5] to conduct the analyses needed to answer the above research questions. The paper is organized as follows: Section 2 introduces the GLMM used for the analyses; Section 3 discusses the experimental ndings; nally, Section 4 draws some conclusions and provides an outlook for future work.

Introduction

itoan 1 10 topics l rre0.9 o C0.8 0 20.7 @ G0.6 C D n0.5 s vP0.4 0 A 1 iltrrvsn20oeoanAPCDCG@000000......4567891 01 25 1500litrrsvn2oa0enoAPDCCG@000000t......o5467891p75ic1001s 251 150 200 25250 ==AP150005to000000p75......i4567891c001s10125 015 200 250 25==AP 0052050000000t......o4567891p75ic1001s 251 150 200 25250 ==AP250005to000000p75......4567891ic001s10125 150 200 25025 ==AP 0503050000000t......o4567891p75ic0101s 251 150 200 25250 ==AP350005iltrrvsn02oenoaAPCDCG@t000000o......4567891p75ic1001s 125 150 200 25025 Fig. 1litrrsv20oenaon02APCDCG@.000000000.........78965498711A01P v25sSysten4m500liilttrrrrsv0o2eanonnoeoa0APCCDCG@00000000Dt........So8679549811p75Cizice1001Gs251@1502200 025250 :==APe47500005attoo000000cpp75......ii7895641cch001ss10125 015p200l o25025t==APSs005y5sh050te000000tom......o8795641p75wSic1001iszes251150 t020 h52025 eS==AyP5sc05005tetomo000000p75......r8795641Sic001irsz10ee251501la200 t25025io==ASPny0056s500,te000000tm......o7895641bp75Sic0101iosze251t150h020 52025 S==AyP6as50005telitrrsv20oeanonAPCDCG@nt000000mo......7896451p75dSic001i1sze215 150A020 P02552 ,==SAPyfs0055toemr75Si001zae125 150g020i250v==eAPn050 numbeG@0r.6 of topic2G@s00..67 as the number of systems increases. 2 MsvnAPDC00..54 o01de52 lSys50tvsnAPDCe00m..4575S1001ize251 150 020 52025 ==SAyPs0505tem75S001ize125 150 020 250 ==AP 500 1 ==AP 00..89 0.7 0.6 0.5 25 50 75 001 125 150 200 250 050 0.4 01

40 topics ==AP 1 0.9 0.8 0.7 =AP 00..56 = 5005 75 001 125 150 200 250 500 0.4 01 70 topics We create a GoP using the TREC 13, 14, and 15 Terabyte track, thus containing 149 topics and 1,326 runs. For each topic size t 2 T = f10; 20; 30; 40; 50; 60; 70g and system size s 2 S = f10; 20; 50; 75; 100; 125; 150; 200; 250; 500g, we independently draw H = 100 random samples of t topics and H = 100 random samples of s systems from the the GoP. Overall, for each combination (t; s) 2 T S of topic and system sizes and for each measure pair, this procedure originates H = 100 samples of correlation values for both and AP .

We use the following model Yijkl = |

Main{Ez ects + i + j + k + l + ( } | )jk + (

)jl + ( Interacti{ozn E ects )kl + "ijkl } E|{rrzo}r (1) where: i is the e ect of the i-th subject, i.e. one of the h = 1; : : : ; H samples; j is the e ect of the j-th factor, i.e. measure pairs; k is the e ect of the k-th factor, i.e. number of topics; l is the e ect of the l-th factor, i.e. number of systems; ( )jk, ( )jl, and ( )kl are, respectively, the interactions between measures pairs and number of topics, measure pairs and number of systems, and number of topics and number of systems; and, "ijkl is the error. 3

Experimental Results

General Trends As Figure 1 highlights, the number of topics a ects both and AP , since their average value increases as the number of topics increases. On the other hand, the number of systems exhibits less impact on the two correlation coe cients: indeed, apart from a small transient up to around 75-100 systems, the trend for both coe cients is somehow constant, especially when the number of topics increases. We can note how, in the transient phase, and AP behave di erently: tends to slightly increase before reaching stability while AP manifests an initial decrease, sometimes followed by an increase, before getting more or less constant. 20 topics

1 ==AP 00..89 0.7 0.6 0.5 25 50 75 001 125 150 200 250 005 0.4 01 50 topics 1 0.9 0.8 0.7 ==AP 00..56 25 Sys50tem57S001ize125 150 020 250 050 0.4 10

30 topi 25 50 75

60 topi

correlation: ANOVA table for the GLMM model of equation (1).

Source

AP : ANOVA table for the GLMM model of equation (1).

Source Subject Measure Pair Topic Size System Size Measure Pair*Topic Size Measure Pair*System Size Topic Size*System Size Error Total

When it comes to con dence intervals, lower number of topics and systems call for larger intervals, which is not surprising. However, generally exhibits smaller con dence intervals than AP , especially for low number of topics. Moreover, seems to be a bit more e ective than AP in bene ting from the increasing number of topics and systems; indeed, correlation values get more stable and con dence intervals get smaller in a \faster" way for than for AP . ANOVA Analysis Tables 1 and 2 report the results of the ANOVA analyses on the GLMM model of equation (1) for and AP , respectively. The most prominent e ect is the measure pair one, which is a large size e ect in terms of !^ 2, and it has almost the same size for both and AP . The second biggest e ect is the topic size one, which again is a large size e ect and it has the same size for both and AP . This supports the previous observations about Figure 1 when we noted that the topic size is the most prominent factor in uencing the correlation among evaluation measures. Finally, the system size e ect, even if signi cant, is a very small size e ect and we can consider it almost negligible; however, it should be noted that this e ect is a little bit more than three times bigger for AP than for . Overall, this sustains the observations made above about the smaller importance of the number of systems on the correlation among evaluation measures, with AP being more sensitive to this factor than .

When it comes to the interaction between e ects, for both and AP , the measure pair and topic size ( )jk and the topic size and system size ( )kl interactions are statistically signi cant. On the other hand, the measure pair and system size ( )jl interaction is not signi cant and this further stress the fact that the number of systems does not in uence much the correlation among evaluation measures. =

Correlation among RoMP: tauCorr = 0.9735; apCorr = 0.8815 =AP = = AP and

AP in terms of how they rank evaluation and

Comparison measures according to and

AP : we can note how there are very few swaps and always among values AP adjacent rank positions.

On the right, show the actual correlation but with means centered around zero: it is evident how close are and , apart from a constant o set; indeed the among the two curves is just 0:0242, indicating very small di

Overall, these measures and you compare them across a large set of topic and system sizes, removing those e ects, and

AP have di erent absolute values but they provide a quite consistent assessment of what the di erences among these measures are. whole values. e.g. stop measures. 4

Conclusions and Future Work

We investigated how topic and system size a ect the correlation among evaluation measures.

We discovered that the number of topics impacts more than the number of systems and that the number of systems does not cause the lation point quite also and

AP is quite consistent when comparing a set of evaluation measures, yet producing erent absolute correlation As future work, we plan to investigate how the di erent system lists, stem mers, IR ect the correlation among evaluation 1. 2.

N.:

Does A

ect the

Correlation

36(2), 19:1{19:40 (2017)

Ferro, N., Harman, D.: CLEF 2009: Grid@CLEF

2009. pp. 552{565. LNCS 6241 (2010)

Ferro, N., Silvello, G.: Toward an Anatomy of IR System Component Performances. JASIST Kendall,

methods. n, A.:

ANOVA and ANCOVA. A GLM Approach. John Wiley & Sons, Yilmaz, E., Aslam, J.A., Robertson, S.E.: A New Rank Correlation

steadily increase but it reaches a stable that the behavior of 3 .

Ferro , ( 2011 ) 69 ( 2 ), 187 { 200 (2018) M.G.: Rank correlation Oxford, England ( 1948 ) Rutherford,