Introduction

Component-Based Evaluation using GLMM?

Nicola Ferro

ferro@dei.unipd.it 0

Gianmaria Silvello

silvello@dei.unipd.it 0 0 University of Padua , Italy

Topic variance has a greater e ect on performances than system variance but it cannot be controlled by system developers who can only try to cope with it. On the other hand, system variance is important on its own, since it is what system developers may a ect directly by changing system components and it determines the di erences among systems. In this paper, we face the problem of studying system variance in order to better understand how much system components contribute to overall performances. To this end, we propose a methodology based on General Linear Mixed Model (GLMM) to develop statistical models able to isolate system variance, component e ects as well as their interaction.

Introduction

The experimental results analysis is a core activity in Information Retrieval (IR) aimed at, rstly, understanding and improving system performances and, secondly, assessing our own experimental methods, such as robustness of experimental collection or properties of the evaluation measures. When it comes to explaining system performances and di erences between algorithms, it is commonly understood [ 2 ] that system performances can be broken down to a reasonable approximation as system performances = topic e ect + sys e ect + topic/sys interaction e ect even though it is not always possible to estimate these e ects separately, especially the interaction one.

It is well-known that topic variability is greater than system variability and a lot of e ort has been put in better understanding this source of variance [ 2 ] as well as in making IR systems more robust to it. Nevertheless, with respect to an IR system, topic variance is a kind of \external source" of variation, which cannot be controlled, but can only be taken into account to better deal with it. On the other hand, system variance is a kind of \internal source" of variation, since it is originated by the choice of system components, may be directly a ected by developers by working on them, and represents the intrinsic di erences between algorithms. ? This is an extended abstract of [ 1 ]. Please refer to the original paper for the full model and experimental results.

Currently, in experimental evaluation we consider system variance as a single monolithic contribution and we cannot break it down into the smaller pieces (the components) constituting an IR system.

We propose a methodology, based on General Linear Mixed Model (GLMM) and ANalysis Of VAriance (ANOVA) [ 3 ], to address this issue and to estimate the e ects of the di erent components of an IR system, thus giving us better insights on what system variance and system e ects are. In particular, the proposed methodology allows us to break down the system e ect into the contributions of stops lists, stemmers or n-grams and IR models, as well as to study their interaction.

In this extended abstract we report the main ideas behind the adopted methodology and the main results we obtained from the experimental evaluation conducted on standard Text REtrieval Conference (TREC) Ad-hoc collections. 2

Methodology and Experimentation

The goal of the proposed methodology is to decompose the e ects of di erent components on the overall system performances. In particular, we are interested in investigating the e ects of the following components: stop lists; Lexical Unit Generator (LUG), namely stemmers or n-grams; IR models, such as the vector space or the probabilistic model.

We considered three main components of an IR system: stop list, LUG and IR model. We selected a set of alternative implementations of each component and by using the Terrier open source system we created a run for each system de ned by combining the available components in all possible ways. The components we selected are: stop list: nostop, indri, lucene, smart, terrier; stemmer: nolug, weak Porter, Porter, Krovetz, Lovins; model: BB2, BM25, DFRBM25, DFRee, DLH, DLH13, DPH, HiemstraLM, IFB2, InL2, InexpB2, InexpC2, LGD, LemurTFIDF, PL2, TFIDF.

We conducted single factor and three-factors ANOVA tests for both the groups on TREC 05, 06, 07, 08, 09 and 10 collections, and by employing the following ve measures: AP, P@10, nDCG@20, RBP and ERR@20.

The full GLMM model for the described factorial ANOVA for repeated measures with three xed factors (stoplist , stemmers , models ) and a random factor (topics ) is:

Yijkl = |

Main{Ez ects + i + j + k + l + } | jk + jl +

kl +

Interacti{ozn E ects

jkl + "ijkl }

E|{rrzo}r

In Figure 1 we can see a graphical representation of the main analyses we conducted by running the ANOVA tests on the grids of points described above. We report only the plots for the TREC 09 and 10 collections. Here, we show three main plots: Tukey HSD plot, main e ect plot and interaction e ect plots. tsopon iirdn lcneeu trsam lsbaonw itrree

Stop Lists

lnoug trvzkoe Stelivsonmmerstrrpoe ltrrsoeabnoPw trrkoeaePw bb2 b25m ifzd frdee lliitrcehdm ldh IRpdh Mlitrsaehmmodifb2elib2ns iln2 ixepb2n jlsks ilfftrudem lgd lp2 iftfd 0.16 0.18 o r G s r e m m e t S an0.16 e M0.15 l PM0.12 8grams inexpb2 0.06 Fsnorwboalm the main e ects and Tukey Hjskolsnestly Signi c0a.04nt Di erence (HSD) the top group in the case of Web search, while krovetz and lovins stay together in the second group, well above the group employing no stemmer at all. With respect to the news search case, the less aggressive stemmers perform better for Web search and this may be motivated again by the hypothesis that the noisy Web context bene ts more from avoiding further noise due to over-stemming.

Discussion and main results

In general, from the experimental analysis we have seen that linguistic preprocessing and linguistic resourcesare very important and contributed pretty much to the e ectiveness of an IR system. So, the role of the stop list is signi cant as well as choosing between stemmers or n-grams.

In particular, we have seen that the choice of the stop list does not make a big di erence with respect to use or not use a stop list; indeed, we have seen that there are no signi cant di erences between the \indri", \smart" and \terrier" stop lists, whereas the \lucene" stop list (which is composed by 15 words) is signi cantly di erent from the other three.

The main e ect of the stemmer is always signi cant even though its size is quite small; nevertheless, there is a tangible di erence between systems using or not using a stemmer. In particular, we observe that there is no signi cant di erence between the Porter and the Krovetz stemmer which are the stemmers with the highest impact on variance followed by the weak Porter and the Lovins ones.

For all the collections, consistently across the measures and both for the stemmer and the n-grams group, the higher e ect size is reported by the stop list*model interaction e ect which is always of medium or large size. This e ect shows us that the variance of the systems is explained for the bigger part by the stop list and the model components. The stop list*stemmer interaction e ects are always not signi cant and a very similar trend can be observed for the stemmer*model interaction e ect.

It is interesting to note that the second order interactions for the n-grams group are all statistically signi cant and that, in particular, we can see that ngrams, di erently than the stemmers, have a bigger e ect on the stop list than on the IR model.

We observe that di erent measures see the stop lists in a comparable way in terms of e ect size. This is valid also for the stemmer, with the exception of ERR@20 for which it has an almost negligible e ect size even though it is statistically signi cant. For the n-grams group all the measures are comparable and ERR@20 is not as low as it happens for the stemmers.

Ferro and

Silvello. A General Linear Mixed Models Approach to Study System Component E ects . In Proc. 39th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016 ). ACM Press, New York, USA, 2016 .

S. E.

Robertson and

Kanoulas . On Per-topic Variance in IR Evaluation . In W. Hersh,

Callan ,

Maarek , and M. Sanderson, editors, Proc. 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012 ), pages 891 { 900 . ACM Press, New York, USA, 2012 .

Rutherford . ANOVA and ANCOVA. A GLM Approach . John Wiley & Sons, New York, USA, 2nd edition, 2011 .