<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Component-Based Evaluation using GLMM?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianmaria Silvello</string-name>
          <email>silvello@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Topic variance has a greater e ect on performances than system variance but it cannot be controlled by system developers who can only try to cope with it. On the other hand, system variance is important on its own, since it is what system developers may a ect directly by changing system components and it determines the di erences among systems. In this paper, we face the problem of studying system variance in order to better understand how much system components contribute to overall performances. To this end, we propose a methodology based on General Linear Mixed Model (GLMM) to develop statistical models able to isolate system variance, component e ects as well as their interaction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The experimental results analysis is a core activity in Information Retrieval (IR)
aimed at, rstly, understanding and improving system performances and,
secondly, assessing our own experimental methods, such as robustness of
experimental collection or properties of the evaluation measures. When it comes to
explaining system performances and di erences between algorithms, it is
commonly understood [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that system performances can be broken down to a
reasonable approximation as
system performances = topic e ect + sys e ect + topic/sys interaction e ect
even though it is not always possible to estimate these e ects separately,
especially the interaction one.
      </p>
      <p>
        It is well-known that topic variability is greater than system variability and
a lot of e ort has been put in better understanding this source of variance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as
well as in making IR systems more robust to it. Nevertheless, with respect to an
IR system, topic variance is a kind of \external source" of variation, which cannot
be controlled, but can only be taken into account to better deal with it. On the
other hand, system variance is a kind of \internal source" of variation, since it
is originated by the choice of system components, may be directly a ected by
developers by working on them, and represents the intrinsic di erences between
algorithms.
? This is an extended abstract of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Please refer to the original paper for the full
model and experimental results.
      </p>
      <p>Currently, in experimental evaluation we consider system variance as a
single monolithic contribution and we cannot break it down into the smaller
pieces (the components) constituting an IR system.</p>
      <p>
        We propose a methodology, based on General Linear Mixed Model (GLMM)
and ANalysis Of VAriance (ANOVA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], to address this issue and to estimate
the e ects of the di erent components of an IR system, thus giving us
better insights on what system variance and system e ects are. In particular, the
proposed methodology allows us to break down the system e ect into the
contributions of stops lists, stemmers or n-grams and IR models, as well as to study
their interaction.
      </p>
      <p>In this extended abstract we report the main ideas behind the adopted
methodology and the main results we obtained from the experimental evaluation
conducted on standard Text REtrieval Conference (TREC) Ad-hoc collections.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology and Experimentation</title>
      <p>The goal of the proposed methodology is to decompose the e ects of di erent
components on the overall system performances. In particular, we are interested
in investigating the e ects of the following components: stop lists; Lexical Unit
Generator (LUG), namely stemmers or n-grams; IR models, such as the vector
space or the probabilistic model.</p>
      <p>We considered three main components of an IR system: stop list, LUG
and IR model. We selected a set of alternative implementations of each
component and by using the Terrier open source system we created a run for each
system de ned by combining the available components in all possible ways. The
components we selected are:
stop list: nostop, indri, lucene, smart, terrier;
stemmer: nolug, weak Porter, Porter, Krovetz, Lovins;
model: BB2, BM25, DFRBM25, DFRee, DLH, DLH13, DPH, HiemstraLM,
IFB2, InL2, InexpB2, InexpC2, LGD, LemurTFIDF, PL2, TFIDF.</p>
      <p>We conducted single factor and three-factors ANOVA tests for both the
groups on TREC 05, 06, 07, 08, 09 and 10 collections, and by employing the
following ve measures: AP, P@10, nDCG@20, RBP and ERR@20.</p>
      <p>The full GLMM model for the described factorial ANOVA for repeated
measures with three xed factors (stoplist , stemmers , models ) and a
random factor (topics ) is:</p>
      <p>Yijkl =
|</p>
      <p>Main{Ez ects
+ i + j + k + l +
}
|
jk +
jl +</p>
      <p>kl +</p>
      <sec id="sec-2-1">
        <title>Interacti{ozn E ects</title>
        <p>jkl + "ijkl
}</p>
      </sec>
      <sec id="sec-2-2">
        <title>E|{rrzo}r</title>
        <p>In Figure 1 we can see a graphical representation of the main analyses we
conducted by running the ANOVA tests on the grids of points described above.
We report only the plots for the TREC 09 and 10 collections. Here, we show
three main plots: Tukey HSD plot, main e ect plot and interaction e ect plots.
tsopon iirdn lcneeu trsam lsbaonw itrree</p>
        <p>Stop Lists</p>
        <p>lnoug trvzkoe Stelivsonmmerstrrpoe ltrrsoeabnoPw trrkoeaePw bb2 b25m ifzd frdee lliitrcehdm ldh IRpdh Mlitrsaehmmodifb2elib2ns iln2 ixepb2n jlsks ilfftrudem lgd lp2 iftfd
0.16
0.18
o
r
G
s
r
e
m
m
e
t
S
an0.16
e
M0.15
l
PM0.12
8grams inexpb2 0.06
Fsnorwboalm the main e ects and Tukey Hjskolsnestly Signi c0a.04nt Di erence (HSD)
the top group in the case of Web search, while krovetz and lovins stay together
in the second group, well above the group employing no stemmer at all. With
respect to the news search case, the less aggressive stemmers perform better for
Web search and this may be motivated again by the hypothesis that the noisy
Web context bene ts more from avoiding further noise due to over-stemming.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Discussion and main results</title>
      <p>In general, from the experimental analysis we have seen that linguistic
preprocessing and linguistic resourcesare very important and contributed pretty
much to the e ectiveness of an IR system. So, the role of the stop list is signi cant
as well as choosing between stemmers or n-grams.</p>
      <p>In particular, we have seen that the choice of the stop list does not make a
big di erence with respect to use or not use a stop list; indeed, we have seen that
there are no signi cant di erences between the \indri", \smart" and \terrier"
stop lists, whereas the \lucene" stop list (which is composed by 15 words) is
signi cantly di erent from the other three.</p>
      <p>The main e ect of the stemmer is always signi cant even though its size
is quite small; nevertheless, there is a tangible di erence between systems using
or not using a stemmer. In particular, we observe that there is no signi cant
di erence between the Porter and the Krovetz stemmer which are the stemmers
with the highest impact on variance followed by the weak Porter and the Lovins
ones.</p>
      <p>For all the collections, consistently across the measures and both for the
stemmer and the n-grams group, the higher e ect size is reported by the stop
list*model interaction e ect which is always of medium or large size. This e ect
shows us that the variance of the systems is explained for the bigger part by the
stop list and the model components. The stop list*stemmer interaction e ects
are always not signi cant and a very similar trend can be observed for the
stemmer*model interaction e ect.</p>
      <p>It is interesting to note that the second order interactions for the n-grams
group are all statistically signi cant and that, in particular, we can see that
ngrams, di erently than the stemmers, have a bigger e ect on the stop list than
on the IR model.</p>
      <p>We observe that di erent measures see the stop lists in a comparable way
in terms of e ect size. This is valid also for the stemmer, with the exception
of ERR@20 for which it has an almost negligible e ect size even though it is
statistically signi cant. For the n-grams group all the measures are comparable
and ERR@20 is not as low as it happens for the stemmers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello. A General Linear</surname>
          </string-name>
          <article-title>Mixed Models Approach to Study System Component E ects</article-title>
          .
          <source>In Proc. 39th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>2016</year>
          ). ACM Press, New York, USA,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Kanoulas</surname>
          </string-name>
          .
          <article-title>On Per-topic Variance in IR Evaluation</article-title>
          . In W. Hersh,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Maarek</surname>
          </string-name>
          , and M. Sanderson, editors,
          <source>Proc. 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>2012</year>
          ), pages
          <fpage>891</fpage>
          {
          <fpage>900</fpage>
          . ACM Press, New York, USA,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Rutherford</surname>
          </string-name>
          .
          <article-title>ANOVA and ANCOVA. A GLM Approach</article-title>
          . John Wiley &amp; Sons, New York, USA, 2nd edition,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>