<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How Topic and System Size A ect the Correlation among Evaluation Measures?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we investigate the e ect of topic and system sizes on the correlation among evaluation measures for both and AP . We found that topic size matters more than system size and that and AP does not lead to noticeably di erent rankings among measures. Correlation analysis plays a central role in Information Retrieval (IR) evaluation where it is one of the tools we use to study properties and relationships among evaluation measures. When a new evaluation measure is proposed, correlation analysis is used to assess how the new measure ranks IR systems with respect to the other existing measures and, thus, to understand whether it actually grasps di erent aspects of the systems and its introduction is somehow motivated. In this context, the most used correlation coe cients are the Kendall's tau correlation [4] and the AP correlation AP [6]. In this paper, we investigate what is the e ect of the number of systems and topics on the correlation among evaluation measures and what are the di erences in using or AP . In order to answer these research questions, we rely on 3 di erent Text REtrieval Conference (TREC) collections and, for each collection, we create a Grid of Points (GoP) [2, 3], i.e. a set of system runs originating from all the possible combinations of the following components: 6 di erent stop lists, 6 types of stemmers, 7 avors of n-grams, and 17 distinct IR models, leading to 1,326 distinct system run. These GoPs basically represent nearly all the state-of-the-art components which constitute the common denominator almost always present in any IR system for English retrieval. We consider 8 di erent evaluation measures { namely, AP, P@10, Rprec, RBP, nDCG, nDCG@20, ERR, and Twist { and we compute the correlation among them over the created GoPs. Finally, we use General Linear Mixed Model (GLMM) and ANalysis Of VAriance (ANOVA) [5] to conduct the analyses needed to answer the above research questions. The paper is organized as follows: Section 2 introduces the GLMM used for the analyses; Section 3 discusses the experimental ndings; nally, Section 4 draws some conclusions and provides an outlook for future work.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>itoan 1 10 topics
l
rre0.9
o
C0.8
0
20.7
@
G0.6
C
D
n0.5
s
vP0.4 0
A 1
iltrrvsn20oeoanAPCDCG@000000......4567891 01 25 1500litrrsvn2oa0enoAPDCCG@000000t......o5467891p75ic1001s 251 150 200 25250 ==AP150005to000000p75......i4567891c001s10125 015 200 250 25==AP 0052050000000t......o4567891p75ic1001s 251 150 200 25250 ==AP250005to000000p75......4567891ic001s10125 150 200 25025 ==AP 0503050000000t......o4567891p75ic0101s 251 150 200 25250 ==AP350005iltrrvsn02oenoaAPCDCG@t000000o......4567891p75ic1001s 125 150 200 25025
Fig. 1litrrsv20oenaon02APCDCG@.000000000.........78965498711A01P v25sSysten4m500liilttrrrrsv0o2eanonnoeoa0APCCDCG@00000000Dt........So8679549811p75Cizice1001Gs251@1502200 025250 :==APe47500005attoo000000cpp75......ii7895641cch001ss10125 015p200l o25025t==APSs005y5sh050te000000tom......o8795641p75wSic1001iszes251150 t020 h52025 eS==AyP5sc05005tetomo000000p75......r8795641Sic001irsz10ee251501la200 t25025io==ASPny0056s500,te000000tm......o7895641bp75Sic0101iosze251t150h020 52025 S==AyP6as50005telitrrsv20oeanonAPCDCG@nt000000mo......7896451p75dSic001i1sze215 150A020 P02552 ,==SAPyfs0055toemr75Si001zae125 150g020i250v==eAPn050
numbeG@0r.6 of topic2G@s00..67 as the number of systems increases.
2 MsvnAPDC00..54 o01de52 lSys50tvsnAPDCe00m..4575S1001ize251 150 020 52025 ==SAyPs0505tem75S001ize125 150 020 250 ==AP 500
1
==AP 00..89
0.7
0.6
0.5
25 50 75 001 125 150 200 250 050 0.4 01</p>
      <p>40 topics
==AP
1
0.9
0.8
0.7
=AP 00..56
=
5005 75 001 125 150 200 250 500 0.4 01
70 topics
We create a GoP using the TREC 13, 14, and 15 Terabyte track, thus containing
149 topics and 1,326 runs. For each topic size t 2 T = f10; 20; 30; 40; 50; 60; 70g
and system size s 2 S = f10; 20; 50; 75; 100; 125; 150; 200; 250; 500g, we
independently draw H = 100 random samples of t topics and H = 100 random samples
of s systems from the the GoP. Overall, for each combination (t; s) 2 T S
of topic and system sizes and for each measure pair, this procedure originates
H = 100 samples of correlation values for both and AP .</p>
      <p>We use the following model
Yijkl =
|</p>
      <p>Main{Ez ects
+ i + j + k + l + (
} |
)jk + (</p>
      <p>)jl + (
Interacti{ozn E ects
)kl + "ijkl
} E|{rrzo}r
(1)
where: i is the e ect of the i-th subject, i.e. one of the h = 1; : : : ; H samples;
j is the e ect of the j-th factor, i.e. measure pairs; k is the e ect of the k-th
factor, i.e. number of topics; l is the e ect of the l-th factor, i.e. number of
systems; ( )jk, ( )jl, and ( )kl are, respectively, the interactions between
measures pairs and number of topics, measure pairs and number of systems, and
number of topics and number of systems; and, "ijkl is the error.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Experimental Results</title>
      <p>General Trends As Figure 1 highlights, the number of topics a ects both
and AP , since their average value increases as the number of topics increases.
On the other hand, the number of systems exhibits less impact on the two
correlation coe cients: indeed, apart from a small transient up to around 75-100
systems, the trend for both coe cients is somehow constant, especially when the
number of topics increases. We can note how, in the transient phase, and AP
behave di erently: tends to slightly increase before reaching stability while AP
manifests an initial decrease, sometimes followed by an increase, before getting
more or less constant.
20 topics</p>
      <p>1
==AP 00..89
0.7
0.6
0.5
25 50 75 001 125 150 200 250 005 0.4 01
50 topics
1
0.9
0.8
0.7
==AP 00..56
25 Sys50tem57S001ize125 150 020 250 050 0.4 10</p>
      <p>30 topi
25 50 75</p>
      <p>60 topi</p>
      <p>correlation: ANOVA table for the GLMM model of equation (1).</p>
      <sec id="sec-2-1">
        <title>Source</title>
        <p>AP : ANOVA table for the GLMM model of equation (1).</p>
      </sec>
      <sec id="sec-2-2">
        <title>Source</title>
      </sec>
      <sec id="sec-2-3">
        <title>Subject</title>
      </sec>
      <sec id="sec-2-4">
        <title>Measure Pair</title>
      </sec>
      <sec id="sec-2-5">
        <title>Topic Size</title>
      </sec>
      <sec id="sec-2-6">
        <title>System Size</title>
      </sec>
      <sec id="sec-2-7">
        <title>Measure Pair*Topic Size</title>
      </sec>
      <sec id="sec-2-8">
        <title>Measure Pair*System Size</title>
      </sec>
      <sec id="sec-2-9">
        <title>Topic Size*System Size</title>
      </sec>
      <sec id="sec-2-10">
        <title>Error</title>
      </sec>
      <sec id="sec-2-11">
        <title>Total</title>
        <p>When it comes to con dence intervals, lower number of topics and systems
call for larger intervals, which is not surprising. However, generally exhibits
smaller con dence intervals than AP , especially for low number of topics.
Moreover, seems to be a bit more e ective than AP in bene ting from the
increasing number of topics and systems; indeed, correlation values get more stable and
con dence intervals get smaller in a \faster" way for than for AP .
ANOVA Analysis Tables 1 and 2 report the results of the ANOVA analyses
on the GLMM model of equation (1) for and AP , respectively. The most
prominent e ect is the measure pair one, which is a large size e ect in terms
of !^ 2, and it has almost the same size for both and AP . The second biggest
e ect is the topic size one, which again is a large size e ect and it has the same
size for both and AP . This supports the previous observations about Figure 1
when we noted that the topic size is the most prominent factor in uencing the
correlation among evaluation measures. Finally, the system size e ect, even if
signi cant, is a very small size e ect and we can consider it almost negligible;
however, it should be noted that this e ect is a little bit more than three times
bigger for AP than for . Overall, this sustains the observations made above
about the smaller importance of the number of systems on the correlation among
evaluation measures, with AP being more sensitive to this factor than .</p>
        <p>When it comes to the interaction between e ects, for both and AP , the
measure pair and topic size ( )jk and the topic size and system size ( )kl
interactions are statistically signi cant. On the other hand, the measure pair
and system size ( )jl interaction is not signi cant and this further stress the
fact that the number of systems does not in uence much the correlation among
evaluation measures.
=</p>
        <p>Correlation among RoMP: tauCorr = 0.9735; apCorr = 0.8815
=AP
=
=
AP
and</p>
        <p>AP
in terms of how they rank evaluation
and</p>
        <p>AP</p>
        <p>Comparison
measures
according to
and</p>
        <p>AP
: we can
note
how
there are
very few
swaps and
always
among
values
AP
adjacent
rank
positions.</p>
        <p>On
the
right,
show
the
actual correlation
but
with
means centered
around
zero: it is evident
how
close
are
and
, apart from
a constant o
set; indeed the
among the two curves is just 0:0242, indicating
very small di</p>
        <p>Overall, these
measures and you compare them
across a large set of topic and system
sizes, removing
those e
ects,
and</p>
        <p>AP
have
di
erent absolute values but they
provide a quite
consistent assessment of
what the di
erences among these
measures are.
whole
values.
e.g. stop
measures.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and</title>
    </sec>
    <sec id="sec-4">
      <title>Future Work</title>
      <p>We investigated
how
topic and system
size a
ect the correlation
among
evaluation
measures.</p>
      <p>We
discovered that the
number of topics impacts
more than the
number
of systems
and
that the
number
of systems
does
not cause the
lation
point
quite
also
and</p>
      <p>AP
is quite consistent
when
comparing a
set
of evaluation
measures, yet
producing
erent
absolute
correlation
As future
work, we plan to investigate how
the di
erent system
lists, stem
mers, IR
ect
the
correlation
among
evaluation
1.
2.</p>
      <p>N.:</p>
      <sec id="sec-4-1">
        <title>Does A</title>
        <p>ect the</p>
      </sec>
      <sec id="sec-4-2">
        <title>Correlation</title>
        <p>36(2), 19:1{19:40 (2017)</p>
      </sec>
      <sec id="sec-4-3">
        <title>Ferro, N.,</title>
      </sec>
      <sec id="sec-4-4">
        <title>Harman, D.: CLEF 2009:</title>
      </sec>
      <sec id="sec-4-5">
        <title>Grid@CLEF</title>
        <p>2009. pp. 552{565. LNCS
6241 (2010)</p>
      </sec>
      <sec id="sec-4-6">
        <title>Ferro, N., Silvello, G.: Toward an</title>
      </sec>
      <sec id="sec-4-7">
        <title>Anatomy of IR System Component Performances.</title>
      </sec>
      <sec id="sec-4-8">
        <title>JASIST</title>
      </sec>
      <sec id="sec-4-9">
        <title>Kendall,</title>
        <p>methods.
n,
A.:</p>
      </sec>
      <sec id="sec-4-10">
        <title>ANOVA and</title>
      </sec>
      <sec id="sec-4-11">
        <title>ANCOVA. A GLM</title>
      </sec>
      <sec id="sec-4-12">
        <title>Approach. John Wiley &amp; Sons,</title>
      </sec>
      <sec id="sec-4-13">
        <title>Yilmaz, E.,</title>
      </sec>
      <sec id="sec-4-14">
        <title>Aslam, J.A.,</title>
      </sec>
      <sec id="sec-4-15">
        <title>Robertson, S.E.: A New Rank Correlation</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>steadily increase but it reaches a stable that the behavior of 3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , (
          <year>2011</year>
          )
          <volume>69</volume>
          (
          <issue>2</issue>
          ),
          <volume>187</volume>
          {
          <fpage>200</fpage>
          <string-name>
            <surname>(2018) M.G.: Rank</surname>
          </string-name>
          correlation Oxford, England (
          <year>1948</year>
          ) Rutherford,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>