<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Correlation to Evaluate QPP</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Josiane Mothe</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INSPE</institution>
          ,
          <addr-line>UT2J</addr-line>
          ,
          <institution>Université de Toulouse</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institut de Recherche en Informatique de Toulouse</institution>
          ,
          <addr-line>IRIT, UMR5505, CNRS, Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Correlation is widely used to test the hypothesis of the relationship between two variables. In this paper we chose to focus the discussion on query dificulty prediction for which correlation is often used to measure the accuracy of predictors. Here, the correlation is calculated between the actual system efectiveness and the predicted one. Although fairly simple to calculate, the Pearson correlation coeficient can be dificult to interpret and use correctly, especially because of its sensitivity to outliers. This paper illustrates the problem and opens discussion pathways.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information systems</kwd>
        <kwd>Information retrieval</kwd>
        <kwd>Query performance prediction</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Correlation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>There are various methods to calculate the relationship between two variables; correlation
coeficient is one of them. Among correlation coeficients, Pearson product-moment is the most
used. Kendall and Spearman correlations are other measures used when two variables are to be
analysed.</p>
      <p>Correlation calculation results in a value that ranges between − 1 (strong negative correlation)
and 1 (strong positive correlation); 0 indicating that the two variables are not correlated.
 −  indicates the confidence or risk of error in rejecting the hypothesis that the two
variables are independent.</p>
      <p>This paper aims to discuss the possible misinterpretation of correlation through some
examples. Here, we mainly focus on Pearson correlation which is the most used in QPP, although we
also consider the other correlation coeficients.</p>
      <p>
        In addition to some assumptions made on the variables, which we describe in section 2, one
of the main problems in using Pearson correlation measure is its sensitivity to outliers [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] that
we illustrate in Section 3. The specific case of QPP is studied in Section 4. Section 5 concludes
this paper.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Correlation measures</title>
      <p>The most familiar measure of correlation is the Pearson product-moment correlation
coeficient (also called the correlation coeficient and labelled  ) which is a normalised form of the
covariance. Covariance between two random variables measures their join distance to their
expected values which can be the distance to the mean for numerical data. Pearson  assumes
linear relationship between  and  .</p>
      <p>More formally,  is calculated by dividing the covariance of the two variables by the
product of their standard deviations and correlation coeficient between two random variables
(1, 2, ..., ...,  ) and  (1, 2, ..., ...,  ) is defined as:</p>
      <p>Where
 (,  ) =
(,  )
 () ( )</p>
      <p>.
(,  ) =
 ()2 =
∑︀=1( − ¯)( − ¯)</p>
      <p>− 1

1 ∑︁ ( − ¯)2</p>
      <p>=1</p>
      <p>Thus, this correlation coeficient measures the link between the two variables by measuring
the mean of the product of the distance of the two variables to their respective mean. When is it
close to 1 or − 1, the two variables are strongly correlated (positively or negatively); confirming
the hypothesis that there is a linear relationship between the two variables.</p>
      <p>Alternatively, Spearman’s correlation () considers the ranks rather than the values and
measures how far from each other variable ranks are.  is similar to Pearson on ranks ( =
 once column of  and  are replaced by their ranks). Spearman’s assumes monotonic
relationship between  and  .</p>
      <p>Similarly, Kendall correlation measures the correlation on ranks, that is the similarity of the
orderings of the data when ranked by each of the variable values. It is afected by whether
the ranks between observations are the same or not without considering how far they are as
opposed to . It is thus considered as more appropriate for discrete variables. Kendall measures
the concordance of any pair of observations (, ) and ( ,  ), where  ̸= . The pair is said
to be concordant if the ranks for both elements agree ( &gt;  and  &gt;  or  &lt;  and
 &lt;  ); discordant if the reverse occurs. The Kendall  coeficient is defined as:
 =
(number of concordant pairs) − (number of discordant pairs)</p>
      <p>( − 1)/2</p>
      <p>Whatever the correlation measure is, to be significant, the link between the two variables
should not be due to the data sample only (i.e. random) but should reflect the link between the
two variables on the entire population. Testing the null hypothesis aims at answering this issue.</p>
      <p>Thus, when considering Pearson correlation (but the same holds for the other correlation
measures) what is tested is 0 :  = 0 (no statistical link between the two variables) vs.
1 :  ̸= 0 (there is a statistical link between the two variables). In bivariate normal data,  = 0
if and only if  and  are independent. So testing for independence is equivalent to testing
 = 0 in this situation.</p>
      <p>The null hypothesis 0 :  = 0. (there is no relationship between the two variables 
and  ) is usually rejected when  −  &lt; 0.05 (and thus the variables are considered as
related in that case). The p-value is a number between 0 and 1 representing the likelihood of
the observation if the hypothesis is assumed to be correct. The statistically significance result is
considered as highly improbable if the null hypothesis is assumed to be true.</p>
      <p>Thus calculating  (,  ) and checking  −  &lt; 0.05 is commonly used in order to
conclude whether  and  are related.</p>
      <p>Correlation is easy to calculate although some misinterpretation or over-interpretation can
occur as illustrated by the Anscombe’s quartet and presented in the next section.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Anscombe’s quartet</title>
      <p>
        Anscombe illustrates the complementary aspect of correlation calculation with the graphical
plotting of data [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Table 1 presents the 4 data sets Anscombe designed: each element is represented by two
variables  and  for which we want to know whether they correlate or not. Table 2 presents
some statistics of the 4 data sets; it reports that various aggregation values are the same for the
4 data sets: the number of elements, the mean of the variable  and the one of  , as well as
the Pearson correlation and the associated P-values. In addition, from the same table, it can be
observed that the  correlation value is higher than 0.816 (which is considered as a high value),
p-value &lt; 0.05 (which is considered as significant).</p>
      <p>Because the real data may not respect the mathematical assumption (linear relationship
between  and  in the case of  ) and because  is also sensitive to outliers, without having a
look to the data and simply trusting the  value and the associate p-value, one could consider
the 4 cases are equivalent in terms of importance of the correlation. However, data plots tell a
diferent story (see Figure 1).</p>
      <p>When plotting the corresponding dots as in Figure 1, it is obvious that the 4 data sets are very
diferent. For data set #1, 0.816 seems to reflect appropriately the linear correlation between
 and  . In data set #2, there is a clear correlation between  and  but which is far for
being linear. In this latter case, a diferent correlation measure may better reflect this perfect
correlation. In data set #3, the correlation between  and  would be 1 if the outlier was
removed from the data set. This outlier abnormally lower the correlation value. Finally, in data
set #4, there is no correlation at all but the high correlation value is due to an outlier. Removing
this outlier would make the correlation 0.</p>
      <p>Anscombe quartet illustrates that correlation value can not be considered without having a
look at the plots. However, most of the time in IR studies (and in others areas as well), correlation
is reported without considering plotting, but just "trusting" the associated p-value. There is thus
a risk of misinterpretation. The more that many authors use the correlation coeficients without
checking if the assumptions are met (e.g. linear correlation in case of Pearson correlation).</p>
    </sec>
    <sec id="sec-4">
      <title>4. Query dificulty predictors and correlation</title>
      <p>In query dificulty prediction, the accuracy of a predictor is often measured in terms of how
much the values of the predictor correlates with the actual system efectiveness.</p>
      <p>In this section, we consider NDCG as the system efectiveness measure and thus as the value
to be predicted by the query dificulty predictor. The system we used here is a simple BM25
weighting schema. We also consider as illustrative examples two well known query dificulty
predictors BM25 and IDF. BM25 is based on the scores retrieved documents obtained; it is thus
a post retrieval feature. IDF on the other hand is a pre-retrieval feature based on query word
IDF. We consider two variants that have been used in the literature for these two features:
maximum and standard deviation for BM25 later referenced as BM25_MAX and BM25_STD
(the maximum and standard deviation of BM25 weights for document - query pairs for that
query); and maximum and average for IDF later referenced as IDF_MAX and IDF_AVG (the
maximum and average inverse document frequency of the query terms).</p>
      <sec id="sec-4-1">
        <title>4.1. Measuring correlation</title>
        <p>A typical problem is to compare the accuracy of diferent variables (here it would be these four
features) to predict query dificulty. One common solution is to consider correlation between
each variable that corresponds to a predictor and the target variable that represents the system
efectiveness (e.g. NDCG).</p>
        <p>Table 3 reports the Pearson correlation as well as Kendall  and Spearman correlation of the
4 query features with NDCG on WT10G TREC collection which consists of topics 451-550 and
Correlation between query features and NDCG. P-Value is indicated by * mark using the usual &lt; 0.05
correlation</p>
        <p>BM25_MAX</p>
        <p>BM25_STD</p>
        <p>
          IDF_MAX IDF_AVG
Removing topic 463 only
about 1.7 millions of web pages. The three calculations agree on the fact that the correlation
values are weak; which is often the case in this task [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. They also agree on that BM25
postretrieval features are better predictors than IDF pre-retrieval features and that IDF_AVG is
weakly correlated with NDCG and generally not significantly; IDF_MAX’s correlation is also
weak. However, the three correlation measures disagree on the best predictor: while Pearson
suggests BM25_MAX is the best, Kendall and Spearman prefer BM25_STD.
        </p>
        <p>Should the disagreement among methods be seen as a warning when discussing the
results and making conclusions? We believe so.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Plotting the data</title>
        <p>Visually, it becomes dificult to see which is the best predictor for NDCG.</p>
        <p>We can see that IDF_Max has many outliers (right side of Figure 2a). If we removed these
outliers with very high IDF_Max, then, the rest of the measures are much more correlated than
the measures of BM25_Max where it is dificult to identify any correlation.</p>
        <p>Should we plot the data to make sure that the calculated coeficients are meaningful
and comparable? We think so.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Impact of outliers</title>
        <p>While observing the (Pearson) correlation value only (first line of Table 3, BM25_MAX is more
correlated to NDCG than BM25_STD; both being statistically significantly correlated. When
observing the plots in Figure 2, we can see that a topic (#463) in the right-side bottom corner of
Figure 2 d. is an "outlier" (like the outlier from the 3rd Anscombe’s data set). If we remove this
outlier and calculate the correlation again, we obtain the first group of rows in Table 4. Indeed,
when removing this single topic from the collection, the Pearson correlation from BM25_STD
increases of about 46% (from 0.232 to 0.339) and becomes higher than BM25_MAX, while the
later is stable (0.294).</p>
        <p>In the same way, when considering IDF_AVG, the numerical results indicates that the
independence cannot be rejected (cor=0.127 and p-value=0.2125). Removing topic 463 from the
collection when analysing IDF_AVG, the correlation is doubled, but more importantly, while it
was not significant initially, the independence can be rejected with quite high confidence now
(the p-value 0.027 is lower than the commonly used 0.05 value).</p>
        <p>We believe that the coeficients should be used with caution when comparing
diferent predictors.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and future work</title>
      <p>In this paper we point out the need of discussions on the use of correlation coeficient for query
performance prediction. We illustrated the possible misinterpretation of correlation measures.
This is a challenge when comparing several variables with regard to their link with a target
variable.</p>
      <p>The influence of outliers has been little studied in the case of correlation coeficient.</p>
      <p>
        In the case of Principal Component Analysis, which is also used to analyse variable
relationships when a large number of variables are involved, Kriegel et al. proposed an approach to
increase the robustness. They suggested to use weighted covariance in order to make PCA less
sensitive to outliers [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In the case of regression, Huang et al. proposed the Robust regression
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] also to make the method less sensitive to outliers. To the best of our knowledge, nothing
similar has been proposed for correlation. Considering the popularity of this method; it would
be worth investigating this problem.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Ravana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajagopal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Balakrishnan</surname>
          </string-name>
          ,
          <article-title>Ranking retrieval systems using pseudo relevance judgments</article-title>
          ,
          <source>Aslib Journal of Information Management</source>
          <volume>67</volume>
          (
          <year>2015</year>
          )
          <fpage>700</fpage>
          -
          <lpage>714</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Maskari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          ,
          <article-title>A review of factors influencing user satisfaction in information retrieval</article-title>
          ,
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>61</volume>
          (
          <year>2010</year>
          )
          <fpage>859</fpage>
          -
          <lpage>868</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Krasakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aliannejadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Voskarides</surname>
          </string-name>
          , E. Kanoulas,
          <article-title>Analysing the efect of clarifying questions on document ranking in conversational search</article-title>
          ,
          <source>in: Proc. of the ACM SIGIR Intern. Conference on Theory of Information Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yom-Tov</surname>
          </string-name>
          ,
          <article-title>Estimating the query dificulty for information retrieval</article-title>
          ,
          <source>Synthesis Lectures on Information Concepts</source>
          ,
          <source>Retrieval, and Services</source>
          <volume>2</volume>
          (
          <year>2010</year>
          )
          <fpage>1</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          ,
          <article-title>Why do you think this query is dificult?: A user study on human query prediction</article-title>
          ,
          <source>in: Proc. of the 39th Inter. ACM SIGIR conference on Research and Development in Information Retrieval</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1073</fpage>
          -
          <lpage>1076</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          , F. de Jong,
          <article-title>A survey of pre-retrieval query performance predictors</article-title>
          ,
          <source>in: Proc. of the 17th ACM Conference on Information and Knowledge Management</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2008</year>
          , pp.
          <fpage>1419</fpage>
          -
          <lpage>1420</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Greene</surname>
          </string-name>
          ,
          <article-title>A relative information gain-based query performance prediction framework with generated query variants</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          <volume>41</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Pearson</surname>
          </string-name>
          ,
          <article-title>Exploring process data</article-title>
          ,
          <source>Journal of Process Control</source>
          <volume>11</volume>
          (
          <year>2001</year>
          )
          <fpage>179</fpage>
          -
          <lpage>194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Casper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tufis</surname>
          </string-name>
          ,
          <article-title>Correlation versus interchangeability: The limited robustness of empirical findings on democracy using highly correlated data sets</article-title>
          ,
          <source>Political Analysis</source>
          <volume>11</volume>
          (
          <year>2003</year>
          )
          <fpage>196</fpage>
          -
          <lpage>203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Anscombe</surname>
          </string-name>
          ,
          <article-title>Graphs in statistical analysis</article-title>
          ,
          <source>The american statistician 27</source>
          (
          <year>1973</year>
          )
          <fpage>17</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          ,
          <article-title>Analytics methods to understand information retrieval efectiveness-a survey</article-title>
          ,
          <source>Mathematics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>2135</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kröger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schubert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimek</surname>
          </string-name>
          ,
          <article-title>A general framework for increasing the robustness of pca-based correlation clustering algorithms</article-title>
          ,
          <source>in: International Conference on Scientific and Statistical Database Management</source>
          , Springer,
          <year>2008</year>
          , pp.
          <fpage>418</fpage>
          -
          <lpage>435</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cabral</surname>
          </string-name>
          , F. De la Torre,
          <article-title>Robust regression</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>38</volume>
          (
          <year>2016</year>
          )
          <fpage>363</fpage>
          -
          <lpage>375</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>