<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Harmony Assumptions: Extending Probability Theory for Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Queen Mary University of London</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In many applications, independence of event occurrences is assumed, even if there is evidence for dependence. Capturing dependence leads to complex models, and even if the complex models were superior, they fail to beat the simplicity and scalability of the independence assumption. Therefore, many models assume independence and apply heuristics to improve results. Theoretical explanations of the heuristics are seldom given or generalisable. [1] reports that some of these heuristics can be explained as encoding dependence in an exponent based on the generalised harmonic sum. Unlike independence, where the probability of subsequent occurrences of an event is the product of the single event probability, harmony is based on a product with decaying exponent. For independence, the sequence probability is p1+1+:::+1 = pn. For harmony, the probability is p1+1=2+:::+1=n p1+log(n). The generalised harmonic sum is the exponent of p, and this leads to a spectrum of harmony assumptions. We will discuss that settings of the term frequency (TF) in IR correspond to harmony assumptions. We will focus on four settings of the TF: 1. Thomas Roelleke, Andreas Kaltenbrunner, and Ricardo A. Baeza-Yates. Harmony assumptions in information retrieval and social networks. Comput. J., 58(11):2982-2999, 2015.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>TF(t; d) :=
8 tfd total TF: corresponds to assuming independence
&gt;&gt;&lt; ptfd + 1 1 sqrt TF: middle between total TF and log-TF
&gt; log(tfd + 1) log-TF: assumes a form for harmony
&gt;: tfd=(tfd + Kd) BM25 TF: assumes a strong form of harmony
[1] shows series-based explanations of the TF settings, and these lead to new insights regarding
the relationships between IR and probability theory. From an IR point of view exciting is the
finding that the BM25-TF is the harmonic sum of Gaussian sums.</p>
      <p>tfd
tfd + 1
=
+ : : : +</p>
      <p>1
1 + 2 + : : : + tfd
This finding provides a probabilistic interpretation of the BM25-TF quantification.
An experimental study for IR and social media investigates assumptions that explain the
dependence between term occurrences. Interestingly, the assumption sqrt-harmony, i.e. the middle
between the total-TF and log-TF, is on average a better assumption than independence or the strong
harmony assumptions corresponding to log-TF and BM25-TF. The potential impact of harmony
assumptions lies beyond IR, since many scientific disciplines and applications rely on probability
theory and apply heuristics to compensate the independence assumption. Given the concept of
harmony assumptions, the dependence between multiple occurrences of an event can be reflected
in an intuitive and effective way.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>