=Paper= {{Paper |id=Vol-1670/paper-70 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1670/paper-70.pdf |volume=Vol-1670 }} ==None== https://ceur-ws.org/Vol-1670/paper-70.pdf
        Harmony Assumptions: Extending Probability Theory for
                      Information Retrieval

                                          Thomas Roelleke
                                     Queen Mary University of London

In many applications, independence of event occurrences is assumed, even if there is evidence
for dependence. Capturing dependence leads to complex models, and even if the complex models
were superior, they fail to beat the simplicity and scalability of the independence assumption.
Therefore, many models assume independence and apply heuristics to improve results. Theoreti-
cal explanations of the heuristics are seldom given or generalisable.
[1] reports that some of these heuristics can be explained as encoding dependence in an exponent
based on the generalised harmonic sum. Unlike independence, where the probability of subse-
quent occurrences of an event is the product of the single event probability, harmony is based on
a product with decaying exponent.
For independence, the sequence probability is p1+1+...+1 = pn . For harmony, the probability is
p1+1/2+...+1/n ≈ p1+log(n) . The generalised harmonic sum is the exponent of p, and this leads to
a spectrum of harmony assumptions. We will discuss that settings of the term frequency (TF) in
IR correspond to harmony assumptions. We will focus on four settings of the TF:
                     
                       tfd              total TF: corresponds to assuming independence
                     √
                     
                     
                          tfd + 1 − 1 sqrt TF: middle between total TF and log-TF
         TF(t, d) :=
                     
                      log(tfd + 1)     log-TF: assumes a form for harmony
                       tfd /(tfd + Kd ) BM25 TF: assumes a strong form of harmony
                     

[1] shows series-based explanations of the TF settings, and these lead to new insights regarding
the relationships between IR and probability theory. From an IR point of view exciting is the
finding that the BM25-TF is the harmonic sum of Gaussian sums.
                                                                            
                        tfd     1          1                    1
                             = · 1+            + ... +
                     tfd + 1    2        1+2             1 + 2 + . . . + tfd
This finding provides a probabilistic interpretation of the BM25-TF quantification.
An experimental study for IR and social media investigates assumptions that explain the depen-
dence between term occurrences. Interestingly, the assumption sqrt-harmony, i.e. the middle be-
tween the total-TF and log-TF, is on average a better assumption than independence or the strong
harmony assumptions corresponding to log-TF and BM25-TF. The potential impact of harmony
assumptions lies beyond IR, since many scientific disciplines and applications rely on probability
theory and apply heuristics to compensate the independence assumption. Given the concept of
harmony assumptions, the dependence between multiple occurrences of an event can be reflected
in an intuitive and effective way.


References
1. Thomas Roelleke, Andreas Kaltenbrunner, and Ricardo A. Baeza-Yates. Harmony assumptions in information
   retrieval and social networks. Comput. J., 58(11):2982–2999, 2015.