Single Query Optimisation is the Root of All Evil
                                                                  J. Shane Culpepper
                                                                   RMIT University
                                                                  Melbourne, Australia
Overview. The demise of the age of one-shot web query optimi-                     Table 1 compares three state-of-the-art search systems, with
sation is nigh. For Information Retrieval researchers and search              a properly tuned BM25 bag-of-words model as a baseline, using
engine engineers, this is a time to rejoice, as new opportunities to          100 adhoc queries from the ClueWeb12B UQV100 collection [1].
revisit old techniques are once again upon us. For years, search              The three systems being compared are BM25, a field-based SDM
systems have tried to infer the intentions of a user using only a few         model [9] (the exact configuration is identical to the one described
(sometimes) carefully selected search terms. However, the classic             by Gallagher et al. [7]), a LambdaMART learning-to-rank (LTR)
search interface (the web browser) on a computer will soon be ob-             model [4, 5] (here lightGBM is used with 459 features), and double
solete. Instead users will find information through mobile devices,           unsupervised fusion [3] (RRF [6] over all UQV query variations and
and conversational search systems such as Alexa, Cortana, or Siri.            two systems - SDM-Field and BM25). We can see that not only does
These interfaces provide direct access to relevance feedback mech-            fusion make more queries better on average, it is also far less likely
anisms from searchers, and allow new opportunities to model state             to make queries worse. This can clearly be seen when comparing
instead of depending on only a single query. In this abstract, we             Wins, Ties, and Losses (W/T/L) in the Table, where a Win or a Loss
argue that now is the time for IR researchers to once again return to         is for any query that increases or decreases the NDCG@10 score
building relevance models for information needs, and stop thinking            for that topic by 10% or more.
in terms of one-off queries. We show that simple combinations of              Summary. So, simple fusion over query variations is clearly ef-
classic techniques along with multiple representations of a single            fective. This has been known for some time [2, 8], particularly on
information need can easily outperform state-of-the-art models                “hard” queries [10]. But system designers generally still focus on
which perform optimisations on a query-by-query basis. This is a              learning-to-rank on single queries. How can we as a community
simple first step in the right direction.                                     step back and learn from over fifty years of research in Information
Problem. The pitfalls of over-optimising a complex multi-stage                Retrieval as we confront the radical shift from classic web search
retrieval system for a single query is rarely considered by search            with ten blue links to interactive search through virtual assistants?
engine designers. Recent work by Bailey et al. [1] showed that                Will system designers once again over-commit to optimising for
thinking in terms of queries and not the underling information                the “current” query, or can we move beyond this paradigm to devise
need can lead to dramatic variance in system effectiveness, but the           and develop entirely new approaches to search? We face many new
authors do not consider the efficiency implications of query varia-           challenges as a community – collection construction, open source
tion, or fully explore how higher level modeling of the information           stateful search systems, evaluation metrics, data privacy – in order
need might be accomplished. So, the key research challenge we set             to not be left behind by the paradigm shift in the way people search
in this abstract is:                                                          and consume information.
Research Challenge: How should academics and system designers                 Acknowledgements. This work was supported by the Australian
model and optimise search performance based on information needs              Research Council’s Discovery Projects Scheme (DP170102231) and a
and not a single query?                                                       grant from the Mozilla Foundation.

                                                                              REFERENCES
                                                                               [1] P. Bailey, A. Moffat, F. Scholer, and P. Thomas. UQV100: A test collection with
             Method                           NDCG@10   W/T/L                      query variability. In Proc. SIGIR, pages 725–728, 2016.
                                                                               [2] N. J. Belkin, P. Kantor, E. A. Fox, and J. A. Shaw. Combining the evidence of
             BM25                              0.212     -/-/-                     multiple query representations for information retrieval. Inf. Proc. & Man., 31(3):
             SDM-Field                         0.233    57/3/40                    431–448, 1995.
                                                                               [3] R. Benham and J. S. Culpepper. Risk-reward trade-offs in rank fusion. In Proc.
             LambdaMART                        0.225    59/2/39                    ADCS, pages 1:1–1:8, 2017.
             DoubleFuse, v=all                 0.300‡   80/1/19                [4] C. Burges. From ranknet to lambdarank to lambdamart: An overview. Learning,
                                                                                   11(23-581):81, 2010.
Table 1: Effectiveness comparison of three state-of-the-art ranking            [5] R.-C. Chen, L. Gallagher, R. Blanco, and J. S. Culpepper. Efficient cost-aware
methods for the most common query variation for each topic from                    cascade ranking in multi-stage retrieval. In Proc. SIGIR, pages 445–454, 2017.
                                                                               [6] G. V. Cormack, C. L. A. Clarke, and S. Buettcher. Reciprocal rank fusion outper-
the ClueWeb12B UQV100 collection [1]. Here ‡ means p < 0.001 in                    forms Condorcet and individual rank learning methods. In Proc. SIGIR, pages
a Bonferroni corrected two-tailed t-test.                                          758–759, 2009.
                                                                               [7] L. Gallagher, J. Mackenzie, R. Benham, R.-C. Chen, F. Scholer, and J. S. Culpepper.
                                                                                   RMIT at the NTCIR-13 We Want Web task. In Proc. NTCIR, 2017.
                                                                               [8] K-L. Kwok, L. Grunfeld, and P. Deng. Employing web mining and data fusion to
                                                                                   improve weak ad hoc retrieval. Inf. Proc. & Man., 43(2):406–419, 2007.
                                                                               [9] D. Metzler and W. B. Croft. A Markov random field model for term dependencies.
DESIRES 2018, August 2018, Bertinoro, Italy                                        In Proc. SIGIR, pages 472–479, 2005.
                                                                              [10] E. M. Voorhees. The TREC robust retrieval track. volume 39, pages 11–20, 2005.
© 2018 Copyright held by the author(s).