Exploring Real-Time Temporal Query Auto-Completion

                                  Stewart Whiting, James McMinn and Joemon M. Jose
                                                        School of Computing Science
                                                           University of Glasgow
                                                               Scotland, UK.
                                                    {stewh,mcminn,jj}@dcs.gla.ac.uk


ABSTRACT
Query auto-completion (QAC) is a common interactive feature for
assisting users during query formulation. Following each query in-
put keystroke, QAC suggests queries prefixed by the input charac-
ters; allowing the user to avoid further cognitive and physical effort
if any are acceptable. To rank suggestions, QAC approaches typ-
ically aggregate past query popularity to determine the likelihood
of a query being used again. Hence, QAC is usually very effective
for consistently popular queries. However, as the web becomes in-
creasingly real-time, more people are turning to search engines to
find out about unpredictable emerging and ongoing events and phe-
nomena. QAC approaches reliant on aggregating long-term historic          Figure 1: Google auto-completion suggestions for the query prefix
query-logs are not sensitive to very recent real-time events, because     ‘hur’. Screenshot taken November 8th 2012, 10 days after Hurri-
newly popular queries will be outweighed by long-term popular             cane Sandy made landfall on the East Coast of the USA. Browser
queries, especially for less-specific prefix lengths (e.g. 2 or 3 char-   cookies were cleared to avoid individual personalization effects.
acters). We explore limiting the aggregation period of past query-
log evidence to increase the temporal sensitivity of QAC. We vary
the query-log aggregation period between 2 and 14 days, for pre-          enough past evidence, completion suggestions ranked solely by
fix lengths of 2 to 5 characters. Experimentation simulates a real-       their popularity in past query-logs provides reasonably effective
time environment using openly available MSN and AOL query-log             QAC [3, 10].
datasets. Analysis indicates a linear relationship between prefix            Figure 1 illustrates the ten completion suggestions offered by
length and QAC performance when using different query-log ag-             Google for the three character query prefix ‘hur’ on November 8th,
gregation periods. In particular, we find QAC for shorter prefix          2012. The list of query suggestions indicates the historically most
lengths is optimal when a shorter query-log aggregation period is         likely queries to be submitted with the given prefix, possibly in the
used, and vice-versa, longer prefix lengths benefit from a longer         context of some undisclosed ranking features such as geo-location
query-log aggregation period.                                             or the user’s past queries. Despite the recency and prominence of
                                                                          Hurricane Sandy, the query ranks very low in the completion sug-
                                                                          gestions, while ‘hurricane isaac’ ranks first, regardless that it oc-
1.    INTRODUCTION                                                        curred many months previously. Aside from this issue, QAC for
   For users, cognitively formulating and physically typing queries       short and unspecific prefixes (i.e. 1 or 2 common characters) is
is a time-consuming and error prone process. As such, query auto-         often unsuccessful as there are usually a huge number of possible
completion (QAC) [3, 10] has been widely adopted by major web             completion suggestions [3]. Consequently, it is typically long-term
search engines to reduce the effort necessary to submit a query.          ‘head’ queries that are suggested as completions for such short pre-
   As a user types their query into the search box, QAC attempts          fixes.
to predict the completed query the user may have in mind. Fol-               With the web increasingly becoming a platform for real-time
lowing each query input keystroke, QAC suggests possible queries          news and media, time plays a central role in information interac-
(which we refer to as completion suggestions) beginning with the          tion. A substantial volume of daily top queries are the result of
already input character sequence (i.e. prefix). The goal for effective    users turning to search engines for up-to-date information about
QAC is to present the user’s intended query after the least possible      very recent or ongoing events [2, 6]. 15% of the daily queries to
keystrokes, and at the highest rank in the list of completion sugges-     an industrial web search engine have never been seen before1 ; a
tions.                                                                    substantial proportion of these queries may be attributed to real-
   Conventional QAC approaches rank completion suggestions by             time events and phenomena, rather than the long-tail of very un-
aggregating their popularity in past query-logs. Further work has         common queries. Similarly, previously unpopular queries may sud-
incorporated personal contextual features for short prefixes [3] and      denly become extremely popular because of recent developments.
time-series modelling of temporal trends [10]. However, with              It is therefore important that QAC supports queries which become
                                                                          highly popular only during brief periods of time, which we refer to
                                                                          as real-time temporal queries.
Copyright is held by the author/owner(s).                                 1 http://www.google.com/competition/howgooglesearchworks.html
DIR 2013, April 26, 2013, Delft, The Netherlands.
   Although the common approach to QAC is to rank completion                         month period from the 1st March 2006 to the 31st May 2006. The
suggestions by their popularity in the historic query-log (i.e. past                 MSN query-log contains almost 14.9M user interactions over a 1
query-log evidence), there has been very little study on the aggre-                  month period from the 1st May 2006 to the 31st May 2006.
gation period necessary to achieve optimal QAC effectiveness, and                       Query-log entries necessary for identifying result clicks were re-
whether this varies for each prefix length [10]. Thus, the objective                 moved. By extracting all the unique query and timestamp combi-
of this paper is to investigate this uncertainty by conducting exper-                nations, we obtained only queries directly typed by users. Navi-
iments based upon the AOL and MSN query-log datasets. For each                       gational queries containing the URL substrings: .com, .net, .org,
prefix length we use an N day sliding window of past query-log ev-                   http, .edu or www were removed. We were left with 21.8M and
idence to rank completion suggestions, hence making QAC more                         12.2M queries for AOL and MSN, respectively. Preliminary analy-
sensitive to real-time querying distribution changes. We present re-                 sis discovered a sizeable number of short bursts of what we suspect
sults and observe overall QAC effectiveness for different periods of                 is bot spamming activity in the AOL query-logs. Generic queries
N days, at prefix lengths of 2 to 5 characters.                                      such as ‘personalfinance’, ‘aolcelebrity’, ‘computercheckup’, ap-
                                                                                     pear in high volume with very uniform spacing (e.g. every 30 or
                                                                                     60 seconds). We manually observed and removed around 10,000
2.     MOTIVATION                                                                    instances of these queries from our analysis.
   As time undoubtedly plays a central role in user search behaviour
[2], it is important for QAC to suggest completions that become                                                             Volume of Queries
highly popular over very short periods (i.e. real-time temporal
                                                                                                       Window Size (Days)    AOL       MSN
queries), while also supporting always popular ’head’ queries.                                                          1    9.2%      3.5%
   Relying on a long period of past query-log evidence will ensure                                                      3   10.1%      4.5%
QAC is robust for continually popular queries, however, it will also                                                    5   10.4%      5.1%
have the effect of smoothing over short-term popular queries. For
example, imagine a scenario where query q1 is consistently popu-                     Table 1: The volume of queries in each query-log which were used
lar, appearing 1000 times each day in the query-log. Aggregating                     ≥ 4 times, and for which 80% of their overall occurrence is within
query popularity over a past 30 day period would mean that query                     a window of N days.
q2 which is popular only today would need to be appear 30,000
times before it outweighed the long-term popular query in a proba-                      In Table 1 we present the volume of queries (i.e. % of the total
bilistic QAC approach. At the same time, reducing the aggregation                    queries submitted in the query-log) which occur four or more times,
period may mean the long-term popular query q1 is not adequately                     and have at least 80% of their use concentrated within a period of N
represented, allowing short-term noise to reduce its ranking.                        days. In AOL, the most popular 1 to 5 day highly temporal queries
   Ultimately, developing an effective QAC system that can re-                       include: ‘amelia earhart pictures’, ‘karl der grosse’, ‘the simp-
spond to real-time temporal trends is a trade-off between robust-                    sons live action’ and ‘leisure suit larry’. Likewise, in MSN among
ness and sensitivity. In this paper we aim to study this trade-off                   the most popular are: ‘stephen colbert’, ‘poison milk’, ‘ohio bear
in terms of how much past query-log evidence is optimal for ag-                      attack’ and ‘kimberley dozier’. Investigation shows that many of
gregating query popularity, and how this changes for each prefix                     these queries describe, or are strongly related to significant events.
length. Moreover, as there has been little experimentation on open                      These results suggest a reasonable volume of real-time temporal
datasets, this work establishes baseline QAC performance for fur-                    queries in both query-logs, at least in the relatively short periods
ther studies.                                                                        we are able to study. We suspect that the percentage of real-time
   The effectiveness of using a shorter query-log evidence aggre-                    temporal queries will have substantially increased in more recent
gation period has been noted previously, particularly for real-time                  query-logs, given the increase in real-time media available on the
temporal queries [10]. While time-series modelling for query                         internet.
trends is able to improve QAC for recurring predictable temporal
trends, for short-term real-time temporal queries it often proved
problematic due to lag and over-fitting [10]. Time-series models
                                                                                     3.    RELATED WORK
were not able to model the increasing trend quickly enough, and                         The majority of research has concentrated on the inherent engi-
likewise, continued to predict increased popularity for some time                    neering complexity of providing efficient and scalable QAC, which
after the brief period of actual popularity.                                         is resilient to typing errors. There have been relatively few stud-
                                                                                     ies on improving QAC effectiveness in search engines; likely due
2.1      Temporal Query-log Analysis                                                 to the fact that there are few suitable query-logs available outside
                                                                                     industrial search engine companies for experimentation.
   We quantify the extent to which the query-logs are composed of
                                                                                        Exploiting the user’s personal context, and past query sessions
real-time temporal querying, in order to determine the degree to
                                                                                     has led to considerable QAC improvement, especially for shorter
which QAC must support this behaviour. We define real-time tem-
                                                                                     prefixes [3, 8]. Shokouhi and Radinsky [9, 10] used time-series
poral queries as those which appear as a ‘spike’ - with the vast ma-
                                                                                     modelling of past temporal query patterns to improve QAC effec-
jority of their occurrence within a short period, e.g. hours to days.
                                                                                     tiveness. Popular queries recurring during specific temporal inter-
Similarly, the queries are unlikely to have been recently popular, or
                                                                                     vals, such as day/night, day of week, month, etc. were modelled
even seen previously.
                                                                                     so that current query popularity could be predicted based on prior
   We analyse the temporal trends contained in two publicly avail-
                                                                                     evidence only. Shokouhi and Radinsky [10] propose the short time
able2 datasets: the AOL [7] and MSN [1] query-log datasets. Ex-
                                                                                     window technique we experiment with in this paper as a baseline
tensive temporal analysis of longer-term and larger proprietary
                                                                                     (which they refer to as p1 , etc.). They note its relative effectiveness,
query-log data has been performed previously by others [6, 4, 2].
                                                                                     particularly for correctly predicting short-term highly temporal and
   The AOL query-log contains 36.3M user interactions over a 3
                                                                                     unpredictable queries for which time-series modelling is problem-
2 MSN available on request. We justify our use of AOL as we study the data without   atic. However, no detailed analysis on the performance impact of
identifying individuals.                                                             aggregation period for each prefix length is performed.
4.    AUTO-COMPLETION APPROACH
   The common “standard” approach to QAC is Maximum Likeli-                   We conduct experimentation using the AOL and MSN query-log
hood Estimation (MLE), based on past query popularity (i.e. ‘most          datasets. By experimenting with millions of queries contained in
popular completion’) [3]. MLE for a prefix ρn (of n characters),           each query-log, we achieve a representative indication of how each
with each query q in all past queries Q prefixed by ρn , is formalised     approach would perform in a real-world setting.
as follows:                                                                   Using two different query-logs validates the approach across two
                                                                           query samples of varying characteristics, and different user popula-
                      MLE(ρn ) = arg maxP(q)                        (1)    tions of the two industrial search engines. The exact sampling and
                                      q∈Q
                                                                           construction of each query-log is unknown. MSN has been filtered
   P(q) is the probability of the query appearing in the past query-       for privacy (e.g. clearing known number patterns, such as phone
log. We refer to this method, aggregating all query-log evidence           numbers), appears to contain fewer adult queries, and is more in-
prior to the current time qt as our baseline MLE-ALL.                      depth as it is only for a one month period. In contrast, the AOL
                                                                           query-log contains more queries, but has greater breadth as it cov-
4.1     Limiting Past Query-log Evidence                                   ers a three month period. However, as noted in [2], the sampling of
   We propose using only the last N days of query-log evidence             AOL may not be truly representative of normal querying distribu-
(e.g., N = 2, 4,7 or 14 days) for computing P(q) at qt (i.e., a sliding    tions because of re-finding behaviour.
window of past evidence). We refer to this approach as MLE-WN.
   The intuition underlying this approach is that a more recent and        6.1     Experiment Settings
limited period of queries may more accurately reflect the current             We report results in Section 7 for two QAC settings: MLE-
query distribution. Similarly, although consistently popular queries       ALL using all query-log evidence prior to qt (we treat this as the
will still be adequately reflected in the distribution, their total fre-   baseline), and MLE-WN, with 2,4,7 and 14 days of past query-
quency will no longer be great enough to outweigh the frequency            log (characterising short and medium-term event/evidence peri-
of popular queries that only spike in shorter periods.                     ods). With only sampled query-log datasets, reducing the evidence
                                                                           period further leads to relatively sparse querying data. To emu-
5.    METHODOLOGY                                                          late a real user interface scenario, we assume the user would see 4
                                                                           highest-ranking completion suggestions for each prefix they input.
   The objective of our experiment is to study the trade-off between
                                                                              We run each approach for all query prefixes of 2 to 5 characters.
sensitivity and robustness of QAC, for different prefix lengths. As
                                                                           Experiments for each prefix length were run independently, hence
such, we explore various query-log aggregation periods for each
                                                                           a successful completion suggestion at a prefix of 2 characters had
prefix length, and measure the effect on overall QAC performance.
                                                                           no effect on the later evaluation for 3 characters.
   Our experimental methodology simulates a real-time user search
                                                                              Learning Period. We report the MRR of MLE QAC computed
scenario; such that the user types a prefix, and receives comple-
                                                                           over the period of the query-log, minus the first N days which
tion suggestions based only on evidence prior to the time of their
                                                                           we treat as the learning period. Doing this makes the MRR ob-
query. QAC effectiveness is measured by the presence, and rank of
                                                                           tained from MLE-ALL and MLE-WN directly comparable as both
a ground-truth match for each set of suggestions.
                                                                           are computed over exactly the same set of queries. In any case,
   A time-ordered query-log provides a stream of ground-truth user
                                                                           QAC performance during this early period will be extremely low
queries. We assume that each query present in the query-log is the
                                                                           as there is very little query popularity evidence (i.e. the ‘cold-start
result of a user having manually typed it into the search box. As
                                                                           problem’), and wouldn’t reflect a real-world scenario where a QAC
such, for each prefix of length n of the query, QAC provides com-
                                                                           system would almost always be trained on past evidence.
pletion suggestions. Each suggestion is matched with the ground-
truth of the user’s actual query (we discuss matching in the follow-
ing section).                                                              7.    RESULTS
   Evaluation Metric. Similar to past QAC work [3, 10], we                    Table 2 presents the overall MRR observed for MLE QAC ex-
rely on Mean Reciprocal Rank (MRR) to observe the effectiveness            periments on the AOL and MSN and AOL query-logs; using the
of each QAC approach. Reciprocal Rank (RR) has typically been              past 2, 4, 7, 14 days as well all past query-log evidence, for prefix
used for evaluation in IR situations where there is a single relevant      lengths of 2 to 5 characters. The MLE-ALL MRR reported beside
document. For a set of completion suggestions S, RR is computed            each MLE-WN corresponds to the baseline using all queries prior
as:                                                                        to qt , but with the first N days of queries excluded for comparison.
                                            1                                 The aggregated statistical power of 21.8M and 12.2M RR mea-
                RR(S, qintended ) =                               (2)      sures (i.e. each query) provided by the AOL and MSN experiments,
                                    S, Rank(qintended )
                                                                           respectively, means that the results we report are statistically signif-
   If no match for qintended is present, then a RR of 0 is assigned        icant according to standard t-tests [5]. Therefore, our analysis con-
(avoiding divide by zero errors). MRR is then computed as the              centrates on the effect size of each window period over the baseline
arithmetic average of RR for all queries.                                  - that is, change in MRR over the corresponding MLE-ALL.
   MRR reflects the user interaction model of QAC; a higher-                  Firstly, for all runs and both query-logs it is clear that QAC is
ranked completion suggestion is more beneficial, but the difference        considerably more effective with a longer (i.e. more specific) pre-
in ordering of lower-ranked completion candidates is less signifi-         fix. This is expected, given that each extra character in the prefix
cant. That is to say, there is less noticeable difference between a        reduces the space of possible completion suggestions, thus increas-
correct completion suggestion ranked at either the 3rd or 4th po-          ing the chance of a completion suggestion match [3].
sition, compared to the 1st or 2nd position. We consider a lit-               QAC is almost always more effective for MSN than for AOL,
eral lower-case string match between completion suggestion and             especially for prefix lengths of 4 or less characters. Using a sliding
ground-truth as a successful match.                                        window of evidence has a significant effect on overall QAC perfor-
                                                                           mance in almost all cases.
6.    EXPERIMENT                                                              For AOL there is a sliding window of evidence which can im-
 ρ        MLE-ALL             MLE-W2        MLE-ALL             MLE-W4         MLE-ALL                  MLE-W7           MLE-ALL                 MLE-W14
 AOL
 2              0.090    0.091 (1.11%)            0.090     0.091 (1.11%)              0.090       0.091 (1.11%)                 0.090       0.091 (1.11%)
 3              0.143    0.147 (2.80%)            0.143     0.146 (2.10%)              0.143       0.145 (1.40%)                 0.143       0.145 (1.40%)
 4              0.185     0.189 (2.16%)           0.184     0.189 (2.72%)              0.184       0.188 (2.17%)                 0.184       0.187 (1.63%)
 5              0.217    0.215 (-0.92%)           0.216     0.217 (0.46%)              0.217       0.218 (0.46%)                 0.217       0.219 (0.92%)
 MSN
 2              0.112    0.117 (4.46%)            0.111     0.115 (3.60%)              0.111       0.113 (1.80%)                 0.110       0.111 (0.91%)
 3              0.164    0.163 (-0.61%)           0.164    0.165 (0.61%)               0.164      0.165 (0.61%)                  0.164       0.164 (0.00%)
 4              0.197    0.188 (-4.57%)           0.197    0.193 (-2.03%)              0.197      0.196 (-0.51%)                 0.197       0.197 (0.00%)
 5              0.215    0.197 (-8.37%)           0.216    0.205 (-5.09%)              0.216      0.211 (-2.31%)                 0.218      0.216 (-0.92%)

Table 2: MRR observed for QAC when using all prior query-log evidence, and the past 2, 4, 7 or 14 days of query-log evidence. Prefix (ρ)
lengths of 2 to 5 characters are reported for the AOL and MSN query-logs. The best performing sliding window setting is highlighted for
each prefix length (although in some cases this is still outperformed by the baseline, at least for the reported window periods).

prove QAC performance over MLE-ALL for all prefix lengths, al-              likely to play a significant role in the already reduced set of pos-
beit relatively marginally for 5 characters. Using a shorter 2 day          sible completion suggestions. Moreover, for less common prefixes
window of evidence improves QAC performance by up to nearly                 (and therefore rarer queries), relying on a longer query-log period
3% for shorter prefixes of 2 or 3 characters. For a prefix of 4 char-       is more likely to include the evidence necessary to rank them effec-
acters, using a little more evidence, e.g. 4 days is optimal. Simi-         tively as they were less likely to be used recently.
larly, the best performance for a 5 character prefix is obtained when          Conclusion. In this paper we examined the trade-off between
using 14 days of evidence.                                                  QAC robustness and real-time temporal sensitivity. We found that
   For MSN, shorter prefixes (e.g. 2 or 3 characters) can outper-           QAC effectiveness can be improved by up to nearly 5%, simply by
form the baseline when using a window of evidence. Specifically,            selecting the optimal time period of query popularity aggregation
we see the best performance improvement of nearly 4.5% when us-             for each prefix length. The period necessary to achieve optimal
ing 2 days of evidence for a 2 character prefix. However, using             QAC effectiveness varies by prefix length; shorter prefixes (e.g. 2-
between 2 and 14 days of query-log evidence always impairs QAC              3 characters) perform best with only short-term evidence (e.g. 2-7
performance compared to the baseline for 4 or 5 character prefixes.         days), whereas longer prefixes (e.g. 4-5 characters) require more
Notably, the detrimental effect on performance is reduced as the            long-term evidence (e.g. 7-14 days, or more). Results also indicate
sliding window of evidence is increased.                                    the need to train per query-log, in order to capture intrinsic tempo-
                                                                            ral and demographic characteristics. Care must also be taken with
8.    DISCUSSION AND CONCLUSION                                             the sampling of queries used for training.
   The baseline QAC performance, and the following sliding win-                Further work will experiment with larger, more recent query-
dow QAC improvement characteristics for each query-log are                  logs and perform cross-validation to verify the preliminary findings
considerably different between the query-logs. AOL QAC is                   we present in this paper. Moreover, we will investigate alternative
marginally but consistently improved in almost all cases, whereas           modelling techniques to improve QAC effectiveness for real-time
MSN QAC is only improved for shorter prefixes, albeit much more             temporal queries, which are problematic for time-series modelling
so than for AOL. This suggests that the two query-logs have dif-            as they spike so briefly in time.
ferent temporal characteristics. In part, this may be caused by a              Acknowledgments. This work was partially supported by the
couple of factors. Firstly, although AOL has more queries, it is            EU LiMoSINe project (288024).
spread more sparsely over a three month period, in contrast, MSN
queries are concentrated in a 1 month period. Additionally, AOL             9.[1] WSCD
                                                                                   REFERENCES
                                                                                       ’09: Proceedings of the 2009 workshop on Web Search Click Data, New
has a day of missing data [2] which will harm QAC effectiveness                  York, NY, USA, 2009. ACM.
following the affected period. Secondly, there may be underlying             [2] E. Adar, D. S. Weld, B. N. Bershad, and S. S. Gribble. Why we search:
demographic differences between users of the two search engines                  visualizing and predicting user behavior. WWW ’07, pages 161–170, New
                                                                                 York, NY, USA, 2007. ACM.
that lead to changes in query distributions.                                 [3] Z. Bar-Yossef and N. Kraus. Context-sensitive query auto-completion. In WWW
   Although the performance improvement characteristics for each                 ’11, pages 107–116, 2011.
prefix and sliding window are different for each query-log, there is         [4] S. M. Beitzel, E. C. Jensen, A. Chowdhury, O. Frieder, and D. Grossman.
a clear overall linear relationship emerging in the results between              Temporal analysis of a very large topically categorized web query log. J. Am.
                                                                                 Soc. Inf. Sci. Technol., 58(2):166–178, Jan. 2007.
prefix length and optimal sliding window period. As such, QAC for            [5] P. Ellis. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis,
shorter prefixes performs optimally with a shorter sliding window                and the Interpretation of Research Results. Cambridge University Press, 2010.
of evidence, and conversely, QAC for longer prefixes performs best           [6] A. Kulkarni, J. Teevan, K. M. Svore, and S. T. Dumais. Understanding temporal
with a longer sliding window of evidence.                                        query dynamics. In ACM WSDM ’11, pages 167–176, New York, NY, USA,
                                                                                 2011. ACM.
   This relationship probably arises from the uncertainty posed by           [7] G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. InfoScale ’06,
short and non-specific prefix lengths [3], where the space of possi-             New York, NY, USA, 2006. ACM.
ble completion suggestions is large. In these cases, using a shorter-        [8] C. Sengstock and M. Gertz. Conquer: a system for efficient context-aware query
period of evidence will still reflect long-term popular queries, but             suggestions. WWW ’11, pages 265–268, New York, NY, USA, 2011. ACM.
                                                                             [9] M. Shokouhi. Detecting seasonal queries by time-series analysis. In SIGIR ’11,
also be sensitive to temporal variation. Longer prefixes are more                pages 1171–1172, 2011.
specific and thus narrow possible completion suggestions consid-            [10] M. Shokouhi and K. Radinsky. Time-sensitive query auto-completion. In SIGIR
erably. In these cases, real-time temporal factors are probably less             ’12, pages 601–610, 2012.