Exploring Real-Time Temporal Query Auto-Completion Stewart Whiting, James McMinn and Joemon M. Jose School of Computing Science University of Glasgow Scotland, UK. {stewh,mcminn,jj}@dcs.gla.ac.uk ABSTRACT Query auto-completion (QAC) is a common interactive feature for assisting users during query formulation. Following each query in- put keystroke, QAC suggests queries prefixed by the input charac- ters; allowing the user to avoid further cognitive and physical effort if any are acceptable. To rank suggestions, QAC approaches typ- ically aggregate past query popularity to determine the likelihood of a query being used again. Hence, QAC is usually very effective for consistently popular queries. However, as the web becomes in- creasingly real-time, more people are turning to search engines to find out about unpredictable emerging and ongoing events and phe- nomena. QAC approaches reliant on aggregating long-term historic Figure 1: Google auto-completion suggestions for the query prefix query-logs are not sensitive to very recent real-time events, because ‘hur’. Screenshot taken November 8th 2012, 10 days after Hurri- newly popular queries will be outweighed by long-term popular cane Sandy made landfall on the East Coast of the USA. Browser queries, especially for less-specific prefix lengths (e.g. 2 or 3 char- cookies were cleared to avoid individual personalization effects. acters). We explore limiting the aggregation period of past query- log evidence to increase the temporal sensitivity of QAC. We vary the query-log aggregation period between 2 and 14 days, for pre- enough past evidence, completion suggestions ranked solely by fix lengths of 2 to 5 characters. Experimentation simulates a real- their popularity in past query-logs provides reasonably effective time environment using openly available MSN and AOL query-log QAC [3, 10]. datasets. Analysis indicates a linear relationship between prefix Figure 1 illustrates the ten completion suggestions offered by length and QAC performance when using different query-log ag- Google for the three character query prefix ‘hur’ on November 8th, gregation periods. In particular, we find QAC for shorter prefix 2012. The list of query suggestions indicates the historically most lengths is optimal when a shorter query-log aggregation period is likely queries to be submitted with the given prefix, possibly in the used, and vice-versa, longer prefix lengths benefit from a longer context of some undisclosed ranking features such as geo-location query-log aggregation period. or the user’s past queries. Despite the recency and prominence of Hurricane Sandy, the query ranks very low in the completion sug- gestions, while ‘hurricane isaac’ ranks first, regardless that it oc- 1. INTRODUCTION curred many months previously. Aside from this issue, QAC for For users, cognitively formulating and physically typing queries short and unspecific prefixes (i.e. 1 or 2 common characters) is is a time-consuming and error prone process. As such, query auto- often unsuccessful as there are usually a huge number of possible completion (QAC) [3, 10] has been widely adopted by major web completion suggestions [3]. Consequently, it is typically long-term search engines to reduce the effort necessary to submit a query. ‘head’ queries that are suggested as completions for such short pre- As a user types their query into the search box, QAC attempts fixes. to predict the completed query the user may have in mind. Fol- With the web increasingly becoming a platform for real-time lowing each query input keystroke, QAC suggests possible queries news and media, time plays a central role in information interac- (which we refer to as completion suggestions) beginning with the tion. A substantial volume of daily top queries are the result of already input character sequence (i.e. prefix). The goal for effective users turning to search engines for up-to-date information about QAC is to present the user’s intended query after the least possible very recent or ongoing events [2, 6]. 15% of the daily queries to keystrokes, and at the highest rank in the list of completion sugges- an industrial web search engine have never been seen before1 ; a tions. substantial proportion of these queries may be attributed to real- Conventional QAC approaches rank completion suggestions by time events and phenomena, rather than the long-tail of very un- aggregating their popularity in past query-logs. Further work has common queries. Similarly, previously unpopular queries may sud- incorporated personal contextual features for short prefixes [3] and denly become extremely popular because of recent developments. time-series modelling of temporal trends [10]. However, with It is therefore important that QAC supports queries which become highly popular only during brief periods of time, which we refer to as real-time temporal queries. Copyright is held by the author/owner(s). 1 http://www.google.com/competition/howgooglesearchworks.html DIR 2013, April 26, 2013, Delft, The Netherlands. Although the common approach to QAC is to rank completion month period from the 1st March 2006 to the 31st May 2006. The suggestions by their popularity in the historic query-log (i.e. past MSN query-log contains almost 14.9M user interactions over a 1 query-log evidence), there has been very little study on the aggre- month period from the 1st May 2006 to the 31st May 2006. gation period necessary to achieve optimal QAC effectiveness, and Query-log entries necessary for identifying result clicks were re- whether this varies for each prefix length [10]. Thus, the objective moved. By extracting all the unique query and timestamp combi- of this paper is to investigate this uncertainty by conducting exper- nations, we obtained only queries directly typed by users. Navi- iments based upon the AOL and MSN query-log datasets. For each gational queries containing the URL substrings: .com, .net, .org, prefix length we use an N day sliding window of past query-log ev- http, .edu or www were removed. We were left with 21.8M and idence to rank completion suggestions, hence making QAC more 12.2M queries for AOL and MSN, respectively. Preliminary analy- sensitive to real-time querying distribution changes. We present re- sis discovered a sizeable number of short bursts of what we suspect sults and observe overall QAC effectiveness for different periods of is bot spamming activity in the AOL query-logs. Generic queries N days, at prefix lengths of 2 to 5 characters. such as ‘personalfinance’, ‘aolcelebrity’, ‘computercheckup’, ap- pear in high volume with very uniform spacing (e.g. every 30 or 60 seconds). We manually observed and removed around 10,000 2. MOTIVATION instances of these queries from our analysis. As time undoubtedly plays a central role in user search behaviour [2], it is important for QAC to suggest completions that become Volume of Queries highly popular over very short periods (i.e. real-time temporal Window Size (Days) AOL MSN queries), while also supporting always popular ’head’ queries. 1 9.2% 3.5% Relying on a long period of past query-log evidence will ensure 3 10.1% 4.5% QAC is robust for continually popular queries, however, it will also 5 10.4% 5.1% have the effect of smoothing over short-term popular queries. For example, imagine a scenario where query q1 is consistently popu- Table 1: The volume of queries in each query-log which were used lar, appearing 1000 times each day in the query-log. Aggregating ≥ 4 times, and for which 80% of their overall occurrence is within query popularity over a past 30 day period would mean that query a window of N days. q2 which is popular only today would need to be appear 30,000 times before it outweighed the long-term popular query in a proba- In Table 1 we present the volume of queries (i.e. % of the total bilistic QAC approach. At the same time, reducing the aggregation queries submitted in the query-log) which occur four or more times, period may mean the long-term popular query q1 is not adequately and have at least 80% of their use concentrated within a period of N represented, allowing short-term noise to reduce its ranking. days. In AOL, the most popular 1 to 5 day highly temporal queries Ultimately, developing an effective QAC system that can re- include: ‘amelia earhart pictures’, ‘karl der grosse’, ‘the simp- spond to real-time temporal trends is a trade-off between robust- sons live action’ and ‘leisure suit larry’. Likewise, in MSN among ness and sensitivity. In this paper we aim to study this trade-off the most popular are: ‘stephen colbert’, ‘poison milk’, ‘ohio bear in terms of how much past query-log evidence is optimal for ag- attack’ and ‘kimberley dozier’. Investigation shows that many of gregating query popularity, and how this changes for each prefix these queries describe, or are strongly related to significant events. length. Moreover, as there has been little experimentation on open These results suggest a reasonable volume of real-time temporal datasets, this work establishes baseline QAC performance for fur- queries in both query-logs, at least in the relatively short periods ther studies. we are able to study. We suspect that the percentage of real-time The effectiveness of using a shorter query-log evidence aggre- temporal queries will have substantially increased in more recent gation period has been noted previously, particularly for real-time query-logs, given the increase in real-time media available on the temporal queries [10]. While time-series modelling for query internet. trends is able to improve QAC for recurring predictable temporal trends, for short-term real-time temporal queries it often proved problematic due to lag and over-fitting [10]. Time-series models 3. RELATED WORK were not able to model the increasing trend quickly enough, and The majority of research has concentrated on the inherent engi- likewise, continued to predict increased popularity for some time neering complexity of providing efficient and scalable QAC, which after the brief period of actual popularity. is resilient to typing errors. There have been relatively few stud- ies on improving QAC effectiveness in search engines; likely due 2.1 Temporal Query-log Analysis to the fact that there are few suitable query-logs available outside industrial search engine companies for experimentation. We quantify the extent to which the query-logs are composed of Exploiting the user’s personal context, and past query sessions real-time temporal querying, in order to determine the degree to has led to considerable QAC improvement, especially for shorter which QAC must support this behaviour. We define real-time tem- prefixes [3, 8]. Shokouhi and Radinsky [9, 10] used time-series poral queries as those which appear as a ‘spike’ - with the vast ma- modelling of past temporal query patterns to improve QAC effec- jority of their occurrence within a short period, e.g. hours to days. tiveness. Popular queries recurring during specific temporal inter- Similarly, the queries are unlikely to have been recently popular, or vals, such as day/night, day of week, month, etc. were modelled even seen previously. so that current query popularity could be predicted based on prior We analyse the temporal trends contained in two publicly avail- evidence only. Shokouhi and Radinsky [10] propose the short time able2 datasets: the AOL [7] and MSN [1] query-log datasets. Ex- window technique we experiment with in this paper as a baseline tensive temporal analysis of longer-term and larger proprietary (which they refer to as p1 , etc.). They note its relative effectiveness, query-log data has been performed previously by others [6, 4, 2]. particularly for correctly predicting short-term highly temporal and The AOL query-log contains 36.3M user interactions over a 3 unpredictable queries for which time-series modelling is problem- 2 MSN available on request. We justify our use of AOL as we study the data without atic. However, no detailed analysis on the performance impact of identifying individuals. aggregation period for each prefix length is performed. 4. AUTO-COMPLETION APPROACH The common “standard” approach to QAC is Maximum Likeli- We conduct experimentation using the AOL and MSN query-log hood Estimation (MLE), based on past query popularity (i.e. ‘most datasets. By experimenting with millions of queries contained in popular completion’) [3]. MLE for a prefix ρn (of n characters), each query-log, we achieve a representative indication of how each with each query q in all past queries Q prefixed by ρn , is formalised approach would perform in a real-world setting. as follows: Using two different query-logs validates the approach across two query samples of varying characteristics, and different user popula- MLE(ρn ) = arg maxP(q) (1) tions of the two industrial search engines. The exact sampling and q∈Q construction of each query-log is unknown. MSN has been filtered P(q) is the probability of the query appearing in the past query- for privacy (e.g. clearing known number patterns, such as phone log. We refer to this method, aggregating all query-log evidence numbers), appears to contain fewer adult queries, and is more in- prior to the current time qt as our baseline MLE-ALL. depth as it is only for a one month period. In contrast, the AOL query-log contains more queries, but has greater breadth as it cov- 4.1 Limiting Past Query-log Evidence ers a three month period. However, as noted in [2], the sampling of We propose using only the last N days of query-log evidence AOL may not be truly representative of normal querying distribu- (e.g., N = 2, 4,7 or 14 days) for computing P(q) at qt (i.e., a sliding tions because of re-finding behaviour. window of past evidence). We refer to this approach as MLE-WN. The intuition underlying this approach is that a more recent and 6.1 Experiment Settings limited period of queries may more accurately reflect the current We report results in Section 7 for two QAC settings: MLE- query distribution. Similarly, although consistently popular queries ALL using all query-log evidence prior to qt (we treat this as the will still be adequately reflected in the distribution, their total fre- baseline), and MLE-WN, with 2,4,7 and 14 days of past query- quency will no longer be great enough to outweigh the frequency log (characterising short and medium-term event/evidence peri- of popular queries that only spike in shorter periods. ods). With only sampled query-log datasets, reducing the evidence period further leads to relatively sparse querying data. To emu- 5. METHODOLOGY late a real user interface scenario, we assume the user would see 4 highest-ranking completion suggestions for each prefix they input. The objective of our experiment is to study the trade-off between We run each approach for all query prefixes of 2 to 5 characters. sensitivity and robustness of QAC, for different prefix lengths. As Experiments for each prefix length were run independently, hence such, we explore various query-log aggregation periods for each a successful completion suggestion at a prefix of 2 characters had prefix length, and measure the effect on overall QAC performance. no effect on the later evaluation for 3 characters. Our experimental methodology simulates a real-time user search Learning Period. We report the MRR of MLE QAC computed scenario; such that the user types a prefix, and receives comple- over the period of the query-log, minus the first N days which tion suggestions based only on evidence prior to the time of their we treat as the learning period. Doing this makes the MRR ob- query. QAC effectiveness is measured by the presence, and rank of tained from MLE-ALL and MLE-WN directly comparable as both a ground-truth match for each set of suggestions. are computed over exactly the same set of queries. In any case, A time-ordered query-log provides a stream of ground-truth user QAC performance during this early period will be extremely low queries. We assume that each query present in the query-log is the as there is very little query popularity evidence (i.e. the ‘cold-start result of a user having manually typed it into the search box. As problem’), and wouldn’t reflect a real-world scenario where a QAC such, for each prefix of length n of the query, QAC provides com- system would almost always be trained on past evidence. pletion suggestions. Each suggestion is matched with the ground- truth of the user’s actual query (we discuss matching in the follow- ing section). 7. RESULTS Evaluation Metric. Similar to past QAC work [3, 10], we Table 2 presents the overall MRR observed for MLE QAC ex- rely on Mean Reciprocal Rank (MRR) to observe the effectiveness periments on the AOL and MSN and AOL query-logs; using the of each QAC approach. Reciprocal Rank (RR) has typically been past 2, 4, 7, 14 days as well all past query-log evidence, for prefix used for evaluation in IR situations where there is a single relevant lengths of 2 to 5 characters. The MLE-ALL MRR reported beside document. For a set of completion suggestions S, RR is computed each MLE-WN corresponds to the baseline using all queries prior as: to qt , but with the first N days of queries excluded for comparison. 1 The aggregated statistical power of 21.8M and 12.2M RR mea- RR(S, qintended ) = (2) sures (i.e. each query) provided by the AOL and MSN experiments, S, Rank(qintended ) respectively, means that the results we report are statistically signif- If no match for qintended is present, then a RR of 0 is assigned icant according to standard t-tests [5]. Therefore, our analysis con- (avoiding divide by zero errors). MRR is then computed as the centrates on the effect size of each window period over the baseline arithmetic average of RR for all queries. - that is, change in MRR over the corresponding MLE-ALL. MRR reflects the user interaction model of QAC; a higher- Firstly, for all runs and both query-logs it is clear that QAC is ranked completion suggestion is more beneficial, but the difference considerably more effective with a longer (i.e. more specific) pre- in ordering of lower-ranked completion candidates is less signifi- fix. This is expected, given that each extra character in the prefix cant. That is to say, there is less noticeable difference between a reduces the space of possible completion suggestions, thus increas- correct completion suggestion ranked at either the 3rd or 4th po- ing the chance of a completion suggestion match [3]. sition, compared to the 1st or 2nd position. We consider a lit- QAC is almost always more effective for MSN than for AOL, eral lower-case string match between completion suggestion and especially for prefix lengths of 4 or less characters. Using a sliding ground-truth as a successful match. window of evidence has a significant effect on overall QAC perfor- mance in almost all cases. 6. EXPERIMENT For AOL there is a sliding window of evidence which can im- ρ MLE-ALL MLE-W2 MLE-ALL MLE-W4 MLE-ALL MLE-W7 MLE-ALL MLE-W14 AOL 2 0.090 0.091 (1.11%) 0.090 0.091 (1.11%) 0.090 0.091 (1.11%) 0.090 0.091 (1.11%) 3 0.143 0.147 (2.80%) 0.143 0.146 (2.10%) 0.143 0.145 (1.40%) 0.143 0.145 (1.40%) 4 0.185 0.189 (2.16%) 0.184 0.189 (2.72%) 0.184 0.188 (2.17%) 0.184 0.187 (1.63%) 5 0.217 0.215 (-0.92%) 0.216 0.217 (0.46%) 0.217 0.218 (0.46%) 0.217 0.219 (0.92%) MSN 2 0.112 0.117 (4.46%) 0.111 0.115 (3.60%) 0.111 0.113 (1.80%) 0.110 0.111 (0.91%) 3 0.164 0.163 (-0.61%) 0.164 0.165 (0.61%) 0.164 0.165 (0.61%) 0.164 0.164 (0.00%) 4 0.197 0.188 (-4.57%) 0.197 0.193 (-2.03%) 0.197 0.196 (-0.51%) 0.197 0.197 (0.00%) 5 0.215 0.197 (-8.37%) 0.216 0.205 (-5.09%) 0.216 0.211 (-2.31%) 0.218 0.216 (-0.92%) Table 2: MRR observed for QAC when using all prior query-log evidence, and the past 2, 4, 7 or 14 days of query-log evidence. Prefix (ρ) lengths of 2 to 5 characters are reported for the AOL and MSN query-logs. The best performing sliding window setting is highlighted for each prefix length (although in some cases this is still outperformed by the baseline, at least for the reported window periods). prove QAC performance over MLE-ALL for all prefix lengths, al- likely to play a significant role in the already reduced set of pos- beit relatively marginally for 5 characters. Using a shorter 2 day sible completion suggestions. Moreover, for less common prefixes window of evidence improves QAC performance by up to nearly (and therefore rarer queries), relying on a longer query-log period 3% for shorter prefixes of 2 or 3 characters. For a prefix of 4 char- is more likely to include the evidence necessary to rank them effec- acters, using a little more evidence, e.g. 4 days is optimal. Simi- tively as they were less likely to be used recently. larly, the best performance for a 5 character prefix is obtained when Conclusion. In this paper we examined the trade-off between using 14 days of evidence. QAC robustness and real-time temporal sensitivity. We found that For MSN, shorter prefixes (e.g. 2 or 3 characters) can outper- QAC effectiveness can be improved by up to nearly 5%, simply by form the baseline when using a window of evidence. Specifically, selecting the optimal time period of query popularity aggregation we see the best performance improvement of nearly 4.5% when us- for each prefix length. The period necessary to achieve optimal ing 2 days of evidence for a 2 character prefix. However, using QAC effectiveness varies by prefix length; shorter prefixes (e.g. 2- between 2 and 14 days of query-log evidence always impairs QAC 3 characters) perform best with only short-term evidence (e.g. 2-7 performance compared to the baseline for 4 or 5 character prefixes. days), whereas longer prefixes (e.g. 4-5 characters) require more Notably, the detrimental effect on performance is reduced as the long-term evidence (e.g. 7-14 days, or more). Results also indicate sliding window of evidence is increased. the need to train per query-log, in order to capture intrinsic tempo- ral and demographic characteristics. Care must also be taken with 8. DISCUSSION AND CONCLUSION the sampling of queries used for training. The baseline QAC performance, and the following sliding win- Further work will experiment with larger, more recent query- dow QAC improvement characteristics for each query-log are logs and perform cross-validation to verify the preliminary findings considerably different between the query-logs. AOL QAC is we present in this paper. Moreover, we will investigate alternative marginally but consistently improved in almost all cases, whereas modelling techniques to improve QAC effectiveness for real-time MSN QAC is only improved for shorter prefixes, albeit much more temporal queries, which are problematic for time-series modelling so than for AOL. This suggests that the two query-logs have dif- as they spike so briefly in time. ferent temporal characteristics. In part, this may be caused by a Acknowledgments. This work was partially supported by the couple of factors. Firstly, although AOL has more queries, it is EU LiMoSINe project (288024). spread more sparsely over a three month period, in contrast, MSN queries are concentrated in a 1 month period. Additionally, AOL 9.[1] WSCD REFERENCES ’09: Proceedings of the 2009 workshop on Web Search Click Data, New has a day of missing data [2] which will harm QAC effectiveness York, NY, USA, 2009. ACM. following the affected period. Secondly, there may be underlying [2] E. Adar, D. S. Weld, B. N. Bershad, and S. S. Gribble. Why we search: demographic differences between users of the two search engines visualizing and predicting user behavior. WWW ’07, pages 161–170, New York, NY, USA, 2007. ACM. that lead to changes in query distributions. [3] Z. Bar-Yossef and N. Kraus. Context-sensitive query auto-completion. In WWW Although the performance improvement characteristics for each ’11, pages 107–116, 2011. prefix and sliding window are different for each query-log, there is [4] S. M. Beitzel, E. C. Jensen, A. Chowdhury, O. Frieder, and D. Grossman. a clear overall linear relationship emerging in the results between Temporal analysis of a very large topically categorized web query log. J. Am. Soc. Inf. Sci. Technol., 58(2):166–178, Jan. 2007. prefix length and optimal sliding window period. As such, QAC for [5] P. Ellis. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, shorter prefixes performs optimally with a shorter sliding window and the Interpretation of Research Results. Cambridge University Press, 2010. of evidence, and conversely, QAC for longer prefixes performs best [6] A. Kulkarni, J. Teevan, K. M. Svore, and S. T. Dumais. Understanding temporal with a longer sliding window of evidence. query dynamics. In ACM WSDM ’11, pages 167–176, New York, NY, USA, 2011. ACM. This relationship probably arises from the uncertainty posed by [7] G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. InfoScale ’06, short and non-specific prefix lengths [3], where the space of possi- New York, NY, USA, 2006. ACM. ble completion suggestions is large. In these cases, using a shorter- [8] C. Sengstock and M. Gertz. Conquer: a system for efficient context-aware query period of evidence will still reflect long-term popular queries, but suggestions. WWW ’11, pages 265–268, New York, NY, USA, 2011. ACM. [9] M. Shokouhi. Detecting seasonal queries by time-series analysis. In SIGIR ’11, also be sensitive to temporal variation. Longer prefixes are more pages 1171–1172, 2011. specific and thus narrow possible completion suggestions consid- [10] M. Shokouhi and K. Radinsky. Time-sensitive query auto-completion. In SIGIR erably. In these cases, real-time temporal factors are probably less ’12, pages 601–610, 2012.