Assessing Query Suggestions for Search Session Simulation Sebastian Günther1 , Matthias Hagen1 1 Martin-Luther-Universität Halle-Wittenberg, Halle (Saale), Germany Abstract Research on simulating search behavior has mainly dealt with result list interactions in the recent years. We instead focus on the querying process and describe a pilot study to assess the applicability of search engine query suggestions to simulate search sessions (i.e., sequences of topically related queries). In automatic and manual assessments, we evaluate to what extent a session detection approach considers the simulated query sequences as “authentic” and how humans perceive the quality in the sense of coherence, realism, and representativeness of the underlying topic. As for the actual suggestion-based simulation, we compare different approaches to select the next query in a sequence (always selecting the first suggestion, random sampling, or topic-informed selection) to the human TREC Session track sessions and a previously suggested simulation scheme. Our results show that while it is easy to create query logs that are authentic to both users and automated evaluation, keeping the sessions related to an underlying topic can be difficult when relying on given suggestions only. Keywords Simulating query sequences, Search session simulation, Query suggestion, TREC Session track, Task-based search 1. Introduction couple of queries, we examine sequences of query sug- gestions provided by some suggestion approach—in our Many studies on the simulation of search behavior focus pilot experiments, we simply use the suggestions that on using simulated user behavior in system evaluations— the Google search engine returns, but any other sugges- while others cover aspects of user modeling in general. tion approach could also be applied. Starting with the Using simulated interactions for evaluation purposes is actual title or the first query of a TREC topic, the second usually motivated by retrieval setups with no or only few query for the session is selected among the suggestions actual users whose behavior can be observed and used to for the first query, the third query is selected from the improve the actual system (e.g., system variants in digital suggestions for the second query, etc. libraries or new (academic) search prototypes without an Our research question is how such suggestion-based established user base). Such few-user systems could also simulated sessions compare to real user sessions in the be evaluated in lab studies. But lab studies are difficult sense of coherence, realism, and representativeness of the to scale up and also consume a lot of time since actual underlying topic. In our pilot study, we thus let a human users need to be hired, instructed, and observed. In such annotator assess human sessions from the TREC Session situations, simulation promises a way out but the extent track mixed with sessions generated from suggestion to which simulated search interactions can actually au- sequences and sessions generated by a previous more thentically replace real users in specific scenarios is still static query simulation scheme. The results show that an open question. In the recent years, mostly result clicks suggestion-based sessions replicate patterns commonly or stopping decisions have been the focus of user mod- seen in query logs. Both humans and a session detection eling and simulation studies while simulating querying framework were unable to differentiate the simulated behavior has received less attention. sessions from real ones. However, keeping close to the In this paper, we describe a pilot study on query sim- given topic when using suggestions as simulated queries ulation that aims to assess the suitability of stitching is rather difficult. Among other reasons, the limited ter- together query suggestions to form “realistic” search ses- minology in the topic, query, and suggestions and most sions (i.e., sequences of queries on the same information importantly the relatively small amount of suggestions need that some human might have submitted). The sce- provided by the Google Suggest Search API often cause nario we address is inspired by typical TREC style eval- the session to drift away from the given topic. uation setups where search topics are given as a verbal description of some information need along with a title or first query. To simulate some search session with a 2. Related Work Causality in Search and Recommendation (CSR) and Simulation of Similar to recent developments in the field of recom- Information Retrieval Evaluation (Sim4IR) workshops at SIGIR, 2021 menders [1], simulation in the context of information " sebastian.guenther@informatik.uni-halle.de (S. Günther); retrieval often aims to support experimental evaluation matthias.hagen@informatik.uni-halle.de (M. Hagen) of retrieval systems (e.g., in scenarios with few user inter- © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). actions like in digital libraries) in a cost-gain scenario [2] CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) (cost for retrieval system interactions, gain for retrieving used Bayesian inference networks to generate queries, good results). Different areas of user behavior have been Azzopardi [11] generated additional ad-hoc queries for addressed by simulation: scanning snippets / result pages, existing TREC collections, while Carterette et al. [3] sug- judging document relevance, clicking on results, reading gest a reformulation simulator to simulate whole sessions result documents, deciding about stopping the search, by also including the snippets from the seen result pages and query (re-)formulation itself. Some simulation stud- in the language model using TREC Session track data. ies combine different of these areas but some also just Some anchor text-based approaches to “simulate” com- focus on a particular one. In this paper, we focus on the plete query logs or to train query translation models also domain of simulating query (re-)formulation behavior. constitute a topic loosely related to ours [16, 17]. How- While quite a few studies on user click models and stop- ever, we aim to simulate shorter sequences of topically ping decisions have been published in the recent years, related queries instead of complete query logs. As for query formulation is still perceived as difficult to simu- the simulation, we want to study in pilot experiments, late [3] but also necessary to generate useful simulations whether and how well sequences of query suggestions for interactive retrieval evaluation [4]. stitched together may form search sessions. This idea The existing approaches to query simulation can is inspired by studies on query suggestions to support be divided into approaches that generate queries fol- task-based search [18, 19] since more complicated tasks lowing rather static underlying schemes [5, 6, 7, 8, 9] usually result in more interactions and queries from the and approaches that use language models constructed respective users. Our research question thus is how “au- from the topic itself, from observed snippets, or from thentic” sessions can be that are formed from simply some result documents to generate queries of varying following suggestions up to some depth. lengths [10, 11, 3, 12]. Not all, but most of the query simulations aim to simulate search sessions in the sense of query sequences that all have a similar intent [13, 14]. 3. Query Log Generation As for the static simulation schemes, many different As described above, there are various types of datasets ideas have been suggested. Jordan et al. [7] generate and models that have been suggested for query simula- controlled sets of single-term, two-term, and multiple tion. In this paper, we want to study a yet not covered term queries for retrieval scenarios on the Reuters-21578 source: query suggestions. Our reasoning is that query corpus by combining terms of selected specificity in the suggestions from large search engines are derived from documents of the corpus (e.g., only highly discrimina- their large query logs and thus represent “typical” user tive terms to form very specific queries). Later studies behavior. In our pilot experiments, we specifically focus have suggested to combine terms from manually gener- on query suggestions provided by the Google Suggest ated query word pools and tested that on TREC topics. Search API (that serves up to 10 suggestions at a time) The respective querying strategies sample initial and but, in principle, any other suggestion approach could subsequent query words from these pools and combine also be applied (e.g., suggestions from other large search them to search sessions [5, 6, 8] following static schemes engines or suggestion methods from the literature). Still, of for instance keeping the same two terms in every the characteristics of the suggestions may vary between query but adding different third terms or for instance different services such that the results of our pilot ex- generating all possible three-permutations of three-term periments should be tested in a more general setup with queries [6]. The suggested static schemes have been “ide- different suggestion approaches. alized” from real searcher interactions [8] and have also As our basis for simulated and real sessions, we use the been used in a later language modeling query simula- TREC 2014 Session track dataset [20] containing 1021 ses- tor [12]. Similar to the mentioned keep-two-terms-but- sions on 60 topics. Each topic is defined by an informa- vary-third-term query formulation strategy, Verberne tion need given as a short description. The respective et al. [9] create queries of 𝑛 terms for the iSearch collec- sessions include (among other information) the queries tion where 𝑛−1 terms are kept and the last term is varied some user formulated on the topic with timestamps, the to mimic academic information seeking behavior and to shown snippets, and clicked results. We extract the first evaluate the cumulated gain over a simulated session. queries of the sessions as seed queries for the simulated One of the earliest more language model-based query sessions since the topics themselves do not have explicit simulators was suggested by Azzopardi et al. [10] in the titles that might be used as a first query. In addition to domain of known-item search on the EuroGOV corpus the TREC data we also sampled sessions from the Webis- (crawl of European government-related sites). Single SMC-12 dataset [21] that contains query sequences from queries for some given known-item document are gen- the AOL log [22]. erated from the term distribution within the document As suggestion-based session simulations, we consider and some added “noise” to mimic imperfect human mem- the following three strategies in our pilot study. ory. The later InQuery system of Keskustalo et al. [15] First Suggestion. This strategy always selects the could only include 20 in the evaluation). While we mostly first suggestion provided by the Google Suggest focus on the textual aspect of the queries in this paper, Search API for the previous query of the session user session logs often come with additional information as input. A generated session will contain a max- like user agent, user identification, IP address, date and imum of four queries in addition to the original time of the interaction. Each of our sessions consists of query (analyzing several query log datasets, the at least one query with a fixed user assigned to it. To run average sessions had up to five queries). A ses- automatic session detection, we also simulate timestamps sion might be terminated early if the API does for each query submission. not provide additional suggestions. Random Suggestion. The random selection strategy Inter-Query Time. To simulate the time gap between randomly selects one of the suggestions provided query submissions, we have extracted the timings from by the Google Suggest Search API for the previous user sessions from the Webis-SMC-12 dataset [21]. Our query of the session as input. Like with the first analysis shows that 25% of the time gap are shorter than suggestion strategy, generated session contain up 41 seconds, while half of the gaps is no longer than to four queries in addition to the original query. 137 seconds. The distribution of timings shows a peak at The same query can not appear back-to-back and 8 seconds and a long tail with the highest values in the a session might be terminated early if the API multi-hour range. To account for logging and annotation does not provide additional suggestions. errors, we have removed outliers by deleting 10% of the longest gaps, which limits the simulated time between Three Word Queries (adapted). This strategy is query submissions to no longer than 20 minutes. We based on the idea of the Session Strategy S3 use this remaining pool of time gaps to accurately repro- described by Keskustalo et al. [8] which is duce the timing distribution for our generated sessions by also implemented in the SimIIR framework1 as randomly drawing values from it—which naturally then TriTermQueryGenerator. The original idea favors shorter time spans since they are more frequent. uses two terms as the basis extended by a third term selected from a topic description. We adapt Limits when using Suggestions. While working on this strategy with a few modifications. Initially our pilot study, we experimented with various combi- we start with the original query from the real nations of suggestion selection strategies and session session without any additions. We then extract lengths. We identified issues in our strategies that are a the 10 keywords from the topic’s description direct result of the nature of search engine suggestions. with highest tf ·idf scores (idf computed on the The first suggestion strategy is particularly prone to English Wikipedia). In each round, we calculate loops, when two queries are the top-ranked suggestions the cosine similarity of each suggestion and for each other—causing the generated session to alternate each original query–keyword pair. We select between two query strings; also observed for singular– the suggestion that is closest to one of the plural pairs or categories (i.e., file formats, programming query–keyword pairs. We limit the sessions languages). To counter the looping issue, we use a unique to a maximum of four queries in addition to query approach, which ensures that queries are not re- the original query. We also employ a dynamic peated in loops within a session. Additionally, another threshold for the cosine similarity that stops policy ensures a minimum dissimilarity between con- accepting suggestions when the similarity falls secutive queries that helps to avoid plurals as top sug- below a certain threshold. Due to the varying gestions. However, while unique / dissimilar queries length and specificity of the descriptions and mitigate looping, we find that especially longer sessions the ambiguity of the topics, the threshold has (say, ten queries) narrow down to very specific topics. to be manually adjusted for each topic. In our A possible reason is that today’s search engine query evaluation, we note that choosing an important suggestions do not only show related queries, but often term from the topic description provides an offer more specific autocompletions. Further details on advantage to this strategy over the previous two the evaluation are provided in Section 4. with respect to the topic representativeness of the generated sessions. 4. Evaluation For the three approaches, we generate 100, 100, and 20 sessions, respectively (in case of the three word strategy, In the evaluation, we compare the sessions generated by the strict selection process and the small pool of sugges- our three approaches to sessions from both the Webis- tions often results in very short sessions such that we SMC-12 dataset and the TREC 2014 Session track. As a 1 https://github.com/leifos/simiir Strategy Sessions Splits Strategy Sessions Real Simulated First suggestion* 64 1 First suggestion* 64 62 2 Random suggestion* 65 2 Random suggestion* 65 62 3 Three word queries 20 0 Three word queries 20 17 3 TREC 2014 Session Track 1257 142 TREC 2014 Session Track 50 49 1 Webis-SMC-12 2882 217 Webis-SMC-12 50 50 0 Table 1 Table 2 Number of within-session splits the automatic session detec- Manual judgments for all sessions whether they are simulated tion introduced for simulated and real sessions (more splits or “real” (* indicates that one-query sessions were removed). mean more query pairs seem to be unrelated; * indicates that “Real” in the upper group and “simulated” in the lower group one-query sessions were removed). indicate cases where the judge was mislead. first step, we perform an automated evaluation by run- 4.2. Human Authenticity Assessment ning the sessions through the session detection approach An automated session detection system only “assesses” of Hagen et al. [21]. Ideally, the simulated sessions should whether the consecutive queries seem to belong together not be split by the session detection in order to count as based on factors like lexical or semantic similarity and “authentic”. In a second step, a human assessor looked time gaps. However, we want to complement this purely at the simulated sessions as well as original sessions and automatic relatedness detection by a manual assessment had to judge whether a session seems to be simulated of how “authentic” the simulated sessions are perceived or of human origin. In a third step, a human assessor by humans, i.e., whether a human can distinguish simu- judged whether a session actually covers the intended lated from real sessions. information need given by the topic description. Procedure. All simulated sessions and a sample of 4.1. Automatic Session Detection original sessions are combined into one session pool. The goal of a session detection system is to identify con- The sessions are then presented to the judge as kind of secutive queries as belonging to the same information log excerpts with user ID, timestamps, and queries. The need or not. When a consecutive pair is detected that judge has no accurate knowledge about the amount of seems to belong to two different information needs, a split queries for each approach and there is no obvious way is introduced. Later some of these sessions might be run to determine the source of a session. The judge then through a mission detection to identify non-consecutive labels each session as real (sampled from actual query sessions that belong to the same search task, etc. logs) or simulated (by one of the three approaches). The As an automatic evaluation of the the simulated ses- results in Table 2 indicate that the simulated sessions are sions’ authenticity, we individually run each simulated perceived as real even though the assessor was told that session and the individual sessions from the TREC and some sessions actually are simulated. Webis-SMC-12 data through the session detection ap- During the assessment, the assessor took notes of proach of Hagen et al. [21]. A simulated or original ses- which features of a session or query determine the judg- sion “passes” the automatic authenticity test iff the de- ment. This helps us in understanding how humans and tection approach does not introduce a split. The results algorithms may come up with different verdicts. The are shown in Table 1 (sessions with only one query were primary criteria for the relatedness of two queries are removed since they will never be split). their term composition and length. Similarities in those Altogether, the simulated sessions are hardly split by aspects are perceived as patterns. This is also true for the automatic detection. The one wrong split for the first small editing actions (adding or replacing single words) suggestion strategy and one wrong split for the random which naturally comes with the specialization towards suggestion strategy are likely due to the first query being a topic. The opposite effect is perceived for rapid topic uppercased while the subsequent suggestions are low- changes. When multiple closely related tasks have to be ercased, while the second “wrong” split for the random fulfilled within one session, there may be large changes suggestions strategy is likely caused by a reformulation from query to query. This is also true for replacing words with abbreviation and no term overlap (“no air condition- by synonyms or abbreviations. While a human judge will ing alternatives” to “what to use instead of ac”). These usually be able to infer context to those rapid changes, an examples serve as a good demonstration for the limita- automatic process is more likely to detect a new session. tions of a fully automatic authenticity evaluation such Another discrepancy between human and algorithmic that we also manually assess the simulated sessions. evaluation becomes apparent when we consider outlier Strategy Sessions On Topic Query String Time First suggestion* 64 21 First suggestion Random suggestion* 65 20 air conditioning alternatives 15:05:53 Three word queries 20 20 air conditioning alternatives car 15:10:22 no air conditioning in car alternatives 15:11:07 Table 3 how can i keep my car cool without ac 15:15:28 Number of simulated sessions judged as “on topic” with re- ways to keep car cool without ac 15:21:16 spect to the TREC topic description (* indicates that one-query sessions were removed). Random suggestion air conditioning alternatives 17:31:54 no air conditioning alternatives 17:32:27 what to use instead of ac 17:36:28 behavior like text formatting (e.g., all-uppercase) that a what to use instead of activator 17:45:42 human might be able to judge as a simple typing error what can i use instead of activator for nails 17:51:03 while a detection approach without lowercasing prepro- how to make nail activator 17:53:26 cessing might be mislead. Random suggestion In a nutshell, while both humans and algorithms look Philadelphia 03:31:29 for patterns in the sessions and queries, the human judge philadelphia cheese 03:34:50 does so more selective by looking for mistakes. If found, philadelphia cheese recipes 03:35:05 the type of a mistake usually heavily influences the as- philadelphia cheese recipes salmon pasta 03:53:17 sessment of a session. Finally, note that due to the nature Table 4 of the three word query strategy there might be a chance Example sessions with unusual editing patterns. for an informed human to guess the sessions origin. 4.3. Human Topicality Assessment Results: We have manually judged all generated ses- So far, we have shown that the authenticity of a session sions. The results are shown in Table 3 show that even is largely influenced by its term composition and appear- the uninformed strategies stay “on topic” on about one ance. However, to serve as a replacement for humans, a third of the sessions. This can largely be attributed to session generator not only has to provide sessions that the nature of the TREC Session track topics that often a detection approach or some human would assess as contain several subtasks. Sessions generated by the three authentic, but also has to simulate sessions that follow word strategy stay “on topic” even more. the topic given as part of the evaluation study. 4.4. Notable Examples Procedure. Determining if a session or query is on As part of the judgment process, we have also taken note topic is a non-trivial task. While a query like “car” over- of simulated sessions which contain conspicuous editing laps with the topic “find information on used car prices”, steps or queries. The examples in Table 4 include a posi- it does not address the information need formulated in tive and a negative example with respect to authenticity. the topic description. We therefore set the following cri- The first example was judged as “real” based on the teria to evaluate if a session is “on topic”: A session is “on usage of an abbreviation for air conditioning in the fourth topic”, if the last query addresses at least one information query. The replacement of terms or groups of terms with need formulated in the topic description or shows clear a common abbreviation might be seen as a typical step signs that the session is headed in that direction—such for a human user after gaining more insight into a topic. that very short sessions are more likely to be on topic. The second example includes an issue that was caused A session is also “on topic”, if any query of the session by the autocomplete feature of the Google suggestions: addresses at least one information need formulated in the abbreviation ‘ac’ was falsely extended to the term the topic description—necessary condition to account for ‘activator’, which ultimately changed the subject of the topics with multiple subtasks. session. The third example shows a very common issue of ambiguous first queries. For the first and random Hypothesis: The first and random approach do not suggestion strategies, there is no way to determine that take the topic into account. Both strategies simply con- a city is referenced in this example such that the session verge to anything the search suggestion API provides quickly diverges to the food domain. for the initial query. Instead, the three word approach makes informed decisions when choosing suggestions and should therefore be able to stay more “on topic”. 4.5. Long Sessions simulated sessions. We will work on query modifica- tions that include “knowledge” from language models or The simulated sessions up to this point had parameters predefined editing rules. like session length and inter-query time been set to values that deemed appropriate in some initial experiments on our end in order to generate “close to real” sessions. We Influence on the Topic. For accurate session simula- also did not include navigational queries or known-item tion, it is necessary to influence the topic that the queries searches, which often could result in either very short or follow. We will evaluate how and where those decisions very long sessions. To investigate the applicability of our have to be made to create an effective user model. approaches to such outlier behavior we have also further assessed some sessions with up to 20 queries. User Types and Editing. Since query modifications of- In many of the cases without imposing any limits on ten follow well-known patterns, we will also investigate the generation process, the sessions still were often ter- ways to replicate editing patterns in simulated queries minated early due to a lack of suggestions. This was that are typical for specific user groups or tasks. mostly caused by two reasons: either the query became too specific to still yield additional suggestions or the pool of unique and dissimilar queries was used up. In Acknowledgments cases where long sessions could actually be generated, This work has been partially supported by the DFG the session usually quickly was rather specific and di- through the project “SINIR: Simulating INteractive Infor- verged substantially from the actual given topic towards mation Retrieval” (grant HA 5851/3-1). the end of the session. Using a different set of more technically oriented top- ics, we were able to generate longer sessions more fre- References quently. For this to work, we had to limit the dissimilarity filter, as abbreviations within the query were more fre- [1] S. Zhang, K. Balog, Evaluating conversational quent and therefore editing distances were smaller. We recommender systems via user simulation, in: also observed that queries from this field were mostly R. Gupta, Y. Liu, J. Tang, B. A. Prakash (Eds.), KDD comprised of categorical keywords stitched together com- ’20: The 26th ACM SIGKDD Conference on Knowl- pared to the more natural looking sessions from standard edge Discovery and Data Mining, Virtual Event, query logs. CA, USA, August 23-27, 2020, ACM, 2020, pp. 1512– Those observations, while helping to shape our pilot 1520. URL: https://doi.org/10.1145/3394486.3403202. study, show that parameters and strategies for authentic doi:10.1145/3394486.3403202. session generation are a very dynamic and potentially [2] M. McGregor, L. Azzopardi, M. Halvey, Untan- also topic-specific issue. gling cost, effort, and load in information seeking and retrieval, in: F. Scholer, P. Thomas, D. El- sweiler, H. Joho, N. Kando, C. Smith (Eds.), CHIIR 5. Conclusion ’21: ACM SIGIR Conference on Human Informa- tion Interaction and Retrieval, Canberra, ACT, Aus- In this paper, we have investigated how well authen- tralia, March 14-19, 2021, ACM, 2021, pp. 151– tic sessions can be simulated using web search engine 161. URL: https://doi.org/10.1145/3406522.3446026. query suggestions. By employing different strategies of doi:10.1145/3406522.3446026. selecting and combining the suggestions, we showcased [3] B. Carterette, A. Bah, M. Zengin, Dynamic test the potential but also the limits of the overall usefulness collections for retrieval evaluation, in: J. Al- of suggestion-based session simulation. Our evaluation lan, W. B. Croft, A. P. de Vries, C. Zhai (Eds.), showed that both humans and a session detection frame- Proceedings of the 2015 International Conference work are unable to distinguish suggestion-based sessions on The Theory of Information Retrieval, ICTIR from sampled real sessions. While some kind of authen- 2015, Northampton, Massachusetts, USA, Septem- ticity can thus be attributed to the simulated sessions, ber 27-30, 2015, ACM, 2015, pp. 91–100. URL: https: staying on topic proved to be rather difficult. Addressing //doi.org/10.1145/2808194.2809470. doi:10.1145/ the outlined shortcomings is an interesting direction for 2808194.2809470. future work. We plan to continue investigating query [4] B. Carterette, E. Kanoulas, E. Yilmaz, Simulating simulation as follows. simple user behavior for system effectiveness eval- uation, in: C. Macdonald, I. Ounis, I. Ruthven Data Independence. Relying on suggestions as query (Eds.), Proceedings of the 20th ACM Conference on candidates limits the flexibility and applicability of the Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24-28, de Vries, C. L. A. Clarke, N. Fuhr, N. Kando (Eds.), 2011, ACM, 2011, pp. 611–620. URL: https://doi.org/ SIGIR 2007: Proceedings of the 30th Annual In- 10.1145/2063576.2063668. doi:10.1145/2063576. ternational ACM SIGIR Conference on Research 2063668. and Development in Information Retrieval, Amster- [5] F. Baskaya, H. Keskustalo, K. Järvelin, Time drives dam, The Netherlands, July 23-27, 2007, ACM, 2007, interaction: Simulating sessions in diverse search- pp. 455–462. URL: https://doi.org/10.1145/1277741. ing environments, in: W. R. Hersh, J. Callan, 1277820. doi:10.1145/1277741.1277820. Y. Maarek, M. Sanderson (Eds.), The 35th Inter- [11] L. Azzopardi, Query side evaluation: An empiri- national ACM SIGIR conference on research and cal analysis of effectiveness and effort, in: J. Al- development in Information Retrieval, SIGIR ’12, lan, J. A. Aslam, M. Sanderson, C. Zhai, J. Zobel Portland, OR, USA, August 12-16, 2012, ACM, 2012, (Eds.), Proceedings of the 32nd Annual Interna- pp. 105–114. URL: https://doi.org/10.1145/2348283. tional ACM SIGIR Conference on Research and De- 2348301. doi:10.1145/2348283.2348301. velopment in Information Retrieval, SIGIR 2009, [6] F. Baskaya, H. Keskustalo, K. Järvelin, Model- Boston, MA, USA, July 19-23, 2009, ACM, 2009, ing behavioral factors ininteractive information pp. 556–563. URL: https://doi.org/10.1145/1571941. retrieval, in: Q. He, A. Iyengar, W. Nejdl, J. Pei, 1572037. doi:10.1145/1571941.1572037. R. Rastogi (Eds.), 22nd ACM International Con- [12] D. Maxwell, L. Azzopardi, K. Järvelin, H. Keskustalo, ference on Information and Knowledge Manage- Searching and stopping: An analysis of stopping ment, CIKM’13, San Francisco, CA, USA, Octo- rules and strategies, in: J. Bailey, A. Moffat, C. C. Ag- ber 27 - November 1, 2013, ACM, 2013, pp. 2297– garwal, M. de Rijke, R. Kumar, V. Murdock, T. K. Sel- 2302. URL: https://doi.org/10.1145/2505515.2505660. lis, J. X. Yu (Eds.), Proceedings of the 24th ACM In- doi:10.1145/2505515.2505660. ternational Conference on Information and Knowl- [7] C. Jordan, C. R. Watters, Q. Gao, Using controlled edge Management, CIKM 2015, Melbourne, VIC, query generation to evaluate blind relevance feed- Australia, October 19 - 23, 2015, ACM, 2015, pp. 313– back algorithms, in: G. Marchionini, M. L. Nelson, 322. URL: https://doi.org/10.1145/2806416.2806476. C. C. Marshall (Eds.), ACM/IEEE Joint Conference doi:10.1145/2806416.2806476. on Digital Libraries, JCDL 2006, Chapel Hill, NC, [13] R. Jones, K. L. Klinkner, Beyond the session time- USA, June 11-15, 2006, Proceedings, ACM, 2006, out: Automatic hierarchical segmentation of search pp. 286–295. URL: https://doi.org/10.1145/1141753. topics in query logs, in: J. G. Shanahan, S. Amer- 1141818. doi:10.1145/1141753.1141818. Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, [8] H. Keskustalo, K. Järvelin, A. Pirkola, T. Sharma, K. Choi, A. Chowdhury (Eds.), Proceedings of the M. Lykke, Test collection-based IR evaluation 17th ACM Conference on Information and Knowl- needs extension toward sessions - A case of ex- edge Management, CIKM 2008, Napa Valley, Cal- tremely short queries, in: G. G. Lee, D. Song, ifornia, USA, October 26-30, 2008, ACM, 2008, C. Lin, A. N. Aizawa, K. Kuriyama, M. Yosh- pp. 699–708. URL: https://doi.org/10.1145/1458082. ioka, T. Sakai (Eds.), Information Retrieval Tech- 1458176. doi:10.1145/1458082.1458176. nology, 5th Asia Information Retrieval Sympo- [14] A. H. Awadallah, X. Shi, N. Craswell, B. Ramsey, sium, AIRS 2009, Sapporo, Japan, October 21-23, Beyond clicks: query reformulation as a predictor 2009. Proceedings, volume 5839 of Lecture Notes of search satisfaction, in: Q. He, A. Iyengar, W. Ne- in Computer Science, Springer, 2009, pp. 63–74. jdl, J. Pei, R. Rastogi (Eds.), 22nd ACM International URL: https://doi.org/10.1007/978-3-642-04769-5_6. Conference on Information and Knowledge Man- doi:10.1007/978-3-642-04769-5\_6. agement, CIKM’13, San Francisco, CA, USA, Octo- [9] S. Verberne, M. Sappelli, K. Järvelin, W. Kraaij, ber 27 - November 1, 2013, ACM, 2013, pp. 2019– User simulations for interactive search: Evaluat- 2028. URL: https://doi.org/10.1145/2505515.2505682. ing personalized query suggestion, in: A. Hanbury, doi:10.1145/2505515.2505682. G. Kazai, A. Rauber, N. Fuhr (Eds.), Advances in [15] H. Keskustalo, K. Järvelin, A. Pirkola, Evaluating Information Retrieval - 37th European Conference the effectiveness of relevance feedback based on a on IR Research, ECIR 2015, Vienna, Austria, March user simulation model: Effects of a user scenario on 29 - April 2, 2015, Proceedings, volume 9022 of Lec- cumulated gain value, Inf. Retr. 11 (2008) 209–228. ture Notes in Computer Science, 2015, pp. 678–690. URL: https://doi.org/10.1007/s10791-007-9043-7. URL: https://doi.org/10.1007/978-3-319-16354-3_75. doi:10.1007/s10791-007-9043-7. doi:10.1007/978-3-319-16354-3\_75. [16] V. Dang, W. B. Croft, Query reformulation using [10] L. Azzopardi, M. de Rijke, K. Balog, Building sim- anchor text, in: B. D. Davison, T. Suel, N. Craswell, ulated queries for known-item topics: An analysis B. Liu (Eds.), Proceedings of the Third International using six european languages, in: W. Kraaij, A. P. Conference on Web Search and Web Data Mining, WSDM 2010, New York, NY, USA, February 4-6, 2010, ACM, 2010, pp. 41–50. URL: https://doi.org/ 10.1145/1718487.1718493. doi:10.1145/1718487. 1718493. [17] N. Craswell, B. Billerbeck, D. Fetterly, M. Najork, Robust query rewriting using anchor data, in: S. Leonardi, A. Panconesi, P. Ferragina, A. Gio- nis (Eds.), Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, Rome, Italy, February 4-8, 2013, ACM, 2013, pp. 335– 344. URL: https://doi.org/10.1145/2433396.2433440. doi:10.1145/2433396.2433440. [18] D. Garigliotti, K. Balog, Generating query sugges- tions to support task-based search, in: N. Kando, T. Sakai, H. Joho, H. Li, A. P. de Vries, R. W. White (Eds.), Proceedings of the 40th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017, ACM, 2017, pp. 1153– 1156. URL: https://doi.org/10.1145/3077136.3080745. doi:10.1145/3077136.3080745. [19] H. Ding, S. Zhang, D. Garigliotti, K. Balog, Gener- ating high-quality query suggestion candidates for task-based search, in: G. Pasi, B. Piwowarski, L. Az- zopardi, A. Hanbury (Eds.), Advances in Informa- tion Retrieval - 40th European Conference on IR Re- search, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings, volume 10772 of Lecture Notes in Computer Science, Springer, 2018, pp. 625–631. URL: https://doi.org/10.1007/978-3-319-76941-7_54. doi:10.1007/978-3-319-76941-7\_54. [20] B. Carterette, E. Kanoulas, M. M. Hall, P. D. Clough, Overview of the TREC 2014 session track, in: E. M. Voorhees, A. Ellis (Eds.), Proceedings of The Twenty-Third Text REtrieval Conference, TREC 2014, Gaithersburg, Maryland, USA, November 19- 21, 2014, volume 500-308 of NIST Special Publica- tion, National Institute of Standards and Technology (NIST), 2014. URL: http://trec.nist.gov/pubs/trec23/ papers/overview-session.pdf. [21] M. Hagen, J. Gomoll, A. Beyer, B. Stein, From search session detection to search mission detection, in: J. Ferreira, J. Magalhães, P. Calado (Eds.), Open re- search Areas in Information Retrieval, OAIR ’13, Lis- bon, Portugal, May 15-17, 2013, ACM, 2013, pp. 85– 92. URL: http://dl.acm.org/citation.cfm?id=2491769. [22] G. Pass, A. Chowdhury, C. Torgeson, A pic- ture of search, in: X. Jia (Ed.), Proceedings of the 1st International Conference on Scalable In- formation Systems, Infoscale 2006, Hong Kong, May 30-June 1, 2006, volume 152 of ACM Inter- national Conference Proceeding Series, ACM, 2006, p. 1. URL: https://doi.org/10.1145/1146847.1146848. doi:10.1145/1146847.1146848.