Assessing Query Suggestions for Search Session Simulation
Sebastian Günther1 , Matthias Hagen1
1
    Martin-Luther-Universität Halle-Wittenberg, Halle (Saale), Germany


                                             Abstract
                                             Research on simulating search behavior has mainly dealt with result list interactions in the recent years. We instead focus on
                                             the querying process and describe a pilot study to assess the applicability of search engine query suggestions to simulate
                                             search sessions (i.e., sequences of topically related queries). In automatic and manual assessments, we evaluate to what extent
                                             a session detection approach considers the simulated query sequences as “authentic” and how humans perceive the quality in
                                             the sense of coherence, realism, and representativeness of the underlying topic. As for the actual suggestion-based simulation,
                                             we compare different approaches to select the next query in a sequence (always selecting the first suggestion, random
                                             sampling, or topic-informed selection) to the human TREC Session track sessions and a previously suggested simulation
                                             scheme. Our results show that while it is easy to create query logs that are authentic to both users and automated evaluation,
                                             keeping the sessions related to an underlying topic can be difficult when relying on given suggestions only.

                                             Keywords
                                             Simulating query sequences, Search session simulation, Query suggestion, TREC Session track, Task-based search


1. Introduction                                                                                                       couple of queries, we examine sequences of query sug-
                                                                                                                      gestions provided by some suggestion approach—in our
Many studies on the simulation of search behavior focus                                                               pilot experiments, we simply use the suggestions that
on using simulated user behavior in system evaluations—                                                               the Google search engine returns, but any other sugges-
while others cover aspects of user modeling in general.                                                               tion approach could also be applied. Starting with the
Using simulated interactions for evaluation purposes is                                                               actual title or the first query of a TREC topic, the second
usually motivated by retrieval setups with no or only few                                                             query for the session is selected among the suggestions
actual users whose behavior can be observed and used to                                                               for the first query, the third query is selected from the
improve the actual system (e.g., system variants in digital                                                           suggestions for the second query, etc.
libraries or new (academic) search prototypes without an                                                                 Our research question is how such suggestion-based
established user base). Such few-user systems could also                                                              simulated sessions compare to real user sessions in the
be evaluated in lab studies. But lab studies are difficult                                                            sense of coherence, realism, and representativeness of the
to scale up and also consume a lot of time since actual                                                               underlying topic. In our pilot study, we thus let a human
users need to be hired, instructed, and observed. In such                                                             annotator assess human sessions from the TREC Session
situations, simulation promises a way out but the extent                                                              track mixed with sessions generated from suggestion
to which simulated search interactions can actually au-                                                               sequences and sessions generated by a previous more
thentically replace real users in specific scenarios is still                                                         static query simulation scheme. The results show that
an open question. In the recent years, mostly result clicks                                                           suggestion-based sessions replicate patterns commonly
or stopping decisions have been the focus of user mod-                                                                seen in query logs. Both humans and a session detection
eling and simulation studies while simulating querying                                                                framework were unable to differentiate the simulated
behavior has received less attention.                                                                                 sessions from real ones. However, keeping close to the
   In this paper, we describe a pilot study on query sim-                                                             given topic when using suggestions as simulated queries
ulation that aims to assess the suitability of stitching                                                              is rather difficult. Among other reasons, the limited ter-
together query suggestions to form “realistic” search ses-                                                            minology in the topic, query, and suggestions and most
sions (i.e., sequences of queries on the same information                                                             importantly the relatively small amount of suggestions
need that some human might have submitted). The sce-                                                                  provided by the Google Suggest Search API often cause
nario we address is inspired by typical TREC style eval-                                                              the session to drift away from the given topic.
uation setups where search topics are given as a verbal
description of some information need along with a title
or first query. To simulate some search session with a                                                                2. Related Work
Causality in Search and Recommendation (CSR) and Simulation of                                                        Similar to recent developments in the field of recom-
Information Retrieval Evaluation (Sim4IR) workshops at SIGIR, 2021                                                    menders [1], simulation in the context of information
" sebastian.guenther@informatik.uni-halle.de (S. Günther);                                                            retrieval often aims to support experimental evaluation
matthias.hagen@informatik.uni-halle.de (M. Hagen)                                                                     of retrieval systems (e.g., in scenarios with few user inter-
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     actions like in digital libraries) in a cost-gain scenario [2]
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
(cost for retrieval system interactions, gain for retrieving   used Bayesian inference networks to generate queries,
good results). Different areas of user behavior have been      Azzopardi [11] generated additional ad-hoc queries for
addressed by simulation: scanning snippets / result pages,     existing TREC collections, while Carterette et al. [3] sug-
judging document relevance, clicking on results, reading       gest a reformulation simulator to simulate whole sessions
result documents, deciding about stopping the search,          by also including the snippets from the seen result pages
and query (re-)formulation itself. Some simulation stud-       in the language model using TREC Session track data.
ies combine different of these areas but some also just           Some anchor text-based approaches to “simulate” com-
focus on a particular one. In this paper, we focus on the      plete query logs or to train query translation models also
domain of simulating query (re-)formulation behavior.          constitute a topic loosely related to ours [16, 17]. How-
While quite a few studies on user click models and stop-       ever, we aim to simulate shorter sequences of topically
ping decisions have been published in the recent years,        related queries instead of complete query logs. As for
query formulation is still perceived as difficult to simu-     the simulation, we want to study in pilot experiments,
late [3] but also necessary to generate useful simulations     whether and how well sequences of query suggestions
for interactive retrieval evaluation [4].                      stitched together may form search sessions. This idea
   The existing approaches to query simulation can             is inspired by studies on query suggestions to support
be divided into approaches that generate queries fol-          task-based search [18, 19] since more complicated tasks
lowing rather static underlying schemes [5, 6, 7, 8, 9]        usually result in more interactions and queries from the
and approaches that use language models constructed            respective users. Our research question thus is how “au-
from the topic itself, from observed snippets, or from         thentic” sessions can be that are formed from simply
some result documents to generate queries of varying           following suggestions up to some depth.
lengths [10, 11, 3, 12]. Not all, but most of the query
simulations aim to simulate search sessions in the sense
of query sequences that all have a similar intent [13, 14].    3. Query Log Generation
   As for the static simulation schemes, many different
                                                               As described above, there are various types of datasets
ideas have been suggested. Jordan et al. [7] generate
                                                               and models that have been suggested for query simula-
controlled sets of single-term, two-term, and multiple
                                                               tion. In this paper, we want to study a yet not covered
term queries for retrieval scenarios on the Reuters-21578
                                                               source: query suggestions. Our reasoning is that query
corpus by combining terms of selected specificity in the
                                                               suggestions from large search engines are derived from
documents of the corpus (e.g., only highly discrimina-
                                                               their large query logs and thus represent “typical” user
tive terms to form very specific queries). Later studies
                                                               behavior. In our pilot experiments, we specifically focus
have suggested to combine terms from manually gener-
                                                               on query suggestions provided by the Google Suggest
ated query word pools and tested that on TREC topics.
                                                               Search API (that serves up to 10 suggestions at a time)
The respective querying strategies sample initial and
                                                               but, in principle, any other suggestion approach could
subsequent query words from these pools and combine
                                                               also be applied (e.g., suggestions from other large search
them to search sessions [5, 6, 8] following static schemes
                                                               engines or suggestion methods from the literature). Still,
of for instance keeping the same two terms in every
                                                               the characteristics of the suggestions may vary between
query but adding different third terms or for instance
                                                               different services such that the results of our pilot ex-
generating all possible three-permutations of three-term
                                                               periments should be tested in a more general setup with
queries [6]. The suggested static schemes have been “ide-
                                                               different suggestion approaches.
alized” from real searcher interactions [8] and have also
                                                                  As our basis for simulated and real sessions, we use the
been used in a later language modeling query simula-
                                                               TREC 2014 Session track dataset [20] containing 1021 ses-
tor [12]. Similar to the mentioned keep-two-terms-but-
                                                               sions on 60 topics. Each topic is defined by an informa-
vary-third-term query formulation strategy, Verberne
                                                               tion need given as a short description. The respective
et al. [9] create queries of 𝑛 terms for the iSearch collec-
                                                               sessions include (among other information) the queries
tion where 𝑛−1 terms are kept and the last term is varied
                                                               some user formulated on the topic with timestamps, the
to mimic academic information seeking behavior and to
                                                               shown snippets, and clicked results. We extract the first
evaluate the cumulated gain over a simulated session.
                                                               queries of the sessions as seed queries for the simulated
   One of the earliest more language model-based query
                                                               sessions since the topics themselves do not have explicit
simulators was suggested by Azzopardi et al. [10] in the
                                                               titles that might be used as a first query. In addition to
domain of known-item search on the EuroGOV corpus
                                                               the TREC data we also sampled sessions from the Webis-
(crawl of European government-related sites). Single
                                                               SMC-12 dataset [21] that contains query sequences from
queries for some given known-item document are gen-
                                                               the AOL log [22].
erated from the term distribution within the document
                                                                  As suggestion-based session simulations, we consider
and some added “noise” to mimic imperfect human mem-
                                                               the following three strategies in our pilot study.
ory. The later InQuery system of Keskustalo et al. [15]
First Suggestion. This strategy always selects the         could only include 20 in the evaluation). While we mostly
       first suggestion provided by the Google Suggest     focus on the textual aspect of the queries in this paper,
       Search API for the previous query of the session    user session logs often come with additional information
       as input. A generated session will contain a max-   like user agent, user identification, IP address, date and
       imum of four queries in addition to the original    time of the interaction. Each of our sessions consists of
       query (analyzing several query log datasets, the    at least one query with a fixed user assigned to it. To run
       average sessions had up to five queries). A ses-    automatic session detection, we also simulate timestamps
       sion might be terminated early if the API does      for each query submission.
       not provide additional suggestions.
Random Suggestion. The random selection strategy           Inter-Query Time. To simulate the time gap between
     randomly selects one of the suggestions provided      query submissions, we have extracted the timings from
     by the Google Suggest Search API for the previous     user sessions from the Webis-SMC-12 dataset [21]. Our
     query of the session as input. Like with the first    analysis shows that 25% of the time gap are shorter than
     suggestion strategy, generated session contain up     41 seconds, while half of the gaps is no longer than
     to four queries in addition to the original query.    137 seconds. The distribution of timings shows a peak at
     The same query can not appear back-to-back and        8 seconds and a long tail with the highest values in the
     a session might be terminated early if the API        multi-hour range. To account for logging and annotation
     does not provide additional suggestions.              errors, we have removed outliers by deleting 10% of the
                                                           longest gaps, which limits the simulated time between
Three Word Queries (adapted). This strategy is             query submissions to no longer than 20 minutes. We
      based on the idea of the Session Strategy S3         use this remaining pool of time gaps to accurately repro-
      described by Keskustalo et al. [8] which is          duce the timing distribution for our generated sessions by
      also implemented in the SimIIR framework1 as         randomly drawing values from it—which naturally then
      TriTermQueryGenerator. The original idea             favors shorter time spans since they are more frequent.
      uses two terms as the basis extended by a third
      term selected from a topic description. We adapt     Limits when using Suggestions. While working on
      this strategy with a few modifications. Initially    our pilot study, we experimented with various combi-
      we start with the original query from the real       nations of suggestion selection strategies and session
      session without any additions. We then extract       lengths. We identified issues in our strategies that are a
      the 10 keywords from the topic’s description         direct result of the nature of search engine suggestions.
      with highest tf ·idf scores (idf computed on the        The first suggestion strategy is particularly prone to
      English Wikipedia). In each round, we calculate      loops, when two queries are the top-ranked suggestions
      the cosine similarity of each suggestion and         for each other—causing the generated session to alternate
      each original query–keyword pair. We select          between two query strings; also observed for singular–
      the suggestion that is closest to one of the         plural pairs or categories (i.e., file formats, programming
      query–keyword pairs. We limit the sessions           languages). To counter the looping issue, we use a unique
      to a maximum of four queries in addition to          query approach, which ensures that queries are not re-
      the original query. We also employ a dynamic         peated in loops within a session. Additionally, another
      threshold for the cosine similarity that stops       policy ensures a minimum dissimilarity between con-
      accepting suggestions when the similarity falls      secutive queries that helps to avoid plurals as top sug-
      below a certain threshold. Due to the varying        gestions. However, while unique / dissimilar queries
      length and specificity of the descriptions and       mitigate looping, we find that especially longer sessions
      the ambiguity of the topics, the threshold has       (say, ten queries) narrow down to very specific topics.
      to be manually adjusted for each topic. In our       A possible reason is that today’s search engine query
      evaluation, we note that choosing an important       suggestions do not only show related queries, but often
      term from the topic description provides an          offer more specific autocompletions. Further details on
      advantage to this strategy over the previous two     the evaluation are provided in Section 4.
      with respect to the topic representativeness of
      the generated sessions.
                                                           4. Evaluation
For the three approaches, we generate 100, 100, and 20
sessions, respectively (in case of the three word strategy, In the evaluation, we compare the sessions generated by
the strict selection process and the small pool of sugges- our three approaches to sessions from both the Webis-
tions often results in very short sessions such that we SMC-12 dataset and the TREC 2014 Session track. As a
   1
       https://github.com/leifos/simiir
         Strategy              Sessions     Splits               Strategy              Sessions    Real     Simulated
         First suggestion*            64         1               First suggestion*            64     62         2
         Random suggestion*           65         2               Random suggestion*           65     62         3
         Three word queries           20         0               Three word queries           20     17         3
         TREC 2014 Session Track 1257         142                TREC 2014 Session Track      50     49         1
         Webis-SMC-12            2882         217                Webis-SMC-12                 50     50         0

Table 1                                                      Table 2
Number of within-session splits the automatic session detec- Manual judgments for all sessions whether they are simulated
tion introduced for simulated and real sessions (more splits or “real” (* indicates that one-query sessions were removed).
mean more query pairs seem to be unrelated; * indicates that “Real” in the upper group and “simulated” in the lower group
one-query sessions were removed).                            indicate cases where the judge was mislead.


first step, we perform an automated evaluation by run-         4.2. Human Authenticity Assessment
ning the sessions through the session detection approach
                                                               An automated session detection system only “assesses”
of Hagen et al. [21]. Ideally, the simulated sessions should
                                                               whether the consecutive queries seem to belong together
not be split by the session detection in order to count as
                                                               based on factors like lexical or semantic similarity and
“authentic”. In a second step, a human assessor looked
                                                               time gaps. However, we want to complement this purely
at the simulated sessions as well as original sessions and
                                                               automatic relatedness detection by a manual assessment
had to judge whether a session seems to be simulated
                                                               of how “authentic” the simulated sessions are perceived
or of human origin. In a third step, a human assessor
                                                               by humans, i.e., whether a human can distinguish simu-
judged whether a session actually covers the intended
                                                               lated from real sessions.
information need given by the topic description.

                                                               Procedure. All simulated sessions and a sample of
4.1. Automatic Session Detection                               original sessions are combined into one session pool.
The goal of a session detection system is to identify con-     The sessions are then presented to the judge as kind of
secutive queries as belonging to the same information          log excerpts with user ID, timestamps, and queries. The
need or not. When a consecutive pair is detected that          judge has no accurate knowledge about the amount of
seems to belong to two different information needs, a split    queries for each approach and there is no obvious way
is introduced. Later some of these sessions might be run       to determine the source of a session. The judge then
through a mission detection to identify non-consecutive        labels each session as real (sampled from actual query
sessions that belong to the same search task, etc.             logs) or simulated (by one of the three approaches). The
   As an automatic evaluation of the the simulated ses-        results in Table 2 indicate that the simulated sessions are
sions’ authenticity, we individually run each simulated        perceived as real even though the assessor was told that
session and the individual sessions from the TREC and          some sessions actually are simulated.
Webis-SMC-12 data through the session detection ap-               During the assessment, the assessor took notes of
proach of Hagen et al. [21]. A simulated or original ses-      which features of a session or query determine the judg-
sion “passes” the automatic authenticity test iff the de-      ment. This helps us in understanding how humans and
tection approach does not introduce a split. The results       algorithms may come up with different verdicts. The
are shown in Table 1 (sessions with only one query were        primary criteria for the relatedness of two queries are
removed since they will never be split).                       their term composition and length. Similarities in those
   Altogether, the simulated sessions are hardly split by      aspects are perceived as patterns. This is also true for
the automatic detection. The one wrong split for the first     small editing actions (adding or replacing single words)
suggestion strategy and one wrong split for the random         which naturally comes with the specialization towards
suggestion strategy are likely due to the first query being    a topic. The opposite effect is perceived for rapid topic
uppercased while the subsequent suggestions are low-           changes. When multiple closely related tasks have to be
ercased, while the second “wrong” split for the random         fulfilled within one session, there may be large changes
suggestions strategy is likely caused by a reformulation       from query to query. This is also true for replacing words
with abbreviation and no term overlap (“no air condition-      by synonyms or abbreviations. While a human judge will
ing alternatives” to “what to use instead of ac”). These       usually be able to infer context to those rapid changes, an
examples serve as a good demonstration for the limita-         automatic process is more likely to detect a new session.
tions of a fully automatic authenticity evaluation such        Another discrepancy between human and algorithmic
that we also manually assess the simulated sessions.           evaluation becomes apparent when we consider outlier
      Strategy                  Sessions     On Topic               Query String                                    Time
      First suggestion*                64            21             First suggestion
      Random suggestion*               65            20             air conditioning alternatives                   15:05:53
      Three word queries               20            20             air conditioning alternatives car               15:10:22
                                                                    no air conditioning in car alternatives         15:11:07
Table 3                                                             how can i keep my car cool without ac           15:15:28
Number of simulated sessions judged as “on topic” with re-          ways to keep car cool without ac                15:21:16
spect to the TREC topic description (* indicates that one-query
sessions were removed).                                             Random suggestion
                                                                    air conditioning alternatives                   17:31:54
                                                                    no air conditioning alternatives                17:32:27
                                                                    what to use instead of ac                       17:36:28
behavior like text formatting (e.g., all-uppercase) that a          what to use instead of activator                17:45:42
human might be able to judge as a simple typing error               what can i use instead of activator for nails   17:51:03
while a detection approach without lowercasing prepro-              how to make nail activator                      17:53:26
cessing might be mislead.                                     Random suggestion
   In a nutshell, while both humans and algorithms look       Philadelphia                                  03:31:29
for patterns in the sessions and queries, the human judge     philadelphia cheese                           03:34:50
does so more selective by looking for mistakes. If found,     philadelphia cheese recipes                   03:35:05
the type of a mistake usually heavily influences the as-      philadelphia cheese recipes salmon pasta      03:53:17
sessment of a session. Finally, note that due to the nature
                                                            Table 4
of the three word query strategy there might be a chance
                                                            Example sessions with unusual editing patterns.
for an informed human to guess the sessions origin.

4.3. Human Topicality Assessment                                  Results: We have manually judged all generated ses-
So far, we have shown that the authenticity of a session          sions. The results are shown in Table 3 show that even
is largely influenced by its term composition and appear-         the uninformed strategies stay “on topic” on about one
ance. However, to serve as a replacement for humans, a            third of the sessions. This can largely be attributed to
session generator not only has to provide sessions that           the nature of the TREC Session track topics that often
a detection approach or some human would assess as                contain several subtasks. Sessions generated by the three
authentic, but also has to simulate sessions that follow          word strategy stay “on topic” even more.
the topic given as part of the evaluation study.
                                                                  4.4. Notable Examples
Procedure. Determining if a session or query is on
                                                         As part of the judgment process, we have also taken note
topic is a non-trivial task. While a query like “car” over-
                                                         of simulated sessions which contain conspicuous editing
laps with the topic “find information on used car prices”,
                                                         steps or queries. The examples in Table 4 include a posi-
it does not address the information need formulated in
                                                         tive and a negative example with respect to authenticity.
the topic description. We therefore set the following cri-
                                                            The first example was judged as “real” based on the
teria to evaluate if a session is “on topic”: A session is “on
                                                         usage of an abbreviation for air conditioning in the fourth
topic”, if the last query addresses at least one information
                                                         query. The replacement of terms or groups of terms with
need formulated in the topic description or shows clear
                                                         a common abbreviation might be seen as a typical step
signs that the session is headed in that direction—such
                                                         for a human user after gaining more insight into a topic.
that very short sessions are more likely to be on topic.
                                                         The second example includes an issue that was caused
A session is also “on topic”, if any query of the session
                                                         by the autocomplete feature of the Google suggestions:
addresses at least one information need formulated in
                                                         the abbreviation ‘ac’ was falsely extended to the term
the topic description—necessary condition to account for
                                                         ‘activator’, which ultimately changed the subject of the
topics with multiple subtasks.
                                                         session. The third example shows a very common issue
                                                         of ambiguous first queries. For the first and random
Hypothesis: The first and random approach do not suggestion strategies, there is no way to determine that
take the topic into account. Both strategies simply con- a city is referenced in this example such that the session
verge to anything the search suggestion API provides quickly diverges to the food domain.
for the initial query. Instead, the three word approach
makes informed decisions when choosing suggestions
and should therefore be able to stay more “on topic”.
4.5. Long Sessions                                             simulated sessions. We will work on query modifica-
                                                               tions that include “knowledge” from language models or
The simulated sessions up to this point had parameters
                                                               predefined editing rules.
like session length and inter-query time been set to values
that deemed appropriate in some initial experiments on
our end in order to generate “close to real” sessions. We      Influence on the Topic. For accurate session simula-
also did not include navigational queries or known-item        tion, it is necessary to influence the topic that the queries
searches, which often could result in either very short or     follow. We will evaluate how and where those decisions
very long sessions. To investigate the applicability of our    have to be made to create an effective user model.
approaches to such outlier behavior we have also further
assessed some sessions with up to 20 queries.                  User Types and Editing. Since query modifications of-
   In many of the cases without imposing any limits on         ten follow well-known patterns, we will also investigate
the generation process, the sessions still were often ter-     ways to replicate editing patterns in simulated queries
minated early due to a lack of suggestions. This was           that are typical for specific user groups or tasks.
mostly caused by two reasons: either the query became
too specific to still yield additional suggestions or the
pool of unique and dissimilar queries was used up. In
                                                               Acknowledgments
cases where long sessions could actually be generated,         This work has been partially supported by the DFG
the session usually quickly was rather specific and di-        through the project “SINIR: Simulating INteractive Infor-
verged substantially from the actual given topic towards       mation Retrieval” (grant HA 5851/3-1).
the end of the session.
   Using a different set of more technically oriented top-
ics, we were able to generate longer sessions more fre-        References
quently. For this to work, we had to limit the dissimilarity
filter, as abbreviations within the query were more fre-        [1] S. Zhang, K. Balog, Evaluating conversational
quent and therefore editing distances were smaller. We              recommender systems via user simulation, in:
also observed that queries from this field were mostly              R. Gupta, Y. Liu, J. Tang, B. A. Prakash (Eds.), KDD
comprised of categorical keywords stitched together com-            ’20: The 26th ACM SIGKDD Conference on Knowl-
pared to the more natural looking sessions from standard            edge Discovery and Data Mining, Virtual Event,
query logs.                                                         CA, USA, August 23-27, 2020, ACM, 2020, pp. 1512–
   Those observations, while helping to shape our pilot             1520. URL: https://doi.org/10.1145/3394486.3403202.
study, show that parameters and strategies for authentic            doi:10.1145/3394486.3403202.
session generation are a very dynamic and potentially           [2] M. McGregor, L. Azzopardi, M. Halvey, Untan-
also topic-specific issue.                                          gling cost, effort, and load in information seeking
                                                                    and retrieval, in: F. Scholer, P. Thomas, D. El-
                                                                    sweiler, H. Joho, N. Kando, C. Smith (Eds.), CHIIR
5. Conclusion                                                       ’21: ACM SIGIR Conference on Human Informa-
                                                                    tion Interaction and Retrieval, Canberra, ACT, Aus-
In this paper, we have investigated how well authen-                tralia, March 14-19, 2021, ACM, 2021, pp. 151–
tic sessions can be simulated using web search engine               161. URL: https://doi.org/10.1145/3406522.3446026.
query suggestions. By employing different strategies of             doi:10.1145/3406522.3446026.
selecting and combining the suggestions, we showcased           [3] B. Carterette, A. Bah, M. Zengin, Dynamic test
the potential but also the limits of the overall usefulness         collections for retrieval evaluation, in: J. Al-
of suggestion-based session simulation. Our evaluation              lan, W. B. Croft, A. P. de Vries, C. Zhai (Eds.),
showed that both humans and a session detection frame-              Proceedings of the 2015 International Conference
work are unable to distinguish suggestion-based sessions            on The Theory of Information Retrieval, ICTIR
from sampled real sessions. While some kind of authen-              2015, Northampton, Massachusetts, USA, Septem-
ticity can thus be attributed to the simulated sessions,            ber 27-30, 2015, ACM, 2015, pp. 91–100. URL: https:
staying on topic proved to be rather difficult. Addressing          //doi.org/10.1145/2808194.2809470. doi:10.1145/
the outlined shortcomings is an interesting direction for           2808194.2809470.
future work. We plan to continue investigating query            [4] B. Carterette, E. Kanoulas, E. Yilmaz, Simulating
simulation as follows.                                              simple user behavior for system effectiveness eval-
                                                                    uation, in: C. Macdonald, I. Ounis, I. Ruthven
Data Independence. Relying on suggestions as query                  (Eds.), Proceedings of the 20th ACM Conference on
candidates limits the flexibility and applicability of the          Information and Knowledge Management, CIKM
     2011, Glasgow, United Kingdom, October 24-28,               de Vries, C. L. A. Clarke, N. Fuhr, N. Kando (Eds.),
     2011, ACM, 2011, pp. 611–620. URL: https://doi.org/         SIGIR 2007: Proceedings of the 30th Annual In-
     10.1145/2063576.2063668. doi:10.1145/2063576.               ternational ACM SIGIR Conference on Research
     2063668.                                                    and Development in Information Retrieval, Amster-
 [5] F. Baskaya, H. Keskustalo, K. Järvelin, Time drives         dam, The Netherlands, July 23-27, 2007, ACM, 2007,
     interaction: Simulating sessions in diverse search-         pp. 455–462. URL: https://doi.org/10.1145/1277741.
     ing environments, in: W. R. Hersh, J. Callan,               1277820. doi:10.1145/1277741.1277820.
     Y. Maarek, M. Sanderson (Eds.), The 35th Inter-        [11] L. Azzopardi, Query side evaluation: An empiri-
     national ACM SIGIR conference on research and               cal analysis of effectiveness and effort, in: J. Al-
     development in Information Retrieval, SIGIR ’12,            lan, J. A. Aslam, M. Sanderson, C. Zhai, J. Zobel
     Portland, OR, USA, August 12-16, 2012, ACM, 2012,           (Eds.), Proceedings of the 32nd Annual Interna-
     pp. 105–114. URL: https://doi.org/10.1145/2348283.          tional ACM SIGIR Conference on Research and De-
     2348301. doi:10.1145/2348283.2348301.                       velopment in Information Retrieval, SIGIR 2009,
 [6] F. Baskaya, H. Keskustalo, K. Järvelin, Model-              Boston, MA, USA, July 19-23, 2009, ACM, 2009,
     ing behavioral factors ininteractive information            pp. 556–563. URL: https://doi.org/10.1145/1571941.
     retrieval, in: Q. He, A. Iyengar, W. Nejdl, J. Pei,         1572037. doi:10.1145/1571941.1572037.
     R. Rastogi (Eds.), 22nd ACM International Con-         [12] D. Maxwell, L. Azzopardi, K. Järvelin, H. Keskustalo,
     ference on Information and Knowledge Manage-                Searching and stopping: An analysis of stopping
     ment, CIKM’13, San Francisco, CA, USA, Octo-                rules and strategies, in: J. Bailey, A. Moffat, C. C. Ag-
     ber 27 - November 1, 2013, ACM, 2013, pp. 2297–             garwal, M. de Rijke, R. Kumar, V. Murdock, T. K. Sel-
     2302. URL: https://doi.org/10.1145/2505515.2505660.         lis, J. X. Yu (Eds.), Proceedings of the 24th ACM In-
     doi:10.1145/2505515.2505660.                                ternational Conference on Information and Knowl-
 [7] C. Jordan, C. R. Watters, Q. Gao, Using controlled          edge Management, CIKM 2015, Melbourne, VIC,
     query generation to evaluate blind relevance feed-          Australia, October 19 - 23, 2015, ACM, 2015, pp. 313–
     back algorithms, in: G. Marchionini, M. L. Nelson,          322. URL: https://doi.org/10.1145/2806416.2806476.
     C. C. Marshall (Eds.), ACM/IEEE Joint Conference            doi:10.1145/2806416.2806476.
     on Digital Libraries, JCDL 2006, Chapel Hill, NC,      [13] R. Jones, K. L. Klinkner, Beyond the session time-
     USA, June 11-15, 2006, Proceedings, ACM, 2006,              out: Automatic hierarchical segmentation of search
     pp. 286–295. URL: https://doi.org/10.1145/1141753.          topics in query logs, in: J. G. Shanahan, S. Amer-
     1141818. doi:10.1145/1141753.1141818.                       Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz,
 [8] H. Keskustalo, K. Järvelin, A. Pirkola, T. Sharma,          K. Choi, A. Chowdhury (Eds.), Proceedings of the
     M. Lykke, Test collection-based IR evaluation               17th ACM Conference on Information and Knowl-
     needs extension toward sessions - A case of ex-             edge Management, CIKM 2008, Napa Valley, Cal-
     tremely short queries, in: G. G. Lee, D. Song,              ifornia, USA, October 26-30, 2008, ACM, 2008,
     C. Lin, A. N. Aizawa, K. Kuriyama, M. Yosh-                 pp. 699–708. URL: https://doi.org/10.1145/1458082.
     ioka, T. Sakai (Eds.), Information Retrieval Tech-          1458176. doi:10.1145/1458082.1458176.
     nology, 5th Asia Information Retrieval Sympo-          [14] A. H. Awadallah, X. Shi, N. Craswell, B. Ramsey,
     sium, AIRS 2009, Sapporo, Japan, October 21-23,             Beyond clicks: query reformulation as a predictor
     2009. Proceedings, volume 5839 of Lecture Notes             of search satisfaction, in: Q. He, A. Iyengar, W. Ne-
     in Computer Science, Springer, 2009, pp. 63–74.             jdl, J. Pei, R. Rastogi (Eds.), 22nd ACM International
     URL: https://doi.org/10.1007/978-3-642-04769-5_6.           Conference on Information and Knowledge Man-
     doi:10.1007/978-3-642-04769-5\_6.                           agement, CIKM’13, San Francisco, CA, USA, Octo-
 [9] S. Verberne, M. Sappelli, K. Järvelin, W. Kraaij,           ber 27 - November 1, 2013, ACM, 2013, pp. 2019–
     User simulations for interactive search: Evaluat-           2028. URL: https://doi.org/10.1145/2505515.2505682.
     ing personalized query suggestion, in: A. Hanbury,          doi:10.1145/2505515.2505682.
     G. Kazai, A. Rauber, N. Fuhr (Eds.), Advances in       [15] H. Keskustalo, K. Järvelin, A. Pirkola, Evaluating
     Information Retrieval - 37th European Conference            the effectiveness of relevance feedback based on a
     on IR Research, ECIR 2015, Vienna, Austria, March           user simulation model: Effects of a user scenario on
     29 - April 2, 2015, Proceedings, volume 9022 of Lec-        cumulated gain value, Inf. Retr. 11 (2008) 209–228.
     ture Notes in Computer Science, 2015, pp. 678–690.          URL: https://doi.org/10.1007/s10791-007-9043-7.
     URL: https://doi.org/10.1007/978-3-319-16354-3_75.          doi:10.1007/s10791-007-9043-7.
     doi:10.1007/978-3-319-16354-3\_75.                     [16] V. Dang, W. B. Croft, Query reformulation using
[10] L. Azzopardi, M. de Rijke, K. Balog, Building sim-          anchor text, in: B. D. Davison, T. Suel, N. Craswell,
     ulated queries for known-item topics: An analysis           B. Liu (Eds.), Proceedings of the Third International
     using six european languages, in: W. Kraaij, A. P.          Conference on Web Search and Web Data Mining,
     WSDM 2010, New York, NY, USA, February 4-6,
     2010, ACM, 2010, pp. 41–50. URL: https://doi.org/
     10.1145/1718487.1718493. doi:10.1145/1718487.
     1718493.
[17] N. Craswell, B. Billerbeck, D. Fetterly, M. Najork,
     Robust query rewriting using anchor data, in:
     S. Leonardi, A. Panconesi, P. Ferragina, A. Gio-
     nis (Eds.), Sixth ACM International Conference
     on Web Search and Data Mining, WSDM 2013,
     Rome, Italy, February 4-8, 2013, ACM, 2013, pp. 335–
     344. URL: https://doi.org/10.1145/2433396.2433440.
     doi:10.1145/2433396.2433440.
[18] D. Garigliotti, K. Balog, Generating query sugges-
     tions to support task-based search, in: N. Kando,
     T. Sakai, H. Joho, H. Li, A. P. de Vries, R. W.
     White (Eds.), Proceedings of the 40th International
     ACM SIGIR Conference on Research and Develop-
     ment in Information Retrieval, Shinjuku, Tokyo,
     Japan, August 7-11, 2017, ACM, 2017, pp. 1153–
     1156. URL: https://doi.org/10.1145/3077136.3080745.
     doi:10.1145/3077136.3080745.
[19] H. Ding, S. Zhang, D. Garigliotti, K. Balog, Gener-
     ating high-quality query suggestion candidates for
     task-based search, in: G. Pasi, B. Piwowarski, L. Az-
     zopardi, A. Hanbury (Eds.), Advances in Informa-
     tion Retrieval - 40th European Conference on IR Re-
     search, ECIR 2018, Grenoble, France, March 26-29,
     2018, Proceedings, volume 10772 of Lecture Notes
     in Computer Science, Springer, 2018, pp. 625–631.
     URL: https://doi.org/10.1007/978-3-319-76941-7_54.
     doi:10.1007/978-3-319-76941-7\_54.
[20] B. Carterette, E. Kanoulas, M. M. Hall, P. D. Clough,
     Overview of the TREC 2014 session track, in:
     E. M. Voorhees, A. Ellis (Eds.), Proceedings of The
     Twenty-Third Text REtrieval Conference, TREC
     2014, Gaithersburg, Maryland, USA, November 19-
     21, 2014, volume 500-308 of NIST Special Publica-
     tion, National Institute of Standards and Technology
     (NIST), 2014. URL: http://trec.nist.gov/pubs/trec23/
     papers/overview-session.pdf.
[21] M. Hagen, J. Gomoll, A. Beyer, B. Stein, From search
     session detection to search mission detection, in:
     J. Ferreira, J. Magalhães, P. Calado (Eds.), Open re-
     search Areas in Information Retrieval, OAIR ’13, Lis-
     bon, Portugal, May 15-17, 2013, ACM, 2013, pp. 85–
     92. URL: http://dl.acm.org/citation.cfm?id=2491769.
[22] G. Pass, A. Chowdhury, C. Torgeson, A pic-
     ture of search, in: X. Jia (Ed.), Proceedings of
     the 1st International Conference on Scalable In-
     formation Systems, Infoscale 2006, Hong Kong,
     May 30-June 1, 2006, volume 152 of ACM Inter-
     national Conference Proceeding Series, ACM, 2006,
     p. 1. URL: https://doi.org/10.1145/1146847.1146848.
     doi:10.1145/1146847.1146848.