=Paper=
{{Paper
|id=Vol-2911/paper6
|storemode=property
|title=Assessing Query Suggestions for Search Session Simulation
|pdfUrl=https://ceur-ws.org/Vol-2911/paper6.pdf
|volume=Vol-2911
|authors=Sebastian Günther,Matthias Hagen
}}
==Assessing Query Suggestions for Search Session Simulation==
Assessing Query Suggestions for Search Session Simulation
Sebastian Günther1 , Matthias Hagen1
1
Martin-Luther-Universität Halle-Wittenberg, Halle (Saale), Germany
Abstract
Research on simulating search behavior has mainly dealt with result list interactions in the recent years. We instead focus on
the querying process and describe a pilot study to assess the applicability of search engine query suggestions to simulate
search sessions (i.e., sequences of topically related queries). In automatic and manual assessments, we evaluate to what extent
a session detection approach considers the simulated query sequences as “authentic” and how humans perceive the quality in
the sense of coherence, realism, and representativeness of the underlying topic. As for the actual suggestion-based simulation,
we compare different approaches to select the next query in a sequence (always selecting the first suggestion, random
sampling, or topic-informed selection) to the human TREC Session track sessions and a previously suggested simulation
scheme. Our results show that while it is easy to create query logs that are authentic to both users and automated evaluation,
keeping the sessions related to an underlying topic can be difficult when relying on given suggestions only.
Keywords
Simulating query sequences, Search session simulation, Query suggestion, TREC Session track, Task-based search
1. Introduction couple of queries, we examine sequences of query sug-
gestions provided by some suggestion approach—in our
Many studies on the simulation of search behavior focus pilot experiments, we simply use the suggestions that
on using simulated user behavior in system evaluations— the Google search engine returns, but any other sugges-
while others cover aspects of user modeling in general. tion approach could also be applied. Starting with the
Using simulated interactions for evaluation purposes is actual title or the first query of a TREC topic, the second
usually motivated by retrieval setups with no or only few query for the session is selected among the suggestions
actual users whose behavior can be observed and used to for the first query, the third query is selected from the
improve the actual system (e.g., system variants in digital suggestions for the second query, etc.
libraries or new (academic) search prototypes without an Our research question is how such suggestion-based
established user base). Such few-user systems could also simulated sessions compare to real user sessions in the
be evaluated in lab studies. But lab studies are difficult sense of coherence, realism, and representativeness of the
to scale up and also consume a lot of time since actual underlying topic. In our pilot study, we thus let a human
users need to be hired, instructed, and observed. In such annotator assess human sessions from the TREC Session
situations, simulation promises a way out but the extent track mixed with sessions generated from suggestion
to which simulated search interactions can actually au- sequences and sessions generated by a previous more
thentically replace real users in specific scenarios is still static query simulation scheme. The results show that
an open question. In the recent years, mostly result clicks suggestion-based sessions replicate patterns commonly
or stopping decisions have been the focus of user mod- seen in query logs. Both humans and a session detection
eling and simulation studies while simulating querying framework were unable to differentiate the simulated
behavior has received less attention. sessions from real ones. However, keeping close to the
In this paper, we describe a pilot study on query sim- given topic when using suggestions as simulated queries
ulation that aims to assess the suitability of stitching is rather difficult. Among other reasons, the limited ter-
together query suggestions to form “realistic” search ses- minology in the topic, query, and suggestions and most
sions (i.e., sequences of queries on the same information importantly the relatively small amount of suggestions
need that some human might have submitted). The sce- provided by the Google Suggest Search API often cause
nario we address is inspired by typical TREC style eval- the session to drift away from the given topic.
uation setups where search topics are given as a verbal
description of some information need along with a title
or first query. To simulate some search session with a 2. Related Work
Causality in Search and Recommendation (CSR) and Simulation of Similar to recent developments in the field of recom-
Information Retrieval Evaluation (Sim4IR) workshops at SIGIR, 2021 menders [1], simulation in the context of information
" sebastian.guenther@informatik.uni-halle.de (S. Günther); retrieval often aims to support experimental evaluation
matthias.hagen@informatik.uni-halle.de (M. Hagen) of retrieval systems (e.g., in scenarios with few user inter-
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). actions like in digital libraries) in a cost-gain scenario [2]
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
(cost for retrieval system interactions, gain for retrieving used Bayesian inference networks to generate queries,
good results). Different areas of user behavior have been Azzopardi [11] generated additional ad-hoc queries for
addressed by simulation: scanning snippets / result pages, existing TREC collections, while Carterette et al. [3] sug-
judging document relevance, clicking on results, reading gest a reformulation simulator to simulate whole sessions
result documents, deciding about stopping the search, by also including the snippets from the seen result pages
and query (re-)formulation itself. Some simulation stud- in the language model using TREC Session track data.
ies combine different of these areas but some also just Some anchor text-based approaches to “simulate” com-
focus on a particular one. In this paper, we focus on the plete query logs or to train query translation models also
domain of simulating query (re-)formulation behavior. constitute a topic loosely related to ours [16, 17]. How-
While quite a few studies on user click models and stop- ever, we aim to simulate shorter sequences of topically
ping decisions have been published in the recent years, related queries instead of complete query logs. As for
query formulation is still perceived as difficult to simu- the simulation, we want to study in pilot experiments,
late [3] but also necessary to generate useful simulations whether and how well sequences of query suggestions
for interactive retrieval evaluation [4]. stitched together may form search sessions. This idea
The existing approaches to query simulation can is inspired by studies on query suggestions to support
be divided into approaches that generate queries fol- task-based search [18, 19] since more complicated tasks
lowing rather static underlying schemes [5, 6, 7, 8, 9] usually result in more interactions and queries from the
and approaches that use language models constructed respective users. Our research question thus is how “au-
from the topic itself, from observed snippets, or from thentic” sessions can be that are formed from simply
some result documents to generate queries of varying following suggestions up to some depth.
lengths [10, 11, 3, 12]. Not all, but most of the query
simulations aim to simulate search sessions in the sense
of query sequences that all have a similar intent [13, 14]. 3. Query Log Generation
As for the static simulation schemes, many different
As described above, there are various types of datasets
ideas have been suggested. Jordan et al. [7] generate
and models that have been suggested for query simula-
controlled sets of single-term, two-term, and multiple
tion. In this paper, we want to study a yet not covered
term queries for retrieval scenarios on the Reuters-21578
source: query suggestions. Our reasoning is that query
corpus by combining terms of selected specificity in the
suggestions from large search engines are derived from
documents of the corpus (e.g., only highly discrimina-
their large query logs and thus represent “typical” user
tive terms to form very specific queries). Later studies
behavior. In our pilot experiments, we specifically focus
have suggested to combine terms from manually gener-
on query suggestions provided by the Google Suggest
ated query word pools and tested that on TREC topics.
Search API (that serves up to 10 suggestions at a time)
The respective querying strategies sample initial and
but, in principle, any other suggestion approach could
subsequent query words from these pools and combine
also be applied (e.g., suggestions from other large search
them to search sessions [5, 6, 8] following static schemes
engines or suggestion methods from the literature). Still,
of for instance keeping the same two terms in every
the characteristics of the suggestions may vary between
query but adding different third terms or for instance
different services such that the results of our pilot ex-
generating all possible three-permutations of three-term
periments should be tested in a more general setup with
queries [6]. The suggested static schemes have been “ide-
different suggestion approaches.
alized” from real searcher interactions [8] and have also
As our basis for simulated and real sessions, we use the
been used in a later language modeling query simula-
TREC 2014 Session track dataset [20] containing 1021 ses-
tor [12]. Similar to the mentioned keep-two-terms-but-
sions on 60 topics. Each topic is defined by an informa-
vary-third-term query formulation strategy, Verberne
tion need given as a short description. The respective
et al. [9] create queries of 𝑛 terms for the iSearch collec-
sessions include (among other information) the queries
tion where 𝑛−1 terms are kept and the last term is varied
some user formulated on the topic with timestamps, the
to mimic academic information seeking behavior and to
shown snippets, and clicked results. We extract the first
evaluate the cumulated gain over a simulated session.
queries of the sessions as seed queries for the simulated
One of the earliest more language model-based query
sessions since the topics themselves do not have explicit
simulators was suggested by Azzopardi et al. [10] in the
titles that might be used as a first query. In addition to
domain of known-item search on the EuroGOV corpus
the TREC data we also sampled sessions from the Webis-
(crawl of European government-related sites). Single
SMC-12 dataset [21] that contains query sequences from
queries for some given known-item document are gen-
the AOL log [22].
erated from the term distribution within the document
As suggestion-based session simulations, we consider
and some added “noise” to mimic imperfect human mem-
the following three strategies in our pilot study.
ory. The later InQuery system of Keskustalo et al. [15]
First Suggestion. This strategy always selects the could only include 20 in the evaluation). While we mostly
first suggestion provided by the Google Suggest focus on the textual aspect of the queries in this paper,
Search API for the previous query of the session user session logs often come with additional information
as input. A generated session will contain a max- like user agent, user identification, IP address, date and
imum of four queries in addition to the original time of the interaction. Each of our sessions consists of
query (analyzing several query log datasets, the at least one query with a fixed user assigned to it. To run
average sessions had up to five queries). A ses- automatic session detection, we also simulate timestamps
sion might be terminated early if the API does for each query submission.
not provide additional suggestions.
Random Suggestion. The random selection strategy Inter-Query Time. To simulate the time gap between
randomly selects one of the suggestions provided query submissions, we have extracted the timings from
by the Google Suggest Search API for the previous user sessions from the Webis-SMC-12 dataset [21]. Our
query of the session as input. Like with the first analysis shows that 25% of the time gap are shorter than
suggestion strategy, generated session contain up 41 seconds, while half of the gaps is no longer than
to four queries in addition to the original query. 137 seconds. The distribution of timings shows a peak at
The same query can not appear back-to-back and 8 seconds and a long tail with the highest values in the
a session might be terminated early if the API multi-hour range. To account for logging and annotation
does not provide additional suggestions. errors, we have removed outliers by deleting 10% of the
longest gaps, which limits the simulated time between
Three Word Queries (adapted). This strategy is query submissions to no longer than 20 minutes. We
based on the idea of the Session Strategy S3 use this remaining pool of time gaps to accurately repro-
described by Keskustalo et al. [8] which is duce the timing distribution for our generated sessions by
also implemented in the SimIIR framework1 as randomly drawing values from it—which naturally then
TriTermQueryGenerator. The original idea favors shorter time spans since they are more frequent.
uses two terms as the basis extended by a third
term selected from a topic description. We adapt Limits when using Suggestions. While working on
this strategy with a few modifications. Initially our pilot study, we experimented with various combi-
we start with the original query from the real nations of suggestion selection strategies and session
session without any additions. We then extract lengths. We identified issues in our strategies that are a
the 10 keywords from the topic’s description direct result of the nature of search engine suggestions.
with highest tf ·idf scores (idf computed on the The first suggestion strategy is particularly prone to
English Wikipedia). In each round, we calculate loops, when two queries are the top-ranked suggestions
the cosine similarity of each suggestion and for each other—causing the generated session to alternate
each original query–keyword pair. We select between two query strings; also observed for singular–
the suggestion that is closest to one of the plural pairs or categories (i.e., file formats, programming
query–keyword pairs. We limit the sessions languages). To counter the looping issue, we use a unique
to a maximum of four queries in addition to query approach, which ensures that queries are not re-
the original query. We also employ a dynamic peated in loops within a session. Additionally, another
threshold for the cosine similarity that stops policy ensures a minimum dissimilarity between con-
accepting suggestions when the similarity falls secutive queries that helps to avoid plurals as top sug-
below a certain threshold. Due to the varying gestions. However, while unique / dissimilar queries
length and specificity of the descriptions and mitigate looping, we find that especially longer sessions
the ambiguity of the topics, the threshold has (say, ten queries) narrow down to very specific topics.
to be manually adjusted for each topic. In our A possible reason is that today’s search engine query
evaluation, we note that choosing an important suggestions do not only show related queries, but often
term from the topic description provides an offer more specific autocompletions. Further details on
advantage to this strategy over the previous two the evaluation are provided in Section 4.
with respect to the topic representativeness of
the generated sessions.
4. Evaluation
For the three approaches, we generate 100, 100, and 20
sessions, respectively (in case of the three word strategy, In the evaluation, we compare the sessions generated by
the strict selection process and the small pool of sugges- our three approaches to sessions from both the Webis-
tions often results in very short sessions such that we SMC-12 dataset and the TREC 2014 Session track. As a
1
https://github.com/leifos/simiir
Strategy Sessions Splits Strategy Sessions Real Simulated
First suggestion* 64 1 First suggestion* 64 62 2
Random suggestion* 65 2 Random suggestion* 65 62 3
Three word queries 20 0 Three word queries 20 17 3
TREC 2014 Session Track 1257 142 TREC 2014 Session Track 50 49 1
Webis-SMC-12 2882 217 Webis-SMC-12 50 50 0
Table 1 Table 2
Number of within-session splits the automatic session detec- Manual judgments for all sessions whether they are simulated
tion introduced for simulated and real sessions (more splits or “real” (* indicates that one-query sessions were removed).
mean more query pairs seem to be unrelated; * indicates that “Real” in the upper group and “simulated” in the lower group
one-query sessions were removed). indicate cases where the judge was mislead.
first step, we perform an automated evaluation by run- 4.2. Human Authenticity Assessment
ning the sessions through the session detection approach
An automated session detection system only “assesses”
of Hagen et al. [21]. Ideally, the simulated sessions should
whether the consecutive queries seem to belong together
not be split by the session detection in order to count as
based on factors like lexical or semantic similarity and
“authentic”. In a second step, a human assessor looked
time gaps. However, we want to complement this purely
at the simulated sessions as well as original sessions and
automatic relatedness detection by a manual assessment
had to judge whether a session seems to be simulated
of how “authentic” the simulated sessions are perceived
or of human origin. In a third step, a human assessor
by humans, i.e., whether a human can distinguish simu-
judged whether a session actually covers the intended
lated from real sessions.
information need given by the topic description.
Procedure. All simulated sessions and a sample of
4.1. Automatic Session Detection original sessions are combined into one session pool.
The goal of a session detection system is to identify con- The sessions are then presented to the judge as kind of
secutive queries as belonging to the same information log excerpts with user ID, timestamps, and queries. The
need or not. When a consecutive pair is detected that judge has no accurate knowledge about the amount of
seems to belong to two different information needs, a split queries for each approach and there is no obvious way
is introduced. Later some of these sessions might be run to determine the source of a session. The judge then
through a mission detection to identify non-consecutive labels each session as real (sampled from actual query
sessions that belong to the same search task, etc. logs) or simulated (by one of the three approaches). The
As an automatic evaluation of the the simulated ses- results in Table 2 indicate that the simulated sessions are
sions’ authenticity, we individually run each simulated perceived as real even though the assessor was told that
session and the individual sessions from the TREC and some sessions actually are simulated.
Webis-SMC-12 data through the session detection ap- During the assessment, the assessor took notes of
proach of Hagen et al. [21]. A simulated or original ses- which features of a session or query determine the judg-
sion “passes” the automatic authenticity test iff the de- ment. This helps us in understanding how humans and
tection approach does not introduce a split. The results algorithms may come up with different verdicts. The
are shown in Table 1 (sessions with only one query were primary criteria for the relatedness of two queries are
removed since they will never be split). their term composition and length. Similarities in those
Altogether, the simulated sessions are hardly split by aspects are perceived as patterns. This is also true for
the automatic detection. The one wrong split for the first small editing actions (adding or replacing single words)
suggestion strategy and one wrong split for the random which naturally comes with the specialization towards
suggestion strategy are likely due to the first query being a topic. The opposite effect is perceived for rapid topic
uppercased while the subsequent suggestions are low- changes. When multiple closely related tasks have to be
ercased, while the second “wrong” split for the random fulfilled within one session, there may be large changes
suggestions strategy is likely caused by a reformulation from query to query. This is also true for replacing words
with abbreviation and no term overlap (“no air condition- by synonyms or abbreviations. While a human judge will
ing alternatives” to “what to use instead of ac”). These usually be able to infer context to those rapid changes, an
examples serve as a good demonstration for the limita- automatic process is more likely to detect a new session.
tions of a fully automatic authenticity evaluation such Another discrepancy between human and algorithmic
that we also manually assess the simulated sessions. evaluation becomes apparent when we consider outlier
Strategy Sessions On Topic Query String Time
First suggestion* 64 21 First suggestion
Random suggestion* 65 20 air conditioning alternatives 15:05:53
Three word queries 20 20 air conditioning alternatives car 15:10:22
no air conditioning in car alternatives 15:11:07
Table 3 how can i keep my car cool without ac 15:15:28
Number of simulated sessions judged as “on topic” with re- ways to keep car cool without ac 15:21:16
spect to the TREC topic description (* indicates that one-query
sessions were removed). Random suggestion
air conditioning alternatives 17:31:54
no air conditioning alternatives 17:32:27
what to use instead of ac 17:36:28
behavior like text formatting (e.g., all-uppercase) that a what to use instead of activator 17:45:42
human might be able to judge as a simple typing error what can i use instead of activator for nails 17:51:03
while a detection approach without lowercasing prepro- how to make nail activator 17:53:26
cessing might be mislead. Random suggestion
In a nutshell, while both humans and algorithms look Philadelphia 03:31:29
for patterns in the sessions and queries, the human judge philadelphia cheese 03:34:50
does so more selective by looking for mistakes. If found, philadelphia cheese recipes 03:35:05
the type of a mistake usually heavily influences the as- philadelphia cheese recipes salmon pasta 03:53:17
sessment of a session. Finally, note that due to the nature
Table 4
of the three word query strategy there might be a chance
Example sessions with unusual editing patterns.
for an informed human to guess the sessions origin.
4.3. Human Topicality Assessment Results: We have manually judged all generated ses-
So far, we have shown that the authenticity of a session sions. The results are shown in Table 3 show that even
is largely influenced by its term composition and appear- the uninformed strategies stay “on topic” on about one
ance. However, to serve as a replacement for humans, a third of the sessions. This can largely be attributed to
session generator not only has to provide sessions that the nature of the TREC Session track topics that often
a detection approach or some human would assess as contain several subtasks. Sessions generated by the three
authentic, but also has to simulate sessions that follow word strategy stay “on topic” even more.
the topic given as part of the evaluation study.
4.4. Notable Examples
Procedure. Determining if a session or query is on
As part of the judgment process, we have also taken note
topic is a non-trivial task. While a query like “car” over-
of simulated sessions which contain conspicuous editing
laps with the topic “find information on used car prices”,
steps or queries. The examples in Table 4 include a posi-
it does not address the information need formulated in
tive and a negative example with respect to authenticity.
the topic description. We therefore set the following cri-
The first example was judged as “real” based on the
teria to evaluate if a session is “on topic”: A session is “on
usage of an abbreviation for air conditioning in the fourth
topic”, if the last query addresses at least one information
query. The replacement of terms or groups of terms with
need formulated in the topic description or shows clear
a common abbreviation might be seen as a typical step
signs that the session is headed in that direction—such
for a human user after gaining more insight into a topic.
that very short sessions are more likely to be on topic.
The second example includes an issue that was caused
A session is also “on topic”, if any query of the session
by the autocomplete feature of the Google suggestions:
addresses at least one information need formulated in
the abbreviation ‘ac’ was falsely extended to the term
the topic description—necessary condition to account for
‘activator’, which ultimately changed the subject of the
topics with multiple subtasks.
session. The third example shows a very common issue
of ambiguous first queries. For the first and random
Hypothesis: The first and random approach do not suggestion strategies, there is no way to determine that
take the topic into account. Both strategies simply con- a city is referenced in this example such that the session
verge to anything the search suggestion API provides quickly diverges to the food domain.
for the initial query. Instead, the three word approach
makes informed decisions when choosing suggestions
and should therefore be able to stay more “on topic”.
4.5. Long Sessions simulated sessions. We will work on query modifica-
tions that include “knowledge” from language models or
The simulated sessions up to this point had parameters
predefined editing rules.
like session length and inter-query time been set to values
that deemed appropriate in some initial experiments on
our end in order to generate “close to real” sessions. We Influence on the Topic. For accurate session simula-
also did not include navigational queries or known-item tion, it is necessary to influence the topic that the queries
searches, which often could result in either very short or follow. We will evaluate how and where those decisions
very long sessions. To investigate the applicability of our have to be made to create an effective user model.
approaches to such outlier behavior we have also further
assessed some sessions with up to 20 queries. User Types and Editing. Since query modifications of-
In many of the cases without imposing any limits on ten follow well-known patterns, we will also investigate
the generation process, the sessions still were often ter- ways to replicate editing patterns in simulated queries
minated early due to a lack of suggestions. This was that are typical for specific user groups or tasks.
mostly caused by two reasons: either the query became
too specific to still yield additional suggestions or the
pool of unique and dissimilar queries was used up. In
Acknowledgments
cases where long sessions could actually be generated, This work has been partially supported by the DFG
the session usually quickly was rather specific and di- through the project “SINIR: Simulating INteractive Infor-
verged substantially from the actual given topic towards mation Retrieval” (grant HA 5851/3-1).
the end of the session.
Using a different set of more technically oriented top-
ics, we were able to generate longer sessions more fre- References
quently. For this to work, we had to limit the dissimilarity
filter, as abbreviations within the query were more fre- [1] S. Zhang, K. Balog, Evaluating conversational
quent and therefore editing distances were smaller. We recommender systems via user simulation, in:
also observed that queries from this field were mostly R. Gupta, Y. Liu, J. Tang, B. A. Prakash (Eds.), KDD
comprised of categorical keywords stitched together com- ’20: The 26th ACM SIGKDD Conference on Knowl-
pared to the more natural looking sessions from standard edge Discovery and Data Mining, Virtual Event,
query logs. CA, USA, August 23-27, 2020, ACM, 2020, pp. 1512–
Those observations, while helping to shape our pilot 1520. URL: https://doi.org/10.1145/3394486.3403202.
study, show that parameters and strategies for authentic doi:10.1145/3394486.3403202.
session generation are a very dynamic and potentially [2] M. McGregor, L. Azzopardi, M. Halvey, Untan-
also topic-specific issue. gling cost, effort, and load in information seeking
and retrieval, in: F. Scholer, P. Thomas, D. El-
sweiler, H. Joho, N. Kando, C. Smith (Eds.), CHIIR
5. Conclusion ’21: ACM SIGIR Conference on Human Informa-
tion Interaction and Retrieval, Canberra, ACT, Aus-
In this paper, we have investigated how well authen- tralia, March 14-19, 2021, ACM, 2021, pp. 151–
tic sessions can be simulated using web search engine 161. URL: https://doi.org/10.1145/3406522.3446026.
query suggestions. By employing different strategies of doi:10.1145/3406522.3446026.
selecting and combining the suggestions, we showcased [3] B. Carterette, A. Bah, M. Zengin, Dynamic test
the potential but also the limits of the overall usefulness collections for retrieval evaluation, in: J. Al-
of suggestion-based session simulation. Our evaluation lan, W. B. Croft, A. P. de Vries, C. Zhai (Eds.),
showed that both humans and a session detection frame- Proceedings of the 2015 International Conference
work are unable to distinguish suggestion-based sessions on The Theory of Information Retrieval, ICTIR
from sampled real sessions. While some kind of authen- 2015, Northampton, Massachusetts, USA, Septem-
ticity can thus be attributed to the simulated sessions, ber 27-30, 2015, ACM, 2015, pp. 91–100. URL: https:
staying on topic proved to be rather difficult. Addressing //doi.org/10.1145/2808194.2809470. doi:10.1145/
the outlined shortcomings is an interesting direction for 2808194.2809470.
future work. We plan to continue investigating query [4] B. Carterette, E. Kanoulas, E. Yilmaz, Simulating
simulation as follows. simple user behavior for system effectiveness eval-
uation, in: C. Macdonald, I. Ounis, I. Ruthven
Data Independence. Relying on suggestions as query (Eds.), Proceedings of the 20th ACM Conference on
candidates limits the flexibility and applicability of the Information and Knowledge Management, CIKM
2011, Glasgow, United Kingdom, October 24-28, de Vries, C. L. A. Clarke, N. Fuhr, N. Kando (Eds.),
2011, ACM, 2011, pp. 611–620. URL: https://doi.org/ SIGIR 2007: Proceedings of the 30th Annual In-
10.1145/2063576.2063668. doi:10.1145/2063576. ternational ACM SIGIR Conference on Research
2063668. and Development in Information Retrieval, Amster-
[5] F. Baskaya, H. Keskustalo, K. Järvelin, Time drives dam, The Netherlands, July 23-27, 2007, ACM, 2007,
interaction: Simulating sessions in diverse search- pp. 455–462. URL: https://doi.org/10.1145/1277741.
ing environments, in: W. R. Hersh, J. Callan, 1277820. doi:10.1145/1277741.1277820.
Y. Maarek, M. Sanderson (Eds.), The 35th Inter- [11] L. Azzopardi, Query side evaluation: An empiri-
national ACM SIGIR conference on research and cal analysis of effectiveness and effort, in: J. Al-
development in Information Retrieval, SIGIR ’12, lan, J. A. Aslam, M. Sanderson, C. Zhai, J. Zobel
Portland, OR, USA, August 12-16, 2012, ACM, 2012, (Eds.), Proceedings of the 32nd Annual Interna-
pp. 105–114. URL: https://doi.org/10.1145/2348283. tional ACM SIGIR Conference on Research and De-
2348301. doi:10.1145/2348283.2348301. velopment in Information Retrieval, SIGIR 2009,
[6] F. Baskaya, H. Keskustalo, K. Järvelin, Model- Boston, MA, USA, July 19-23, 2009, ACM, 2009,
ing behavioral factors ininteractive information pp. 556–563. URL: https://doi.org/10.1145/1571941.
retrieval, in: Q. He, A. Iyengar, W. Nejdl, J. Pei, 1572037. doi:10.1145/1571941.1572037.
R. Rastogi (Eds.), 22nd ACM International Con- [12] D. Maxwell, L. Azzopardi, K. Järvelin, H. Keskustalo,
ference on Information and Knowledge Manage- Searching and stopping: An analysis of stopping
ment, CIKM’13, San Francisco, CA, USA, Octo- rules and strategies, in: J. Bailey, A. Moffat, C. C. Ag-
ber 27 - November 1, 2013, ACM, 2013, pp. 2297– garwal, M. de Rijke, R. Kumar, V. Murdock, T. K. Sel-
2302. URL: https://doi.org/10.1145/2505515.2505660. lis, J. X. Yu (Eds.), Proceedings of the 24th ACM In-
doi:10.1145/2505515.2505660. ternational Conference on Information and Knowl-
[7] C. Jordan, C. R. Watters, Q. Gao, Using controlled edge Management, CIKM 2015, Melbourne, VIC,
query generation to evaluate blind relevance feed- Australia, October 19 - 23, 2015, ACM, 2015, pp. 313–
back algorithms, in: G. Marchionini, M. L. Nelson, 322. URL: https://doi.org/10.1145/2806416.2806476.
C. C. Marshall (Eds.), ACM/IEEE Joint Conference doi:10.1145/2806416.2806476.
on Digital Libraries, JCDL 2006, Chapel Hill, NC, [13] R. Jones, K. L. Klinkner, Beyond the session time-
USA, June 11-15, 2006, Proceedings, ACM, 2006, out: Automatic hierarchical segmentation of search
pp. 286–295. URL: https://doi.org/10.1145/1141753. topics in query logs, in: J. G. Shanahan, S. Amer-
1141818. doi:10.1145/1141753.1141818. Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz,
[8] H. Keskustalo, K. Järvelin, A. Pirkola, T. Sharma, K. Choi, A. Chowdhury (Eds.), Proceedings of the
M. Lykke, Test collection-based IR evaluation 17th ACM Conference on Information and Knowl-
needs extension toward sessions - A case of ex- edge Management, CIKM 2008, Napa Valley, Cal-
tremely short queries, in: G. G. Lee, D. Song, ifornia, USA, October 26-30, 2008, ACM, 2008,
C. Lin, A. N. Aizawa, K. Kuriyama, M. Yosh- pp. 699–708. URL: https://doi.org/10.1145/1458082.
ioka, T. Sakai (Eds.), Information Retrieval Tech- 1458176. doi:10.1145/1458082.1458176.
nology, 5th Asia Information Retrieval Sympo- [14] A. H. Awadallah, X. Shi, N. Craswell, B. Ramsey,
sium, AIRS 2009, Sapporo, Japan, October 21-23, Beyond clicks: query reformulation as a predictor
2009. Proceedings, volume 5839 of Lecture Notes of search satisfaction, in: Q. He, A. Iyengar, W. Ne-
in Computer Science, Springer, 2009, pp. 63–74. jdl, J. Pei, R. Rastogi (Eds.), 22nd ACM International
URL: https://doi.org/10.1007/978-3-642-04769-5_6. Conference on Information and Knowledge Man-
doi:10.1007/978-3-642-04769-5\_6. agement, CIKM’13, San Francisco, CA, USA, Octo-
[9] S. Verberne, M. Sappelli, K. Järvelin, W. Kraaij, ber 27 - November 1, 2013, ACM, 2013, pp. 2019–
User simulations for interactive search: Evaluat- 2028. URL: https://doi.org/10.1145/2505515.2505682.
ing personalized query suggestion, in: A. Hanbury, doi:10.1145/2505515.2505682.
G. Kazai, A. Rauber, N. Fuhr (Eds.), Advances in [15] H. Keskustalo, K. Järvelin, A. Pirkola, Evaluating
Information Retrieval - 37th European Conference the effectiveness of relevance feedback based on a
on IR Research, ECIR 2015, Vienna, Austria, March user simulation model: Effects of a user scenario on
29 - April 2, 2015, Proceedings, volume 9022 of Lec- cumulated gain value, Inf. Retr. 11 (2008) 209–228.
ture Notes in Computer Science, 2015, pp. 678–690. URL: https://doi.org/10.1007/s10791-007-9043-7.
URL: https://doi.org/10.1007/978-3-319-16354-3_75. doi:10.1007/s10791-007-9043-7.
doi:10.1007/978-3-319-16354-3\_75. [16] V. Dang, W. B. Croft, Query reformulation using
[10] L. Azzopardi, M. de Rijke, K. Balog, Building sim- anchor text, in: B. D. Davison, T. Suel, N. Craswell,
ulated queries for known-item topics: An analysis B. Liu (Eds.), Proceedings of the Third International
using six european languages, in: W. Kraaij, A. P. Conference on Web Search and Web Data Mining,
WSDM 2010, New York, NY, USA, February 4-6,
2010, ACM, 2010, pp. 41–50. URL: https://doi.org/
10.1145/1718487.1718493. doi:10.1145/1718487.
1718493.
[17] N. Craswell, B. Billerbeck, D. Fetterly, M. Najork,
Robust query rewriting using anchor data, in:
S. Leonardi, A. Panconesi, P. Ferragina, A. Gio-
nis (Eds.), Sixth ACM International Conference
on Web Search and Data Mining, WSDM 2013,
Rome, Italy, February 4-8, 2013, ACM, 2013, pp. 335–
344. URL: https://doi.org/10.1145/2433396.2433440.
doi:10.1145/2433396.2433440.
[18] D. Garigliotti, K. Balog, Generating query sugges-
tions to support task-based search, in: N. Kando,
T. Sakai, H. Joho, H. Li, A. P. de Vries, R. W.
White (Eds.), Proceedings of the 40th International
ACM SIGIR Conference on Research and Develop-
ment in Information Retrieval, Shinjuku, Tokyo,
Japan, August 7-11, 2017, ACM, 2017, pp. 1153–
1156. URL: https://doi.org/10.1145/3077136.3080745.
doi:10.1145/3077136.3080745.
[19] H. Ding, S. Zhang, D. Garigliotti, K. Balog, Gener-
ating high-quality query suggestion candidates for
task-based search, in: G. Pasi, B. Piwowarski, L. Az-
zopardi, A. Hanbury (Eds.), Advances in Informa-
tion Retrieval - 40th European Conference on IR Re-
search, ECIR 2018, Grenoble, France, March 26-29,
2018, Proceedings, volume 10772 of Lecture Notes
in Computer Science, Springer, 2018, pp. 625–631.
URL: https://doi.org/10.1007/978-3-319-76941-7_54.
doi:10.1007/978-3-319-76941-7\_54.
[20] B. Carterette, E. Kanoulas, M. M. Hall, P. D. Clough,
Overview of the TREC 2014 session track, in:
E. M. Voorhees, A. Ellis (Eds.), Proceedings of The
Twenty-Third Text REtrieval Conference, TREC
2014, Gaithersburg, Maryland, USA, November 19-
21, 2014, volume 500-308 of NIST Special Publica-
tion, National Institute of Standards and Technology
(NIST), 2014. URL: http://trec.nist.gov/pubs/trec23/
papers/overview-session.pdf.
[21] M. Hagen, J. Gomoll, A. Beyer, B. Stein, From search
session detection to search mission detection, in:
J. Ferreira, J. Magalhães, P. Calado (Eds.), Open re-
search Areas in Information Retrieval, OAIR ’13, Lis-
bon, Portugal, May 15-17, 2013, ACM, 2013, pp. 85–
92. URL: http://dl.acm.org/citation.cfm?id=2491769.
[22] G. Pass, A. Chowdhury, C. Torgeson, A pic-
ture of search, in: X. Jia (Ed.), Proceedings of
the 1st International Conference on Scalable In-
formation Systems, Infoscale 2006, Hong Kong,
May 30-June 1, 2006, volume 152 of ACM Inter-
national Conference Proceeding Series, ACM, 2006,
p. 1. URL: https://doi.org/10.1145/1146847.1146848.
doi:10.1145/1146847.1146848.