<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Assessing Query Suggestions for Search Session Simulation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sebastian Günther</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Hagen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Martin-Luther-Universität Halle-Wittenberg</institution>
          ,
          <addr-line>Halle (Saale)</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Research on simulating search behavior has mainly dealt with result list interactions in the recent years. We instead focus on the querying process and describe a pilot study to assess the applicability of search engine query suggestions to simulate search sessions (i.e., sequences of topically related queries). In automatic and manual assessments, we evaluate to what extent a session detection approach considers the simulated query sequences as “authentic” and how humans perceive the quality in the sense of coherence, realism, and representativeness of the underlying topic. As for the actual suggestion-based simulation, we compare diferent approaches to select the next query in a sequence (always selecting the first suggestion, random sampling, or topic-informed selection) to the human TREC Session track sessions and a previously suggested simulation scheme. Our results show that while it is easy to create query logs that are authentic to both users and automated evaluation, keeping the sessions related to an underlying topic can be dificult when relying on given suggestions only.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Simulating query sequences</kwd>
        <kwd>Search session simulation</kwd>
        <kwd>Query suggestion</kwd>
        <kwd>TREC Session track</kwd>
        <kwd>Task-based search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>couple of queries, we examine sequences of query
suggestions provided by some suggestion approach—in our
Many studies on the simulation of search behavior focus pilot experiments, we simply use the suggestions that
on using simulated user behavior in system evaluations— the Google search engine returns, but any other
suggeswhile others cover aspects of user modeling in general. tion approach could also be applied. Starting with the
Using simulated interactions for evaluation purposes is actual title or the first query of a TREC topic, the second
usually motivated by retrieval setups with no or only few query for the session is selected among the suggestions
actual users whose behavior can be observed and used to for the first query, the third query is selected from the
improve the actual system (e.g., system variants in digital suggestions for the second query, etc.
libraries or new (academic) search prototypes without an Our research question is how such suggestion-based
established user base). Such few-user systems could also simulated sessions compare to real user sessions in the
be evaluated in lab studies. But lab studies are dificult sense of coherence, realism, and representativeness of the
to scale up and also consume a lot of time since actual underlying topic. In our pilot study, we thus let a human
users need to be hired, instructed, and observed. In such annotator assess human sessions from the TREC Session
situations, simulation promises a way out but the extent track mixed with sessions generated from suggestion
to which simulated search interactions can actually au- sequences and sessions generated by a previous more
thentically replace real users in specific scenarios is still static query simulation scheme. The results show that
an open question. In the recent years, mostly result clicks suggestion-based sessions replicate patterns commonly
or stopping decisions have been the focus of user mod- seen in query logs. Both humans and a session detection
eling and simulation studies while simulating querying framework were unable to diferentiate the simulated
behavior has received less attention. sessions from real ones. However, keeping close to the</p>
      <p>
        In this paper, we describe a pilot study on query sim- given topic when using suggestions as simulated queries
ulation that aims to assess the suitability of stitching is rather dificult. Among other reasons, the limited
tertogether query suggestions to form “realistic” search ses- minology in the topic, query, and suggestions and most
sions (i.e., sequences of queries on the same information importantly the relatively small amount of suggestions
need that some human might have submitted). The sce- provided by the Google Suggest Search API often cause
nario we address is inspired by typical TREC style eval- the session to drift away from the given topic.
uation setups where search topics are given as a verbal
description of some information need along with a title
or first query. To simulate some search session with a 2. Related Work
Similar to recent developments in the field of
recommenders [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], simulation in the context of information
retrieval often aims to support experimental evaluation
of retrieval systems (e.g., in scenarios with few user
interactions like in digital libraries) in a cost-gain scenario [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
(cost for retrieval system interactions, gain for retrieving used Bayesian inference networks to generate queries,
good results). Diferent areas of user behavior have been Azzopardi [11] generated additional ad-hoc queries for
addressed by simulation: scanning snippets / result pages, existing TREC collections, while Carterette et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
sugjudging document relevance, clicking on results, reading gest a reformulation simulator to simulate whole sessions
result documents, deciding about stopping the search, by also including the snippets from the seen result pages
and query (re-)formulation itself. Some simulation stud- in the language model using TREC Session track data.
ies combine diferent of these areas but some also just Some anchor text-based approaches to “simulate”
comfocus on a particular one. In this paper, we focus on the plete query logs or to train query translation models also
domain of simulating query (re-)formulation behavior. constitute a topic loosely related to ours [
        <xref ref-type="bibr" rid="ref5">16, 17</xref>
        ].
HowWhile quite a few studies on user click models and stop- ever, we aim to simulate shorter sequences of topically
ping decisions have been published in the recent years, related queries instead of complete query logs. As for
query formulation is still perceived as dificult to simu- the simulation, we want to study in pilot experiments,
late [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] but also necessary to generate useful simulations whether and how well sequences of query suggestions
for interactive retrieval evaluation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. stitched together may form search sessions. This idea
      </p>
      <p>
        The existing approaches to query simulation can is inspired by studies on query suggestions to support
be divided into approaches that generate queries fol- task-based search [
        <xref ref-type="bibr" rid="ref6 ref7">18, 19</xref>
        ] since more complicated tasks
lowing rather static underlying schemes [5, 6, 7, 8, 9] usually result in more interactions and queries from the
and approaches that use language models constructed respective users. Our research question thus is how
“aufrom the topic itself, from observed snippets, or from thentic” sessions can be that are formed from simply
some result documents to generate queries of varying following suggestions up to some depth.
lengths [
        <xref ref-type="bibr" rid="ref3">10, 11, 3, 12</xref>
        ]. Not all, but most of the query
simulations aim to simulate search sessions in the sense
of query sequences that all have a similar intent [13, 14]. 3. Query Log Generation
      </p>
      <p>
        As for the static simulation schemes, many diferent
ideas have been suggested. Jordan et al. [7] generate As described above, there are various types of datasets
controlled sets of single-term, two-term, and multiple and models that have been suggested for query
simulaterm queries for retrieval scenarios on the Reuters-21578 tion. In this paper, we want to study a yet not covered
corpus by combining terms of selected specificity in the source: query suggestions. Our reasoning is that query
documents of the corpus (e.g., only highly discrimina- suggestions from large search engines are derived from
tive terms to form very specific queries). Later studies their large query logs and thus represent “typical” user
have suggested to combine terms from manually gener- behavior. In our pilot experiments, we specifically focus
ated query word pools and tested that on TREC topics. on query suggestions provided by the Google Suggest
The respective querying strategies sample initial and Search API (that serves up to 10 suggestions at a time)
subsequent query words from these pools and combine but, in principle, any other suggestion approach could
them to search sessions [5, 6, 8] following static schemes also be applied (e.g., suggestions from other large search
of for instance keeping the same two terms in every engines or suggestion methods from the literature). Still,
query but adding diferent third terms or for instance the characteristics of the suggestions may vary between
generating all possible three-permutations of three-term diferent services such that the results of our pilot
exqueries [6]. The suggested static schemes have been “ide- periments should be tested in a more general setup with
alized” from real searcher interactions [8] and have also diferent suggestion approaches.
been used in a later language modeling query simula- As our basis for simulated and real sessions, we use the
tor [12]. Similar to the mentioned keep-two-terms-but- TREC 2014 Session track dataset [
        <xref ref-type="bibr" rid="ref8">20</xref>
        ] containing 1021
sesvary-third-term query formulation strategy, Verberne sions on 60 topics. Each topic is defined by an
informaet al. [9] create queries of  terms for the iSearch collec- tion need given as a short description. The respective
tion where − 1 terms are kept and the last term is varied sessions include (among other information) the queries
to mimic academic information seeking behavior and to some user formulated on the topic with timestamps, the
evaluate the cumulated gain over a simulated session. shown snippets, and clicked results. We extract the first
      </p>
      <p>
        One of the earliest more language model-based query queries of the sessions as seed queries for the simulated
simulators was suggested by Azzopardi et al. [10] in the sessions since the topics themselves do not have explicit
domain of known-item search on the EuroGOV corpus titles that might be used as a first query. In addition to
(crawl of European government-related sites). Single the TREC data we also sampled sessions from the
Webisqueries for some given known-item document are gen- SMC-12 dataset [
        <xref ref-type="bibr" rid="ref9">21</xref>
        ] that contains query sequences from
erated from the term distribution within the document the AOL log [
        <xref ref-type="bibr" rid="ref10">22</xref>
        ].
and some added “noise” to mimic imperfect human mem- As suggestion-based session simulations, we consider
ory. The later InQuery system of Keskustalo et al. [15] the following three strategies in our pilot study.
First Suggestion. This strategy always selects the could only include 20 in the evaluation). While we mostly
ifrst suggestion provided by the Google Suggest focus on the textual aspect of the queries in this paper,
Search API for the previous query of the session user session logs often come with additional information
as input. A generated session will contain a max- like user agent, user identification, IP address, date and
imum of four queries in addition to the original time of the interaction. Each of our sessions consists of
query (analyzing several query log datasets, the at least one query with a fixed user assigned to it. To run
average sessions had up to five queries). A ses- automatic session detection, we also simulate timestamps
sion might be terminated early if the API does for each query submission.
      </p>
      <p>not provide additional suggestions.</p>
      <sec id="sec-1-1">
        <title>Random Suggestion. The random selection strategy</title>
        <p>randomly selects one of the suggestions provided
by the Google Suggest Search API for the previous
query of the session as input. Like with the first
suggestion strategy, generated session contain up
to four queries in addition to the original query.
The same query can not appear back-to-back and
a session might be terminated early if the API
does not provide additional suggestions.</p>
        <sec id="sec-1-1-1">
          <title>Three Word Queries (adapted). This strategy is</title>
          <p>based on the idea of the Session Strategy S3
described by Keskustalo et al. [8] which is
also implemented in the SimIIR framework1 as
TriTermQueryGenerator. The original idea
uses two terms as the basis extended by a third
term selected from a topic description. We adapt
this strategy with a few modifications. Initially
we start with the original query from the real
session without any additions. We then extract
the 10 keywords from the topic’s description
with highest tf · idf scores (idf computed on the
English Wikipedia). In each round, we calculate
the cosine similarity of each suggestion and
each original query–keyword pair. We select
the suggestion that is closest to one of the
query–keyword pairs. We limit the sessions
to a maximum of four queries in addition to
the original query. We also employ a dynamic
threshold for the cosine similarity that stops
accepting suggestions when the similarity falls
below a certain threshold. Due to the varying
length and specificity of the descriptions and
the ambiguity of the topics, the threshold has
to be manually adjusted for each topic. In our
evaluation, we note that choosing an important
term from the topic description provides an
advantage to this strategy over the previous two
with respect to the topic representativeness of
the generated sessions.</p>
          <p>
            Inter-Query Time. To simulate the time gap between
query submissions, we have extracted the timings from
user sessions from the Webis-SMC-12 dataset [
            <xref ref-type="bibr" rid="ref9">21</xref>
            ]. Our
analysis shows that 25% of the time gap are shorter than
41 seconds, while half of the gaps is no longer than
137 seconds. The distribution of timings shows a peak at
8 seconds and a long tail with the highest values in the
multi-hour range. To account for logging and annotation
errors, we have removed outliers by deleting 10% of the
longest gaps, which limits the simulated time between
query submissions to no longer than 20 minutes. We
use this remaining pool of time gaps to accurately
reproduce the timing distribution for our generated sessions by
randomly drawing values from it—which naturally then
favors shorter time spans since they are more frequent.
          </p>
        </sec>
        <sec id="sec-1-1-2">
          <title>Limits when using Suggestions. While working on</title>
          <p>our pilot study, we experimented with various
combinations of suggestion selection strategies and session
lengths. We identified issues in our strategies that are a
direct result of the nature of search engine suggestions.</p>
          <p>The first suggestion strategy is particularly prone to
loops, when two queries are the top-ranked suggestions
for each other—causing the generated session to alternate
between two query strings; also observed for singular–
plural pairs or categories (i.e., file formats, programming
languages). To counter the looping issue, we use a unique
query approach, which ensures that queries are not
repeated in loops within a session. Additionally, another
policy ensures a minimum dissimilarity between
consecutive queries that helps to avoid plurals as top
suggestions. However, while unique / dissimilar queries
mitigate looping, we find that especially longer sessions
(say, ten queries) narrow down to very specific topics.
A possible reason is that today’s search engine query
suggestions do not only show related queries, but often
ofer more specific autocompletions. Further details on
the evaluation are provided in Section 4.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Evaluation</title>
      <sec id="sec-2-1">
        <title>For the three approaches, we generate 100, 100, and 20</title>
        <p>sessions, respectively (in case of the three word strategy, In the evaluation, we compare the sessions generated by
the strict selection process and the small pool of sugges- our three approaches to sessions from both the
Webistions often results in very short sessions such that we SMC-12 dataset and the TREC 2014 Session track. As a</p>
      </sec>
      <sec id="sec-2-2">
        <title>1https://github.com/leifos/simiir</title>
        <p>
          ifrst step, we perform an automated evaluation by
running the sessions through the session detection approach
of Hagen et al. [
          <xref ref-type="bibr" rid="ref9">21</xref>
          ]. Ideally, the simulated sessions should
not be split by the session detection in order to count as
“authentic”. In a second step, a human assessor looked
at the simulated sessions as well as original sessions and
had to judge whether a session seems to be simulated
or of human origin. In a third step, a human assessor
judged whether a session actually covers the intended
information need given by the topic description.
4.2. Human Authenticity Assessment
An automated session detection system only “assesses”
whether the consecutive queries seem to belong together
based on factors like lexical or semantic similarity and
time gaps. However, we want to complement this purely
automatic relatedness detection by a manual assessment
of how “authentic” the simulated sessions are perceived
by humans, i.e., whether a human can distinguish
simulated from real sessions.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Procedure. All simulated sessions and a sample of</title>
        <p>4.1. Automatic Session Detection original sessions are combined into one session pool.
The goal of a session detection system is to identify con- The sessions are then presented to the judge as kind of
secutive queries as belonging to the same information log excerpts with user ID, timestamps, and queries. The
need or not. When a consecutive pair is detected that judge has no accurate knowledge about the amount of
seems to belong to two diferent information needs, a split queries for each approach and there is no obvious way
is introduced. Later some of these sessions might be run to determine the source of a session. The judge then
through a mission detection to identify non-consecutive labels each session as real (sampled from actual query
sessions that belong to the same search task, etc. logs) or simulated (by one of the three approaches). The</p>
        <p>
          As an automatic evaluation of the the simulated ses- results in Table 2 indicate that the simulated sessions are
sions’ authenticity, we individually run each simulated perceived as real even though the assessor was told that
session and the individual sessions from the TREC and some sessions actually are simulated.
Webis-SMC-12 data through the session detection ap- During the assessment, the assessor took notes of
proach of Hagen et al. [
          <xref ref-type="bibr" rid="ref9">21</xref>
          ]. A simulated or original ses- which features of a session or query determine the
judgsion “passes” the automatic authenticity test if the de- ment. This helps us in understanding how humans and
tection approach does not introduce a split. The results algorithms may come up with diferent verdicts. The
are shown in Table 1 (sessions with only one query were primary criteria for the relatedness of two queries are
removed since they will never be split). their term composition and length. Similarities in those
        </p>
        <p>Altogether, the simulated sessions are hardly split by aspects are perceived as patterns. This is also true for
the automatic detection. The one wrong split for the first small editing actions (adding or replacing single words)
suggestion strategy and one wrong split for the random which naturally comes with the specialization towards
suggestion strategy are likely due to the first query being a topic. The opposite efect is perceived for rapid topic
uppercased while the subsequent suggestions are low- changes. When multiple closely related tasks have to be
ercased, while the second “wrong” split for the random fulfilled within one session, there may be large changes
suggestions strategy is likely caused by a reformulation from query to query. This is also true for replacing words
with abbreviation and no term overlap (“no air condition- by synonyms or abbreviations. While a human judge will
ing alternatives” to “what to use instead of ac”). These usually be able to infer context to those rapid changes, an
examples serve as a good demonstration for the limita- automatic process is more likely to detect a new session.
tions of a fully automatic authenticity evaluation such Another discrepancy between human and algorithmic
that we also manually assess the simulated sessions. evaluation becomes apparent when we consider outlier
behavior like text formatting (e.g., all-uppercase) that a
human might be able to judge as a simple typing error
while a detection approach without lowercasing
preprocessing might be mislead.</p>
        <p>In a nutshell, while both humans and algorithms look
for patterns in the sessions and queries, the human judge
does so more selective by looking for mistakes. If found,
the type of a mistake usually heavily influences the
assessment of a session. Finally, note that due to the nature
of the three word query strategy there might be a chance
for an informed human to guess the sessions origin.
4.3. Human Topicality Assessment
So far, we have shown that the authenticity of a session
is largely influenced by its term composition and
appearance. However, to serve as a replacement for humans, a
session generator not only has to provide sessions that
a detection approach or some human would assess as
authentic, but also has to simulate sessions that follow
the topic given as part of the evaluation study.</p>
        <p>First suggestion
air conditioning alternatives
air conditioning alternatives car
no air conditioning in car alternatives
how can i keep my car cool without ac
ways to keep car cool without ac
Random suggestion
air conditioning alternatives
no air conditioning alternatives
what to use instead of ac
what to use instead of activator
what can i use instead of activator for nails
how to make nail activator
Random suggestion
Philadelphia
philadelphia cheese
philadelphia cheese recipes
philadelphia cheese recipes salmon pasta
4.4. Notable Examples</p>
      </sec>
      <sec id="sec-2-4">
        <title>Hypothesis: The first and random approach do not</title>
        <p>take the topic into account. Both strategies simply
converge to anything the search suggestion API provides
for the initial query. Instead, the three word approach
makes informed decisions when choosing suggestions
and should therefore be able to stay more “on topic”.</p>
        <p>Procedure. Determining if a session or query is on
topic is a non-trivial task. While a query like “car” over- As part of the judgment process, we have also taken note
laps with the topic “find information on used car prices”, of simulated sessions which contain conspicuous editing
it does not address the information need formulated in steps or queries. The examples in Table 4 include a
posithe topic description. We therefore set the following cri- tive and a negative example with respect to authenticity.
teria to evaluate if a session is “on topic”: A session is “on The first example was judged as “real” based on the
topic”, if the last query addresses at least one information usage of an abbreviation for air conditioning in the fourth
need formulated in the topic description or shows clear query. The replacement of terms or groups of terms with
signs that the session is headed in that direction—such a common abbreviation might be seen as a typical step
that very short sessions are more likely to be on topic. for a human user after gaining more insight into a topic.
A session is also “on topic”, if any query of the session The second example includes an issue that was caused
addresses at least one information need formulated in by the autocomplete feature of the Google suggestions:
the topic description—necessary condition to account for the abbreviation ‘ac’ was falsely extended to the term
topics with multiple subtasks. ‘activator’, which ultimately changed the subject of the
session. The third example shows a very common issue
of ambiguous first queries. For the first and random
suggestion strategies, there is no way to determine that
a city is referenced in this example such that the session
quickly diverges to the food domain.
4.5. Long Sessions
The simulated sessions up to this point had parameters
like session length and inter-query time been set to values
that deemed appropriate in some initial experiments on
our end in order to generate “close to real” sessions. We
also did not include navigational queries or known-item
searches, which often could result in either very short or
very long sessions. To investigate the applicability of our
approaches to such outlier behavior we have also further
assessed some sessions with up to 20 queries.</p>
        <p>In many of the cases without imposing any limits on
the generation process, the sessions still were often
terminated early due to a lack of suggestions. This was
mostly caused by two reasons: either the query became
too specific to still yield additional suggestions or the
pool of unique and dissimilar queries was used up. In
cases where long sessions could actually be generated,
the session usually quickly was rather specific and
diverged substantially from the actual given topic towards
the end of the session.</p>
        <p>Using a diferent set of more technically oriented
topics, we were able to generate longer sessions more
frequently. For this to work, we had to limit the dissimilarity
iflter, as abbreviations within the query were more
frequent and therefore editing distances were smaller. We
also observed that queries from this field were mostly
comprised of categorical keywords stitched together
compared to the more natural looking sessions from standard
query logs.</p>
        <p>Those observations, while helping to shape our pilot
study, show that parameters and strategies for authentic
session generation are a very dynamic and potentially
also topic-specific issue.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion</title>
      <p>In this paper, we have investigated how well
authentic sessions can be simulated using web search engine
query suggestions. By employing diferent strategies of
selecting and combining the suggestions, we showcased
the potential but also the limits of the overall usefulness
of suggestion-based session simulation. Our evaluation
showed that both humans and a session detection
framework are unable to distinguish suggestion-based sessions
from sampled real sessions. While some kind of
authenticity can thus be attributed to the simulated sessions,
staying on topic proved to be rather dificult. Addressing
the outlined shortcomings is an interesting direction for
future work. We plan to continue investigating query
simulation as follows.</p>
      <sec id="sec-3-1">
        <title>Data Independence. Relying on suggestions as query</title>
        <p>candidates limits the flexibility and applicability of the
simulated sessions. We will work on query
modifications that include “knowledge” from language models or
predefined editing rules.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Influence on the Topic. For accurate session simulation, it is necessary to influence the topic that the queries follow. We will evaluate how and where those decisions have to be made to create an efective user model.</title>
      </sec>
      <sec id="sec-3-3">
        <title>User Types and Editing. Since query modifications often follow well-known patterns, we will also investigate ways to replicate editing patterns in simulated queries that are typical for specific user groups or tasks.</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>This work has been partially supported by the DFG through the project “SINIR: Simulating INteractive Information Retrieval” (grant HA 5851/3-1).</title>
        <p>2011, Glasgow, United Kingdom, October 24-28, de Vries, C. L. A. Clarke, N. Fuhr, N. Kando (Eds.),
2011, ACM, 2011, pp. 611–620. URL: https://doi.org/ SIGIR 2007: Proceedings of the 30th Annual
In10.1145/2063576.2063668. doi:10.1145/2063576. ternational ACM SIGIR Conference on Research
2063668. and Development in Information Retrieval,
Amster[5] F. Baskaya, H. Keskustalo, K. Järvelin, Time drives dam, The Netherlands, July 23-27, 2007, ACM, 2007,
interaction: Simulating sessions in diverse search- pp. 455–462. URL: https://doi.org/10.1145/1277741.
ing environments, in: W. R. Hersh, J. Callan, 1277820. doi:10.1145/1277741.1277820.
Y. Maarek, M. Sanderson (Eds.), The 35th Inter- [11] L. Azzopardi, Query side evaluation: An
empirinational ACM SIGIR conference on research and cal analysis of efectiveness and efort, in: J.
Aldevelopment in Information Retrieval, SIGIR ’12, lan, J. A. Aslam, M. Sanderson, C. Zhai, J. Zobel
Portland, OR, USA, August 12-16, 2012, ACM, 2012, (Eds.), Proceedings of the 32nd Annual
Internapp. 105–114. URL: https://doi.org/10.1145/2348283. tional ACM SIGIR Conference on Research and
De2348301. doi:10.1145/2348283.2348301. velopment in Information Retrieval, SIGIR 2009,
[6] F. Baskaya, H. Keskustalo, K. Järvelin, Model- Boston, MA, USA, July 19-23, 2009, ACM, 2009,
ing behavioral factors ininteractive information pp. 556–563. URL: https://doi.org/10.1145/1571941.
retrieval, in: Q. He, A. Iyengar, W. Nejdl, J. Pei, 1572037. doi:10.1145/1571941.1572037.
R. Rastogi (Eds.), 22nd ACM International Con- [12] D. Maxwell, L. Azzopardi, K. Järvelin, H. Keskustalo,
ference on Information and Knowledge Manage- Searching and stopping: An analysis of stopping
ment, CIKM’13, San Francisco, CA, USA, Octo- rules and strategies, in: J. Bailey, A. Mofat, C. C.
Agber 27 - November 1, 2013, ACM, 2013, pp. 2297– garwal, M. de Rijke, R. Kumar, V. Murdock, T. K.
Sel2302. URL: https://doi.org/10.1145/2505515.2505660. lis, J. X. Yu (Eds.), Proceedings of the 24th ACM
Indoi:10.1145/2505515.2505660. ternational Conference on Information and
Knowl[7] C. Jordan, C. R. Watters, Q. Gao, Using controlled edge Management, CIKM 2015, Melbourne, VIC,
query generation to evaluate blind relevance feed- Australia, October 19 - 23, 2015, ACM, 2015, pp. 313–
back algorithms, in: G. Marchionini, M. L. Nelson, 322. URL: https://doi.org/10.1145/2806416.2806476.
C. C. Marshall (Eds.), ACM/IEEE Joint Conference doi:10.1145/2806416.2806476.
on Digital Libraries, JCDL 2006, Chapel Hill, NC, [13] R. Jones, K. L. Klinkner, Beyond the session
timeUSA, June 11-15, 2006, Proceedings, ACM, 2006, out: Automatic hierarchical segmentation of search
pp. 286–295. URL: https://doi.org/10.1145/1141753. topics in query logs, in: J. G. Shanahan, S.
Amer1141818. doi:10.1145/1141753.1141818. Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz,
[8] H. Keskustalo, K. Järvelin, A. Pirkola, T. Sharma, K. Choi, A. Chowdhury (Eds.), Proceedings of the
M. Lykke, Test collection-based IR evaluation 17th ACM Conference on Information and
Knowlneeds extension toward sessions - A case of ex- edge Management, CIKM 2008, Napa Valley,
Caltremely short queries, in: G. G. Lee, D. Song, ifornia, USA, October 26-30, 2008, ACM, 2008,
C. Lin, A. N. Aizawa, K. Kuriyama, M. Yosh- pp. 699–708. URL: https://doi.org/10.1145/1458082.
ioka, T. Sakai (Eds.), Information Retrieval Tech- 1458176. doi:10.1145/1458082.1458176.
nology, 5th Asia Information Retrieval Sympo- [14] A. H. Awadallah, X. Shi, N. Craswell, B. Ramsey,
sium, AIRS 2009, Sapporo, Japan, October 21-23, Beyond clicks: query reformulation as a predictor
2009. Proceedings, volume 5839 of Lecture Notes of search satisfaction, in: Q. He, A. Iyengar, W.
Nein Computer Science, Springer, 2009, pp. 63–74. jdl, J. Pei, R. Rastogi (Eds.), 22nd ACM International
URL: https://doi.org/10.1007/978-3-642-04769-5_6. Conference on Information and Knowledge
Mandoi:10.1007/978-3-642-04769-5\_6. agement, CIKM’13, San Francisco, CA, USA,
Octo[9] S. Verberne, M. Sappelli, K. Järvelin, W. Kraaij, ber 27 - November 1, 2013, ACM, 2013, pp. 2019–
User simulations for interactive search: Evaluat- 2028. URL: https://doi.org/10.1145/2505515.2505682.
ing personalized query suggestion, in: A. Hanbury, doi:10.1145/2505515.2505682.
G. Kazai, A. Rauber, N. Fuhr (Eds.), Advances in [15] H. Keskustalo, K. Järvelin, A. Pirkola, Evaluating
Information Retrieval - 37th European Conference the efectiveness of relevance feedback based on a
on IR Research, ECIR 2015, Vienna, Austria, March user simulation model: Efects of a user scenario on
29 - April 2, 2015, Proceedings, volume 9022 of Lec- cumulated gain value, Inf. Retr. 11 (2008) 209–228.
ture Notes in Computer Science, 2015, pp. 678–690. URL: https://doi.org/10.1007/s10791-007-9043-7.
URL: https://doi.org/10.1007/978-3-319-16354-3_75. doi:10.1007/s10791-007-9043-7.
doi:10.1007/978-3-319-16354-3\_75. [16] V. Dang, W. B. Croft, Query reformulation using
[10] L. Azzopardi, M. de Rijke, K. Balog, Building sim- anchor text, in: B. D. Davison, T. Suel, N. Craswell,
ulated queries for known-item topics: An analysis B. Liu (Eds.), Proceedings of the Third International
using six european languages, in: W. Kraaij, A. P. Conference on Web Search and Web Data Mining,</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , K. Balog,
          <article-title>Evaluating conversational recommender systems via user simulation</article-title>
          , in: R. Gupta,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Prakash</surname>
          </string-name>
          (Eds.),
          <source>KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          , Virtual Event, CA, USA,
          <year>August</year>
          23-
          <issue>27</issue>
          ,
          <year>2020</year>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>1512</fpage>
          -
          <lpage>1520</lpage>
          . URL: https://doi.org/10.1145/3394486.3403202. doi:
          <volume>10</volume>
          .1145/3394486.3403202.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>McGregor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Halvey</surname>
          </string-name>
          ,
          <article-title>Untangling cost, efort, and load in information seeking and retrieval</article-title>
          , in: F. Scholer, P. Thomas,
          <string-name>
            <given-names>D.</given-names>
            <surname>Elsweiler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Joho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Smith (Eds.),
          <source>CHIIR '21: ACM SIGIR Conference on Human Information Interaction and Retrieval</source>
          , Canberra,
          <string-name>
            <surname>ACT</surname>
          </string-name>
          , Australia, March
          <volume>14</volume>
          -19,
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>151</fpage>
          -
          <lpage>161</lpage>
          . URL: https://doi.org/10.1145/3406522.3446026. doi:
          <volume>10</volume>
          .1145/3406522.3446026.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Carterette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zengin</surname>
          </string-name>
          ,
          <article-title>Dynamic test collections for retrieval evaluation</article-title>
          , in: J.
          <string-name>
            <surname>Allan</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. P. de Vries</surname>
          </string-name>
          , C. Zhai (Eds.),
          <source>Proceedings of the 2015 International Conference on The Theory of Information Retrieval</source>
          ,
          <string-name>
            <surname>ICTIR</surname>
          </string-name>
          <year>2015</year>
          , Northampton, Massachusetts, USA, September
          <volume>27</volume>
          -
          <issue>30</issue>
          ,
          <year>2015</year>
          , ACM,
          <year>2015</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>100</lpage>
          . URL: https: //doi.org/10.1145/2808194.2809470. doi:
          <volume>10</volume>
          .1145/ 2808194.2809470.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Carterette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kanoulas</surname>
          </string-name>
          , E. Yilmaz,
          <article-title>Simulating simple user behavior for system efectiveness evaluation</article-title>
          , in: C.
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>I. Ounis</given-names>
          </string-name>
          , I. Ruthven (Eds.),
          <source>Proceedings of the 20th ACM Conference on Information and Knowledge Management</source>
          ,
          <source>CIKM WSDM</source>
          <year>2010</year>
          , New York, NY, USA, February 4-
          <issue>6</issue>
          ,
          <year>2010</year>
          , ACM,
          <year>2010</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>50</lpage>
          . URL: https://doi.org/ 10.1145/1718487.1718493. doi:
          <volume>10</volume>
          .1145/1718487. 1718493.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Billerbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fetterly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          ,
          <article-title>Robust query rewriting using anchor data</article-title>
          , in: S. Leonardi,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panconesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Gionis (Eds.),
          <source>Sixth ACM International Conference on Web Search and Data Mining, WSDM</source>
          <year>2013</year>
          , Rome, Italy, February 4-
          <issue>8</issue>
          ,
          <year>2013</year>
          , ACM,
          <year>2013</year>
          , pp.
          <fpage>335</fpage>
          -
          <lpage>344</lpage>
          . URL: https://doi.org/10.1145/2433396.2433440. doi:
          <volume>10</volume>
          .1145/2433396.2433440.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Garigliotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          ,
          <article-title>Generating query suggestions to support task-based search</article-title>
          , in: N.
          <string-name>
            <surname>Kando</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          de Vries, R. W. White (Eds.),
          <source>Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Shinjuku, Tokyo, Japan,
          <source>August</source>
          <volume>7</volume>
          -
          <issue>11</issue>
          ,
          <year>2017</year>
          , ACM,
          <year>2017</year>
          , pp.
          <fpage>1153</fpage>
          -
          <lpage>1156</lpage>
          . URL: https://doi.org/10.1145/3077136.3080745. doi:
          <volume>10</volume>
          .1145/3077136.3080745.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garigliotti</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Balog, Generating high-quality query suggestion candidates for task-based search</article-title>
          , in: G. Pasi,
          <string-name>
            <given-names>B.</given-names>
            <surname>Piwowarski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Hanbury (Eds.),
          <source>Advances in Information Retrieval - 40th European Conference on IR Research</source>
          , ECIR
          <year>2018</year>
          , Grenoble, France, March 26-29,
          <year>2018</year>
          , Proceedings, volume
          <volume>10772</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2018</year>
          , pp.
          <fpage>625</fpage>
          -
          <lpage>631</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -76941-7_
          <fpage>54</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -76941-7\_
          <fpage>54</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B.</given-names>
            <surname>Carterette</surname>
          </string-name>
          , E. Kanoulas,
          <string-name>
            <surname>M. M. Hall</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
          </string-name>
          ,
          <article-title>Overview of the TREC 2014 session track</article-title>
          , in: E. M.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Ellis (Eds.),
          <source>Proceedings of The Twenty-Third Text REtrieval Conference</source>
          , TREC 2014, Gaithersburg, Maryland, USA, November
          <volume>19</volume>
          -
          <issue>21</issue>
          ,
          <year>2014</year>
          , volume
          <volume>500</volume>
          -308 of NIST Special Publication,
          <source>National Institute of Standards and Technology (NIST)</source>
          ,
          <year>2014</year>
          . URL: http://trec.nist.gov/pubs/trec23/ papers/overview-session.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gomoll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>From search session detection to search mission detection</article-title>
          , in: J.
          <string-name>
            <surname>Ferreira</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Magalhães</surname>
          </string-name>
          , P. Calado (Eds.), Open research Areas in Information Retrieval, OAIR '
          <fpage>13</fpage>
          , Lisbon, Portugal, May
          <volume>15</volume>
          -17,
          <year>2013</year>
          , ACM,
          <year>2013</year>
          , pp.
          <fpage>85</fpage>
          -
          <lpage>92</lpage>
          . URL: http://dl.acm.org/citation.cfm?id=
          <fpage>2491769</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Torgeson</surname>
          </string-name>
          ,
          <article-title>A picture of search</article-title>
          , in: X.
          <string-name>
            <surname>Jia</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the 1st International Conference on Scalable Information Systems, Infoscale</source>
          <year>2006</year>
          ,
          <string-name>
            <given-names>Hong</given-names>
            <surname>Kong</surname>
          </string-name>
          , May 30-June 1,
          <year>2006</year>
          , volume
          <volume>152</volume>
          of ACM International Conference Proceeding Series, ACM,
          <year>2006</year>
          , p.
          <fpage>1</fpage>
          . URL: https://doi.org/10.1145/1146847.1146848. doi:
          <volume>10</volume>
          .1145/1146847.1146848.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>