=Paper=
{{Paper
|id=Vol-3160/paper8
|storemode=property
|title=Simulating User Interaction and Search Behaviour in Digital Libraries
|pdfUrl=https://ceur-ws.org/Vol-3160/paper8.pdf
|volume=Vol-3160
|authors=Saber Zerhoudi,Michael Granitzer,Christin Seifert,Jörg Schlötterer
|dblpUrl=https://dblp.org/rec/conf/ircdl/ZerhoudiGSS22
}}
==Simulating User Interaction and Search Behaviour in Digital Libraries==
<pdf width="1500px">https://ceur-ws.org/Vol-3160/paper8.pdf</pdf>
<pre>
Simulating User Interaction and Search Behaviour in
Digital Libraries
Saber Zerhoudi1 , Michael Granitzer1 , Christin Seifert2 and Joerg Schloetterer2
1
    University of Passau, Passau, Germany
2
    University of Duisburg-Essen, Duisburg, Germany


                                         Abstract
                                         Providing a satisfying user experience is key to successful digital libraries. An efficient search user
                                         interface design requires a detailed understanding of user behaviour to match their needs. A common
                                         approach to modelling users is to transform the recorded users’ navigation behaviour to Markov models
                                         and analyse them to provide personalised content. However, personalisation based on log data is hard
                                         to achieve due to the lack of sufficient daily user interactions in comparison to web search engines,
                                         even when achieved, it’s usually on the cost of users’ privacy and the confidentiality of their personal
                                         information potentially leading to privacy violation. In order to allow data analysis while retaining
                                         users’ privacy, we propose a Markov approach to simulate user sessions and help reducing the amount
                                         of collected user data while preserving the profiling efficiency. Specifically, given log data of search
                                         sequences performed by users in a digital library, we explore basic, conditional, contextual, time-aware
                                         and query-change Markov models to simulate similar search sequences. Our results on an academic
                                         search engine log dataset show that our methods reliably simulate global and type specific search
                                         behaviour. Simulated search sequences representing an approximation of real behaviour can be used
                                         to generate log data at any desired volume and variety such that A/B-testing digital libraries becomes
                                         scalable.

                                         Keywords
                                         Simulating user session, Markov models, User search behaviour, User modelling


1. Introduction
The main aim for search engines is to offer a satisfying user experience. Their success can be
measured by how well they are able to predict user actions and provide users with the right
content at the right time. Modelling user behaviour accurately also allows search engines to
adapt their user interface design to match users’ needs. This includes deciding where each
user interface component should be placed and which content should be provided to the user.
However, problems arise when improving platforms such as digital libraries (e.g., academic
search engines) that aim to provide personalised content to their user-base, namely researchers
and students. In fact, it is often not possible to build personalised user models due to the lack
of sufficient daily user interactions in comparison to web search engines [1], and even when

IRCDL 2022: 18th Italian Research Conference on Digital Libraries, February 24–25, 2022, Padova, Italy
$ saber.zerhoudi@uni-passau.de (S. Zerhoudi); michael.granitzer@uni-passau.de (M. Granitzer);
christin.seifert@uni-due.de (C. Seifert); joerg.schloetterer@uni-due.de (J. Schloetterer)
 0000-0003-2259-0462 (S. Zerhoudi); 0000-0003-3566-5507 (M. Granitzer); 0000-0002-6776-3868 (C. Seifert);
0000-0002-3678-0390 (J. Schloetterer)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                           1
Saber Zerhoudi et al. CEUR Workshop Proceedings                                               1–15


available, such models would be generated at a significantly smaller scale and would be less
accurate.
   Recently, user behaviour has been subject to controversy debates in information security
and privacy related areas, especially with the large portion of collected user data in web search
engines and digital libraries. Analysing this wealth of information without privacy-concerns
allowed - web search engines on a larger scale and digital libraries on a minor one - to improve
their search experience. However, concerns started to rise regarding the usage of collected
sessions and the impact of storing personal, privacy-sensitive data without the user knowing.
   To reduce the amount of collected user data while preserving the profiling efficiency and
help domain-specific search engines, where the amount of user data is limited, to provide
accurate information to users while protecting their privacy and the confidentiality of their
personal information, we propose to use simulation. Simulation offers a way to overcome the
lack of experimental real-world data, especially when the acquisition of such data is costly or
challenging [2]. Simulating user search behaviour also allows information retrieval systems (IR)
to conduct A/B-tests for the purpose of monitoring user interactions and parameterising the
a-priori distribution of search types using different back-end configurations and user interface
variants (e.g., changing facets placement in the user interface to more prominent place) to
improve the retrieval functionality.
   To better simulate users’ search behaviour and model their information needs, search engines
offer contextual information with data such as previous queries and click-actions, which is very
useful for understanding user search behaviour and improving the retrieval system configura-
tions involved in the search process. In fact, digital libraries should consider user’s browsing
context (e.g., which users are more likely to formulate more queries) and sequentiality (e.g.,
which search actions are performed knowing that the user has formulated its 𝑥-th query) to
improve the precision of results. Users in the same group are most likely to behave in a similar
manner. Therefore, it gets easier to recognise users’ intentions and simulate their behaviour
when exploiting the available contextual information.
   In this paper, we propose the use of Markov models to simulate global and search-type
specific behaviour to support users in their search. Markov models have been widely used for
discovering meaningful patterns in browsing data due to their good interpretability [3, 4, 5, 6].
In particular, they capture sequences in search patterns using transitional probabilities between
states and translate user sessions into Markov processes.
   Our contributions are the following: (1) we conduct experiments on a real-wold dataset and
simulate user search behaviour using a basic Markov approach which we consider as baseline, (2)
we then demonstrate that while using user’s browsing sequentiality, the Markov model simulates
user sessions marginally better than the baseline, (3) we also showcase that Markov model
perform well when considering user’s actions or the user’s intent-level query-change as context,
even when the contexts are derived based on common sense assumptions and (4) modelling the
dwell time and using it as an additional feature for session simulation yielded better results.
Our contributions also include (5) an empirical validation that utilises the Kolmogorov-Smirnov
statistical test to compare the similarity between real log and simulated data distribution and (6)
an evaluation technique that uses classification to assess the quality of simulated search session
and calculate the model’s improvement in comparison to the baseline.


                                                  2
Saber Zerhoudi et al. CEUR Workshop Proceedings                                              1–15


2. Related Work
While Markov models can be used in other domains such as credit card fraud detection [7],
computational biology [8] or cyber-security [9], we restrict the discussion here to understanding,
modelling and predicting users’ search behaviour on search engines.
   Several Markov models variations and extensions were proposed for modelling user naviga-
tion behaviour. For example, Cadez et al. [10] used first-order Markov models to cluster user
behaviour and reported some interesting insights (e.g., order in which users request pages) that
improved the website which the data was taken from. Liu et al. [11] explored Continuous-Time
Markov Processes for modelling users’ browsing behaviour in web search to help computing
page importance. Azzopardi introduced Search Economic Theory (SET) [12] to model the
search process and estimate users’ effort during a search process using a cost function. He
compared the interaction costs of different search strategies using simulated user interactions.
Hybrid-order tree-like Markov models [13], Selective Markov models [14] and variable order
Markov model [15] were also used to extract users’ interactions from logs and capture both
small and large order Markov dependencies.
   Besides, the most commonly used techniques are Hidden Markov Models (HMM) based.
Hidden Markov Models are used to model exploratory search sessions. Lucas found HMM to be
suitable for pattern recognition problems on the basis of sequential data [16] (e.g., credit card
fraud detection, handwriting recognition or biological sequence analysis). However, working
with HMM usually requires an understanding of the domain since we do not have control over
the state transitions and the states are not completely observable. Kiseleva et al. [17] suggested
to use a hierarchical clustering approach to decompose users’ web sessions into non-overlapping
temporal segments. Authors also identified temporal context and utilised it to predict future
user’s actions more accurately using Markov models.
   Our discussion of related work reveals a strong focus on predicting and modelling user’s
search behaviour. Despite this fact, the application of Markov model as a simulation model
outside the healthcare domain [18] is still in its infancy.


3. Methodology
The main purpose of this study is to simulate user search behaviour based on user’s behaviour
patterns. In other words, given log data of search sequences performed by users in a digital
library, our goal is to simulate similar search sequences. In this work, we use Markov model
as a simulation model. In general, the input for this problem is a sequence of search sessions
produced by real users, and the goal is to build Markov models using different approaches that
model user’s search behaviour.

3.1. Basic Markov model
We propose investigating the use of Markov Chains to model the search dynamics. The theo-
retical model is based on first-order Markov models [19]. Our model is defined as a weighted
directed graph, where user actions are represented as nodes and the transition probability
between actions as edges. In fact, let 𝐺 = (𝐴, 𝐸, 𝑤) where:


                                                  3
Saber Zerhoudi et al. CEUR Workshop Proceedings                                            1–15


    • 𝐴 is the set of all possible user search actions including START/END node, with each
      action defined as state 𝑆𝑎 ,
    • 𝐸 ⊆ 𝐴 × 𝐴 is the set of all possible transitions between two actions,
    • 𝑤 : 𝐸 → [ 0, 1] is a function that assigns a weight 𝑤 to every pair of states ( 𝑆𝑖 , 𝑆𝑗 )
      which represents the likelihood of a transition from state 𝑖 to state 𝑗.
   Let 𝑋𝑘 be the random variable that models actions in a user search session. Basic Markov
approach models the transition probability between a pair of states using maximum likelihood
estimation as follows:
                                                         𝑁𝑆𝑎 ,𝑆𝑏
                              𝑃 (𝑋𝑘 = 𝑆𝑏 |𝑋𝑘−1 = 𝑆𝑎 ) =                                   (1)
                                                          𝑁𝑆𝑎
where 𝑁𝑆𝑎 is the total amount of how many times the state 𝑆𝑎 occurred in the training data
and 𝑁𝑆𝑎 ,𝑆𝑏 is the amount of how many times the transition from state 𝑎 to state 𝑏 has been
observed.
   We use first-order Markov model as a baseline to model the global user search behaviour
and simulate user search sessions in such a way that each action that is performed by a user
corresponds to a state in the model.

3.2. Conditional Markov model
In conditional Markov approach, we aim to model the probability to perform an action given the
current action and query, and the probability to formulate a query given the current action and
query. We extend the work that has been done in [20] by adjusting it to the simulation scenario.
We utilise the user search sessions in the log data to calculate the transition probabilities
between different states resulting in a Markov model that takes into consideration the following
definitions:
    • 𝑖 ∈ N user query: the 𝑖-th query formulated in a user search session
    • 𝑗 ∈ N user action: the 𝑗-th action performed for the 𝑖-th query in a user search session
Every user action in a search session is related to the preceded query formulation. With the
exception of the first query formulation that marks the start of a user search session, every
query is also related to the preceded action performed by the user.
   Let 𝑋𝑘 be the random variable that models query formulation in a user search session and 𝑌𝑘
the random variable that models user actions performed for the last query in a search session
after 𝑘 transitions. In order to model user sessions, we define the pair ( 𝑋𝑘 = 𝑖, 𝑌𝑘 = 𝑗) in a
way that it represents when a user performs 𝑗-th action after formulating an 𝑖-th query. We
introduce two different types of transitions. The first one models the transition probability of
formulating 𝑖-th query given that 𝑗 actions were performed in the ( 𝑖 − 1) -th query [20]:
                                                           𝑃 ( 𝑋𝑘 = 𝑖, 𝑌𝑘 = 𝑗)
            𝑃 ( 𝑋𝑘 = 𝑖|𝑋𝑘−1 = 𝑖 − 1, 𝑌𝑘−1 = 𝑗) =                                             (2)
                                                      𝑃 ( 𝑋𝑘−1 = 𝑖 − 1, 𝑌𝑘−1 = 𝑗)
  and the second one models the transition probability of performing 𝑗 actions given that 𝑖-th
query was formulated [20]:
                                                           𝑃 ( 𝑋𝑘 = 𝑖, 𝑌𝑘 = 𝑗)
            𝑃 ( 𝑌𝑘 = 𝑗|𝑋𝑘−1 = 𝑖, 𝑌𝑘−1 = 𝑗 − 1) =                                             (3)
                                                      𝑃 ( 𝑋𝑘−1 = 𝑖, 𝑌𝑘−1 = 𝑗 − 1)


                                                  4
Saber Zerhoudi et al. CEUR Workshop Proceedings                                               1–15


   where the first event in a user search session is a query formulation followed by either a user
action or a new query reformulation. After 𝑘 events, the transition probability is calculated
using the Markov transition matrix 𝑀 𝑘 of 𝑘-th order.
   In the basic approach, the random variable 𝑋𝑘 represents the action/state (regardless whether
it’s a query or an action) and the transition probability from state 𝑎 to 𝑏 is 𝑃 (𝑋𝑘 = 𝑏|𝑋𝑘−1 = 𝑎).
The Markov property (i.e., future state depends only on the current one) is also preserved. For
the conditional approach, we add a second condition, i.e., the future action always depends on
the current action and query, which makes it necessary to distinguish queries 𝑋𝑘 and actions
𝑌𝑘 . We then have 𝑃 (𝑋𝑘 = 𝑖|𝑋𝑘−1 = 𝑖 − 1, 𝑌𝑘−1 = 𝑗) and 𝑃 (𝑌𝑘 = 𝑗|𝑋𝑘−1 = 𝑖, 𝑌𝑘−1 = 𝑗 − 1)
as transition probability.
   We compute the transition probabilities based on these two definitions (i.e., Eq. 2 and 3)
by looking at the current action and the current formulated query when trying to model the
probability to perform another action or formulate another query.

3.3. Contextual Markov model
During a search session, a user performs different search actions to find documents that fulfill
his/her information need. The technique that we propose here aims to categorise users into
different groups based on their search behaviour. Search tasks are commonly divided into two
major types of user’s behaviour [21, 22]: (1) "exploratory search" where users tend to explore
the search result list exhaustively, rephrase queries or extensively use filtering operations on
the results, and (2) "known-item" where users only investigate the first few results and rephrase
their queries quickly to match their need.
   In this approach, we utilise search-type context that splits the training data into smaller por-
tions. More precisely, we build a first-order Markov model for each search type (i.e., exploratory
search and known-item) by adopting context and compare the simulation’s performance to the
Markov model built from the whole data. This would allow us to evaluate the impact of context
on the accuracy of simulated sessions.
   Kumaripaba et al. [22] extended the work of Marchionini [21] and provided a few simple
indicators of information search behaviours to categorise users into exploratory and known-item
searchers. They showed that the most distinctive indicators that characterise exploratory search
behaviours are query length, maximum scroll depth, and task completion time. They proved
empirically that exploratory tasks can be distinguished within in the first query session by these
indicators. Following their findings, we identify known-item search sessions by fine-tuning
their indicators to fit our dataset and capping the first query duration to 200 seconds, the dwell
duration of the first three actions to 120 seconds, the cumulative actions tally in the first query
iteration to 4 actions and the the task completion time to 800 seconds. The remaining user
sessions are considered as exploratory search.
   The transition probability between any two states 𝑆𝑎 and 𝑆𝑏 is modelled similarly to (Eq. 1).

3.4. Time-aware Markov model
Our time-aware Markov model addresses time distribution between actions by taking the dwell
time spent between each transition into account. We assume the existence of a distinctive


                                                  5
Saber Zerhoudi et al. CEUR Workshop Proceedings                                               1–15


distribution that governs how much time a user needs to move from state 𝑎 to state 𝑏 for each
possible transition 𝑆𝑎 → 𝑆𝑏 . Each transition time is collected from training data and used to
estimate the time distribution of that transition. In order to calculate this time distribution, we
utilise the gamma distribution which is governed by two parameters [23, 24].
   Let 𝑌 be the random variable that models time between transitions. 𝑌 is gamma-distributed
where 𝜃 is the scale parameter and 𝑘 the shape parameter. 𝑌 is denoted by: 𝐺𝑎𝑚𝑚𝑎( 𝑘, 𝜃) =
𝑌 ∼ Γ( 𝑘, 𝜃). The probability density function of gamma distribution can be expressed as
                                  1
follows [25]: 𝑓 ( 𝑥; 𝑘; 𝜃) =           𝑥𝑘−1 𝑒−𝑥/𝜃 , 𝑥 > 0; 𝑘, 𝜃 > 0.
                               Γ( 𝑘)𝜃𝑘
   Let 𝑁 be a set of independent observations ( 𝑥1 , ..., 𝑥𝑁 ) of transition times between two
states ( 𝑆𝑎 , 𝑆𝑏 ). The likelihood function is:
                                                  𝑁
                                                  ∏︁
                                    𝐿( 𝑘; 𝜃) =          𝑓 ( 𝑥; 𝑘; 𝜃)                            (4)
                                                  𝑖=1

and thus, the maximum likelihood function with respect to 𝜃 is (obtained by setting the derivative
                                           1 ∑︀𝑁
of the log of equation (4) to zero) ^𝜃 =            𝑥𝑖 , and similarly the maximum with respect to
                         (︂              𝑘𝑁 𝑖=1
                                         )︂
                       1             1
𝑘, ln( 𝑘) − 𝜓( 𝑘) ≈         1+              which can be estimated from the training data. With the
                      2𝑘         6𝑘 + 1
gamma distribution parameters estimated to fit the data, we aim to represent the dwell times in
a realistic manner.
   Similarly to the Basic Markov, time-aware Markov approach models the transition probability
between any two states 𝑆𝑎 and 𝑆𝑏 using maximum likelihood estimation as follows: 𝑃 (𝑋𝑘 =
                     𝑁𝑆𝑎 ,𝑆𝑏
𝑆𝑏 |𝑋𝑘−1 = 𝑆𝑎 ) =             , where 𝑁𝑆𝑎 ,𝑆𝑏 is the number of times we observed a transition from
                       𝑁𝑆𝑎
state 𝑆𝑎 to 𝑆𝑏 and 𝑁𝑆𝑎 is the total number of transitions from state 𝑆𝑎 to any other state in the
training data.
   In this experiment, we seek to describe user action sequences while taking dwell time into
consideration. We use the maximum likelihood to estimate the parameters of the gamma
distribution for every transition from the log data. We then used the most likely transition time
as a feature.

3.5. Query-change Markov model
Users tend to generate search sessions with borderline behaviours and mixed characteristics. In
fact, users may have precise search goals, yet the search process is not straightforward. Several
studies have shown that the query change which occurs in a search session is heavily related
to the contents of previously viewed search results [26, 27, 28]. Therefore, looking into how
users change their query from 𝑞𝑖−1 to 𝑞𝑖 can partially explain how they express their dynamic
information need in the session. As consequence, such terms may be used for categorising users
into finer groups to better simulate search-type specific sessions.
   In this approach, we expand the findings of Huang et al. [29] and adopt user’s session-level
query change process as an implicit indicator for categorising search behaviour and estimating
the change in user’s information need. We develop a syntactic taxonomy by analysing the query


                                                  6
Saber Zerhoudi et al. CEUR Workshop Proceedings                                                  1–15


change Δ𝑞𝑖 as the syntactic change which occurs between two consecutive queries 𝑞𝑖−1 and 𝑞𝑖 :
Δ𝑞𝑖 = 𝑞𝑖 − 𝑞𝑖−1 .
   Let 𝑊 (𝑞) be the bag of words for 𝑞 and 𝑞𝑖 , 𝑞𝑖−1 two successive queries. A query 𝑞𝑖 can be
decomposed into three main parts in relation to 𝑞𝑖−1 , namely theme, added and removed terms
and written as follows: 𝑞𝑖 = Δ𝑞𝑖= + Δ𝑞𝑖+ − Δ𝑞𝑖− , where: Δ𝑞𝑖= = {𝑤 | 𝑤 ∈ 𝑊 (𝑞𝑖 ), 𝑤 ∈
𝑊 (𝑞𝑖−1 )}; Δ𝑞𝑖+ = {𝑤 | 𝑤 ∈ 𝑊 (𝑞𝑖 ), 𝑤 ̸∈ 𝑊 (𝑞𝑖−1 )}; Δ𝑞𝑖− = {𝑤 | 𝑤 ̸∈ 𝑊 (𝑞𝑖 ), 𝑤 ∈
𝑊 (𝑞𝑖−1 )}. We refer to the common terms that appear in both query 𝑞𝑖 and 𝑞𝑖−1 as theme
terms and denote it (Δ𝑞𝑖= ) as they usually represent the main thematic of a session. Added
terms (Δ𝑞𝑖+ ) represent the new terms that users add to the previous query and removed terms
(Δ𝑞𝑖− ) refer to the terms that users delete from the previous query as they seek to improve bad
performing queries. Theme terms (Δ𝑞𝑖= ) are generated using the longest common subsequence
(LCS) approach [30] in both queries 𝑞𝑖 , 𝑞𝑖−1 . The LCS can be the common part or parts of
two queries as long as the subsequence appears in both queries in the same relative order
but not necessarily continuous. For instance, 𝑆1 : 𝑞1 −→ 𝑞2 is a search session where 𝑞1 =
"tire recycling technique in Europe" and 𝑞2 = "tire recycling facilities in Europe". Δ𝑞𝑖= = "tire
recycling in Europe", Δ𝑞𝑖+ = "facilities" and Δ𝑞𝑖− = "technique".
   In query-change Markov model, we employ queries as states. Specially, we define the pair
( 𝑋𝑘 = 𝑖, 𝑌𝑘 = 𝑗) in a way that it represents when a user performs 𝑗 action (i.e., keeping,
adding or removing query terms) after formulating an 𝑖-th query. The transition probability of
formulating 𝑖-th query given that 𝑗 actions were performed in the ( 𝑖 − 1) -th query is modelled
as follows:
                                                             𝑃 ( 𝑋𝑘 = 𝑖, 𝑌𝑘 = 𝑗)
                𝑃 ( 𝑋𝑘 = 𝑖|𝑋𝑘−1 = 𝑖 − 1, 𝑌𝑘−1 = 𝑗) =                                .              (5)
                                                        𝑃 ( 𝑋𝑘−1 = 𝑖 − 1, 𝑌𝑘−1 = 𝑗)

  Similarly to the first type of transition in the conditional approach, we compute the transition
probabilities based on the first definition (Eq. 2) by looking at the current query-change (i.e.,
adding, removing or keeping query terms) and the current formulated query when trying to
model the probability to formulate another query.


4. Experimental Setup
4.1. Dataset
We use Sowiport1 User Search Session data set (SUSS) [31] for our experiments. The digital
library provides records and information for the social sciences in English and German. We used
data that was collected over a period of one year (from April 2014 to April 2015). The dataset
contains over 480,000 individual search sessions and around 8 million log entries. we filter logs
to remove any session that does not contain a query (i.e., users having searched nothing) or
has invalid query annotations. In order to describe users’ search actions, SUSS offers a list of
58 different actions that covers all user’s activities while interacting with the interface of the
digital library (e.g., formulating a query, clicking on a document, viewing the full document’s
content, selecting a facet, using search filters, etc.). For each user interaction, a session id, date
    1
        http://www.sowiport.de


                                                  7
Saber Zerhoudi et al. CEUR Workshop Proceedings                                               1–15


stamp, length of the action and other additional information are stored to describe user’s path
during the search process.

4.2. Evaluation Metrics
4.2.1. Kolmogorov-Smirnov-based Evaluation
Several statistical tests exist to evaluate the goodness of fit of two statistical models such as
Likelihood ratio and the Person chi-square [32, 33, 34]. In this work, we utilise the Kolmogorov-
Smirnov goodness-of-fit test [35] with its two-sample test (KS-2) variant. The advantage of this
test is that it is one of the most useful and non-parametric methods of comparing two sample
distributions. Essentially, we test the null hypothesis that the two independent samples belong
to the same continuous distribution and proceed with calculating the absolute value of the
distance between two data samples which we refer to as the test statistic 𝑑 to compare their
distribution for similarities. The test statistic 𝑑 is calculated as follow:

                                  𝑑 = sup |𝐸𝑛1 (𝑥) − 𝐸𝑛2 (𝑥)|                                   (6)
                                         𝑥

where, 𝑛1 and 𝑛2 are observations from the first and second sample respectively. 𝐸𝑛1 (𝑥) and
𝐸𝑛2 (𝑥) are the empirical distributions of the first and second sample, obtained from n1 and n2
observations. We then compare the test statistic value against the critical value derived from
the KS-2 table value to either accept or reject the hypothesis. The null hypothesis is accepted
when the test statistic value is less than the table value and rejected otherwise.

4.2.2. Classification-based Evaluation
In addition to the KS-2 test, we define a classification-based evaluation to (1) evaluate the
simulation performance of our models and (2) calculate their improvement in comparison to
the baseline.
   In order to evaluate how good our model simulates user search behaviour, we first develop a
set of features that represent the sequentiality of a user search session in the form of a feature
vector. Then we train a classifier to distinguish simulated sessions from real log data sessions
and report the results.
   Inspired by the findings in [36] about what kinds of engineered features are best suited
to various machine learning model types, we develop a set of features that represent the
sequentiality of the search session (i.e., query search types; search types; advanced search types;
clicking, viewing and exporting actions) and discard those that only describe the user’s overall
search behaviour (e.g., tally of search actions, queries formulation and clicks).
   We used one-cold-encoding to indicate the presence of a feature (i.e., (0) if present and (1) if
not) and ordered features in the sequence (i.e., 𝑖_feature where 𝑖 refers to the sequence order of
the query in a session , e.g., 1_search, 2_view_record). The resulting feature vector has a higher
dimensionality (170) than the raw feature categories (58) due to the inclusion of the sequence
order in the binary feature vector. Fig.1 gives an example of a user session in the form of a
feature vector.


                                                  8
Saber Zerhoudi et al. CEUR Workshop Proceedings                                                    1–15


  [ "1_search", "1_search_change_paging", "1_view_record", "2_search", "2_search_change_paging",
                      "3_search", "3_view_record", "3_view_description", "end"]
Figure 1: Example of a user search session in the form of a feature vector.


    Another important feature that we also considered in our third approach as described in
section 3.4 is the dwell time. Dwell time in a search action refers to the amount of time that the
user spends in between the current and the future action (formulating a query, browsing a search
result page, assessing a document, end of session.. etc.). We estimated the dwell times using
gamma distribution for each state transition and added them to the list of features described
above.
    Each user session is converted to a feature vector, labelled and fed to a classifier. This task
was repeated separately for each of the Markov approaches defined in section 3. We combined
an equal amount of data from log sessions and simulated sessions for a balanced classification
and split our sequential dataset of user’s actions into a training set (consist of 80%), on which
we trained the classifiers, and a test set (consist of 20%), on which we evaluated the model.
    To evaluate how well the classification results generalise, we adopted 10-fold cross-validation
that we performed in all classification experiments. We ran all experiments using the simulated
sessions that we have generated from each approach and reported the average score over 10
independent runs. We used the five most popular algorithms in binary classification, namely,
Logistic Regression [37], Support Vector Machine [38], K-Nearest Neighbors [39], Decision
Trees [40], Random Forest [41] and reported the average score. We also used automated
machine learning (Auto-sklearn [42]) as it employs an ensemble of top performing models
discovered during the optimisation process and reports the best result.
    Since we are interested in finding a classifier that is close to 100% recall on the real log sessions
(i.e., successful in detecting all real log sessions) and a high recall on the simulated sessions
(i.e., good at detecting most of simulated sessions), we incorporate a bias in the classifier by
weighting the class of real data (𝑤𝑟𝑒𝑎𝑙 = 104 , 𝑤𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑒𝑑 = 1) as we need to penalise bad real
log sessions predictions.
    In order to calculate improvements of our models in comparison to the baseline, we utilise
metrics such as Precision, Recall, F-measure and Accuracy which are common for objectively
measuring the classifier’s performance. In our case, we consider True Positive (TP) to be the
scenario where the model classifies simulated sessions as simulated. A score of 0.5 (chance
level of balanced binary classification) means that the classifier cannot distinguish between
simulated and real log sessions and therefore, the simulated sessions are similar to real log data
sessions. With a score below 0.5, the classifier performs even worse than chance-level, whereas
with a score of 1, simulated sessions and log data are completely different.
    Since we can distinguish between real log and simulated sessions, reporting the accuracy
alone can obfuscate some of the performance that F-measure would highlight. F-measure tells
how precise the classifier is (i.e., how many instances it classifies correctly), as well as how
robust it is (i.e., does not miss a significant number of instances). In fact, if F-measure showed
low precision/recall along with a low accuracy, we can have better confidence in the results.
Therefore, we utilise all four metrics to demonstrate relative performance and consistency of
the results.


                                                   9
Saber Zerhoudi et al. CEUR Workshop Proceedings                                                   1–15


5. Experimental Results
5.1. Kolmogorov-Smirnov-based Evaluation
For this evaluation test, we used the baseline Markov model approach described in section 3.1
to simulate global user search sessions and a separate model for each approach to simulate
type-specific user search sessions. For each model, we utilise the transition probabilities between
states which are drawn from the log sessions and the simulated sessions separately to generate
two independent samples. By feeding these data points to the given formula (Eq. 6) , we got
the results reported in Table 1. The p-value provides a measure of how much we can trust
the statistical test. In general, the closest the p-value to zero is, the more reliable the test is.
Likewise, the critical
                 √︁ value for the two samples can be approximated using the following formula:
                  𝑛1·𝑛2 where 𝛼 refers to the level of significance and 𝑛1, 𝑛2 are observations
𝐷𝑛1,𝑛2 = 𝑐(𝛼) 𝑛1+𝑛2
from the first and second sample respectively. In our experiment, we set 𝑛1 = 𝑛2 = 300, 000
and 𝛼 = 0.01.

Table 1
Two Sample Kolmogorov-Smirnov Test of log sessions and simulated sessions distributions using different
Markov model approaches. The p-value provides a measure of how much we can trust the statistical
test (the closest the p-value to zero is, the more reliable the test is).
                          Approach                Kolmogorov-Smirnov Test
                                                D-statistic (p-value) D-critical
                   Basic (Baseline)              0.00417 (7.44𝑒 − 11)      0.00421
                   Conditional                   0.00367 (1.36𝑒 − 12)      0.00398
                   Contextual Exploratory        0.00381 (6.54𝑒 − 12)      0.00389
                                 known-item      0.00302 (4.11𝑒 − 12)      0.00356
                   Time-aware                    0.00406 (6.57𝑒 − 12)      0.00415
                   Query-change                  0.00387 (3.88𝑒 − 12)      0.00391


  Table 1 shows that the statistical value is smaller than the critical value across all models,
hence we retain the null hypothesis. Therefore, we conclude that the simulated and the real log
sessions belong to the same distribution.

5.2. Classification-based Evaluation
Since the KS-2 critical values are all significant, it means that our different approaches do not
improve the simulation or at least it is hard to quantify the improvement using a KS-2 test.
Therefore, we need to adopt a second evaluation method: we investigate whether we can train
a classifier and try to distinguish between real log and simulated sessions through controlled
scenarios. For each scenario, we simulate an equal amount of sessions as present in the log
data to balance class distribution for each approach, namely, baseline, conditional, contextual,
time-aware and query-change.
   We investigate whether we can distinguish between log sessions and simulated sessions that
were generated using our predefined approaches in section 3. Table 2 shows that conditional


                                                  10
Saber Zerhoudi et al. CEUR Workshop Proceedings                                                      1–15


Table 2
Classification of real log sessions vs simulated sessions using baseline and different Markov model
approaches. We report the accuracy, recall, precision and F-measure across 10-CV folds (while (a)
averaging over five classifiers defined in subsection 4.2.2 and (b) using Auto-sklearn. W.Sum is the
weighted average score (with 0.39 and 0.61 being the size of exploratory and known-item models
respectively) and RI is the relative improvement compared to the baseline. Bold indicates the best result
in terms of the corresponding metric. Lowest results are the best as we aim to reduce the classifier’s
capability to distinguish between real log and simulated sessions.
         Approach              Accuracy (RI)       Recall (RI)     Precision (RI)      F-measure (RI)
 Basic (Baseline)                    0.661            0.773             0.756               0.750
 Conditional                    0.545 (17.54%)    0.578 (25.22%)    0.589 (22.08%)      0.583 (22.27%)
 Contextual Exploratory              0.608            0.527             0.636               0.568
               known-item            0.564            0.487             0.573               0.509
 W.Sum                          0.581 (12.08%)    0.502 (34.98%)    0.597 (20.96%)      0.532 (29.07%)
 Time-aware                      0.653 (1.21%)    0.764 (1.16%)     0.732 (3.17%)       0.748 (2.67%)
 Query-change                   0.551 (16.64%)    0.582 (24.71%)    0.601 (20.50%)      0.562 (25.06%)

                     (a) Averaging over five classifiers defined in subsection 4.2.2

         Approach              Accuracy (RI)       Recall (RI)     Precision (RI)      F-measure (RI)
 Basic (Baseline)                    0.660            0.764             0.748               0.746
 Conditional                    0.512 (22.42%)    0.577 (24.47%)    0.582 (22.19%)      0.579 (22.38%)
 Contextual Exploratory              0.612            0.531             0.631               0.577
               known-item            0.568            0.486             0.577               0.528
 W.Sum                          0.585 (11.34%)    0.503 (34.09%)    0.598 (20.05%)      0.546 (26.71%)
 Time-aware                      0.655 (7.57%)    0.742 (2.87%)     0.733 (2.01%)       0.737 (1.21%)
 Query-change                   0.536 (18.78%)    0.589 (22.91%)    0.591 (20.98%)      0.590 (20.91%)

                                        (b) Using Auto-sklearn


Markov model, which simulate user actions based on the query-action approach, achieves
higher performance in comparison to the baseline with a relative improvement of 22% for
the F-measure score. This demonstrates that separating user search actions depending on the
query’s ordinal position help improving the simulation quality.
   Table 2 also shows that when using the contextual Markov approach, the model did better
while simulating sessions for known-item search types with an F-measure score of 0.509 in
comparison to exploratory search search types with a score of 0.568. One possible explanation
for this is that known-item sessions are probably easier to simulate since there is less variation.
The exploratory group of users generate longer sessions, thus higher total of state transitions
which results in a diverse number of simulated sessions. Using search types to discover context
helped creating simulated sessions more similar to original ones (i.e., reducing the accuracy of
the classifier). In addition, table 3a also reports low precision (i.e., 0.573 for known-item and
0.636 for exploratory)/recall (i.e., 0.487 for known-item and 0.527 for exploratory) values along
with a low accuracy score (i.e., 0.564 for known-item and 0.608 for exploratory) when using the
exploratory-known-item approach in comparison to the baseline, which indicates that we can


                                                   11
Saber Zerhoudi et al. CEUR Workshop Proceedings                                            1–15


have better confidence in the results.
   Our initial hypothesis was that the baseline model should outperform the time-aware Markov
model since the baseline is less complex to learn (e.g., excluding dwell time features). The
less information needed to encode in the model the easier it is to get "similar" simulated data.
However, adding time improves our model’s simulation quality as we gained around 2.67%/1.21%
in term of the relative improvement over the baseline. Another hypothesis that we have tested
is to replace the estimated time distribution between actions using gamma distribution with
a calculated average of dwell time of the same state transition and feed it as a feature to the
model [43]. We concluded that the gamma estimation scored much better than averaging the
dwell time value.
   Although these results do not undoubtfully prove that we succeeded in simulating look-alike
user search sessions, they show that users in our simulated sessions behave in a way that comes
closer to the original behaviour from the log data.


6. Conclusion
In this paper we have presented various variants of Markov model-based simulation techniques
that simulate global and user-type specific behaviour models. In our approaches, we first
introduced a formal baseline that consists of exploring a first-order Markov model which we
argue to be the best fit for our dataset. Then we extended the work of Baeza-Yates et al. [20]
and presented a conditional Markov approach that is adapted for simulation modeling. Next,
we initiated the problem of learning contextual Markov models where we partitioned users
according to their navigation behaviour. In particular, we grouped users by learning a mixture
of search characteristics that categorise them into exploratory search and known-item search
types. Then we explored the work of Huang et al. [29] and adopted user’s session-level query
change to develop a syntactic Markov model that models the syntactic change which occurs
between two consecutive queries. Finally, in our time-aware approach, we modeled the dwell
time that users spend in between actions using the gamma distribution and then fed it as a
feature to our evaluation model. There are several straightforward extensions to our work.
We can use better clustering techniques for context discovery in contextual Markov models.
Another extension is to model the duration of state transition differently.
   We performed experiments on a real-wold dataset with conditional, contextual, time-aware
and query-change Markov models and provided more empirical results showing that our Markov
approaches succeeded in simulating sessions that are deemed to be good approximations of
actual user sessions, making our Markov models approaches a suitable choice for user session
simulation.


Acknowledgments
This work has been partially carried out within the project “SINIR: Simulating INteractive
Information Retrieval” funded by the DFG (grant HA 5851/3-1).


                                                  12
Saber Zerhoudi et al. CEUR Workshop Proceedings                                              1–15


References
 [1] H. Xie, D. Wolfram, State digital library usability: Contributing organizational factors,
     Journal of the American society for information science and technology 53 (2002) 1085–
     1097.
 [2] Y. Zhang, X. Liu, C. Zhai, Information retrieval evaluation as search simulation: A general
     formal framework for ir evaluation, in: Proceedings of the ACM SIGIR International
     Conference on Theory of Information Retrieval, 2017, pp. 193–200.
 [3] I. Cadez, D. Heckerman, C. Meek, P. Smyth, S. White, Visualization of navigation patterns
     on a web site using model-based clustering, in: Proceedings of the sixth ACM SIGKDD
     international conference on Knowledge discovery and data mining, 2000, pp. 280–284.
 [4] P. L. Pirolli, J. E. Pitkow, Distributions of surfers’ paths through the world wide web:
     Empirical characterizations, World Wide Web 2 (1999) 29–45.
 [5] E. Manavoglu, D. Pavlov, C. L. Giles, Probabilistic user behavior models, in: Third IEEE
     International Conference on Data Mining, IEEE, 2003, pp. 203–210.
 [6] P. Haider, L. Chiarandini, U. Brefeld, Discriminative clustering for market segmentation, in:
     Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery
     and data mining, 2012, pp. 417–425.
 [7] Y. Lucas, P.-E. Portier, L. Laporte, L. He-Guelton, O. Caelen, M. Granitzer, S. Calabretto,
     Towards automated feature engineering for credit card fraud detection using multi-
     perspective hmms, Future Generation Computer Systems 102 (2020) 393–402.
 [8] A. Krogh, B. Larsson, G. Von Heijne, E. L. Sonnhammer, Predicting transmembrane protein
     topology with a hidden markov model: application to complete genomes, Journal of
     molecular biology 305 (2001) 567–580.
 [9] Y. Xie, S.-Z. Yu, A large-scale hidden semi-markov model for anomaly detection on user
     browsing behaviors, IEEE/ACM transactions on networking 17 (2008) 54–65.
[10] I. Cadez, D. Heckerman, C. Meek, P. Smyth, S. White, Visualization of navigation patterns
     on a web site using model-based clustering, in: Proceedings of the sixth ACM SIGKDD
     international conference on Knowledge discovery and data mining, 2000, pp. 280–284.
[11] Y. Liu, B. Gao, T.-Y. Liu, Y. Zhang, Z. Ma, S. He, H. Li, Browserank: letting web users
     vote for page importance, in: Proceedings of the 31st annual international ACM SIGIR
     conference on Research and development in information retrieval, 2008, pp. 451–458.
[12] L. Azzopardi, The economics in interactive information retrieval, in: Proceedings of the
     34th international ACM SIGIR conference on Research and development in Information
     Retrieval, 2011, pp. 15–24.
[13] X. Dongshan, S. Junyi, A new markov model for web access prediction, Computing in
     Science & Engineering 4 (2002) 34–39.
[14] M. Deshpande, G. Karypis, Selective markov models for predicting web page accesses,
     ACM transactions on internet technology (TOIT) 4 (2004) 163–184.
[15] J. Borges, M. Levene, Evaluating variable-length markov chain models for analysis of user
     web navigation sessions, IEEE Transactions on Knowledge and Data Engineering 19 (2007)
     441–452.
[16] Y. Lucas, P.-E. Portier, L. Laporte, L. He-Guelton, O. Caelen, M. Granitzer, S. Calabretto,
     Towards automated feature engineering for credit card fraud detection using multi-


                                                  13
Saber Zerhoudi et al. CEUR Workshop Proceedings                                             1–15


     perspective hmms, Future Generation Computer Systems 102 (2020) 393–402.
[17] J. Kiseleva, H. Thanh Lam, M. Pechenizkiy, T. Calders, Discovering temporal hidden
     contexts in web sessions for user trail prediction, in: Proceedings of the 22nd International
     Conference on World Wide Web, 2013, pp. 1067–1074.
[18] D. Budd, L. C. Burns, Z. Guo, G. L’Italien, P. Lapuerta, Impact of early intervention and
     disease modification in patients with predementia alzheimer’s disease: a markov model
     simulation, ClinicoEconomics and outcomes research: CEOR 3 (2011) 189.
[19] M. H. Davis, Markov models and optimization, Routledge, 2018.
[20] R. Baeza-Yates, C. Hurtado, M. Mendoza, G. Dupret, Modeling user search behavior, in:
     Third Latin American Web Congress (LA-WEB’2005), IEEE, 2005, pp. 10–pp.
[21] G. Marchionini, Exploratory search: from finding to understanding, Communications of
     the ACM 49 (2006) 41–46.
[22] K. Athukorala, D. Głowacka, G. Jacucci, A. Oulasvirta, J. Vreeken, Is exploratory search
     different? a comparison of information search behavior for exploratory and lookup tasks,
     J. Assoc. Inf. Sci. Technol. 67 (2016) 2635–2651. URL: https://doi.org/10.1002/asi.23617.
     doi:10.1002/asi.23617.
[23] J. Mun, Understanding and choosing the right probability distributions, Adv. Anal. Model
     (2015) 899–917.
[24] A. Hassan, R. Jones, K. L. Klinkner, Beyond dcg: user behavior as a predictor of a successful
     search, in: Proceedings of the third ACM international conference on Web search and data
     mining, 2010, pp. 221–230.
[25] R. V. Hogg, J. McKean, A. T. Craig, Introduction to mathematical statistics, Pearson Educa-
     tion, 2005.
[26] J. Jiang, C. Ni, What affects word changes in query reformulation during a task-based
     search session?, 2016, pp. 111–120. doi:10.1145/2854946.2854978.
[27] H. Yang, D. Guan, S. Zhang, The query change model: Modeling session search as a markov
     decision process, ACM Trans. Inf. Syst. 33 (2015). URL: https://doi.org/10.1145/2747874.
     doi:10.1145/2747874.
[28] M. Sloan, H. Yang, J. Wang, A term-based methodology for query reformulation under-
     standing, Information Retrieval Journal 18 (2015) 145–165. URL: http://dx.doi.org/10.1007/
     s10791-015-9251-5. doi:10.1007/s10791-015-9251-5.
[29] J. Huang, E. N. Efthimiadis, Analyzing and evaluating query reformulation strategies
     in web search logs, in: Proceedings of the 18th ACM conference on Information and
     knowledge management, 2009, pp. 77–86.
[30] D. S. Hirschberg, Algorithms for the longest common subsequence problem, J. ACM
     24 (1977) 664–675. URL: https://doi.org/10.1145/322033.322044. doi:10.1145/322033.
     322044.
[31] P. Mayr, Sowiport user search sessions data set (suss), GESIS Datorium (2016).
[32] P. Billingsley, Statistical methods in markov chains, The annals of mathematical statistics
     (1961) 12–40.
[33] R. N. Hiscott, Chi-square tests for markov chain analysis, Journal of the International
     Association for Mathematical Geology 13 (1981) 69–80.
[34] T. W. Anderson, L. A. Goodman, Statistical inference about markov chains, The Annals of
     Mathematical Statistics (1957) 89–110.


                                                  14
Saber Zerhoudi et al. CEUR Workshop Proceedings                                             1–15


[35] F. J. Massey Jr, The kolmogorov-smirnov test for goodness of fit, Journal of the American
     statistical Association 46 (1951) 68–78.
[36] J. Heaton, An empirical analysis of feature engineering for predictive modeling, South-
     eastCon 2016 (2016). URL: http://dx.doi.org/10.1109/SECON.2016.7506650. doi:10.1109/
     secon.2016.7506650.
[37] D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, M. Klein, Logistic regression, Springer, 2002.
[38] W. S. Noble, What is a support vector machine?, Nature biotechnology 24 (2006) 1565–1567.
[39] S. A. Dudani, The distance-weighted k-nearest-neighbor rule, IEEE Transactions on
     Systems, Man, and Cybernetics (1976) 325–327.
[40] J. R. Quinlan, Simplifying decision trees, International journal of man-machine studies 27
     (1987) 221–234.
[41] M. Belgiu, L. Drăguţ, Random forest in remote sensing: A review of applications and
     future directions, ISPRS journal of photogrammetry and remote sensing 114 (2016) 24–31.
[42] M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, F. Hutter, Auto-sklearn:
     efficient and robust automated machine learning, in: Automated Machine Learning,
     Springer, Cham, 2019, pp. 113–134.
[43] V. Tran, D. Maxwell, N. Fuhr, L. Azzopardi, Personalised search time prediction using
     markov chains, in: Proceedings of the ACM SIGIR International Conference on Theory of
     Information Retrieval, 2017, pp. 237–240.


                                                  15

</pre>