-

1613-0073

Conventional Metrics: Assessing User Simulators in Information Retrieval

Saber Zerhoudi

saber.zerhoudi@uni-passau.de 0 1

Michael Granitzer

michael.granitzer@uni-passau.de 0 1

Evaluation Metrics, Simulation Evaluation, Information Retrieval

0 Figure 1: Bootstrap analysis of simulation approaches 1 University of Passau , Germany

Traditional evaluation methods for user simulators in Information Retrieval systems have limitations in assessing their reliability for comparative analysis. To address this, we apply the Fréchet Distance (FD) to measure similarity between real and simulated user search session distributions. Using the TREC Session 2014 Track dataset, we compare FD's performance against established metrics like session nDCG and Expected Global Utility. Our study explores FD's efectiveness in various scenarios, including those with minimal and extensive interaction data, and examines its sensitivity to diferent feature extraction methods. Results show that FD correlates strongly with existing metrics while ofering unique insights into session similarity, particularly for complex, multi-query sessions. FD demonstrates robustness across feature extraction techniques and versatility in various evaluation scenarios. This research contributes to the field of Interactive Information Retrieval (IIR) by providing a more comprehensive framework for evaluating simulated search sessions.

CEUR ceur-ws.org

1. Introduction

The evolution of Interactive Information Retrieval (IIR) systems has introduced new challenges in performance evaluation, particularly for simulated search sessions. Traditional metrics often fail to capture the complex, dynamic nature of user interactions in modern search environments, which involve query sequences, diverse user actions, and temporal elements. Conventional methods typically require extensive real user interaction data, which is costly and dificult to obtain, and may not adequately assess simulation fidelity for comparative analyses across diferent IIR systems.

To address these limitations, we propose the application of Fréchet Distance (FD) as a novel metric for evaluating the similarity between real and simulated user search sessions in IIR. This approach extends FD’s successful application from fields like computer vision to information retrieval. FD ofers a quantitative measure of how well simulated data replicates complex patterns of real user behavior, potentially serving as a standard for assessing user simulator performance and reliability in IIR systems. Our investigation into FD’s eficacy for IIR systems is guided by four research questions: (1) How efectively does FD measure the quality of simulated search sessions with minimal interaction data? (2) Can FD accurately evaluate the quality of CEUR Workshop Proceedings simulated search sessions with extensive interaction data? (3) How correlated is the performance of IR systems in simulated search sessions, as measured by FD, compared to other metrics used to assess the similarity of simulated user sessions? (4) How sensitive is FD to diferent feature extraction methods when assessing the similarity of simulated search sessions?

Our experimental study utilizes the TREC Session 2014 Track dataset[ 1 ], which provides detailed logs of user interactions across multiple search sessions. This dataset is particularly suitable for our research due to its structured representation of query sequences and user actions over time. To compute FD, we employ various feature extraction methods to create vector representations of both simulated and real sessions, ranging from simple query-based embeddings to more sophisticated approaches like BERT-based Session Embedding and Timeaware Session Embedding.

We investigate the correlation between FD and established session similarity metrics such as session nDCG (sDCG) [ 2 ] and Expected Global Utility (EGU) [ 3 ], examine FD’s sensitivity to diferent feature extraction methods, and explore its performance across various session lengths and complexities. Additionally, we assess FD’s performance with both minimal and extensive interaction data, providing insights into its versatility as an evaluation metric. By integrating this sophisticated approach into IIR system evaluation, our research aims to demonstrate FD’s potential as a metric that not only compares the quality of simulated and real user sessions but also ofers insights into the reliability and accuracy of diferent simulation methodologies. This study could significantly enhance the development and assessment of user simulators, leading to more efective and user-centric interactive information retrieval systems.

2. Related Work

The evaluation of interactive information retrieval (IIR) systems has evolved significantly, moving from single query-based metrics to more comprehensive session-based measures. This shift reflects the recognition of complex, multi-query search behaviors and the limitations of traditional evaluation methods. Early eforts to address this led to the development of sessionbased extensions of traditional metrics, such as Session nDCG (sDCG) by Järvelin et al. [ 2 ], which applies a discount to results from later queries in a session.

Building on this concept, Yang and Lad [ 3 ] proposed a framework for modeling user browsing behavior and computing Expected Global Utility (EGU) over a session, while Kanoulas et al. [ 4 ] introduced the concept of modeling a user’s browsing behavior as a ”path.” Although these session-based metrics represented a significant advancement, they often did not model detailed browsing behaviors, such as clicking decisions. The development of click models [ 5 ] addressed this gap, providing insights into user interaction patterns. Fuhr [ 6 ] proposed the Interactive IR Probability Ranking Principle (IIR-PRP), which theoretically integrates a user’s clicking decision with a measure of overall utility of a ranked list. Zerhoudi et al. [ 7 ] used the twosample Kolmogorov-Smirnov (KS-2) goodness-of-fit test and a classification-based evaluation to evaluate simulated user interactions in the context of a search session. However, most existing session-based evaluation measures and click models assume sequential browsing, which may not hold in modern search interfaces with complex layouts and interaction possibilities.

The Fréchet metric, a natural measure of similarity between two curves, has gained prominence in various applications [ 8 ]. This metric can be intuitively understood by imagining a dog and its handler walking on separate curves. Both can control their speed but must move forward, with the Fréchet distance representing the minimal leash length required for them to traverse their respective paths from start to finish. Due to its efectiveness in comparing curve similarities, the Fréchet distance and its variants have found widespread use across diverse fields. These applications include dynamic time-warping [ 9 ], speech recognition [ 10 ], and matching of time series in databases [ 11 ]. The versatility of the Fréchet metric in these domains underscores its significance in analyzing and comparing complex, non-linear data patterns.

Inspired by the success of distribution-based metrics in other fields, such as the Fréchet Inception Distance (FID) in computer vision [ 12 ], our work introduces the Fréchet Distance (FD) as a novel metric for evaluating simulated search sessions in IIR. This approach addresses several limitations of existing methods by handling complex interactions, requiring minimal data, providing distribution-based comparisons, ofering flexibility in feature representation, and investigating correlations with established metrics.

By adapting FD to the evaluation of simulated search sessions, our work bridges the gap between advanced evaluation techniques in other fields and the specific needs of IIR evaluation. This approach ofers a promising direction for developing more accurate and comprehensive evaluation methods for modern IIR systems, particularly in scenarios where traditional metrics may fall short due to data sparsity or complex user interaction patterns.

Arabzadeh and Clarke [ 13 ]’s proposal to use the Fréchet Distance to measure the distance between the distributions of relevant judged items and retrieved results aligns with our approach, further validating its potential as a robust and flexible metric. Their research provides additional evidence for the efectiveness of distribution-based comparisons in scenarios with sparse data, complementing our exploration of FD for simulated search sessions. 3. Fréchet Distance for Evaluating Simulated Search Sessions

3.1. Fréchet Distance

The Fréchet distance is a measure of dissimilarity between two curves or trajectories. It can be conceptualized as the minimum leash length required for a dog walking along one path while its owner walks along another, with both potentially moving at diferent speeds [ 14, 15 ].

Formally, given two curves and represented as sequences of points in a metric space, the Fréchet distance (, ) is computed as: (, ) = inf max ((()), (()))

, ∈[ 0,1 ] where and are continuous maps from [ 0, 1 ] to a metric space, and and are continuous, non-decreasing surjection functions representing reparameterizations of [ 0, 1 ]. This formulation ensures that neither the dog nor its owner can backtrack along their respective curves.

The Fréchet distance can also be applied to assess the disparity between probability distributions [ 12 ]. For two normal univariate distributions and , the Fréchet Distance is given by: ( , ) = ( − )2 + ( − )2, where and represent the mean and standard deviation of the distributions, respectively. This versatility makes the Fréchet distance a powerful tool for comparing both geometric curves and statistical distributions in various fields of study. 3.2. Fréchet Distance for Evaluating Simulated Search Sessions The evaluation of simulated search sessions using Fréchet Distance provides a robust method for assessing the quality of simulation models in Information Retrieval (IR). This approach considers both the semantic content and sequential nature of user actions within search sessions. Let represent a set of simulated search sessions, where each session consists of a sequence of user actions

. These actions may include queries, clicks, scrolls, or other interactions with the search engine. denotes the set of ideal or expected actions for the sessions in . The function (

) generates a sequence of actions for a given simulated session , producing To apply Fréchet Distance, we map the actions to a suitable embedding space using function . , which transforms any action into a -dimensional vector. This embedding captures both semantic and behavioral aspects of each action. The Fréchet Distance for Simulated Search Sessions ( ) is then calculated as: = (( ), ( ())) Here, FD measures the distance between the distribution of the set embeddings of the ideal actions ( ) and those of the simulated actions ( ()) . A lower similarity between simulated and ideal actions, suggesting better performance of the simulation model on the session set . To account for the sequential nature of search sessions, we extend indicates higher this measure to consider the order of actions within each session:

= 1 || ∈ ∑ (( ), ( ( ))) (5) (6) This sequential Fréchet Distance (

) calculates the average Fréchet Distance between ideal and simulated action sequences for each session. This provides a more nuanced evaluation of the simulation model’s ability to capture the temporal dynamics of user behavior in search sessions. By incorporating both semantic content and sequential information, this approach ofers a comprehensive evaluation framework for simulated search sessions, enabling researchers to assess and improve the fidelity of their simulation models in IR experiments.

4. Experimental Setup

In this section, we describe the general settings of our experiments, including the dataset, traditional evaluation metrics, click models, simulation framework, and the embeddings used to represent user search sessions.

4.1. Dataset

This study employs the TREC Session 2014 Track dataset [ 1 ], which is designed for evaluating multi-query search behavior. The dataset comprises 1,257 sessions with 4,680 queries and 1,685 clicks, averaging 4.33 queries per session (median: 2). It includes real user queries, interaction logs, and ranked document lists with snippets, making it ideal for simulated search session evaluation. The dataset’s diverse composition allows for a comprehensive analysis of user interactions and search strategies in multi-query scenarios, enabling robust conclusions about the efectiveness of various information retrieval techniques.

4.2. Click Models and Simulation Framework

This study employs a comprehensive approach to simulate user search sessions, utilizing both traditional probabilistic graphical models and a neural click model. The traditional models include the Position-Based Model (PBM) [ 16 ], User Browsing Model (UBM) [ 17 ], Dependent Click Model (DCM) [ 18 ], and Dynamic Bayesian Network Model (DBN) [ 19 ], implemented using the PyClick library. Additionally, we incorporate the neural click model NCM [ 20 ] to enhance simulation diversity. To create complete user search sessions, we use the SimIIR 2.0 framework [ 21 ], which simulates complex user behaviors including query formulation, result list examination, and click decisions. This integrated approach allows for a comprehensive assessment of FD’s ability to quantify the quality of simulated search sessions across a broad spectrum of user behaviors.

4.3. Embeddings for Search Session Representation

This study explores two approaches for embedding search sessions to apply the Fréchet Distance metric. The first approach uses action-based embeddings, employing a fine-tuned BERT model [ 22 ] to embed individual user actions and aggregating them through mean pooling or sequence modeling. The second approach adapts Doc2Vec [ 23 ] to create Session2Vec [ 24 ], learning fixed-length vector representations of entire search sessions. Both methods aim to capture semantic and behavioral aspects of user search sessions. For query and document representations, pre-trained word embeddings are used, with query embeddings computed as the average of query word embeddings and document embeddings derived from title and snippet words. These approaches provide insights into efectively capturing search behavior nuances and improving simulated search session evaluation using the Fréchet Distance metric.

4.4. Evaluation Process

Our evaluation process involves generating simulated search sessions using various click models and SimIIR 2.0, then embedding these sessions along with ground truth sessions from the TREC Session 2014 Track dataset. We compare simulation approaches by calculating the Fréchet Distance between simulated and ground truth session embedding distributions. To account for temporal dynamics, we also compute the sequential Fréchet Distance ( ) as defined in equation (6). The efectiveness of Fréchet Distance as an evaluation metric is assessed by comparing FD scores with traditional metrics and analyzing results across diferent click models and simulation configurations. This approach provides insights into the strengths and limitations of various simulation methods in replicating realistic user search behavior. 5. Evaluating Search Sessions with Minimal Interaction Data This section explores the efectiveness of the Fréchet Distance (FD) in assessing the quality of simulated search sessions with limited interaction data. We analyze the performance of various click models and the SimIIR 2.0 framework on the TREC Session 2014 Track dataset, comparing the FD metric with traditional session-based metrics.

Our study utilizes a subset of 200 sessions from the TREC Session 2014 Track, each containing an average of 4-5 queries. We generate simulated interactions using click models and the SimIIR 2.0 framework, then compute the FD between simulated and ground truth sessions using action-based embeddings and Session2Vec representations.

Table 1 presents the performance of diferent simulation approaches using traditional metrics (nDCG@10 and ERR@10) and Fréchet Distance metrics (FD@1 and FD@10). The results show that FD efectively quantifies the quality of simulated search sessions. The PBM model, being the simplest, shows the highest FD values, indicating the largest discrepancy from ground truth sessions. Conversely, SimIIR 2.0, incorporating more complex user behaviors, achieves the lowest FD values, suggesting simulations closest to the ground truth. FD aligns well with traditional evaluation metrics, consistently ranking simulation approaches. This indicates that FD can efectively capture simulation quality even with minimal interaction data. Bootstrap analysis (Figure 1) confirms these patterns across diferent samples, with narrow confidence intervals indicating stability. The FD metric demonstrates sensitivity to simulation model complexity, with more sophisticated models like NCM and SimIIR 2.0 achieving lower FD scores. However, its discriminative power decreases when comparing closely performing models. Overall, Fréchet Distance proves to be a promising metric for evaluating simulated search sessions, especially with minimal interaction data, complementing traditional evaluation metrics in interactive information retrieval. 6. Evaluating Search Sessions with Extensive Interaction Data This section examines the efectiveness of the Fréchet Distance (FD) in assessing simulated search sessions with extensive interaction data. Using a subset of the TREC Session 2014 Track dataset, we focus on longer and more complex sessions to address whether FD can accurately evaluate the quality of simulated search sessions with extensive interaction data. Our experimental setup involved selecting 100 sessions from the dataset, each containing at least 10 queries and a rich set of user interactions. We generated simulated interactions using various click models and the SimIIR 2.0 framework, computing the FD between simulated and ground truth sessions using both action-based embeddings and Session2Vec representations. To investigate the impact of interaction data quantity, we evaluated the simulations using full sessions, first 5 queries, and first 10 queries.

Results in Table 2 demonstrate that Fréchet Distance (FD) efectively quantifies the quality of simulated search sessions across varying amounts of interaction data. FD scores for full sessions are generally lower than those for partial sessions, indicating improved simulation accuracy with more interaction data. Figure 2 shows a strong negative correlation between nDCG@10 and FD scores for full sessions suggests alignment with traditional evaluation metrics.

Bootstrap analysis, using 1000 subsets of 50 sessions each, revealed narrow confidence intervals for both ERR@10 and FD scores, indicating stability across diferent session subsets. High correlations between FD scores for full and partial sessions (5Q and 10Q) suggest that FD maintains discriminative power even when evaluating partial sessions (i.e., Table 3). FD demonstrates several advantages, including consistency across diferent amounts of interaction data, robustness in assessments, sensitivity to simulation complexity, and alignment with traditional metrics. However, computational complexity may increase with larger datasets, warranting future exploration of eficient approximation methods.

In conclusion, Fréchet Distance proves to be an efective and robust metric for evaluating simulated search sessions, particularly with extensive interaction data. Its ability to capture distributional similarities between simulated and ground truth sessions makes it valuable for assessing and improving search session simulation models. 7. Correlation Analysis with Session Similarity Metrics This section examines the correlation between the Fréchet Distance (FD) and other established metrics used to evaluate the similarity of simulated user sessions in Information Retrieval (IR) systems. The study aims to understand how FD’s performance in measuring simulated search sessions correlates with other session similarity metrics.

We employ the TREC Session 2014 dataset, selecting 200 sessions of varying lengths and complexities. Simulated sessions are generated using click models and the SimIIR 2.0 framework. FD is compared with Session nDCG (sDCG), Expected Global Utility (EGU), Path-based Session Evaluation (PSE), and Interactive IR Probability Ranking Principle (IIR-PRP) based metric.

Results show strong negative correlations between FD and all other metrics (i.e., Table 4), with the strongest correlation observed with PSE. As FD decreases, indicating better simulation quality, other metrics tend to increase.

Correlation analysis for diferent session lengths reveals that the relationship between FD and other metrics strengthens as session length increases as shown in Table 5, suggesting FD’s efectiveness in capturing complex interaction patterns in longer sessions. The study supports FD as a valid and efective measure for evaluating simulated search sessions, demonstrating consistency with established metrics and sensitivity to session complexity.

However, limitations include dataset specificity, simulation model dependency, and the need for further investigation into metric assumptions and computational eficiency. Future work should validate findings on other datasets, explore a wider range of simulation approaches, and correlate metric-based assessments with human judgments of session similarity. In conclusion, this study supports the use of FD as a valuable tool in evaluating IR systems, particularly for complex, multi-query sessions, potentially ofering complementary insights to existing metrics and enhancing the assessment and improvement of IR systems in interactive, session-based search contexts. 8. Sensitivity to Feature Extraction in Simulated Sessions This section examines the impact of various feature extraction methods on the Fréchet Distance when evaluating simulated search sessions. The study utilizes the TREC Session 2014 dataset, selecting 300 sessions of varying lengths and complexities. Five feature extraction methods are investigated: Query-based Embedding (QBE), Action Sequence Embedding (ASE), Session2Vec (S2V), BERT-based Session Embedding (BSE), and Time-aware Session Embedding (TSE).

9. Conclusion and Future Work

This paper explores the application of Fréchet Distance (FD) as a novel metric for evaluating simulated search sessions in interactive information retrieval systems. Our experiments demonstrate FD’s efectiveness and robustness in assessing the quality of simulated user interactions across various scenarios.

FD shows strong correlations with established metrics like session nDCG and Expected Global Utility, capturing similar aspects of session quality while ofering additional insights due to its distributional nature. It proves particularly efective in evaluating longer, more complex sessions and demonstrates efectiveness even with minimal interaction data. These findings have significant implications for interactive information retrieval, potentially leading to more accurate and realistic simulation models. Future research directions include investigating FD’s performance with multi-modal session representations, extending this work to larger datasets, correlating FD assessments with human judgments, and exploring FD in evaluating generative models for search session simulation.

While our study ofers valuable insights, it’s important to acknowledge certain limitations. FD requires sets of sessions for evaluation and assumes multivariate normal distributions, which may not always hold for all types of session data. Additionally, as an unbounded metric, the interpretation of FD scores may vary depending on dataset characteristics and sample sizes.

Despite these limitations, we believe that FD ofers a powerful and flexible approach to evaluating simulated search sessions, complementing existing metrics and potentially driving improvements in both simulation models and real-world information retrieval systems.

[1]

Carterette , E. Kanoulas, M. M. Hall , P. D. Clough , Overview of the TREC 2014 session track , in: E. M. Voorhees , A . Ellis (Eds.), Proceedings of The Twenty-Third Text REtrieval Conference , TREC 2014, Gaithersburg, Maryland, USA, November 19 - 21 , 2014 , volume 500 -308 of NIST Special Publication, National Institute of Standards and Technology (NIST) , 2014 . URL: http://trec.nist.gov/pubs/trec23/papers/overview-session.pdf.

[2]

Järvelin ,

S. L.

Price ,

L. M. L.

Delcambre ,

M. L.

Nielsen , Discounted cumulated gain based evaluation of multiple-query IR sessions , in: C. Macdonald , I.

Ounis , V.

Plachouras , I. Ruthven , R. W. White (Eds.), Advances in Information Retrieval , 30th European Conference on IR Research , ECIR 2008 , Glasgow, UK, March 30-April 3, 2008 . Proceedings, volume 4956 of Lecture Notes in Computer Science, Springer, 2008 , pp. 4 - 15 . URL: https://doi.org/10.1007/978-3- 540 -78646- 7 _4. doi: 10 .1007/978- 3- 540 - 78646- 7\_4.

[3]

Yang ,

Lad , Modeling expected utility of multi-session information distillation , in: L. Azzopardi , G. Kazai, S. E.

Robertson , S. M.

Rüger , M.

Shokouhi , D.

Song , E. Yilmaz (Eds.), Advances in Information Retrieval Theory , Second International Conference on the Theory of Information Retrieval , ICTIR 2009 , Cambridge, UK, September 10-12 , 2009 , Proceedings, volume 5766 of Lecture Notes in Computer Science, Springer, 2009 , pp. 164 - 175 . URL: https://doi.org/10.1007/978-3- 642 -04417-5_ 15 . doi: 10 .1007/978- 3- 642 - 04417- 5\_ 15 .

[4]

Kanoulas ,

Carterette ,

P. D.

Clough ,

Sanderson , Evaluating multi-query sessions , in: W. Ma, J. Nie,

Baeza-Yates ,

Chua , W. B. Croft (Eds.), Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR 2011 , Beijing, China, July 25-29 , 2011 , ACM, 2011 , pp. 1053 - 1062 . URL: https://doi.org/10. 1145/2009916.2010056. doi: 10 .1145/2009916.2010056.

[5]

Chuklin , I. Markov, M. de Rijke, Click Models for Web Search , Synthesis Lectures on Information Concepts , Retrieval, and Services, Morgan & Claypool Publishers, 2015 . URL: https://doi.org/10.2200/S00654ED1V01Y201507ICR043. doi: 10 .2200/ S00654ED1V01Y201507ICR043.

[6]

Fuhr , A probability ranking principle for interactive information retrieval , Inf. Retr . 11 ( 2008 ) 251 - 265 . URL: https://doi.org/10.1007/s10791-008-9045-0. doi: 10 .1007/ S10791- 008- 9045- 0.

[7]

Zerhoudi ,

Granitzer ,

Seifert ,

Schloetterer , Evaluating simulated user interaction and search behaviour , in: M. Hagen , S.

Verberne , C.

Macdonald , C.

Seifert , K.

Balog , K.

Nørvåg , V. Setty (Eds.), Advances in Information Retrieval - 44th European Conference on IR Research , ECIR 2022 , Stavanger, Norway, April 10-14 , 2022 , Proceedings, Part

, volume 13186 of Lecture Notes in Computer Science, Springer, 2022 , pp. 240 - 247 . URL: https://doi.org/10.1007/978-3- 030 -99739-7_ 28 . doi: 10 .1007/978- 3- 030 - 99739- 7\_ 28 .

[8]

Efrat ,

L. J.

Guibas ,

Har-Peled ,

J. S. B.

Mitchell , T. M. Murali , New similarity measures between polylines with applications to morphing and polygon sweeping , Discret. Comput. Geom . 28 ( 2002 ) 535 - 569 . URL: https://doi.org/10.1007/s00454-002-2886-1. doi: 10 .1007/ S00454- 002- 2886- 1.

[9]

E. J.

Keogh ,

M. J.

Pazzani , Scaling up dynamic time warping to massive dataset , in: J. M. Zytkow , J. Rauch (Eds.), Principles of Data Mining and Knowledge Discovery, Third European Conference, PKDD '99 , Prague, Czech Republic, September 15-18 , 1999 , Proceedings, volume 1704 of Lecture Notes in Computer Science, Springer, 1999 , pp. 1 - 11 . URL: https://doi.org/10.1007/978-3- 540 -48247- 5 _1. doi: 10 .1007/978-3- 540 -48247-5\_1.

[10]

Kwong ,

He ,

Man ,

Chau ,

Tang , Parallel genetic-based hybrid pattern matching algorithm for isolated word recognition , Int. J. Pattern Recognit. Artif. Intell . 12 ( 1998 ) 573 - 594 . URL: https://doi.org/10.1142/S0218001498000348. doi: 10 .1142/ S0218001498000348.

[11]

Kim ,

Shin , Optimization of subsequence matching under time warping in time-series databases , in: H. Haddad , L. M.

Liebrock , A.

Omicini , R. L. Wainwright (Eds.), Proceedings of the 2005 ACM Symposium on Applied Computing (SAC) , Santa Fe , New Mexico, USA, March 13 -17, 2005 , ACM, 2005 , pp. 581 - 586 . URL: https://doi.org/10.1145/ 1066677.1066814. doi: 10 .1145/1066677.1066814.

[12]

Heusel ,

Ramsauer ,

Unterthiner ,

Nessler ,

Hochreiter , Gans trained by a two time-scale update rule converge to a local nash equilibrium , in: I. Guyon, U. von Luxburg, S. Bengio,

H. M.

Wallach ,

Fergus ,

S. V. N.

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9 , 2017 , Long Beach, CA, USA, 2017 , pp. 6626 - 6637 . URL: https://proceedings.neurips.cc/paper/2017/hash/ 8a1d694707eb0fefe65871369074926d-Abstract.html.

[13]

Arabzadeh ,

C. L. A.

Clarke , Fréchet distance for ofline evaluation of information retrieval systems with sparse labels , in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1 :

Long

Papers , St. Julian's, Malta, March 17 -22, 2024 , Association for Computational Linguistics, 2024 , pp. 420 - 431 . URL: https://aclanthology.org/ 2024 . eacl-long . 26 .

[14]

Eiter ,

Mannila , Computing discrete fréchet distance ( 1994 ).

[15]

Alt , The computational geometry of comparing shapes , in: S. Albers,

Alt , S. Näher (Eds.), Eficient

Algorithms

, Essays Dedicated to Kurt Mehlhorn on the Occasion of His 60th Birthday , volume 5760 of Lecture Notes in Computer Science, Springer, 2009 , pp. 235 - 248 . URL: https://doi.org/10.1007/978-3- 642 -03456-5_ 16 . doi: 10 .1007/978-3- 642 -03456-5\ _ 16 .

[16]

Craswell ,

Zoeter ,

M. J.

Taylor , B. Ramsey , An experimental comparison of click position-bias models , in: M. Najork , A. Z. Broder , S. Chakrabarti (Eds.), Proceedings of the International Conference on Web Search and Web Data Mining, WSDM 2008 , Palo Alto, California, USA, February 11 - 12 , 2008 , ACM, 2008 , pp. 87 - 94 . URL: https: //doi.org/10.1145/1341531.1341545. doi: 10 .1145/1341531.1341545.

[17]

Dupret ,

Piwowarski , A user browsing model to predict search engine click data from past observations , in: S. Myaeng,

D. W.

Oard ,

Sebastiani ,

Chua , M. Leong (Eds.), Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR 2008 , Singapore, July 20-24 , 2008 , ACM, 2008 , pp. 331 - 338 . URL: https://doi.org/10.1145/1390334.1390392. doi: 10 .1145/1390334. 1390392.

[18]

Guo , C. Liu,

Y. M.

Wang , Eficient multiple-click models in web search , in: R. BaezaYates, P. Boldi,

B. A.

Ribeiro-Neto ,

B. B.

Cambazoglu (Eds.), Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009 , Barcelona, Spain, February 9- 11 , 2009 , ACM, 2009 , pp. 124 - 131 . URL: https://doi.org/10.1145/1498759. 1498818. doi: 10 .1145/1498759.1498818.

[19]

Chapelle , Y. Zhang, A dynamic bayesian network click model for web search ranking , in: J. Quemada , G. León, Y. S. Maarek , W. Nejdl (Eds.), Proceedings of the 18th International Conference on World Wide Web, WWW 2009 , Madrid, Spain, April 20-24 , 2009 , ACM, 2009 , pp. 1 - 10 . URL: https://doi.org/10.1145/1526709.1526711. doi: 10 .1145/1526709.1526711.

[20]

Borisov , I. Markov, M. de Rijke,

Serdyukov , A neural click model for web search , in: J. Bourdeau , J.

Hendler , R.

Nkambou , I.

Horrocks , B. Y.

Zhao (Eds.), Proceedings of the 25th International Conference on World Wide Web, WWW 2016 , Montreal, Canada, April 11 - 15 , 2016 , ACM, 2016 , pp. 531 - 541 . URL: https://doi.org/10.1145/2872427.2883033. doi: 10 .1145/2872427.2883033.

[21]

Zerhoudi ,

Günther ,

Plassmeier ,

Borst ,

Seifert ,

Hagen ,

Granitzer , The simiir 2.0 framework: User types, markov model-based interaction simulation, and advanced query generation , in: M. A. Hasan , L. Xiong (Eds.), Proceedings of the 31st ACM International Conference on Information & Knowledge Management , Atlanta, GA , USA, October 17 - 21 , 2022 , ACM, 2022 , pp. 4661 - 4666 . URL: https://doi.org/10.1145/3511808.3557711. doi: 10 .1145/3511808.3557711.

[22]

F. A.

Acheampong ,

Nunoo-Mensah ,

Chen , Transformer models for text-based emotion detection: a review of bert-based approaches , Artif. Intell. Rev . 54 ( 2021 ) 5789 - 5829 . URL: https://doi.org/10.1007/s10462-021-09958-2. doi: 10 .1007/S10462- 021- 09958- 2.

[23]

Q. V.

Le , T. Mikolov, Distributed representations of sentences and documents , in: Proceedings of the 31th International Conference on Machine Learning, ICML 2014 , Beijing, China, 21 - 26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, JMLR.org , 2014 , pp. 1188 - 1196 . URL: http://proceedings.mlr.press/v32/le14.html.

[24]

Bing ,

Niu ,

Lam ,

Wang , Learning a semantic space of web search via session data , in: S. Ma, J. Wen,

Liu ,

Dou ,

Zhang ,

Chang ,

W. X.

Zhao (Eds.), Information Retrieval Technology - 12th Asia Information Retrieval Societies Conference, AIRS 2016 , Beijing, China, November 30 - December 2 , 2016 , Proceedings, volume 9994 of Lecture Notes in Computer Science, Springer, 2016 , pp. 83 - 97 . URL: https://doi.org/10.1007/978-3- 319 -48051- 0 _7. doi: 10 .1007/978- 3- 319 - 48051- 0\_7.