Using CoST on self-assessment domain expertise in complex search tasks Cheyenne Dosso1 , Jose G. Moreno2 , Aline Chevalier1 and Lynda Tamine2 1 Université Jean-Jaures, CLLE, Toulouse, France 2 University of Toulouse, IRIT, UMR 5505 CNRS, F-31000, Toulouse, France Abstract While great progress is made in the area of information access, there are still open issues that involve designing intelligent systems supporting task-based search. Despite the importance of task-based search, the information retrieval and information science communities still feel the lack of open-ended and annotated datasets that enable the evaluation of a number of related facets of search tasks in downstream applications. Existing datasets are either sampled from large-scale logs but provide poor annotations, or sampled from lower-scale user studies but focus on ranked list evaluation. In this work, we briefly present CoST 1 : a novel richly annotated dataset for evaluating complex search tasks, collaboratively designed by researchers from the computer science and cognitive psychology domains, and intended to answer a wide range of research questions dealing with task-based search. CoST collection has been entirely detailed in a previous paper [1]. We report here its main design methodology, characteristics of the data provided and illustrative evaluations showing its importance to the IR community, among which a new evaluation (in comparison to [1]) related to the impact of user’s domain expertise on his self-assessment about task complexity. Keywords Complex search task, Expertise, User study, Evaluation 1. Introduction Over the years, users’ search activity has been increasingly diversified from solving simple and well-defined tasks such as fact finding, to more complex and knowledge-oriented tasks such as learning and decision-making [2, 3, 4, 5]. These complex tasks generally involve richer search interactions requiring the mobilization of cognitive resources on the part of the user. While search systems are well adapted for simple tasks, they do not suitably assist users to solve complex tasks. Early attempts to fill this gap is through TREC and CLEF evaluation compaigns such as TREC Interactive tracks (1997–2002) [6], followed later by the TREC Session Tracks (2011–2014) [7]. Recently, the TREC Dynamic Domain Track (2015–2017) [8], TREC Tasks track (2015–2017) [9] and the CLEF Dynamic Search Lab (2017–2018) [10] have also allowed a significant progress 1 The data collection is available at https://doi.org/10.6084/m9.figshare.15286353 and fully described in [1]. CIRCLE’22: Joint Conference of the Information Retrieval Communities in Europe, July 04–07, 2022, Samatan, Gers, France $ cheyenne.dosso@univ-tlse2.fr (C. Dosso); jose.moreno@irit.fr (J. G. Moreno); aline.chevalier@univ-tlse2.fr (. A. Chevalier); tamine@irit.fr (L. Tamine)  0000-0002-8852-5797 (J. G. Moreno) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) in this research area. Other attempts are exemplified by collections sampled from the publicly available AOL query log [11] providing human annotation of within-session tasks [12] and cross-session tasks [13, 14]. Recently, Volske et al. [15] published a large-scale search AOL log annotated with cross-user task identifiers, extended with Google and Bing query suggestions. The AOL logs and TREC tracks have a design focus on shared ranking tasks and does not provide complete and systematic data on search tasks, sessions and users. Through an open design methodology, the CoST collection [1] addresses this limitation by providing rich task and session data. It is worth of mention that the CoST collection provides queries annotated by experts in the domain of cognitive psychology and computer science. Finally, it makes possible the evaluation of agnostic or multilingual models since it is the first collection to have been published in French. The CoST dataset includes 5667 queries recorded from 630 task-based sessions that result from a user study involving 70 participants with varying domains of expertise (computer science, medicine, psychology). Among these tasks, 3 are simple fact- finding tasks (designed for evaluation control) and 12 are complex search tasks that are related to 3 domains of expertise: medicine, psychology and computer science. 2. The CoST Data collection Table 1 Statistics of the CoST data collection. # Search tasks 15 Min/Max/Avg/Std Queries per task 1/60/5.39/6.70 # Sessions and human answers 630 # Queries 5667 # Min/Max/Avg/Std Query length 1/36/3.93/2.45 terms # Min/Max/Avg/Std Clicks per query 1/24/1.7/1.8 Here, we briefly describe CoST as full details of the collection can be found in [1]. Participants. Seventy native French-speaking participants took part in the user study. All of them were experts in one domain and non-experts in the other domains: 25 in computer science, 10 in medicine, 35 in psychology. They had at least a bachelor degree and were asked to complete a MCQ in each domain assessing their level of prior knowledge. All participants had to solve 15 tasks, 5 of which were in their domain of knowledge, and 10 out their domain. Protocol. The user study followed two main steps. First, participants were asked to com- plete an online pre-questionnaire containing MCQs, free and informed consent, and socio- demographic questions. Second, participants were asked to perform 15 search sessions to solve tasks varying in complexity according to 5 levels. Three fact-finding tasks [16] where the answer is directly accessible on the SERPs [17]. Three multi-criteria inference tasks [16] requiring the production of inferences by the user to clarify the terms of the statement and the integration of different search criteria to access the answer [17]. Three exploratory learning tasks [18] where the objective is to lead users to acquire new knowledge about a topic. Three decision-making tasks where the objective is to compare a set of information in order to make a final decision [19]. Finally, three problem-solving tasks where the objective is to create a new coherent set of information from the knowledge acquired during IS [19]. Before each task, users had to fill in a pre-questionnaire. This included the expected difficulty assessment proposed by [20], which evaluates it according to 5 items: difficulty in searching for information using a search engine, difficulty in understanding the information found, difficulty in determining the usefulness of the information found, difficulty in integrating all of the information found into the answer, and difficulty in determining when to stop the search. For each item, the response modality was a 4-pts Lickert scale ranging from "not at all difficult" to "very difficult. Then, participants were asked to determine their familiarity level with the tasks’ topics. Along search sessions, participants used a browser developed for the purpose of recording human-system interactions. It was used to generate logs from which we extracted the data of the search sessions published in CoST : 1) the keyboard keys; 2) Mouse clicks; 3) SERPs and visited documents; 4) timestamps in milliseconds (e.g., instant of click on a selected SERPs). From these logs, we extracted search sessions data released in CoST. More precisely, the CoST sessions mainly include: 1) the identifiers (Id) of the search sessions; 2) Id of the search task with the complexity and domain attributes; 3) the anonymous Id of the users about his/her domain of expertise; 4) Query Id and query textual formulation; 5) SERPs’ clicks (i.e., page and rank); 6) URLs of visited documents. After each task, users were asked to complete a post-questionnaire including an evaluation of the difficulty experienced according to the same 5 items as the expected difficulty [20]. In addition, the post-questionnaire contained different questions concerning users’ perceptions on: the quality of the answer provided, the thematic relevance of the search engine, the websites and the documents visited, the usefulness and reliability of the information gathered [21] and the general satisfaction regarding the accomplishment of the task. Table 1 shows the statistics of the CoST collection and the full set of data retrieved during the user study and integrated into the CoST collection. Query annotation. The CoST collection also provides richly annotated queries based on two main query reformulation strategies [22, 23, 24, 25] : exploration vs. exploitation. The exploration strategy refers to the regulation and adaptation behaviors of the user’s informa- tion seeking activity. At the task level, the user might dynamically reframe his goal while the search task evolves, by integrating new incoming information from the online visited content. Exploration allows the opening and initiation of new search paths so that the user processes an additional part of the search space (e.g., moving from one subtask to another with a clear cut-off) [26, 23]. At the query formulation level, an exploration strategy results in a large semantic jump between the content of two successive queries. The exploitation strategy reflects perseverance behaviors in processing similar information needs during the information seeking activity. At the task level, this strategy allows the deep processing of a previously opened search path initiated with the aim of processing a specific part of the search space [24, 22]. At the query formulation level, exploitation corresponds to a narrow semantic jump between the content of two successive queries. A total of 5667 queries were double annotated by humans using a three-step annotation process. For detailed information regarding the annotation process as well as for the confidentiality and anonymization processing procedure, the taxonomies of task complexity, and examples of tasks used, please refer to [1]. Table 2 Summary of ANOVA results for Domain Expertise (DE), and Domain Expertise*Task Complex- ity (DE*TC). Difficulty Quality Relevance Usefulness Reliability Satisfaction Effects F* F p N²p F p N²p F p N²p F p N²p F p N²p F p N²p DE F(1,68) 9.77 =.003 0.126 11.1 =.001 0.141 14.2 <.001 0.173 4.75 <.05 0.065 7.35 =.008 0.098 16.6 <.001 0.196 DE*TC F(4,272) 3.67 =.006 0.051 1.22 n.s 2.47 =.05 0.035 0.58 n.s 1.35 n.s 1.69 n.s 3. Using the CoST collection to User’s Assessments in Complex Search Tasks In Dosso et al. [1], we presented the performance results of two downstream tasks, namely query task mapping and search strategy identification, using the CoST collection. We also studied, using the behavioural data and query annotations provided in the CoST collection, the effects of task complexity and domain knowledge of the task on the users’ behaviors (ClickSerp, NoClickSerp, TimeSerp, TimeURL, TimeSession) and search strategies (exploration- exploitation). In this paper, we extend this work by analyzing the effects of domain expertise and tasks complexity on: 1) the user’s self-assessment of task difficulty, called later perceived difficulty [20] and 2) the users’ perceptions [21] (quality of response provided, relevance of information, usefulness of information, reliability of information, and overall satisfaction with task completion). We perform repeated measures of ANOVA on dependent variables cited above link to the pre- and post-questionnaires (See Section 2). We focus on two independent variables. First, the computer science expertise as between-subject factor with two modalities (In domain and Out domain). The experimental group “In domain” includes the 25 computer science students and the group “Out domain” includes the 35 psychology students and the 10 medicine students. Second, we select as within-subject factor the task complexity with 5 modalities (factfinding, exploratory learning, decision-making, problem-solving, multicriteria- inferential). In the case where the ANOVA test is significant, we perform Scheffe’s post-hocs. Table 2 presents a summary of the ANOVA results for domain expertise and the interaction between domain expertise and task complexity on users’ perceptions. We discuss below the results obtained and the primary findings that emerged from them. Overall, we can see from Table 2 that the domain expertise in computer science has effects on all the measures (perceived difficulty and users’ perceptions). Computer science experts evaluate the tasks as significantly less difficult (𝑀 = 9.9 𝑆𝐷 = 3.3) than non-experts do (𝑀 = 11.5 𝑆𝐷 = 4.8). Experts rate their final written answer as higher quality (𝑀 = 3 𝑆𝐷 = 0.7) than non-experts do (𝑀 = 2.62 𝑆𝐷 = 0.93). In addition, domain experts report that the information they accessed during their search sessions is more relevant (𝑀 = 3.04 𝑆𝐷 = 0.73), more useful (𝑀 = 4.82 𝑆𝐷 = 1.52), and more reliable (𝑀 = 5.44 𝑆𝐷 = 1) compared to non-experts in the domain (relevance: 𝑀 = 2.71 𝑆𝐷 = 0.83; usefulness: 𝑀 = 4.43 𝑆𝐷 = 1.64; reliability: 𝑀 = 4.92 𝑆𝐷 = 1.3). Finally, domain experts are more satisfied (𝑀 = 4.8 𝑆𝐷 = 1.53) than their counterparts (𝑀 = 3.9 𝑆𝐷 = 1.9) with task completion. In summary, domain expertise reduces the perceived difficulty of the tasks regardless of their level of complexity and leads to higher perceived satisfaction with task completion by perceiving: higher quality answers and access to more relevant, useful and reliable information during the search session. To go further in our analysis, we aim to determine to what extent the interaction of expertise and task complexity have effects on perceived difficulty and user perceptions. For difficulty, we note that it is specifically the problem-solving task and the multicriteria-inferential task that are perceived differently between experts and non-experts. Specifically, computer science perceive these two tasks as less difficult (problem solving task: 𝑀 = 12.13 𝑆𝐷 = 3.11; multicriteria- inferential task: 𝑀 = 10.7 𝑆𝐷 = 3.4) than non-experts (𝑀 = 14.7 𝑆𝐷 = 4.1 and 𝑀 = 13.1 𝑆𝐷 = 4 respectively). As for the users’ perceptions, only the perception variable related to the relevance of the information retrieved during the search session is significant. Experts in the computer science domain judge that they had access to more relevant information for the problem solving task (𝑀 = 3 𝑆𝐷 = 0.7) than non-experts (𝑀 = 2.23 𝑆𝐷 = 0.9). For [19], it is important to consider both the objective complexity of the tasks and the subjective complexity which refers to the difficulty of the tasks perceived by the users to understand the dynamics of solving complex tasks. Here, we find that two of the most complex tasks in the CoST collection (i.e. problem solving [19] and multicriteria-inferential [16, 17]) are perceived as more difficult by users without domain expertise in computer science. Going further, we find that for the problem solving task the non-experts felt they had access to less relevant information than the domain experts. By coupling the results of difficulty and perceived relevance, we can argue that experts were able to access more relevant information for the task thanks to their previous knowledge of the domain and thus felt that they could solve it with more ease than non-experts of the domain. 4. Conclusions and future work In this paper, we briefly presented the CoST collection, specifically useful for task-based search evaluation. The CoST collection includes a wide range of task-related and user-related data and annotations including particularly the cognitive complexity of tasks, the domain exper- tise of participants and query type annotations. These critical attributes allow a wide range of experiments for researchers from different fields including but not limited to IR, IS and psychology. Acknowledgements This work was supported by the Agence National de la Recherche (ANR), through project CoST (https://www.irit.fr/COST/), code ANR-18-CE23-0016. We thank Claire Ibarboure for applying the double annotation of the queries. References [1] C. Dosso, J. G. Moreno, A. Chevalier, L. Tamine, Cost: An annotated data collection for complex search, in: Proceedings of the 30th ACM International on Conference on Infor- mation and Knowledge Management, CIKM ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp. 4455–4464. URL: https://doi.org/10.1145/3459637.3481998. doi:10.1145/3459637.3481998. [2] M.-A. Cartright, R. W. White, E. Horvitz, Intentions and attention in exploratory health search, SIGIR ’11, 2011, pp. 65–74. [3] A. Hassan Awadallah, R. W. White, P. Pantel, S. T. Dumais, Y.-M. Wang, Supporting complex search tasks, CIKM ’14, 2014, pp. 829–838. [4] S. Y. Rieh, K. Collins-Thompson, P. Hansen, H.-J. Lee, Towards searching as a learning process: A review of current perspectives and future directions, Journal of Information Science 42 (2016) 19–34. [5] N. Belkin, T. Bogers, J. Kamps, D. Kelly, M. Koolen, E. Yilmaz, Second workshop on supporting complex search tasks, CHIIR ’17, Association for Computing Machinery, New York, NY, USA, 2017, p. 433–435. URL: https://doi.org/10.1145/3020165.3022163. [6] P. Over, The trec interactive track: an annotated bibliography, Information Processing & Management 37 (2001) 369–381. URL: https://www.sciencedirect.com/science/article/ pii/S0306457300000534. doi:https://doi.org/10.1016/S0306-4573(00)00053-4, interactivity at the Text Retrieval Conference (TREC). [7] B. Carterette, P. Clough, M. Hall, E. Kanoulas, M. Sanderson, Evaluating retrieval over sessions: The trec session track 2011-2014, SIGIR ’16, 2016, pp. 685–688. [8] G. H. Yang, I. Soboroff, TREC 2016 dynamic domain track overview, in: E. M. Voorhees, A. Ellis (Eds.), Proceedings of The Twenty-Fifth Text REtrieval Conference, TREC 2016, Gaithersburg, Maryland, USA, November 15-18, 2016, volume 500-321 of NIST Special Publication, National Institute of Standards and Technology (NIST), 2016. [9] E. Kanoulas, E. Yilmaz, R. Mehrotra, B. Carterette, N. Craswell, P. Bailey, TREC 2017 tasks track overview, in: Proceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg, Maryland, USA, November 15-17, 2017, volume 500-324 of NIST Special Publication, National Institute of Standards and Technology (NIST), 2017. [10] E. Kanoulas, L. Azzopardi, G. H. Yang, Overview of the clef dynamic search evaluation lab 2018, in: P. Bellot, C. Trabelsi, J. Mothe, F. Murtagh, J. Y. Nie, L. Soulier, E. SanJuan, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, Springer International Publishing, Cham, 2018, pp. 362–371. [11] G. Pass, A. Chowdhury, C. Torgeson, A picture of search, InfoScale ’06, Association for Computing Machinery, New York, NY, USA, 2006, p. 1–es. [12] C. Lucchese, S. Orlando, R. Perego, F. Silvestri, G. Tolomei, Identifying task-based sessions in search engine query logs, in: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, 2011, p. 277–286. [13] P. Sen, D. Ganguly, G. Jones, Tempo-lexical context driven word embedding for cross- session search task extraction, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 283–292. [14] M. Hagen, J. Gomoll, A. Beyer, B. Stein, From search session detection to search mission detection, in: Proceedings of the 10th Conference on Open Research Areas in Infor- mation Retrieval, OAIR ’13, LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, 2013, p. 85–92. [15] M. Völske, E. Fatehifar, B. Stein, M. Hagen, Query-task mapping, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, Association for Computing Machinery, 2019, p. 969–972. [16] D. J. Bell, I. Ruthven, Searcher’s assessments of task complexity for web searching, Lecture Notes in Computer Science (2004) 57–71. doi:https://doi.org/10.1007/ 978-3-540-24752-4_5. [17] M. Sanchiz, A. Chevalier, F. Amadieu, How do older and young adults start searching for information? impact of age, domain knowledge and problem complexity on the different steps of information searching, Computers in Human Behavior 72 (2017) 67–78. doi:https://doi.org/10.1016/j.chb.2017.02.038. [18] M. Gary, Exploratory search: from finding to understanding, Communications of the ACM 49 (2006) 41–46. doi:https://doi.org/10.1145/1121949.1121979. [19] C. D. J., Task complexity: A review and analysis, Academy of Management Review 13 (1988) 40–52. doi:https://doi.org/10.5465/amr.1988.430677. [20] W.-C. Wu, D. Kelly, A. Edwards, J. Arguello, Grannies, tanning beds, tattoos and nascar : Evaluation of search tasks with varying levels of cognitive complexity, in: Proceedings of the 2012 Information Interaction in Context, IIiX ’12, Association for Cumputing Machin- ery, New York, NY, USA, 2012, pp. 254–257. doi:https://doi.org/10.1145/2362724. 2362768. [21] J. Jiang, D. He, D. Kelly, J. Allan, Understanding ephemeral state of relevance, in: Proceedings of the 2017 Conference on Human Information Interaction & Retrieval, CHIIR ’17, Association for Cumputing Machinery, New York, NY, USA, 2017, pp. 137–146. doi:https://doi.org/10.1145/3020165.3020176. [22] M. Sanchiz, F. Amadieu, A. Chevalier, An evolving perspective to capture individual differences related to fluid and crystallized abilities in information searching with a search engine, in: W. T. Fu, H. van Oostendorp (Eds.), Understanding and Improving Information Search: A Cognitive Approach, Human-Computer Interaction, 1 ed., Springer, Cham, Switzerland, 2020, pp. 71–96. [23] J. Liu, S. Sarkar, C. Shah, Identifying and predicting the states of complex search tasks, in: Proceedings of the 2020 Conference on Human Information Interaction & Retrieval, CHIIR ’20, Association for Cumputing Machinery, New York, NY, USA, 2020, pp. 193–202. doi:https://doi.org/10.1145/3343413.3377976. [24] B. J. Jansen, D. L. Booth, A. Spink, Patterns of query reformulation during web searching, J. Am. Soc. Inf. Sci. Technol. 60 (2009) 1358–1371. [25] Y. He, J. Tang, H. Ouyang, C. Kang, D. Yin, Y. Chang, Learning to rewrite queries, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, 2016, pp. 1443–1452. [26] B. M. Wildemuth, D. Kelly, E. Boettcher, E. Moore, G. Dimitrova, Examining the impact of domain and cognitive complexity on query formulation and reformulation, Information Processing & Management 54 (2018) 433–450.