Evaluating the Success of Search Sessions in Interactive Information Retrieval Magdalena Nikola University of Milan Bicocca, Milan, Italy m.nikola@campus.unimib.it Abstract. Interactive Information Retrieval (IIR) studies include both system evaluations and users’ information search behaviors, and the in- teraction of users with systems and information. The development and testing of appropriate measures and methodologies for evaluating IIR is central to information science. To better understand users’ needs and support their interactions with information, IIR systems need to be able to understand the goals underlying users’ search behaviors. This work is conceived to address some aspects of this problem. In particular, it considers how people evaluate the success of a complete search session and of the various search intentions within a search session, with respect to the task which motivated the search. In this paper a pilot study is described. Keywords: Information Retrieval · Interactive Information Retrieval · Work Task · Evaluation · Search Session. 1 Introduction The main goal of an Information Retrieval System (IRS) is to return to users the most relevant documents in response to their queries, thus respecting the so- called paradigm “one query-one response”. However, people usually engage in longer and more complex information seeking episodes. Therefore, when people try to address a new type of problem, they need to engage in many activities other than just clicking on a search result retrieved by the system. In IIR, the crucial point is to develop systems that allow the user to easily access the in- formation s/he needs, while also providing solutions to a series of problems that may arise during a search session. According to Cole [2], the evaluation of a system should focus on how users are able to achieve their goals, how the system helps users to identify and engage in appropriate interactions, and the relation- ship between the results of these interactions and the progress towards the goals. In order to understand and develop suitable measures for the evaluation of IIR systems, it is necessary to know how people evaluate the system’s support for achieving the goals of an intention during a search session, and, in general, how they evaluate the success in achieving the goals of the entire search session. In Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). FDIA 2019, 17-18 July 2019, Milan, Italy. order to do this, it is necessary to understand what these intentions are, and what the nature of the work tasks is since it has been shown that task’s topic has an essential influence on user behavior during a search session [5]. This pa- per presents a methodology and a pilot study of a project undertaken during my master’s thesis.1 2 Related Works Several studies have shown that capturing a user’s information need is one of the most critical aspects of IR. Although it is difficult to create an all-including definition of an information need, most information needs can be characterized in terms of tasks and topics: a task represents the goal or purpose of the search, this is what a user wants to accomplish by searching, e.g., a user wants to plan a trip; a topic represents the subject area that is the focus of the task, e.g., the user might plan a trip to Africa. Research has also shown that information needs evolve during the search process, as these are dynamic information needs. This evolution is due to the fact that during a search for information, people learn more about their needs, and consequently their pertinent behaviors change. Li & Belkin in [1] define tasks as “activities people attempt to accomplish in order to keep their work or life moving on”. More in general, a work task is defined as an activity people complete in order to achieve their work’s goal, e.g., writing a report, planning a vacation. Moreover, a work task is without a doubt a motiva- tion for information search, and includes both a) information-seeking tasks and b) information-search tasks. With information-search is intended information search only through an information system. Instead, with information-seeking is intended the fact that users may also seek information from other sources, such as human or printed documents. One important development in IIR evaluation and experimentation has been the simulated work task that describes the situa- tion leading to the information need. The nature of the task that leads a person to engage in the interaction with an IRS in searching for information has been shown to influence the behavior of users during the search sessions. In recent years, the characteristics of search tasks have been studied, such as how different search tasks could be classified, what they are influenced by, and how they differ according to their attributes. A concrete example is a study conducted by Wildemuth, Freund, and Toms [4] in 2014, in which two attributes of the search tasks are studied and implemented: task complexity and task diffi- culty. That work provides a “detailed revision of existing practice in developing search tasks to test, observe or control” these two attributes, because as they say “it is not clear if these attributes are mutually exclusive or share some di- mensions, as current definitions have tended to blur the distinction” [4]. 3 A New Paradigm of User Study for Evaluating IIR This project for the evaluation of IIR systems aims at investigating the following issues: a) given a search session in response to a motivating task, how would 1 The project was undertaken under the supervision of Prof. Nicholas Belkin at Rut- gers University and Prof. Gabriela Pasi at the University of Milano - Bicocca we evaluate the system support for that search session? b) given an intention associated with a query segment, how would we evaluate the system support for that intention? c) can we discover measures for evaluating the contribution of each query segment to the success of the search session as a whole? To address these issues, the main practical goal of the work is to develop a framework for the evaluation of IIR Systems. To do this, the following research questions have to be answered: 1) RQ1 How do people judge the success of a search session? 2) RQ2 How useful was each intention/query segment in accomplishing the goal of the search session? RQ1 concerns the ability to learn how satisfied are users in carrying out the search task, or how successful, according to them, was their search session. Specifically, it wants to investigate the kind of measures that users adopt when they evaluate the whole search session: what is/are appropriate measure(s) for evaluating the system support of the search session? Do different types of moti- vating tasks require different evaluating measures? RQ2 aims to learn about the usefulness of each intention of the search ses- sion and the usefulness of each query segment of the same search session in accomplishing the goal of the search task. Furthermore, it aims at understand- ing what are the appropriate measures for evaluating the contribution of each intention/query segment in accomplishing the goal of the search session. 4 Research methodology In the performed pilot study, users were required to follow a specific procedure, whose steps are summarized in Table 1. Table 1. Summary procedure. Procedure Time 1 Read and sign the consent form 3 min 2 Initial questionnaire 2 min 3 Shown the tutorial about the system 10 min 4 Shown the task and the topic of the search 3 min 5 Second questionnaire 2 min 6 Search, all behaviors are logged 20 min 7 Replay the search, by query segment & annotation of query segments 40 min 8 Search session evaluation and comparison 12 min As shown in Table 1, prior to conducting their searches, subjects were asked to read and sign a consent form in which each of them was informed about the experiment. Then, searchers completed a brief questionnaire about their demo- graphic characteristics and their normal searching behaviors. Next, searchers were given a video tutorial which was designed to interactively guide them through the workings of the experimental system. In the next step, to the users were shown the tasks and the topics of the search. Before doing their search, subjects were asked to take familiarity with the topic and the motivating task and to anticipate their supposed difficulty in completing the assignment. While doing the search they had the possibility of saving/unsaving pages they consid- ered useful/not useful for accomplishing the task. The search ended when the time required for the search was expired or when users have felt that the task was accomplished. After the search was completed, participants were required to fill a questionnaire, whose focus was understanding their intentions in each query segment and the successes related to them. At the end of the entire searching ex- perience, subjects participated in a structured post-search interview which was designed to elicit confidence, attitudes, strategies, and behaviors directly related to the success or unsuccessful of their search session. 4.1 Study Motivating Tasks The task type classification framework proposed by Li & Belkin [1] was used to construct two motivating tasks for this study. The specific intention in task con- struction was to design motivating tasks that differed systematically on several of the facets of the task that were shown to affect search behavior. In particu- lar, two task types were chosen because they have shown, in previous work, to lead to significant differences in search behaviors, including frequency of search intentions. We hypothesize that the understanding of success in the two tasks is different. The motivating tasks used in the study are based on the following Task Scenario: You are about to plan a vacation with your partner to improve your personal relationship between you and him/ her. You want to do this after the end of Spring semester, when you have 18-26 May when you’ll both be free, and can book for a week somewhere, including travel time. The considered two tasks to be executed by participants are Task 1 or Task 2, summarized in Table 2 below. Table 2. Description of the tasks. No. Your Task Find at least three resorts in different countries that you think will be good for the purpose of the trip to show your partner. If you book flights now to the general area for that period, you’ll be able to afford a nice resort. Be sure that 1 the region safety won’t be a problem for the places you find. Please save up to three pages for each resort, that you think will be useful in helping you and your partner to decide which one is best. Please find and save the page(s) for each of the resorts named below, which give you information about the best available weekly rate for two at that resort, 2 during the period March 16-24. Available resorts: Indonesia - The Santai, Malaysia - The Banjaran Hotsprings Retreat, Singapore - Capella Singapore, Thailand - Pimalai Resort & Spa, Vietnam - Fusion Maia Da Nang. 5 Results Undergraduate students were recruited from Rutgers University to participate in the study. The age of participants ranged between 18 and 21, and the aver- age number of years the participants have been conducting online searches was 11 years. All participants rated themselves as an experienced searcher in using search engines (e.g., Google, Bing). Some of them indicated that they are also experts in searching through social media (e.g., Facebook, Twitter, YouTube), or marked that they are also experts in searching within community-based forums (e.g., Quora, Stack Overflow). However, only one participant rated himself as an experienced searcher in using other search tools, such as a library database. In general, on average, participants were experienced with online information searching, because they usually search for information online for their every-day needs (e.g., homework, studies). In the first part of the study, it may be said that most of the participants were successful with their intentions: in fact, in most cases, users have managed to complete the intentions of query segments, so these intentions have been marked as successful. During the search, however, there were cases in which the participants failed to positively conclude some intentions, which is why they have been labeled as non-successful intentions. Summarizing, 77% of the intentions chosen during the search sessions were marked as successful, and 15% as non- successful. Moreover, some intentions have not been reported either as successful or as not successful, this number covers 8% of the total intentions. Instead, the reasons for which users have reformulated their queries can be grouped as follows: a) the user entered the new query because s/he was able to find the best-rated resort from what TripAdvisor stated, b) the user was still trying to find information from each of the websites, c) the user wanted to find another review of the resort besides TripAdvisor, d) the participant found the top resort in Vietnam and was looking for more information about the resort, e) the user was trying to load the website for another resort but it would not load, so s/he moved onto the next resort which also would not load, f ) the user wanted to obtain details about the best luxury resorts in Malaysia. It can be said that the most important part of this project was to understand what the users meant by the success of a task, what it means to achieve the goal of the task and positively conclude the search session. For this reason, all users were asked to provide us with their own and personal definition of successful. To the question “What do you mean by successful?”, we have received several answers, which vary from the simplest answer in which the user says that s/he was able to find three websites/resort in three countries, to the most reformulated ones in which the users explain that s/he found what s/he was looking for to the best of his/her ability, or that s/he did not find package pricing, rather the nightly pricing for each of the resorts and their amenities. 6 Discussion of the Results The most important outcomes from the study are: a) even with this small sample, participants made use of almost all the available intentions and they seem to have been sufficient to describe what the participants wanted to accomplish; b) the reasons for judgments of success or unsuccess of the different intentions depend on the considered intention, thus indicating that they would require different measures for evaluating the system support. What such measures would be could not be determined, given the small number of participants, but with more data, it seems to be possible to infer categories of different measures; c) the reasons for the success of the search session have to do with the accomplishment of the task, which means that any possible measure for evaluating the search session as a whole should be directly related to the type of motivating task. Since there are two task types in the study, with more data it should be possible to identify, based on both the reasons given and the reasons for changes of search strategy, some general evaluation measures for the different tasks; d) the descriptions of plans or search strategies and the reasons for changing can clearly be sources for identifying criteria, and possible measures, for evaluation of the search session as a whole. 7 Conclusions and Future Developments In the field of Interactive Information Retrieval (IIR), the main goal of this work was to understand the reasons why people change their queries, what is successful to them and why, and, more precisely, to understand how people evaluate the success of a search episode. The few data obtained in this pilot study and described in this paper, indicate that we are in a promising direction to arrive at defining standard methods and metrics for the evaluation of IIR systems. In order to validate in a more complete way our hypothesis and results, it will be necessary to wait for the conclsion of the project, and for the global collection of data relative to all the participants expected for this project. References 1. Li, Yuelin and Belkin, Nicholas J: A faceted approach to conceptualizing tasks in information seeking., vol. 44, pp. 1822–1837. Information Processing & Management (2008). 2. Cole, Michael and Liu, Jingjing and Belkin, Nicholas and Bierig, R and Gwizdka, J and Liu, C and Zhang, J and Zhang, X: Usefulness as the criterion for evaluation of interactive information retrieval., vol. 44, pp. 1–4. Proc. HCIR (2009). 3. Lin, Shinjeng and Xie, Iris: Behavioral changes in transmuting multisession succes- sive searches over the web., vol. 64, pp. 1259–1283. Journal of the American Society for Information Science and Technology (2013). 4. Wildemuth, Barbara and Freund, Luanne and G. Toms, Elaine: Untangling search task complexity and difficulty in the context of interactive information retrieval studies., vol. 44, pp. 1118–1140. Journal of Documentation (2014). 5. Hienert, Daniel and Mitsui, Matthew and Mayr, Philipp and Shah, Chirag and Belkin, Nicholas J: The role of the task topic in web search of different task types., pp. 72–81., Proceedings of the 2018 Conference on Human Information Interaction & Retrieval, ACM (2018).