A Proposal for User-Focused Evaluation and Prediction of Information Seeking Process Chirag Shah School of Communication & Information (SC&I) Rutgers University 4 Huntington St, New Brunswick, NJ 08901, USA chirags@rutgers.edu ABSTRACT 1 INTRODUCTION One of the ways IR systems help searchers is by predicting or IR evaluations are often concerned with explaining factors assuming what could be useful for their information needs based relating to user or system performance after the search and on analyzing information objects (documents, queries) and retrieval are conducted [20]. Most recommender systems, finding other related objects that may be relevant. Such however, operate with an objective to suggest objects that could approaches often ignore the underlying search process of be useful to a user based on his/her or others’ past actions information seeking, thus forgoing opportunities for making [2][19]. We commenced our investigation by broadly asking process-based recommendations. To overcome this limitation, how we could take valuable lessons from both IR evaluations we are proposing a new approach that analyzes a searcher’s and recommender systems to not only evaluate an ongoing current processes to forecast his likelihood of achieving a certain search process, but also predict how well it will unfold and level of success in the future. Specifically, we propose a suggest a better path to the searcher if it is likely to machine-learning based method to dynamically evaluate and underperform. The motivation behind this investigation was predict search performance several time-steps ahead at each based on the following assumptions and realizations grounded in given time point of the search process during an exploratory the literature. search task. Our prediction method uses a collection of features extracted solely from the search process such as dwell time, 1. The underlying rational processes involved in information query entropy and relevance judgment in order to evaluate search are reflected in the actions users make while whether it will lead to low or high performance in the future. searching. These actions include entering search queries, Experiments that simulate the effects of switching search paths skimming the results, as well as selecting and collecting show a significant number of subpar search processes improving useful information [8][14][15]. 2. A searcher’s performance is a function of these actions after the recommended switch. In effect, the work reported here performed during a search episode [7][22]. provides a new framework for evaluating search processes and predicting search performance. Importantly, this approach is With these assumptions, we propose to quantify a search process based on user processes, and independent of any IR system using various user actions, and use it for user performance allowing for wider applicability that ranges from searching to (henceforth, ‘search performance’ or ‘performance’) prediction recommendations. as well as search process recommendations. Categories and Subject Descriptors 2 BACKGROUND H.3: INFORMATION STORAGE AND RETRIEVAL H.3.3: Past research on predictive models that relates to the approach Information Search and Retrieval: Search process; H.3: we describe in this paper can be grouped into two main INFORMATION STORAGE AND RETRIEVAL H.3.4: categories: (1) behavioral studies and (2) IR approaches. In both Systems and Software: Performance evaluation (efficiency and cases; however, the focus has been on end products instead of in effectiveness) the process required to produce them. As far as the behavioral studies go, research has been conducted General Terms to explore users models that help anticipating specific aspects of Measurement, Performance, Experimentation the search process. One goal in this context has been the Keywords determination of whether a search process will be completed in a single or multiple sessions. For example, Agichtein et al. [3] Exploratory search, Evaluation, Performance prediction investigated different patterns that can be identified in tasks that require multiple sessions. As a result, the authors devised an algorithm capable of predicting whether users will continue or abandon the task. Similar work is described in Diriye et al. [6], Presented at EuroHCIR2013. which focuses on predicting and understanding of why and Copyright © 2013 for the individual papers by the papers’ authors. Copying permitted only for private and academic purposes. This volume when users abandon Web searches. To address this problem, the is published and copyrighted by its editors. authors studied features such as queries and interactions with result pages. Based on this approach, the authors were able to determine reasons for search abandonment such as accidental causes (e.g. Web browser crashing), satisfaction levels, and query suggestions, among others. There have been also attempts to understand past users' steps ahead with the aim to aid their search process awareness behaviors in order to predict future ones in similar conditions. and performance trends. For example, Adar et al. [1] visually explored behavioral aspects Unlike previous works in IR, we are not proposing to use time using large-scale datasets containing queries and other series analyses or seasonal components of historic data. Instead, information objects produced by users. The authors were able to we investigate predictive models based on machine learning identify different behavioral patterns that seem to appear (ML) techniques; namely: SVM, logistic regression, and Naïve consistently in different datasets. While not directly related to Bayes which are trained over a set of features such as time, performance prediction, this work focused on attributes of the number of queries, and page dwell time. In contrast to most IR search process instead of in final products derived from it. evaluations, our method focuses on user-processes. Also, unlike Research like the ones described above often relies on historic most recommender systems, our approach could output data from large populations and the use of trend and seasonal alternative strategies instead of similar/relevant products to help components, which are used to model long-term direction and the searcher. In essence, the work reported here takes several periodicity patterns of time-series [17]. For example, some have lessons from tradition IR evaluations, recommender system explored seasonal aspects in Web search (e.g. weekly, monthly, designs, and weather/stock forecasting to come up with a new or annual behaviors) that provides useful information to predict approach for evaluating and predicting search performance. and suggest queries [5]. In the next section we provide a detailed description of our From an IR perspective, Radinski et al. [18] explored models to method, feature selection, and the measures we used in order to predict users’ behaviors in a population in order to improve create ML-based predictive models. results from IR systems. The authors also developed a learning algorithm capable of selecting an appropriate predictive model 3 METHOD depending on the situation and time. As described by the In order to analyze the search processes followed by different authors, applications of this approach could go from click users/teams, we assume that the underlying dynamics of the predictions to query-URL predictions. In contrast to this search processes are expressed by a collection of activities that approach, our method presented in this paper considers both the take place from the beginning to the end of the search processes. population trends and an individual user behavior. The first part of our method is a feature extraction step in which In a similar track, several works have been conducted on query we extract a wide array of features relating to webpages, queries performance prediction, focusing on developing techniques that and snippets saved from the search processes for each unit of help IR system to anticipate whether a query will be effective or time t. This step is performed in order to evaluate how well we not to provide results that satisfy users’ needs [4][10][11]. For could use those features to capture the underlying dynamics example, Gao et al. [10] found that features derived from search which would lead to recognizing whether a search process is results and interactions features offer better prediction results going to lead to high or low performance in the future time steps than a prediction baseline defined in terms of query features. at t+n (n=1,2,….,N), where N is the furthest time step. Results from this study have direct implications to individual The decision to include or exclude a feature was based on users by aiding the auto evaluation process of IR systems. literature (e.g., [7]) as well as our past experience [22] with In information search, users may be unaware of their individual representing and evaluating search objects and processes. Each performance when solving an information search task. For feature is extracted for each user or team, u, up to time t from instance, Shah & Marchionini [23] showed how lack of the search processes and they are explained in detail as follows. awareness about different objects involved in searching (queries, • Total coverage (u,t): The total number of distinct visited pages, bookmarks) could result in mistaken perception Webpages visited by a user (u) up to time t. This feature about search performance during an exploratory search task. captures the Webpage based activity performed by a user Even if an IR system is highly effective, users may run into and provides a measure to see how much distinct multiple query formulation and evaluation of several pages information has been found by the user up to this time. before finding what they need. This process, which can be related to search strategies, implies effort and time that is • Useful coverage (u,t): The total number of distinct usually underestimated by the users themselves. In this sense, webpages in which a user spent at least 30 seconds, up to instead of predicting end products (i.e., overall performance), time t. This measure evaluates out of the total pages he/she the approach we introduce in this paper is oriented toward has visited how many of them were useful in finding predictions at different times in order to increase the level of relevant information leading to satisfaction with their awareness of users about their own search process. Similar to context in completing the exploratory task [9][22][25]. weather forecast, this information could help users to be aware • Number of queries (u,t): Total number of unique queries of possible trends based on past and current behavior. executed by a user up to time t. This feature implicitly For a more recent discussion on IR evaluations and their relates to how much effort and cognitive thinking a user has shortcomings, see [12]. To the best of our knowledge, search put in to this task. process performance prediction at different times from a user • Number of saved snippets (u,t): Total number of snippets perspective has not been explored. Similar approaches can be saved by user u up to time t. This measures the amount of found in weather and stock market studies. For example, using information that the user thought that might be relevant in machine learning approaches such as Support Vector Machine the future to complete the task and needed to be (SVM), some models have been implemented to predict the remembered. In other words, this feature is an indication of trends of two different daily stock price indices using NASDAQ explicit relevance judgments made by the user. and Korean Stock prices [13][16]. In a similar fashion, our • Length of Query (u,q,t): Length of each query(q) executed approach is oriented to forecast users’ search performance N- by a user u based on the character count of the query up to time t. This feature captures how the user imposed the queries and how long they were at different times of the above mentioned criteria and threshold and used as the output search process. class labels to be used in the n-step ahead prediction model. If a • Number of tokens in each query (u,q,t): This is the count of class label at n-step ahead was correctly predicted based on the tokens/words in each query(q) executed by user u up to features extracted up to time t from the classification model it time t. This query based measure takes into account how was considered as correctly classified and if not as misclassified. specific a user was in defining the query. By inspecting the 4 EXPERIMENTS datasets, we realized that queries with a less number of In order to evaluate whether users who are predicted to perform tokens tend to get general results. On the other hand, at low performance in the future based on the current search composed queries with multiple terms are related to more process, could benefit from this analysis to improve their search specific searchers. We also observed that typically the users process, we conducted some simple simulation analysis. started with general queries with few words at the beginning of the search process but then went into more We considered the individual user search processes as a detailed queries to find more specific information later. For collection of search paths, where each search path is defined as all these reasons, we found it to be useful to capture the the search process from the time a user issued a query up to the number of token used in a query. time user issued another quite different query. This was found • Query entropy (u,q,t): This measures the information out using generalized Levenshtein (edit) distance, which is a content in a given query (q), by finding the expected value commonly used distance metric for measuring the distance of information contained in a query. We used the widely between two character sequences. If the Levenshtein (edit) recognized notion of Shannon entropy [24] in Information distance between two subsequent queries were greater than 2 Theory to calculate the information content of a query. We (assuming less than 2 was when there were changes in the calculated the number of unique characters appearing in queries due to simple spelling mistakes or refining of the query), each of the queries, which represent the observed counts of we considered the search process from the former query to the the random variable. This was used as the input to Shannon next query as a single search path. entropy calculation and we used to the maximum- Following this method, we found the first search path of each likelihood method to calculate the entropy. Query entropy user and based on the features extracted up to the end of the first feature has been used in the past to predict goodness of a search path, and based on the classification model learnt from query for making query expansion decision [21]. that corresponding n-step ahead prediction we predicted whether The method used to assess the search performance of a user is the user is going to have low/high performance at the end of the described below. We define a measure called Efficiency (u,t), for session. If the user was going to have low performance, then out each user u up to time t in order to predict whether a given of the users who predicted to have high performance, we looked search process is going to yield in high/low performance in the at which high performing user has the lowest Levenshtein (edit) future We first define Effectiveness of user u up to time t as the distance between the queries issued by low performing user ratio of useful coverage and total coverage (both defined within the first search path and considered it as a pair of users, earlier). A similar measure was used in [7] and [22]. whom we are going to use in the simulation. Then, for each low performing user and high performing user that was matched, we Useful coverage(u,t) switched the search process of low performing user at the end of Effectiveness(u,t) = Total coverage(u,t) the first search path with the high performing user’s search path (1) up to t=T minutes, where T is the total number of minutes for a We then calculated Efficiency as defined in Equation 2. session. Then we evaluated by switching the search process Effectiveness(u, t) early during the overall process whether it would benefit each Efficiency(u,t) = low performing user to improve their performance. We found NumberofQueries(u,t) (2) that we were able to move most of the underperforming search In other words, Efficiency is defined as the Effectiveness processes to higher performance by early detection and obtained per query, or how effective a query is in terms of switching, while keeping the higher performing processes achieving a certain level of useful coverage. unharmed. The performance for each user u at each time t was classified in These simulations provide verification that by realizing early to the two classes; high performance and low performance based during the search process whether a user is going to perform on the following criteria: well or not, one could recommend better search processes/strategies for that user which would lead to uplifting Class = { low high ;if ;else Efficiency(u, t) ≥ Efficiency(u, t) (3) the search performance of a previously destined to low performing user. 5 CONCLUSION Using various user studies data available to us, we constructed When it comes to prediction, information retrieval and filtering feature matrices which consist of all aforementioned features for systems are primarily focused on objects while assessing what each minute of time t for all the users in each dataset, and and if something could help the users. These approaches are converted in to a long vector of features which we fed as the often system-dependent even though the process of information input to the classification models used.1 The class labels were seeking is usually user-specific. Personalization and generated as high/low performance at minute t+n based on the recommendations are frequently exercised as methods to address user-specific IR and filtering, but still limited to comparing and 1 In the interest of space and scope of work here, details of these recommending objects, not focusing on underlying IR processes experiments have been omitted, but will be available for that are carried out by the searchers. We presented a new discussion at the workshop. approach to address these shortcomings. We began by asking whether we could model a user’s search process based on the in collaborative IR systems. In Proceedings of the 75th Annual actions he/she is performing during an exploratory search task Meeting of the Association for Information Science and and forecast how well that process will do in the future. This Technology (ASIS&T). Baltimore, MD, USA. was based on a realization that an information seeker’s search [8] Gwizdka, J. (2008). Cognitive load on web search tasks. Workshop goal/task can be mapped out as a series of actions, and that a on Cognition and the Web, Information Processing, sequence of actions or choices the searcher makes, and Comprehension, and Learning. Granada, Spain. Available from http://eprints.rclis.org/14162/1/GwizdkaJ_WCW2008_short_paper especially the search path he/she takes, affects how well he/she _finalp.pdf will do. Thus, in contrast to approaches that measure the goodness of search products (e.g., documents, queries) as a way [9] Fox, S., Karnawat, K., Mydland, M., Dumais, S., & White, T. (2005). Evaluating implicit measures to improve web search. ACM to evaluate the overall search effectiveness, we measured the TOIS, 23(2): 147−168. likelihood of an existing search process to produce good results. [10] Gao, Q., White, R., Dumais, S.T., Wang, S., & Anderson, B. Here we presented simulations to demonstrate what could (2010). Predicting query performance using query, result and happen if one can make process-based predictions, but one could interaction features. In Proceedings of RIAO 2010. develop an actual recommender system using the proposed [11] He, B., & Ounis, I. (2006). Query performance prediction, method. Another potential application of such prediction-based Information Systems, Volume 31, Issue 7, November 2006, Pages method would be to use such approach in IR systems to provide 585-594, ISSN 0306-4379, 10.1016/j.is.2005.11.003. the awareness to users how their future performance will be [12] Järvelin, K. (2012). IR research: systems, interaction, evaluation based on the current/past search process. The system could and theories. ACM SIGIR Forum, 45(2), 17. identify that a user will have low performance if, he continues doi:10.1145/2093346.2093348 this manner at an early stage of the process, and what could be [13] Kyoung-jae, K. (2003). Financial time series forecasting using done to provide suggestions to improve overall performance. support vector machines. Neurocomputing, Volume 55, Issues 1–2, September 2003, Pages 307-319, ISSN 0925-2312, Given that the proposed technique is independent of any specific 10.1016/S0925-2312(03)00372-2. kind of system, and solely focused on user-based processes, it [14] Liu, C., Gwizdka, J., Liu, J., Xu, T., & Belkin, N. J. (2010). will presumably be easy to apply it to a variety of IR systems Analysis and evaluation of query reformulations in different task and situations irrespective of retrieval, ranking, or types. American Society for Information Science, 47(17). Available recommendation algorithms. Finally, while we have used from http://dl.acm.org/citation.cfm?id=1920331.1920356 datasets borrowed from previous user studies, one could easily [15] Liu, J., Gwizdka, J., Liu, C., & Belkin, N. J. (2010). Predicting apply the proposed method to Web logs, TREC data, and other task difficulty for different task types. In Proceedings of the forms of datasets with various user actions recorded over time. Association for Information Science, 47(16). Available from http://dl.acm.org/citation.cfm?id=1920331.1920355 6 ACKNOWLEDGEMENTS [16] Ming-Chi, L. (2009). Using support vector machine with a hybrid The work reported here is supported by The Institute of Museum feature selection method to the stock trend prediction. Expert and Library Services (IMLS) Cyber Synergy project as well as Systems with Applications, Volume 36, Issue 8, October 2009, IMLS grant # RE-04-12-0105-12. The author is also grateful to Pages 10896-10904, ISSN 0957-4174, 10.1016/j.eswa.2009.02.038 his PhD students Chathra Hendahewa and Roberto Gonzalez- [17] Ord, J., Hyndman, R., Koehler, A., & Snyder, R. (2008). Ibanez for their valuable contributions to this work. Forecasting with Exponential Smoothing (The State Space Approach). Springer, 2008. 7 REFERENCES [18] Radinski, K., Svore, K., Dumais, S. T., Teevan, J., Horvitz, E., & [1] Adar, E., Weld, D. S., Bershad, B. N., & Gribble, S. D. (2007). Bocharov, A. (2012). Modeling and predicting behavioral Why we search: visualizing and predicting user behavior. In dynamics on the Web. In Proceedings of WWW 2012. Proceedings of World Wide Web (WWW) Conference 2007. [19] Resnick, P., & Varian, H. R. (1997). Recommender Systems. [2] Adomavicius, G., & Tuzhilin, A. (2005). Toward the Next Communications of the ACM, 40(3), 56–58. Generation of Recommender Systems: A Survey of the State-of- [20] Saracevic, T. (1995). Evaluation of evaluation in information the-Art and Possible Extensions. IEEE Transactions on Knowledge retrieval. In Proceedings of the Annual ACM Conference on and Data Engineering, 17(6), 734–749. Research and Development in Information Retrieval (SIGIR) (pp. [3] Agichtein, E., White, R.W., Dumais, S.T., & Bennett. P.N. (2012). 138–146). Search interrupted: Understanding and predicting search task [21] Shah, C., & Croft, W. B. (2004). Evaluating high accuracy continuation. In Proceedings of the Annual ACM Conference on retrieval techniques. Proceedings of the Annual ACM Conference Research and Development in Information Retrieval (SIGIR) 2012. on Research and Development in Information Retrieval (SIGIR) [4] Cronen-Townsend, S., Zhou, Y., & Croft, B. (2002). Predicting (pp. 2-9). Sheffield, UK. query performance. In Proceedings of the Annual ACM [22] Shah, C., & Gonzalez-Ibanez, R. (2011). Evaluating the Synergic Conference on Research and Development in Information Effect of Collaboration in Information Seeking. Proceedings of the Retrieval (SIGIR) 2002. Annual ACM Conference on Research and Development in [5] Dignum, S., Kruschwitz, U., Fasli, M., Yunhyong, K., Dawei, S., Information Retrieval (SIGIR) (pp. 913–922). Beijing, China. Beresi, U.C., & De Roeck, A. (2010). Incorporating Seasonality [23] Shah, C., & Marchionini, G. (2010). Awareness in Collaborative into Search Suggestions Derived from Intranet Query Logs. In Information Seeking. Journal of American Society of Information Proceedings of IEEE/WIC/ACM International Conference on Web Science and Technology (JASIST), 61(10), 1970–1986. Intelligence and Intelligent Agent Technology (WI-IAT) 2010, vol.1, no., pp.425-430, Aug. 31 2010-Sept. 3 2010 [24] Shannon, C. E. and Weaver, W. Mathematical Theory of Communication. Urbana, IL: University of Illinois Press, 1963. [6] Diriye, A., White, R.W., Buscher, G., & Dumais, S.T. (2012). Leaving so soon? Understanding and predicting web [25] White, R. W., & Huang, J. (2010). Assessing the scenic route: search abandonment rationales. In Proceedings of CIKM 2012. Measuring the value of search trails in web logs. In Proceedings of the Annual ACM Conference on Research and Development in [7] González-Ibáñez, R., Shah, C., & White, R. W. (2012). Pseudo- Information Retrieval (SIGIR). Geneva, Switzerland. collaboration as a method to perform selective algorithmic mediation 8