Using semantic differentials for an evaluative view of the search engine as an interactive system Frances Johnson Department of Languages, Information & Communications Manchester Metropolitan University Geoffrey Manton +44 161 247 6156 F.Johnson@mmu.ac.uk ABSTRACT In this paper, we investigate the use of semantic differentials in Accordingly, search engine developments have focused on obtaining the evaluative view held by users of the search engine. providing query assistance drawing on contextual aspects to the The completed scales of bipolar adjectives were analysed to search, such as personal history and/or current context [9]. At the suggest the dimensions of the user judgment formed when asked interface, developments focus on improving the search process via to characterize a search engine. These were then used to obtain a richer information representations and interactions, such as comparative evaluation of two engines potentially offering previews and facets through to tools that allow the user to view different types of support (or assistance) during a search. We and explore connections in the results, for example ‘the relation consider the value of using the semantic differential as a technique browser data analysis tool’ [10]. These shifts into HCIR are in the toolkit for assessing the user experience during information intended to help in the various stages of search, from starting the interactions in exploratory search tasks. task and understanding the query topic, throughout the search in deciding what to do next, and to stopping with a sense of Categories and Subject Descriptors H3.3 [Information search confidence. In short, developments aim to support true and retrieval]; Search process. H.5.2 [User interfaces]: exploration of the search and, whilst many efforts may fall short, Evaluation/methodology they will provide some form of user support in query assistance General Terms and in improving the search process as an interactive experience. Measurement, Performance, Design, Human Factors The context for evaluation is predicated on White and Roth’s [3] model of the exploratory search process. This involves the Keywords searcher in a dynamic interplay between the cognition of their Semantic Differentials, User Evaluation, Exploratory Search, ‘problem space’ and their exploratory activities in the iterative Information Interaction, User Interface Design, search process including the query formulation, results examination and information extraction. Data collected on the searcher’s information interactions may confirm this model [7] as well as attempt to systematically evaluate the effectiveness of exploratory search systems. In evaluation, a framework is used to 1. INTRODUCTION attempt to assess performance during the search stages and to The design of interfaces to support exploratory search seeks to relate aspects of the system to its role in supporting information provide users with the tools for and the experience of an exploration, including sense making or query visualisation [5]. interactive and engaging search. This is a departure from the The challenge for the evaluation of exploratory search is the classic model of information retrieval wherein the user submits a assumption that the user is willing or able to make an evaluative keyword query to the system and scans the list of retrieved results judgment throughout the search or that valid measures can be for relevance, either stopping with relevant results or refining the found through their actions, for example of usage of query terms. query to get results that are closer to the information need. In general, evaluation draws from established HCI measures of Exploratory search does not necessarily assume that the user has a effectiveness (can people complete their tasks?) efficiency (how well defined information need (at least one that can be articulated long do people take?), an assessment of the user’s overall as a keyword query) or indeed that the query will be ‘static’ and satisfaction or other affective responses. Where possible, and thus satisfied by a single list of retrieved results. increasingly so, the user actions are observed and recorded as dependent on the system and/or its interface. In this study we focus on an attempt to obtain the user’s evaluative view of the search engine, based on criteria which may be affected by the Presented at EuroHCIR2012. Copyright © 2012 for the individual developments for new and richer interactive designs. It is papers by the papers' authors. Copying permitted only for private assumed that this would be part of an assessment which when and academic purposes. This volume is published and copyrighted taken with others will build a picture of the ‘user experience’ of by its editors. the system used in exploratory search. 2. USER EVALUATION The assumption made here, in the use of SDs on ‘search engines’ is that users hold an evaluative view which is formed when using In developing an instrument to collect the user assessment effort the engine to find and/or explore information. The SD is used to goes into ensuring that the evaluation is made in the task context. investigate the adjectives that best ‘conceptualise’ the search It means little to know that the user is ‘satisfied’ with the interface engine, from the user perspective. Factorial analysis is also used without gaining insight into why this assessment has been formed. to identify the dimensions of the judgment, in a sense the A variety of questionnaires have been developed for assessing packaging of the components of the judgment into smaller units of usability of interactive systems, such as search engines. Two well meaning reflecting what is important when responding to the known are the SUS (System Usability Scale) developed at the concept ‘search engine’. Digital Equipment Corporation [2] and the QUIS (Questionnaire for User Interaction Satisfaction) from the University of Maryland The design of the SD aims to allow a degree of abstraction in the [4]. Both assess usability from the user perspective with 10 evaluation so that participants can reflect the complexity of their statements and rating scales in the SUS and the QUIS with 27 response. In this study, the adjectives to include on the SD scale questions. The QUIS asks the user to respond on a rating scale to were chosen from Microsoft’s Product Reaction Cards, these statements which address specific usability aspects of the system, having been collected in previous research, usability studies and such as “use of the terms were consistent throughout the website”. in the marketing of web sites and systems. The majority of the The SUS on the other hand focuses on collecting the users’ terms formed pairs on some continuum and 40 terms (20 pairs) overall reaction to the site/system on statements, such as “I found were selected to present in the SD. The selection was subject to the website unnecessarily complex”. Arguably the QUIS focuses the judgment of the researcher. This is a limitation of this on the concerns that a developer might have when assessing exploratory study, however some steps were taken to formalise the usability whilst the SUS assumes that the user’s overall selection. A loose grouping of the adjective pairs was made as assessment is a reflection on the extent to which their goal relating to appearance (such as ‘attractive’), judgment (‘relevant’), directed tasks were facilitated by the system and its design. emotive (‘boring’) and use (‘fast’). Five pairs from each of these groupings were made. The pairs were mixed on the SD to avoid Questionnaires, such as SUS, are used in an experimental set up having all the positive terms on one side of the scale and only when an explanation of the user’s overall assessment is sought. intervals were shown on the scales with the numerical values used However, the limitations of the questionnaire to capture and only for data entry. This allowed participants to focus on how an provide insight into the complexity of the user’s assessment has adjective pair related to the engine and its characteristics, rather lead to alternative tools, for example Microsoft’s Product than on ‘scoring’ it in some way. Reaction Cards in the "Desirability Toolkit". This invites participants on a usability test to select as many, or as few, words from a list of 118 which best describe their reaction and/or interaction with the system they have just used. Benedek and 3.1 Implementation Miner [1] includes a list of the words used and point out that the The study was conducted on our undergraduates studying BSc approach helps elicit negative comments as well as positive, thus Web Development and on a postgraduate cohort studying on MA overcoming a problem with questionnaires biased towards Library and Information Management or the MSc Information positive responses. Management. A total of 89 students participated in the study. At Given the potential scope of the users’ response (represented in the start of the class each participant was asked to think about a the reaction cards with some 100+ terms) this study sets out to search engine, and adjectives they would use to describe the investigate the value in assembling these into a framework (of engine, (in other words, “what it means to them”). Each sorts) for the collection of the users’ evaluative judgment of an participant was then given the SD to complete. This is referred to interactive system based on the technique known as ‘semantic as the ‘baseline’ and the data were analysed to gauge user differentials’. Specifically the aim of this small preliminary perceptions of search engines. investigation was to begin to determine the extent to which users In the following lab sessions (about one hour later) each hold an evaluative view of a ‘search engine’ and, what are the participant was required to perform two search tasks on each of dimensions (traits or criteria) on which we form this view. If it the two search engines - Google, an engine we can assume some can be found that this view is strongly held (that is, an attitude is familiarity and, a second clustering engine (Yippy, formerly formed which may influence how we behave and interact with the Clusty). The two tasks were as follows search engine) then it may be feasible to investigate the influence, 1. Find information on the symptoms for diabetes type II if any, of a design for information interaction on the evaluative view. In this study the technique of semantic differentials is used 2. Find information to help write an assignment on the to best describe the evaluative view held by its participants. This debate ‘nurture vs nature’ is then employed to assess two quite different search engines following the completion of two query based searches. These were selected to give the participants experience of using the engines for a closed question (find symptoms) and on a more open ‘informational’ type of query (on the ‘nature nurture’ 3. SEMANTIC DIFFERENTIALS debate). A measure of search success was not taken as the aim was simply to get the participants using the engines. The order of Semantic Differentials (SDs) originate from the work of Osgood use of the two sites was randomized so that approximately half of [8] as a technique for attitude measurement, scaling people on the participants worked on Google first and half on the clustering their responses to adjectives in respect to a concept. Typically engine. All were told to spend no longer than 10 minutes individuals respond to several pairs of bipolar adjectives scored searching on each engine and to complete the SD for each engine on a continuum + to – and in doing so differentiate their meaning immediately after each use. of the concept in intensity and in direction (in a ‘semantic space’). 4. FINDINGS 4.2 Comparative evaluations 4.1 Evaluative views Using the same SDs, participants scaled their responses post The responses to the baseline (think of an engine) were entered search using Google and the clustering search engine. These were into SPSS with the scales coded (7-1) so that the positive entered into a worksheet to obtain basic statistics. The mode for adjectives corresponded to the higher numbers. Descriptive each adjective is shown Figure 1 with a note of those with mode statistics of mean, mode and standard deviation were calculated >4 and < 3 suggesting a positive or negative response. for each of the adjectives. Those with a mean greater than 4 or less than 3 were taken to suggest the adjective pairs that best characterise the participants’ view, as follows attractive unattractive powerful simplistic valuable not valuable relevant irrelevant satisfying frustrating fast slow predictable unpredictable intuitive rigid Google (mode > 4 or < 3) easy difficult & in bold where mean is also > 4 Factor analysis investigates the correlations among subsets of the 1attractive - , 6valuable - , 8relevant - , responses to the bipolar pairs and groups the correlated variables such that each group is largely independent of the others. 15satisfying - , 16fast - , 17predictable - ,18controllable -, Exploratory factor analysis was employed to identify the groups and (where mode < 3) 19rigid - which might explain most of the variance in the data. With 20 pairs of adjectives to perform Principal Components Analysis (PCA) in SPSS it is recommended that a minimum of 100 responses are obtained, whilst others recommend that a sample requires approx 5-10 times the number of people as scale pairs [6]. With 89 responses we should use a reduced number of pairs, however the Kaiser-Meyer-Olkin measure of sampling adequacy (.616) is greater than the 0.6 needed to indicate that the correlations matrix may be able to factorise. So with this, PCA was run (with varimax rotation to force items to ‘load’ with only one factor group), to identify the possible ‘factors’ or subsets derived from patterns of correlation of the adjective pairs. The following five subsets were obtained (the adjectives from the list above having a low or high mean are shown in bold). The labels were assigned to suggest the evaluative dimension. Factor 1 label USE – Utility Clustering search engine (mode > 4 or < 3) effective, valuable, satisfy, relevant, predictable, & in bold where mean > 4 or < 3 intimidating, inspiring, stimulating 14engaging - , 19intuitive – Factor 2 label QUALITY – Affective and (where mode < 3) 13intimidating - , 17 – unpredictable engaging, fun, connected Factor 3 label QUALITY - Appearance Figure 1. Responses to the adjectives for both engines high quality, personal, meaningful, rigid, attractive Using the suggested dimensions or aspects of the user evaluation Factor 4 label USE – Efficient from the factor analysis of the ‘baseline’ data we can compare the participants’ responses on the high or low scoring adjectives easy, intuitive, fast, powerful across the engines. On QUALITY – Appearance Google was rated rigid and attractive and whereas Google was neutral on Factor 5 label USE - Control the factor QUALITY- Affective, the clustering search engine obtained a positive score towards the adjective engaging. controllable On the factor labeled USE- Utility Google was scored as predictable, valuable, relevant and satisfying, whereas the [2] Brooke, J. SUS: A Quick and Dirty Usability Scale. In: P.W. clustering engine as unpredictable and towards intimidating. On Jordan, B. Thomas, B.A. Weerdmeester & I.L. McClelland (Eds.), USE-Efficient Google was rated as fast and the clustering Usability Evaluation in Industry. London: Taylor & Francis, 1996 engine appears more intuitive. Google was also rated as [www.itu.dk/courses/U/E2005/litteratur/sus.pdf# controllable. [3] Capra, R., and Marchionini, G. The Relation Browser tool for faceted exloratory search. Proceedings of the 2008 Conference on 5. DISCUSSION Digital Libraries, Pittsburg, Pennsylvania, June, 2008 This is an exploratory study and it has its limitations. It is questionable whether the selection of the adjectives to use in the [4] Chin, J. P., Diehl, V. A, & Norman, K. Development of an SD influenced the results. In particular there is uncertainty in the instrument measuring user satisfaction of the human-computer results that intuitive to rigid is on some continuum. Also there is interface, Proceedings of ACM SIGCHI ,1988, pp. 213-218. some unease at accepting a factor with 8 out of 20 pairs and one http://www.cs.umd.edu/hcil/quis/ with only one. Perhaps the sample size was too small to attempt factoring. The results also raise questions on how some of the [5] Daqing He, et al An evaluation of adaptive filtering in the adjectives were interpreted by the participants. These context of realistic task-based information exploration. . withstanding, the participants in this study did appear to hold an Information Processing Management, 44(2), 2008 pp. 511-533 evaluative judgment of the concept ‘search engine’ and the traits represented in the scale were grouped to suggest the aspects on [6] Gable, R. K., & Wolf, M. E.. Instrument development in the which an assessment may be formed. It is of particular interest affective domain (2nd ed.). Boston: Kluwer Academic, 1993 that upon using the search engine Google to conduct a search task the ratings on the SD, on the whole, altered only in the factors of [7] Kules, B and Capra, R. Visualizing stages during an ‘controllable’ and USE –efficient (easy, intuitive and exploratory search. Proceedings HCIR October 20th, 2011. powerful). Perhaps we can assume that Google was the typical engine when asked to think of an engine in the baseline and, when [8] Osgood, C.E, Suci, G., & Tannenbaum, P The Measurement it came to use Google, users shifted their perception with regards of Meaning. University of Illinois Press, 1957 to some of the adjectives. Perhaps this is not surprising but it may suggest that we hold an implicit view of search engines, and that [9] Teevan, J., Dumais,S.T and E. Horvitz. Potential for this view will be influenced by actual use (and the experience). Personalization. ACM Transactions on Computer-Human Our participants may have had less familiarity with the clustering Interaction special issue on Data Mining for Understanding User engine, and in the evaluation this appears to have prompted an Needs, 17(1), 2010 http://people.csail.mit.edu/teevan/work/ ‘affective’ response in finding the engine to be ‘engaging’ whilst publications/ papers/tochi10.pdf also indicating shifts in the ‘use’ factors (towards an assessment of the engine as ‘unpredictable’). Again the infallibility of some [10] White, Ryen W. & Roth., R. A. Exploratory Search: Beyond of the terms is highlighted where an ‘unpredictable’ system may the Query-Response Paradigm, CA: Morgan and Claypool, 2009 be regarded to be a negative judgment, but if the system is also considered to be engaging the assessment could be highly desirable depending on the user’s goals. This study of the use of Appendix: The Semantic Differential scale semantic differentials indicates that it is worth running the test with a new cohort of students to determine the extent to which a attractive _ _ _ _ _ _ _ unattractive consistent view is obtained. As an exploratory study it also impersonal _ _ _ _ _ _ _ personal suggests that further research on user’s perceptions and mental dull _ _ _ _ _ _ _ fun models of search engines is worthwhile. With regards to the powerful _ _ _ _ _ _ _ simplistic challenge of providing an evaluation of the exploratory search, disconnected _ _ _ _ _ _ _ connected this study falls short as no behavioural data was obtained. valuable _ _ _ _ _ _ _ not valuable However, perhaps, with further design of the SD and use in an high quality _ _ _ _ _ _ _ low quality experimental set up with honed tasks, a user assessment of the irrelevant _ _ _ _ _ _ _ relevant interface may be obtained as dependent on the search interface effective _ _ _ _ _ _ _ ineffective development and design. incomprehensible _ _ _ _ _ _ _ meaningful stimulating _ _ _ _ _ _ _ confusing boring _ _ _ _ _ _ _ inspiring 6. REFERENCES intimidating _ _ _ _ _ _ _ empowering stressful _ _ _ _ _ _ _ engaging [1] Benedek, J. and Miner, T. "Measuring Desirability: New satisfying _ _ _ _ _ _ _ frustrating Methods for Evaluating Desirability in a Usability Lab Setting." fast _ _ _ _ _ _ _ slow Redmond, WA: Microsoft Corporation, 2002. predictable _ _ _ _ _ _ _ unpredictable http://www.microsoft.com/usability/UEPostings/Desirability controllable _ _ _ _ _ _ _ uncontrollable Toolkit.doc intuitive _ _ _ _ _ _ _ rigid difficult _ _ _ _ _ _ _ eas