Using semantic differentials for an evaluative view of the
             search engine as an interactive system
                                                          Frances Johnson
                                  Department of Languages, Information & Communications
                                            Manchester Metropolitan University
                                                     Geoffrey Manton
                                                    +44 161 247 6156
                                                     F.Johnson@mmu.ac.uk


ABSTRACT
In this paper, we investigate the use of semantic differentials in     Accordingly, search engine developments have focused on
obtaining the evaluative view held by users of the search engine.      providing query assistance drawing on contextual aspects to the
The completed scales of bipolar adjectives were analysed to            search, such as personal history and/or current context [9]. At the
suggest the dimensions of the user judgment formed when asked          interface, developments focus on improving the search process via
to characterize a search engine. These were then used to obtain a      richer information representations and interactions, such as
comparative evaluation of two engines potentially offering             previews and facets through to tools that allow the user to view
different types of support (or assistance) during a search. We         and explore connections in the results, for example ‘the relation
consider the value of using the semantic differential as a technique   browser data analysis tool’ [10]. These shifts into HCIR are
in the toolkit for assessing the user experience during information    intended to help in the various stages of search, from starting the
interactions in exploratory search tasks.                              task and understanding the query topic, throughout the search in
                                                                       deciding what to do next, and to stopping with a sense of
Categories and Subject Descriptors H3.3 [Information search            confidence. In short, developments aim to support true
and retrieval]; Search process. H.5.2 [User interfaces]:               exploration of the search and, whilst many efforts may fall short,
Evaluation/methodology                                                 they will provide some form of user support in query assistance
General Terms                                                          and in improving the search process as an interactive experience.
Measurement, Performance, Design, Human Factors                        The context for evaluation is predicated on White and Roth’s [3]
                                                                       model of the exploratory search process. This involves the
Keywords                                                               searcher in a dynamic interplay between the cognition of their
Semantic Differentials, User Evaluation, Exploratory Search,           ‘problem space’ and their exploratory activities in the iterative
Information Interaction, User Interface Design,                        search process including the query formulation, results
                                                                       examination and information extraction. Data collected on the
                                                                       searcher’s information interactions may confirm this model [7] as
                                                                       well as attempt to systematically evaluate the effectiveness of
                                                                       exploratory search systems. In evaluation, a framework is used to
1.      INTRODUCTION                                                   attempt to assess performance during the search stages and to
The design of interfaces to support exploratory search seeks to        relate aspects of the system to its role in supporting information
provide users with the tools for and the experience of an              exploration, including sense making or query visualisation [5].
interactive and engaging search. This is a departure from the          The challenge for the evaluation of exploratory search is the
classic model of information retrieval wherein the user submits a      assumption that the user is willing or able to make an evaluative
keyword query to the system and scans the list of retrieved results    judgment throughout the search or that valid measures can be
for relevance, either stopping with relevant results or refining the   found through their actions, for example of usage of query terms.
query to get results that are closer to the information need.          In general, evaluation draws from established HCI measures of
Exploratory search does not necessarily assume that the user has a     effectiveness (can people complete their tasks?) efficiency (how
well defined information need (at least one that can be articulated    long do people take?), an assessment of the user’s overall
as a keyword query) or indeed that the query will be ‘static’ and      satisfaction or other affective responses. Where possible, and
thus satisfied by a single list of retrieved results.                  increasingly so, the user actions are observed and recorded as
                                                                       dependent on the system and/or its interface. In this study we
                                                                       focus on an attempt to obtain the user’s evaluative view of the
                                                                       search engine, based on criteria which may be affected by the
Presented at EuroHCIR2012. Copyright © 2012 for the individual         developments for new and richer interactive designs. It is
papers by the papers' authors. Copying permitted only for private      assumed that this would be part of an assessment which when
and academic purposes. This volume is published and copyrighted        taken with others will build a picture of the ‘user experience’ of
by its editors.                                                        the system used in exploratory search.
2.     USER EVALUATION                                                    The assumption made here, in the use of SDs on ‘search engines’
                                                                          is that users hold an evaluative view which is formed when using
In developing an instrument to collect the user assessment effort         the engine to find and/or explore information. The SD is used to
goes into ensuring that the evaluation is made in the task context.       investigate the adjectives that best ‘conceptualise’ the search
It means little to know that the user is ‘satisfied’ with the interface   engine, from the user perspective. Factorial analysis is also used
without gaining insight into why this assessment has been formed.         to identify the dimensions of the judgment, in a sense the
A variety of questionnaires have been developed for assessing             packaging of the components of the judgment into smaller units of
usability of interactive systems, such as search engines. Two well        meaning reflecting what is important when responding to the
known are the SUS (System Usability Scale) developed at the               concept ‘search engine’.
Digital Equipment Corporation [2] and the QUIS (Questionnaire
for User Interaction Satisfaction) from the University of Maryland        The design of the SD aims to allow a degree of abstraction in the
[4]. Both assess usability from the user perspective with 10              evaluation so that participants can reflect the complexity of their
statements and rating scales in the SUS and the QUIS with 27              response. In this study, the adjectives to include on the SD scale
questions. The QUIS asks the user to respond on a rating scale to         were chosen from Microsoft’s Product Reaction Cards, these
statements which address specific usability aspects of the system,        having been collected in previous research, usability studies and
such as “use of the terms were consistent throughout the website”.        in the marketing of web sites and systems. The majority of the
The SUS on the other hand focuses on collecting the users’                terms formed pairs on some continuum and 40 terms (20 pairs)
overall reaction to the site/system on statements, such as “I found       were selected to present in the SD. The selection was subject to
the website unnecessarily complex”. Arguably the QUIS focuses             the judgment of the researcher. This is a limitation of this
on the concerns that a developer might have when assessing                exploratory study, however some steps were taken to formalise the
usability whilst the SUS assumes that the user’s overall                  selection. A loose grouping of the adjective pairs was made as
assessment is a reflection on the extent to which their goal              relating to appearance (such as ‘attractive’), judgment (‘relevant’),
directed tasks were facilitated by the system and its design.             emotive (‘boring’) and use (‘fast’). Five pairs from each of these
                                                                          groupings were made. The pairs were mixed on the SD to avoid
Questionnaires, such as SUS, are used in an experimental set up           having all the positive terms on one side of the scale and only
when an explanation of the user’s overall assessment is sought.           intervals were shown on the scales with the numerical values used
However, the limitations of the questionnaire to capture and              only for data entry. This allowed participants to focus on how an
provide insight into the complexity of the user’s assessment has          adjective pair related to the engine and its characteristics, rather
lead to alternative tools, for example Microsoft’s Product                than on ‘scoring’ it in some way.
Reaction Cards in the "Desirability Toolkit". This invites
participants on a usability test to select as many, or as few, words
from a list of 118 which best describe their reaction and/or
interaction with the system they have just used. Benedek and              3.1        Implementation
Miner [1] includes a list of the words used and point out that the        The study was conducted on our undergraduates studying BSc
approach helps elicit negative comments as well as positive, thus         Web Development and on a postgraduate cohort studying on MA
overcoming a problem with questionnaires biased towards                   Library and Information Management or the MSc Information
positive responses.                                                       Management. A total of 89 students participated in the study. At
Given the potential scope of the users’ response (represented in          the start of the class each participant was asked to think about a
the reaction cards with some 100+ terms) this study sets out to           search engine, and adjectives they would use to describe the
investigate the value in assembling these into a framework (of            engine, (in other words, “what it means to them”). Each
sorts) for the collection of the users’ evaluative judgment of an         participant was then given the SD to complete. This is referred to
interactive system based on the technique known as ‘semantic              as the ‘baseline’ and the data were analysed to gauge user
differentials’. Specifically the aim of this small preliminary            perceptions of search engines.
investigation was to begin to determine the extent to which users         In the following lab sessions (about one hour later) each
hold an evaluative view of a ‘search engine’ and, what are the            participant was required to perform two search tasks on each of
dimensions (traits or criteria) on which we form this view. If it         the two search engines - Google, an engine we can assume some
can be found that this view is strongly held (that is, an attitude is     familiarity and, a second clustering engine (Yippy, formerly
formed which may influence how we behave and interact with the            Clusty). The two tasks were as follows
search engine) then it may be feasible to investigate the influence,
                                                                                1.    Find information on the symptoms for diabetes type II
if any, of a design for information interaction on the evaluative
view. In this study the technique of semantic differentials is used             2.   Find information to help write an assignment on the
to best describe the evaluative view held by its participants. This                  debate ‘nurture vs nature’
is then employed to assess two quite different search engines
following the completion of two query based searches.                     These were selected to give the participants experience of using
                                                                          the engines for a closed question (find symptoms) and on a more
                                                                          open ‘informational’ type of query (on the ‘nature nurture’
3.     SEMANTIC DIFFERENTIALS                                             debate). A measure of search success was not taken as the aim
                                                                          was simply to get the participants using the engines. The order of
Semantic Differentials (SDs) originate from the work of Osgood            use of the two sites was randomized so that approximately half of
[8] as a technique for attitude measurement, scaling people on            the participants worked on Google first and half on the clustering
their responses to adjectives in respect to a concept. Typically          engine. All were told to spend no longer than 10 minutes
individuals respond to several pairs of bipolar adjectives scored         searching on each engine and to complete the SD for each engine
on a continuum + to – and in doing so differentiate their meaning         immediately after each use.
of the concept in intensity and in direction (in a ‘semantic space’).
4.     FINDINGS
                                                                     4.2     Comparative evaluations
4.1    Evaluative views                                              Using the same SDs, participants scaled their responses post
The responses to the baseline (think of an engine) were entered      search using Google and the clustering search engine. These were
into SPSS with the scales coded (7-1) so that the positive           entered into a worksheet to obtain basic statistics. The mode for
adjectives corresponded to the higher numbers. Descriptive           each adjective is shown Figure 1 with a note of those with mode
statistics of mean, mode and standard deviation were calculated      >4 and < 3 suggesting a positive or negative response.
for each of the adjectives. Those with a mean greater than 4 or
less than 3 were taken to suggest the adjective pairs that best
characterise the participants’ view, as follows
         attractive          unattractive
         powerful            simplistic
         valuable            not valuable
         relevant            irrelevant
         satisfying          frustrating
         fast                slow
         predictable         unpredictable
         intuitive           rigid
                                                                                         Google (mode > 4 or < 3)
         easy                difficult
                                                                       & in bold where mean is also > 4
Factor analysis investigates the correlations among subsets of the
                                                                       1attractive - , 6valuable - , 8relevant - ,
responses to the bipolar pairs and groups the correlated variables
such that each group is largely independent of the others.             15satisfying - , 16fast - , 17predictable - ,18controllable -,
Exploratory factor analysis was employed to identify the groups        and (where mode < 3) 19rigid -
which might explain most of the variance in the data. With 20
pairs of adjectives to perform Principal Components Analysis
(PCA) in SPSS it is recommended that a minimum of 100
responses are obtained, whilst others recommend that a sample
requires approx 5-10 times the number of people as scale pairs
[6]. With 89 responses we should use a reduced number of pairs,
however the Kaiser-Meyer-Olkin measure of sampling adequacy
(.616) is greater than the 0.6 needed to indicate that the
correlations matrix may be able to factorise. So with this, PCA
was run (with varimax rotation to force items to ‘load’ with only
one factor group), to identify the possible ‘factors’ or subsets
derived from patterns of correlation of the adjective pairs. The
following five subsets were obtained (the adjectives from the list
above having a low or high mean are shown in bold). The labels
were assigned to suggest the evaluative dimension.
         Factor 1 label USE – Utility
                                                                               Clustering search engine (mode > 4 or < 3)
            effective, valuable, satisfy, relevant, predictable,       & in bold where mean > 4 or < 3
               intimidating, inspiring, stimulating
                                                                       14engaging - , 19intuitive –
         Factor 2 label QUALITY – Affective                            and (where mode < 3)
                                                                       13intimidating - , 17 – unpredictable
            engaging, fun, connected

         Factor 3 label QUALITY - Appearance
                                                                     Figure 1. Responses to the adjectives for both engines
            high quality, personal, meaningful, rigid, attractive
                                                                     Using the suggested dimensions or aspects of the user evaluation
         Factor 4 label USE – Efficient                              from the factor analysis of the ‘baseline’ data we can compare the
                                                                     participants’ responses on the high or low scoring adjectives
            easy, intuitive, fast, powerful                          across the engines. On QUALITY – Appearance Google
                                                                     was rated rigid and attractive and whereas Google was neutral on
         Factor 5 label USE - Control                                the factor QUALITY- Affective, the clustering search
                                                                     engine obtained a positive score towards the adjective engaging.
            controllable                                             On the factor labeled USE- Utility Google was scored as
predictable, valuable, relevant and satisfying, whereas the             [2] Brooke, J. SUS: A Quick and Dirty Usability Scale. In: P.W.
clustering engine as unpredictable and towards intimidating. On        Jordan, B. Thomas, B.A. Weerdmeester & I.L. McClelland (Eds.),
USE-Efficient Google was rated as fast and the clustering              Usability Evaluation in Industry. London: Taylor & Francis, 1996
engine appears more intuitive. Google was also rated as                [www.itu.dk/courses/U/E2005/litteratur/sus.pdf#
controllable.
                                                                       [3] Capra, R., and Marchionini, G. The Relation Browser tool for
                                                                       faceted exloratory search. Proceedings of the 2008 Conference on
5.     DISCUSSION                                                      Digital Libraries, Pittsburg, Pennsylvania, June, 2008
This is an exploratory study and it has its limitations. It is
questionable whether the selection of the adjectives to use in the     [4] Chin, J. P., Diehl, V. A, & Norman, K. Development of an
SD influenced the results. In particular there is uncertainty in the   instrument measuring user satisfaction of the human-computer
results that intuitive to rigid is on some continuum. Also there is    interface, Proceedings of ACM SIGCHI ,1988, pp. 213-218.
some unease at accepting a factor with 8 out of 20 pairs and one       http://www.cs.umd.edu/hcil/quis/
with only one. Perhaps the sample size was too small to attempt
factoring. The results also raise questions on how some of the         [5] Daqing He, et al An evaluation of adaptive filtering in the
adjectives were interpreted by the participants. These                 context of realistic task-based information exploration. .
withstanding, the participants in this study did appear to hold an     Information Processing Management, 44(2), 2008 pp. 511-533
evaluative judgment of the concept ‘search engine’ and the traits
represented in the scale were grouped to suggest the aspects on        [6] Gable, R. K., & Wolf, M. E.. Instrument development in the
which an assessment may be formed. It is of particular interest        affective domain (2nd ed.). Boston: Kluwer Academic, 1993
that upon using the search engine Google to conduct a search task
the ratings on the SD, on the whole, altered only in the factors of     [7] Kules, B and Capra, R. Visualizing stages during an
‘controllable’ and USE –efficient (easy, intuitive and                 exploratory search. Proceedings HCIR October 20th, 2011.
powerful). Perhaps we can assume that Google was the typical
engine when asked to think of an engine in the baseline and, when       [8] Osgood, C.E, Suci, G., & Tannenbaum, P The Measurement
it came to use Google, users shifted their perception with regards     of Meaning. University of Illinois Press, 1957
to some of the adjectives. Perhaps this is not surprising but it may
suggest that we hold an implicit view of search engines, and that      [9] Teevan, J., Dumais,S.T and E. Horvitz. Potential for
this view will be influenced by actual use (and the experience).       Personalization. ACM Transactions on Computer-Human
Our participants may have had less familiarity with the clustering     Interaction special issue on Data Mining for Understanding User
engine, and in the evaluation this appears to have prompted an         Needs, 17(1), 2010 http://people.csail.mit.edu/teevan/work/
‘affective’ response in finding the engine to be ‘engaging’ whilst     publications/ papers/tochi10.pdf
also indicating shifts in the ‘use’ factors (towards an assessment
of the engine as ‘unpredictable’). Again the infallibility of some     [10] White, Ryen W. & Roth., R. A. Exploratory Search: Beyond
of the terms is highlighted where an ‘unpredictable’ system may        the Query-Response Paradigm, CA: Morgan and Claypool, 2009
be regarded to be a negative judgment, but if the system is also
considered to be engaging the assessment could be highly
desirable depending on the user’s goals. This study of the use of      Appendix: The Semantic Differential scale
semantic differentials indicates that it is worth running the test
with a new cohort of students to determine the extent to which a       attractive         _    _   _    _   _   _    _   unattractive
consistent view is obtained. As an exploratory study it also           impersonal         _    _   _    _   _   _    _   personal
suggests that further research on user’s perceptions and mental        dull               _    _   _    _   _   _    _   fun
models of search engines is worthwhile. With regards to the            powerful           _    _   _    _   _   _    _   simplistic
challenge of providing an evaluation of the exploratory search,        disconnected       _    _   _    _   _   _    _   connected
this study falls short as no behavioural data was obtained.            valuable           _    _   _    _   _   _    _   not valuable
However, perhaps, with further design of the SD and use in an          high quality       _    _   _    _   _   _    _   low quality
experimental set up with honed tasks, a user assessment of the         irrelevant         _    _   _    _   _   _    _   relevant
interface may be obtained as dependent on the search interface         effective          _    _   _    _   _   _    _   ineffective
development and design.                                                incomprehensible   _    _   _    _   _   _    _   meaningful
                                                                       stimulating        _    _   _    _   _   _    _   confusing
                                                                       boring             _    _   _    _   _   _    _   inspiring
6.      REFERENCES                                                     intimidating       _    _   _    _   _   _    _   empowering
                                                                       stressful          _    _   _    _   _   _    _   engaging
[1] Benedek, J. and Miner, T. "Measuring Desirability: New             satisfying         _    _   _    _   _   _    _   frustrating
Methods for Evaluating Desirability in a Usability Lab Setting."       fast               _    _   _    _   _   _    _   slow
Redmond, WA: Microsoft Corporation, 2002.                              predictable        _    _   _    _   _   _    _   unpredictable
http://www.microsoft.com/usability/UEPostings/Desirability             controllable       _    _   _    _   _   _    _   uncontrollable
Toolkit.doc                                                            intuitive          _    _   _    _   _   _    _   rigid
                                                                       difficult          _    _   _    _   _   _    _   eas