=Paper= {{Paper |id=None |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-763/euroHCIR2011_proceedings.pdf |volume=Vol-763 }} ==None== https://ceur-ws.org/Vol-763/euroHCIR2011_proceedings.pdf

!
euroHCIR2011!
4th!July!2011!–!Newcastle,!UK!
!
Proceedings+of+the+!
1st!European!Workshop(on(!
Human"Computer)Interaction)with)
Information*Retrieval!
A"workshop"at"BCS/HCI2011"
"

Executive!Summary!
EuroHCIR2011,was,the,first,in,a,series,of,new,workshops,aimed,to,stimulate,the,
European,Human,Computer,Interaction,and,Information,Retrieval,(HCIR),
community,in,a,similar,manner,to,series,of,successful,workshops,held,in,the,
USA.,The,workshop,,which,won,industry,sponsorship,from,LexisNexis,,was,
highly,successful,,accepting,11,short,papers,and,drawing,participants,from,a,
dozen,countries,across,Europe.,In,addition,to,the,8,insightful,presentations,and,
3,poster,presentations,,Ann,Blandford,,from,University,College,London’s,
Interaction,Centre,,gave,an,inspiring,keynote,about,their,work,on,Exploratory,
Search,and,Serendipity.,

Organised!by!
Max"L."Wilson" Birger"Larson"
Future,Interaction,Technologies,Lab, The,Royal,School,of,Library,and,
Swansea,University,,UK,, Information,Science,,Denmark,
m.l.wilson@swansea.ac.uk, blar@iva.dk,
, "
Tony"Russell/Rose" James"Kalbach"
UXLabs,,UK, LexisNexis,,UK,
tgr@uxlabs.co.uk, James.kalbach@lexisnexis.co.uk,
!
Sponsored!by!!

, ,
Session!1!
Page!3!!K!! Exploratory"Search"in"an"Audio/Visual"Archive:"Evaluating"a"
Professional"Search"Tool"for"Non/Professional"Users!
Marc%Bron,%Jasmijn%Van%Gorp,%Frank%Nack%and%Maarten%De%Rijke%
!
Page!7!K!! Supplying"Collaborative"Source/code"Retrieval"Tools"to"Software"
Developers!
Juan%M.%Fernández>Luna,%Juan%F.%Huete%and%Julio%Rodriguez>Cano%
!
Page!11!K!! Interactive"Analysis"and"Exploration"of"Experimental"Evaluation"
Results!
Emanuele%Di%Buccio,%Marco%Dussin,%Nicola%Ferro,%Ivano%Masiero,%
Giuseppe%Santucci%and%Giuseppe%Tino%
!
Page!15!K! A"Taxonomy"of"Enterprise"Search!
Tony%Russell>Rose,%Joe%Lamantia%and%Mark%Burrell%

Session!2!
Page!19!K! Back"to"MARS:"The"unexplored"possibilities"in"query"result"
visualization!
Alfredo%Ferreira,%Pedro%B.%Pascoal%and%Manuel%J.%Fonseca%
!
Page!23!K!! The"Mosaic"Test:"Benchmarking"Colour/based"Image"Retrieval"
Systems"Using"Image"Mosaics!
William%Plant,%Joanna%Lumsden%and%Ian%Nabney%
!
Page!27!K!! Evaluating"the"Cognitive"Impact"of"Search"User"Interface"Design"
Decisions!
Max%L.%Wilson%
!
Page!31!K!! The"potential"of"Recall"and"Precision"as"interface"design"
parameters"for"information"retrieval"systems"situated"in"
everyday"environments!
Ayman%Moghnieh%and%Josep%Blat%

Posters!
Page!35!K!! Towards"User/Centered"Retrieval"Algorithms!
Manuel%J.%Fonseca%
!
Page!38!K!! Design"Thinking"Search"User"Interfaces!
Arne%Berger%
%
Page!42!K!! The"Development"and"Application"of"an"Evaluation"Methodology"
for"Person"Search"Engines!
Roland%Brennecke,%Thomas%Mandl%and%Christa%Womser>Hacker%
!
Exploratory Search in an Audio-Visual Archive:
Evaluating a Professional Search Tool for
Non-Professional Users

Marc Bron Jasmijn van Gorp
ISLA, University of Amsterdam TViT, Utrecht University
m.m.bron@uva.nl j.vangorp@uu.nl

Frank Nack Maarten de Rijke
ISLA, University of Amsterdam ISLA, University of Amsterdam
nack@uva.nl derijke@uva.nl

ABSTRACT view, i.e., the type of information included in the metadata, which
As archives are opening up and publishing their content online, does not necessarily match the expectation of the general public.
the general public can now directly access archive collections. To This leads to an increase in exploratory types of search [5], as users
support access, archives typically provide the public with their in- are unable to translate their information need into terms that corre-
ternal search tools that were originally intended for professional spond with the representation of the content in the archive. The sec-
archivists. We conduct a small-scale user study where non-profes- ond problem is that archives provide users with professional search
sionals perform exploratory search tasks with a search tool origi- tools to search through their collections. Such tools were origi-
nally developed for media professionals and archivists in an audio nally developed to support professional users in searching through
visual archive. We evaluate the tool using objective and subjective the metadata descriptions in a collection. Given their knowledge of
measures and find that non-professionals find the search interface the collection, professionals primarily exhibit directed search be-
difficult to use in terms of both. Analysis of search behavior shows havior [3], but it is unclear to what extent professional search tools
that non-professionals often visiting the description page of indi- support non-professional users in exploratory search.
vidual items in a result list are more successful on search tasks than The focus of most work on improving exploratory search is to-
those who visit fewer pages. A more direct presentation of enti- wards professionals [1]. In this paper we present a small-scale user
ties present in the metadata fields of items in a result list can be study where non-professional users perform exploratory search tasks
beneficial for non-professional users on exploratory search tasks. in an audio-visual archive using a search tool originally developed
for media professionals and archivists. We investigate the follow-
Categories and Subject Descriptors ing hypotheses: (i) a search interface designed for professional
users does not provide satisfactory support for non-professional
H.5.2 [User interfaces]: Evaluation/methodology
users on exploratory search tasks; and (ii) users with high perfor-
General Terms mance on exploratory search tasks have different search behavior
than users with lower performance.
Measurment, Performance, Design, Experimentation In order to investigate the first hypothesis we evaluate the search
Keywords tool performance objectively in terms of the number of correct an-
swers found for the search tasks and subjectively through a usabil-
Exploratory search, Usability evaluation ity questionnaire. To answer the second hypothesis, we perform an
analysis of the click data logged during search.
1. INTRODUCTION
Traditionally, archives have been the domain of archivists and 2. EXPERIMENTAL DESIGN
librarians, who retrieve relevant items for a user’s request through
their knowledge of the content in, and organization of, the archive. The environment. The setting for our experiment was the Nether-
Increasingly, archives are opening up and publishing their content lands Institute for Sound and Vision (S&V), the Dutch national au-
online, making their collections directly accessible for the general diovisual broadcast archive. In the experiment we used the archive’s
public. There are two major problems that these non-professional collection consisting of around 1.5 M (television) programs with
users face. First, most users are unfamiliar or only partially famil- metadata descriptions provided by professional annotators.
iar with the archive content and its representation in the repository. We also utilized the search interface of S&V.1 The interface is
The internal representation is designed from the expert point of available in a simple and an advanced version. The simple version
is similar to search engines known from the web. It has a single
search box and submitting a query results in a ranked list of 10
programs. Clicking on one of the programs, the interface shows a
page with the complete metadata description of the program. Ta-
Copyright c 2011 for the individual papers by the papers’ authors. Copy- ble 1 shows the metadata fields available for a program. Instead of
ing permitted only for private and academic purposes. This volume is pub- 1
lished and copyrighted by the editors of euroHCIR2011. http://zoeken.beeldengeluid.nl
the usual snippets presented with each item in a result list, the inter- “television geography” you need to investigate the representation of
face shows the title, date, owner and keywords for each item on the places in drama series. Find five drama series where location plays
result page. Only the keywords and title field provide information an important role. (iii) For the course “media and gender” you need
about the actual content of the program while the other fields pro- to give a presentation about the television career of five different fe-
vide information primarily used for the organization of programs in male hosts of game shows broadcasted during the 1950s, 1960s or
the archive collection. The description and summary fields contain 1970s. Find five programs that you can use in your presentation.
the most information about the content of programs but are only Subjects received the search tasks in random order to avoid any
available by visiting the program description page. bias. Also, subjects were encouraged to perform the search in any
We used the advanced version of the interface in the experiment means that suited them best. During the experiment we logged
which next to the search box offers two other components: search all search actions, e.g., clicks, performed by each subject. After a
boxes operating on specific fields and filters for certain categories subject had finished all three search tasks, he or she was asked to fill
of terms. Fielded searches operate on specific fields in the program out a questionnaire about the experiences with the search interface.
metadata. The filters become available after a list of programs has Methodology for evaluation and analysis. We performed two
been returned in response to a query. The filters display the top types of evaluation of the search interface: a usability questionnaire
five most frequent terms in the returned documents for a metadata and the number of correct answers submitted for the search tasks.
field. The metadata fields displayed in the filter component of the The questionnaire consists of three sets of questions. The first set
interface are highlighted in bold in Table 1. Once a checkbox next involves aspects of the experienced search behaviour with the in-
to one of the terms has been ticked, programs not containing that terface. The second set contains questions about how useful users
term in that field are removed from the result list. find the filter component, fielded search component, and metadata
fields presented in the interface. The third set asks subjects to in-
Table 1: All metadata fields available for programs. We differ- dicate the usefulness of a series of term clouds. The primary goal
entiate between fields that describe program content and fields is not to evaluate the term clouds or their visualization but to find
that do not. Bold indicates fields used by the filter component. preferences for information from certain metadata fields. We gen-
content descriptors organizational descriptors erated a term cloud for a specific field as follows. First, we got
the top 1000 program descriptions for the query “comedian.” We
field explanation field explanation
counted the terms for a field for each of the documents. The cloud
description program highlights medium storage medium then represented a graphical display of the top 50 most frequent
person people in program genre gameshow; news terms in the fields of those documents, where the size of a term was
keyword terms provided by rights parties allowed
annotator to broadcast relative to its frequency, i.e, the higher the frequency the bigger the
summary summary of the owner owner of the term. In the questionnaire subjects indicate agreement on a 5 point
program format broadcast rights Likert scale ranging from one (not at all) to five (extremely). The
organization organization in program date broadcast date second type of evaluation was based on the evaluation methodology
location locations in program origin program origin applied at TREC [2]. We pooled the results of all subjects and let
title program title two assessors make judgements about the relevance of the submit-
ted answers to a search task. An answer is only considered relevant
Subjects. In total, 22 first year university students from media if both assessors agree. Performance is measured in terms of the
studies participated in the experiment. The students (16 female, number of correct answers (#correct) submitted to the system.
6 male) were between 19 and 22 years of age. As a reward for For the analysis of the search behavior of subjects we looked
participation the students gained free entrance to the museum of at (i) the number of times a search query is submitted using any
the archive. combination of components (#queries); (ii) the number of times a
Experiment setup. In each of the five studios available at S&V ei- program description page is visited (#pages); and (iii) the number
ther one or two subjects performed the experiment at a time in a sin- of times a specific component is used, i.e., the general searchbox,
gle studio. In case two subjects were present, each of them worked filters and fields. A large value for #queries indicates a look up
on machines facing opposite sides of the studio. We instructed sub- type search behavior. It is characterized by a pattern of submitting
jects not to communicate during the experiment. During the experi- a query, checking if the answer can be found in the result list and if
ment one instructor was always present in a studio. Before starting, it is not, to formulate a new query. The new query is not necessar-
the subjects learned the goals of the experiment, got a short tuto- ily based on information gained from the retrieved results but rather
rial on the search interface and performed a test query. During this inspired by the subject’s personal knowledge [4]. A large value for
phase the subjects were allowed to ask questions. #pages indicates a learning style search behavior. In this search
In the experiment each subject had to complete three search tasks strategy a subject visits the program description of each search re-
in 45 minutes. If after 15 minutes a task was not finished, the in- sult to get a better understanding of the organization and content of
structor asked the subject to move on to the next task. Search tasks the archive. New queries are then also based on information gained
are related to matters that could potentially occur within courses from the previous text analysis [4]. We check the usage frequency
of the student’s curriculum. Each search task required the subjects of specific components to see if performance differences between
to find five answers before moving on to the next task. A correct subjects are due to alternative uses of interface components.
answer was a page with the complete metadata description of a
program that fulfilled the information need expressed by the search 3. RESULTS
task. Subjects could indicate that a page was an answer through a
submit button added to the interface for the experiment. Search interface evaluation. Figure 1 shows the distribution of
We used the following three search tasks in the experiment: (i) For the amount of correct answers submitted for a search task, together
the course “media and ethnicity” you need to investigate the role with the distribution of the amount of answers (correct or incor-
of ethnicity in television-comedy. Find five programs with differ- rect) submitted. Out of the possible total of 330 answers, 173 are
ent comedians with a non-western background. (ii) For the course actually submitted. Subjects submit the maximum number of five
Table 3: Analysis of search behavior of subjects. Significance
is tested using a standard two-tailed t-test. The symbol N indi-
30
#correct cates a significant increase at the ↵ < 0.01 significance level.
#submitted
#tasks

filter field searchbox #queries #pages
20

B avg 21.3 29.5 44.8 35.2 21.2
35.7N
10

G avg 15.2 44.0 42.0 34.3
0

0 1 2 3 4 5 ber of queries suggests that the difference in performance is not
due to one group doing more lookups than the other. The indi-
Figure 1: Distribution of amount correct/submitted answers.
cator for learning type search, i.e., #pages, shows that there is a
significant difference in the number of program description pages
visited between subjects of the two groups, i.e., subjects in group
answers for 18 of the tasks. This suggests that subjects have diffi-
G tend to visit program description pages more often than subjects
culties in finding answers within the given time limit. Subjects find
of group B. We also find that the average time subjects in group G
no correct answers for 31 of the tasks, five subjects find no cor-
spend on a program description page is 27 seconds, while subjects
rect answer for any of the tasks, and none of the subjects reaches
from group B spend on average 39 seconds. These observations
the maximum of five correct answers for a task. In total 64 out of
support our hypothesis that there are differences in search behavior
173 answers are correct. This low precision indicates that subjects
between subjects that have high performance on exploratory search
find it difficult to judge if an answer is correct based on the meta-
tasks and subjects with lower performance.
data provided by the program description. Table 2 shows ques-
tions about the satisfaction of subjects with the interfaces. Subjects Usefulness of program descriptions. One explanation for this dif-
indicate their level of agreement from one (not at all) to five (ex- ference in performance is that through their search behavior sub-
tremely). For all questions the majority of subjects find the amount jects from group G learn more about the content and organization
of support offered by the interface on the exploratory search tasks of the archive and are able to assimilate this information faster from
marginal. This finding supports our first hypothesis that the search the program descriptions than subjects from group B. As subjects
interface intended for professional users does not provide satisfac- process more program descriptions they learn more about the avail-
tory support to non-professional users on exploratory search tasks. able programs and terminology in the domain. This results in a
richer set of potential search terms to formulate their information
need. To investigate whether subjects found information in the pro-
Search behavior analysis. Although all subjects are non-experts gram descriptions useful in suggesting new search terms, we anal-
with respect to search with this particular interface, some perform yse the second set of questions from the questionnaire. The top half
better than others. We investigate whether there is a difference in of Table 4 shows subjects’ responses to questions about the useful-
the search behavior of subjects that have high performance on the ness of metadata fields present on the search result page. Consid-
search tasks and users that have lower performance. We divide ering responses from all subjects the genre and keyword fields are
subjects into two groups depending on the average number of cor- found most useful and the title and date fields as well, although to
rect answers found aggregated over the three tasks, i.e., 2.9 out of a lesser degree. The fields intended for professionals, i.e., origin,
the possible maximum of 15. The group with higher performance owner, rights, and medium are found not useful by the majority of
(group G) consists of 11 subjects with 3 or more correct answers, subjects. Between group B and G there are no significant differ-
whereas the group with lower performance (group B) consists of ences in subject’s judgement of the usefulness of the fields.
11 subjects with 2 or less correct answers.
Table 3 shows the averages of the search behavior indicators for
each of the two groups. We first look at the usage frequency of the
filter, field, and search box components by subjects in group G vs. Table 4: Questions about the usefulness of metadata fields on
group B. There is no significant difference between the groups, in- program description pages and the mode and average (avg) of
dicating that there is no direct correlation between performance on the subjects responses: for all subjects, the good (G) and bad
the search tasks and use of specific search components. Next we (B) performing group. We use a Wilcoxon signed rank test for
look at search behavior as an explanation for the difference in per- the ordinal scale. The symbol M (N ) indicates a significant in-
formance between the groups. Our indicator for lookup searches, crease at the ↵ < 0.05 (0.01) level.
i.e., #queries, shows a small difference in the number of submitted all B G
queries. That subjects in both groups submit a comparable num- question field mode mode avg mode avg
Degree to which date 3 2 2.2 3 3.0
fields on the result owner 1 1 1.6 1 2.0
Table 2: Questionnaire results about the satisfaction of subjects page were useful in rights 1 1 1.3 1 1.4
suggesting new genre 4 1 2.8 4 3.9
with the search interface. Agreement is indicated on a 5 point terms keyword 4 1,5 3.1 4 3.5
Likert scale ranging from one (not at all) to five (extremely). origin 1 1,2 1.7 1 2.0
question mode avg title 3,4 2 2.2 4 3.0
To what degree are you satisfied with the search 2 2.3 medium 1 1 1.5 1,2 1.6
experience offered by the interface? Degree to which summary 4 1,4 2.8 5 3.8
To what degree did the interface support you by 2 2.4 fields in program description 4 4 3.3 4M 4.1
suggesting new search terms? descriptions were person 4 1,3,4 2.8 4N 3.8
To what degree are you satisfied with the sug- 2 2.3 useful in suggesting location 1,3,4 1,3 2.0 4N 3.0
gestions for new search terms by the interface? new terms organization 1 1 1.8 1,2 2.0
The bottom part of Table 4 shows subject’s responses to ques- the archive. Together, the above findings suggest that subjects find
tions about the usefulness of metadata fields only present on the a direct presentation of short and meaningful terms, i.e., categories,
program description page and not already shown on the search re- keywords, and entities, on the search results page useful.
sult page. Based on all responses, the summary, description, person
and location metadata fields are considered most useful by the ma- 4. CONCLUSION
jority of the subjects. These findings further support our argument
We presented results from a user study where non-professional
that program descriptions provide useful information for subjects
users perform exploratory search tasks with a search tool originally
to complete their search tasks.
developed for media professionals and archivists in an audio visual
When we contrast responses of the two groups we find that group
archive. We hypothesized that such search tools provide unsatisfac-
G subjects consider the description, person, and location metadata
tory support to non-professional users on exploratory search tasks.
fields significantly more useful than subjects from group B. This
By means of a TREC style evaluation we find that subjects achieve
suggests that group B subjects have more difficulties in distilling
low recall in the number of correct answers found. In a question-
useful information from these fields (recall also the longer time
naire regarding the user satisfaction with the search support offered
spent on a page). This does not say that these users cannot un-
by the tool, subjects indicate this to be marginal. Both findings sup-
derstand the provided information. All that is indicated is that the
port our hypothesis that a professional search tool is unsuitable for
chosen modality, i.e., text, might not be the right one. A graphical
non-professional users performing exploratory search tasks.
representation, for example as term clouds, might be better.
Through an analysis of the data logged during the experiment,
Fields as term clouds. In response to the observations just made, we find evidence to support our second hypothesis that subjects per-
we also investigated how users would judge visual representations form different search strategies. Subjects that visit more program
of search results, i.e., in the form of term clouds directly on the description pages are more successful on the exploratory search
search result page. Here the goal is not to evaluate the visualization tasks. We also find that subjects consider certain metadata fields on
of the clouds or the method by which they are created. Of interest the program description pages more useful than others. Subjects in-
to us is whether subjects would find a direct presentation of infor- dicate that visualization of certain fields as term clouds directly in
mation normally “hidden” on the program description page useful. the search interface would be useful in completing the search tasks.
Recall from §2 that we generate term clouds for each field on Subjects especially consider presentations of short and meaningful
the basis of the terms in the top 1000 documents returned for a text units, e.g., categories, keywords, and entities, useful.
query. From Table 5 we observe that subjects do not consider In future work we plan to perform an experiment in which we
the description and summary clouds useful, while previously these present non-professional users with two interfaces: the current search
fields were judged most useful among the fields in the program de- interface and one with a direct visualization of categories, key-
scription. Both clouds contain general terms from the television words and entities on the search result page.
domain, e.g., program and series, which do not provide subjects
Acknowledgements. This research was partially supported by the
with useful search terms. Although this could be due to the use
European Union’s ICT Policy Support Programme as part of the
of frequencies to select terms, these fields are inherently difficult
Competitiveness and Innovation Framework Programme, CIP ICT-
to visualize without losing the relations between the terms. The
PSP under grant agreement nr 250430, the PROMISE Network of
genre, keyword, location and, to some degree, person clouds are all
Excellence co-funded by the 7th Framework Programme of the Eu-
considered useful, but they support the user in different ways. The
ropean Commission, grant agreement no. 258191, the DuOMAn
genre field supports the subject in understanding how content in the
project carried out within the STEVIN programme which is funded
archive is organized, i.e., it provides an overview of the genres used
by the Dutch and Flemish Governments under project nr STE-09-
for categorization. The keyword cloud provides the user with alter-
12, the Netherlands Organisation for Scientific Research (NWO)
native search terms for his original query, for example, satire or
under project nrs 612.061.814, 612.061.815, 640.004.802, 380-70-
parody instead of cabaret. The location and person clouds offer an
011, the Center for Creation, Content and Technology (CCCT), the
indication of which locations and persons are present in the archive
Hyperlocal Service Platform project funded by the Service Innova-
and how prominent they are. For these fields visualization is easier,
tion & ICT program, the WAHSP project funded by the CLARIN-
i.e., genre, keywords or entities by themselves are meaningful with-
nl program, and under COMMIT project Infiniti.
out having to represent relations between them. Subjects consider
the title field only marginally useful. For this field the usefulness is
dependent on the knowledge of the subject as titles are not neces- REFERENCES
sarily descriptive. The subjects also consider the organization field [1] J.-w. Ahn, P. Brusilovsky, J. Grady, D. He, and R. Florian. Se-
marginally useful, probably due to the nature of our search tasks, mantic annotation based exploratory search for information an-
i.e., two tasks focus on finding persons and in one locations play alysts. Inf. Proc. & Management, 46(4):383 – 402, 2010.
an important role. We assume though that in general this type of [2] D. K. Harman. The TREC test collections. In E. M. Voorhees
information need occurs when the general public starts exploring and D. K. Harman, editors, TREC: Experiment and evaluation
in information retrieval. MIT, 2005.
[3] B. Huurnink, L. Hollink, W. van den Heuvel, and M. de Rijke.
Table 5: Questions about the usefulness of term clouds based Search behavior of media professionals at an audiovisual
on specific metadata fields. Agreement is indicated on a 5 point archive. J. Am. Soc. Inf. Sci. and Techn., 61:1180–1197, 2010.
Likert scale ranging from one (not at all) to five (extremely). [4] G. Marchionini. Exploratory search: from finding to under-
cloud mode avg cloud mode avg standing. Comm. ACM, 49(4):41 – 46, April 2006.
[5] R. White, B. Kules, S. Drucker, and M. Schraefel. Supporting
title 2 2.8 description 1 2.5
exploratory search: Special issue. Comm. ACM, 49(4), 2006.
person 2,3 2.9 genre 4 3.4
location 4 3.3 summary 1 2.3
organization 2 2.2 keyword 4 3.8
Supplying Collaborative Source-code Retrieval Tools
to Software Developers

Juan M. Fernández-Luna Juan F. Huete Julio C. Rodríguez-Cano
Departamento de Ciencias de Departamento de Ciencias de Centro de Desarrollo Territorial
la Computación e Inteligencia la Computación e Inteligencia Holguín. Universidad de las
Artificial, CITIC-UGR. Artificial, CITIC-UGR. Ciencias Informáticas, 80100
Universidad de Granada, Universidad de Granada, Holguín, Cuba
18071 Granada, Spain 18071 Granada, Spain jcrcano@uci.cu
jmfluna@decsai.ugr.es jhg@decsai.ugr.es

ABSTRACT One of the reasons that the existing IR systems do not
Collaborative information retrieval (CIR) and search-driven adequately support collaboration is that there are not good
software development (SDD) are both new emerging research models and methods that describe users’ behavior during
fields; the first one was born in response to the problem of collaborative tasks. To address this issue, the community
satisfying shared information needs of groups of users that has adopted CIR as an emerging research field in charge to
collaborate explicitly, and the second to explore source-code establish techniques to satisfy the shared information needs
retrieval concept as an essential activity during software de- of group members, starting from the extension of the IR
velopment process. Taking advantages of the recent con- process with the knowledge about the queries, the context,
tributions in CIR and SDD, in this paper we introduce a and the explicit collaboration habits among group members.
plug-in that can be added to the NetBeans IDE in order CIR community identifies four fundamental features in this
to enable remote teams of developers to use collaborative multidisciplinary field that can enhance the value of colla-
source-code retrieval tools. We also include in this work borative search tools: user intent transition, awareness, di-
experimental results to confirm that CIR&SDD techniques vision of labor, and sharing of knowledge [2].
give out better search results than individual strategies. In addition, SDD is a new research area motivated by
the observation that software developers spend most of their
time searching pertinent information that they need in order
Categories and Subject Descriptors to solve their tasks at hand. We identified that SDD context
H.5.3 [Information Interfaces and presentation (e.g., was a very interesting field where collaborative IR features
HCI)]: Group and Organization Interfaces; H.3.3 [Information could be greatly exploited. For this reason we use the phrase
Storage and Retrieval]: Search Process. collaborative SDD to refer to the application of di↵erent
collaborative IR techniques in the SDD process [3].
General Terms It’s known than some IDE incorporate tools with support
Design, Human Factors. for developer’s collaboration practices, but without making
emphasis in source-code retrieval. In this sense, the objec-
tive of this paper is to present the results of the comparison
Keywords of traditional SDD and collaborative SDD. In both search
Collaborative Information Seeking and Retrieval, Search- scenarios, we use the NetBeans IDE plug-in COSME (CO-
driven Software Development, Multi-user Search Interface. llaborative Search MEeting) with the appropriate configura-
tions. COSME endows NetBeans IDE with traditional and
1. INTRODUCTION collaborative source-code retrieval tools.
This paper is organized as follows: The first section presents
“Collaboration” seems to be the buzzword this year, a brief overview of related works and place our research in
just like “knowledge management” was last year.
context. Then, we describe our software tool and method,
– David Coleman
explaining the di↵erent aspects of our experimental evalua-
In the last few years, Information Retrieval (IR) Systems tion. Finally we discuss the results and present some con-
have become critical tools for software developers. Today clusion remarks.
we can use vertical IR systems focused in integrated deve-
lopment environment (IDE) extensions for source-code re-
trieval as such Strathcona [5], CodeConjurer [6], and Code- 2. RELATED WORK
Genie [1], but these only allow an individual interaction from There is a small body of work that investigates methods
the team developers’s perspective. to join collaborative information retrieval and search-driven
software development. On the one hand, some researchers
have identified di↵erent search scenarios where it is necessa-
Copyright c 2011 for the individual papers by the papers’ authors. ry to extend IR systems with collaborative capabilities. For
Copying permitted only for private and academic purposes. This volume is example, in the Web context, SearchTogether [8] is a sys-
published and copyrighted by the editors of EuroHCIR2011.
tem which enables remote users to synchronously or asyn-
EuroHCIR ’11 Newcastle, UK chronously collaborate when searching the Web. It supports
collaboration with several mechanisms of group awareness, 3. THE COSME PLUG-IN
division of labor, and persistence. On the other hand, the To improve software developers with shared technical in-
SDD community presents di↵erent prototypes and systems. formation needs we implemented the COSME front-end as
For example, Sourcerer [1] is an infrastructure for large-scale a NetBeans IDE plug-in. The principal technologies that
indexing and analysis of open source code. Sourcerer crawls we used to implement it include the CIRLab framework [2],
Internet looking for Java code from a variety of locations, NetBeans IDE platform, Java as programming language,
such as open source repositories, public web sites, and ver- and AMENITIES (A MEthodology for aNalysis and desIgn
sion control systems. of cooperaTIve systEmS) as software engineering method-
CIR systems can be applied in several domains, such as ology. COSME is designed to enable either synchronous
travel planning, organizing social events, working on a home- or asynchronous, but explicit remote collaboration among
work assignment or medical environments, among many oth- teams of developers with shared technical needs. In the fol-
ers. We identified software development as another possi- lowing section we are going to outline COSME.
ble application field where much evidence of collaboration
among programmers on a development task can be found. 3.1 Current Features
For example, concurrent edition of models and processes re- Figure 1 is a screenshot showing various features of our
quire synchronous collaboration between architects and de- COSME plug-in. We refer to the circled numbers in the
velopers who can not be physically present at a common following text.
location [7]. 1. Search Control Panel: It is integrated in turn for
However, current SDD systems do not have support for three collapsible panels; (a) configuration, where the devel-
explicit collaboration among developers with shared techni- opers can select the search options and engines to accomplish
cal information needs, which frequently look for additional the search tasks; (b) filters show the user’s interest field ac-
documentation on the API (Application Programming In- cording to the collection contents; and (c) collection type
terface), read posts for people having the same problem, permit to specify the type of search result’s items.
search the company’s site for help with the API, or looking 2. Search Results Window: The search results can
for source code examples where other people successfully be classified according to three di↵erent source-code local-
used the API. Fortunately, in the last few years, some re- ization: (d) results can be obtained as a consequence of
searchers have realized that collaboration is an important division of labor techniques introduced by the collaborative
feature, which should be analyzed in detail in order to be search session (CoSS) chairman. A CoSS is a group of end-
integrated with operational IR systems, upgrading them to users working together to satisfy their shared information
CIR systems. needs. One CoSS only can have one developer in the roll of
As an approach to these situations, we propose in this chairman; (e) or by explicit recommendations accomplished
work the COSME plug-in [4]. It makes the contribution in for group members of their CoSS; (f ) finally, search results
current SDD providing explicit support for teams of devel- also can be obtained by individual search.
opers, enabling developers to collaborate on both the pro- 3. Item Viewer: It shows full item content in di↵erent
cess and results of a search. COSME provides collabora- formats, e.g. pdf, plain text, and Java source-code files.
tive search functions for exploring and managing source-code All item formats are showed to the developers within the
repositories and documents about technical information in NetBeans IDE.
the software development context. 4. CoSS Portal: Developer can use the chat tool em-
In order to support such CIR techniques, COSME pro- bedded in the CoSS Portal to negotiate the creation of a
vides some collaborative services in the context of SDD: collaborative search session or to join at any active CoSS.
For each CoSS, the chairman can to establish the integrity
• The embedded chat tool enables direct communication criteria, membership policy, and division of labor principles.
among di↵erent developers.
4. EXPERIMENTAL EVALUATION
• Relevant search results can be shared with the explicit In this section we are going to show how collaborative
recommender mechanisms. features applied to SDD improves the traditional opera-
tion without them. Then if we consider the null hypoth-
• Another important feature is the automatic division esis (H0 ) that AT SDD ACSDD , our alternative hypothesis
of labor. By implementing an e↵ective division of la- (H1 ) is that the collaborative work should help to improve
bor policy the search task can be split across team the retrieval performance in a SDD task: AT SDD < ACSDD ,
developers, thereby avoiding considerable duplication where TSDD stands for Traditional SDD and CSDD for Col-
of e↵ort. laborative SDD. To evaluate our proposal we compare 10
group interactions in two di↵erent kinds of search scenarios
• Through awareness mechanisms all developers are al- (SS) on SDD, SS2k+1 and SS2(k+1) ; k 2 0, . . . , 9. SS2k+1
ways informed about the team activities to save e↵ort. represents a team of developers that use a conventional IR
Awareness is a valuable learning mechanism that help system, this means that developers do not have access to
the less experienced developers to view the syntax used techniques of division of labor, sharing of knowledge, or
by their teammates, being an inspiration to reformu- awareness (traditional SDD – TSDD), while S2(k+1) repre-
late their queries. sents a team of developers that uses a CIR system. Then, 5
teams worked in a TSDD context (those with odd subindexes)
• All search results can be annotated, either for personal and the other 5 with CSDD (even subindexes). In both
use, like a summary, or in the team context, for dis- search scenarios, we used COSME with the appropriate con-
cussion threads and ratings. figurations for both settings.
Figure 1: Screenshot of NetBeans IDE with COSME plug-in installed

The search scenario was a common task proposed to a qe 0 .
group of developers without Java background: select the
most relevant classes to manage GUI (Graphical User In- T
| qu 0 qe 0 |
terface) components using di↵erent Java API with a total sim(qu , qe ) = S = (1)
| qu 0 q e 0 |
of 2420 files. Specifically, Jidesoft (634), OpenSwing (434)),
SwingX (732)) and Swing (620). We have focussed on these In Equation 1, is a value between 0 and 1. For this ex-
API because they are directly related to the context of the periment we assumed that there exists an expert’s relevance
N +1 S
experiment although they are not complete: we have only judgement to qu only if 9 2
, where N =| qu 0 qe 0 |,
N
considered their most relevant API packages for the experi- selecting the relevance judgements that correspond to max
ment. for each qe .
For evaluation purposes, we created our own test collec- In order to measure the e↵ectiveness of the described SST SDD
tion: a group of 10 experts proposed a set of 100 topics and SSCSDD scenarios, we considered as evaluation mea-
strongly related to the objective of the experimentation, sures the metrics proposed by Pickens et al. in [9], i.e. se-
then their corresponding queries were submitted to each of lected precision (Ps , the fraction of documents judged rel-
the following search engines: Lucene, Minion, Indri and Ter- evant by the developer that were marked relevant in the
rier. A document pool was obtained by ranking fusion and ground truth), and selected recall (Rs ) as their dependent
later the experts, grouped in pairs, determined the relevant measures. To summarize e↵ectiveness in a single number we
documents for each topic. use F1s measure.
In collaborative SDD, it is very important to analyze the According to the documents that each team selected for
interaction among group members, therefore, unlike the eval- each common topic, F1s measure was computed. In order to
uation of a traditional SDD system, we can not fix the accomplish the statistical analysis of the results, we use the
queries. Then each participating group could freely formu- non parametric test of Wilcoxon (all against all). The Monte
late their queries to the search engine. In order to compare Carlo method was used and adjusted with the 99% trust
team results, the search engine identified the most similar intervals and 10000 signs. It was considered the existences
queries formulated by the members of the teams with re- of significance (Sig.) as appear in Table 1.
spect to those formulated by experts. If the system found We could notice significative di↵erences between TSDD
enough similarity and if they occur in all the groups, then and CSDD groups, considered two by two. As F1s values for
these queries are considered that deals with the same topic CSDD groups are better than those computed from TSDD
and selected for group comparison purposes. The similar- groups for those cases, then we could conclude that when
ity measure between queries is calculated by Equation 1. A teams works supported by collaborative tools, they obtain
user query (qu ) and an expert query (qe ) are considered to better results. From Table 1, we could realize that apart
be the same if they are within a given similarity threshold. from SS5 , each SST SDD has got at least one SSCSDD with
A new query qu 0 is obtained applying the Porter stemmer significant di↵erence values of F1s . With this results we
algorithm to qu ’s terms, and analogously, we would obtain accept H1 , because AT SDD < ACSDD .
SS1 SS2 SS3 SS4 SS5 SS6 SS7 SS8 SS9
F1s
SS2 0, 062
SS3 0, 180 0, 051
SS4 0, 022† 0, 212 0, 038†
SS5 0, 272 0, 069 0, 152 0, 054
SS6 0, 045† 0, 201 0, 080 0, 290 0, 056
SS7 0, 215 0, 031† 0, 340 0,090 0, 206 0, 042†
SS8 0, 053 0, 131 0, 061 0, 190 0, 072 0, 158 0, 070
SS9 0, 243 0, 072 0, 201 0, 029† 0, 344 0, 068 0, 238 0, 042†
SS10 0, 065 0, 098 0, 041† 0, 290 0, 072 0, 235 0, 045† 0, 132 0, 058
†: significant di↵erence (0, 01  Sig < 0, 05)
‡: highly significant di↵erence (Sig < 0, 01)

Table 1: Wilcoxon Test Results.

5. CONCLUSIONS AND FUTURE WORKS Search-Driven Development-Users, Infrastructure, Tools
Collaboration in SDD is just being recognized as an im- and Evaluation, pages 1–4, Washington, DC, USA,
portant research area. While in some cases collaborative 2009. IEEE Computer Society.
SDD can be handled by conventional search engines, we [2] J. M. Fernández-Luna, J. F. Huete, R. Pérez-Vázquez,
need to understand how the collaborative nature of source- and J. C. Rodrı́guez-Cano. Cirlab: A groupware
code retrieval a↵ects the requirements on search algorithms. framework for collaborative information retrieval
Research in this direction needs to adopt the theories and research. Information Processing and Management,
methodologies of SDD and CIR, and supplement them with 44(1):256–273, 2009.
new approach constructs as appropriate. In this work we [3] J. M. Fernández-Luna, J. F. Huete, R. Pérez-Vázquez,
present COSME as a collaborative SDD tool that helps team and J. C. Rodrı́guez-Cano. Improving search–driven
developers to find better sources than searching with tradi- development with collaborative information retrieval
tional SDD strategies, as well as an experimental approach techniques. In HCIR ’09: IIIrd Workshop on
that confirms our hypotheses. Human–Computer Interaction and Information
Our ongoing work focuses on the COSME back-end which Retrieval, Washington DC, USA, 2009.
poses fundamental research challenges as well as provides [4] J. M. Fernández-Luna, J. F. Huete, R. Pérez-Vázquez,
new opportunities to let group members collaborate in new and J. C. Rodrı́guez-Cano. Cosme: A netbeans ide
ways: plugin as a team–centric alternative for search driven
(i) Profile Analysis. We aim to analyze the user-generated software development. In Group 2010: Ist Workshop on
data using various techniques from the study of di↵erent col- Collaborative Information Seeking, Florida, USA, 2010.
laborative virtual environments and recommender systems. [5] R. Holmes. Do developers search for source code
With the results, our goal is to provide better personalized examples using multiple facts? In SUITE 2009: First
search results, support the users while searching and recom- International Workshop on Search-Driven Development
mend users to relevant trustworthy collaborators. Users, Infrastructure, Tools and Evaluation, Vancouver,
(ii) P2P/hybrid-network Retrieval. Due to scalability Canada, 2009.
and privacy issues we favor a distributed environment by [6] W. Janjic. Lowering the barrier to reuse through
means of a P2P (peer-to-peer) retrieval feature based on hy- test-driven search. In SUITE 2009: First International
brid architecture to store the user-generated data and col- Workshop on Search-Driven Development Users,
lections (CASPER – CollAborative Search in PEer-to-peer Infrastructure, Tools and Evaluation, Vancouver,
netwoRks). The main challenges in this respect are to ensure Canada, 2009.
a reliable and efficient data analysis. [7] M. Jiménez, M. Piattini, and A. Vizcaı́no. Challenges
and improvements in distributed software development:
6. ACKNOWLEDGMENTS A systematic review. 2009.
This work has been partially supported by the Spanish re- [8] M. R. Morris and E. Horvitz. Searchtogether: an
search programme Consolider Ingenio 2010: MIPRCV (CSD2007- interface for collaborative web search. In UIST ’07:
00018), the Spanish MICIN project TIN2008-06566-C04-01 Proceedings of the 20th annual ACM symposium on
and the Andalusian Consejerı́a de Innovación, Ciencia y Em- User interface software and technology, pages 3–12,
presa project TIC-04526. We also would like to thank Car- New York, NY, USA, 2007. ACM.
men Torres for support and discussions and for all of our [9] J. Pickens, G. Golovchinsky, C. Shah, P. Qvarfordt, and
experiment participants. M. Back. Algorithmic mediation for collaborative
exploratory search. In SIGIR ’08: Proceedings of the
7. REFERENCES 31st annual international ACM SIGIR conference on
[1] S. Bajracharya, J. Ossher, and C. Lopes. Sourcerer: An Research and development in information retrieval,
internet-scale software repository. In SUITE ’09: pages 315–322, New York, NY, USA, 2008. ACM.
Proceedings of the 2009 ICSE Workshop on
Interactive Analysis and Exploration of
Experimental Evaluation Results

Emanuele Di Buccio Marco Dussin Nicola Ferro
University of Padua, Italy University of Padua, Italy University of Padua, Italy
dibuccio@dei.unipd.it dussinma@dei.unipd.it ferro@dei.unipd.it
Ivano Masiero Giuseppe Santucci Giuseppe Tino
University of Padua, Italy Sapienza University of Rome, Sapienza University of Rome,
masieroi@dei.unipd.it Italy Italy
santucci@dis.uniroma1.it tino@dis.uniroma1.it

ABSTRACT research groups and industries, producing a huge amount of
This paper proposes a methodology based on discounted cu- valuable data to be analysed, mined, and understood.
mulated gain measures and visual analytics techniques in The aim of this work is to explore how we can improve
order to improve the analysis and understanding of IR ex- the comprehension of and the interaction with the experi-
perimental evaluation results. The proposed methodology mental results by IR researchers and IR system developers.
is geared to favour a natural and e↵ective interaction of the We imagine the following scenarios: (i) a researcher or a de-
researchers and developers with the experimental data and veloper is attending the workshop of one of the large-scale
it is demonstrated by developing an innovative application evaluation campaigns and s/he wants to explore and under-
based on Apple iPad. stand the experimental results as s/he is listening at the
presentation discussing them; (ii) a team of researchers or
developers is working on tuning and improving an IR sys-
Categories and Subject Descriptors tem and they need tools and applications that allow them
H.3.3 [Information Search and Retrieval]: [Search pro- to investigate and discuss the performances of the system
cess]; H.3.4 [Systems and Software]: [Performance eval- under examination in a handy and e↵ective way.
uation (efficiency and e↵ectiveness)] These scenarios call for: (a) proper metrics that allow
us to understand the behaviour of a system; (b) e↵ective
General Terms analysis and visualization techniques that allow us to get an
overall idea of the main factors and critical areas which have
Experimentation, Human Factors, Measurement, Performance influenced performances in order to be able to dig into de-
tails; (c) for tools and applications that allow us to interact
Keywords with the experimental result in a both e↵ective and natural
Ranking, Visual Analytics, Interaction, Discounted Cumu- way.
lated Gain, Experimental Evaluation, DIRECT To this end, we propose a methodology which allows us to
quickly get an idea of the distance of an IR system with re-
spect to both its own optimal performances and the best per-
1. INTRODUCTION formances possible. We rely on the (normalized) discounted
The Information Retrieval (IR) field has a strong and long- cumulated gain (n)DCG family of measures [7] because they
lived tradition, that dates back to late 50s/early 60s of the have shown to be especially well-suited not only to quantify
last century, as far as the assessment of the performances of system performances but also to give an idea of the over-
an IR system is concerned. In particular, in the last 20 years, all user satisfaction with a given ranked list considering the
large-scale evaluation campaigns, such as the Text REtrieval persistence of the user in scanning the list.
Conference (TREC)1 in the United States and the Cross- The contribution of this paper is to improve on the previ-
Language Evaluation Forum (CLEF)2 in Europe, have con- ous work [7,11] by trying to better understand what happens
ducted cooperative evaluation e↵orts involving hundreds of when you flip documents with di↵erent relevance grades in
1
http://trec.nist.gov/ a ranked list. This is achieved by providing a formal model
2
http://www.clef-campaign.org/ that allows us to properly frame the problem and quantify
the gain/loss with respect to an optimal ranking, rank by
rank, according to the actual result list produced by an IR
system.
The proposed model provides the basis for the develop-
ment of Visual Analytics (VA) techniques that give us the
possibility to get a quick and intuitive idea of what hap-
pened in a result list and what determined its perceived
performances. Visual Analytics [8, 10, 14] is an emerging
Copyright c 2011 for the individual papers by the papers’ authors. Copy-
ing permitted only for private and academic purposes. This volume is pub- multi-disciplinary area that takes into account both ad-hoc
lished and copyrighted by the editors of euroHCIR2011. and classical Data Mining (DM) algorithms and Informa-
tion Visualization (IV) techniques, combining the strengths vector of n documents V , i.e., V [1] contains the identifier of
of human and electronic data processing. Visualisation be- the document predicted by the system to be most relevant,
comes the medium of a semi-automated analytical process, V [n] the least relevant one. The ground truth GT function
where human beings and machines cooperate using their re- assigns to each document V [i] a value in the relevance inter-
spective distinct capabilities for the most e↵ective results. val {0..k}, where k represents the highest relevance score,
Decisions on which direction analysis should take in order e.g. k = 3 in [7]. The basic assumption is that the greater
to accomplish a certain task are left to final user. While IV the position of a document the less likely it is that the user
techniques have been extensively explored [4,13], combining will examine it, because of the required time and e↵ort and
them with automated data analysis for specific application the information coming from the documents already exam-
domains is still a challenging activity [9]. Moreover, the ined. As a consequence, the greater the rank of a relevant
Visual Analytics community acknowledges the relevance of document the less useful it is for the user. This is mod-
interaction for visual data analysis, and that the current eled through a discounting function DF that progressively
research activities very often focus only on visual represen- reduces the relevance of a document, GT (V [i]) as i increases:
tation, neglecting the interaction design, as clearly stated ⇢
in [14]. This refers to two di↵erent typologies of interaction: GT (V [i]), if i  x
DF (V [i]) = (1)
1) interaction within a visualization and, 2), closer to the GT (V [i])/ logx (i), if i > x
paper contribution, interaction between the visual and the The quality of a result can be assessed Pusing the discounted
analytical components. cumulative gain function DCG(V, i) = ij=1 DF (V [j]) that
The idea of exploring and applying VA techniques to the estimates the information gained by a user that examines
experimental evaluation in the IR field is quite innovative the first i documents of V .
since it has never been attempted before and, due to the The DCG function allows for comparing the performances
complexity of the evaluation measures and the amount of of di↵erent search engines, e.g., plotting the DCG(i) values
data produced by large-scale evaluation campaigns, there is of each engine and comparing the curve behavior.
a strong need for better and more e↵ective representation However, if the user’s task is to improve the ranking per-
techniques. Moreover, visualizing and assessing ranked list formance of a single search engine, looking at the misplaced
of items, to the best of the authors’ knowledge, has not been documents (i.e., ranked too high or too low with respect to
addressed by the VA community. The few related propos- the other documents) the DCG function does not help: the
als, see, e.g., [12], use rankings for presenting the user with same value DCG(i) could be generated by di↵erent permu-
the most relevant visualizations, or for browsing the ranked tations of V and it does not point out the loss in cumulative
result, see, e.g., [5], but do not deal with the problem of gain caused by misplaced elements. To this aim, we intro-
observing the ranked item position, comparing it with an duce the following definitions and novel metrics.
ideal solution, to assess and improve the ranking quality. A We denote with OptP erm(V ) the set of optimal permu-
first attempt in such a direction is in [1], where the authors tations of V such as that 8OV 2 OptP erm(V ) it holds
explored the basic issues associated with the problem, pro- V
that GT (OV [i]) GT (OV [j])8i, j <= n i < j, that
viding basic metrics and introducing a VA web based system is, OV maximizes the values of DCG(OV, i)8i. In other
that allows for exploring the quality of a ranking with re- words, OptP erm(V ) represents the set of the optimal rank-
spect to an optimal solution. ings for a given search result. It is worth noting that each
On top of the proposed model, we have built a running vector in OptP erm(V ) is composed by k + 1 intervals of
prototype where the experimental results and data are stored documents sharing the same GT values. As an example, as-
in a dedicated system accessible via standard Web services. suming a result vector composed by 12 elements and k = 3,
This allows for the design and development of various client a possible sequence of GT values of an optimal vector OV
applications and tools for exploiting the managed data. In is <3,3,3,3,2,2,2,2,1,1,0,0>; according to this we define the
particular, in this paper, we have started to explore the pos- max index(V, r) and min index(V, r) functions, with 0 
sibility of adopting the Apple iPad3 as an appropriate device r  k, that return the greatest and the lowest indexes of el-
to allow users to easily and naturally interact with the ex- ements in a vector belonging to OptP erm(V ) that share the
perimental data and we have developed an initial prototype same GT value r. As an example, considering the above 12
that allows us for interactively inspecting the actual experi- GT values, min index(V, 2) = 5 and max index(V, 2) = 8.
mental data in order to get insights about the behaviour of Using the above definitions we can define the relative posi-
a IR system. tion R P os(V [i]) function for each document in V as follows:
Overall, the proposed model, the proposed visualization (
techniques, and the implemented prototype meet all the (a- 0, if min index(V, GT (V [i])  i  max index(V, GT (V [i])
c) requirements for the two scenarios introduced above. min index(V, GT (V [i]) i, if i < min index(V, GT (V [i])
The paper is organized as follows. Section 2 introduces the max index(V, GT (V [i]) i, if i > max index(V, GT (V [i])
model underlying the system together with its visualization
R P os(V [i]) allows for pointing out misplaced elements
techniques; Section 3 describes the interaction strategies of
and understanding how much they are misplaced: 0 values
the system, Section 4 describes the overall architecture of
denote documents that are within the optimal interval, nega-
the system, and Section 5 concludes the paper, pointing out
tive and positive values denote elements that are respectively
ongoing research activities.
below and above the optimal interval. The absolute value
of R P os(V [i]) gives the minimum distance of a misplaced
2. THE PROTOTYPE element from its optimal interval.
According to [7] we model the retrieval results as a ranked According to the actual relevance and rank position, the
same value of R P os(V [i]) can produce di↵erent variations
3
http://www.apple.com/ipad/ of the DCG function. We measure the contributions of mis-
Figure 1: The iPad prototype interface.

placed elements with the function Gain(V, i) that com- position. Similarly, the Gain vector codes the function
pares 8i the actual values of DF (V [i]) with the correspond- using colors: light blue refers to positive values, light red
ing values in OV , DF (OV [i]): Gain(V, i) = DF (V [i]) codes negative values, and green 0 values. Moreover, if the
DF (OV [i]). user touches a specific area of the R P os vector (that is sim-
ulated by the gray round in Figure 1), the main results list
automatically scrolls back, providing the end user with a de-
3. INTERACTION tailed view on the corresponding documents. The rightmost
A multi-touch prototype interface based on the model pre- part of the screen shows the DCG graphs of the ideal, the
sented in section 2 has been designed for the iPad device. It optimal and the experiment vector, i.e. the ranking curves.
has been developed and tested on the iOS 4.24 with the inte- The navigation bar displays a back button on the right which
gration of the Core Plot5 plotting framework for the graph- let the user visualize the results for a di↵erent topic.
ical visualization of data. The interface allows the end user
for comparing the curve of the ranked results, for a given
experiment/topic, with the optimal one and with the ideal 4. ARCHITECTURE
one. This facilitates the activities of failure analysis, eas- The design of the architecture of the system benefits from
ily locating misplaced elements, blue or red items, that pop what has been learned in ten years of work for the CLEF and
up from the visualization together with the extent of their in the design and implementation of Distributed Information
displacement and the impact they have on DCG. Retrieval Evaluation Campaign Tool (DIRECT), the system
Figure 1 shows a screenshot of the current interface: the developed in CLEF since 2005 to manage all the aspects of
main list on the left represents the top n = 200 ranked result an evaluation campaign [2, 3].
for a given experiment/topic and it can be easily scrolled by The approach to the architecture is the implementation
the user. Each row corresponds to a document ID, a short of a modular design, as sketched in Figure 2, with the aim
snippet of the content is included in the subtitle of each to clearly separate the logic entailed by the application into
cell and more information on a specific result (i.e. relevance three levels of abstraction – data, application, and interface
score, DCG, R P os, Gain) can be viewed by touching the logic – able to reciprocally communicate, easily extensible
row. On the right side there are two coloured vectors which and implementable using modular and reusable components.
show the R P os and Gain functions. The R P os vec- The Data Logic layer, depicted at the bottom of Figure 2,
tor presents the results using di↵erent color shadings: light deals with the persistence of the information coming from
green, light red and light blue respectively for documents the other layers. From the implementation point of view,
that are within, below and above the optimal interval. It data stored into databases and indexes are mapped to re-
allows for locating misplaced documents and, thanks to the sources and communicate with the upper levels through the
shading, understanding how they are far from the optimal mechanism granted by the Data Access Object (DAO) pat-
4
tern6 — see point (1) in Figure 2. The Application Logic
http://developer.apple.com/
5 6
http://code.google.com/p/core-plot/ http://java.sun.com/blueprints/corej2eepatterns/
Acknowledgements
The work reported in this paper has been partially sup-
atio
n ported by the PROMISE network of excellence (contract
plic
Ap and e
5
rfa
c n. 258191), as a part of the 7th Framework Program of the
Inte ogic
L
European commission (FP7/2007-2013).

6. REFERENCES
4
[1] N. Ferro, A. Sabetta, G. Santucci, G. Tino, and F.
Acc
ess
Veltri. Visual comparison of ranked result cumulated
Con n
trol atio gains. In Proc. of EuroVA 2011. Eurographics, 2011.
RE plic
STfu
lW Ap Logic
eb
Serv
ice
3
[2] M. Agosti, G. Di Nunzio, M. Dussin, and N. Ferro. 10
6
Years of CLEF Data in DIRECT: Where We Are and
Resource

2
Where We Can Go. In Proc. of EVIA 2010, pages
Logging Infrastructure
Resource

16–24. Tokyo, Japan, 2010.
Resource

[3] M. Agosti and N. Ferro. Towards an Evaluation
Da
ta L
ogic Infrastructure for DL Performance Evaluation. In
Resource DAO

Evaluation of Digital Libraries: An Insight to Useful
Resource DAO

Applications and Methods. Chandos Publishing,
Resource DAO

tab
ase
s Oxford, UK, 2009.
Da and s
d exe [4] S. K. Card and J. Mackinlay. The structure of the
In
1

ion
information visualization design space. In Proc. of
r act
bst InfoVis ’97, pages 92–99, Washington, DC, USA,
nA
atio
plic
n
Ap 1997. IEEE Computer Society.
atio
ent
n Im
ple
m
[5] M. Derthick, M. G. Christel, A. G. Hauptmann, and
atio
plic H. D. Wactlar. Constant density displays using
Ap

diversity sampling. In Proc. of the IEEE Information
Visualization, pages 137–144, 2003.
Figure 2: The Architecture of the Application.
[6] R. T. Fielding and R. N. Taylor. Principled design of
the modern web architecture. ACM TOIT, 2:115–150,
layer is in charge of the high-level tasks made by the sys- 2002.
tem, such as the enrichment of raw data, the calculation [7] K. Järvelin and J. Kekäläinen. Cumulated Gain-Based
of metrics and the carrying out of statistical analyses on Evaluation of IR Techniques. ACM TOIS,
experiments. These resources (2) are therefore accessible 20(4):422–446, October 2002.
via HTTP through a RESTful Web service [6], sketched at [8] D. Keim, G. Andrienko, J.-D. Fekete, C. Görg,
point (3). After the validation of credentials and permissions J. Kohlhammer, and G. Melançon. Information
made by the access control mechanism (4), it is possible for visualization. chapter Visual Analytics: Definition,
remote devices such as web browsers or custom clients (5) Process, and Challenges, pages 154–175.
to create, modify, or delete resources attaching their rep- Springer-Verlag, Berlin, Heidelberg, 2008.
resentation in XML7 or JSON8 format to the body of an [9] D. Keim, J. Kohlhammer, G. Santucci, F. Mansmann,
HTTP request, and to read them as response of specific F. Wanner, and M. Schäfer. Visual Analytics
queries. A logging infrastructure (6) grants the tracking of Challenges. In Proc. of eChallenges 2009, 2009.
all the activities made at each layer and can be used to ob- [10] D. A. Keim, F. Mansmann, J. Schneidewind, and
tain information about the provenance of all the managed H. Ziegler. Challenges in visual data analysis. In Proc.
resources. of IV’06, pages 9–16, 2006.
[11] H. Keskustalo, K. Järvelin, A. Pirkola, and
5. CONCLUSIONS J. Kekäläinen. Intuition-Supporting Visualization of
We have presented a model and a prototype which allow User’s Performance Based on Explicit Negative
users to easily interact with the experimental results and to Higher-Order Relevance. In Proc. of SIGIR ’08, pages
work together in a cooperative way while actually accessing 675–681. ACM Press, NY, USA, 2008.
the data. This first step uncovers new and interesting pos- [12] J. Seo and B. Shneiderman. A rank-by-feature
sibilities for the experimental evaluation and for the way in framework for interactive exploration of
which researchers and developers usually carry out such ac- multidimensional data. In Proc. of the IEEE
tivities. For example, the proposed techniques may alleviate Information Visualization, pages 65–72, 2004.
the burden of certain tasks, such as failure analysis, which [13] B. Shneiderman. The eyes have it: a task by data type
are often overlooked due to their demanding nature, thus taxonomy for information visualizations. In Proc. of
making easier and more common to perform them and, as a the 1996 IEEE Symposium on Visual Languages,
consequence, improving the overall comprehension of system pages 336 –343, 1996.
behaviour. This will be explored in the future work. [14] J. J. Thomas and K. A. Cook. A visual analytics
Patterns/DataAccessObject.html agenda. IEEE Computer Graphics and Applications,
7
http://www.w3.org/XML/ 26:10–13, 2006.
8
http://www.ietf.org/rfc/rfc4627.txt
A Taxonomy of Enterprise Search
Tony Russell-Rose Joe Lamantia Mark Burrell
UXLabs Ltd. Endeca Endeca
London 101 Main St. 101 Main St.
UK Cambridge, USA Cambridge, USA
+44 7779 936191 +1 617 674 6000 +1 617 674 6000
tgr@uxlabs.co.uk jlamantia@endeca.com mburrell@endeca.com

ABSTRACT problem solving strategies and tactics that information seekers
employ over extended periods of time (e.g. Kuhlthau, 1991).
Classic IR (information retrieval) is predicated on the notion of
In this paper, we examine the needs and behaviours of varied
users searching for information in order to satisfy a particular
individuals across a range of search and discovery scenarios
“information need”. However, it is now accepted that much of
within various types of enterprise. These are based on an analysis
what we recognize as search behaviour is often not informational
of the scenarios derived from numerous engagements involving
per se. For example, Broder (2002) has shown that the need
the development of search and business intelligence solutions
underlying a given web search could in fact be navigational (e.g.
utilizing the Endeca Latitude software platform. In so doing, we
to find a particular site or known item) or transactional (e.g. to
extend the classic IR concept of information-seeking to a broader
find a sites through which the user can transact, e.g. through
notion of discovery-oriented problem solving, accommodating the
online shopping, social media, etc.). Similarly, Rose & Levinson
much wider range of behaviours required to fulfil the typical goals
(2004) have identified consumption of online resources as a
and objectives of enterprise knowledge workers.
further category of search behaviour and query intent.
Our approach to enterprise discovery is an activity-centred model
In this paper, we extend this work to the enterprise context, inspired by Don Norman’s Activity Centred Design, which
examining the needs and behaviours of individuals across a range “organizes according to usage” whereas “...traditional human
of search and discovery scenarios within various types of centred design organizes according to topic, in isolation, outside
enterprise. We present an initial taxonomy of “discovery modes”, the context of real, everyday use.” (Norman 2006). This approach
and discuss some initial implications for the design of more is an extension of previous activity-centred modelling efforts
effective search and discovery platforms and tools. which focused on a “captur[ing] a systematic and holistic view of
what users need to accomplish when undertaking information
Categories and Subject Descriptors retrieval tasks more complex than searching” (Lamantia 2006),
H.3.3 [I nfor mation Sear ch and Retr ieval]: Search process; employing Grounded Theory to provide methodological structure
H.3.5 [Online I nfor mation Ser vices]: Web-based services (Glaser 1967).
General Terms In this context, we present an alternative model focused on
Human Factors. information discovery rather than information seeking per se,
which has at its core an initial taxonomy of the “modes of
discovery” that knowledge workers employ to satisfy their
Keywords information search and discovery goals. We then discuss some
Enterprise search, information seeking, user behaviour, initial implications of this model for the design of more effective
knowledge workers, search modes, information discovery, user search and discovery platforms and tools.
experience design.
2. INFORMATION RETRIEVAL MODELS
1. INTRODUCTION The classic model of IR assumes an interaction cycle consisting of
To design better search and discovery experiences we must four main activities: the identification an information need, the
understand the complexities of the human-information seeking specification of an appropriate query, the examination of retrieval
process. Numerous theoretical frameworks have been proposed to results, and reformulation (where necessary) of the original query.
characterize this complex process, notably the standard model This cycle is then repeated until a suitable result set is found
(Sutcliffe & Ennis 1998), the cognitive model (Norman 1998) and (Salton 1989).
the dynamic model (Bates, 1989). In addition, others have
In both the above models, the user’s information need is assumed
investigated search as a strategic process, examining the various
to be static. However, it is now acknowledged that information
seekers’ needs often change as they interact with a search system.
Copyright © 2011 for the individual papers by the papers' authors. In recognition of this, alternative models of information seeking
Copying permitted only for private and academic purposes. This volume have been proposed. For example, Bates (1989) proposed the
is published and copyrighted by the editors of euroHCIR2011. dynamic “berry-picking” model of information seeking, in which
the information need (and consequently the query) changes
throughout the search process This model also recognises that
information needs are not satisfied by a single, final result set, but
by the aggregation of results, insights and interactions along the There are however some guiding principles that we can apply to
way. facilitate convergence on a stable set. For example, an ideal set of
Bates’ work is particularly interesting as it explores the modes would exhibit properties such as: Consistency (they
connections between the dynamic model and the search strategies represent approximately the same level of abstraction);
and tactics that professional information-seekers employ. In Orthogonality (they operate independently to each other); and
particular, Bates identifies a set of 29 individual tactics, organised Comprehensiveness (they address the full range of discovery
into four broad categories (Bates, 1979). Likewise, O’Day & scenarios).
Jeffries (1993) examined the use of information search results by The initial set of discovery modes to emerge from this analysis
clients of professional information intermediaries and identified consists of a set of nine, arranged into three top-level categories
three distinct “search modes” or major categories of search consistent with those of Marchionini (2005). The nine modes are
behaviour: (1) Monitoring a known topic or set of variables over as follows, each shown with a brief definition:
time; (2) Following a specific plan for information gathering; (3)
Exploring a topic in an undirected fashion. 1. Lookup
O’Day and Jeffries also observed that a given search would often 1a. Locating: To find a specific (possibly known) item; 1b.
evolve over time into a series of interconnected searches, Verifying: To confirm or substantiate that an item or set of items
delimited by certain triggers and stop conditions that indicate the meets some specific criterion; 1c. Monitoring: To maintain
transitions between modes or individual searches executed as part awareness of the status of an item or data set for purposes of
of an overall enquiry or scenario. Moreover, O’Day & Jeffries management or control.
also attempted to characterise the analysis techniques employed
by the clients in interpreting the search results, identifying the 2. Learn
following six primary categories: (1) Looking for trends or 2a. Comparing: To examine two or more items to identify
correlations; (2) Making comparisons; (3) Experimenting with similarities & differences; 2b. Comprehending: To generate
different aggregations/scaling; (4) Identifying critical subsets; (5) insight by understanding the nature or meaning of an item or data
Making assessments; (6) Interpreting data to find meaning. set; 2c. Exploring: To proactively investigate or examine an item
More recent investigations into the relationship between or data set for the purpose of serendipitous knowledge discovery.
information needs and search activities include that of
3. Investigate
Marchionini (2005), who identifies three major categories of
search activity, namely “Lookup”, “Learn” and “Investigate”. 3a. Analyzing: To critically examine the detail of an item or data
set to identify patterns & relationships; 3b. Evaluating: To use
3. A TAXONOMY OF ENTERPRISE judgment to determine the significance or value of an item or data
set with respect to a specific benchmark or model; Synthesizing:
SEACH AND DISCOVERY To generate or communicate insight by integrating diverse inputs
The primary source of data in this study is a set of user scenarios to create a novel artefact or composite view.
captured during numerous engagements involving the
development of search and business intelligence solutions Evidently, the output of this process has been optimized for the
utilizing the Endeca Latitude software platform. These scenarios current data set and in that respect represents an initial
take the form of a simple narrative that illustrates the user’s end interpretation that will need to evolve further. For example,
goal and the primary task or action they take to complete it, “monitoring” may appear to be a lookup activity when considered
followed by a brief description of their job function or role, for in the context of a simple alert message, but when viewed as a
example: strategic activity performed by an executive in the context of an
organisational dashboard, a much greater degree of interaction
x “I need to understand a portfolio’s exposures to assess and complexity is implied. Conversely, “exploring” is a concept
portfolio-level investment mix” (Portfolio Manager) whose level of abstraction may prove somewhat higher than the
others, thus breaking the consistency principle suggested above.
x “I need to understand the quality performance of a part
and module set in manufacturing and the field so that I However, the true value of the modes will be realised not by their
can determine if I should replace that part” conceptual purity or elegance but by their utility as a design
(Engineering) resource. In this respect, they should be judged by the extent to
which they facilitate the design process in capturing important
These scenarios were manually analyzed to identify themes or
characteristics common to enterprise search and discovery
modes that appeared consistently throughout the set. For example,
experiences, whilst flexibly accommodating arbitrary variations in
in each of the scenarios above there is an articulation of the need
domain, information resources, etc.
to develop an understanding or comprehension of some aspect of
the data, implying that “comprehending” may constitute one such
discovery mode. Inevitably, this analysis process was somewhat 4. MODE SEQUENCES AND PATTERNS
iterative and subjective, echoing the observations made by Bates A further interesting observation arising from the above analysis
(1979) in the identification of her search tactics: “While our goal is that the mapping between scenarios and modes is not one-to–
over the long term may be a parsimonious few, highly effective one. Instead, some scenarios are seen to involve a number of
tactics, our goal in the short term should be to uncover as many modes, sometimes with a primary or dominant mode, and often
as we can, as being of potential assistance. Then we can test the with an implied linear sequence. Moreover, certain sequences of
tactics and select the good ones. If we go for closure too soon, modes tend to re-occur more frequently than others, forming
i.e., seek that parsimonious few prematurely, then we may miss specific “mode chains” or patterns, analogous to higher-level
some valuable tactics.” syntactic units. These patterns provide a framework for
understanding the transitions between modes (echoing the triggers scale independent, orthogonal, semantically distinct, conceptually
identified by O’Day & Jeffries), and allude to the existence of connected, and flexibly sequenceable. Such a profile -- analogous
natural seams that can be used be used to provide further insight to notes in the musical scale, or the words and phrases we
into information enterprise search and discovery behaviour. assemble into sentences -- should allow the modes to serve as a
These mode chains echo the above-mentioned efforts to create language for the design of variable scale activity-centered
goal-based information retrieval models, which yielded modes discovery solutions through common constructive mechanisms
and a set of broadly applicable “information retrieval patterns that such as concatenation, combination and nesting. And if the modes
describe the ways users combine and switch modes to meet goals: do act as an elementary grammar for discovery, then sustained use
Each pattern is assembled from combinations of the same four as a functional and interaction design language should result in
[elemental] modes” (Lamantia 2006). the creation of larger and more complex units of meaning which
offer cumulative value.
Professional experience with employing the modes as both an
analytical framework for understanding discovery needs and as a
design grammar for the definition of discovery solutions suggests
that both implications are valid. Further, our observations of
using the modes suggest the existence of recognizable patterns in
the design of discovery solutions. We will briefly discuss some of
the patterns observed, doing so at three common levels of solution
scale: on the level of a single functional or interface element, for
whole screens or interfaces composed of multiple functional
elements, and for applications comprising multiple screens.

5.1 Single element patterns
5.1.1 Comparison Views
One of the most common design patterns is to support the need
Figure 1. Discovery mode network for the Compare mode by creating A/B type comparison views
The five most frequent mode patterns are listed below. These have that present two display panes - each containing data display
been assigned descriptive (if somewhat informal) labels to aid charts or tables; or single items or groups of items - side by side to
their characterisation, along with the sequence of modes they emphasize similarities and differences.
represent and an associated example scenario:
5.1.2 Contextual Views
1. Comparison-driven optimization: (Analyze-Compare- Another common design pattern supports the Analysis mode by
Evaluate) e.g. “Replace a problematic part with an allowing a fore-grounded view of a single chart, table, item, or
equivalent or better part without compromising quality list, accompanied by its contextual ‘halo’ - the full body of
and cost” information available about the element such as status, origin,
2. Exploration-driven optimization: (Explore-Analyze- format, relationships to other elements; annotations; etc.
Evaluate) e.g. “Identify opportunities to optimize use of
tooling capacity for my commodity/parts” 5.2 Whole screen patterns
3. Strategic Insight (Analyze-Comprehend-Evaluate) e.g. 5.2.1 Dashboard
“Understand a lead's underlying positions so that I can One of the most common screen-level design patterns is to
assess the quality of the investment opportunity” support the Monitoring and Synthesis modes by presenting a
collection of metrics which in aggregate provide the status of
4. Strategic Oversight (Monitor-Analyze-Evaluate) e.g.
independent processes, groups, or progress versus goals in a
“Monitor & assess commodity status against
‘dashboard’ style screen.
strategy/plan/target”
5. Comparison-driven Synthesis (Analyze-Compare- 5.2.2 Visual Discovery Screen: 4-Dimensions
Synthesize) e.g. “Analyze and understand consumer- A second common screen-level design pattern for discovery
customer-market trends to inform brand strategy & experiences is the visual discovery screen, which supports modes
communications plan” such Exploration, Evaluation, and Verification by layering views
that present visualizations of several dimensions of a single axis
Further insight may be derived by examining how the mode
of focus such as a core process, organizational unit, or KPI. When
patterns combine across all the scenarios to the form of a “mode
switching between layered views, the axis in focus remains the
network”, as shown in Figure 1. Evidently, some modes act as
same, but the data and presentation in the dimensions adjusts to
“terminal” nodes, i.e. entry points or exit points to a discovery
match the preferred discovery mode.
scenario. For example, Monitor and Explore feature only as entry
points at the initiation of a scenario, whilst Synthesize and
Evaluate feature only as exit points to a scenario.
5.3 Application-level patterns
5.3.1 Differentiated Application
5. DESIGN PRINCIPLES FOR SEARCH The ‘Differentiated Application’ pattern assembles a collection of
AND DISCOVERY SOLUTIONS individual screens whose distinct compositions and designs
The modes establish a ‘taskonomy’ or collection of defined support individual discovery modes of Analysis, Comparison,
discovery activities which are structurally consistent, domain and Evaluation and Monitoring in aggregate to address the ‘Strategic
Oversight’ mode sequence. Application-level patterns often
address a spectrum of discovery needs for a group of users with In addition, we have proposed an alternative model focused on
differing organizational responsibilities, such as management vs. information discovery rather than information seeking which has
detailed analysis. at its core a taxonomy of “modes of discovery” that knowledge
workers employ to satisfy their information search and discovery
6. DISCUSSION goals. We have also examined some of the initial implications of
The above analysis is predicated on the notion that the user this model for the design of more effective search and discovery
scenarios provide a unique insight into the information needs of platforms and tools.
enterprise knowledge workers. However, a number of caveats
Suggestions for future work include further iterations on the
apply to both the data and the approach.
“propose-classify-refine” cycle using independent data. This data
Firstly, the scenarios were originally generated to support the should ideally be acquired based on a principled sampling strategy
development of a specific implementation rather than for the that attempts where possible to address any biases introduced in
analysis above. Therefore, the principles governing their creation the creation of the original scenarios. In addition, this process
may not faithfully reflect the true distribution or priority of should be complemented by empirical research and observation of
information needs among the various end user populations. knowledge workers in context to validate and refine the discovery
Secondly, the particular sample we selected for this study was modes and triggers that give rise to the observed patterns of usage.
based on a number of pragmatic factors (including availability),
which may not faithfully represent the true distribution or priority 8. REFERENCES
among enterprise organizations. Thirdly, the data will inevitably [1] Bates, Marcia J. 1979. "Information Search Tactics." Journal
contain some degree of subjectivity, particularly in cases where of the American Society for Information Science 30: 205-214
scenarios were generated by proxy rather than with direct end-user [2] Bates, Marcia J. 1989. "The Design of Browsing and
contact. Fourthly, the data will inevitably contain some degree of Berrypicking Techniques for the Online Search Interface."
inconsistency in cases where scenarios were documented by Online Review 13: 407-424.
different individuals.
[3] Broder, A. 2002. A taxonomy of web search, ACM SIGIR
We should also acknowledge a number of caveats concerning the Forum, v.36 n.2, Fall 2002
process itself. In inductive work with foundations in qualitatively
centered frameworks such as Grounded Theory, it is expected that [4] Kuhlthau, C. C. 1991. Inside the information search process:
a number of iterations of a “propose-classify-refine” cycle will be Information seeking from the user's perspective. Journal of
required for the process to converge on a stable output (e.g. Rose the American Society for Information Science, 42, 361-371.
& Levinson, 2004). In addition, those iterations should involve a [5] Lamantia, J. 2006. “10 Information Retrieval Patterns”
variety of critical viewpoints, with the output tested and refined JoeLamantia.com, http://www.joelamantia.com/information-
using a separate, independent sample on each iteration. Likewise, architecture/10-information-retrieval-patterns
the process by which scenarios are classified would benefit from
[6] Glaser, B. & Strauss, A. 1967. The Discovery of Grounded
further rigour: this is a critical part of the process and of course
Theory: Strategies for Qualitative Research. New York:
relies on human judgement and inference, but that judgement
Aldine de Gruyter.
needs to go beyond simple word matching and be consistently
applied to each scenario so that subtle distinctions in meaning and [7] Marchionini, G. 2006. Exploratory search: from finding to
intent can be accurately identified and recorded. understanding. Commun. ACM 49(4): 41-46
That said, some interesting comparisons can already be made with [8] Norman, Donald A. 1988. The psychology of everyday
the existing frameworks. For example, the first and third of the things. New York, NY, US: Basic Books.
search modes suggested by O’Day and Jeffries have also been [9] Donald A. Norman. 2006. Logic versus usage: the case for
identified as distinct discovery modes in our own study, and the activity centered design. Interactions 13, 6
second (arguably) maps on to one or more of the mode chains
identified above. Likewise, the search results analysis techniques [10] O'Day, V. and Jeffries, R. 1993. Orienteering in an
that O’Day & Jeffries identified also present some interesting information landscape: how information seekers get from
parallels. here to there. INTERCHI 1993: 438-445
[11] Rose, D. and Levinson, D. 2004. Understanding user goals in
7. CONCLUSIONS AND FUTURE web search, Proceedings of the 13th international
DIRECTIONS conference on World Wide Web, New York, NY, USA
To design better search and discovery experiences we must [12] Salton, G. (1989). Automatic Text Processing: The
understand the complexities of the human-information seeking Transformation, Analysis, and Retrieval of Information by
process. In this paper, we have examined the needs and Computer. Addison-Wesley, Reading, MA.
behaviours of varied individuals across a range of search and
[13] A.G. Sutcliffe and M. Ennis. Towards a cognitive theory of
discovery scenarios within various types of enterprise. In so
information retrieval. Interacting with Computers, 10:321–
doing, we have extended the classic IR concept of information-
351, 1998.
seeking to a broader notion of discovery-oriented problem
solving, accommodating the much wider range of behaviours
required to fulfil the typical goals and objectives of enterprise
knowledge workers.
Back to MARS: The unexplored possibilities in query
result visualization

Alfredo Ferreira Pedro B. Pascoal Manuel J. Fonseca
INESC-ID/IST/TU Lisbon INESC-ID/IST/TU Lisbon INESC-ID/IST/TU Lisbon
Lisbon, Portugal Lisbon, Portugal Lisboa, Portugal
alfredo.ferreira@ist.utl.pt pmbp@ist.utl.pt mjf@inesc-id.pt

ABSTRACT on visual queries. However, most existing solutions still face
A decade ago, Nakazato proposed 3D MARS, an immer- major drawbacks and challenges to be tackled. Among oth-
sive virtual reality environment for content-based image re- ers, extensively identified in Datta’s survey [5], we high-
trieval. Even so, the idea of taking advantage of post-WIMP light two. First, queries rely mostly on meta-information,
interfaces for multimedia retrieval was no further explored often keyword-based. This means that, in a closer analysis,
for content-based retrieval. Considering the latest low-cost, searches can be reduced to text information retrieval of mul-
o↵-the-shelf hardware for visualization and interaction, we timedia objects. Second, the result visualization follows the
believe that is time to explore immersive virtual environ- traditional paradigm, where the results are presented as a
ments for multimedia retrieval. In this paper we highlight list of items on a screen. These items are usually thumbnails,
the advantages of such approach, identifying possibilities but can be just filenames or metadata. Such methodology
and challenges. Focusing on a specific field, we introduce greatly hinders the interpretation of query results on collec-
a preliminary immersive virtual reality prototype for 3D ob- tions of videos or 3D objects.
ject retrieval. However, the concepts behind this prototype
can be easily extended to the other media. Notably, a decade ago, a new visualization system for content-
based image retrieval(CBIR) was proposed by Nakazato and
Categories and Subject Descriptors Huang from the University of Illinois. The 3DMARS [11]
H.3.3 [Information Storage and Retrieval]: Information was an immersive virtual reality (VR) environment to per-
Search and Retrieval; H.5.2 [Information Interfaces and form image retrieval. It worked on the NCSA CAVE [4]
Presentation]: User Interfaces—Interaction Styles, Input which provided fully immersive experience and later on desk-
Devices and Strategies top VR systems. However, despite this ground-breaking
work and recent developments in the interaction domain,
little advantages have been taken by the multimedia infor-
Keywords mation retrieval community from immersive virtual environ-
Multimedia Information Retrieval, 3D Object Retrieval, Im-
ments.
mersive Virtual Environment
In this paper we bring up the work of Nakazato and Huang
1. INTRODUCTION as a starting point to the exploration of new possibilities
Despite advances on multimedia information retrieval (MIR), for result visualization in multimedia information retrieval.
this field still on its infancy. Especially when compared to With the spreading of stereoscopic viewing and last gener-
its textual counterpart. Actual textual search engines are ation interaction devices outside lab environment and into
maturely developed and its widespread use makes them fa- our everyday lives, we believe that in a short time users will
miliar to most users. The current scenario in MIR is quite expect richer results from multimedia search engines than
di↵erent. Indeed, existing content-based MIR solutions are just a list of thumbnails. Following this rationale, and de-
far from being largely used by the common user. spite it could be applied to any type of media, we will focus
our approach on 3D object retrieval (3DOR).
A few exceptional systems were able to strive with relative
success, such as Retrievr1 , a search tool for Flickr2 based
1
http://labs.systemone.at/retrievr/
2. TRADITIONAL 3DOR APPROACHES
2 The first and most noticeable 3D search engine, at least
http://www.flickr.com/
within researchers working on this area, is the Princeton
3D Model Search Engine[8]. This remarkable work provide
content-based retrieval of 3D models from a collection of
more than 36000 objects. Four query specification options
are available: text based; by example; by 2D sketch; and by
3D sketch. The results of this queries are presented as an
array of model thumbnails.
Copyright c 2011 for the individual papers by the papers’ authors. Copy-
ing permitted only for private and academic purposes. This volume is pub-
lished and copyrighted by the editors of euroHCIR2011. Additionally to queries by example and sketch-based queries,
the FOX-MIIRE search engine[1] introduced the query by
photo. This was the first tool capable of retrieve a 3D
model from a photograph of a similar object. However, and
similarly to Princeton engine, the results are displayed as a
thumbnail list.

Outside the research field, Google 3D Warehouse3 of-
fers a text-based search engine for the common user. This
online repository contains a very large number of di↵erent
models, from monuments to cars and furniture, humans and
spaceships. However, searching for models in this collection
is limited by textual queries or, when models represent real
objects, by its georeference. On the other hand, the results
are displayed by model images in a list, with the opportunity
to manipulate a 3D view of a selected model.

Generally, the query specification and visualization of results
in commercial tools for 3D object retrieval, usually associ-
ated with 3D model online selling sites, did not di↵er much Figure 1: The interface of 3D MARS.
from those presented above. The query is specified through
keywords or by example and results are presented as a list
of model thumbnails. Generally, post-WIMP approaches abandoned the traditional
mouse and keyboard combination, favouring devices with six
These traditional approaches to query specification and re- degrees of freedom (DoF). Unlike traditional WIMP interac-
sult visualization do not take advantage of latest advances tion style, where it is necessary to map the inputs from a 2D
of neither computer graphics or interaction paradigms. Cur- interaction space to a 3D visualization space, six DoF de-
rent hardware and software are capable of handling mil- vices allow straightforward direct mapping between device
lions of triangles per frame and generating complex e↵ects in movements and rotations and corresponding e↵ects on the
real-time. Additionally, the growingly common use of new three-dimensional space. This represents an huge leap to the
human-computer interaction (HCI) paradigms and devices concept of direct manipulation, which, according to Shnei-
brought new possibilities for multi-modal systems. derman [14], rapidly increments operations and allows the
immediate visualization of e↵ects on an manipulated object.
This helps making the interaction more comprehensible, pre-
3. NEW PARADIGMS IN HCI dictable and controllable.
The recent dissemination among common users of new HCI
paradigms and devices (e.g. Nintendo Wiimote4 or Mi- Combining six DoF devices with stereoscopy, it is possible
crosoft Kinect5 ) brought new possibilities for multi-modal to make a multi-modal immersive interaction with direct
systems. For decades, the “windows, icons, menus, pointing and natural manipulation of objects shapes within virtual
device” (WIMP) interaction style prevailed outside the re- environments. This may be experienced using immersive
search field, while post-WIMP interfaces were being devised displays (e.g., HMDs, CAVEs) [7] or desktop [15].
and explored [16], but without major impact in everyday
use of computer systems. Despite the growing interest around the application of this
new paradigms in HCI, no relevant e↵orts were made to
Particularly, the use of gestures to interact with system has explore the latest technological advances for multimedia in-
been part of the interface scene since the very early days. A formation retrieval. Indeed, to the extent of our knowledge,
pioneering multimodal application was “Put-that-there” [2], there has not been presented any research or new solution
by Bolt. In “Put-that-there”, the user commands simple that take advantage of immersive virtual environments for
shapes on a large-screen graphics display surface. This ap- information retrieval since Nakazato’s 3D MARS [11] .
proach combined gestures and voice commands to interact
with the system. However, just recently such interaction
paradigm have been introduced in o↵-the-shelf commodity 4. 3D MARS
products. The 3D MARS system demonstrates that the use of 3D vi-
sualization in multimedia retrieval has two benefit. First,
Recent technological advances allowed development of low- more content can be displayed at the same time without
cost, lightweight, easy to use systems. With limited re- occluding one another. Second, by assigning di↵erent mean-
sources, novel and more natural HCI can be developed and ings to each axis, the user can determine which features are
explored. For instance, Lee [10] used a Wiimote and took ad- important as well as examine the query result with respect
vantage of its high resolution infra-red camera to implement to three di↵erent criteria at the same time.
multipoint interactive whiteboard, finger tracking and head
tracking for desktop virtual reality displays. Post-WIMP Nakazato focused his work on query result visualization.
finally arrived to the masses. Thus 3D MARS supports only query-by-example mechanism
to specify the search. The user select one image from a list
3
http://sketchup.google.com/3dwarehouse/ and the system retrieves and displays the most similar im-
4
http://www.nintendo.com/wii/console/controllers ages from the image database in a 3D virtual space. The
5
http://www.xbox.com/en-US/kinect image location on this space is determined by its distance
to the query image, where more similar images are closer to
the origin of the space. The distance in each coordinate axis
depend on a pre-defined set of features. The X-axis, Y-axis
and Z-axis represent color, texture and structure of images
respectively.

The interaction with the query results is done through a
wand that the user holds while freely walking around the
CAVE, as depicted in Figure 1. By wearing shutter glasses,
the user can see a stereoscopic view of the world, which
provides a full immersive experience. In such solution, vi-
sualizing query results goes far beyond scrolling on a list
of thumbnails. The user navigates among the results in a
three-dimensional space.

The 3D MARS was a catalyst for the incitement proposed
in this paper: explore immersive visualization systems for Figure 2: User exploring query results in Im-O-Ret
multimedia information retrieval. Following that idea, we
devised an immersive 3D virtual reality system for the dis-
play of query results of queries for 3D object Retrieval. even more the visualization since the user gains depth per-
ception over the environment.

5. IMMERSIVE 3DOR The combined use of VE and devices with six DoF, provides
Taking advantage of the new paradigms in HCI, we pro- a more complete visualization and makes interaction more
pose an immersive VR system for 3D object retrieval (Im- natural, comprehensible and predictable. Their use, will also
O-Ret). The version of the system presented in this pa- add some challenges to the implementation of such system.
per relies on a large-screen display, the LEMe Wall [6], and
the a six DoF interaction device, the SpacePoint Fusion, an
o↵-the-shelf device developed by PNI Sensor Corporation.
5.2 Challenges
While in traditional 3DOR systems the query results are
However, minimal e↵ort is required in order to have the sys-
represented and ordered as a list of thumbnails ordered by
tem working in a context with HMD glasses or stereoscopic
a given similarity measure, when we move to a virtual envi-
glasses, as well as using other input devices, such as Wiimote
ronment, the distribution of results in a 3D space becomes a
or Kinect.
challenge. How query results should arranged in 3D space to
be meaningful to the user remains an open question. In our
Regardless of the hardware details, the Im-ORet allows the
approach we select three shape descriptors and assigned each
user to browse the results of a query to collection of 3D ob-
one to a coordinate axis, but this is a preliminary approach.
jects in an immersive virtual environment. The objects are
We believe that a final solution is more complex that this.
distributed in the virtual 3D space according to their sim-
Further investigation on this topic is clearly required.
ilarity. This is measured by the distance of each result to
the query, which stands in the origin of the coordinates. To
On the other hand, the way users navigate and interact with
each of the three axis is assigned a di↵erent shape matching
objects in an immersive environment and interact with it
algorithm. The similarity to the query returned by the cor-
still an open issue. Norman[12] stated that gesturing is a
responding algorithm determines the coordinate. Current
natural, automatic behaviour, but the unintended interpre-
version of Im-O-Ret uses the Lightfield Descriptors [3] on
tations of gestures can create undesirable states. Having this
the X-axis, the Coord and Angle Histogram [13] for the Y-
in mind, it is important to aim for an interface that is both
axis, the Spherical Harmonics Descriptor [9] for the Z-axis.
predictable and easy to learn.
Figure 2 illustrates a user browsing the results of a query.
Above all, an important challenge remains open. No easy
5.1 Possibilities query specification mechanism has been presented, neither
Similar to the 3D MARS, this work opens a myriad of new in traditional search engines, nor with new HCI paradigms.
possibilities. By assigning di↵erent shape matching algo- Although sketch-based queries apparently provide good re-
rithms to each axis, one can adapt the query mechanism to sults, they greatly depend on the ability of the user to draw a
specific domains, producing more precise results. Applying 3D model, which hinders the goal of a widely used, content-
transparency to results, it is possible to overlay results of based, 3D search engine.
distinct queries. Adding e↵ects to results, such as glow or
special colors, it order to convey additional information. 6. CONCLUSIONS
We believe that recent advances in low-cost, post-WIMP en-
Since query results are not images or thumbnails, but three- abler technology, can be seen as an opportunity to overcome
dimensional models, it is possible to navigate around them in some drawbacks of current multimedia information retrieval
the virtual environment and even manipulate them. More- solutions. Combined with the dissemination of stereoscopic
over, instead of a static view of the result, displaying it as a visualization as a commodity, these interaction paradigms
3D object that can be rotating over one axis, o↵ers a better will acquaint common users with immersive virtual reality
perception of the model. Adding stereoscopy will improve environments.
In this paper we highlight that such scenario is a fertile P. Otto, V. Petrovic, K. Ponto, A. Prudhomme,
ground to be explored by search engines for multimedia in- R. Rao, L. Renambot, D. Sandin, J. Schulze, L. Smarr,
formation retrieval. In that context, we identified two major M. Srinivasan, P. Weber, and G. Wickham. The future
research topics: query result visualization and query speci- of the cave. Central European Journal of Engineering,
fication. While the latest requires further study, we already 1:16–37, 2011. 10.2478/s13531-010-0002-5.
started tackling the first one. [8] T. Funkhouser, P. Min, M. Kazhdan, J. Chen,
A. Halderman, D. Dobkin, and D. Jacobs. A search
We developed a novel visualization approach for 3D object engine for 3d models. ACM Trans. Graph., 22:83–105,
retrieval. The Im-O-Ret o↵ers the users an immersive vir- January 2003.
tual environment for browsing results of a query to a col- [9] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz.
lection of 3D objects. The query results are displayed as Rotation invariant spherical harmonic representation
3D models in a 3D space, instead of the traditional list of of 3d shape descriptors. In Proceedings of the 2003
thumbnails. The user can explore the results, navigating in Eurographics/ACM SIGGRAPH symposium on
that space and directly manipulating the objects. Geometry processing, SGP ’03, pages 156–164,
Aire-la-Ville, Switzerland, Switzerland, 2003.
Looking back to 3D MARS, the initial work proposed by Eurographics Association.
Nakazaro, we realize it was a valid idea that fell almost into [10] J. Lee. Hacking the nintendo wii remote. Pervasive
obliviousness. We expect that our preliminary work, which Computing, IEEE, 7(3):39 –45, july-sept. 2008.
lies over concepts introduced by 3D MARS, could prove the [11] T. S. H. Munehiro Nakazato. 3d mars: Immersive
goodness of our incitement to explore the possibilities of- virtual reality for content-based image retrieval. In
fered by immersive virtual environments to the multimedia Proceedings of 2001 IEEE International Conference on
information retrieval. Multimedia and Expo (ICME2001), 2001.
[12] D. A. Norman. Natural user interfaces are not natural.
7. ACKNOWLEDGMENTS interactions, 17:6–10, May 2010.
The work described in this paper was partially supported [13] E. Paquet and M. Rioux. Nefertiti: a query by content
by the Portuguese Foundation for Science and Technology software for three-dimensional models databases
(FCT) through the project 3DORuS, reference PTDC/EIA- management. In NRC 97: Proceedings of the
EIA/102930/2008 and by the INESC-ID multiannual fund- International Conference on Recent Advances in 3-D
ing, through the PIDDAC Program funds. Digital Imaging and Modeling, page 345, Washington,
DC, USA, 1997. IEEE Computer Society.
8. REFERENCES [14] B. Shneiderman. Direct manipulation for
[1] T. F. Ansary, J.-P. Vandeborre, and M. Daoudi. comprehensible, predictable and controllable user
3d-model search engine from photos. In Proceedings of interfaces. In Proceedings of the 2nd international
the 6th ACM international conference on Image and conference on Intelligent user interfaces, IUI ’97,
video retrieval, CIVR ’07, pages 89–92, New York, pages 33–39, New York, NY, USA, 1997. ACM.
NY, USA, 2007. ACM. [15] B. Sousa Santos, P. Dias, A. Pimentel, J.-W.
[2] R. A. Bolt. Put-that-there: Voice and gesture at the Baggerman, C. Ferreira, S. Silva, and J. Madeira.
graphics interface. In Proceedings of the 7th annual Head-mounted display versus desktop for 3d
conference on Computer graphics and interactive navigation in virtual reality: a user study. Multimedia
techniques, SIGGRAPH ’80, pages 262–270, New Tools Appl., 41:161–181, January 2009.
York, NY, USA, 1980. ACM. [16] A. van Dam. Post-wimp user interfaces. Commun.
[3] D.-Y. Chen, X.-P. Tian, Y. te Shen, and ACM, 40:63–67, February 1997.
M. Ouhyoung. On visual similarity based 3d model
retrieval. volume 22 of EUROGRAPHICS 2003
Proceedings, pages 223–232, 2003.
[4] C. Cruz-Neira, D. J. Sandin, and T. A. DeFanti.
Surround-screen projection-based virtual reality: the
design and implementation of the cave. In Proceedings
of the 20th annual conference on Computer graphics
and interactive techniques, SIGGRAPH ’93, pages
135–142, New York, NY, USA, 1993. ACM.
[5] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image
retrieval: Ideas, influences, and trends of the new age.
ACM Comput. Surv., 40:5:1–5:60, May 2008.
[6] B. R. de AraÃžjo, T. Guerreiro, R. J. Costa, J. A. P.
Jorge, and J. M. Pereira. Leme wall: Desenvolvendo
um sistema de multi-projecção. 13Âž Encontro
PortuguÃls de ComputaÃğÃčo GrÃafica,
, Vila Real,
Portugal, 2005.
[7] T. DeFanti, D. Acevedo, R. Ainsworth, M. Brown,
S. Cutchin, G. Dawe, K.-U. Doerr, A. Johnson,
C. Knox, R. Kooima, F. Kuester, J. Leigh, L. Long,
The Mosaic Test: Benchmarking Colour-based Image
Retrieval Systems Using Image Mosaics

William Plant Joanna Lumsden Ian T. Nabney
School of Engineering and School of Engineering and School of Engineering and
Applied Science Applied Science Applied Science
Aston University Aston University Aston University
Birmingham, U.K. Birmingham, U.K. Birmingham, U.K.

ABSTRACT 1. INTRODUCTION
Evaluation and benchmarking in content-based image re- Colour-based image retrieval systems such as Chromatik [1],
trieval has always been a somewhat neglected research area, MultiColr [5] and Picitup [10] enable users to retrieve images
making it difficult to judge the efficacy of many presented from a database based on colour content alone. Such a facil-
approaches. In this paper we investigate the issue of bench- ity is particularly useful to users across a number of different
marking for colour-based image retrieval systems, which en- creative industries, such as graphic, interior and fashion de-
able users to retrieve images from a database based on low- sign [6, 7]. Surprisingly, however, little research appears to
level colour content alone. We argue that current image have been conducted into evaluating colour-based image re-
retrieval evaluation methods are not suited to benchmark- trieval systems. Currently, there is no standardised measure
ing colour-based image retrieval systems, due in main to and image database to evaluate the performance of an image
not allowing users to reflect upon the suitability of retrieved retrieval system [8]. The most commonly applied evaluation
images within the context of a creative project and their methods are those of precision and recall [8] and the tar-
reliance on highly subjective ground-truths. As a solution get search and category search tasks [11]. The precision and
to these issues, the research presented here introduces the recall measure is used to evaluate the accuracy of image re-
Mosaic Test for evaluating colour-based image retrieval sys- sults returned by a system in response to a query, whilst the
tems, in which test-users are asked to create an image mosaic target search and category search tasks are both user-based
of a predetermined target image, using the colour-based im- evaluation strategies in which test-users are asked to retrieve
age retrieval system that is being evaluated. We report on images from a database that are relevant to a given target,
our findings from a user study which suggests that the Mo- using the image retrieval system that is being evaluated.
saic Test overcomes the major drawbacks associated with ex-
isting image retrieval evaluation methods, by enabling users In this research, we argue that the image retrieval system
to reflect upon image selections and automatically measur- evaluation strategies listed above are not suitable for eval-
ing image relevance in a way that correlates with the percep- uating and benchmarking colour-based image systems for
tion of many human assessors. We therefore propose that two fundamental reasons. Firstly, none of the above evalua-
the Mosaic Test be adopted as a standardised benchmark tion methods allow test-users to perform an important pro-
for evaluating and comparing colour-based image retrieval cess often conducted by creative users, known as reflection-
systems. in-action [12]. In reflection-in-action, a creative project is
modified by a user and then reviewed by the user after the
Categories and Subject Descriptors modification. After assessing their modification, the creative
H.3.4 [Information Storage and Retrieval]: Systems individual will then decide whether to maintain or discard
and Software—Performance evaluation; H.2.8 [Database the modification to the project. As an example, a graphic
Management]: Database Applications—Image Databases designer will add an image to a web page before making an
assessment as to its aesthetic suitability. Secondly, the cat-
egory search and precision and recall measures require an
Keywords image database and associated ground-truth (a manually
Image databases, content-based image retrieval, image mo- generated list pre-defining which images in the database are
saic, performance evaluation, benchmarking. similar to others) for defining image relevance during a sys-
tem evaluation. Such human-based definitions of similarity,
however, can often be highly subjective resulting in retrieved
images being incorrectly assessed as irrelevant.

As a result of these drawbacks, no method currently exists
for reliably evaluating colour-based image retrieval systems.
The following section introduces the Mosaic Test which has
been developed to address the current problem, providing
Copyright ! c 2011 for the individual papers by the papers’ authors. Copy-
a reliable means for benchmarking colour-based image re-
ing permitted only for private and academic purposes. This volume is pub-
lished and copyrighted by the editors of euroHCIR2011. trieval systems.
2. THE MOSAIC TEST to indicate their subjective experience of workload (using
For the Mosaic Test, participants are asked to manually cre- the NASA TLX scales [2]) post test.
ate an image mosaic (comprising 16 cells) of a predetermined
target image. An image mosaic (first devised by Silvers [14]) The time (number of seconds), subjective workload (user
is a form of art that is typically generated automatically NASA-TLX ratings) and relevance (image mosaic accuracy)
through use of content-based image analysis. A target im- measures achieved by colour-based image retrieval systems
age is divided into cells, each of which is then replaced by a evaluated using the Mosaic Test can be directly compared
small image with similar colour content to the correspond- and used for benchmarking. When comparing the Mosaic
ing cell in the target image. Viewed from a distance, the Test measures achieved by different systems, the more ef-
smaller images collectively appear to form the target image, fective colour-based image retrieval system will be the one
whilst viewing an image mosaic close up reveals the detail that enables users to create the most accurate image mo-
contained within each of the smaller images. An example of saics, fastest and with the least workload.
an automatically generated image mosaic is shown in Fig-
ure 1. 2.1 Mosaic Test Tool
To support users in their manual creation of image mosaics
using the Mosaic Test, we have developed a novel software
tool in which an image mosaic of a predetermined target
image can be created using simple drag and drop functions.
We refer to this as the Mosaic Test Tool. The Mosaic Test
Tool has been designed so that it can be displayed simul-
taneously with the colour-based image retrieval system un-
der evaluation (as can be seen in Figure 2). This removes
the need for users to constantly switch between application
windows, and permits users to easily drag images from the
colour-based image retrieval system being tested to their im-
age mosaic in the Mosaic Test Tool. It is important to note
Figure 1: An example of an image mosaic. The
that the facility to export images through drag and drop
region highlighted green in the image mosaic (right)
operations is the only requirement of a colour-based image
has been created using the images shown (left).
retrieval system for it to be compatible with the Mosaic Test
Tool and thus the Mosaic Test.
For target images in the Mosaic Test, photographs of jelly
beans are used. The images of jelly beans produce a bright,
interesting target image for participants to create in mosaic
form and the generation of an image mosaic that appears
visually similar to the target image is also very achievable.
More importantly, retrieving images from a database com-
prising large areas of a small number of distinct colours is a
practise commonly performed by users in creative industries.

To complete their image mosaics, participants must identify
the colours required to fill an image mosaic cell (by inspect-
ing the corresponding region in the target image), and re-
trieve a suitably coloured image from the 25,000 contained
within the MIRFLICKR-25000 image collection [4] using the
colour-based evaluation system under evaluation. When se-
lecting images for use in their image mosaic, users can add,
move or remove images accordingly to assess the suitability
of images within the context of their image mosaic. It is
in this way that the Mosaic Test overcomes the first ma-
jor drawback of existing evaluation methods, by enabling
participants to perform the creative practise of reflection-in- Figure 2: The Mosaic Test Tool (left) and an image
action [12]. Upon completion of an image mosaic, the time retrieval system under evaluation (right) during a
required by the user to finish the image mosaic is recorded, Mosaic Test session.
along with the visual accuracy of their creation in com-
parison with the initial target image. Through analysing The target image and image mosaic are displayed simulta-
the accuracy of user-generated image mosaics (in a manner neously on the Mosaic Test Tool interface to allow users to
which correlates with the perception of a number of different manually inspect and identify the colours (and colour lay-
human assessors), the Mosaic Test is able to overcome the out) required for each image mosaic cell. As can be seen
second drawback associated with existing evaluation tech- in Figure 2, the target image (the image the user is trying
niques. This is because it does not rely on a highly subjective to replicate in the form of an image mosaic) is displayed in
image database ground-truth. The image mosaic accuracy the top half of the Mosaic Test Tool. Coupled with the ease
measure adopted for use with the Mosaic Test is discussed in which images can be added to, or removed from, image
further in Section 3.1. Additionally, participants are asked mosaic cells, users of the Mosaic Test Tool can simply as-
sess the suitability of a retrieved image by dragging it to the content-based image retrieval, to discover which best cor-
appropriate image mosaic cell and viewing it alongside the relates with human perceptions of image mosaic distance.
other image mosaic cells. To do this, we calculated the image mosaic distance rank-
ings according to the existing measure and several colour
3. USER STUDY descriptors (and their associated distance measures), and
To evaluate the Mosaic Test, we recruited 24 users to par- then calculated the Spearman’s rank correlation coefficient
ticipate in a user study. Participants were given written between each of the tested distance measures and the rank-
instructions explaining the concept of an image mosaic and ings assigned by the users in our study.
the functionality of the Mosaic Test Tool. A practise ses-
sion was undertaken by each participant, in which they were For the image colour descriptors (and associated distance
asked to complete a practise image mosaic using a small se- measures), we firstly tested the global colour histogram (GCH)
lection of suitable images. Participants were then asked to as an image descriptor. A colour histogram contains a nor-
complete 3 image mosaics using 3 different colour-based im- malised pixel count for each unique colour in the colour
age retrieval systems. To ensure that users did not simply space. We used a 64-bin histogram, in which each of the red,
learn a set of database images suitable for use in a solitary green and blue colour channels (in an RGB colour space)
image mosaic, 3 different target images were used. These were quantised to 4 bins (4 x 4 x 4 = 64). We adopted
target images were carefully selected so that the number of the Euclidean distance metric to compare the global colour
jelly beans (and thus colours) in each were evenly balanced, histograms of the image mosaics and corresponding target
with only the colour and layout of the jelly beans varying images. We also tested local colour histograms (LCH) as an
between the target images. To also ensure that results were image descriptor. For this, 64-bin colour histograms were
not effected by a target image being more difficult to cre- calculated for each image mosaic cell (for the image mosaic
ate in image mosaic form than another, the order in which descriptor), and its corresponding area in the target image
the target images were presented to participants remained (for the target image descriptor). The average Euclidean
constant whilst the order in which the colour-based image distance between all of the corresponding colour histograms
retrieval systems were used was counter balanced. After (in the image mosaic and target image LCH descriptors) was
completing the 3 image mosaics, participants were asked to used to compare LCH descriptors. Finally, we tested (along
rank each of their creations in ascending order of ‘closeness’ with their associated distance measures) the MPEG-7 colour
to its corresponding target image. structure (MPEG-7 CST) and colour layout (MPEG-7 CL)
descriptors [13], as well as the auto colour correlogram de-
We wanted to investigate whether the Mosaic Test does over- scriptor (ACC) [3].
come the drawbacks of existing evaluation strategies so that
it may be adopted as a reliable benchmark of colour-based The auto colour-correlogram (ACC) of an image can be de-
image retrieval systems. Firstly, we hypothesised that users scribed as a table indexed by colour pairs, where the k-th
in the study would perform reflection-in-action and so we entry for colour i specifies the probability of finding another
wanted to observe whether this was indeed true for partici- pixel of colour i in the image at a distance k. For the MPEG-
pants when judging the suitability of images retrieved from 7 colour structure descriptor (MPEG-7 CST), a sliding win-
the database. Secondly, we were eager to investigate which dow (8 × 8 pixels in size) moves across the image in the
method should be adopted for measuring the accuracy of an HMMD colour space [13] (reduced to 256 colours). With
image mosaic in the Mosaic Test. each shift of the structuring element, if a pixel with colour i
occurs within the block, the total number of occurrences in
the image for colour i is incremented to form a colour his-
3.1 Assessing Image Mosaic Accuracy togram. The distance between two MPEG-7 CSTs or two
As an image mosaic is an art form intended to be viewed ACCs can be calculated using the L1 (or city-block) dis-
and enjoyed by humans, it seems logical that the adopted tance metric. Finally, the MPEG-7 colour layout descriptor
measure of image mosaic accuracy - i.e., how close an image (MPEG-7 CL) [13] divides an image into 64 regular blocks,
mosaic looks to its intended target image - should correlate and calculates the dominant colour of the pixels within each
with the inter-image distance perceptions of a number of hu- block [13]. The cumulative distance between the colours (in
man assessors. An existing measure for automatically com- the Y Cb Cr colour space) of corresponding blocks forms the
puting the distance between an image mosaic and its corre- measure of similarity between 2 MPEG-7 CL descriptors.
sponding target image is the Average Pixel-to-Pixel (APP)
distance [9]. The APP distance is expressed formally in Accuracy Measure rs Significant (5%)
Equation (1), where i is 1 of a total n corresponding pixels MPEG-7 CST 0.572 YES
in the mosaic image M and target image T , and r, g and b APP 0.275 NO
are the red, green and blue colour values of a pixel. GCH 0.242 NO
MPEG-7 CL 0.198 NO
LCH 0.176 NO
Pn q
i − r i )2 + (g i − g i )2 + (bi − bi )2 ACC 0.154 NO
i=0 (rM T M T M T
AP P = (1)
n Table 1: The Spearman’s rank correlation coeffi-
cients (rs ) between the image mosaic distance rank-
We were eager to compare the existing APP image mosaic ings made by humans and the rankings generated
distance measure with a variety of image colour descrip- by the tested colour descriptors.
tors (and associated distance measures) commonly used for
4. RESULTS ture descriptors from the user-generated image mosaics and
Table 1 shows the Spearman’s rank correlation coefficients their corresponding target images, and calculating the L1
(rs ) calculated between the human-assigned rankings and (or city-block) distance between them. As a result of our
each of the rankings generated by the tested colour descrip- findings, we propose that the Mosaic Test be adopted in all
tors. We compare the rs correlation coefficient for each mea- future research evaluating the effectiveness of colour-based
sure tested with the critical value of r, which at a 5% sig- image retrieval systems. Future work will be to publicly re-
nificance level with 22 d.f. (24 − 2) equates to 0.423. Any lease the Mosaic Test Tool and procedural documentation
rs value greater than this critical value can be considered a for other researchers in the domain of content-based image
significant correlation at a 5% level. retrieval.

5. DISCUSSION 7. REFERENCES
We observed the actions taken by the participants of the user [1] Exalead. Chromatik. Accessed December 1, 2010, at:
study when creating their image mosaics. It was clear that http://chromatik.labs.exalead.com/.
the majority of users performed reflection-in-action when [2] S. G. Hart. NASA-Task Load Index (NASA-TLX); 20
assessing the relevance (or suitability) of images retrieved Years Later. In Proceedings of the Human Factors and
from the database for use in their image mosaics. As partic- Ergonomics Society 50th Annual Meeting, pages
ipants of a Mosaic Test were able to perform this reflection- 904–908, 2006.
in-action [12], it is clear that the Mosaic Test also overcomes [3] J. Huang, S. R. Kumar, M. Mitra, W. Zhu, and
the first of the two major drawbacks present in current im- R. Zabih. Image Indexing Using Color Correlograms.
age retrieval evaluation methods. As shown in Table 1, the In Computer Vision and Pattern Recognition, pages
MPEG-7 colour structure descriptor (MPEG-7 CST) was 762–768, 1997.
the only colour descriptor (and associated distance measure) [4] M. J. Huiskes and M. S. Lew. The MIR Flickr
we found to correlate with human perceptions of image mo- Retrieval Evaluation. In ACM International
saic distance at the 5% significance level. Therefore, by mea- Conference on Multimedia Information Retrieval,
suring the L1 (or city-block) distance between the MPEG-7 pages 39–43, 2008.
CSTs of the target image and user-generated image mosaics, [5] idée Inc. idée MultiColr Search Lab. Accessed
the Mosaic Test can automatically calculate the relevance November 2, 2010 at
of retrieved images in a manner that correlates with human http://labs.ideeinc.com/multicolr.
perception, thus overcoming the second major drawback of
[6] Imagekind Inc. Shop Art by Color. Accessed
existing image retrieval evaluation methods for benchmark-
November 2, 2010, at:
ing colour-based image retrieval systems (the reliance on a
http://www.imagekind.com/shop/ColorPicker.aspx.
highly subjective image database ground-truth).
[7] T. K. Lau and I. King. Montage : An Image Database
for the Fashion, Textile, and Clothing Industry in
6. CONCLUSION Hong Kong. In Third Asian Conference on Computer
Current image retrieval system evaluation methods have two Vision, pages 410–417, 1998.
fundamental drawbacks that result in them being unsuit-
[8] H. Müller, W. Müller, D. M. Squire,
able for evaluating and benchmarking colour-based image
S. Marchand-Maillet, and T. Pun. Performance
retrieval systems. These evaluation strategies do not enable
Evaluation in Content-Based Image Retrieval:
users to perform the practise of reflection-in-action [12], in
Overview and Proposals. Pattern Recognition Letters,
which creative users assess project modifications within the
22(5):593–601, 2001.
context of the creative piece he/she is working on. The
existing image retrieval system evaluation methods also rely [9] S. Nakade and P. Karule. Mosaicture: Image Mosaic
heavily upon highly subjective image database ground-truths Generating System Using CBIR Technique. In
International Conference on Computational
when assessing the relevance of images selected by test users
Intelligence and Multimedia Applications, pages
or returned by a system. As a result of these drawbacks, no
339–343, 2007.
method currently exists for reliably evaluating and bench-
marking colour-based image retrieval systems. In this paper, [10] Picitup. Picitup. Accessed January 21, 2011, at:
we have introduced the Mosaic Test which has been devel- http://www.picitup.com/.
oped to address the current problem, by providing a reliable [11] W. Plant and G. Schaefer. Evaluation and
means by which to evaluate colour-based image retrieval sys- Benchmarking of Image Database Navigation Tools. In
tems. International Conference on Image Processing,
Computer Vision, and Pattern Recognition, pages
The findings of a user study reveal that the Mosaic Test 248–254, 2009.
overcomes the two major drawbacks associated with existing [12] D. A. Schön. The Reflective Practitioner: How
evaluation method used in the research domain of image re- Professionals Think in Action. Basic Books, 1983.
trieval. As well as also providing valuable effectiveness data [13] T. Sikora. The MPEG-7 Visual Standard for Content
relating to efficiency and user workload, the Mosaic Test Description - An Overview. IEEE Transactions on
enables participants to reflect on the relevance of retrieved Circuits and Systems for Video Technology, 11(6),
images within the context of their image mosaic (i.e., per- 2001.
form reflection-in-action [12]). The Mosaic Test is also able [14] R. Silvers. Photomosaics: Putting Pictures in their
to automatically measure the relevance of retrieved images Place. Master’s thesis, Massachusetts Institute of
in a manner which correlates with the perceptions of mul- Technology, 1996.
tiple human assessors, by computing MPEG-7 colour struc-
Evaluating the Cognitive Impact of
Search User Interface Design Decisions
Max L. Wilson
Future Interaction Technology Labs
Department of Computer Science, College of Science
Swansea University, UK
m.l.wilson@swansea.ac.uk
ABSTRACT highlighted options in unused filters that were related to
The design of search user interfaces has developed guide searchers [10]. Frequently, however, we informally
dramatically over the years, from simple keyword search noted that searchers spent increasing periods of time on
systems to complex combinations of faceted filters and visually comprehending the interface before making their
sorting mechanisms. These complicated interactions can first move. In follow up studies, we saw minimal
provide the searcher with a lot of power and control, but at interaction with facets during the first visit, but recorded a
what cost? Our own work has seen users experience a sharp significant increase in the use of faceted features during
learning curve with faceted browsers, even before they subsequent return visits. It is the hypothesis of our
begin interacting. This paper describes a forthcoming forthcoming work that this non-use of such powerful
period of work that intends to investigate the cognitive features is caused by an increased cognitive load created by
impact of incrementally adding features to search user the associated increased complexity of the SUI. It is this
interfaces. We intend to produce search user interface cognitive impact that we believe can be measured and
design recommendations to help designers maximize attributed to specific design decisions.
support for searchers while minimizing cognitive impact.
mSpace is one specific faceted browser, but the principle of
Author Keywords faceted browsing can be implemented in many different
Search, Exploratory Search, User Interface Design, ways [2]. We also hypothesize that not only the presence,
Cognitive Load Theory but also the subsequent design of SUI features can also
have an impact. The following sections cover some related
ACM Classification Keywords work before describing our plans to evaluate the cognitive
H5.2. Information interfaces and presentation (User impact that adding features to SUIs can have.
Interfaces): evaluation/methodology, screen design. H3.3.
Information search and retrieval: Search process. RELATED WORK
SUI design is affected by many factors. Interaction
INTRODUCTION designers can decide how best to support searchers, but
User Interface (UI) Designers are always concerned with designs may be limited by the metadata that is available
supporting users effectively and intuitively, but a common about the possible results. Both the underlying data and the
recent focus for Search User Interface (SUI) designers has graphical design may also have an impact, then, on how the
been to increase the interactive power and control that chosen interaction will look and feel. As perhaps the most
searchers have over results. As a community, we want to recognized SUI for many users around the world, Google
support users in exploring, discovering, comparing, and has always maintained a very clean and clear white design1,
choosing results that meet their needs. SUI designers, and make very incremental careful design changes that stay
therefore, are concerned with maximizing the use of within that design. Competitor search engines have notably
powerful interface features while maintaining a clear and changed over the years, with many now being very similar
intuitive design. to Google in terms of interaction design, while trying to
keep their own visual design consistent.
In our prior work, we developed mSpace [7] as a faceted
browser that lets searchers use combinations of orthogonal For more exploratory websites that sell a wide range of
metadata filters to narrow their search. We developed products, or provide large collections of information or
advanced interactions for faceted browsers that took documents, there are now many different features that
advantage of visual location within the SUI, and support people, from tabular or dropdown-based sorting

Copyright © 2011 for the individual papers by the papers' authors. 1
Copying permitted only for private and academic purposes. This volume is
http://searchengineland.com/qa-with-marissa-mayer-
published and copyrighted by the editors of euroHCIR2011. google-vp-search-products-user-experience-10370
mechanisms, to categories, clusters, filters, and facets. where two systems provide the same support, one may be
Some websites that provide these features are frustrating harder or easier to use because of its simple visual design.
and difficult to use, while others are simple, intuitive, and Our conclusion is that to understand the success of a SUI,
successful. In these systems it is often the way that the ideal we must analyse both the support in terms of functionality,
support has been developed that has affected their success. and the cognitive impact is creates. Being able to
In a study of the success of different faceted browser understand and predict these two things would help us to
implementations, Capra et al [1] directly compared two design and build better SUIs
faceted browsers to a government website, all over the same
hierarchical government dataset, and discovered that the EVALUATING THE SUPPORT PROVIDED BY SUIS
customized hierarchical design of the original website Beyond the common practice of performing task-oriented
supported searchers far better than the functionally more user studies, my own doctoral work focused on the design
powerful faceted browsers. of an analytical evaluation metric for SUIs, called the
Search Interface Inspector2 (Sii). Sii calculates the support
Both the choice of content and the visual design have both for different types of users based upon the set of features in
been shown to have an impact on usability. White et al the interface, and how many interactions they take to use
showed that the text that includes the search terms is best, [9]. To analyse a SUI, the evaluator catalogues the features
and that highlighting these terms also improves search [12]. of the design and calculates how many interactions are
Similarly, Lin et al. have shown that simply highlighting required to perform a set of known search tactics. The
the domain name in the URL bar significantly reduces the method then interpolates the likely support for different
chances that users will be caught be fishing attacks [4].
types of searchers (explorers or searchers that know what
Zheng et al [13] have also shown that users can make often-
they are looking for, for example), based upon the types of
accurate snap judgments about the credibility of websites tactics they are likely to perform. Sii can be used to
within half a second. Further, Wilson et al [10] noted that compare several designs and produces a series of 3
the success of adding guiding highlights to their faceted interactive graphs that allow evaluators perform an
browser was affected by the choice of highlight-colour and investigative analysis of the results.
its implied meaning.
Sii is based on detailed established information seeking
The choice of SUI features within a single implementation theory and rewards the design of search functionality that
has also been shown to have an impact on search success. has simple interaction. Consequently, however, Sii rewards
Diriye et al compared a keyword search interface with a the addition of new simple functionality, without being able
revised version that also included query suggestions [3]. to estimate the increasing complexity of the SUI as new
Their results showed that such features slowed down
features are added. To remedy this problem, a chapter of the
searchers who were performing simple lookup tasks, but
thesis investigated Cognitive Load Theory and initially
supported those who were performing more complicated specified a similar metric that calculated the cognitive load
exploratory tasks. Similarly, Wilson and Wilson have also of a UI. This second measure of intrinsic cognitive load was
found early results indicating that the simple presence, proposed for inclusion in Sii, estimated the intrinsic
without interaction, of a keyword cloud provides additional cognitive load of a SUI. Similar to how the original metric
support, where subsequent interaction provides very little was correlated with study results, one aim of the work
gain [11] during exploratory tasks. Wilson and Wilson’s described below is to further refine and validate this
results suggest that searchers can learn more about the analytical measure of the cognitive impact of SUIs.
result set from seeing the terms in the keyword cloud, than
actually using them to filter the results. Cognitive Load Theory highlights that capacity for learning
is affected by three aspects: intrinsic, extrinsic, and
The location of features within a SUI has also been shown germane cognitive load. Intrinsic cognitive load is created
to have an impact. Morgan and Wilson studied the visual
by the materials providing the learning experience, or in our
layout of search thumbnails, predicting that having a rack of
case the SUI. Extrinsic cognitive load is created by the
thumbnails at the top of the user interface would allow complexity in the task at hand. Germane cognitive load is
searchers to make faster judgments when trying to re-find then required to process what is learned and commit it to
pages [5]. Their results showed that a rack of thumbnails long-term memory. If intrinsic load and extrinsic load are
was significantly more disruptive to searchers when the too high, then there may not be enough space load left for
target page was not in the results, than the support it germane cognitive load. Although, it is commonly accepted
provided when it was. that effort can increase overall capacity, the aim should still
The studies above indicate that the success of SUIs can be be to reduce intrinsic cognitive load by improving the
attributed to the appropriateness of the functionality design of learning materials or SUIs [6]. Reducing intrinsic
provided, where unnecessary functionality can slow users load creates space for users to perform increasingly
down. Further, the studies indicate that the success of SUIs
can be determined by simple visual or spatial changes that
do not necessarily impact functionality. Consequently,
2
http://mspace.fm/sii
complex tasks, or opens-up germane cognitive load so that turn help us make hypotheses about design issues. This
what is being learned can be retained. phase will help us identify the cost of adding a feature,
where task success would allow us to measure their benefit.
EVALUATING THE COGNITIVE IMPACT OF SUIS
The general structure of the studies we are planning is to Phase 2 – capturing impact in the context of tasks
use brain scanners to record the cognitive impact that Where the first phase above allows us to learn to recognize
different SUIs have on a user. The initial phases will focus the signs from EEG signals, we intend to try and detect
on identifying and measuring such responses to significant cognitive load in situ, and in the context of a task. We will
and obvious differences, before trying to capture changes to be setting participants specific simple and exploratory tasks,
more subtle designs and, hopefully, in-situ. Initially, we whilst controlling the type of user interface features they
will be using EPOC Emotiv headsets3, as shown in Figure see, to capture the cognitive impact as they start. This phase
1, to take readings. These headsets are commercialized will help us identify whether the impact of a search user
versions of EEG scanners, but are designed for use in more interface is affected by task context.
natural contexts. EEG scanners, as with many other brain
scanning systems, are typically affected by simple body Phase 3 – the impact of different implementations
movements and so are often restricted to confined While adding features creates an obvious change in the user
conditions. Such scanners, therefore, are often not suitable interface, different features can be put in different places in
for task-based evaluations, which require action and the SUI and also be implemented differently. Google, for
movement. In psychology, EEG scanners are typically used example, puts suggested refinements at the bottom of the
in constrained environments where users are only allowed page, while Bing has them on the side. Bing also chooses to
to move their thumbs to answer yes or no. Consequently, provide a mix of refinements and alternative directions. In
this work requires scanners that can be used in more natural Phase 2 we intend to analyse both of these kinds of
contexts while performing everyday searching tasks. In the variables to see if they have significant impacts on
future, funding permitting, we also intend to buy an fNIR cognitive load. This phase will help us identify whether the
scanner, which has been shown to be suitable for task-based cost of adding SUI features can be minimized by refining
evaluation conditions [8]. We intend to use these their design.
measurements to understand the impact of design decisions,
in order to make clear recommendations to SUI designers. Discussion
There are many challenges remaining in this planned work.
So far, we have planned very controlled comparisons of
SUI changes, but in real life these systems are used in the
context of complex tasks and for extended periods of time.
Controlled situations will help identify cause and effect, but
other similar objective measurements, like eye trackers, still
require interpretation. We hope to expand on these
methods, and the findings of existing brain scanning HCI
research [8], by addressing this issue over time. Finally,
although this research is primarily interested in the
development of SUI interfaces and how they affect people
learning to use powerful search features, there are many
other things that can be distracting in general UI design.
These methods will likely expand to help address other
Figure 1: EPOC Emotiv Headset
design questions; we, however, are particularly aiming to
answer questions about encouraging exploratory search and
Phase 1 – the impact of additional features
learning, by increasing the power of SUIs, while reducing
Beginning this summer, with two summer interns, we will
their impact on searchers.
be performing our first studies, which will simply display
SUIs of incremental complexity to participants. We will
CONCLUSIONS
begin with a simple keyword search design, and add
This work has yet to begin formally, but we intend to learn
features such as recommendations and filters. The order
more about the impact that very simple design decisions
that interfaces are shown to participants will be randomized
can have on searchers. From previous experience of
to avoid learning and familiarity bias. The aim of this phase
searcher success in evaluations, both industry and academia
is to prove that the learning curves experienced by users
know that such changes can seriously impact the success of
exist and the cognitive load can be measured objectively.
a search user interface. This work will use objective
We hope that the results will show initial insight into the
measurements of brain response to help us identify the
amount of impact that different features have, which may in
factors that make search user interfaces hard to
comprehend. We hope that such measurements will a) help
3
http://www.emotiv.com/ us analyse the cost-benefit trade-off of adding additional
support to search user interfaces, and b) help us develop multimodal exploratory search. Commun. ACM 49, 4
design recommendations for implementing search user (April 2006), 47-49.
interface features so that they have minimal impact. 8. Erin Treacy Solovey, Audrey Girouard, Krysta
Chauncey, Leanne M. Hirshfield, Angelo Sassaroli,
REFERENCES Feng Zheng, Sergio Fantini, and Robert J.K. Jacob.
1. Robert Capra, Gary Marchionini, Jung Sun Oh, Fred 2009. Using fNIRS brain sensing in realistic HCI
Stutzman, and Yan Zhang. 2007. Effects of structure settings: experiments and guidelines. In Proc. UIST
and interaction style on distinct search tasks. In Proc. '09. ACM, New York, NY, USA, 157-166.
JCDL '07. ACM, New York, NY, USA, 442-451.
9. Max L. Wilson, M. C. schraefel, and Ryen W. White.
2. Edward C. Clarkson, Shamkant B. Navathe, and 2009. Evaluating advanced search interfaces using
James D. Foley. 2009. Generalized formal models for established information-seeking models. J. Am. Soc.
faceted user interfaces. In Proc. JCDL '09. ACM, New Inf. Sci. Technol. 60, 7 (July 2009), 1407-1422.
York, NY, USA, 125-134.
10. Max L. Wilson, Paul André, and mc schraefel. 2008.
3. Abdigani Diriye, Ann Blandford, and Anastasios Backward highlighting: enhancing faceted search.
Tombros. 2010. Exploring the impact of search In Proc UIST '08. ACM, New York, NY, USA, 235-
interface features on search tasks. In Proc. ECDL'10. 238
4. Eric Lin, Saul Greenberg, Eileah Trotter, David Ma, 11. Wilson, M. J. and Wilson, M. L. Tag Clouds and
John Aycock. Does Domain Highlighting Help People Keyword Clouds: evaluating zero-interaction benefits.
Identify Phishing Sites. In Proc. CHI2011 (in press). In Ext. Abstract CHI’11.
5. Rhys Morgan and Max L. Wilson. 2010. The Revisit 12. Ryen W. White, Ian Ruthven, and Joemon M. Jose.
Rack: grouping web search thumbnails for optimal 2002. Finding relevant documents using top ranking
visual recognition. In Proc. ASIS&T '10. sentences: an evaluation of two alternative schemes.
6. Sharon Oviatt. 2006. Human-centered design meets In Proc. SIGIR '02. ACM, New York, NY, USA, 57-
cognitive load theory: designing interfaces that help 64.
people think. In Proc. MULTIMEDIA'06. ACM, New 13. Xianjun Sam Zheng, Ishani Chakraborty, James Jeng-
York, NY, USA, 871-880. Weei Lin, and Robert Rauschenberger. 2009.
7. m.c. schraefel, Max Wilson, Alistair Russell, and Correlating low-level image statistics with users -
Daniel A. Smith. 2006. mSpace: improving rapid aesthetic and affective judgments of web pages.
information access to multimedia domains with In Proc. CHI '09. ACM, New York, NY, USA, 1-10.
The potential of Recall and Precision as interface design
parameters for information retrieval systems situated in
everyday environments
Ayman Moghnieh Josep Blat
Universitat Pompeu Fabra Universitat Pompeu Fabra
C/Tanger 122-140, E-08018 C/Tanger 122-140, E-08018
Barcelona, Spain Barcelona, Spain
ayman.moghnie@upf.edu josep.blat@upf.edu

ABSTRACT entrances, and public squares, represent new border zones that
In this paper, we investigate ways for a tighter integration of IR maintain connectivity and mutual presence between the real and
and HCI in new urban contexts, as HCI expands its reach outside the digital worlds, and actively sustain flows of useful or relevant
the workplace towards environments where efficiency and information towards nearby people who in-turn search, discover,
performance no longer constitute the backbone of interaction and interact with the displayed information.
requirements. In particular, we propose to use Recall and The human interaction with information via situated interfaces
Precision as design parameters to describe the information settings creates new challenges for conventional information retrieval (IR)
and performance of situated interfaces acting as retrieval systems systems: first, the relationship between people and digital
in these environments. To explore this notion, we follow an information spaces becomes more explicit and the technology that
inductive design research process by which different prototypes supports it more ubiquitous. Second, the human interaction with
are designed, developed, and evaluated. Our experience shows information spaces adopts a more direct approach supported by
that Recall and Precision, as design parameters, help to reflect the the coming of age of new interaction paradigms (e.g. touch,
information requirements onto the interface design, and contribute gesture, speech) that emulate the manipulation of objects. Third,
to adapting IR to the contemporary challenges it faces, although the information space hosted by a situated interface tends to be
more work is needed to consolidate its role vis-à-vis the growing specialized in subjects and themes befitting the environment
ubiquity of computer technologies. where the interface is situated, and the goals and interests of the
people present in it. Fourth, the interaction properties may vary
Categories and Subject Descriptors considerably in terms of interaction duration and the amount of
H.5.2 User Interfaces. user attention delegated to the situated interface [1].
These challenges, among others [2], justify the search for a tighter
General Terms coupling of interface and interaction design, and IR systems, by
Design, Experimentation, Human Factors, Theory. which IR as a supporting technology for interacting with
information contributes to making the interface design more
transparent and the human-information interaction more fluid and
Keywords direct. Therefore, we reason that the performance of situated
Information Retrieval, Human-Information Interaction, Situated interfaces as IR systems ought to be attuned according to the
Interfaces, Interface and Interaction Design nature of each specific interaction scenario, given that a
maximization of IR performance, may not be adequate for
1. INTRODUCTION answering the interaction design requirements in all kinds of user
As computer technologies become more ubiquitous and versatile, experiences with situated interfaces [5, 10]. Consequently, IR
and get further integrated in human environments, several genres performance tilts towards becoming a design issue that determines
of situated information interfaces (e.g. interactive peripheral some of the characteristics of situated interfaces that mediate this
displays, ambient displays, and interactive surfaces) are starting to interaction.
assume a mediating role between people and digital information
Currently, two metrics (Recall and Precision) are used to assess
spaces in different environments. From an HCI perspective, these
the performance of IR systems in response to user queries [3].
situated interfaces, primarily found in public and semi-public Recall is the fraction of retrieved information elements from the
environments such as malls, public transportation, building entire existing set of elements that are relevant to the user query in
the information space. Precision is the fraction of retrieved
Copyright © 2011 for the individual papers by the papers’ elements found relevant with respect to the user query, over the
authors. Copying permitted only for private and academic entire set of retrieved elements. However, the query as a
middleman between humans and information spaces goes against
purposes. This volume is published and copyrighted by the
the transparent design of situated interfaces that support a direct
editors of euroHCIR2011.
interaction with information spaces. In addition, the information
spaces hosted by situated interfaces are usually predetermined or
pre-queried in accordance with the specific interests of potential
users and the characteristics or nature of the environments where Miller’s Law argues that the total number of different objects that
the hosting interfaces are situated. Instead of querying, the explicit humans can simultaneously hold in their working memory is
momentarily needs of users are answered by direct interaction approximately seven [4]. This affects the manner by which
with the visualized information. This superlatively converts the information is perceived when the cardinality of the visualized set
relevance of the displayed information to the user interests from a of objects increases. In particular, there is a natural observable
performance factor to a design issue. tendency to perceptually cluster or group these objects recursively
whenever the perceivable number exceed Miller’s threshold. To
Therefore, we argue that the definition of Recall and Precision can observe this phenomenon, eight 10 minutes long think-aloud
be loosened or reinterpreted to respectively describe the quantity sessions were organized with eight different university students
of retrieved information elements and their visual diversity as that watched InformationCasserole showing magazine ads
displayed on the interface, since relevance is no longer a progressively being added to the water container, and commented
performance factor from an HCI stance. These two metrics can on how the number of ads shown in the casserole affects the way
consequently act as parameters that bind the design and they perceive the set of visualized ads.
performance of situated interfaces as retrieval systems to the
informational expectations of users, by controlling the amount and We observed that when one object is shown, it tends to engage the
diversity of visualized information in order to maximize the subjects in a prolonged and detailed examination. This changes
transparency of their designs to support a direct human- when two to seven objects are displayed since subjects become
information interaction. more interested in identifying relations among the objects and
comparing them. The interest in object relations abates with a
In order to explore this idea further, we followed a line of higher object number, and instead the relations among clusters or
inductive design research by conceptualizing, designing, and collections of objects start to proportionally grab attention. When
evaluating experimental prototypes. We first introduce two sets of the number of visualized objects crosses a certain threshold,
prototypes devised to understand how users perceive the quantity
which we estimate at Miller’s number squared, the casserole
and visible diversity of information objects. We then define becomes perceptually saturated and the subjects begin to treat the
parameterization scales for Recall and Precision based on these set of ads as a space, reasoning about different regions in it. In
experiments. In order to develop a thoughtful understanding of conclusion, we find that the quantity of visualized objects (R) is
how Recall and Precision, which we will consecutively refer to as perceived in four different density thresholds, and to each we
R and P, can act as design parameters for situated interfaces, we accord a parameter value: R=0 for visualizing no or a single
use them in the analysis, design, and evaluation of five different object; R=1 for a single collection of seven or less objects; R=2
situated interfaces. Next, we investigate how these two parameters for seven or less collections; and R=3 for single information space
can be dynamically controlled by users through the design of two
or more than seven squared objects. This is reflected in figure 2.
interactive interfaces for searching and browsing news articles.
We conclude by assessing our experience and discuss the viability
and implications of our approach.

2. RECALL AND PRECISION FROM A
PERCEPTUAL STANCE

Figure 2. R as a design parameter
In order to study the effects that the visible diversity of
information objects (P) has on the manner by which people
perceive information, eight paper-based prototypes similar to the
InformationCasserole were conceived. Each prototype shows a
combination of twelve to fifteen information objects from
different genres (e.g. classified ads, news headlines, blog posts,
news pictures, movie posters, youtube videos, secondhand goods,
and city events). The object genre was emphasized and
differentiated by aesthetic design. The visible object diversity
encourages people to search for relations among visualized
objects [6]. Therefore, the combinations, ranging from one to
eight genres, were designed to encourage subjects to search for
Figure 1. An instance of the InformationCasserole prototypes patterns and relations among the objects. Six twenty minutes
think-aloud sessions were organized with subjects whom were
InformationCasserole is a series of video prototypes (figure 1) asked to search for and identify different genres of objects in each
designed to study the effect that the number of visualized of the eight combinations presented in random order.
elements (R) has on the way humans perceive the information
revealed on the interface. They show classified ads from As expected, the subjects perceptually clustered the objects
magazines and newspaper floating on different levels in a glass primarily in accordance to their genre. However, they sometimes
container filled with slowly moving water. Therefore, their tended to search for inner-divisions in objects of the same genre
settings emulate a transparent interface design and foster a direct (e.g. clustering movies according to their cinematic kind or news
relationship between the human and digital information spaces. articles in familiar news categories), or to merge related genres as
a single genre (e.g. news articles and blog posts, or movie posters
and news pictures). In total, the subjects perceived the diversity of
objects (P) in four different levels, and to each level we accord a · The amount of available user attention (e.g. MetroWindow
corresponding parameter value inversely proportional to the disposes of little attention in contrast with DigiJuke).
number of visible object genres: the first level is a single-genre · The duration of human interaction with information (e.g.
diversity (P=3); the second level is a diversity of two to three NewsWall remains in contact for prolonged durations, while the
genres (P=2); the third level refers to diversity of three to four interaction with YouServe is more momentarily).
genres (P=1); the fourth level describes a diversity of five to seven
genres of objects (P=0). Figure 3 shows the number of visible · The convergence or divergence of the information seeking
genres of objects in each of the eight combinations as seen by the tasks (e.g. YouServe supports finding a specific library service,
subjects, and the P value of each of the four identified diversity while Arts&Movies is designed to acquaint people with many
levels. movies).
Table 1. Values of R and P parameters for each interface
Situated interface Recall Precision
Arts&Movies 2 1
DigiJuke 3 3
YouServe 1 2
NewsWall 1 1
MetroWindow 0 3

The results of this R and P qualification are summarized in table
Figure 3. P as a design parameter
1. They show how R and P can characterize, from a perceptual
stance, the role of a situated interface as an information retrieval
3. SITUATED INTERFACES AS IR engine, and parameterize the design of its information settings
SYSTEMS accordingly. For example, when the user objectives are to search
In order to assess how R and P act as design parameters for the for specific objects (e.g. YouServe), R is minimized, while P can
information settings of situated interfaces, the following five be maximized when the search converges on specific genres (e.g.
interfaces that act as retrieval systems in real-world environments MetroWindow) or minimized when it diverges to cover many
were analyzed, and for each a corresponding design was genres (e.g. NewsWall). A maximized R signals that the
developed and evaluated in settings that resemble or emulate its interaction tackles a large number of objects. In this case, when P
deployment environment. is maximized (e.g. DigiJuke), it determines that this large number
is a single collection of similar objects, or, when it is minimized
The Arts&Movies is a situated interface intended for movie (e.g. Arts&Movies), it signals that this large number of objects is a
theatre lobbies to support the search and discovery of new visually diversified information space.
interesting movies through an animated visualization that draws
attention to relationships between movies and concepts. The The designers also developed the interfaces information
DigiJuke is installed inside a bar to allow people to browse and architecture and aesthetic design, but these activities lies outside
select music songs on the touch-screen, and play their video clips the scope of this paper. The final designs are shown in figure 4.
accompanied by related images on the projection display. The
YouServe prototype is collocated in a university library lobby to
assist people in familiarizing themselves with the available library
services, and finding a service relevant to specific needs. The
NewsWall is a large display situated in the news production room
of a broadcasting corporation. The prototype subtly visualizes the
constantly evolving news information space on the web. The
MetroWindow is designed for metro wagons and broadcasts
summarized local news about cultural and civic events in the city
of Barcelona.
In related works [7, 8] we have argued how R and P, as design
parameters, can be quantified during requirement analysis and
used alongside other aspects to conceptualize the design of
information interfaces. For each situated interface, a couple of
Figure 4. The situated interfaces final designs
designers analyzed the characteristics of three entities being: the
deployment environment, the humans present in it, and the
adequate information space, which was defined based on an 4. USER CONTROL OVER R AND P
understanding of the needs and goals of the humans alongside the Based on the discerned ability of R and P to describe the
nature of the environment and the information and activity flows information settings of situated interfaces and consequently their
that it hosts. Based on this analysis, the designers qualified the performance as information retrieval systems, we explored the
values of R and P for each situated interface, and consequently possibility of allowing users to control them dynamically in
described its information settings, being the quantity of classic search and retrieval scenarios. Therefore, we designed two
information to visualize and its visible diversity. This experimental prototypes (figure 5) for querying a large
qualification of R and P was defined in accordance with several information space of news articles, by which users can set and
non-disjoint or co-dependent situational aspects of human- control the values of both R and P. The prototypes were evaluated
information interaction such as: to assess the feasibility of this approach and its utility.
The NewSearch prototype collocates two slide-bars adjacently to re-querying, a more profound study should be conducted for
the query textbox for setting R and P explicitly, and returns an further analysis. Such endeavor will constitute the essence of our
equivalent clustered visualization of news articles. Users control future work.
the number of clusters (discerned by color) by P and their average
cardinality by R. The 3DQuery prototype uses a tag-map as a new 6. DISCUSSION
concept for defining user queries, and shows a corresponding map
The approach that we presented in this paper demonstrates that a
of news articles. The tag-map is a rectangular box where users can
tighter integration of HCI and IR is possible, by exploring the
place different tags of distinct sizes. The position of each tag
potential of R and P as design parameters for the information
determines that of the corresponding cluster of news articles, and
settings of situated interfaces. The use of these two performance
the tag size the cluster cardinality.
metrics as design parameters may be seen as controversial,
however, it is justified given that efficiency and information
relevance no longer constitute the backbone of user expectations
in all cases of human-information interaction. Instead, new
aspects of human-information interaction (e.g. emotional,
cognitive, experiential, situational, and cultural) are affecting the
manner by which we conceptualize information systems. Our
approach does not comprehensively address all these aspects, and
therefore can be complemented by introducing new parameters to
reflect with a higher affinity the aspects of human-information
interaction onto the system design.
Figure 5. NewSearch (left) and 3DQuery (right) prototypes
Each prototype was evaluated by a different group of ten subjects 7. ACKNOWLEDGEMENTS
in the lab. The subjects were asked to browse and read the The authors would like to thank Oriol Galimany and other
collection of news articles for fifteen minutes, and then answer a members of the Interactive Technology Group at Universitat
set of open-ended questions concerning their utility and usability. Pompeu Fabra for their support.
The user evaluations of both prototypes showed that their learning
curve is not negligible. Subjects were not naturally inclined to use 8. REFERENCES
the slide-bars of NewSearch to control the information settings. [1] Vogel, D. and Balakrishnan, R. 2004. Interactive public
An explanation for this may well be that they are accustomed to a ambient displays: transitioning from implicit to explicit,
given query paradigm and the difficulty lies in making the public to personal, interaction with multiple users.
paradigm change [9]. However, this issue requires further Proceedings of UIST '04, pp. 137- 146.
investigations. Subjects found it easy to use the tag-map paradigm
in general, but it was deemed too complicated for simple queries [2] NJ Belkin. Some (what) grand challenges for information
and more useful for prolonged search and exploration since it retrieval. ACM SIGIR Forum, 2008
allows users to dynamically adjust queries and therefore [3] R.A. Baeza-Yates and B. Ribeiro-Neto. 1999. Modern
eliminates or reduces the need for re-querying. Information Retrieval. Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA.
The experience and knowledge gathered with the design and
evaluation of these two prototypes would be used for developing [4] Miller G. The Magical Number Seven, Plus or Minus Two:
future prototypes that intent to delegate more intuitively a Some Limits on Our Capacity for Processing Information.
dynamic control over the information settings of information The Psychological Review, 1956.
retrieval interfaces to their users. [5] L. Hallnäs and J. Redström. 2001. Slow Technology,
Designing for Reflection. Personal Ubiquitous Comput. 5, 3
5. CONCLUSIONS (January 2001), 201-212.
During the course of this paper we have explored ways to tightly [6] Koffa, K. (1935): Principles of Gestalt Psychology. London,
integrate IR and HCI in a variety of human-information Routledge & Kegan Paul Ltd.
interaction scenarios where interfaces act as information retrieval
[7] Moghnieh, A., & Blat, J. (2009). A basic framework for
systems. In particular, we studied how R and P as design
integrating social and collaborative applications into learning
parameters can describe the information settings of these
environments. Proceedings of m-ICTE’09 Vol. 2 (pp. 1057-
interfaces. Both aspects were parameterized on a 0-3 scale on the
1061), 2009.
basis of conducted experiments to analyze different possible
information settings. Consequently, five situated interfaces were [8] Moghnieh, A., Sayago, S., Arroyo, E., Sopi, G., and Blat, J.
designed and analyzed to discern how R and P are qualified Parameterized User-Centered Design for Interacting with
during requirement analysis, and how together they describe the Multimedia Repositories. In Proc. MMEDIA '09, IEEE.
information settings of situated interfaces, and therefore help [9] B. Buxton. 2007. Sketching User Experiences: Getting the
reflect the interaction requirements onto the interface design. Design Right and the Right Design. Morgan Kaufmann
Finally, we investigated the feasibility and utility of delegating Publishers Inc. CA, USA.
control of R and P dynamically to users during classic search and [10] S. Bødker. 2006. When second wave HCI meets third wave
retrieval scenarios, and concluded that while this approach is challenges. In Proceedings of NordiCHI '06.
clearly advantageous for exploration tasks and tasks that require
Towards User-Centered Retrieval Algorithms

Manuel J. Fonseca
Department of Computer Science and Engineering
INESC-ID/IST/Technical University of Lisbon
R. Alves Redol, 9, 1000-029 Lisboa, Portugal
mjf@inesc-id.pt

ABSTRACT not be able to find what they want or they may not even be
Nowadays almost all retrieval algorithms (for text, images, able to submit a query to the system.
drawings, etc.) are mainly concerned in achieving good For illustration purposes let us consider the following hy-
system-centered measures, such as precision and recall. How- pothetic scenario: “We developed a system for retrieving
ever, these systems are used by users, who try to achieve generic complex vector drawings, like for instance techni-
goals through the execution of tasks. To better satisfy the cal drawings, architectural plants or clipart drawings. We
users’ needs we must involve them in the development pro- evaluated it using query-by-example and a set of predefined
cess of the retrieval systems. drawings, achieving a good precision and recall measure. Af-
In this paper, we argue that a user-centered approach, terwards, when we delivered the system to users, we noticed
where users are included in the development cycle of the that they were not able to use it, because they could not find
overall retrieval system, can lead to improved retrieval algo- the (first) drawing that they must use as query to find the
rithms and also to a better user satisfaction while using the desired drawing. Moreover, users do not want to search for
system. the complete drawing, but only by a subpart of the drawing.”
This scenario could be avoided if before we developed the
retrieval system we asked users what were their needs, what
Categories and Subject Descriptors did they want to perform on the system and how they want
H.3.3 [Information Storage and Retrieval]: Information to do it. To collect all this information we need to apply
Search and Retrieval; H.5.2 [Information Interfaces and a user-centered approach where users are involved in the
Presentation]: User Interfaces - Graphical user interfaces development of the retrieval system and algorithms.
(GUI) In this paper we defend an user-centered approach as a
way to create better retrieval algorithms and improve the
overall retrieval system. We start by shortly describe the
General Terms user-centered approach and the iterative cycle used in the
Design, Human Factors user interface design. In Section 3 we describe our appli-
cation of the user-centered approach in the development of
Keywords retrieval algorithms. Finally, we present some conclusions.
User-Centered Design, User-centered approach, Retrieval al-
gorithms 2. USER-CENTERED DESIGN
The user-centered design (UCD) is a design methodology,
1. INTRODUCTION where the needs, skills and limitations of the users are taken
The majority of the retrieval algorithms, whether they into account during all stages of the development of the sys-
are for text, images, drawings, 3D objects, audio, video, etc., tem. The key premise of the user-centered design is that
are mainly interested in performing well for system-centered the active involvement of the users in the development pro-
measures, like for instance precision and recall. However, cess as well as in the evaluation of the interactive products
these systems are used by users who want to perform spe- can lead to well-designed systems that best meet the desired
cific tasks and achieve specific goals. We can develop a good usability goals. These systems will take advantage of users
retrieval system, that performs well against a predefined skills, will be relevant to their work and activities, and will
ground truth, but when we delivery it to users they may help them rather than constrain their actions.
One of the principles from the UCD [4] states that we
first need to identify who the users will be (profile, skills
limitations, etc.) and what tasks they perform and/or wish
to perform. The second principle mentions that the systems
should be exposed to users in the early stages of development
to collect feedback from them. Finally, the third principle is
Copyright c 2011 for the individual papers by the papers’ authors. Copy- iterative design. The results and feedback from user testing
ing permitted only for private and academic purposes. This volume is pub- should be used to fix and improve the system. The UCD
lished and copyrighted by the editors of euroHCIR2011.
EuroHCIR ’11 Newcastle, UK assumes an iterative cycle with identification of the users’
. needs, design of the solution and evaluation, repeated as
often as necessary, as depicted in Figure 1. (system and user centered measures) should be used to im-
prove the system and to refine the user and functional re-
!"#$%&'(%)&"*% quirements of the retrieval system.
+'&,-"."% One of the things that we observed in one evaluation ses-
sion with users, was that users did not care about where
in the order of retrieval the intended drawing appears, the
important fact being that it was there. One of the users pro-
duced this comment “It [the system] found it [the drawing]!
That is what counts!” However, when we evaluate retrieval
systems, the majority of the existing measures and ground
89&,1&20'% /0,120'% truth datasets privilege precision. Of course this system-
3#".4'%&'(% centered evaluation is important, but we should also take
5$0606-7.'4% into account the users perspective, where they privilege re-
call.

3.1 An Example
Involving the users can a↵ect the way we develop the re-
Figure 1: User-centered design iterative cycle.
trieval algorithms. In recent years we developed a generic
approach for complex vector drawing retrieval, based on the
topology and geometry of the elements present in the draw-
3. USER-CENTERED RETRIEVAL ing. These two features were used to describe the content
Typically when we want to develop a new retrieval ap- of the drawings, and during matching, we first compare the
proach, we look at the media to retrieve (text, audio, video, drawings using topology and them we compare the geome-
drawings, images, etc.), identify the features that better de- try of those with similar topologies, giving the same weigh
scribe the media, create a matching algorithm and finally to both features (for more details see [1]). This generic re-
we compute precision and recall. Although this methodol- trieval approach was used to develop one system for retriev-
ogy allows us to create retrieval systems, we believe that by ing technical drawings [3] and another for retrieving clipart
including the user in the development cycle will allow us to drawings [2].
deliver better and more usable retrieval systems, that will Before we developed this solution and the two retrieval
allow users to achieve their goals and not only systems that systems, we performed user and task analysis to understand
have a good precision and recall performance. how users wanted to make queries to this type of systems.
Moreover, we should not develop retrieval systems, and We notice that they prefer to draw sketches of the drawing
that includes descriptor computation, matching algorithms that they were looking for than to submit an existing draw-
and presentation of the results, without first identifying a ing to perform a query-by-example. Moreover, most of the
set of user needs and functional requirements (first step in times they do not have a drawing similar to the one that
the user-centered design). We need to know our users, their they are looking for.
skills, their background, their profile. We must identify their The two systems were both evaluated with users, and from
needs and requirements, their goals and how they achieve those evaluations we observed that the way users search for
them. In summary, we need to do an user and task analysis technical drawings was di↵erent from the way they search
before we start developing our retrieval system. User and for clipart drawings [6]. While in the case of technical draw-
task analysis should not only influence the design of the ings users draw more complete sketches with several visual
user interface, but also the design of the retrieval approach elements, and consequently defining a richer topological con-
or algorithm.
For instance, users could use various strategies to perform
a search in a drawing retrieval system. They could use a
drawing that they already have, in a file, to search for sim-
ilar drawings using query-by-example, or they could draw
a sketch of the drawing that they want to find. As we can
see, the retrieval solution (feature extraction, indexing and
matching algorithms) will be di↵erent on each case. While
in the first case we only need to compare two drawings of
the same complexity and with the same characteristics (sets
of lines and polygons), in the second case we need to com-
pare complex drawings with sketches (typically simpler and
with less elements). Thus, the way users perform the task
to achieve their goal influence the retrieval approach that
we should develop.
After developing the retrieval solution based on the user
requirements, we should evaluate the retrieval system, using
not only system-centered measures, but also user-centered
measures, such as time to complete tasks, error rates, sat-
isfaction, etc. As in the user-centered design of interactive Figure 2: Sketch specifying a query to find a tech-
systems, results from the evaluation of the retrieval system nical drawing.
4. CONCLUSIONS
In this paper we defended a user-centered approach for
the development of retrieval systems. As in the case of user
interfaces design, also for retrieval systems is important to
know our users, adapt the algorithms to them, and involve
the users in the evaluation of the system.
We believe, and we had confirmed, that the involvement
of the user in the development cycle of retrieval systems can
conduct to better systems that satisfy users needs and are
Figure 3: Sketch specifying a query to find a clipart more adapted to them.
drawing.
5. ACKNOWLEDGMENTS
This work was supported by FCT through the PIDDAC
figuration, as illustrated in Figure 2; for clipart drawings, Program funds (INESC-ID multiannual funding) and the
users produced simpler sketches, with fewer elements and Crush project, PTDC/EIA-EIA/108077/2008.
with a poorer topological description (see Figure 3).
Due to this observation during tests with users, we refine
our retrieval algorithm for retrieving clipart drawings [5], 6. REFERENCES
putting more emphasis on the geometry than on topology. [1] M. J. Fonseca. Sketch-Based Retrieval in Large Sets of
With this change we were able to achieve a better precision Drawings. PhD thesis, Instituto Superior Técnico /
and recall measure for clipart drawings, and we adapted our Technical University of Lisbon, July 2004.
retrieval system to the users’ way of sketching queries. [2] M. J. Fonseca, B. Barroso, P. Ribeiro, and J. A. Jorge.
Retrieving clipart images by content. In Proceedings of
3.2 Discussion the 3rd International Conference on Image and Video
We can not develop our retrieval algorithms without in- Retrieval (CIVR’04), volume 3115 of Lecture Notes in
volving our users into the development cycle. As in the Computer Science, pages 500–507. Springer-Verlag,
design of interactive systems, also in the development of re- Dublin, Ireland, July 2004.
trieval systems we must involve the users. [3] M. J. Fonseca, A. Ferreira, and J. A. Jorge.
They must be involved in the initial phase, so we can Content-based retrieval of technical drawings.
understand how they search for the information, what are International Journal of Computer Applications in
their knowledge, what are their limitations and what is their Technology (IJCAT), 23(2–4):86–100, 2005.
profile. With this we are able to identify users needs and [4] J. D. Gould and C. Lewis. Designing for usability: key
functional requirements. principles and what designers think. Commun. ACM,
Later on, during the development of the algorithms we 28(3):300–311, 1985.
should take into account this input and adapt the algorithms [5] P. Sousa and M. J. Fonseca. Geometric matching for
to provide “good results” for ”our” users, and not for the users clip-art drawing retrieval. Journal of Visual
in general, or for the system. Communication and Image Representation (JVCI),
Finally, during the evaluation stage, besides computing 20(2):71–83, February 2009.
the traditional system-centered measures, for a set of datasets [6] P. Sousa and M. J. Fonseca. Sketch-based retrieval of
defined as ground truth, we should also involve users in the drawings using spatial proximity. Journal of Visual
evaluation to collect quantitative and qualitative measures. Languages and Computing (JVLC), 21(2):69–80, April
Information gather during evaluation should be used to im- 2010.
prove the retrieval algorithms and the overall retrieval sys-
tem, in the next iteration of the iterative cycle of the user-
centered approach.
Design Thinking for Search User Interface Design
Arne Berger
Chemnitz University of Technology
Strasse der Nationen 62
09107 Chemnitz
Germany

arne.berger {at} informatik.tu-
chemnitz.de

ABSTRACT better understanding, DT is used as an expression for the design
The paper describes with the help of a brief example how design process, while DM is used as an expression for any design method
methods, namely those formed in design thinking can help search from the DT or any other DM toolbox.
user interface design to innovate throughout the software
development process. 2. CURRENT STATE OF DESIGN
METHODS IN SEARCH USER INTERFACE
Categories and Subject Descriptors DESIGN
H.5.2 [Ergonomics, Evaluation/methodology]: Design Methods The possibilities of DM are still badly implemented into product
in Search User Interface Design development. However, a subset of DM, namely User Centered
Design (UCD) is fairly well implemented in the domain of
interface design, including that of search user interface design.
General Terms UCD significantly helps evaluating user needs but often fails to
Measurement, Documentation, Performance, Design, Human
innovate. UCD methods mainly consist of a relatively strict set of
Factors, Experimentation
methods compared to what DT and DM have to offer [9.]. Those
methods are capable of gaining insight and evaluating interfaces
Keywords but do not encourage an innovation process for future user
Design Thinking, User Interface Design, Design Methods, interfaces.
Qualitative Studies
As an user interface design professional working in an academic
development environment that is mainly formed by information
1. INTRODUCTION retrieval experts, the following description of a typical workflow
Since Tim Browns ingenious talk on TED [1.], Design Thinking abstracts the prototypical UCD process of developing search user
(DT) had a huge impact on the business and design world. By interfaces.
injecting the way designers think into accustomed business
processes, CEOs hoped to gain an advantage in competition.
Designers on the other hand hoped their overall influence might
2.1 Current Process of Search User Interface
increase. However, the field has more to offer than bringing Design
creative techniques to supposedly uncreative domains. The first 1. Users tasks and problems are observed via Site Visits or
publications on the matter appeared as early as the late 1960s [2., Website Analytics [10.]. Those methods help to gain insight into
3., 4.] as a way to externalize the enigmatic design process. Since specific user problems. The combination of both nowadays is the
then, the creative application of design methods (DM) has proven holy grail of gaining insight into users issues [10.].
its effectiveness, fun and relevance countless times. [5., 6.]
2. Information retrieval experts and search user interface
Despite its persistent application in typical creative domains, the
designers use methods like brainstorming to plan a software
radical application of DM for digital age products is still a young
product. It is used mainly as a conversation starter, but also
discipline.
functions as a way to frame the current state of technical
possibilities.
1.1 Design Thinking vs. Design Methods
The difference between DT coined and developed at Stanford [7.] 3. Users problems (step 1.) are interpreted and tried to be solved
and DM as defined by Jones amongst many others [3.] needs to be with the help of the technical possibilities (step 2.) which are then
precised in another publication. For now, the author (a Designer) implemented.
is grateful to see the broad spectrum of DM finally being brought
4. The usability of the search user interface proposed in 3. is
to attention due to the success of DT. However, there are way
evaluated via user studies comparable to the ones in step 1.
more methods to use than the 51 methods as suggested by DT [8.]
and there are way more feasible design processes than defined in Iterations: The abovementioned steps are iteratively repeated
DT. Because of the briefness of this paper and for the sake of a several times. With the help of prototypes the interface is refined
before a final implementation takes place. However these steps
Copyright © 2011 for the individual papers by the papers' authors.
Copying permitted only for private and academic purposes. This volume only help to streamline the interface. They are not fully useful for
is published and copyrighted by the editors of euroHCIR2011 innovating an interface according to DTs possibilities.
2.2 Critics of the Current Process 3.1.1 Very Low-Fi Prototype (Conceptual Model)
We believe that the process of nailing down the problem and Generated by: user
suggesting a vital solution after framing technical possibilities and
Function: none, may not be technically feasible
observing users is insufficient. Those well established methods
have the main advantage of providing hard numerical measures. Workflow: only conceptual
Which is even more so, when measures like precision and recall
are used to learn how efficient a system is. Via those standardized Visual Design: none
measurements a comparison between different solutions is easy to Medium: analog
draw. Relying on those hard measures only shows insights, which
can be formulated in numbers and concluded from those. Modality: any

On the other hand, soft properties of a search user interface like Usually user generated, often not understandable without the
»what user really want«, »fun of use«, »suitability to unusual creators explanations. It only describes a preliminary workflow of
tasks« and in parts »user satisfaction« are next to impossible to operations and functions and is not necessarily technically
measure via hard numbers. Although efforts exist [11.] feasible.
measurability of qualitative soft properties is hard to be
standardized. Outcomes therefore are less clear cut and often fail 3.1.2 Low-Fi Prototype (e.g. Paper Prototype)
to be comparable via statistics. As the academic viewpoint in the Generated by: user, designer
field tends to analytic comparison, soft properties are seldom Function: none, may not be technically feasible
explored, described and measured. Therefore subsequent findings
often fail to be implemented. Workflow: preliminary, mimicking operations
Based on the before mentioned, we propose the radical application Visual Design: none
of DT in search user interface design via »participatory
Medium: analog
prototypes«. This concept integrates users and developers alike.
We demonstrate its process briefly in the next chapter and explain Modality: any
its application in three following examples.
Usually presented via the Wizard-Of-Oz technique it incorporates
3. PROPOSED DESIGN THINKING as many operations as possible and always fakes function.
PROCESS FOR SEARCH USER 3.1.3 Mock-Up
INTERFACES Generated by: designer
In the business world (see introduction) DT is foremost a process
Function: none, may not be technically feasible
used for innovating new products.
Workflow: mimicking operations closely
The DT process is defined as following [8.]
Visual Design: none
Understand: Understand problem and context.
Observe: Externalize future users problems via e.g. extreme user Medium: digital
interviews or empathy maps. Modality: any
Define: Interpreting and weighting the gained knowledge from
Is often (and should be) visually unapealing, mimicking
the previous steps via e.g. ad-hoc personas.
operations closely, but fakes function.
Ideate: Using common or uncommon creative techniques, e.g.
body storming for generating many ideas. 3.1.4 Dummy (often refered to as Click Dummy)
Prototype: Visualize and communicate ideas with the help of fast Generated by: designer
and cheap prototypes with paper, Lego bricks or the product box Function: none, may not be technically feasible
method.
Workflow: mimicking operations
Test: Future users test those prototypes, via e.g. story telling
techniques. Visual Design: existing, often visually polished
We believe that DT can and should be incorporated in any Medium: digital
possible stage of a development cycle. Interface design prototypes
are extraordinary easy to manufacture and cost next to nothing. Modality: any

We suggest to apply the DT process more closely to the Incorporates a polished visual design, mimicking operations, but
development of search user interfaces to benefit from its many fakes function. May or may not incorporate the proposed
advantages, esp. to force the pace of innovation. interaction paradigm. The most common implementation of the
later is a browser based click dummy that fakes the functions off a
3.1 Prototype Categories mobile touchscreen device.
As the label »prototype« may be misleading, we tend to think of
anything capable of producing feedback as a prototype. To make 3.1.5 High-Fi Prototype
further understanding easier we classify prototypes as following in Generated by: designer, developer
the order of their advancement: Function: incorporates some or most of the proposed functions
Workflow: mimicking operations
Visual Design: existing, often visually polished we introduced participatory prototypes to search user interface
design for the creation of playlists for mobile video consumption.
Medium: digital
Two other successful projects include Design Thinking for a
Modality: same as end product customized faceted navigation and Design Thinking for a
Is similiar to a Dummy but also incorporates some of the multitouch interface for searching in large multimedial
proposed functions. It also incorporates the proposed interaction repositories.
paradigm.
4. DESIGN THINKING THE CREATION
3.1.6 Alpha Grade Version OF PLAYLISTS FOR MOBILE VIDEO
Generated by: developer
CONSUMPTION
Function: incorporates some or most of the proposed functions We wanted to address a problem, know to many smartphone users
on the move. We understand that, weather commuting or going
Workflow: mostly operational
out with friends users usually avoid constructing complex search
Visual Design: may or not be existing queries to find suitable content to watch.
Medium: digital To define the problem, we asked users what they miss and want
from a mobile TV application. Two main points emerged:
Modality: any
With services like youtube consumers are left having to refine a
A prototype proposed by developers that demonstrates most basic search query several times or to use non-customized item lists
functions, usually does not feature a polished design. such as »most viewed«. On the other hand, in traditional TV a
moderator weaves a golden thread and guides viewers via this
3.1.7 Beta Version potentially emotional connection through a series of video clips.
Generated by: developer After an ideate session the most promising prototype was a mixed
Function: incorporates some or most of the proposed functions breed of playlists, woven together by emotional metadata. To gain
insight into users mindsets regarding the construction of those
Workflow: fully operational personalized playlists we applied various DM.
Visual Design: existing To find out which emotional content attributes users are looking
for, we asked participants to map out a virtual space of content
Medium: digital
properties and show how they thought to navigate within it. This
Modality: same as end product method usually helps to discover pathways and interests in which
people make sense of a particular content space. The results
A visually polished prototype most often proposed by developers eventually help to make sense of how to construct queries for
is a functioning program that may have bugs or quirks and is filter specification.
mainly used in order to get rid of those.
Users were asked to individually draw a map or diagram of what
3.2 Observations for Prototypes comes to their mind when being on the move and having a mobile
As this brief listing suggests most of the prototyping work in video handset available, whether sitting on public transportation
search user interface design is done by a designer. Thus helping to alone or being in a pub with friends. The six users had 15 minutes
maintain a conversation between what users want and what time to draw a map or scheme and were asked to freely associate
developers can implement. parameters to form a personalized playlist. Given the mindset of
being on the move, users formed questions from a simple
There are usually no direct prototypes from the users. Users vocabulary and subsequently wanted to change only certain
comments or observations are interpreted multiple times. First parameters after watching a few video items. A discussion with all
they are made operable via prototypes, crafted by designers, participants followed.
which subsequently are interpreted by the developers.
The results lead to the assumption that users are interested in
Prototypes from the perspective of a developer are used only for direct mood filters. Most of the user generated maps feature mood
evaluation during the end of the implementation cycle. As a lot of clusters or the simple question »how« in a list of questions.
code and effort went into these, heavy changes are omitted and
hopefully eliminated with earlier prototypes. Based on those findings the developers of the future interface with
the help of a designer proposed a low fidelity prototype containing
While the main goal of DT is to encourage interdisciplinary user a filter named »How« together with more filters based on the four
groups to create innovative prototypes, it does not focus on direct cardinal questions Who, Where, When, What. This was done
prototypes from users or developers. because all those metadata fields could be filled with metadata
readily available in the existing database. To prove the concept it
3.3 Implications for Process was introduced to twelve users. Users’ feedback on this approach
We want to continously implement user prototypes into the was insightful in two ways. On one hand, users at large expressed
development and we also encourage a process where developers their general approval on the advantages that might arise by
explain technical feasibility via prototypes even in very draft and constructing exhaustive content filters with just a few steps of
early stages. interaction. On the other hand, the pre-structured characteristic
This realization came through practical usage of various DM in a was heavily criticized. However, the rigidly defined prototype
couple of projects. The following chapter briefly describes how inspired participants to incredibly rich feedback. This proposal in
combination with open ended questions has proved to be a fast
and convenient way to gain user feedback on a large variety of References
issues without a lot of explanation. The main insight is, that all [1] http://www.ted.com/talks/tim_brown_urges_designers_to_thi
users found and used the filter option »how«. Most user feedback nk_big.html (accessed Apr 29, 2011)
was given on only this feature. Findings are discussed in depth in
[12]. [2] Archer. Design as a discipline. Design Studies (1979) vol. 1
(1) pp. 17-20
TV Anytime [13.] is a metadata standard that defines metadata for
[3] Jones. Design Methods. John Wiley and Sons (1992)
broadcasts. It is common to use in describing video items and also
features 53 moods. For the sake of technical interoperability we [4] Newell et al. The processes of creative thinking. (1959)
wanted to stay within the realm of this particular metadata [5] Lawson. How designers think: the design process
standard but also wanted to make the proposed moods more demystified. (2006) (Elsevier)
accessible for users. Based on those technical restrictions and the
previous results we individually asked 45 potential users to sort [6] Schön. The reflective practitioner: how professionals think in
the moods into self-defined categories that made sense to them. action. Basic Books (1983)
[7] Kelley. The Art of Innovation: Lessons in Creativity from
At least two completely different ways of sorting prevailed. One
IDEO. Crown Business (2001)
group of users preferred an order that resembles a classification
into movie genres, while a second group was interested to sort [8] Plattner. Design Thinking. Mi Wirtschaftsbuch (2009)
them according to emotional dependencies. While a number of 45 [9] Cooper. About Face 3. Wiley and Sons (2007)
users was significant enough to reveal two groups, users assigned
to the first group were too few to manifest significance. Focusing [10] Hearst. Search User Interfaces. Cambridge University Press
on the larger group (35 participants) seven mood categories were (2009)
filled unanimously. Apart from very few moods all other moods [11] Hassenzahl et. al. AttrakDiff: Ein Fragebogen zur Messung
are mutually joint to groups. This could make the previous wahrgenommener hedonischer und pragmatischer Qualität.
discussed low fidelity prototype more flexible in navigating In: Proceedings of Mensch & Computer (2003)
complete mood sets. Based on those findings, users proposed an [12] Knauf, Berger, et. al. Constraints and simplification for a
interface that asks questions in an order that is more determined better mobile video annotation and content customization
by them. A subsequent High-Fi prototype was built, incorporated process. In Workshop Proceedings of the EuroITV. (2010)
1000 video items. It allows the selection of a variety of moods as
well as a combination of filters derived from the five cardinal [13] TV-Anytime Phase 1: Metadata schemas
questions. A formal user study is now underway. http://www.etsi.org/deliver/etsi_ts/102800_102899/1028220
301/01.02.01_60/ts_1028220301v010201p.pdf (accessed Oct
5. Acknowledgements 10, 2010)
This publication was prepared as a part of the research initiative
sachsMedia (http://sachsmedia.tv), which is funded by the
German Federal Ministry of Education and Research under the
grant reference number 03IP608. The authors take sole
responsibility for the contents of this publication.
The Development and Application of an
Evaluation Methodology for Person Search Engines
Roland Brenneke Thomas Mandl Christa Womser-Hacker
Information Science Information Science Information Science
University of Hildesheim University of Hildesheim University of Hildesheim
Marienburger Platz 22 Marienburger Platz 22 Marienburger Platz 22
Germany Germany Germany
roland.brenneke@gmx.de mandl@uni-hildesheim.de womser@uni-hildesheim.de

ABSTRACT Web search or go directly to social networks to find out about
people. Nevertheless, 10% is still a significant share and hit rates
This paper presents a user oriented evaluation methodology for for person search engines are constantly high. In addition, many
comparing person search services on the Web. Many established of these searches may have a high impact. Many recruiters use
system oriented methods from information retrieval cannot be person search engines for checking on candidates.
applied to this domain. Our user oriented methodology is applied
A questionnaire study among 548 enterprises was published in
to a test comparing the person search engines yasni, pipl.com and
2010 [5]. This Social Media HR Report 2010, revealed that in
123people. The user study with over 30 participants led to
2009 over 59% of the companies have used the internet to check
relevant results. The coverage of data object types within the
on applicants. Almost 10% had already turned down an
person search engine results is quite different. Especially the
application because of information on the Web. Companies who
amount of pictures and social media network entries which are
do not use the Web for checking on applicants` state that lack of
presented by the systems and which are perceived by the test users
time and ethical questions are the main reasons not to do so [5].
differ greatly. The results also revealed a tendency to judge people
more positively when more information was found. An international study showed that this behaviour is more
widespread in the US than in European countries [3]. Interviews
with decision makers in German companies revealed that they are
well aware of the potential of retrieving applicant information
1. INTRODUCTION [11].
Person search engines are important specialized search services on
the Web. These systems consult other services for information The use of person search engines for job applicants is only one
about a person and integrate it in one interface. They can be potential usage scenario; however, it is a very prominent one.
regarded as meta search services or one point stops for personal Other than that, there are many reasons for why a user would want
information. Mostly, they are tailored for normal people and not to search for a person. And despite the use of a named entity in
for celebrities and other famous people. As such, it is different the search, the information need is rather vague and can be
from named entity search in general. rephrased with “Find out something about person X”.
Especially in the Web 2.0 and its ease of publishing content on The success of a person search engine depends on many factors.
the Web, many people deposit much information about them or Person search engines are meta services which extract results from
content they created in various sites. Users need to have the a large variety of different online media. The presentation of these
proper information competence to foresee the consequences of results in the user interface is an essential factor for the success of
such behavior. Often, users are advised not to publish too much the search service. If a result is far down on the result page and
information. Online reputation management becomes an the user never scrolls there, potentially relevant items cannot be
important issue. On the side of the users, social networks and found. That means that the search capability is only one success
person search services lead to information ethical considerations factor for person search engines. Consequently, our experiment
about the use of personal information. was designed as a user test. We intended to evaluate the user
experience and the success with the tool person search engine and
Searching on information about others is a very frequent neither specific system components nor absolute retrieval
information need and a reason for using a search service. performance.
According to Google Trends, the most popular person search
services receive over 200,000 hits per day. However, 90% of the
users do not rely on person search engines but they use general 2. RELATED WORK
The evaluation of retrieval systems is central in information
Copyright © 2011 for the individual papers by the papers' authors. retrieval research because the system performance cannot be
Copying permitted only for private and academic purposes. This volume predicted. The most influential retrieval evaluation methodology
is published and copyrighted by the editors of euroHCIR2011. is called the Cranfield paradigm. Information retrieval research
has adopted an evaluation scheme which tries to ignore subjective
EuroHCIR 2011. The 1st European Workshop on Human-Computer differences between users in order to be able to compare systems
Interaction and Information Retrieval. July 4th 2011. Newcastle, UK and algorithms. The user is replaced by a prototypical and
constant user. Relevance judgments are provided by domain
experts [8, 10].
Cranfield evaluations have often been criticised for several We selected people who had posted a large amount of information
reasons. The main objections come from advocates of user about themselves in the network. Again, this was done to obtain
oriented studies. The search situation of users depends on many similar and comparable difficulty for the three test cases. Three
individual and contextual factors which can only be captured in person search engines were selected for the comparative test. We
user experiments [6]. The real user experience and the success in chose yasni, pipl.com and 123people because they were very
a real world situation cannot be measured with the laboratory style popular at the time of the study according to Google trends. All
experiments based on the Cranfield paradigm [12]. three companies claim that they exploit only information available
Person search engines have a higher chance to succeed than on the public Web.
general purpose search services. The retrieval with named entities
is known to be easier than searches without names entities [9]. 4. STUDY
The selection of a person search engine hints the type of result. Students of the University of Hildesheim were recruited through a
Consequently, synonymy between names and words are a smaller mailing list of students. Participation was voluntarily and no
problem than in general purpose search engines. Synonymy gratification was given. None of the participants had a computer
between names, on the other hand, is a big challenge for person science background. They all were frequent Internet users and had
search engines. searched for people before but only 10% had used a person search
engine before. The others use Google or social networks to find
3. METHODOLOGY information on people.
The balance between control and realism is a challenge for each The issue of relevance is always a crucial one in information
experiment. For the presented study, we chose a user experiment retrieval evaluation. In our study, any item could contribute to the
to test person search engines because an approach purely full picture of the applicant. Despite the clearly defined scenario,
dedicated to retrieval power does not mirror the user experience it remains vague which information is needed and what type of
for person search engines well. It is necessary to limit the realism information is useful. It is difficult to assign relevance to items or
in a user experiment in order to allow comparison across even weights to categories. The user interfaces of the person
participants in the test. We selected a job applicant scenario in search engines present the items in categories like e.g. social
order to make the experiment interesting for the users. Applicant network entries or videos.
search is a very prominent usage type. The method was successful A questionnaire study [7] showed that users search mainly for the
in making the experiment attractive. The test users liked the following items in the order presented when retrieving
experiment very much and through word of mouth, more information about a specific person:
applicants wanted to register for the experiment than were needed.
• Contact information
The selection of persons for the task defines the content for the
• Profile on a social network
test. It seemed necessary to identify people for whom much
• Photo
information can be found on the Web. If there were no videos,
working results like presentations or social network entries, then • Information about professional accomplishments or
the performance of the person search engine could not be tested interests
with our experiment. So even if the persons selected are not
representative in terms of amount of online information for the The most frequently researched item, contact information does not
whole population or all persons who are indexed in a person apply for our scenario because the persons had sent a letter of
search service it increases the validity of the test to select persons application. The next two most frequent items are included. The
with a large amount of online information. fourth item is rather vague as some of the other items following as
far as the categories of person search engines are concerned. As a
Three people were carefully selected who had similar consequence, the data available does not justify the assignment of
qualifications. For them, a job profile was developed which was weights to some items. In our study, all clicks on items were
given to the participants together with the names of the people. scored equally. The results will also show which of the items were
The users were asked to search for these people who would be most popular. The time per applicant was limited to 10 minutes.
interviewed for the position and check if they were appropriate. The entire experiment took 45 minutes on average including the
The job description and the name of each applicant were given to pre- and post questionnaire.
the test persons. Each of the candidates was well qualified for the
job but had one negative aspect in his online data. One was an One search service modified the interface after the first two tests.
advocate of nuclear power and the job was for offered by an So it was necessary to eliminate three test sessions from the
alternative energy company. The second applicant was a serial results and recruit further test users. This shows that not only the
entrepreneur who portrayed himself on Facebook in pictures with dynamics of the personal data presents a challenge for the test but
attractive women and sports cars. The third applicant had party also the ongoing modifications of the search engine. Overall, 34
photos online where he could be seen smoking cigarettes and he took part in the experiment. Due to the problems of a relaunch of
considered himself as lazy in one social network while he had a one service, we could consider the experiments of 10 users of
very business oriented self image in another social network. 123people, 11 users of Pipl and 10 user of Yasni.

Obviously, such a scenario has some limitations. Person search Each test person worked with one search engines on all three
engines need to disambiguate between people with the same applicants. This between groups approach was applied was mainly
name. We decided to choose people who are not ambiguous in applied to avoid a long learning phase for each of the person
order to have the same difficulty for each person. Such issues are search engines. All tests were recorded with appropriate software.
evaluated in the system oriented campaign WEPS [1].
Figure 1: Popularity of person search engines according to Google Trends

5. RESULTS
The result description focuses on the information perceived by
users and the performance of the test users in the application task.
The information items clicked by the users were categorized. It
can be seen that the services lead to a similar number of clicks
when summed up over all users. Each of the services resulted in
between 110 to 120 clicks for the ten test persons. In the case of
Pipl, 11 test persons were considered. Each engine leads to a
sufficient number of entries and has abundant information on the
applicants in our scenario. This was a goal of the test design and
was accomplished.
The type of information which was encountered was quite
different. It can be easily seen, that 123.people facilitates access to
photos whereas Pipl leads more users to social network entries. A
comparative analysis for the services for the most popular item
types is shown in Table 1.
In the post test questionnaire, users were asked about their
subjective impression of the service they had used. In the overall
satisfaction, 123people was rated highest. For the page structure,
pipl received the best grades and the coverage of different
business networks yasni was rated as most successful. In the latter
case, the finding from the objective click data was confirmed.
Further details on the results are provided in [2]. Figure 2: Clicks on items in the three person search engines

Table 1: Comparison of data types encountered

Item 123people Pipl Yasni
Photo ++ +− −−
Business network − − ++
Social network − ++ + Perception
Homepage/Blog + + +−
++ Excellent
Microblog + +− +
Yellow pages +− −− + + Good

Forum post − +− + +− Moderate
Videoclip + +− +− − Poor
Publication
−− Unperceived
Presentation
Because of a very low number of clicks is no rating
Email address possible.
Address
Phone number
For two services, applicant 1 was selected by the majority of the [3] CrossTab Marketing Services. 2010. Europäischer
test users. These two services had identified most items for this Datenschutztag: Studie zur Online Reputation
applicant. For yasni, applicant 2 was chosen as the best Trustworthy Computing Group, Microsoft (Hrsg.).
applicant despite the fact that the other two services found on http://www.microsoft.com/germany/sicherheit/datenschutzstudie.
average 10 items more for this person. Applicant 3 was given mspx
the last place for all three person search services. For each [4] Hellmann, R.; Griesbaum, J.; Mandl, T. 2010. Quality in
service, he is the applicant with the fewest items. There might be Blogs: How to find the best User Generated Content. In:
a trend to rate people higher when more information is available 13th Intl Conf on Business Information Systems (BIS 2010)
online. Berlin, 3.-5. May. Berlin et al.: Springer [LNBIP 47] pp.
47-58.
6. RESUME [5] Zur Jacobsmühlen, T. (2010): Social Media HR Report
We presented a holistic evaluation methodology for person 2010 Stepstone.de & HRM.de (eds.).
search engines. The performance of these search services is http://www.jacobsmuehlen.de/studie/
measured by observing the perception of test users. The test
methodology is built on a realistic scenario and use case but it [6] Lamm, K.; Greve, W.; Mandl, T.; Womser-Hacker, C.
does not cover all the relevant quality aspects of person search 2010. The Influence of Expectation and System
engines. The important capability to resolve the ambiguity of Performance on User Satisfaction with Retrieval Systems.
names was not dealt with. In future work, it might be promising In: Proc EVIA 2010: The First Intl Workshop on
to develop a performance based test for this task only. Evaluating Information Access June 2010 National
Institute of Informatics (NII) Tokyo, Japan, June 15-18,
The complete information seeking behaviour and its success is http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings
also not measured with our test. In a realistic scenario, people 8/EVIA/09-EVIA2010-LammK.pdf
might access the social media networks through a person search
engine and continue their search mainly there. This issue could [7] Madden, M.; Smith, A. 2010. Reputation Management and
be resolved by observing real behaviour. Social Media: How people monitor their identity and
search for others online. PEW Internet & American Life
In the test, the search engine 123people was the winner. It not Project. http://pewinternet.org/Reports/2010/Reputation-
only led users to the highest number of items, but it was also Management.aspx
subjectively judged to be the best person search engine.
However, in several aspects other systems performed better and [8] Mandl, T. 2008. Recent Developments in the Evaluation of
were judged better. The evaluation showed that the different Information Retrieval Systems: Moving Toward Diversity
tools are all based on the freely available data on the Web but and Practical Applications. In: Informatica – An Intl.
that they lead to different results. The most sought items in our Journal of Computing and Informatics vol. 32. pp. 27-38.
test were photos, entries and profiles in social and business [9] Mandl, T.; Womser-Hacker, C. 2005. The Effect of Named
networks and personal homepages. Each of the engines Entities on Effectiveness in Cross-Language Information
exhibited a strength in one of these items, e.g. 123people for Retrieval Evaluation. In: Proc 2005 ACM SAC Symposium
photos because they are shown as top results. This is also on Applied Computing (SAC). Santa Fe, New Mexico,
confirmed by the questionnaire study among American USA. March 13.-17. 2005. pp. 1059-1064.
recruiters [7]. [10] Robertson, S. 2008. On the history of evaluation in IR. In:
For the users who publish information about themselves and Journal of Information Science 34(4). pp. 439-456
who become information providers by doing that the issue of [11] Schäuble, T.; Griesbaum, J.; Mandl, T. 2009. Mehr-
information competence will become more and more important. wertpotenziale von Online-Social-Business-Netzwerken für
Personal Online Identity Management is a growing field and die Personalbeschaffung von Fach- und Führungskräften.
several new companies are entering the market. In: Informatik 2009 - Beiträge 39. Jahrestagung der
Gesellschaft für Informatik e.V. (GI) Lübeck [LNI P-154]
7. REFERENCES pp. 2166 – 2180.
[1] Artiles, J.; Borthwick, A.; Gonzalo, J.; Sekine, S.; Amigó,
E. 2010. WePS-3 Evaluation Campaign: Overview of the [12] Tawileh, W.; Mandl, T.; Griesbaum, J. 2010. Evaluation of
Web People Search Clustering and Attribute Extraction five web search engines in Arabic language. In: LWA–
Tasks. In: CLEF Working Notes Lernen - Wissensentdeckung – Adaptivität: Proc Work-
http://nlp.uned.es/weps/weps-3/papers shopwoche GI, Universität Kassel. Workshop Information
Retrieval.
[2] Brenneke, R. 2010. Evaluation von Personen- http://www.kde.cs.uni-kassel.de/conf/lwa10/papers/ir1.pdf
suchmaschinen und Umgang mit persönlichen Daten im
Internet. Master Thesis, University of Hildesheim,
Germany. International Information Management.