Exploratory Search in an Audio-Visual Archive:
                 Evaluating a Professional Search Tool for
                          Non-Professional Users

                                             Marc Bron                             Jasmijn van Gorp
                                 ISLA, University of Amsterdam                    TViT, Utrecht University
                                       m.m.bron@uva.nl                              j.vangorp@uu.nl

                                             Frank Nack                             Maarten de Rijke
                                 ISLA, University of Amsterdam              ISLA, University of Amsterdam
                                           nack@uva.nl                               derijke@uva.nl


ABSTRACT                                                                     view, i.e., the type of information included in the metadata, which
As archives are opening up and publishing their content online,              does not necessarily match the expectation of the general public.
the general public can now directly access archive collections. To           This leads to an increase in exploratory types of search [5], as users
support access, archives typically provide the public with their in-         are unable to translate their information need into terms that corre-
ternal search tools that were originally intended for professional           spond with the representation of the content in the archive. The sec-
archivists. We conduct a small-scale user study where non-profes-            ond problem is that archives provide users with professional search
sionals perform exploratory search tasks with a search tool origi-           tools to search through their collections. Such tools were origi-
nally developed for media professionals and archivists in an audio           nally developed to support professional users in searching through
visual archive. We evaluate the tool using objective and subjective          the metadata descriptions in a collection. Given their knowledge of
measures and find that non-professionals find the search interface           the collection, professionals primarily exhibit directed search be-
difficult to use in terms of both. Analysis of search behavior shows         havior [3], but it is unclear to what extent professional search tools
that non-professionals often visiting the description page of indi-          support non-professional users in exploratory search.
vidual items in a result list are more successful on search tasks than          The focus of most work on improving exploratory search is to-
those who visit fewer pages. A more direct presentation of enti-             wards professionals [1]. In this paper we present a small-scale user
ties present in the metadata fields of items in a result list can be         study where non-professional users perform exploratory search tasks
beneficial for non-professional users on exploratory search tasks.           in an audio-visual archive using a search tool originally developed
                                                                             for media professionals and archivists. We investigate the follow-
Categories and Subject Descriptors                                           ing hypotheses: (i) a search interface designed for professional
                                                                             users does not provide satisfactory support for non-professional
H.5.2 [User interfaces]: Evaluation/methodology
                                                                             users on exploratory search tasks; and (ii) users with high perfor-
General Terms                                                                mance on exploratory search tasks have different search behavior
                                                                             than users with lower performance.
Measurment, Performance, Design, Experimentation                                In order to investigate the first hypothesis we evaluate the search
Keywords                                                                     tool performance objectively in terms of the number of correct an-
                                                                             swers found for the search tasks and subjectively through a usabil-
Exploratory search, Usability evaluation                                     ity questionnaire. To answer the second hypothesis, we perform an
                                                                             analysis of the click data logged during search.
1.    INTRODUCTION
   Traditionally, archives have been the domain of archivists and            2.      EXPERIMENTAL DESIGN
librarians, who retrieve relevant items for a user’s request through
their knowledge of the content in, and organization of, the archive.         The environment. The setting for our experiment was the Nether-
Increasingly, archives are opening up and publishing their content           lands Institute for Sound and Vision (S&V), the Dutch national au-
online, making their collections directly accessible for the general         diovisual broadcast archive. In the experiment we used the archive’s
public. There are two major problems that these non-professional             collection consisting of around 1.5 M (television) programs with
users face. First, most users are unfamiliar or only partially famil-        metadata descriptions provided by professional annotators.
iar with the archive content and its representation in the repository.          We also utilized the search interface of S&V.1 The interface is
The internal representation is designed from the expert point of             available in a simple and an advanced version. The simple version
                                                                             is similar to search engines known from the web. It has a single
                                                                             search box and submitting a query results in a ranked list of 10
                                                                             programs. Clicking on one of the programs, the interface shows a
                                                                             page with the complete metadata description of the program. Ta-
Copyright c 2011 for the individual papers by the papers’ authors. Copy-     ble 1 shows the metadata fields available for a program. Instead of
ing permitted only for private and academic purposes. This volume is pub-    1
lished and copyrighted by the editors of euroHCIR2011.                           http://zoeken.beeldengeluid.nl
the usual snippets presented with each item in a result list, the inter-   “television geography” you need to investigate the representation of
face shows the title, date, owner and keywords for each item on the        places in drama series. Find five drama series where location plays
result page. Only the keywords and title field provide information         an important role. (iii) For the course “media and gender” you need
about the actual content of the program while the other fields pro-        to give a presentation about the television career of five different fe-
vide information primarily used for the organization of programs in        male hosts of game shows broadcasted during the 1950s, 1960s or
the archive collection. The description and summary fields contain         1970s. Find five programs that you can use in your presentation.
the most information about the content of programs but are only               Subjects received the search tasks in random order to avoid any
available by visiting the program description page.                        bias. Also, subjects were encouraged to perform the search in any
   We used the advanced version of the interface in the experiment         means that suited them best. During the experiment we logged
which next to the search box offers two other components: search           all search actions, e.g., clicks, performed by each subject. After a
boxes operating on specific fields and filters for certain categories      subject had finished all three search tasks, he or she was asked to fill
of terms. Fielded searches operate on specific fields in the program       out a questionnaire about the experiences with the search interface.
metadata. The filters become available after a list of programs has        Methodology for evaluation and analysis. We performed two
been returned in response to a query. The filters display the top          types of evaluation of the search interface: a usability questionnaire
five most frequent terms in the returned documents for a metadata          and the number of correct answers submitted for the search tasks.
field. The metadata fields displayed in the filter component of the        The questionnaire consists of three sets of questions. The first set
interface are highlighted in bold in Table 1. Once a checkbox next         involves aspects of the experienced search behaviour with the in-
to one of the terms has been ticked, programs not containing that          terface. The second set contains questions about how useful users
term in that field are removed from the result list.                       find the filter component, fielded search component, and metadata
                                                                           fields presented in the interface. The third set asks subjects to in-
Table 1: All metadata fields available for programs. We differ-            dicate the usefulness of a series of term clouds. The primary goal
entiate between fields that describe program content and fields            is not to evaluate the term clouds or their visualization but to find
that do not. Bold indicates fields used by the filter component.           preferences for information from certain metadata fields. We gen-
         content descriptors            organizational descriptors         erated a term cloud for a specific field as follows. First, we got
                                                                           the top 1000 program descriptions for the query “comedian.” We
field        explanation              field    explanation
                                                                           counted the terms for a field for each of the documents. The cloud
description program highlights       medium storage medium                 then represented a graphical display of the top 50 most frequent
person       people in program       genre gameshow; news                  terms in the fields of those documents, where the size of a term was
keyword      terms provided by       rights parties allowed
             annotator                      to broadcast                   relative to its frequency, i.e, the higher the frequency the bigger the
summary      summary of the          owner owner of the                    term. In the questionnaire subjects indicate agreement on a 5 point
             program format                 broadcast rights               Likert scale ranging from one (not at all) to five (extremely). The
organization organization in program date   broadcast date                 second type of evaluation was based on the evaluation methodology
location     locations in program    origin program origin                 applied at TREC [2]. We pooled the results of all subjects and let
title        program title                                                 two assessors make judgements about the relevance of the submit-
                                                                           ted answers to a search task. An answer is only considered relevant
Subjects. In total, 22 first year university students from media           if both assessors agree. Performance is measured in terms of the
studies participated in the experiment. The students (16 female,           number of correct answers (#correct) submitted to the system.
6 male) were between 19 and 22 years of age. As a reward for                   For the analysis of the search behavior of subjects we looked
participation the students gained free entrance to the museum of           at (i) the number of times a search query is submitted using any
the archive.                                                               combination of components (#queries); (ii) the number of times a
Experiment setup. In each of the five studios available at S&V ei-         program description page is visited (#pages); and (iii) the number
ther one or two subjects performed the experiment at a time in a sin-      of times a specific component is used, i.e., the general searchbox,
gle studio. In case two subjects were present, each of them worked         filters and fields. A large value for #queries indicates a look up
on machines facing opposite sides of the studio. We instructed sub-        type search behavior. It is characterized by a pattern of submitting
jects not to communicate during the experiment. During the experi-         a query, checking if the answer can be found in the result list and if
ment one instructor was always present in a studio. Before starting,       it is not, to formulate a new query. The new query is not necessar-
the subjects learned the goals of the experiment, got a short tuto-        ily based on information gained from the retrieved results but rather
rial on the search interface and performed a test query. During this       inspired by the subject’s personal knowledge [4]. A large value for
phase the subjects were allowed to ask questions.                          #pages indicates a learning style search behavior. In this search
   In the experiment each subject had to complete three search tasks       strategy a subject visits the program description of each search re-
in 45 minutes. If after 15 minutes a task was not finished, the in-        sult to get a better understanding of the organization and content of
structor asked the subject to move on to the next task. Search tasks       the archive. New queries are then also based on information gained
are related to matters that could potentially occur within courses         from the previous text analysis [4]. We check the usage frequency
of the student’s curriculum. Each search task required the subjects        of specific components to see if performance differences between
to find five answers before moving on to the next task. A correct          subjects are due to alternative uses of interface components.
answer was a page with the complete metadata description of a
program that fulfilled the information need expressed by the search        3.    RESULTS
task. Subjects could indicate that a page was an answer through a
submit button added to the interface for the experiment.                   Search interface evaluation. Figure 1 shows the distribution of
   We used the following three search tasks in the experiment: (i) For     the amount of correct answers submitted for a search task, together
the course “media and ethnicity” you need to investigate the role          with the distribution of the amount of answers (correct or incor-
of ethnicity in television-comedy. Find five programs with differ-         rect) submitted. Out of the possible total of 330 answers, 173 are
ent comedians with a non-western background. (ii) For the course           actually submitted. Subjects submit the maximum number of five
                                                                         Table 3: Analysis of search behavior of subjects. Significance
                                                                         is tested using a standard two-tailed t-test. The symbol N indi-
            30
                                               #correct                  cates a significant increase at the ↵ < 0.01 significance level.
                                               #submitted
   #tasks


                                                                                    filter    field searchbox #queries #pages
            20

                                                                          B avg     21.3     29.5       44.8          35.2      21.2
                                                                                                                                35.7N
            10


                                                                          G avg     15.2     44.0       42.0          34.3
            0


                      0       1       2      3       4       5           ber of queries suggests that the difference in performance is not
                                                                         due to one group doing more lookups than the other. The indi-
 Figure 1: Distribution of amount correct/submitted answers.
                                                                         cator for learning type search, i.e., #pages, shows that there is a
                                                                         significant difference in the number of program description pages
                                                                         visited between subjects of the two groups, i.e., subjects in group
answers for 18 of the tasks. This suggests that subjects have diffi-
                                                                         G tend to visit program description pages more often than subjects
culties in finding answers within the given time limit. Subjects find
                                                                         of group B. We also find that the average time subjects in group G
no correct answers for 31 of the tasks, five subjects find no cor-
                                                                         spend on a program description page is 27 seconds, while subjects
rect answer for any of the tasks, and none of the subjects reaches
                                                                         from group B spend on average 39 seconds. These observations
the maximum of five correct answers for a task. In total 64 out of
                                                                         support our hypothesis that there are differences in search behavior
173 answers are correct. This low precision indicates that subjects
                                                                         between subjects that have high performance on exploratory search
find it difficult to judge if an answer is correct based on the meta-
                                                                         tasks and subjects with lower performance.
data provided by the program description. Table 2 shows ques-
tions about the satisfaction of subjects with the interfaces. Subjects   Usefulness of program descriptions. One explanation for this dif-
indicate their level of agreement from one (not at all) to five (ex-     ference in performance is that through their search behavior sub-
tremely). For all questions the majority of subjects find the amount     jects from group G learn more about the content and organization
of support offered by the interface on the exploratory search tasks      of the archive and are able to assimilate this information faster from
marginal. This finding supports our first hypothesis that the search     the program descriptions than subjects from group B. As subjects
interface intended for professional users does not provide satisfac-     process more program descriptions they learn more about the avail-
tory support to non-professional users on exploratory search tasks.      able programs and terminology in the domain. This results in a
                                                                         richer set of potential search terms to formulate their information
                                                                         need. To investigate whether subjects found information in the pro-
Search behavior analysis. Although all subjects are non-experts          gram descriptions useful in suggesting new search terms, we anal-
with respect to search with this particular interface, some perform      yse the second set of questions from the questionnaire. The top half
better than others. We investigate whether there is a difference in      of Table 4 shows subjects’ responses to questions about the useful-
the search behavior of subjects that have high performance on the        ness of metadata fields present on the search result page. Consid-
search tasks and users that have lower performance. We divide            ering responses from all subjects the genre and keyword fields are
subjects into two groups depending on the average number of cor-         found most useful and the title and date fields as well, although to
rect answers found aggregated over the three tasks, i.e., 2.9 out of     a lesser degree. The fields intended for professionals, i.e., origin,
the possible maximum of 15. The group with higher performance            owner, rights, and medium are found not useful by the majority of
(group G) consists of 11 subjects with 3 or more correct answers,        subjects. Between group B and G there are no significant differ-
whereas the group with lower performance (group B) consists of           ences in subject’s judgement of the usefulness of the fields.
11 subjects with 2 or less correct answers.
    Table 3 shows the averages of the search behavior indicators for
each of the two groups. We first look at the usage frequency of the
filter, field, and search box components by subjects in group G vs.      Table 4: Questions about the usefulness of metadata fields on
group B. There is no significant difference between the groups, in-      program description pages and the mode and average (avg) of
dicating that there is no direct correlation between performance on      the subjects responses: for all subjects, the good (G) and bad
the search tasks and use of specific search components. Next we          (B) performing group. We use a Wilcoxon signed rank test for
look at search behavior as an explanation for the difference in per-     the ordinal scale. The symbol M (N ) indicates a significant in-
formance between the groups. Our indicator for lookup searches,          crease at the ↵ < 0.05 (0.01) level.
i.e., #queries, shows a small difference in the number of submitted                                           all       B         G
queries. That subjects in both groups submit a comparable num-           question             field         mode mode avg mode avg
                                                                         Degree to which        date              3       2 2.2   3      3.0
                                                                         fields on the result   owner             1       1 1.6   1      2.0
Table 2: Questionnaire results about the satisfaction of subjects        page were useful in    rights            1       1 1.3   1      1.4
                                                                         suggesting new         genre             4       1 2.8   4      3.9
with the search interface. Agreement is indicated on a 5 point           terms                  keyword           4     1,5 3.1   4      3.5
Likert scale ranging from one (not at all) to five (extremely).                                 origin            1     1,2 1.7   1      2.0
 question                                           mode avg                                    title           3,4       2 2.2   4      3.0
 To what degree are you satisfied with the search        2       2.3                            medium            1       1 1.5 1,2      1.6
 experience offered by the interface?                                    Degree to which        summary          4   1,4 2.8   5 3.8
 To what degree did the interface support you by         2       2.4     fields in program      description      4     4 3.3   4M 4.1
 suggesting new search terms?                                            descriptions were      person           4 1,3,4 2.8   4N 3.8
 To what degree are you satisfied with the sug-          2       2.3     useful in suggesting   location     1,3,4   1,3 2.0   4N 3.0
 gestions for new search terms by the interface?                         new terms              organization     1     1 1.8 1,2 2.0
   The bottom part of Table 4 shows subject’s responses to ques-           the archive. Together, the above findings suggest that subjects find
tions about the usefulness of metadata fields only present on the          a direct presentation of short and meaningful terms, i.e., categories,
program description page and not already shown on the search re-           keywords, and entities, on the search results page useful.
sult page. Based on all responses, the summary, description, person
and location metadata fields are considered most useful by the ma-         4.    CONCLUSION
jority of the subjects. These findings further support our argument
                                                                              We presented results from a user study where non-professional
that program descriptions provide useful information for subjects
                                                                           users perform exploratory search tasks with a search tool originally
to complete their search tasks.
                                                                           developed for media professionals and archivists in an audio visual
   When we contrast responses of the two groups we find that group
                                                                           archive. We hypothesized that such search tools provide unsatisfac-
G subjects consider the description, person, and location metadata
                                                                           tory support to non-professional users on exploratory search tasks.
fields significantly more useful than subjects from group B. This
                                                                           By means of a TREC style evaluation we find that subjects achieve
suggests that group B subjects have more difficulties in distilling
                                                                           low recall in the number of correct answers found. In a question-
useful information from these fields (recall also the longer time
                                                                           naire regarding the user satisfaction with the search support offered
spent on a page). This does not say that these users cannot un-
                                                                           by the tool, subjects indicate this to be marginal. Both findings sup-
derstand the provided information. All that is indicated is that the
                                                                           port our hypothesis that a professional search tool is unsuitable for
chosen modality, i.e., text, might not be the right one. A graphical
                                                                           non-professional users performing exploratory search tasks.
representation, for example as term clouds, might be better.
                                                                              Through an analysis of the data logged during the experiment,
Fields as term clouds. In response to the observations just made,          we find evidence to support our second hypothesis that subjects per-
we also investigated how users would judge visual representations          form different search strategies. Subjects that visit more program
of search results, i.e., in the form of term clouds directly on the        description pages are more successful on the exploratory search
search result page. Here the goal is not to evaluate the visualization     tasks. We also find that subjects consider certain metadata fields on
of the clouds or the method by which they are created. Of interest         the program description pages more useful than others. Subjects in-
to us is whether subjects would find a direct presentation of infor-       dicate that visualization of certain fields as term clouds directly in
mation normally “hidden” on the program description page useful.           the search interface would be useful in completing the search tasks.
   Recall from §2 that we generate term clouds for each field on           Subjects especially consider presentations of short and meaningful
the basis of the terms in the top 1000 documents returned for a            text units, e.g., categories, keywords, and entities, useful.
query. From Table 5 we observe that subjects do not consider                  In future work we plan to perform an experiment in which we
the description and summary clouds useful, while previously these          present non-professional users with two interfaces: the current search
fields were judged most useful among the fields in the program de-         interface and one with a direct visualization of categories, key-
scription. Both clouds contain general terms from the television           words and entities on the search result page.
domain, e.g., program and series, which do not provide subjects
                                                                           Acknowledgements. This research was partially supported by the
with useful search terms. Although this could be due to the use
                                                                           European Union’s ICT Policy Support Programme as part of the
of frequencies to select terms, these fields are inherently difficult
                                                                           Competitiveness and Innovation Framework Programme, CIP ICT-
to visualize without losing the relations between the terms. The
                                                                           PSP under grant agreement nr 250430, the PROMISE Network of
genre, keyword, location and, to some degree, person clouds are all
                                                                           Excellence co-funded by the 7th Framework Programme of the Eu-
considered useful, but they support the user in different ways. The
                                                                           ropean Commission, grant agreement no. 258191, the DuOMAn
genre field supports the subject in understanding how content in the
                                                                           project carried out within the STEVIN programme which is funded
archive is organized, i.e., it provides an overview of the genres used
                                                                           by the Dutch and Flemish Governments under project nr STE-09-
for categorization. The keyword cloud provides the user with alter-
                                                                           12, the Netherlands Organisation for Scientific Research (NWO)
native search terms for his original query, for example, satire or
                                                                           under project nrs 612.061.814, 612.061.815, 640.004.802, 380-70-
parody instead of cabaret. The location and person clouds offer an
                                                                           011, the Center for Creation, Content and Technology (CCCT), the
indication of which locations and persons are present in the archive
                                                                           Hyperlocal Service Platform project funded by the Service Innova-
and how prominent they are. For these fields visualization is easier,
                                                                           tion & ICT program, the WAHSP project funded by the CLARIN-
i.e., genre, keywords or entities by themselves are meaningful with-
                                                                           nl program, and under COMMIT project Infiniti.
out having to represent relations between them. Subjects consider
the title field only marginally useful. For this field the usefulness is
dependent on the knowledge of the subject as titles are not neces-         REFERENCES
sarily descriptive. The subjects also consider the organization field      [1] J.-w. Ahn, P. Brusilovsky, J. Grady, D. He, and R. Florian. Se-
marginally useful, probably due to the nature of our search tasks,             mantic annotation based exploratory search for information an-
i.e., two tasks focus on finding persons and in one locations play             alysts. Inf. Proc. & Management, 46(4):383 – 402, 2010.
an important role. We assume though that in general this type of           [2] D. K. Harman. The TREC test collections. In E. M. Voorhees
information need occurs when the general public starts exploring               and D. K. Harman, editors, TREC: Experiment and evaluation
                                                                               in information retrieval. MIT, 2005.
                                                                           [3] B. Huurnink, L. Hollink, W. van den Heuvel, and M. de Rijke.
Table 5: Questions about the usefulness of term clouds based                   Search behavior of media professionals at an audiovisual
on specific metadata fields. Agreement is indicated on a 5 point               archive. J. Am. Soc. Inf. Sci. and Techn., 61:1180–1197, 2010.
Likert scale ranging from one (not at all) to five (extremely).            [4] G. Marchionini. Exploratory search: from finding to under-
 cloud          mode avg cloud               mode avg                          standing. Comm. ACM, 49(4):41 – 46, April 2006.
                                                                           [5] R. White, B. Kules, S. Drucker, and M. Schraefel. Supporting
 title            2        2.8    description    1         2.5
                                                                               exploratory search: Special issue. Comm. ACM, 49(4), 2006.
 person           2,3      2.9    genre          4         3.4
 location         4        3.3    summary        1         2.3
 organization     2        2.2    keyword        4         3.8