Exploratory Search in an Audio-Visual Archive: Evaluating a Professional Search Tool for Non-Professional Users Marc Bron Jasmijn van Gorp ISLA, University of Amsterdam TViT, Utrecht University m.m.bron@uva.nl j.vangorp@uu.nl Frank Nack Maarten de Rijke ISLA, University of Amsterdam ISLA, University of Amsterdam nack@uva.nl derijke@uva.nl ABSTRACT view, i.e., the type of information included in the metadata, which As archives are opening up and publishing their content online, does not necessarily match the expectation of the general public. the general public can now directly access archive collections. To This leads to an increase in exploratory types of search [5], as users support access, archives typically provide the public with their in- are unable to translate their information need into terms that corre- ternal search tools that were originally intended for professional spond with the representation of the content in the archive. The sec- archivists. We conduct a small-scale user study where non-profes- ond problem is that archives provide users with professional search sionals perform exploratory search tasks with a search tool origi- tools to search through their collections. Such tools were origi- nally developed for media professionals and archivists in an audio nally developed to support professional users in searching through visual archive. We evaluate the tool using objective and subjective the metadata descriptions in a collection. Given their knowledge of measures and find that non-professionals find the search interface the collection, professionals primarily exhibit directed search be- difficult to use in terms of both. Analysis of search behavior shows havior [3], but it is unclear to what extent professional search tools that non-professionals often visiting the description page of indi- support non-professional users in exploratory search. vidual items in a result list are more successful on search tasks than The focus of most work on improving exploratory search is to- those who visit fewer pages. A more direct presentation of enti- wards professionals [1]. In this paper we present a small-scale user ties present in the metadata fields of items in a result list can be study where non-professional users perform exploratory search tasks beneficial for non-professional users on exploratory search tasks. in an audio-visual archive using a search tool originally developed for media professionals and archivists. We investigate the follow- Categories and Subject Descriptors ing hypotheses: (i) a search interface designed for professional users does not provide satisfactory support for non-professional H.5.2 [User interfaces]: Evaluation/methodology users on exploratory search tasks; and (ii) users with high perfor- General Terms mance on exploratory search tasks have different search behavior than users with lower performance. Measurment, Performance, Design, Experimentation In order to investigate the first hypothesis we evaluate the search Keywords tool performance objectively in terms of the number of correct an- swers found for the search tasks and subjectively through a usabil- Exploratory search, Usability evaluation ity questionnaire. To answer the second hypothesis, we perform an analysis of the click data logged during search. 1. INTRODUCTION Traditionally, archives have been the domain of archivists and 2. EXPERIMENTAL DESIGN librarians, who retrieve relevant items for a user’s request through their knowledge of the content in, and organization of, the archive. The environment. The setting for our experiment was the Nether- Increasingly, archives are opening up and publishing their content lands Institute for Sound and Vision (S&V), the Dutch national au- online, making their collections directly accessible for the general diovisual broadcast archive. In the experiment we used the archive’s public. There are two major problems that these non-professional collection consisting of around 1.5 M (television) programs with users face. First, most users are unfamiliar or only partially famil- metadata descriptions provided by professional annotators. iar with the archive content and its representation in the repository. We also utilized the search interface of S&V.1 The interface is The internal representation is designed from the expert point of available in a simple and an advanced version. The simple version is similar to search engines known from the web. It has a single search box and submitting a query results in a ranked list of 10 programs. Clicking on one of the programs, the interface shows a page with the complete metadata description of the program. Ta- Copyright c 2011 for the individual papers by the papers’ authors. Copy- ble 1 shows the metadata fields available for a program. Instead of ing permitted only for private and academic purposes. This volume is pub- 1 lished and copyrighted by the editors of euroHCIR2011. http://zoeken.beeldengeluid.nl the usual snippets presented with each item in a result list, the inter- “television geography” you need to investigate the representation of face shows the title, date, owner and keywords for each item on the places in drama series. Find five drama series where location plays result page. Only the keywords and title field provide information an important role. (iii) For the course “media and gender” you need about the actual content of the program while the other fields pro- to give a presentation about the television career of five different fe- vide information primarily used for the organization of programs in male hosts of game shows broadcasted during the 1950s, 1960s or the archive collection. The description and summary fields contain 1970s. Find five programs that you can use in your presentation. the most information about the content of programs but are only Subjects received the search tasks in random order to avoid any available by visiting the program description page. bias. Also, subjects were encouraged to perform the search in any We used the advanced version of the interface in the experiment means that suited them best. During the experiment we logged which next to the search box offers two other components: search all search actions, e.g., clicks, performed by each subject. After a boxes operating on specific fields and filters for certain categories subject had finished all three search tasks, he or she was asked to fill of terms. Fielded searches operate on specific fields in the program out a questionnaire about the experiences with the search interface. metadata. The filters become available after a list of programs has Methodology for evaluation and analysis. We performed two been returned in response to a query. The filters display the top types of evaluation of the search interface: a usability questionnaire five most frequent terms in the returned documents for a metadata and the number of correct answers submitted for the search tasks. field. The metadata fields displayed in the filter component of the The questionnaire consists of three sets of questions. The first set interface are highlighted in bold in Table 1. Once a checkbox next involves aspects of the experienced search behaviour with the in- to one of the terms has been ticked, programs not containing that terface. The second set contains questions about how useful users term in that field are removed from the result list. find the filter component, fielded search component, and metadata fields presented in the interface. The third set asks subjects to in- Table 1: All metadata fields available for programs. We differ- dicate the usefulness of a series of term clouds. The primary goal entiate between fields that describe program content and fields is not to evaluate the term clouds or their visualization but to find that do not. Bold indicates fields used by the filter component. preferences for information from certain metadata fields. We gen- content descriptors organizational descriptors erated a term cloud for a specific field as follows. First, we got the top 1000 program descriptions for the query “comedian.” We field explanation field explanation counted the terms for a field for each of the documents. The cloud description program highlights medium storage medium then represented a graphical display of the top 50 most frequent person people in program genre gameshow; news terms in the fields of those documents, where the size of a term was keyword terms provided by rights parties allowed annotator to broadcast relative to its frequency, i.e, the higher the frequency the bigger the summary summary of the owner owner of the term. In the questionnaire subjects indicate agreement on a 5 point program format broadcast rights Likert scale ranging from one (not at all) to five (extremely). The organization organization in program date broadcast date second type of evaluation was based on the evaluation methodology location locations in program origin program origin applied at TREC [2]. We pooled the results of all subjects and let title program title two assessors make judgements about the relevance of the submit- ted answers to a search task. An answer is only considered relevant Subjects. In total, 22 first year university students from media if both assessors agree. Performance is measured in terms of the studies participated in the experiment. The students (16 female, number of correct answers (#correct) submitted to the system. 6 male) were between 19 and 22 years of age. As a reward for For the analysis of the search behavior of subjects we looked participation the students gained free entrance to the museum of at (i) the number of times a search query is submitted using any the archive. combination of components (#queries); (ii) the number of times a Experiment setup. In each of the five studios available at S&V ei- program description page is visited (#pages); and (iii) the number ther one or two subjects performed the experiment at a time in a sin- of times a specific component is used, i.e., the general searchbox, gle studio. In case two subjects were present, each of them worked filters and fields. A large value for #queries indicates a look up on machines facing opposite sides of the studio. We instructed sub- type search behavior. It is characterized by a pattern of submitting jects not to communicate during the experiment. During the experi- a query, checking if the answer can be found in the result list and if ment one instructor was always present in a studio. Before starting, it is not, to formulate a new query. The new query is not necessar- the subjects learned the goals of the experiment, got a short tuto- ily based on information gained from the retrieved results but rather rial on the search interface and performed a test query. During this inspired by the subject’s personal knowledge [4]. A large value for phase the subjects were allowed to ask questions. #pages indicates a learning style search behavior. In this search In the experiment each subject had to complete three search tasks strategy a subject visits the program description of each search re- in 45 minutes. If after 15 minutes a task was not finished, the in- sult to get a better understanding of the organization and content of structor asked the subject to move on to the next task. Search tasks the archive. New queries are then also based on information gained are related to matters that could potentially occur within courses from the previous text analysis [4]. We check the usage frequency of the student’s curriculum. Each search task required the subjects of specific components to see if performance differences between to find five answers before moving on to the next task. A correct subjects are due to alternative uses of interface components. answer was a page with the complete metadata description of a program that fulfilled the information need expressed by the search 3. RESULTS task. Subjects could indicate that a page was an answer through a submit button added to the interface for the experiment. Search interface evaluation. Figure 1 shows the distribution of We used the following three search tasks in the experiment: (i) For the amount of correct answers submitted for a search task, together the course “media and ethnicity” you need to investigate the role with the distribution of the amount of answers (correct or incor- of ethnicity in television-comedy. Find five programs with differ- rect) submitted. Out of the possible total of 330 answers, 173 are ent comedians with a non-western background. (ii) For the course actually submitted. Subjects submit the maximum number of five Table 3: Analysis of search behavior of subjects. Significance is tested using a standard two-tailed t-test. The symbol N indi- 30 #correct cates a significant increase at the ↵ < 0.01 significance level. #submitted #tasks filter field searchbox #queries #pages 20 B avg 21.3 29.5 44.8 35.2 21.2 35.7N 10 G avg 15.2 44.0 42.0 34.3 0 0 1 2 3 4 5 ber of queries suggests that the difference in performance is not due to one group doing more lookups than the other. The indi- Figure 1: Distribution of amount correct/submitted answers. cator for learning type search, i.e., #pages, shows that there is a significant difference in the number of program description pages visited between subjects of the two groups, i.e., subjects in group answers for 18 of the tasks. This suggests that subjects have diffi- G tend to visit program description pages more often than subjects culties in finding answers within the given time limit. Subjects find of group B. We also find that the average time subjects in group G no correct answers for 31 of the tasks, five subjects find no cor- spend on a program description page is 27 seconds, while subjects rect answer for any of the tasks, and none of the subjects reaches from group B spend on average 39 seconds. These observations the maximum of five correct answers for a task. In total 64 out of support our hypothesis that there are differences in search behavior 173 answers are correct. This low precision indicates that subjects between subjects that have high performance on exploratory search find it difficult to judge if an answer is correct based on the meta- tasks and subjects with lower performance. data provided by the program description. Table 2 shows ques- tions about the satisfaction of subjects with the interfaces. Subjects Usefulness of program descriptions. One explanation for this dif- indicate their level of agreement from one (not at all) to five (ex- ference in performance is that through their search behavior sub- tremely). For all questions the majority of subjects find the amount jects from group G learn more about the content and organization of support offered by the interface on the exploratory search tasks of the archive and are able to assimilate this information faster from marginal. This finding supports our first hypothesis that the search the program descriptions than subjects from group B. As subjects interface intended for professional users does not provide satisfac- process more program descriptions they learn more about the avail- tory support to non-professional users on exploratory search tasks. able programs and terminology in the domain. This results in a richer set of potential search terms to formulate their information need. To investigate whether subjects found information in the pro- Search behavior analysis. Although all subjects are non-experts gram descriptions useful in suggesting new search terms, we anal- with respect to search with this particular interface, some perform yse the second set of questions from the questionnaire. The top half better than others. We investigate whether there is a difference in of Table 4 shows subjects’ responses to questions about the useful- the search behavior of subjects that have high performance on the ness of metadata fields present on the search result page. Consid- search tasks and users that have lower performance. We divide ering responses from all subjects the genre and keyword fields are subjects into two groups depending on the average number of cor- found most useful and the title and date fields as well, although to rect answers found aggregated over the three tasks, i.e., 2.9 out of a lesser degree. The fields intended for professionals, i.e., origin, the possible maximum of 15. The group with higher performance owner, rights, and medium are found not useful by the majority of (group G) consists of 11 subjects with 3 or more correct answers, subjects. Between group B and G there are no significant differ- whereas the group with lower performance (group B) consists of ences in subject’s judgement of the usefulness of the fields. 11 subjects with 2 or less correct answers. Table 3 shows the averages of the search behavior indicators for each of the two groups. We first look at the usage frequency of the filter, field, and search box components by subjects in group G vs. Table 4: Questions about the usefulness of metadata fields on group B. There is no significant difference between the groups, in- program description pages and the mode and average (avg) of dicating that there is no direct correlation between performance on the subjects responses: for all subjects, the good (G) and bad the search tasks and use of specific search components. Next we (B) performing group. We use a Wilcoxon signed rank test for look at search behavior as an explanation for the difference in per- the ordinal scale. The symbol M (N ) indicates a significant in- formance between the groups. Our indicator for lookup searches, crease at the ↵ < 0.05 (0.01) level. i.e., #queries, shows a small difference in the number of submitted all B G queries. That subjects in both groups submit a comparable num- question field mode mode avg mode avg Degree to which date 3 2 2.2 3 3.0 fields on the result owner 1 1 1.6 1 2.0 Table 2: Questionnaire results about the satisfaction of subjects page were useful in rights 1 1 1.3 1 1.4 suggesting new genre 4 1 2.8 4 3.9 with the search interface. Agreement is indicated on a 5 point terms keyword 4 1,5 3.1 4 3.5 Likert scale ranging from one (not at all) to five (extremely). origin 1 1,2 1.7 1 2.0 question mode avg title 3,4 2 2.2 4 3.0 To what degree are you satisfied with the search 2 2.3 medium 1 1 1.5 1,2 1.6 experience offered by the interface? Degree to which summary 4 1,4 2.8 5 3.8 To what degree did the interface support you by 2 2.4 fields in program description 4 4 3.3 4M 4.1 suggesting new search terms? descriptions were person 4 1,3,4 2.8 4N 3.8 To what degree are you satisfied with the sug- 2 2.3 useful in suggesting location 1,3,4 1,3 2.0 4N 3.0 gestions for new search terms by the interface? new terms organization 1 1 1.8 1,2 2.0 The bottom part of Table 4 shows subject’s responses to ques- the archive. Together, the above findings suggest that subjects find tions about the usefulness of metadata fields only present on the a direct presentation of short and meaningful terms, i.e., categories, program description page and not already shown on the search re- keywords, and entities, on the search results page useful. sult page. Based on all responses, the summary, description, person and location metadata fields are considered most useful by the ma- 4. CONCLUSION jority of the subjects. These findings further support our argument We presented results from a user study where non-professional that program descriptions provide useful information for subjects users perform exploratory search tasks with a search tool originally to complete their search tasks. developed for media professionals and archivists in an audio visual When we contrast responses of the two groups we find that group archive. We hypothesized that such search tools provide unsatisfac- G subjects consider the description, person, and location metadata tory support to non-professional users on exploratory search tasks. fields significantly more useful than subjects from group B. This By means of a TREC style evaluation we find that subjects achieve suggests that group B subjects have more difficulties in distilling low recall in the number of correct answers found. In a question- useful information from these fields (recall also the longer time naire regarding the user satisfaction with the search support offered spent on a page). This does not say that these users cannot un- by the tool, subjects indicate this to be marginal. Both findings sup- derstand the provided information. All that is indicated is that the port our hypothesis that a professional search tool is unsuitable for chosen modality, i.e., text, might not be the right one. A graphical non-professional users performing exploratory search tasks. representation, for example as term clouds, might be better. Through an analysis of the data logged during the experiment, Fields as term clouds. In response to the observations just made, we find evidence to support our second hypothesis that subjects per- we also investigated how users would judge visual representations form different search strategies. Subjects that visit more program of search results, i.e., in the form of term clouds directly on the description pages are more successful on the exploratory search search result page. Here the goal is not to evaluate the visualization tasks. We also find that subjects consider certain metadata fields on of the clouds or the method by which they are created. Of interest the program description pages more useful than others. Subjects in- to us is whether subjects would find a direct presentation of infor- dicate that visualization of certain fields as term clouds directly in mation normally “hidden” on the program description page useful. the search interface would be useful in completing the search tasks. Recall from §2 that we generate term clouds for each field on Subjects especially consider presentations of short and meaningful the basis of the terms in the top 1000 documents returned for a text units, e.g., categories, keywords, and entities, useful. query. From Table 5 we observe that subjects do not consider In future work we plan to perform an experiment in which we the description and summary clouds useful, while previously these present non-professional users with two interfaces: the current search fields were judged most useful among the fields in the program de- interface and one with a direct visualization of categories, key- scription. Both clouds contain general terms from the television words and entities on the search result page. domain, e.g., program and series, which do not provide subjects Acknowledgements. This research was partially supported by the with useful search terms. Although this could be due to the use European Union’s ICT Policy Support Programme as part of the of frequencies to select terms, these fields are inherently difficult Competitiveness and Innovation Framework Programme, CIP ICT- to visualize without losing the relations between the terms. The PSP under grant agreement nr 250430, the PROMISE Network of genre, keyword, location and, to some degree, person clouds are all Excellence co-funded by the 7th Framework Programme of the Eu- considered useful, but they support the user in different ways. The ropean Commission, grant agreement no. 258191, the DuOMAn genre field supports the subject in understanding how content in the project carried out within the STEVIN programme which is funded archive is organized, i.e., it provides an overview of the genres used by the Dutch and Flemish Governments under project nr STE-09- for categorization. The keyword cloud provides the user with alter- 12, the Netherlands Organisation for Scientific Research (NWO) native search terms for his original query, for example, satire or under project nrs 612.061.814, 612.061.815, 640.004.802, 380-70- parody instead of cabaret. The location and person clouds offer an 011, the Center for Creation, Content and Technology (CCCT), the indication of which locations and persons are present in the archive Hyperlocal Service Platform project funded by the Service Innova- and how prominent they are. For these fields visualization is easier, tion & ICT program, the WAHSP project funded by the CLARIN- i.e., genre, keywords or entities by themselves are meaningful with- nl program, and under COMMIT project Infiniti. out having to represent relations between them. Subjects consider the title field only marginally useful. For this field the usefulness is dependent on the knowledge of the subject as titles are not neces- REFERENCES sarily descriptive. The subjects also consider the organization field [1] J.-w. Ahn, P. Brusilovsky, J. Grady, D. He, and R. Florian. Se- marginally useful, probably due to the nature of our search tasks, mantic annotation based exploratory search for information an- i.e., two tasks focus on finding persons and in one locations play alysts. Inf. Proc. & Management, 46(4):383 – 402, 2010. an important role. We assume though that in general this type of [2] D. K. Harman. The TREC test collections. In E. M. Voorhees information need occurs when the general public starts exploring and D. K. Harman, editors, TREC: Experiment and evaluation in information retrieval. MIT, 2005. [3] B. Huurnink, L. Hollink, W. van den Heuvel, and M. de Rijke. Table 5: Questions about the usefulness of term clouds based Search behavior of media professionals at an audiovisual on specific metadata fields. Agreement is indicated on a 5 point archive. J. Am. Soc. Inf. Sci. and Techn., 61:1180–1197, 2010. Likert scale ranging from one (not at all) to five (extremely). [4] G. Marchionini. Exploratory search: from finding to under- cloud mode avg cloud mode avg standing. Comm. ACM, 49(4):41 – 46, April 2006. [5] R. White, B. Kules, S. Drucker, and M. Schraefel. Supporting title 2 2.8 description 1 2.5 exploratory search: Special issue. Comm. ACM, 49(4), 2006. person 2,3 2.9 genre 4 3.4 location 4 3.3 summary 1 2.3 organization 2 2.2 keyword 4 3.8