=Paper=
{{Paper
|id=Vol-1174/CLEF2008wn-iCLEF-PeinadoEt2008b
|storemode=property
|title=UNED at iCLEF 2008: Analysis of a Large Log of Multilingual Image Searches in Flickr
|pdfUrl=https://ceur-ws.org/Vol-1174/CLEF2008wn-iCLEF-PeinadoEt2008b.pdf
|volume=Vol-1174
|dblpUrl=https://dblp.org/rec/conf/clef/PeinadoGAL08a
}}
==UNED at iCLEF 2008: Analysis of a Large Log of Multilingual Image Searches in Flickr==
UNED at iCLEF 2008: Analysis of a large log of multilingual image searches in Flickr Vı́ctor Peinado, Julio Gonzalo, Javier Artiles and Fernando López-Ostenero NLP & IR Group, ETSI Informática, UNED c/ Juan del Rosal, 16, E-28040 Madrid, Spain victor@lsi.uned.es, javart@gmail.com, {julio, flopez}@lsi.uned.es Abstract In this paper, we summarise our analysis of the large log of multilingual image searches in Flickr provided to iCLEF 2008 participants. We have studied (a) correlations be- tween the language skills of searchers in the target language and other session para- meters, such as success (was the image found?), number of query refinements, etc.; (b) learning effects over time; (c) usage of specific cross-language search facilities and (d) users perceptions on the task (questionnaire analysis). We have identified 5101 complete search sessions (searcher/target image pairs) in the logs provided by the organisation. Our analysis shows that when users have active competence in the target language, their success rate is 12% higher than if they do not know the language at all. If the user has passive competence of the language (i.e. can partially understand texts but cannot make queries), the success rate equals those with active competence, but at the expense of executing more interactions with the system. The most remarkable learning effect is that users carry out fewer interactions when they are familiarised with the task and the system, keeping the success rate and the number of hints invariant. Finally, the usage of specific cross-language facilities (such as refining translations offered by the system) is low, but significantly higher than standard relevance feedback facilities, and is perceived as useful by searchers. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.4 [Information Systems Applications]: H.4.m Miscellaneous General Terms interactive information retrieval, cross-language information retrieval Keywords iCLEF, Flickr, log analysis, multilingual image search, user studies 1 Introduction In this paper, we summarise our analysis of the large log of multilingual image searches in Flickr provided to iCLEF 2008 participants [1]. In this search log, every session consists of a searcher (a registered user with a profile that includes her native language and her proficiency in English, Spanish, Italian, German, Dutch and French) and a target image (from the Flickr image database, annotated in one or more of that six languages). When the session starts, the user does not know in which language(s) the image is annotated. The interface provides facilities to perform queries simultaneously in up to six languages (via dictionary translation of query terms), to provide controlled relevance feedback (clicking on suggested terms and terms from the images found) and to refine the translations provided by the system (changing the selection of the system or adding new translations). The task is, therefore, a multilingual known-item retrieval task. If the user gives up, she can ask for hints; the first hint is the target language (which turns the task into bilingual or monolingual search, depending on the language profile of the user). The rest of the hints are keywords used to annotate the image, which is aimed at preventing users from being discouraged with difficult images. The log consists of more than 5,000 search sessions by more than 200 users with a wide range of skills in the interface languages, coming from four continents. The size of this corpus permits studying the behaviour of users in a multilingual search scenario at a scale that had not been possible before. The UNED team has focused on studying (a) correlations between the language skills of searchers in the target language and other session parameters, such as success (was the image found?), number of query refinements, etc.; (b) learning effects over time; (c) usage of specific cross-language search facilities and (d) users perceptions on the task (questionnaire analysis). This paper is a summary of our study. The structure of the rest of the paper is as follows: Section 2 describes the process performed to regularise the logs and characterise each user’s search sessions. In Section 3 we search for correlations between language skills of searchers and other parameters of the search sessions. In Section 4 we study learning effects over time. In Section 5 we report on other aspects of our study, focusing on the usage of cross-lingual refinement facilities and users’ perceptions on the task. Finally, in Section 6 we draw some general conclusions. 2 Log Processing and Characterisation of the Search Ses- sions We have processed the logs provided by the iCLEF organisation in order to identify and charac- terise search sessions. A search session starts when the user is given a target image and finishes when the user either finds the image or gives up and stops searching. In the meantime, the user may log out and log in (even several times) and, essentially, interact with the interface: launch queries, explore the rank of results, ask for hints, read descriptions associated to images, ma- nipulate the translations suggested by the system and therefore improve her personal dictionary, etc. Once search sessions are identified and open sessions are filtered out (those that were active when the log was produced or those that died because of user inactivity for more than 24 hours), we retained 5101 search sessions. We have processed the logs to provide a rich characterisation of each session. The essential features are the user’s profile (in particular her language skills), the use of the different interface facilities (including translation features), the session number (when was the image searched in the search history of the user), etc. We have also distinguished between the behaviour before and after the first hint, which is the language in which the image is annotated, because it represents the frontier between fully multilingual search (the image can be annotated in any of six languages) and bilingual or monolingual search. See Appendix A for a comprehensive list of the features that we have extracted. 3 Analysis Considering Language Skills In our first analysis we have divided search sessions in three groups: “active” is the group of sessions where the image was annotated in a language in which the user can read and write. Sessions in “passive” are those where the target language was partially understandable by the user, but the user could not make queries in that language (think, for instance, of French for most Spanish or Italian speakers). Finally “unknown” stands for images annotated in languages completely unfamiliar for the user. In our pool of sessions we found 2809 for active, 726 for passive and 1566 for unknown. These figures are large enough to reach quantitatively meaningful conclusions. Table 1 shows average values for some session features, for each of these three groups. The most notable result is the degree of success (was the image found?) for each of the groups: Active and passive speakers successfully found the image 82% and 81% of the times. Users with no competence in the annotation language obtained 72%, performing 12% worse. It is somehow surprising that users which only have a passive knowledge of the target language perform as well as those with active knowledge, because the first group must necessarily use the translation capabilities of the system to express their query. The unknown group performs only 12% worse, which reveals a difference but not a large gap. Note that the translation capabilities of the interface were not optimal: they used only freely available dictionaries with some coverage gaps, and they were not tailored to the domain (the Flickr database). Note that, as users could ask for hints, it can be the case that “passive” cases reach the same success as “active” ones because they simply ask for more hints. This is not the case: the average number of hints hardly varies between the three groups, ranging from a minimum of 2.13 hints per session to a maximum of 2.18. competence success # hints # queries refinements ranking exploration mono multi mono multi mono multi active 82% 2.13 3.96 3.8 3.91 3.76 2.58 2.69 passive 81% 2.18 4.32 4.05 4.27 4.01 3.4 3.14 unknown 72% 2.15 4.5 4.38 4.41 4.28 2.93 3.06 Table 1: User’s behaviour according to language skills The rest of the columns of Table Table 1 show the average number of queries launched, direct query refinements and the number of times that the user explored the ranking beyond the first page of results (containing 20 items). In general, there is a clear ordering between active, passive and unknown sessions: active sessions need less interactions, passive more, and unknown even more. For instance, the average number of queries posed in the multilingual search mode is 3.8 for active sessions, 4.06 for passive sessions, and 4.38 for unknown sessions. Therefore, passive sessions achieve similar success than active sessions, but with a higher effort. Unknown sessions have even higher effort, but still with a 12% loss in effectiveness. The only feature in which this tendency is broken is with rank exploration: passive sessions tend to explore the rank further than unknown sessions, perhaps because the textual information in the images can be more easily used to do relevance feedback. Note that we have not included search time in the tables. Although the logs provide time stamps, we have discarded them because there is no way of knowing when the user was actively engaged in the task or performing some other task while the session remained open. Therefore, time is less reliable as an activity indicator than the number of interactions with the system. 4 Analysis Considering Time. The Learning Effect For many users, this might be the first occasion in which they search simultaneously in several languages. It seems interesting to check if there is a learning effect once they have search for a certain number of images and they are familiarised with the possibilities and difficulties of searching Figure 1: Average hints requested and success rate per sessions performed. Figure 2: Average queries launched per sessions performed. in a cross-language setting. For instance, will they learn to refine the translations chosen by the system when they are not appropriate? With this goal in mind, we have also analysed users’ behaviour considering the number of search sessions completed, and extracted some trends which are depicted in the following figures. First, Figure 1 shows how the success rate remains stable regardless of the number of sessions performed previously. Also the number of hints requested, in spite of showing a wider variability, remains stable too. The next figures show that, the more time users spend interacting with the search engine, three features decrease: the average number of queries launched (see Figure 2), the average number of direct query refinements (see Figure 3) and the average number of ranking exploration beyond the first page of results (see Figure 4) decrease, both in the monolingual and the multilingual environments. In other words, it takes users less effort to find the images. Figure 3: Average direct refinements per sessions performed. Figure 4: Average navigations beyond the first page of results per sessions performed. 5 Other issues 5.1 Usage of Specific Cross-Lingual Refinement Facilities FlickLing[2] search interface provides some functionalities which take advantage of some of the Flickr’s services1 . Flickr services suggest new terms related to a given query and FlickLing allows to use these terms to launch a new query or to refine a previous one. This functionality was only used by less than 2% of users, as shown in Table 2. This is in agreement with the common place that relevance feedback facilities are rarely used in search engines (at least in non-specialised search scenarios), even if they can provide more search effectiveness. The percentage of queries where users change the translations chosen by the system or add new translations to the dictionary is also quite low in absolute terms, but it is relevant if we compare it with the use of standard relevance feedback mechanisms. Usage ranges from 4% to 11% are much higher than relevance feedback (2%) and can be taken as a positive indication of their usefulness at certain points of the search process. Flickr related terms usage of competence new query query refinement personal dictionary mono multi mono multi manipulations adding new terms active 1% 1% 2% 2% 4% 7% passive 1% 1% 3% 1% 6% 9% unknown 2% 1% 4% 5% 9% 11% Table 2: Usage of Specific Cross-Lingual Refinement Facilities 5.2 User Perceptions on the Task Although the primary source of information are the activity logs of the users, the logs also collect the answers to two types of questionnaires: one is presented after each session (in two forms: one if the search failed and another one if the search succeeded), and another one is presented only once, when the user has performed fifteen search sessions (and therefore has a rather complete overall impression of the task). 5.2.1 Post-session questionnaires Let us start with the results of post-session questionnaires, which are depicted in Figure 5 (after success questionnaire) and Figure 6 (after failure). In cases of success, the task is perceived as easy in more than 1800 cases, and hard in over 2200 cases. The two most popular sources of difficulty are not related with the cross-language nature of the task: “it was difficult to describe the query” and “it was hard because of the size of the image set”. Not knowing the target language is mentioned as a difficulty in slightly over 400 cases, and bad translations in around 350 cases. 200 answers thought that having to translate the query was a problem. In cases of failure, the pattern is very similar: the two most common cases of failure are “I can’t find suitable keywords for this image” and “There are too many images for my search”. Overall, the perception of users is that multilinguality is a difficulty, but not as relevant as other aspects of the search task. 5.2.2 Final questionnaires These questionnaires are not exactly answered when the experience is over, but when the user has performed fifteen search sessions and is therefore familiarised with the task. These are the results: 1 See http://www.flickr.com/services/api for further information about Flickr API. Figure 5: Post-image questionnaires after finding an image: overall results. Figure 6: Post-image Questionnaires after giving up an image: overall results. Figure 7: Which, in your opinion, are the most challenging aspects of the task? Challenging aspects of the task Which, in your opinion, are the most challenging aspects of the task? Figure 7 shows the answers to this question. Notably, when we restrict this question to experienced users, which has searched at least for fifteen images, the results change drastically. Over 85% of the users agree or strongly agree that “Selecting/finding appropriate translations for the terms in my query” is a challenging aspect of the task, which makes it the most challenging aspect. Interface facilities Two questions were addressed at how users perceived the interface facilities. The first one is “Which interface facilities did you find most useful?” and the results are depicted in Figure 8. Note that cross-language facilities (automatic translation of query terms and possibility of improving the translations chosen by the system) are much more valued than standard feedback facilities (the assistant to select new terms from the set of results and the additional query terms suggested by Flickr). This is in agreement with the proportional usage of these two kinds of facilities, although we must remark that the actual usage of those facilities is lower than what would be expected Figure 8: Which interface facilities did you find most useful? from the questionnaire. Do users learn that they are useful late in the search history? According to our study of learning effects, this is not the case. The second question is “Which interface facilities did you miss?”, with a list of explicit facilities. The results are shown in Figure 9. Three facilities have an agreement rate (agree or strongly agree) above 70%: “a system able to select the translations for my query better”, “The classification of search results in different tabs according to the image caption language”, and “the possibility to search according to the visual features of the image”. Other choices have slightly lower agreement rates: “an advanced search mode giving more control on how Flickr is queried”, “bilingual dictio- naries with a better coverage”, and “more support to decide what the possible translations mean and therefore which ones are more appropriate”. The least valued option (yet with an agreement rate above 50%) is “detection and translation of multi-word expressions”, perhaps due to the nature of the task and the annotations (tags are frequently single words). It is difficult to extract conclusions from the answers to this question, apart from the fact that users seem to appreciate all features that can seemingly improve the search experience, even if interactive features are not frequently used in practice. Figure 9: Which interface facilities did you miss? Figure 10: How did you select/find the best translations for your query terms? Finding appropriate translations The last question was “How did you select/find the best translations for your query terms?”. By far the most popular answer is “using my knowledge of target languages whenever possible”, which was frequently used by 60% of the users and sometimes by another 30%. In contrast, less than 10% frequently “did not pay attention to the translations. I just trusted the system”. This is in sharp contrast with the average behaviour of users, which rarely modify the translations chosen by the system, and deserves further investigation. Finally, “using additional dictionaries and other online sources” is used frequently by less than 20% of the users, and “sometimes” by another 20%. 6 Conclusions The search logs under study in the iCLEF 2008 task provide a more solid base to extract conclu- sions about the behaviour of users in multilingual search scenarios than most previous experiments, which were mostly performed under laboratory conditions and therefore more restricted in size. At UNED we have identified 5101 complete search sessions (searcher/target image pairs) in the logs provided by the organisation. Our analysis shows that when users have active competence in the target language, their success rate is 12% higher than if they do not know the language at all. If the user has passive competence of the language (i.e. can partially understand texts but cannot make queries), the success rate equals those with active competence, but at the expense of executing more interactions with the system. The most remarkable learning effect is that users carry out fewer interactions when they are familiarised with the task and the system, keeping the success rate and the number of hints invariant. Finally, the usage of specific cross-language facilities (such as refining translations offered by the system) is low, but significantly higher than standard relevance feedback facilities, and is perceived as useful by searchers. Finally, the perception of experience users about cross-language retrieval interactive facilities is very positive, in spite of the fact that they are not frequently used. This is an indication that advanced search features - in this case, manipulation of translations offered by the system - might not be used frequently, but when they are used they become critical for the success of the task. A consequence is that query translation assistance should be hidden in the default settings of a cross-language search interface, but should be possible to invoke it for certain advanced users or specific search situations. Acknowledgements This work has been partially supported by the Regional Government of Madrid under the MAVIR Research Network (S-0505/TIC-0267) and the Spanish Government under project Text-Mess (TIN2006-15265-C06-02). References [1] Gonzalo, J., Clough, P., Karlgren, J.: Overview of iCLEF 2008: search log analysis for Multi- lingual Image Retrieval. This volume. [2] Peinado, V., Artiles, J., Gonzalo, J., Barker, E., López-Ostenero, F.: FlickLing: a multilingual search interface for Flickr. This volume. A Characterization of the Search Sessions In order to characterize our users’ behavior, we decided to represent every search session capturing the user’s profile, the use of the translation capabilities, and the state of every single feature of the interface. In most of the cases, we emphasize a crucial moment in the development of the search session: the first hint reveling the language of annotation of the image, since it may greatly delimite the problem. Thus, we extracted the following 111 features. ## SEARCH SESSION AND USER PROFILE 1: user ID 2: user’s mother language 3: interface language used 4: German is (active|passive|unknown) 5: English is (active|passive|unknown) 6: Spanish is (active|passive|unknown) 7: French is (active|passive|unknown) 8: Italian is (active|passive|unknown) 9: Dutch is (active|passive|unknown) 10: Session number for the individual user 11: target image ID 12: image language annotations 13: image language is is (active|passive|unknown) 14: hints requested 15: success of failure in the session ## MONOLINGUAL INTERFACE 16: queries 17: queries before asking the first hint 18: queries after asking the first hint 19: direct query refinements 20: direct query refinements before asking the first hint 21: direct query refinements after asking the first hint 22: query refinements from a related term suggested by Flickr 23: query refinements from a related term suggested by Flickr before asking the first hint 24: query refinements from a related term suggested by Flickr after asking the first hint 25: query refinements by adding a related term to a previous query 26: query refinements by adding a related term to a previous query before asking the first hint 27: query refinements by adding a related term to a previous query after asking the first hint 28: query refinements from an image tag 29: query refinements from an image tag before asking the first hint 30: query refinements from an image tag after asking the first hint 31: query refinements by adding an image tag to a previous query 32: query refinements by adding an image tag to a previous query before asking the first hint 33: query refinements by adding an image tag to a previous query after asking the first hint 34: exploration of the ranking beyond the first page (20 results) 35: exploration of the ranking beyond the first page (20 results) before asking the first hint 36: exploration of the ranking beyond the first page (20 results) after asking the first hint 37: clicks to enlarge an image 38: clicks to enlarge an image before asking the first hint 39: clicks to enlarge an image after asking the first hint 40: wrong clicks on "I found it" icon 41: wrong clicks on "I found it" icon before asking the first hint 42: wrong clicks on "I found it" icon after asking the first hint 43: clicks on "show description" icon 44: clicks on "show description" icon before asking the first hint 45: clicks on "show description" icon after asking the first hint ## MULTILINGUAL INTERFACE 46: queries 47: queries before asking the first hint 48: queries after asking the first hint 49: direct query refinements 50: direct query refinements before asking the first hint 51: direct query refinements after asking the first hint 52: query refinements from a related term suggested by Flickr 53: query refinements from a related term suggested by Flickr before asking the first hint 54: query refinements from a related term suggested by Flickr after asking the first hint 55: query refinements by adding a related term to a previous query 56: query refinements by adding a related term to a previous query before asking the first hint 57: query refinements by adding a related term to a previous query after asking the first hint 58: query refinements from an image tag 59: query refinements from an image tag before asking the first hint 60: query refinements from an image tag after asking the first hint 61: query refinements by adding an image tag to a previous query 62: query refinements by adding an image tag to a previous query before asking the first hint 63: query refinements by adding an image tag to a previous query after asking the first hint 64: exploration of the ranking beyond the first page (20 results) 65: exploration of the ranking beyond the first page (20 results) before asking the first hint 66: exploration of the ranking beyond the first page (20 results) after asking the first hint 67: clicks to enlarge an image 68: clicks to enlarge an image before asking the first hint 69: clicks to enlarge an image after asking the first hint 70: wrong clicks on "I found it" icon 71: wrong clicks on "I found it" icon before asking the first hint 72: wrong clicks on "I found it" icon after asking the first hint 73: clicks on "show description" icon 74: clicks on "show description" icon before asking the first hint 75: clicks on "show description" icon after asking the first hint ## LANGUAGES ENABLED AND PERSONAL DICTIONARY 76: German as target language 77: German as target language before asking the first hint 78: German as target language after asking the first hint 79: English as target language 80: English as target language before asking the first hint 81: English as target language after asking the first hint 82: Spanish as target language 83: Spanish as target language before asking the first hint 84: Spanish as target language after asking the first hint 85: French as target language 86: French as target language before asking the first hint 87: French as target language after asking the first hint 88: Italian as target language 89: Italian as target language before asking the first hint 90: Italian as target language after asking the first hint 91: Dutch as target language 92: Dutch as target language before asking the first hint 93: Dutch as target language after asking the first hint 94: modifications of the translations suggested by the system 95: modifications of the translations suggested by the system before asking the first hint 96: modifications of the translations suggested by the system after asking the first hint 97: new translations added to the personal dictionary 98: new translations added to the personal dictionary before asking the first hint 99: new translations added to the personal dictionary after asking the first hint ## POST-IMAGE QUESTIONNAIRES AFTER SUCCESS 100: It was easy 101: It was hard because of the size of the image set 102: It was hard because the translations were bad 103: It was difficult to describe the image 104: It was hard because I didn’t know the language in which the image was annotated 105: It was hard because of the number of potential target languages 106: It was hard because I needed to translate the query ## POST-IMAGE QUESTIONNAIRES AFTER GIVING UP 107: There are too many images for my search 108: The translations provided by the system are not right 109: I can’t find suitable keywords for this image 110: I have difficulties with the search interface 111: I just don’t know what else to do