UNED at iCLEF 2009: Analysis of Multilingual Image Search Sessions Vı́ctor Peinado, Fernando López-Ostenero and Julio Gonzalo NLP & IR Group, ETSI Informática, UNED c/ Juan del Rosal, 16, E-28040 Madrid, Spain {victor, flopez, julio}@lsi.uned.es Abstract In this paper we summarize the analysis performed on the logs of multilingual image search provided by iCLEF09 and its comparison with the logs released in the iCLEF08 campaign. We have processed more than one million log lines in order to identify and characterize 5, 243 individual search sessions. We focus on the analysis of users’ behavior and their performance trying to find possible correlations between: a) the language skills of the users and the annotation language of the target images; and b) the final outcome of the search session. We have observed that the proposed task can be considered as easy, even though users with no competence in the annotation language of the images tend to perform more interactions and to use cross-language facilities more frequently. Usage of rele- vance feedback is remarkably low, but successful users use it more often. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.3 Information Search and Retrieval; H.3.7 Digital Libraries; H.5.2 User Interfaces General Terms cross-language information retrieval, interactive systems, image search, known-item retrieval task Keywords flickr, images, log analysis 1 Introduction In this paper we summarize the analysis performed on the logs of multilingual image search provided in the iCLEF 2009 track [2] and its comparison with the logs released in the iCLEF 2008 campaign [1]. In the search logs provided by the organizers individual search sessions can be easily identified. Each session starts when a registered user is shown a target image and finishes when the user finds the image or decides to give up. The logs collects all the interactions occurred in the meantime: monolingual and multilingual queries launched, query refinements, navigation across the results ranking, hints showed by the system, usage of the personal dictionaries and other cross-language facilities, etc. These logs are automatically generated by the FlickLing search engine. Please see [3] for a complete description on the interface’s functionalities and the logs. Last year [5] we focused on the analysis of possible correlations between the language skills of the users and the annotation language of the target images, along with the usage of some of the specific cross-language facilities FlickLing features. In this work we are going to focus on the analysis of users’ behavior and their performance trying to find possible correlations between: a) the language skills of the users and the annotation language of the target images; and b) the final outcome of the search session. Being aware of the differences between both groups of users involved in an interactive experiment and between both pools of images used, we are replicating the analysis trying to find out new correlations and reinforce or discard the evidences observed. The remainder of the paper is as follows: Section 2 describes the processing tasks and the characterization of the search sessions performed on the iCLEF logs. Next, we discuss some correlations found between our users’ search behavior and their profile according to their language skills (Sections 3) and the final outcome of their search sessions (Section 4). Then, in Section 5 we present additional data of our study based on the questionnaires collected during the experiments. Finally, in Section 6 we draw some general conclusions and move forward to propose future work lines. 2 iCLEF Logs Processing The logs provided by the iCLEF organization in 2009 were considerably smaller than last year’s corpus. Table 1 shows some of the most relevant statistics of both raw logs. 2008 2009 registered users 305 130 log lines 1, 483, 806 617, 947 valid search sessions 5, 101 2, 410 images found 4, 033 2, 149 images not found 1, 068 261 hints asked 11, 044 5, 805 monolingual queries 37, 125 13, 037 multilingual queries 36, 504 17, 872 promoted translations 584 725 penalized translations 215 353 image descriptions shown 418 100 Table 1: Statistics of the logs provided by the iCLEF organization Of course, we had many users who registered and tried just a few searches. As last year, for the current analysis, we are going to focus only on those users who, regardless of their final outcome, were able to complete at least 15 search sessions and filled in the overall questionnaire. We think that these users finished the experiment (even though they could go on searching at will) and became experienced enough using FlickLing. Table 2 shows some of the most relevant statistics of both logs considering only the mentioned sub-sets of users. Notice that we are going to analyze more than one million log lines generated by 98 users and containing 5, 243 search sessions and more than 62, 000 queries. Comparing a collection of logs generated in an interactive image search experiment with different users and two different sets of target images is not straightforward, but we think these figures are large enough to reach quantitatively meaningful conclusions. So, we have processed the logs in order to obtain a rich characterization of the search sessions: the user and her behavior, the target image and the usefulness of the search and translations facilities provided by flickling. As in our previous work [4], we have extracted 115 features for each session, capturing the complete user’s profile according to her language skills, the target image’s profile, and the usage of the interface’s functionalities. The first hint provided by the system when the user decides to quit is always the language in which the target image is annotated. Since 2008 2009 considered users 65 33 log lines 841, 957 357, 703 valid search sessions 3, 640 1, 603 images found 2, 983 1, 439 images not found 657 164 hints asked 8, 093 3, 886 monolingual queries 23, 060 8, 461 multilingual queries 20, 607 10, 463 promoted translations 223 525 penalized translations 70 246 image descriptions shown 126 42 Table 2: Statistics of the sub-sets of logs analyzed this fact may turn the initial fully multilingual search (target images can be tagged in up to six different languages, e.g.: Dutch, English, French, German, Italian and Spanish) into a bilingual or monolingual search (depending on the user’s language skills), we have also tracked the user’s behavior before and after asking for this first hint. In the following sections we present the analyses performed on these two sub-sets of search sessions according to the language skills of the users (Section 3) and considering the final outcome of the search sessions (Section 4). 3 Analysis According to Language Skills We have divided our search sessions into three different ‘profiles’ according to the user’s language skills with respect to the annotation language of the target image. On one hand, “active” denotes the sessions where the image was annotated in a language in which the user was able to read and write fluently. On the other hand, “passive” sessions are those where the target language was partially understandable by the user, but the user could not make queries in that language (e.g. images annotated in Italian for most Spanish or French speakers). Finally “unknown” refers to sessions when the image is annotated in languages completely unfamiliar for the user. 3.1 Users’ Behavior While the iCLEF08 corpus has samples enough under these categories (2, 345 sessions for active, 535 for passive and 760 for unknown), iCLEF09 corpus has no active sessions at all and has a great majority of unknown sessions (only 18 are passive and 1585 are unknown). The explanation for this is in the different characteristics of the target images proposed each year. Last year the image corpus was fully multilingual but most of the images could be easily found by simply searching in English and Spanish, the most popular languages among our users. This year, on the contrary, the image corpus was collected trying to avoid images annotated in English and stressing carefully on Dutch and German. Our users were coming basically from Romania, Italy and Spain, with little knowledge in these languages. Table 3 shows the number of samples per profile, the average values for success rate (was the image found?) and the average number of hints requested per search session for each year’s logs, along with the aggregate values. According to the figures, it seems that the degree of success was high in all cases. In the iCLEF08 corpus, active and passive speakers performed similarly (passive users asking for more hints, though): they successfully found the target image 84% and 82% of the times, respectively. On the other hand, as expected, users with no competence in the annotation language obtained 73% of success rate and asked for more hints (2.42). iCLEF08 result samples success rate # hints requested active 2, 345 85% 2.14 passive 535 82% 2.22 unknown 760 73% 2.42 iCLEF09 result samples success rate # hints requested active 0 - - passive 18 78% 1.22 unknown 1, 585 90% 2.43 iCLEF08 + iCLEF09 result samples success rate # hints requested active 2, 345 85% 2.14 passive 553 82% 2.12 unknown 2, 345 84% 2.45 Table 3: User’s behavior according to language skills: average success rate and hints requested In the iCLEF09 corpus, the division in profiles does not allow to find clear correlations because of the lack of samples. Unknown users, nonetheless, were able to successfully find the image 90% of the times, while asking for 2.43 hints, a smiliar figure compared to iCLEF08. It’s worth noticing that hints in iCLEF09 were more specific and concrete than in iCLEF08. Thus, even though most of the target images were annotated in an unknown language, asking for hints was definitely more useful this year. Finally, in the the aggregate figures, it can be observed that while all three profiles present a similar success rate, the unknown users asks for more hints than users with some degree of competence in the annotation language of the image. 3.2 Cognitive Effort We have grouped under the name “cognitive effort” some of the most usual interactions occurred in a traditional search interface, namely: launching queries, exploring the ranking of results beyond the first page (each page contains 20 items), and using relevance feedback (words provided by Flickr related to the query terms, and the tags associated to each image retrieved in the ranking of results). So, Table 4 shows the figures related to these interactions for each user profile in both FlickLing’s monolingual and multilingual environments. In the iCLEF08 logs, as expected, active and passive users launch more queries in the monolin- gual environment, while unknown users, who are supposed to need some translation functionalities to find the image, launch more multilingual queries using FlickLing’s facilities. As far as the rank- ing exploration is concerned, the same pattern appears: active and passive users cover more ranking pages while querying in monolingual and unknown users explore the ranking more deeply while querying in multilingual. Analyzing the iCLEF09 results, we cannot draw any clear conclusions but if we ignore the 18 samples corresponding to passive users, we find the unknown users again performed more interac- tions in the multilingual environment: more queries launched and more ranking explorations. Usage of relevance feedback facilities, as shown in previous works (see [5]), is very low for both logs collections. But even with small variations, active and passive players used relevance feedback more often with monolingual searches and unknown players used it more often in the multilingual environment. Analyzing the aggregate data we can maintain the following conclusions: active and passive users employed more cognitive effort in monolingual searches while unknown users needed more cognitive effort in multilingual searches in order to reach a similar performance, as shown in Section 3.1. iCLEF08 competence typed queries ranking exploration relevance feedback mono multi mono multi mono multi active 4.03 3.28 2.09 1.92 0.03 0.03 passive 4.16 3.31 2.83 2.24 0.05 0.02 unknown 3.81 4.02 2.36 2.81 0.07 0.09 iCLEF09 competence typed queries ranking exploration relevance feedback mono multi mono multi mono multi active - - - - - - passive 4.72 11.06 2.78 11.11 0 0 unknown 3.48 3.89 1.76 2.43 0.01 0.03 iCLEF08 + iCLEF09 competence typed queries ranking exploration relevance feedback mono multi mono multi mono multi active 4.03 3.28 2.09 1.92 0.03 0.03 passive 4.18 3.56 2.83 2.53 0.05 0.02 unknown 3.57 3.91 1.96 2.55 0.03 0.05 Table 4: Cognitive effort according to language skills: typed queries, ranking exploration and usage of relevance feedback 3.3 Usage of Specific Cross-Language Refinement Facilities The dictionaries used by FlickLing were not optimal. In order to cover the six languages considered in the experiment, freely-available general-purpose dictionaries were used. To rectify some of the translation errors, FlickLing allows users to promote and penalize the translations appearing in the general dictionaries of its multilingual environment. This changes are incorporated into a personal dictionary for each user and do not affect other players’ translations. When characterizing the search sessions, we also took into consideration the usage of this functionality by our users. In general, the usage of the personal dictionary was low. Table 5 shows the average percentage of search sessions in which users manipulated their personal dictionary by adding new translations, promoting good translation options and removing bad ones, and the average query terms modified by these manipulations. iCLEF08 competence dictionary manipulations query terms modified active 0.06 0.04 passive 0.05 0.03 unknown 0.17 0.11 iCLEF09 competence dictionary manipulations query terms modified active - - passive 6.56 1.67 unknown 0.4 0.16 iCLEF08 + iCLEF09 competence dictionary manipulations query terms modified dictionary manipulations query terms modified active 0.06 0.04 passive 0.27 0.08 unknown 0.33 0.14 Table 5: Usage of specific cross-language refinement facilities according to language skills In iCLEF08, unknown users manipulated their personal dictionary about three times (0.17) more often than active (0.06) and passive (0.05) players, and consequently the number of query terms modified was also higher (0.11). If we compare both log collections, we observe how in iCLEF09, where the usage of cross-language facilities was more expected, was also increased (0.4). As far as the aggregate data are concerned, we can observe that the more lack of language skills a user has, the more she uses these cross-language facilities. 4 Analysis According to Search Session’s Outcome In the following sections we are going to analyze users’ behavior according to the final outcome of the search sessions. In order to find some correlations about the most successful strategies used by our users, we are going to divide the sessions into two categories: on one hand, “success” refers to those sessions where users were, with or without hints, able to find the proposed target image. On the other hand, “fail” refers to those sessions where the user decided to quit before finding the image. 4.1 Users’ Behavior As we saw in Section 3.1, we are going to analyze users’ behavior but stressing now on the final outcome of the search sessions. If we see Table 6, the first detail to be noted is the number and percentage of samples of each category: 81.95% of success samples in iCLEF08, 89.77% in iCLEF09 and 84.34% in the aggregate results confirm that finding the proposed images was an easy task. iCLEF08 result samples % # hints requested success 2, 983 81.95% 2.32 fail 657 18.05% 1.74 iCLEF09 result samples % # hints requested success 1, 439 89.77% 2.38 fail 164 10.23% 2.77 iCLEF08 + iCLEF09 result samples % # hints requested success 4, 422 84.34% 2.34 fail 821 15.66% 1.95 Table 6: User’s behavior according to search session outcome: average success rate and hints requested Regarding the average number of hints requested, users in successful sessions asked for 2.32 and 2.38 hints in iCLEF08 and iCLEF09, respectively. Users in failed sessions asked for a similar quantity of hints in iCLEF09 (2.77), while in iCLEF08 the number of hints is lower (1.74). Finally, in the aggregate results we can observe that asking for hints seems to have been a good strategy to find the target image, in spite of the score loss, since users in successful sessions asked for 2.34 hints compared to 1.95 for failed sessions. 4.2 Cognitive Effort Analyzing the cognitive effort with respect to the outcome of the search session, our aim is to find some correlation about what strategy was the most convenient for our users to find the images in the iCLEF experiment proposed. As shown in Table 7, in the iCLEF08 logs, successful users launched more queries in the monolingual environment than in the multilingual one (4.05 vs. 3.36), while unsuccessful players does not show differences 3.76 vs. 3.79). On the other hand, in the iCLEF09 logs, successful users launched more multilingual queries than monolingual (4.02 vs. 3.65). This can be explained, as mentioned above, because of the kind of the image collection, which was designed to force the multilingual searches. This fact can also be seen in the number of explorations of the ranking, slightly higher than in iCLEF09 (2.48 and 2.93 vs. 2.13 and 2.26). Lastly, in general, users in failed sessions seems to have performed more interactions in the monolingual environment. iCLEF08 competence typed queries ranking exploration relevance feedback mono multi mono multi mono multi success 4.05 3.36 2.22 2.13 0.05 0.04 fail 3.76 3.79 2.39 2.26 0.05 0.02 iCLEF09 competence typed queries ranking exploration relevance feedback mono multi mono multi mono multi success 3.65 4.02 1.89 2.48 0.02 0.03 fail 1.96 3.23 0.79 2.93 0.01 0.02 iCLEF08 + iCLEF09 competence typed queries ranking exploration relevance feedback mono multi mono multi mono multi success 3.92 3.58 2.11 2.24 0.04 0.04 fail 3.4 3.67 2.07 2.4 0.04 0.02 Table 7: Cognitive effort according to the search session outcome: typed queries, ranking explo- ration and usage of relevance feedback As the last columns of the table show, the usage of relevance feedback was very low in both categories, being higher in monolingual in iCLEF08 and multilingual in iCLEF09. In general, but still with little differences, successful users tended to use relevance feedback more frequently. 4.3 Usage of Specific Cross-Language Refinement Facilities Finally, regarding the manipulation of the personal dictionary (see Table 8), successful users in iCLEF08 used it slightly more often than those who failed (0.08 vs. 0.06). In iCLEF09, the general usage is much more higher, but the pattern is reproduced upside down: unsuccessful players tended to manipulate their dictionaries more often (0.62 vs. 0.46). In the aggregate data corresponding to both logs we can observe that successful players used this functionality more frequently (0.2 vs. 0.17). 5 Questionnaire analysis: Users’ Perception on the Task Along with the interactions of the users and the information of the search sessions, iCLEF logs also contain two types of questionnaires: one is shown every time the user finishes a search session and it contains questions about the target image and the development of the search. The other questionnaire is shown when the user has completed 15 search sessions and raises overall questions about the task itself, the usefulness of the interface functionalities and the user’s performance. In this analysis we are going to focus only on the former one, specially in the following questions: Which, in your opinion, are the most challenging aspects of the task? 83% of par- ticipants from iCLEF08 and 85% from iCLEF09 agree or strongly agree that “Selecting/finding appropriate translations for the terms in my query” was the most challenging aspect of the task. iCLEF08 competence dictionary manipulations query terms modified success 0.08 0.05 fail 0.06 0.05 iCLEF09 competence dictionary manipulations query terms modified success 0.46 0.17 fail 0.62 0.18 iCLEF08 + iCLEF09 competence dictionary manipulations query terms modified dictionary manipulations query terms modified success 0.2 0.09 fail 0.17 0.07 Table 8: Usage of specific cross-language refinement facilities according to the search session outcome Users from both years also agree or strongly agree with other answers such as “Finding the correct terms to express an image in my own native language”, “Handling multiple target languages at the same time” and “Finding the target image in very large sets of results” in about 80% of the cases. Which interface facilities did you find most useful? In both years, cross-language function- alities such as the automatic translation and the possibility of maintaining a personal dictionary are more valued than relevance feedback facilities, specially among the iCLEF09 users. 80% of the users from 2009 agree with the usefulness of the personal dictionary, against the 59% who agree with the usefulness of the additional query terms suggested by the system. Which interface facilities did you miss? Up to seven different facilities not implemented in the current version of FlickLing are proposed in this question. Among the iCLEF08 users, the most popular answers with an agreement rate about 75% were “The classification of search results in different tabs according to the image caption languages” and “A system able to select the translations for my query terms better”. As far as the iCLEF09 users are concerned, the most popular answers with more than 80% of support were, along with “A system able to select the translations for my query terms better”, “Bilingual dictionaries with a better coverage” and “Detection and translation of multi-word expressions”. These answers seem to be accordance with the fact that iCLEF09 users needed to interact more frequently in a multilingual environment with cross-language tools that could be improved. How did you select/find the best translations for your query terms? Again a question in accordance to the different users’ and images’ profiles in both campaigns. While the most popular answer for the iCLEF08 users was “Using my knowledge of target languages whenever possible” (around 90%), iCLEF09 users opted for “Using additional dictionaries and other on-line sources” in 82% of the cases. . 6 Conclusions and Future Work In this paper we have summarized the analysis performed on the logs of multilingual image search provided by iCLEF09 and its comparison with the logs released in the iCLEF08 campaign. We have processed more than one million log lines in order to identify and characterize 5, 243 individual search sessions. Each session starts when a registered user is shown a target image and finishes when the user finds the image or decides to give up. Besides, the logs collects all the interactions occurred in the meantime: monolingual and multilingual queries launched, query refinements, navigation across the results ranking, hints showed by the system, usage of the personal dictionaries and other cross-language facilities, etc. In this work we have focused on the analysis of users’ behavior and their performance trying to find possible correlations between: a) the language skills of the users and the annotation language of the target images; and b) the final outcome of the search session. Among the conclusions observed in this work, we can mention: • The proposed task was easy, since all users’ profiles reach more than 80% of success rate. Users with no competence in the annotation language of the image tend to ask for more hints. • Users with some knowledge in the annotation language of the images employ more cogni- tive effort in monolingual searches, while users without skills need more cognitive effort in multilingual searches in order to reach a similar performance. • As expected, the more lack of language skills a user has, the more she uses cross-language facilities. • Given the features of the two images collections, in iCLEF08, where most of the images were annotated in known languages, successful users launched more queries in the monolingual environment. On the other hand, in iCLEF09, where multilingual needs were forced on purpose, successful users launched more multilingual queries. • Usage of relevance feedback is remarkably low, but successful users tended to use it more frequently. • Questionnaires show that cross-language facilities are seen as very positive when proposed in a multilingual search scenario. • The answers collected in the questionnaires are in accordance with the fact that iCLEF09 users needed to interact more frequently in a multilingual environment with cross-language tools that could be improved. As part of the future work, we’re currently widening this study analyzing users’ behavior across time as they go ahead in the experiment finishing more and more search sessions, in order to find useful correlations about how they learn to interact with the system and how they test different search strategies. Acknowledgements This work has been partially supported by the Regional Government of Madrid under the Re- search Network MAVIR (S-0505/TIC-0267) and the Spanish Government under project Text-Mess (TIN2006-15265-C06-02). We would also like to thank Javier Artiles for his intensive work during the implementation of the FlickLing interface and all the collaborators involved in collecting the image corpus and the testing stage. References [1] Gonzalo, J., Clough, P., Karlgren, J.: Overview of iCLEF 2008: search log analysis for Multi- lingual Image Retrieval. In: Evaluating Systems for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum (CLEF 2008), Aarhus, Den- mark. LNCS vol. 5706, Springer Verlag. 2009. [2] Gonzalo, J., Clough, P., Karlgren, J.: Overview of iCLEF 2009. This volume. [3] Peinado, V., Artiles, J., Gonzalo, J., Barker, E., López-Ostenero, F.: FlickLing: a multilingual search interface for Flickr. In: Working Notes for the CLEF 2008 Workshop, 17-19 September, Aarhus, Denmark. [4] Peinado, V., Gonzalo, J., Artiles, J., López-Ostenero, F.: UNED at iCLEF 2008: Analysis of a Large Log of Multilingual Image Searches in Flickr. In: Working Notes for the CLEF 2008 Workshop. Aarhus, Denmark, September 17-18. 2008. [5] Peinado, V., Gonzalo, J., Artiles, J., López-Ostenero, F.: Log Analysis of Multilingual Image Search in Flickr. In: Evaluating Systems for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum (CLEF 2008), Aarhus, Denmark. LNCS vol. 5706, pp. 236-242. Springer Verlag. 2009.