=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-Image-CaputoEt2014
|storemode=property
|title=Overview of the ImageCLEF 2014 Domain Adaptation Task
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Image-CaputoEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/CaputoP14
}}
==Overview of the ImageCLEF 2014 Domain Adaptation Task==
Overview of the ImageCLEF 2014 Domain Adaptation Task Barbara Caputo1 and Novi Patricia2 1 University of Rome La Sapienza, Italy 2 Idiap Research Institute, Switzerland Abstract. This paper describes the first edition of the Domain Adapta- tion Task at ImageCLEF 2014. Domain adaptation refers to the challenge of leveraging over knowledge acquired when learning to recognize given classes on a database, when using a different data collection. We de- scribe the scientific motivations behind the task, the research challenge on which the 2014 edition focused, the data and evaluation metric and results obtained by participants. After a discussion on the lesson learned during this first edition, we conclude with possible ideas for future edi- tions of the task. 1 Introduction and Motivation The amount of freely available and annotated image collections is dramatically increased over the last years, thanks to the diffusion of high-quality cameras, and also to the introduction of new and cheap annotation tools such as Mechanical Turk [3]. Attempts to leverage over and across such large data sources has proved challenging. Indeed, tools like Google GoggleS3 are able to recognize reliably limited classes of objects, like books or wine labels, but are not able to generalize across generic objects like food items, clothing items and so on. Several authors showed that, for a given task, training on a dataset (e.g. Pascal VOC 07) and testing on another (e.g. ImageNet) produces very poor results, although the set of depicted object categories is the same [10,13,6,12]. In other words, existing object categorization methods do not generalize well across databases. This problem is known in the literature as the domain adaptation challenge, as known in machine learning for speech and language processing [1,5]. A source domain (S) usually contains a large amount of labeled images, while a target domain (T ) refers broadly to a dataset that is assumed to have different charac- teristics from the source, and few or no labeled samples. Formally, two domains differ when their probability distributions differ: PS (x, y) 6= PT (x, y), where x ∈ X indicates the generic image sample and y ∈ Y the corresponding class label. Within this context, the across dataset generalization problem stems from an intrinsic difference between the underlying distributions of the data. Addressing this issue would have a tremendous impact on the generality and adaptability of any vision-based annotation system. Current research in domain adaptation focuses on a scenario where 3 http://www.google.com/mobile/goggles 341 – (a) the prior domain (source) consists of one or maximum two databases; – (b) the labels between the source and the target domain are the same, and – (c) the number of annotated training data for the target domain are limited. The goal of the Domain Adaptation Task, initiated in 2014 under the Image- CLEF umbrella [4], is to push the state of the art in domain adaptation towards more realistic settings, relaxing these assumptions. Our ambition is to provide, over the years, stimulating problems and challenging data collections that might stimulate and support novel research in the field. In the rest of the paper we describe the 2014 Domain Adaptation Task (sec- tion 2.1), the data and features provided to the participants (section 2.2), and the evaluation metric adopted (section 2.3). Section 3 describes the results ob- tained while section 4 provides an in depth discussion of the results obtained and identifies possible new directions for the 2015 edition of the task. Conclusions are given in section 5. 2 The 2014 Domain Adaptation Task In this section we describe the Domain Adaptation Task proposed in the Image- CLEF 2014 lab. We first outline the research challenge we aimed at addressing (section 2.1). Then, we describe the data collection used and the features pro- vided to all participants (section 2.2) and we describe the evaluation metric used (section 2.3). 2.1 The Research Challenge In the 2014 version (first edition) of the Domain Adaptation Task, we focused on the number of sources available to the system. Current experimental settings, widely used in the community, consider typically one source and one target [10], or at most two sources and one target [6,11]. This scenario is unrealistic: with the wide abundance of annotated resources and data collections that are made available to users, and with the fast progress that is being made in the image annotation community, it is likely that systems will be able to access more and more databases, and therefore to leverage over a much larger number of sources than two, as considered in the most challenging settings today. To push research towards more realistic scenarios, the 2014 edition of the Domain Adaptation Task has proposed an experimental setup with four sources, where such sources were built by exploiting existing available resources like the ImageNet, Caltetch256 [7] databases and so on. Participants were thus requested to build recognition systems for the target classes by leveraging over such source knowledge. We considered a semi-supervised setting, i.e. a setting where the target data, for each class, is limited but annotated. In the next section we describe in details the data used for the sources, the classes contained both in the source and the target, and the target data provided to participants. 342 2.2 Data and Features Source and Target Data To define the source and target data, we considered five publicly available databases: – the Caltech-256 database, consisting of 256 object categories, with a total of 30.607 images; – the ImageNet ILSVRC2012 database, organized according to the WordNet hierarchy, with an average of 500 images per node; – the PASCAL VOC2012 database, an image data set for object class recog- nition with 20 object classes; – the Bing database, containing all 256 categories from the Caltech-256 one, and augmented with 300 web images per category that were collected through textual search using Bing; – and the SUN database, a scene understanding database that contains 899 categories and 130.519 images. We then selected twelve classes, common to all the dataset listed above: aereoplane, bike, bird, boat, bottle, bus, car, dog, horse, monitor, motorbike, and people. Figure 1 illustrates the images contained for each class in each of the considered datasets. As sources, we considered 50 images representing the classes listed above from the databases Caltech-256, ImageNet, PASCAL and Bing. The 50 images were randomly selected from all those contained in each of the data collection, for a total of 600 images for each source. As target, we used images taken from the SUN database for each class. We randomly selected 5 images per class for training, and 50 images per class for testing. These data were given to all participants as validation set. The test set consisted of 50 images for each class, for a total of 600, manually collected by us using the class names as textual queries with standard search engines. Features Instead of making available directly the images to participants, we decided to release pre-computed features only, in order to keep the focus on the learning aspects of the algorithms in this year’s competition. Thus, we rep- resented every image with dense SIFT descriptors (PHOW features) at points on a regular grid with spacing 128 pixels [2]. At each grid point the descriptors were computed over four patches with different radii, hence each point was repre- sented by four SIFT descriptors. The dense features have been vector quantized into 256 visual words using k-means clustering on a randomly chosen subset of the Caltech-256 database. Finally, all images were converted to 2×2 spatial histograms over the 256 visual words, resulted in 1024 feature dimension. The software used for computing such features is available at www.vlfeat.org. 2.3 Evaluation Metrics We asked participants to provide the class name for each of the 600 test images released. Results were compared with the ground truth, and a score was assigned as follows: 343 Fig. 1. Exemplar images for the 12 classes from the five selected public databases. – For each correctly classified image will receive 1 point; – For each misclassified image will receive 0 point. We provided to all participants, together with the validation data, a matlab script for evaluating the performance of their algorithms before the official sub- mission, i.e. on the validation data. The script had been tested under Matlab (ver 8.1.0.64) and Octave (ver 3.6.2). 3 Results While 19 groups registered to the domain adaptation task to receive access to the training and validation data, only 3 groups eventually submitted runs: the XRCE group, the Hubert Curien Lab group and the Idiap group (organizers). They submitted the following algorithms: – the XRCE group submitted a set of methods based on several heterogeneous methods for domain adaptation, whose predictions were subsequently fused. By combining the output of instance based approaches and metric learning one with a brute force SVM prediction, they obtained a set of heterogeneous classifiers all producing class prediction for the target domain instances. These were combined through different versions of majority voting in order to improve the overall accuracy. 344 – The Hubert Curien Lab group did not submit any working notes, neither sent any detail about their algorithm. We are therefore not able to describe it. – The Idiap group submitted a baseline run using a recently introduced learn- ing to learn algorithm [9]. The approach considers source classifiers as ex- perts, and it combines their confidence output with a high-level cue integra- tion scheme, as opposed to a mid-level one as proposed in [8]. The algorithm is called High-level Learning to Learn (H-L2L). As our goal was not to ob- tain the best possible performance but rather to provide an off the shelf baseline against which to compare results of the other participants, we did not perform any parameter tuning. Table 1 reports the final ranking among groups. Table 2 reports the results obtained by the best run submitted by each group, for each of the 12 target classes. We see that XRCE obtained the best score, followed by the Hubert Curien lab. The Idiap baseline obtained the worst score, clearly pointing towards the importance of parameter selection in these kind of benchmark evaluations. Table 1. Ranking and best score obtained by the three groups that submitted runs. Rank Group Score 1 XRCE 228 2 Hubert Curien Lab Group 158 3 Idiap 45 Table 2. Class by Class score obtained by the three groups that submitted runs. class Score XRCE Score Hubert Curien Score Idiap aereoplane 41 36 3 bike 12 7 1 bird 15 15 0 boat 18 5 4 bottle 20 25 3 bus 23 10 6 car 17 13 7 dog 8 8 3 horse 17 6 2 monitor 28 15 3 motorbike 12 7 3 people 17 11 10 345 4 Analysis and Discussion The clear success of the XRCE group, obtained by combining several domain adaptation methods presented in the literature, seems to indicate that current methods are not able to address effectively the problem of leveraging over multi- ple sources. Ensemble methods, chosen by at least two teams, appear instead to be a viable option in this setting, whether used to combine the output of various domain adaptation algorithms, whether used to combine several source output confidences. The choice made to provide to participants only the features computed from each image, and not the images itself, forced groups to focus on the learning aspects of the problems, but perhaps did not allow for enough flexibility in attacking the problem. We don’t plan to repeat this choice in the future editions of the task. A last remark should be made on the scarce participation to the task. Even though only three groups eventually submitted runs, 19 groups expressed interest and registered, in order to access the training and validation data. We believe that this is an indicator of enough interest to push us to organize again the task next year, also collecting feedbacks from the participating and registered groups in order to identify possible problems in the current edition and to offer a more engaging edition of the task in the future. 5 Conclusions The first edition of the Domain Adaptation Task, organized under the Image- CLEF umbrella, focused on the problem of building a classifier in a target domain while leveraging over four different sources. nineteen groups registered for the task, and eventually three groups submitted runs, with the XRCE winning the competition with an ensemble learning based method. For the 2015 edition of the task, we plan to make available to participants the raw images, as opposed to pre-computed features as done in 2014, so to allow for a wider generality of approaches. We will continue to propose data supporting the problem of leverag- ing from multiple sources, possibly by augmenting the number of classes (which was 12 in the 2014 edition), and/or allowing for a partial overlap of classes be- tween sources and between sources and target, as proposed in [12]. In order to significantly increase the number of participants to the task next year, we will contact all groups that registered to the task and ask their preferences among these different options. Acknowledments This work was partially supported by the Swiss National Science Foundation project Situated Vision (SIVI). 346 References 1. Blitzer, J., McDonald, R., Pereira, F.: Domain adaptation with structural corre- spondence learning. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2006) 2. Bosch, Anna ad Zisserman, A.: Image classification using random forests and ferns. In: Proc. CVPR (2007) 3. Buhrmester, M., Kwang, T., Gosling, S.D.: Amazon’s mechanical turk a new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science 6(1), 3–5 (2011) 4. Caputo, B., Müller, H., Martinez-Gomez, J., Villegas, M., Acar, B., Patricia, N., Marvasti, N., Üsküdarlı, S., Paredes, R., Cazorla, M., Garcia-Varea, I., Morell, V.: ImageCLEF 2014: Overview and analysis of the results. In: CLEF proceedings. Lecture Notes in Computer Science, Springer Berlin Heidelberg (2014) 5. Daumé III, H.: Frustratingly easy domain adaptation. In: Association for Compu- tational Linguistics Conference (ACL) (2007) 6. Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: Proc. CVPR. Extended version considering its additional material 7. Griffin, G., Holub, A., Perona, P.: Caltech 256 object category dataset. Tech. Rep. UCB/CSD-04-1366, California Institue of Technology (2007) 8. Jie, L., Tommasi, T., Caputo, B.: Multiclass transfer learning from unconstrained priors. In: Proc. ICCV (2011) 9. Patricia, N., Caputo, B.: Learning to learn, from transfer learning to domain adap- tation: a unifying perspective. In: Proc. CVPR (2014) 10. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: Proc. ECCV (2010) 11. Tommasi, T., Caputo, B.: Frustratingly easy nbnn domain adaptation. In: Proc. ICCV (2013) 12. Tommasi, T., Quadrianto, N., Caputo, B., Lampert, C.: Beyond dataset bias: Multi-task unaligned shared knowledge transfer. In: Proc. ACCV (2012) 13. Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: Proc. CVPR (2011) 347